Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News Sign UpSign In
| Download
Views: 94
Kernel: Python 3 (Anaconda)

Practical examples using numpy and Pandas

This week we will focus on using practical examples using numpy and Pandas.


Let's do some analysis of daily weather records from Champaign-Urbana. The records go back to 1888!

#load in numpy and pandas import numpy as np import pandas as pd #We'll use pa daily_data = pd.read_csv('daily_wx_data_champaign.csv',sep=',',skiprows=7)
daily_data['TMIN']
0 64 1 57 2 51 3 62 4 60 5 50 6 37 7 46 8 48 9 64 10 63 11 54 12 45 13 50 14 53 15 44 16 39 17 42 18 47 19 48 20 53 21 52 22 63 23 54 24 45 25 54 26 52 27 36 28 33 29 49 .. 46915 22 46916 22 46917 11 46918 34 46919 26 46920 18 46921 15 46922 11 46923 26 46924 25 46925 32 46926 16 46927 9 46928 13 46929 39 46930 33 46931 24 46932 31 46933 27 46934 24 46935 37 46936 40 46937 44 46938 43 46939 48 46940 43 46941 44 46942 37 46943 M 46944 M Name: TMIN, Length: 46945, dtype: object

Let's do some basic stats. Calculate the mean of the minimum temperature values,

daily_data['TMIN'].mean()
WARNING: 1 intermediate output message was discarded.

Oops, we need to do some cleanup, because the column is being handled as a string because it contains M values mixed in with the numbers. We can use fancy indexing to help. First, create a boolean condition that pattern matches the M values:

daily_data['TMIN'] == 'M'
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 46915 False 46916 False 46917 False 46918 False 46919 False 46920 False 46921 False 46922 False 46923 False 46924 False 46925 False 46926 False 46927 False 46928 False 46929 False 46930 False 46931 False 46932 False 46933 False 46934 False 46935 False 46936 False 46937 False 46938 False 46939 False 46940 False 46941 False 46942 False 46943 True 46944 True Name: TMIN, Length: 46945, dtype: bool

Then use the boolean array as a fancy index to do the job.

daily_data['TMIN'][daily_data['TMIN'] == 'M']=None daily_data['TMIN'].astype(float)
/ext/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
0 64.0 1 57.0 2 51.0 3 62.0 4 60.0 5 50.0 6 37.0 7 46.0 8 48.0 9 64.0 10 63.0 11 54.0 12 45.0 13 50.0 14 53.0 15 44.0 16 39.0 17 42.0 18 47.0 19 48.0 20 53.0 21 52.0 22 63.0 23 54.0 24 45.0 25 54.0 26 52.0 27 36.0 28 33.0 29 49.0 ... 46915 22.0 46916 22.0 46917 11.0 46918 34.0 46919 26.0 46920 18.0 46921 15.0 46922 11.0 46923 26.0 46924 25.0 46925 32.0 46926 16.0 46927 9.0 46928 13.0 46929 39.0 46930 33.0 46931 24.0 46932 31.0 46933 27.0 46934 24.0 46935 37.0 46936 40.0 46937 44.0 46938 43.0 46939 48.0 46940 43.0 46941 44.0 46942 37.0 46943 NaN 46944 NaN Name: TMIN, Length: 46945, dtype: float64

Don't forget to store the data back in memory!

daily_data['TMIN']=daily_data['TMIN'].astype(float)
daily_data['TMIN'].median()
43.0

In-class exercise 1: Use a loop to replace all of the columns as floats, ignoring M values.

daily_data = pd.read_csv('daily_wx_data_champaign.csv',sep=',',skiprows=7) cols = ['TMIN','TMAX','PRCP','SNOW','SNWD'] for col in cols: daily_data[col][daily_data[col] == 'M'] = None daily_data[col][daily_data[col] == 'T'] = 0.005 daily_data[col] = daily_data[col].astype('float') daily_data
/ext/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /ext/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Date PRCP SNOW SNWD TMAX TMIN Unnamed: 6
0 1888-08-17 0.000 NaN NaN 84.0 64.0 NaN
1 1888-08-18 0.000 NaN NaN 81.0 57.0 NaN
2 1888-08-19 0.000 NaN NaN 85.0 51.0 NaN
3 1888-08-20 0.005 NaN NaN 72.0 62.0 NaN
4 1888-08-21 0.100 NaN NaN 81.0 60.0 NaN
5 1888-08-22 0.000 NaN NaN 71.0 50.0 NaN
6 1888-08-23 0.000 NaN NaN 84.0 37.0 NaN
7 1888-08-24 0.000 NaN NaN 84.0 46.0 NaN
8 1888-08-25 0.000 NaN NaN 88.0 48.0 NaN
9 1888-08-26 0.000 NaN NaN 87.0 64.0 NaN
10 1888-08-27 0.000 NaN NaN 82.0 63.0 NaN
11 1888-08-28 0.000 NaN NaN 80.0 54.0 NaN
12 1888-08-29 0.000 NaN NaN 84.0 45.0 NaN
13 1888-08-30 0.000 NaN NaN 85.0 50.0 NaN
14 1888-08-31 0.000 NaN NaN 73.0 53.0 NaN
15 1888-09-01 0.000 NaN NaN 72.0 44.0 NaN
16 1888-09-02 0.000 NaN NaN 79.0 39.0 NaN
17 1888-09-03 0.000 NaN NaN 82.0 42.0 NaN
18 1888-09-04 0.000 NaN NaN 83.0 47.0 NaN
19 1888-09-05 0.000 NaN NaN 76.0 48.0 NaN
20 1888-09-06 0.000 NaN NaN 78.0 53.0 NaN
21 1888-09-07 0.000 NaN NaN 86.0 52.0 NaN
22 1888-09-08 0.030 NaN NaN 89.0 63.0 NaN
23 1888-09-09 0.000 NaN NaN 81.0 54.0 NaN
24 1888-09-10 0.000 NaN NaN 84.0 45.0 NaN
25 1888-09-11 0.000 NaN NaN 89.0 54.0 NaN
26 1888-09-12 0.000 NaN NaN 76.0 52.0 NaN
27 1888-09-13 0.000 NaN NaN 69.0 36.0 NaN
28 1888-09-14 0.000 NaN NaN 76.0 33.0 NaN
29 1888-09-15 0.550 NaN NaN 69.0 49.0 NaN
... ... ... ... ... ... ... ...
46915 2017-01-28 0.005 0.005 0.000 37.0 22.0 NaN
46916 2017-01-29 0.005 0.005 0.005 29.0 22.0 NaN
46917 2017-01-30 0.090 0.800 1.000 42.0 11.0 NaN
46918 2017-01-31 0.000 0.000 0.005 44.0 34.0 NaN
46919 2017-02-01 0.000 0.000 0.000 42.0 26.0 NaN
46920 2017-02-02 0.000 0.000 0.000 30.0 18.0 NaN
46921 2017-02-03 0.000 0.000 0.000 31.0 15.0 NaN
46922 2017-02-04 0.000 0.000 0.000 37.0 11.0 NaN
46923 2017-02-05 0.000 0.000 0.000 46.0 26.0 NaN
46924 2017-02-06 0.000 0.000 0.000 47.0 25.0 NaN
46925 2017-02-07 0.150 0.000 0.000 62.0 32.0 NaN
46926 2017-02-08 0.000 0.000 0.000 33.0 16.0 NaN
46927 2017-02-09 0.130 1.500 2.000 20.0 9.0 NaN
46928 2017-02-10 0.000 0.000 2.000 50.0 13.0 NaN
46929 2017-02-11 0.000 0.000 0.005 67.0 39.0 NaN
46930 2017-02-12 0.000 0.000 0.000 52.0 33.0 NaN
46931 2017-02-13 0.000 0.000 0.000 46.0 24.0 NaN
46932 2017-02-14 0.000 0.000 0.000 53.0 31.0 NaN
46933 2017-02-15 0.000 0.000 0.000 43.0 27.0 NaN
46934 2017-02-16 0.000 0.000 0.000 52.0 24.0 NaN
46935 2017-02-17 0.000 0.000 0.000 70.0 37.0 NaN
46936 2017-02-18 0.000 0.000 0.000 70.0 40.0 NaN
46937 2017-02-19 0.000 0.000 0.000 68.0 44.0 NaN
46938 2017-02-20 0.000 0.000 0.000 71.0 43.0 NaN
46939 2017-02-21 0.020 0.000 0.000 62.0 48.0 NaN
46940 2017-02-22 0.005 0.000 0.000 70.0 43.0 NaN
46941 2017-02-23 0.000 0.000 0.000 60.0 44.0 NaN
46942 2017-02-24 0.005 0.000 0.000 68.0 37.0 NaN
46943 2017-02-25 0.005 0.000 0.000 NaN NaN NaN
46944 2017-02-26 NaN NaN NaN NaN NaN NaN

46945 rows × 7 columns

daily_data['SNOW'] == np.nan
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 46915 False 46916 False 46917 False 46918 False 46919 False 46920 False 46921 False 46922 False 46923 False 46924 False 46925 False 46926 False 46927 False 46928 False 46929 False 46930 False 46931 False 46932 False 46933 False 46934 False 46935 False 46936 False 46937 False 46938 False 46939 False 46940 False 46941 False 46942 False 46943 False 46944 False Name: SNOW, Length: 46945, dtype: bool

Now, let's do some sorting.

Let's sort the days to find the top 10 coldest minimum temperatures.

newdf=daily_data.sort(columns='TMIN',ascending=1) newdf.loc[:,['Date','TMIN']]
/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....) if __name__ == '__main__':
Date TMIN
40317 1999-01-05 -25.0
3828 1899-02-09 -25.0
38505 1994-01-19 -25.0
6023 1905-02-13 -25.0
35219 1985-01-20 -24.0
34113 1982-01-10 -23.0
8542 1912-01-07 -22.0
38504 1994-01-18 -21.0
10739 1918-01-12 -21.0
15128 1930-01-18 -21.0
2366 1895-02-08 -21.0
1987 1894-01-25 -21.0
17324 1936-01-23 -21.0
32294 1977-01-17 -20.0
3831 1899-02-12 -20.0
35218 1985-01-19 -20.0
13281 1924-12-28 -20.0
34826 1983-12-24 -20.0
1614 1893-01-17 -19.0
3829 1899-02-10 -19.0
3832 1899-02-13 -18.0
12923 1924-01-05 -18.0
37017 1989-12-23 -18.0
37014 1989-12-20 -18.0
22813 1951-02-02 -18.0
1612 1893-01-15 -18.0
34120 1982-01-17 -18.0
12939 1924-01-21 -17.0
43981 2009-01-16 -17.0
32293 1977-01-16 -17.0
... ... ...
36430 1988-05-15 NaN
36820 1989-06-09 NaN
36877 1989-08-05 NaN
36879 1989-08-07 NaN
36891 1989-08-19 NaN
36904 1989-09-01 NaN
37065 1990-02-09 NaN
37067 1990-02-11 NaN
37143 1990-04-28 NaN
37161 1990-05-16 NaN
37395 1991-01-05 NaN
37579 1991-07-08 NaN
37668 1991-10-05 NaN
38881 1995-01-30 NaN
39663 1997-03-22 NaN
40328 1999-01-16 NaN
40561 1999-09-06 NaN
40717 2000-02-09 NaN
41026 2000-12-14 NaN
41033 2000-12-21 NaN
42430 2004-10-18 NaN
42646 2005-05-22 NaN
42654 2005-05-30 NaN
42961 2006-04-02 NaN
43590 2007-12-22 NaN
43594 2007-12-26 NaN
43595 2007-12-27 NaN
43949 2008-12-15 NaN
46943 2017-02-25 NaN
46944 2017-02-26 NaN

46945 rows × 2 columns

In-class exercise two: Create a new data frame with just the top 10 coldest temperature values and the dates on which they occurred (It's ok if there is a tie, just get the first 10). Write that data frame to a csv file using pd.to_csv().

Download both files to your computer, and open them in Microsoft Excel. Do the times and temperatures match up in the raw data with your result?

In-class exercise three: Compute what fraction of good observations had lows below 32 degrees F before 1-1-1950 with including and after 1-1-1950. We might want to do this to assess whether there is a time trend in the coldest temperatures in this dataset (with possible impacts on agriculture).

Hint! Just as before, we need to convert our column from a string to a time. pandas works well with datetime objects, easily converted with pd.to_datetime.

To define a pandas datetime object to compare against, use pd.datetime().

Here is a start!

daily_data['Date']=pd.to_datetime(daily_data['Date']) daily_data['Date'][0]
Timestamp('1888-08-17 00:00:00')
daily_data['Date'][daily_data['Date'] >= pd.datetime(1950,1,1)]
22416 1950-01-01 22417 1950-01-02 22418 1950-01-03 22419 1950-01-04 22420 1950-01-05 22421 1950-01-06 22422 1950-01-07 22423 1950-01-08 22424 1950-01-09 22425 1950-01-10 22426 1950-01-11 22427 1950-01-12 22428 1950-01-13 22429 1950-01-14 22430 1950-01-15 22431 1950-01-16 22432 1950-01-17 22433 1950-01-18 22434 1950-01-19 22435 1950-01-20 22436 1950-01-21 22437 1950-01-22 22438 1950-01-23 22439 1950-01-24 22440 1950-01-25 22441 1950-01-26 22442 1950-01-27 22443 1950-01-28 22444 1950-01-29 22445 1950-01-30 ... 46915 2017-01-28 46916 2017-01-29 46917 2017-01-30 46918 2017-01-31 46919 2017-02-01 46920 2017-02-02 46921 2017-02-03 46922 2017-02-04 46923 2017-02-05 46924 2017-02-06 46925 2017-02-07 46926 2017-02-08 46927 2017-02-09 46928 2017-02-10 46929 2017-02-11 46930 2017-02-12 46931 2017-02-13 46932 2017-02-14 46933 2017-02-15 46934 2017-02-16 46935 2017-02-17 46936 2017-02-18 46937 2017-02-19 46938 2017-02-20 46939 2017-02-21 46940 2017-02-22 46941 2017-02-23 46942 2017-02-24 46943 2017-02-25 46944 2017-02-26 Name: Date, dtype: datetime64[ns]

Now we can get more complex. Let's say we wanted to calculate the mean temperature of all the January maximum temperatures in the data.

As usual, there are many ways we could do this.

  1. We could loop over all of the values, and test if the observation is in the month. Then, as we have done before, if the test returns True, we can calculate a running sum and a counter.

  2. We could create a new column which contains just the month, then use fancy indexing to find the values we want.

  3. We could let pandas do all the work for us.

Let's work through all of these.

Before we get started, pandas has some really useful tools to work with dates and times. I found some useful code by googling http://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python

df['month'] = df['Date'].dt.month

Adapt the above code to our data frame and try it! Be sure to print some values to the screen to make sure that the code is working.

Ok, let's code up the loop for approach 1:

Trying approach 2: Let's do the fancy indexing

#Remember the two tests #Create boolean test, test it #Apply the fancy index, check it

Approach 3: The pandas way. We can use the groupby function to group data according to values. It saves us a lot of work! It even can do all of the months for us!

daily_data.groupby(['month'])['TMIN'].mean()

From approach 3, print a sentence that says the fraction of below 32 days for each period, using the data you calculated.