ATMS 305: Mar 1 in-class exercises¶

We'll continue where we left off working with our Champaign daily weather observations from Monday.

# Data prep again

import numpy as np
import pandas as pd

daily_data = pd.read_csv('daily_wx_data_champaign.csv',sep=',',skiprows=7)

cols = ['TMIN', 'TMAX', 'PRCP', 'SNOW', 'SNWD']

for col in cols:
        daily_data[col][daily_data[col] == 'M'] = None
        daily_data[col][daily_data[col] == 'T'] = '0.005'
        daily_data[col] = daily_data[col].astype('float')
        
daily_data['Date']=pd.to_datetime(daily_data['Date'])
        
daily_data['TMIN'].astype(float)

/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

0        64.0
1        57.0
2        51.0
3        62.0
4        60.0
5        50.0
6        37.0
7        46.0
8        48.0
9        64.0
10       63.0
11       54.0
12       45.0
13       50.0
14       53.0
15       44.0
16       39.0
17       42.0
18       47.0
19       48.0
20       53.0
21       52.0
22       63.0
23       54.0
24       45.0
25       54.0
26       52.0
27       36.0
28       33.0
29       49.0
         ... 
46915    22.0
46916    22.0
46917    11.0
46918    34.0
46919    26.0
46920    18.0
46921    15.0
46922    11.0
46923    26.0
46924    25.0
46925    32.0
46926    16.0
46927     9.0
46928    13.0
46929    39.0
46930    33.0
46931    24.0
46932    31.0
46933    27.0
46934    24.0
46935    37.0
46936    40.0
46937    44.0
46938    43.0
46939    48.0
46940    43.0
46941    44.0
46942    37.0
46943     NaN
46944     NaN
Name: TMIN, dtype: float64

Remember we can do statistics on a pandas DataFrame column easily, lets try printing mean, median standard deviation:

daily_data.describe()

Now we can get more complex. Let's say we wanted to calculate the mean temperature of all the January maximum temperatures in the data.

As usual, there are many ways we could do this.

We could loop over all of the values, and test if the observation is in the month. Then, as we have done before, if the test returns True, we can calculate a running sum and a counter.
We could create a new column which contains just the month, then use fancy indexing to find the values we want.
We could let pandas do all the work for us.

Let's work through all of these.

Before we get started, pandas has some really useful tools to work with dates and times. I found some useful code by googling http://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python

df['month'] = df['Date'].dt.month

Adapt the above code to our data frame and try it! Be sure to print some values to the screen to make sure that the code is working.

Ok, let's code up the loop for approach 1:

#dt.month, dt.year, dt.day, dt.hour

daily_data['Month'] = daily_data['Date'].dt.month #adding a column to daily_data called Month
np.shape(daily_data) #how many rows and columns we have
nrows,ncols = np.shape(daily_data)

tmin_sum=0
tmin_num=0


for i in range(1000):
    if (daily_data['Month'].loc[i] == 1) & (np.isnan(daily_data['TMIN'].loc[i]) != False):
        #print(daily_data['TMIN'].loc[i])
        tmin_sum = daily_data['TMIN'] .loc[i]+tmin_sum
        tmin_num=tmin_num+1
        
print(tmin_sum,tmin_num,tmin_sum/tmin_num)

2062.0 93 22.1720430108

Trying approach 2: Let's do the fancy indexing

#Remember the two tests
#Create boolean test, test it
#Apply the fancy index, check it
daily_data['TMIN'][daily_data['Month'] == 1 & (np.isnan(daily_data['TMIN']) == False)].mean()

18.125877632898696

Approach 3: The pandas way. We can use the groupby function to group data according to values. It saves us a lot of work! It even can do all of the months for us!

new_df = daily_data.groupby(by='Month').max()
new_df['TMAX']
return Month(new_df['Date'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'Month'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-83-7d84cb30a896> in <module>()
----> 1 new_df = daily_data.groupby(by='Month').max()
      2 new_df['TMAX']
      3 return Month(new_df['Date'])

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   3989         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   3990                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 3991                        **kwargs)
   3992 
   3993     def asfreq(self, freq, method=None, how=None, normalize=False):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in groupby(obj, by, **kwds)
   1509         raise TypeError('invalid type: %s' % type(obj))
   1510 
-> 1511     return klass(obj, by, **kwds)
   1512 
   1513 

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    368                                                     level=level,
    369                                                     sort=sort,
--> 370                                                     mutated=self.mutated)
    371 
    372         self.obj = obj

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated)
   2460 
   2461         elif is_in_axis(gpr):  # df.groupby('name')
-> 2462             in_axis, name, gpr = True, gpr, obj[gpr]
   2463             exclusions.append(name)
   2464         elif isinstance(gpr, Grouper) and gpr.key is not None:

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'Month'

From approach 3, print a sentence that says the fraction of below 32 days for each period, using the data you calculated.

More complex `pandas` operations¶

Sometimes data doesn't quite come in a nice file format that is easily structured. In the homework, there is an example of such files.

We may need to do some minor surgery on data to get it in a format that is easily used. Let's take a look at one of the homework files, censo.data. Open it in a text editor and examine the format.

newdata = pd.DataFrame() #emptydataframe
newdata['Date'] = pd.DatetimeIndex(start=pd.datetime(1948,1,1),end=pd.datetime(2016,12,1),freq='MS') #MS = Start of each month
newdata['ENSO'] = data.loc[:,1:].stack().values
newdata

Let's create a new dataframe that will contain our reorganized data. We can use pd.DataFrame() to do this.

Now, we can create a column called Date that will contain the time range of the data. One way to do this is to use the pd.DatetimeIndex() function. It's easy to use, just give it a start, end, and frequency. Frequency are given by codes in the following table: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases.

Now we need to change our data from a year x month array to a one-dimensional list of dates, one per month. We can run the loc() command to select the data in the second to the last columns in our original data frame, then use the stack() command to turn the 2-D array into a 1-D array. Grab the values using the values object, and then save it in our new data frame as a column called ENSO.

Now let's load in another file, data from the Pacific Decadal Oscillation Index (PDO) in the file pdo.data. Go through the same procedure as above, making sure to check the results with the original file.

newdf_all=pd.merge(newdf,newdf2,how='inner',on='Date')
newdf_all2=pd.merge(newdf_all,newdf3,how='inner',on='Date')
newdf_all2

Now, we can combine the two data frames into one using the pd.merge() command. For example:

alldata = pd.merge(newdf,newdf2,how='inner',on='Date')

If you want to visualize the data, take a sneak peak, run the following code (make sure your DataFrame is called the correct name!)

%pylab inline
plt.figure(figsize=(8,6))
plt.scatter(newdata['ENSO'],newdata['PDO'],5,newdata['Date'].dt.year)
plt.xlabel('Normalized ENSO index')
plt.ylabel('Normalized PDO index')
plt.colorbar()

Populating the interactive namespace from numpy and matplotlib

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'PDO'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-73-662e39adaa78> in <module>()
      1 get_ipython().magic('pylab inline')
      2 plt.figure(figsize=(8,6))
----> 3 plt.scatter(newdata['ENSO'],newdata['PDO'],5,newdata['Date'].dt.year)
      4 plt.xlabel('Normalized ENSO index')
      5 plt.ylabel('Normalized PDO index')

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'PDO'

<matplotlib.figure.Figure at 0x7f058e6e5eb8>

	PRCP	SNOW	SNWD	TMAX	TMIN	Unnamed: 6
count	46757.000000	41687.000000	38593.000000	46863.000000	46861.000000	0.0
mean	0.103152	0.062820	0.267036	61.756439	42.045240	NaN
std	0.294664	0.456103	1.295213	21.303182	18.741104	NaN
min	0.000000	0.000000	0.000000	-12.000000	-25.000000	NaN
25%	0.000000	0.000000	0.000000	44.000000	29.000000	NaN
50%	0.000000	0.000000	0.000000	65.000000	43.000000	NaN
75%	0.030000	0.000000	0.000000	80.000000	58.000000	NaN
max	5.320000	14.000000	19.000000	109.000000	80.000000	NaN

	Date	ENSO
0	1948-01-01	0.26
1	1948-02-01	0.43
2	1948-03-01	0.72
3	1948-04-01	0.21
4	1948-05-01	0.24
5	1948-06-01	0.41
6	1948-07-01	0.08
7	1948-08-01	0.25
8	1948-09-01	0.45
9	1948-10-01	-0.64
10	1948-11-01	-0.21
11	1948-12-01	0.58
12	1949-01-01	0.18
13	1949-02-01	0.13
14	1949-03-01	-0.55
15	1949-04-01	0.15
16	1949-05-01	0.43
17	1949-06-01	0.00
18	1949-07-01	-0.06
19	1949-08-01	0.08
20	1949-09-01	-0.42
21	1949-10-01	-0.59
22	1949-11-01	-0.29
23	1949-12-01	-0.91
24	1950-01-01	-0.74
25	1950-02-01	-1.70
26	1950-03-01	-1.43
27	1950-04-01	-1.29
28	1950-05-01	-1.22
29	1950-06-01	-1.69
...	...	...
798	2014-07-01	0.42
799	2014-08-01	0.70
800	2014-09-01	0.97
801	2014-10-01	0.76
802	2014-11-01	1.24
803	2014-12-01	0.87
804	2015-01-01	0.94
805	2015-02-01	0.14
806	2015-03-01	0.91
807	2015-04-01	0.83
808	2015-05-01	1.69
809	2015-06-01	1.53
810	2015-07-01	2.34
811	2015-08-01	2.60
812	2015-09-01	2.92
813	2015-10-01	2.77
814	2015-11-01	1.99
815	2015-12-01	1.91
816	2016-01-01	2.96
817	2016-02-01	3.05
818	2016-03-01	1.23
819	2016-04-01	1.78
820	2016-05-01	0.07
821	2016-06-01	-0.40
822	2016-07-01	-0.56
823	2016-08-01	-0.93
824	2016-09-01	-1.25
825	2016-10-01	-0.25
826	2016-11-01	-0.14
827	2016-12-01	-0.39

ATMS 305: Mar 1 in-class exercises¶

More complex pandas operations¶

More complex `pandas` operations¶