ATMS 305: Mar 1 in-class exercises

We'll continue where we left off working with our Champaign daily weather observations from Monday.

In [78]:
# Data prep again

import numpy as np
import pandas as pd

daily_data = pd.read_csv('daily_wx_data_champaign.csv',sep=',',skiprows=7)

cols = ['TMIN', 'TMAX', 'PRCP', 'SNOW', 'SNWD']

for col in cols:
        daily_data[col][daily_data[col] == 'M'] = None
        daily_data[col][daily_data[col] == 'T'] = '0.005'
        daily_data[col] = daily_data[col].astype('float')
        
daily_data['Date']=pd.to_datetime(daily_data['Date'])
        
daily_data['TMIN'].astype(float)
/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[78]:
0        64.0
1        57.0
2        51.0
3        62.0
4        60.0
5        50.0
6        37.0
7        46.0
8        48.0
9        64.0
10       63.0
11       54.0
12       45.0
13       50.0
14       53.0
15       44.0
16       39.0
17       42.0
18       47.0
19       48.0
20       53.0
21       52.0
22       63.0
23       54.0
24       45.0
25       54.0
26       52.0
27       36.0
28       33.0
29       49.0
         ... 
46915    22.0
46916    22.0
46917    11.0
46918    34.0
46919    26.0
46920    18.0
46921    15.0
46922    11.0
46923    26.0
46924    25.0
46925    32.0
46926    16.0
46927     9.0
46928    13.0
46929    39.0
46930    33.0
46931    24.0
46932    31.0
46933    27.0
46934    24.0
46935    37.0
46936    40.0
46937    44.0
46938    43.0
46939    48.0
46940    43.0
46941    44.0
46942    37.0
46943     NaN
46944     NaN
Name: TMIN, dtype: float64

Remember we can do statistics on a pandas DataFrame column easily, lets try printing mean, median standard deviation:

In [13]:
daily_data.describe()
Out[13]:
PRCP SNOW SNWD TMAX TMIN Unnamed: 6
count 46757.000000 41687.000000 38593.000000 46863.000000 46861.000000 0.0
mean 0.103152 0.062820 0.267036 61.756439 42.045240 NaN
std 0.294664 0.456103 1.295213 21.303182 18.741104 NaN
min 0.000000 0.000000 0.000000 -12.000000 -25.000000 NaN
25% 0.000000 0.000000 0.000000 44.000000 29.000000 NaN
50% 0.000000 0.000000 0.000000 65.000000 43.000000 NaN
75% 0.030000 0.000000 0.000000 80.000000 58.000000 NaN
max 5.320000 14.000000 19.000000 109.000000 80.000000 NaN

Now we can get more complex. Let's say we wanted to calculate the mean temperature of all the January maximum temperatures in the data.

As usual, there are many ways we could do this.

  1. We could loop over all of the values, and test if the observation is in the month. Then, as we have done before, if the test returns True, we can calculate a running sum and a counter.

  2. We could create a new column which contains just the month, then use fancy indexing to find the values we want.

  3. We could let pandas do all the work for us.

Let's work through all of these.

Before we get started, pandas has some really useful tools to work with dates and times. I found some useful code by googling http://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python

df['month'] = df['Date'].dt.month

Adapt the above code to our data frame and try it! Be sure to print some values to the screen to make sure that the code is working.

Ok, let's code up the loop for approach 1:

In [43]:
#dt.month, dt.year, dt.day, dt.hour

daily_data['Month'] = daily_data['Date'].dt.month #adding a column to daily_data called Month
np.shape(daily_data) #how many rows and columns we have
nrows,ncols = np.shape(daily_data)

tmin_sum=0
tmin_num=0


for i in range(1000):
    if (daily_data['Month'].loc[i] == 1) & (np.isnan(daily_data['TMIN'].loc[i]) != False):
        #print(daily_data['TMIN'].loc[i])
        tmin_sum = daily_data['TMIN'] .loc[i]+tmin_sum
        tmin_num=tmin_num+1
        
print(tmin_sum,tmin_num,tmin_sum/tmin_num)
2062.0 93 22.1720430108

Trying approach 2: Let's do the fancy indexing

In [53]:
#Remember the two tests
#Create boolean test, test it
#Apply the fancy index, check it
daily_data['TMIN'][daily_data['Month'] == 1 & (np.isnan(daily_data['TMIN']) == False)].mean()
Out[53]:
18.125877632898696

Approach 3: The pandas way. We can use the groupby function to group data according to values. It saves us a lot of work! It even can do all of the months for us!

In [83]:
new_df = daily_data.groupby(by='Month').max()
new_df['TMAX']
return Month(new_df['Date'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'Month'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-83-7d84cb30a896> in <module>()
----> 1 new_df = daily_data.groupby(by='Month').max()
      2 new_df['TMAX']
      3 return Month(new_df['Date'])

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   3989         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   3990                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 3991                        **kwargs)
   3992 
   3993     def asfreq(self, freq, method=None, how=None, normalize=False):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in groupby(obj, by, **kwds)
   1509         raise TypeError('invalid type: %s' % type(obj))
   1510 
-> 1511     return klass(obj, by, **kwds)
   1512 
   1513 

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    368                                                     level=level,
    369                                                     sort=sort,
--> 370                                                     mutated=self.mutated)
    371 
    372         self.obj = obj

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated)
   2460 
   2461         elif is_in_axis(gpr):  # df.groupby('name')
-> 2462             in_axis, name, gpr = True, gpr, obj[gpr]
   2463             exclusions.append(name)
   2464         elif isinstance(gpr, Grouper) and gpr.key is not None:

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'Month'

From approach 3, print a sentence that says the fraction of below 32 days for each period, using the data you calculated.

In [ ]:
 

More complex pandas operations

Sometimes data doesn't quite come in a nice file format that is easily structured. In the homework, there is an example of such files.

We may need to do some minor surgery on data to get it in a format that is easily used. Let's take a look at one of the homework files, censo.data. Open it in a text editor and examine the format.

In [70]:
newdata = pd.DataFrame() #emptydataframe
newdata['Date'] = pd.DatetimeIndex(start=pd.datetime(1948,1,1),end=pd.datetime(2016,12,1),freq='MS') #MS = Start of each month
newdata['ENSO'] = data.loc[:,1:].stack().values
newdata
Out[70]:
Date ENSO
0 1948-01-01 0.26
1 1948-02-01 0.43
2 1948-03-01 0.72
3 1948-04-01 0.21
4 1948-05-01 0.24
5 1948-06-01 0.41
6 1948-07-01 0.08
7 1948-08-01 0.25
8 1948-09-01 0.45
9 1948-10-01 -0.64
10 1948-11-01 -0.21
11 1948-12-01 0.58
12 1949-01-01 0.18
13 1949-02-01 0.13
14 1949-03-01 -0.55
15 1949-04-01 0.15
16 1949-05-01 0.43
17 1949-06-01 0.00
18 1949-07-01 -0.06
19 1949-08-01 0.08
20 1949-09-01 -0.42
21 1949-10-01 -0.59
22 1949-11-01 -0.29
23 1949-12-01 -0.91
24 1950-01-01 -0.74
25 1950-02-01 -1.70
26 1950-03-01 -1.43
27 1950-04-01 -1.29
28 1950-05-01 -1.22
29 1950-06-01 -1.69
... ... ...
798 2014-07-01 0.42
799 2014-08-01 0.70
800 2014-09-01 0.97
801 2014-10-01 0.76
802 2014-11-01 1.24
803 2014-12-01 0.87
804 2015-01-01 0.94
805 2015-02-01 0.14
806 2015-03-01 0.91
807 2015-04-01 0.83
808 2015-05-01 1.69
809 2015-06-01 1.53
810 2015-07-01 2.34
811 2015-08-01 2.60
812 2015-09-01 2.92
813 2015-10-01 2.77
814 2015-11-01 1.99
815 2015-12-01 1.91
816 2016-01-01 2.96
817 2016-02-01 3.05
818 2016-03-01 1.23
819 2016-04-01 1.78
820 2016-05-01 0.07
821 2016-06-01 -0.40
822 2016-07-01 -0.56
823 2016-08-01 -0.93
824 2016-09-01 -1.25
825 2016-10-01 -0.25
826 2016-11-01 -0.14
827 2016-12-01 -0.39

828 rows × 2 columns

Let's create a new dataframe that will contain our reorganized data. We can use pd.DataFrame() to do this.

In [ ]:
 

Now, we can create a column called Date that will contain the time range of the data. One way to do this is to use the pd.DatetimeIndex() function. It's easy to use, just give it a start, end, and frequency. Frequency are given by codes in the following table: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases.

In [ ]:
 

Now we need to change our data from a year x month array to a one-dimensional list of dates, one per month. We can run the loc() command to select the data in the second to the last columns in our original data frame, then use the stack() command to turn the 2-D array into a 1-D array. Grab the values using the values object, and then save it in our new data frame as a column called ENSO.

In [ ]:
 

Now let's load in another file, data from the Pacific Decadal Oscillation Index (PDO) in the file pdo.data. Go through the same procedure as above, making sure to check the results with the original file.

In [ ]:
newdf_all=pd.merge(newdf,newdf2,how='inner',on='Date')
newdf_all2=pd.merge(newdf_all,newdf3,how='inner',on='Date')
newdf_all2

Now, we can combine the two data frames into one using the pd.merge() command. For example:

alldata = pd.merge(newdf,newdf2,how='inner',on='Date')
In [ ]:
 

If you want to visualize the data, take a sneak peak, run the following code (make sure your DataFrame is called the correct name!)

In [73]:
%pylab inline
plt.figure(figsize=(8,6))
plt.scatter(newdata['ENSO'],newdata['PDO'],5,newdata['Date'].dt.year)
plt.xlabel('Normalized ENSO index')
plt.ylabel('Normalized PDO index')
plt.colorbar()
Populating the interactive namespace from numpy and matplotlib
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'PDO'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-73-662e39adaa78> in <module>()
      1 get_ipython().magic('pylab inline')
      2 plt.figure(figsize=(8,6))
----> 3 plt.scatter(newdata['ENSO'],newdata['PDO'],5,newdata['Date'].dt.year)
      4 plt.xlabel('Normalized ENSO index')
      5 plt.ylabel('Normalized PDO index')

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/projects/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/projects/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: 'PDO'
<matplotlib.figure.Figure at 0x7f058e6e5eb8>
In [ ]: