We'll continue where we left off working with our Champaign daily weather observations from Monday.
# Data prep again
import numpy as np
import pandas as pd
daily_data = pd.read_csv('daily_wx_data_champaign.csv',sep=',',skiprows=7)
cols = ['TMIN', 'TMAX', 'PRCP', 'SNOW', 'SNWD']
for col in cols:
daily_data[col][daily_data[col] == 'M'] = None
daily_data[col][daily_data[col] == 'T'] = '0.005'
daily_data[col] = daily_data[col].astype('float')
daily_data['Date']=pd.to_datetime(daily_data['Date'])
daily_data['TMIN'].astype(float)
Remember we can do statistics on a pandas
DataFrame column easily, lets try printing mean, median standard deviation:
daily_data.describe()
Now we can get more complex. Let's say we wanted to calculate the mean temperature of all the January maximum temperatures in the data.
As usual, there are many ways we could do this.
We could loop over all of the values, and test if the observation is in the month. Then, as we have done before, if the test returns True
, we can calculate a running sum and a counter.
We could create a new column which contains just the month, then use fancy indexing to find the values we want.
We could let pandas
do all the work for us.
Let's work through all of these.
Before we get started, pandas
has some really useful tools to work with dates and times. I found some useful code by googling http://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python
df['month'] = df['Date'].dt.month
Adapt the above code to our data frame and try it! Be sure to print some values to the screen to make sure that the code is working.
Ok, let's code up the loop for approach 1:
#dt.month, dt.year, dt.day, dt.hour
daily_data['Month'] = daily_data['Date'].dt.month #adding a column to daily_data called Month
np.shape(daily_data) #how many rows and columns we have
nrows,ncols = np.shape(daily_data)
tmin_sum=0
tmin_num=0
for i in range(1000):
if (daily_data['Month'].loc[i] == 1) & (np.isnan(daily_data['TMIN'].loc[i]) != False):
#print(daily_data['TMIN'].loc[i])
tmin_sum = daily_data['TMIN'] .loc[i]+tmin_sum
tmin_num=tmin_num+1
print(tmin_sum,tmin_num,tmin_sum/tmin_num)
Trying approach 2: Let's do the fancy indexing
#Remember the two tests
#Create boolean test, test it
#Apply the fancy index, check it
daily_data['TMIN'][daily_data['Month'] == 1 & (np.isnan(daily_data['TMIN']) == False)].mean()
Approach 3: The pandas
way. We can use the groupby
function to group data according to values. It saves us a lot of work! It even can do all of the months for us!
new_df = daily_data.groupby(by='Month').max()
new_df['TMAX']
return Month(new_df['Date'])
From approach 3, print a sentence that says the fraction of below 32 days for each period, using the data you calculated.
pandas
operations¶Sometimes data doesn't quite come in a nice file format that is easily structured. In the homework, there is an example of such files.
We may need to do some minor surgery on data to get it in a format that is easily used. Let's take a look at one of the homework files, censo.data
. Open it in a text editor and examine the format.
newdata = pd.DataFrame() #emptydataframe
newdata['Date'] = pd.DatetimeIndex(start=pd.datetime(1948,1,1),end=pd.datetime(2016,12,1),freq='MS') #MS = Start of each month
newdata['ENSO'] = data.loc[:,1:].stack().values
newdata
Let's create a new dataframe that will contain our reorganized data. We can use pd.DataFrame()
to do this.
Now, we can create a column called Date
that will contain the time range of the data. One way to do this is to use the pd.DatetimeIndex()
function. It's easy to use, just give it a start, end, and frequency. Frequency are given by codes in the following table: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases.
Now we need to change our data from a year x month array to a one-dimensional list of dates, one per month. We can run the loc()
command to select the data in the second to the last columns in our original data frame, then use the stack()
command to turn the 2-D array into a 1-D array. Grab the values using the values
object, and then save it in our new data frame as a column called ENSO
.
Now let's load in another file, data from the Pacific Decadal Oscillation Index (PDO) in the file pdo.data
.
Go through the same procedure as above, making sure to check the results with the original file.
newdf_all=pd.merge(newdf,newdf2,how='inner',on='Date')
newdf_all2=pd.merge(newdf_all,newdf3,how='inner',on='Date')
newdf_all2
Now, we can combine the two data frames into one using the pd.merge()
command.
For example:
alldata = pd.merge(newdf,newdf2,how='inner',on='Date')
If you want to visualize the data, take a sneak peak, run the following code (make sure your DataFrame is called the correct name!)
%pylab inline
plt.figure(figsize=(8,6))
plt.scatter(newdata['ENSO'],newdata['PDO'],5,newdata['Date'].dt.year)
plt.xlabel('Normalized ENSO index')
plt.ylabel('Normalized PDO index')
plt.colorbar()