Practical examples using numpy
and Pandas
This week we will focus on using practical examples using numpy
and Pandas
.
Let's do some analysis of daily weather records from Champaign-Urbana. The records go back to 1888!
Let's do some basic stats. Calculate the mean of the minimum temperature values,
Oops, we need to do some cleanup, because the column is being handled as a string because it contains M
values mixed in with the numbers. We can use fancy indexing to help. First, create a boolean condition that pattern matches the M
values:
Then use the boolean array as a fancy index to do the job.
Don't forget to store the data back in memory!
In-class exercise 1: Use a loop to replace all of the columns as floats, ignoring M
values.
Date | PRCP | SNOW | SNWD | TMAX | TMIN | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 1888-08-17 | 0.000 | NaN | NaN | 84.0 | 64.0 | NaN |
1 | 1888-08-18 | 0.000 | NaN | NaN | 81.0 | 57.0 | NaN |
2 | 1888-08-19 | 0.000 | NaN | NaN | 85.0 | 51.0 | NaN |
3 | 1888-08-20 | 0.005 | NaN | NaN | 72.0 | 62.0 | NaN |
4 | 1888-08-21 | 0.100 | NaN | NaN | 81.0 | 60.0 | NaN |
5 | 1888-08-22 | 0.000 | NaN | NaN | 71.0 | 50.0 | NaN |
6 | 1888-08-23 | 0.000 | NaN | NaN | 84.0 | 37.0 | NaN |
7 | 1888-08-24 | 0.000 | NaN | NaN | 84.0 | 46.0 | NaN |
8 | 1888-08-25 | 0.000 | NaN | NaN | 88.0 | 48.0 | NaN |
9 | 1888-08-26 | 0.000 | NaN | NaN | 87.0 | 64.0 | NaN |
10 | 1888-08-27 | 0.000 | NaN | NaN | 82.0 | 63.0 | NaN |
11 | 1888-08-28 | 0.000 | NaN | NaN | 80.0 | 54.0 | NaN |
12 | 1888-08-29 | 0.000 | NaN | NaN | 84.0 | 45.0 | NaN |
13 | 1888-08-30 | 0.000 | NaN | NaN | 85.0 | 50.0 | NaN |
14 | 1888-08-31 | 0.000 | NaN | NaN | 73.0 | 53.0 | NaN |
15 | 1888-09-01 | 0.000 | NaN | NaN | 72.0 | 44.0 | NaN |
16 | 1888-09-02 | 0.000 | NaN | NaN | 79.0 | 39.0 | NaN |
17 | 1888-09-03 | 0.000 | NaN | NaN | 82.0 | 42.0 | NaN |
18 | 1888-09-04 | 0.000 | NaN | NaN | 83.0 | 47.0 | NaN |
19 | 1888-09-05 | 0.000 | NaN | NaN | 76.0 | 48.0 | NaN |
20 | 1888-09-06 | 0.000 | NaN | NaN | 78.0 | 53.0 | NaN |
21 | 1888-09-07 | 0.000 | NaN | NaN | 86.0 | 52.0 | NaN |
22 | 1888-09-08 | 0.030 | NaN | NaN | 89.0 | 63.0 | NaN |
23 | 1888-09-09 | 0.000 | NaN | NaN | 81.0 | 54.0 | NaN |
24 | 1888-09-10 | 0.000 | NaN | NaN | 84.0 | 45.0 | NaN |
25 | 1888-09-11 | 0.000 | NaN | NaN | 89.0 | 54.0 | NaN |
26 | 1888-09-12 | 0.000 | NaN | NaN | 76.0 | 52.0 | NaN |
27 | 1888-09-13 | 0.000 | NaN | NaN | 69.0 | 36.0 | NaN |
28 | 1888-09-14 | 0.000 | NaN | NaN | 76.0 | 33.0 | NaN |
29 | 1888-09-15 | 0.550 | NaN | NaN | 69.0 | 49.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
46915 | 2017-01-28 | 0.005 | 0.005 | 0.000 | 37.0 | 22.0 | NaN |
46916 | 2017-01-29 | 0.005 | 0.005 | 0.005 | 29.0 | 22.0 | NaN |
46917 | 2017-01-30 | 0.090 | 0.800 | 1.000 | 42.0 | 11.0 | NaN |
46918 | 2017-01-31 | 0.000 | 0.000 | 0.005 | 44.0 | 34.0 | NaN |
46919 | 2017-02-01 | 0.000 | 0.000 | 0.000 | 42.0 | 26.0 | NaN |
46920 | 2017-02-02 | 0.000 | 0.000 | 0.000 | 30.0 | 18.0 | NaN |
46921 | 2017-02-03 | 0.000 | 0.000 | 0.000 | 31.0 | 15.0 | NaN |
46922 | 2017-02-04 | 0.000 | 0.000 | 0.000 | 37.0 | 11.0 | NaN |
46923 | 2017-02-05 | 0.000 | 0.000 | 0.000 | 46.0 | 26.0 | NaN |
46924 | 2017-02-06 | 0.000 | 0.000 | 0.000 | 47.0 | 25.0 | NaN |
46925 | 2017-02-07 | 0.150 | 0.000 | 0.000 | 62.0 | 32.0 | NaN |
46926 | 2017-02-08 | 0.000 | 0.000 | 0.000 | 33.0 | 16.0 | NaN |
46927 | 2017-02-09 | 0.130 | 1.500 | 2.000 | 20.0 | 9.0 | NaN |
46928 | 2017-02-10 | 0.000 | 0.000 | 2.000 | 50.0 | 13.0 | NaN |
46929 | 2017-02-11 | 0.000 | 0.000 | 0.005 | 67.0 | 39.0 | NaN |
46930 | 2017-02-12 | 0.000 | 0.000 | 0.000 | 52.0 | 33.0 | NaN |
46931 | 2017-02-13 | 0.000 | 0.000 | 0.000 | 46.0 | 24.0 | NaN |
46932 | 2017-02-14 | 0.000 | 0.000 | 0.000 | 53.0 | 31.0 | NaN |
46933 | 2017-02-15 | 0.000 | 0.000 | 0.000 | 43.0 | 27.0 | NaN |
46934 | 2017-02-16 | 0.000 | 0.000 | 0.000 | 52.0 | 24.0 | NaN |
46935 | 2017-02-17 | 0.000 | 0.000 | 0.000 | 70.0 | 37.0 | NaN |
46936 | 2017-02-18 | 0.000 | 0.000 | 0.000 | 70.0 | 40.0 | NaN |
46937 | 2017-02-19 | 0.000 | 0.000 | 0.000 | 68.0 | 44.0 | NaN |
46938 | 2017-02-20 | 0.000 | 0.000 | 0.000 | 71.0 | 43.0 | NaN |
46939 | 2017-02-21 | 0.020 | 0.000 | 0.000 | 62.0 | 48.0 | NaN |
46940 | 2017-02-22 | 0.005 | 0.000 | 0.000 | 70.0 | 43.0 | NaN |
46941 | 2017-02-23 | 0.000 | 0.000 | 0.000 | 60.0 | 44.0 | NaN |
46942 | 2017-02-24 | 0.005 | 0.000 | 0.000 | 68.0 | 37.0 | NaN |
46943 | 2017-02-25 | 0.005 | 0.000 | 0.000 | NaN | NaN | NaN |
46944 | 2017-02-26 | NaN | NaN | NaN | NaN | NaN | NaN |
46945 rows × 7 columns
Now, let's do some sorting.
Let's sort the days to find the top 10 coldest minimum temperatures.
Date | TMIN | |
---|---|---|
40317 | 1999-01-05 | -25.0 |
3828 | 1899-02-09 | -25.0 |
38505 | 1994-01-19 | -25.0 |
6023 | 1905-02-13 | -25.0 |
35219 | 1985-01-20 | -24.0 |
34113 | 1982-01-10 | -23.0 |
8542 | 1912-01-07 | -22.0 |
38504 | 1994-01-18 | -21.0 |
10739 | 1918-01-12 | -21.0 |
15128 | 1930-01-18 | -21.0 |
2366 | 1895-02-08 | -21.0 |
1987 | 1894-01-25 | -21.0 |
17324 | 1936-01-23 | -21.0 |
32294 | 1977-01-17 | -20.0 |
3831 | 1899-02-12 | -20.0 |
35218 | 1985-01-19 | -20.0 |
13281 | 1924-12-28 | -20.0 |
34826 | 1983-12-24 | -20.0 |
1614 | 1893-01-17 | -19.0 |
3829 | 1899-02-10 | -19.0 |
3832 | 1899-02-13 | -18.0 |
12923 | 1924-01-05 | -18.0 |
37017 | 1989-12-23 | -18.0 |
37014 | 1989-12-20 | -18.0 |
22813 | 1951-02-02 | -18.0 |
1612 | 1893-01-15 | -18.0 |
34120 | 1982-01-17 | -18.0 |
12939 | 1924-01-21 | -17.0 |
43981 | 2009-01-16 | -17.0 |
32293 | 1977-01-16 | -17.0 |
... | ... | ... |
36430 | 1988-05-15 | NaN |
36820 | 1989-06-09 | NaN |
36877 | 1989-08-05 | NaN |
36879 | 1989-08-07 | NaN |
36891 | 1989-08-19 | NaN |
36904 | 1989-09-01 | NaN |
37065 | 1990-02-09 | NaN |
37067 | 1990-02-11 | NaN |
37143 | 1990-04-28 | NaN |
37161 | 1990-05-16 | NaN |
37395 | 1991-01-05 | NaN |
37579 | 1991-07-08 | NaN |
37668 | 1991-10-05 | NaN |
38881 | 1995-01-30 | NaN |
39663 | 1997-03-22 | NaN |
40328 | 1999-01-16 | NaN |
40561 | 1999-09-06 | NaN |
40717 | 2000-02-09 | NaN |
41026 | 2000-12-14 | NaN |
41033 | 2000-12-21 | NaN |
42430 | 2004-10-18 | NaN |
42646 | 2005-05-22 | NaN |
42654 | 2005-05-30 | NaN |
42961 | 2006-04-02 | NaN |
43590 | 2007-12-22 | NaN |
43594 | 2007-12-26 | NaN |
43595 | 2007-12-27 | NaN |
43949 | 2008-12-15 | NaN |
46943 | 2017-02-25 | NaN |
46944 | 2017-02-26 | NaN |
46945 rows × 2 columns
In-class exercise two: Create a new data frame with just the top 10 coldest temperature values and the dates on which they occurred (It's ok if there is a tie, just get the first 10). Write that data frame to a csv file using pd.to_csv()
.
Download both files to your computer, and open them in Microsoft Excel. Do the times and temperatures match up in the raw data with your result?
In-class exercise three: Compute what fraction of good observations had lows below 32 degrees F before 1-1-1950 with including and after 1-1-1950. We might want to do this to assess whether there is a time trend in the coldest temperatures in this dataset (with possible impacts on agriculture).
Hint! Just as before, we need to convert our column from a string to a time. pandas
works well with datetime
objects, easily converted with pd.to_datetime
.
To define a pandas
datetime object to compare against, use pd.datetime()
.
Here is a start!
Now we can get more complex. Let's say we wanted to calculate the mean temperature of all the January maximum temperatures in the data.
As usual, there are many ways we could do this.
We could loop over all of the values, and test if the observation is in the month. Then, as we have done before, if the test returns
True
, we can calculate a running sum and a counter.We could create a new column which contains just the month, then use fancy indexing to find the values we want.
We could let
pandas
do all the work for us.
Let's work through all of these.
Before we get started, pandas
has some really useful tools to work with dates and times. I found some useful code by googling http://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python
Adapt the above code to our data frame and try it! Be sure to print some values to the screen to make sure that the code is working.
Ok, let's code up the loop for approach 1:
Trying approach 2: Let's do the fancy indexing
Approach 3: The pandas
way. We can use the groupby
function to group data according to values. It saves us a lot of work! It even can do all of the months for us!
From approach 3, print a sentence that says the fraction of below 32 days for each period, using the data you calculated.