Lab 6
Let it Rain!
Source: 120 Years of Australian Rainfall, Bureau of Meteorology
As an 'oasis' on the bottom corner of a desert, the Western Australian southwest region, including Perth, is more at risk from any decline in rainfall than many other aspects of climate change.
The 2019 rainfall map (shown above) paints an alarming picture!
In this lab, we will examine raw data from the Bureau of Meteorology, focussing specifically on the Perth region, and see if there are any trends we can observe for ourselves.
While we have access to a relatively small period of time geologically speaking, its a very important period of time from the perspective of man-made climate change. It will be interesting to see whether, even in such a short period of time, we can observe any trends.
Data Acquisition and Inspection
The Australian Government's Bureau of Meteorology (BOM) provides a wealth of climate data through its Climate Data Online service.
Find the daily rainfall data for Perth.
You should find a number of available weather stations in the Perth area. Notice that when you select a station, a graph is provided that indicates over what timespan data from that station is available, and how complete the data is.
Q: Which station has the longest record of data available?
Which station has a complete record for the longest period?
Download the data for the Midland and Airport stations.
Data is typically accompanied by a 'README' file or similar, explaining what is in the files.
Read the
..Note.txt
file for one of the datasets. Look out in particular for:information on the data fields contained in the files
information on missing data
information on quality of the data
As you read the file, start to think about what you might need to do when inspecting and cleaning the data.
Inspect the data to see if it matches what you expect to see from the Note.
Upload the data to CoCalc, and set up constants to point to the data.
(The files should not be more than 2MB.)
The csv
library
So far we have parsed csv files ourselves. As this is a common task, a library has been provided.
Look up the documentation for the csv module in the latest version of the Python Standard Library documentation.
Import the
csv
package, and use a csvreader
to read the Perth Airport data into a list of lists.
How many lines of data are there?
*Hint: the reader is an iterable, so it can be used with list()
to create a list of lists.
Data Cleaning and Conversion
Print the first 10 lines of the Airport data. What do you see?
Use a list comprehension (one line) to extract all the lines where the daily rainfall field is not empty and starts with a number.
How many lines of data remain? What percentage of the data is this (to 1 decimal place/3 significant figures)?
Use a list comprehension (one line each) to check whether there are any lines in the remaining data that:
have a period over which rain was measured greater than 1
have a flag that is not either 'Y' or 'N'
numpy datetime64
The python ecosystem contains a (confusing) number modules for dealing with dates and times (again a product of the way it evolved), including the modules datetime
, time
and calendar
(we will see another later).
numpy also has its own datetime type, called datetime64
to distinguish it from the datetime module.
numpy's datetime has a fixed-size representation (a 64-bit integer) which allows it to be used efficiently in standard numpy arrays. This means however it has a precision limit, and there is a trade-off (albeit very large) between the 'bigness' of timespan that can be represented, and the 'smallness' of the time interval that can be represented.
To make it more useful, numpy lets you choose the time units it represents, from attoseconds (for particle physicists?) through to years (for geologists?). This is explained in the section Datetime Units which provides a table showing the timespan that can be used for each unit size.
The limits are far more than we will ever need for our time-series data, but understanding why it works this way makes it easy to understand the reason for including date or time units. Since our data is daily, we'll use days ('D'
) as the unit.
To see how datetime64
can be used as an array type, try the following. (Each should only take one line.)
Assign today's date to a variable
today
.Use
dtype
to check that you are using 'Day' as the date units.Use
arange
to create an array of dates from January 1st this year to December 31st.How many days will there be this year?
What day of the year is it today?
How many more days are there this year?
*Hint: Use arithmetic rather than search.
Data types
In order to plot the rainfall data, we would like to have an array of days (dates), and a corresponding array of rainfall for those days.
We have already extracted all the lines of data that have numerical data in the rainfall field. We have determined that there are no periods of measurement greater than 1 (we'll ignore this column), and all the data have a valid quality flag.
Now we want to get out the relevant fields.
Q: What data types would you use?
Pre-allocating arrays
We'll use a loop to extract the data directly into arrays (we won't use an 'accumulator' list). As we saw in the lectures, to do this efficiently, we need to pre-allocate arrays of the right size.
Using numpy's
full
method, define arrays for:days, with type
datetime64[D]
rain, with type
float
flags, with type
boolean
Using
enumerate
, write a loop to inspect each line in your list of cleaned data, and store the correct date, rainfall, and flag in the three arrays.
Putting it all together [Checked answer, 1 mark]
Complete the function raindata(filename)
which:
uses a csv reader to read the data from the file filename into a list of lists
uses a list comprehension to extract the lines in which the rainfall measurement is nonempty and begins with a numeric character
allocates space in arrays for
days
,rain
, andflags
with typesdatetime64[D]
,float
, andboolean
respectivelyassigns the data to the arrays (with boolean True representing the flag 'Y')
returns a triple
(days, rain, flags)
containing the three arrays
For this exercise you don't need to check the period of rainfall or the validity of the flags, you can assume they're OK. (But if you do, that's OK too.)
The tests will assume the data is stored in the same place as in the zip file downloaded from the BOM - that is, in the file: IDCJAC0009_009021_1800/IDCJAC0009_009021_1800_Data.csv
Data Visualisation and Analysis
Do a scatter plot of the amount of rain against days for the Perth Airport data.
Your plot might look something like this:
Can you see any trend in the data?
There is a lot of overlap between the sample points. Once way to see better what is happening is to set the opaqueness (or, conversely, transparency) or alpha value.
Check the documentation for plt.scatter() and set the alpha value to 0.1.
Can you see any trends yet?
Comparing with averages
Plot the data against a line showing the average rainfall received at the weather station.
Can you see any trends?
Create a boolean mask (over the days) that can be used to extract the data between 1950 inclusive and 1960 exclusive.
'Inclusive' means it includes 1950 (the interval is closed on the left), 'exclusive' means it excludes 1960 (the interval is open on the right).
You should use datetime64
objects to set the boundaries.
Use your mask to plot the rainfall for the 1950s, against the average for the 1950s.
As its a bit less crowded, you might choose to use a little higher alpha value.
Your output might look something like this:
Your boss says "I've changed my mind, I want to see the data from 1950 to 1951".
Change your mask to meet the new spec and replot.
You shouldn't need to change anything except the mask - the plotting logic stays the same. Even though we haven't written a function (yet), this shows the power and elegance of reusable programming elements!
Let's go ahead and write a function.
Write a function
get_averages(days, rain, first, last)
that:takes the
days
andrain
arrays and two dates,first
andlast
returns a pair
(selected_days, avs)
whereselected_days
are those days that lie between first (inclusive) and last (exclusive)avs
is an array of values the same size asselected_days
, which are all the same and are equal to the average rainfall over that period
As always, you should check your code on some values to satisfy yourself that it works.
Generate your plot again, but this time use the above method to get the average between 1950 and 1960 and plot that as a line on the graph.
Increase the default width of the average line so its easier to see.
Use
arange
(one line) to generate an array containing all the decade start dates from 1940 to 2020.
Your array might look like this:
array(['1940', '1950', '1960', '1970', '1980', '1990', '2000', '2010', '2020'], dtype='datetime64[Y]')
Using this array as an iterable, call
get_averages
to successively plot each decade's average on your scatter plot.
Can you see any trend yet?
Plot just the decadal averages on their own (without the scatter plot).
Can you see a trend?
Accumulating the decadal means [2 lab marks]
Write a function all_decades (days, rain)
that returns two arrays, one of x-values and one of y-values, that can be used to plot a single line of all the decadal averages as follows:
Here mean_xs
should be dates (datetime64) and mean_ys
should be average rainfall in mms (float).
As usual you may break it down into more than one function if you wish.
Plot the average rainfall line.
Flags
So far we have not taken into account the flags that indicate the data that has been checked and determined valid.
Change your plot of the decadal means so that it plots a green line for all data that has been verified, and an amber (orange) line for all data that has yet to be verified. Include a legend that informs the reader of what the colours mean.
Midland weather station data
We saw from our initial inspection that the Midland data has missing portions.
Plot a line of decadal averages for the Midland data. This time use green for verified data, orange for unverified data, and red for any periods of missing data.
Putting it all together
Finally, plot the Midland and Perth Airport decadal lines on the same plot, using the 'traffic light' (red, orange, green) colour scheme, along with a legend.
What observations can you make?
© Cara MacNish