Lab 7
Feel the Heat!
Source: 110 Years of Australian Temperatures, Bureau of Meteorology
We might love a sunburnt country, but Dorothea Mackellar in 1908 couldn't have envisioned quite how sunburnt its getting!
The 2019 temperature anomolies map (shown above) paints another alarming picture!
In this lab, we will again examine raw data from the Bureau of Meteorology. This time we will focus on maximum temperatures for Western Australian towns, form the north to the south.
Working with 2-Dimensional Arrays
This lab assignment provides a lot of practice working with 2-d arrays. It introduces some important new concepts.
Because of this and the length of lab the due date will be at little later than usual - the due date will be 10am on Tuesday 29th September.
Because it will involve working through new concepts it may take a few sessions to work though it, so its important not to leave it until close to the due date.
Data Acquisition and Inspection
Once again we will source our data from the Australian Government's Bureau of Meteorology (BOM) Climate Data Online service.
Find the monthly average maximum temperature data.
Find and download the historical data for:
Perth Airport
Geraldton Town
Albany
Broome Airport
Upload the data to your CoCalc project. You should upload the whole folder for each town, it should not be more than about 60KB each.
Read one of the
Note.txt
files so that you can see what is in each file, and the structure.Inspect one of each type of data file to see that they conform to your expectations from reading the Note.
Set up
We will set up our constants as follows.
In this lab we will use the second kind of file discussed in Note.txt (12 months per line).
We will use a dictionary to map town names to the station codes. This will allow us to easily find the data for each town.
For simplicity we will only use one product (monthly maximum temperatures), however you can see that we could easily generalise this to other data sets (products) in a similar way to the stations.
File paths
Lets start with a small utility function.
Using
str.format()
, write a functiontemps_file(town)
that returns the path to the file containing the twelve-monthly temperature table fortown
, wheretown
is the town name (Albany, Perth, etc).
For example:
You can find out about string formatting using format()
at: https://docs.python.org/3.8/library/string.html#format-string-syntax. If you are not familiar with specification languages (you are not expected to be) it is better to look at the examples for this one.
(You should use the the "new" {}
formatting, not the "old" %
formatting.)
np.genfromtxt
While csv
is particularly targeted at reading csv (like) files, a range of other libraries include their own bespoke file/buffer/string parsers.
For example, numpy
has fromstring
, loadtxt
and the more general genfromtxt
(as well as others for different file/buffer types).
genfromtxt
is extremely versatile - for example, it allows us to skip rows, select columns and deal with missing values.
We've seen that our data has:
a header row, that we won't need for analysis
16 columns, but not all of them are needed
missing values, represented by the string "null"
This makes genfromtxt
a great choice.
Read the API for
genfromtxt
in the latest version of the numpy manual.Read the data for Albany into an array
albany
(1 line of code). Your array should:exclude the header line
exclude the first two columns (code and station number)
ensure any occurrences of "null" are replaced with
nan
Hint: It is not immediately obvious from the API documentation, but by default missing values identified by genfromtxt
for an array of floats will be entered as np.nan
.
The the resulting array should be of type float64. This should happen by default, you don't need to specify it - but you should check it (see np.dtype
). As we know the array is homogeneous, so this includes the years, and the missing values.
Print the shape of the array for Albany.
Selecting items in 2-d arrays
We know how to select an item in a 1-d array using a slice. To select an item in a 2-d array, we just select the row and the column.
For example, albany[0,-1]
will select the item in the last column of the first (zeroth) row, that is, the annual rainfall for 1880.
Print out the annual rainfall for the last year in the table (1 line of code).
Print out the annual rainfall for the second last year in the table.
Inspecting the year range for a town [1 lab mark]
Write a function
year_range(town)
that returns a pair of integers representing the first year and the last year included in that town's table (you may assume the tables are always in the correct format and in chronological order).
You should use genfromtxt
to read the data, and then use array selection (no loops). This should only take a couple of lines of code!
Check your answers for the various towns are correct.
Collecting all the data
Write a function
get_temps()
that, by iterating through theSTATIONS
dictionary in alphabetical order, returns:a list of town names in alphabetical order
a list of arrays containing the corresponding tables of temperature data (from the year to the average annual rainfall)
So that you can observe the progress and get a better idea of what data is there, output the town, rows of data and first and last years for each town. For example:
A 'production' version [1 lab mark]
In this version the user has more control, including being able to turn the printed output on or off.
Write a function
get_temperatures (stations, quiet=True)
with the same specification asget_temps
except that:a stations dictionary is passed to the function (rather than 'hard wired' to STATIONS)
it takes one optional argument,
quiet
, that defaults toTrue
. Ifquiet
is False, then it should print the output as it reads in the data, allowing the user to monitor it, otherwise it reads the data in `silently' (can be useful when called by other functions)
Data Cleaning and Conversion
Selecting a row or column
We can use the slice operators in the usual way to mean all the data between some bounds (optionally with step size). So to get all the years, for example, I can select all rows and the first (zeroth) column:
Using Albany as an example, use array selection to return:
the first row of data
the last row of data
all the annual averages in the last column
Selecting an area
Use a slice to select the first 4 months of data for the first 3 years recorded.
Your code should be of the form:
albany[
select the 3 rows here,
select the 4 columns here]
.
Check against the original csv file to ensure you have selected the correct data.
Two ways to skin a cat?
Use selection to print the temperature for January 1882 (
albany[2,1]
).Now use selection to extract the row for 1882, then use selection on the result to extract the January reading (
albany[2][1])
.
What do you notice?
Now use selection to extract the area containing the first 2 months of temperature data for the first 3 years,
albany[:3,1:3]
.Then try
albany[:3][1:3]
.
What do you find?
Why is it different? Why did it work for selecting one item?
What would you need to put in the second set of brackets to get the same result? Try this out.
Verifying the data
Using
np.mean()
and array selection, determine whether mean of the first 12 months of rainfall is equal to the average annual rainfall reported for Albany for that year (1 line of code).
Complete the following without using any loops. (Each should take 1 line to do the operation, and one line to check the shape where relevant.)
Select all of the monthly rainfall data into an array (that is, all columns except the year and the rainfall). Check the shape is as you would expect.
Use
np.mean
with theaxis
argument to get an array of the means for each year, rounded to 1 decimal place. Check the shape (to ensure you used the right axis).Get a boolean array of all those years where the stated annual average is not equal to your calculated average. Check the shape.
Use numpy's
any
to check whether there are any that are different.
Use the boolean array from above as a mask to select all the rows in the Albany data where the averages don't match. (1 line of code)
Change your line of code from above so that it prints just the year for each of the rows that don't match.
Let's say we want to extract both the year and annual average columns from the rainfall table. We could do that using boolean masks with something like this:
Try this out. What is the shape of the resulting array?
Note that cols
was automagically cast to an array in this demonstration. Better would be to define cols
as an array.
Repeat this with
cols
defined as an array.
Using a boolean array to select the columns in the above example is a bit unweildy. Luckily numpy
allow integer masks as well.
For example, if I want the rows for January and April, I could use albany[:,[1,4]]
. Give this a try.
Now print the year and annual average columns for the mismatched years using an integer array.
You should get the same output as above.
np.stack
Another way of extracting the two columns is to stack
them in a new array. Look up the API for numpy.stack
.
Assuming mismatch
is my boolean mask, I can say:
Give this a try.
This has taken each column of data as a 1-d (flattened) array and stacked them to create a 2-d array.
If we want to see it vertically again, we can transpose it:
While this is less efficient than the previous approach, since it creates a new array, it is probably easier to read. More importantly, however, we can include other data in the new array, such as our calculated means.
Using this approach, create an array that has the year, reported average, and calculated mean, for each year where the reported and calculated averaqes don't match. You should have a 3x10 array.
To make it a little easier to read, transpose it to a 10x3 array and print it. Your output should start like this:
Could all of your difference be explained by rounding errors and missing data?
np.hstack
, np.vstack
and np.concatenate
Stacking (and concatenating) can be used to create a bigger array in the same dimensions, or in a new dimension (as we did with stack
above).
Let's say we have the years in 3 1-d arrays
and we want to put them in one long array. We can use hstack
to stack them "horizontally":
Try this using
concatenate
.Try using
vstack
. What is the difference?
Putting it all together [1 lab mark]
Write a function that:
reads the temperature data for all towns specified in the
stations
dictionary in alphabetical order of town name (usingget_temps_data
)returns an x 3 array which has one line for each of the cases in which the recorded annual average does not reconcile with the calculated average
Note that:
the array should be ordered (vertically) by town and then by year
nan
is considered as not matching any number (includingnan
)your code should be general:
you should not assume that the towns are Albany, Perth and Broome
you can assume that all files will take the same format as downloaded from the BOM (that is, a
temps_file
method that usesPRODUCT
,MONTHS12
andstations
) will correctly generate the data filename for any town instations
, and the columns will be the same
(This should only take about 8 lines of code, without being overly terse.)
We will make the following validating and data-cleaning assumptions:
If the calculated mean is within 0.1 of the reported annual average, we'll assume the data is valid.
Since the reported averages may have been calculated on more precise figures and reported to 1 decimal place, we will take the reported average when considering annual figures (that is, we delay rounding until as late as possible to avoid compounding rounding errors).
When considering annual figures, we will remove years where the annual figure is a null value.
Write a function
clean (table)
that takes a 14 column table, as returned byget_temps_data()
and cleans it according to the above assumptions.Write a function
clean_all (tables)
that takes a list of tables and returns a cleaned list of tables.
To avoid a warning caused by nan's, it is suggested your remove the nan rows first.
Therefore, it is suggested you (in 4 lines of code):
make a mask that keeps all the rows that don't have null value for annual average temperature (you may find
np.logical_not
useful)mask out those rows
make a mask that keeps all the rows that have a mean temperature within 0.1 of the reported annual average
mask out those rows
Visualisation
For convenience, define a method
get_table (town, stations)
that returns the temperature table for a named town.
Using the table for Albany, plot the annual averages (last column) against the years (first column).
You will notice that the plot has not treated missing data well. Rather than a line joining the missing data, it would be more instructive to have a gap. (Our plot of the cleaned data will also not include the final year in the x-axis.)
One way to leave gaps is to ensure all missing years are set to nan
. Matplotlib will not plot these years.
To demonstrate this, plot the uncleaned data for Albany.
At first sight this appears to have solved our problem! We can just use the uncleaned data.
But actually this is just a coincidence of our specific data, not a general solution.
Have a look at the data file for Albany. It happens that the gap of missing data from 1966 to 2001 has a null value either side. This causes matplotlib not put a line between those values.
To demonstrate this, we can leave some other gaps without null values. A quick way to do this is by plotting every tenth value.
Generate this plot. What do you see?
Augmenting the data
Matplotlib 'does its best' to anticpate what we want to see, and does a pretty good job in general! After all, it filled in all the years on the x-axis for us (allowing us to be a bit lazy), even though we only fed it x values for some of the years.
But it can only do so much. We will need to do it properly.
We will now 'augment' the data by including all of the years ranging from the first to the last year in the data. We will set the missing years to np.nan
so that they are not plotted.
We can do all of this in numpy without using any loops.
Identifying missing years [1 lab mark]
Write a method missing_years (table)
that takes a (uncleaned) table, and returns an array of years (as integers) that fall within the range of years from the first in the table to the last in the table, that either:
don't have a reported annual average in the table
have a reported annual average of "null"
Your method should not use any loops. You can use previously defined methods.
Hint: It is suggested that you break it down into the following steps:
use the information from the (uncleaned) table to generate an array of all the years in the range of your table
use the cleaned table to get an array of the years that have valid data
determine the indices of the valid years in the full range of years
generate a boolean mask over the full range of years that selects the non-valid years
Tip: You may also find the method np.astype()
useful.
Check your results are correct by comparing (visually) with the downloaded tables.
Augmenting with nan
[1 lab mark]
Write a function
augmented (table)
that returns a pair of arrays:the first is an array of all the years in the range of years covered in the table, as integers
the second is an array of floats which contains:
the reported annual average for the corresponding year
np.nan
where either the data was null or missing
Again the function should have no loops.
Hint: Use a similar structure to missing_years
.
Again, check you are getting the right results on the downloaded tables.
Plot the augmented table for Albany. Is it whate you expected?
Visualising the Complete Data!
At last its time to plot the historic rainfall. But because we've prepared well, this is very straightforward! It shouldn't more than about 4 lines of code to get and plot the data for all the towns in STATIONS
(plus a few lines to label the graph).
Plot all the historical data for average annual temperatures for all towns on a single chart.
Part of the chart should look like this (yours should include all towns and all years):
Can you see any trends? How might you quantify this?
© Cara MacNish