Project: Ruida He - CITS2402 Introduction to Data Science

Views: ⁵³¹³
Image: default

Kernel: Python 3 (system-wide)

Lab 7

Feel the Heat!

Source: 110 Years of Australian Temperatures, Bureau of Meteorology

We might love a sunburnt country, but Dorothea Mackellar in 1908 couldn't have envisioned quite how sunburnt its getting!

The 2019 temperature anomolies map (shown above) paints another alarming picture!

In this lab, we will again examine raw data from the Bureau of Meteorology. This time we will focus on maximum temperatures for Western Australian towns, form the north to the south.

Working with 2-Dimensional Arrays

This lab assignment provides a lot of practice working with 2-d arrays. It introduces some important new concepts.

Because of this and the length of lab the due date will be at little later than usual - the due date will be 10am on Tuesday 29th September.

Because it will involve working through new concepts it may take a few sessions to work though it, so its important not to leave it until close to the due date.

Data Acquisition and Inspection

Once again we will source our data from the Australian Government's Bureau of Meteorology (BOM) Climate Data Online service.

Find the monthly average maximum temperature data.
Find and download the historical data for:
- Perth Airport
- Geraldton Town
- Albany
- Broome Airport
Upload the data to your CoCalc project. You should upload the whole folder for each town, it should not be more than about 60KB each.
Read one of the Note.txt files so that you can see what is in each file, and the structure.
Inspect one of each type of data file to see that they conform to your expectations from reading the Note.

Set up

We will set up our constants as follows.

In this lab we will use the second kind of file discussed in Note.txt (12 months per line).
We will use a dictionary to map town names to the station codes. This will allow us to easily find the data for each town.
For simplicity we will only use one product (monthly maximum temperatures), however you can see that we could easily generalise this to other data sets (products) in a similar way to the stations.

In [1]:

PRODUCT = "IDCJAC0002"
MONTH12 = "Data12.csv"

STATIONS = {"Albany": "009500",
            "Perth": "009021",
            "Broome": "003003",
            "Geraldton": "008050"
           }

File paths

Lets start with a small utility function.

Using str.format(), write a function temps_file(town) that returns the path to the file containing the twelve-monthly temperature table for town, where town is the town name (Albany, Perth, etc).

For example:

temps_file("Albany")

'IDCJAC0002_009500/IDCJAC0002_009500_Data12.csv'

You can find out about string formatting using format() at: https://docs.python.org/3.8/library/string.html#format-string-syntax. If you are not familiar with specification languages (you are not expected to be) it is better to look at the examples for this one.

(You should use the the "new" {} formatting, not the "old" % formatting.)

In [2]:

def temps_file(town):
    return "{product}_{station}/{product}_{station}_{month}".format(
        product = PRODUCT,
        station = STATIONS[town],
        month = MONTH12)

temps_file("Albany")

'IDCJAC0002_009500/IDCJAC0002_009500_Data12.csv'

In [3]:

import numpy as np
import csv

COLUMN = 16

path = temps_file("Albany")

albany = np.genfromtxt(path, delimiter = ",",
                       skip_header = 1,
                       replace_space = ('null', np.nan),
                       usecols = tuple(range(2, COLUMN)))

`np.genfromtxt`

While csv is particularly targeted at reading csv (like) files, a range of other libraries include their own bespoke file/buffer/string parsers.

For example, numpy has fromstring, loadtxt and the more general genfromtxt (as well as others for different file/buffer types).

genfromtxt is extremely versatile - for example, it allows us to skip rows, select columns and deal with missing values.

We've seen that our data has:

a header row, that we won't need for analysis
16 columns, but not all of them are needed
missing values, represented by the string "null"

This makes genfromtxt a great choice.

Read the API for genfromtxt in the latest version of the numpy manual.
Read the data for Albany into an array albany (1 line of code). Your array should:
- exclude the header line
- exclude the first two columns (code and station number)
- ensure any occurrences of "null" are replaced with nan

print(albany[-4:])

[[2017.    22.1   22.2   21.8   20.1   19.4   19.    15.9   16.7   17.5
    19.1   20.6   21.5   19.7]
 [2018.    22.4   23.    22.4   21.2   20.7   17.4   17.    16.2   17.
    19.    18.7   21.1   19.7]
 [2019.    22.2   22.2   21.2   21.2   18.7   17.6   16.8   17.9   19.3
    19.3   19.4   23.5   19.9]
 [2020.    22.2   23.4   23.1   21.2   19.6   18.9   17.3    nan    nan
     nan    nan    nan    nan]]

Hint: It is not immediately obvious from the API documentation, but by default missing values identified by genfromtxt for an array of floats will be entered as np.nan.

The the resulting array should be of type float64. This should happen by default, you don't need to specify it - but you should check it (see np.dtype). As we know the array is homogeneous, so this includes the years, and the missing values.

Print the shape of the array for Albany.

Selecting items in 2-d arrays

We know how to select an item in a 1-d array using a slice. To select an item in a 2-d array, we just select the row and the column.

For example, albany[0,-1] will select the item in the last column of the first (zeroth) row, that is, the annual rainfall for 1880.

Print out the annual rainfall for the last year in the table (1 line of code).
Print out the annual rainfall for the second last year in the table.

Inspecting the year range for a town [1 lab mark]

Write a function year_range(town) that returns a pair of integers representing the first year and the last year included in that town's table (you may assume the tables are always in the correct format and in chronological order).

You should use genfromtxt to read the data, and then use array selection (no loops). This should only take a couple of lines of code!

Check your answers for the various towns are correct.

In [4]:

def year_range(town):

    path = temps_file(town)

    data = np.genfromtxt(path, delimiter = ",", skip_header = 1, replace_space = ('null', np.nan), usecols = tuple(range(2, 16)))

    return (int(data[0,0]), int(data[-1, 0]))

In [5]:

from nose.tools import assert_equal
assert_equal(year_range("Albany"), (1880, 2020))
print("So far, so good. Additional tests will be applied.")

So far, so good. Additional tests will be applied.

Collecting all the data

Write a function get_temps() that, by iterating through the STATIONS dictionary in alphabetical order, returns:
- a list of town names in alphabetical order
- a list of arrays containing the corresponding tables of temperature data (from the year to the average annual rainfall)

So that you can observe the progress and get a better idea of what data is there, output the town, rows of data and first and last years for each town. For example:

(towns, tables) = get_temps()

Collecting data for Albany   :  105 years recorded between 1880 and 2020
Collecting data for Broome   :   82 years recorded between 1939 and 2020
Collecting data for Geraldton:   73 years recorded between 1880 and 1953
Collecting data for Perth    :   77 years recorded between 1944 and 2020

In [6]:

def get_temps():

    towns  = []
    tables = []

    # First we just sort the dic alphabetically --> turn it to list with tuples as elements
    sorted_stations = sorted(STATIONS.items(), key=lambda x:x[0].lower())

    for (key, item) in sorted_stations:

        towns.append(key)

        path = temps_file(key)

        data = np.genfromtxt(path, delimiter = ",", skip_header = 1, replace_space = ('null', np.nan), usecols = tuple(range(2, 16)))

        tables.append(data)

    return (towns, tables)

A 'production' version [1 lab mark]

In this version the user has more control, including being able to turn the printed output on or off.

Write a function get_temperatures (stations, quiet=True) with the same specification as get_temps except that:
- a stations dictionary is passed to the function (rather than 'hard wired' to STATIONS)
- it takes one optional argument, quiet, that defaults to True. If quiet is False, then it should print the output as it reads in the data, allowing the user to monitor it, otherwise it reads the data in `silently' (can be useful when called by other functions)

In [7]:

def get_temperatures (stations, quiet=True):

    towns  = []
    tables = []

    # First we just sort the dic alphabetically --> turn it to list with tuples as elements
    sorted_stations = sorted(stations.items(), key = lambda x:x[0].lower())

    for (key, item) in sorted_stations:

        path = temps_file(key)

        data = np.genfromtxt(path, delimiter = ",", skip_header = 1, replace_space = ('null', np.nan), usecols = tuple(range(2, 16)))

        towns.append(key)

        tables.append(data)

        if quiet == False:
            print(key, "\n")
            print(data, "\n")

    return (towns, tables)

In [8]:

from nose.tools import assert_equal
(towns, tables) = get_temperatures(STATIONS, quiet=True)
assert_equal(towns, ['Albany', 'Broome', 'Geraldton', 'Perth'])
geraldton = tables[towns.index("Geraldton")]
assert_equal(geraldton[0,0],1880.)
assert_equal(geraldton[1,-2],28.9)

Data Cleaning and Conversion

Selecting a row or column

We can use the slice operators in the usual way to mean all the data between some bounds (optionally with step size). So to get all the years, for example, I can select all rows and the first (zeroth) column:

(towns, tables) = get_temperatures(STATIONS)
albany = tables[towns.index("Albany")]
albany[:,0]

array([1880., 1881., 1882., 1883., 1884., 1885., 1886., 1887., 1888.,
       1889., 1890., 1891., 1892., 1893., 1894., 1895., 1896., 1897.,
       1898., 1899., 1900., 1901., 1902., 1903., 1904., 1905., 1906.,
       1907., 1908., 1909., 1910., 1911., 1912., 1913., 1914., 1915.,
       1916., 1917., 1918., 1919., 1920., 1921., 1922., 1923., 1924.,
...

Using Albany as an example, use array selection to return:
- the first row of data
- the last row of data
- all the annual averages in the last column

In [9]:

(towns, tables) = get_temperatures(STATIONS)
albany = tables[towns.index("Albany")]

albany[0, :]
albany[-1, :]
albany[:, -1]

array([18.2, 18.2, 17.8, 17.8, 17.9, 17.8, 17.6, 17.3, 18.4, 17.1, 17. ,
       18.3, 18.8, 18.9, 19.8, 19.6, 20.1, 19.6, 20.1, 19.7, 19. , 19.8,
       19.4, 18.8, 19.7, 18.9, 19.7, 20.1, 19.7, 20. , 20.2, 20. , 19.9,
       19.5, 19.8, 20. , 20. , 19.5, 20.8, 20.3, 20.6, 21. , 19.9, 20. ,
       20. , 19.5, 19.2, 19.4, 19.3, 18.9, 20.4, 19.3, 19.8, 20.3, 20.3,
       19.4, 19.8, 20. ,  nan, 19.9, 20.4, 20. , 19.6, 19.1, 19.6, 19.7,
       19.5, 19.4, 19.9, 20.1, 20. , 19.3, 19.3, 19.4, 19.7, 19. , 19.3,
       20.2, 19.8, 20.1, 18.8, 20.3, 20.1, 20.1, 19.1,  nan,  nan, 19.8,
       19.6, 19.4, 19.6, 19.7, 19.5, 19.6, 19.8, 20.4, 20.8, 20.3, 20.3,
       20.3, 19.4, 19.7, 19.7, 19.9,  nan])

Selecting an area

Use a slice to select the first 4 months of data for the first 3 years recorded.

Your code should be of the form:
albany[select the 3 rows here, select the 4 columns here].

Check against the original csv file to ensure you have selected the correct data.

In [10]:

albany[:3, 1:5]

array([[22.1, 22.8, 20.6, 19.2],
       [20.1, 22.2, 22.2, 19.6],
       [20.2, 22.1, 20.1, 18.8]])

Two ways to skin a cat?

Use selection to print the temperature for January 1882 (albany[2,1]).
Now use selection to extract the row for 1882, then use selection on the result to extract the January reading (albany[2][1]).

What do you notice?

In [11]:

albany[2,1]

20.2

In [12]:

albany[2][1]

20.2

Now use selection to extract the area containing the first 2 months of temperature data for the first 3 years, albany[:3,1:3].
Then try albany[:3][1:3].

What do you find?

Why is it different? Why did it work for selecting one item?

What would you need to put in the second set of brackets to get the same result? Try this out.

In [13]:

albany[:3,1:3]

array([[22.1, 22.8],
       [20.1, 22.2],
       [20.2, 22.1]])

In [14]:

albany[:3]

array([[1880. ,   22.1,   22.8,   20.6,   19.2,   17.5,   14.7,   15.1,
          15.2,   16.2,   16.2,   18. ,   20.8,   18.2],
       [1881. ,   20.1,   22.2,   22.2,   19.6,   17.1,   14.4,   15.3,
          15.5,   16.4,   16.1,   18.7,   20.7,   18.2],
       [1882. ,   20.2,   22.1,   20.1,   18.8,   16.3,   14.6,   14.4,
          14.9,   16.4,   17.3,   17.8,   20.8,   17.8]])

In [15]:

albany[:3][1:3]

array([[1881. ,   20.1,   22.2,   22.2,   19.6,   17.1,   14.4,   15.3,
          15.5,   16.4,   16.1,   18.7,   20.7,   18.2],
       [1882. ,   20.2,   22.1,   20.1,   18.8,   16.3,   14.6,   14.4,
          14.9,   16.4,   17.3,   17.8,   20.8,   17.8]])

Verifying the data

Using np.mean() and array selection, determine whether mean of the first 12 months of rainfall is equal to the average annual rainfall reported for Albany for that year (1 line of code).

In [16]:

albany[:10]

# mask = (np.isclose(np.mean(albany[:, 1:13], axis = 1), albany[:, 13]))

mask = (np.round(np.mean(albany[:, 1:13], axis = 1), 1) == albany[:, 13])


print(np.round(np.mean(albany[:, 1:13], axis = 1), 1))

print(albany[:, 13])

print(mask)

print(np.all(mask))

[18.2 18.2 17.8 17.8 17.9 17.8 17.6 17.3 18.4 17.1 17.  18.3 18.8 18.9
 19.8 19.6 20.1 19.6 20.1 19.7 19.  19.8 19.4 18.8 19.7 19.  19.7 20.1
 19.7 20.  20.2 20.  19.9 19.5 19.8 20.  20.  19.5 20.8 20.3 20.7 21.
 19.9 20.  20.  19.5 19.2 19.4 19.3 18.8 20.4 19.3 19.8 20.3 20.3 19.4
 19.8 20.   nan 19.9 20.4 20.  19.6 19.1 19.6 19.7 19.5 19.4 20.  20.1
 20.  19.3 19.3 19.4 19.7 19.  19.3 20.2 19.8 20.1 18.8 20.3 20.1 20.1
 19.1  nan  nan 19.8 19.6 19.4 19.6 19.7 19.4 19.6 19.8 20.4 20.8 20.3
 20.3 20.3 19.3 19.7 19.7 19.9  nan]
[18.2 18.2 17.8 17.8 17.9 17.8 17.6 17.3 18.4 17.1 17.  18.3 18.8 18.9
 19.8 19.6 20.1 19.6 20.1 19.7 19.  19.8 19.4 18.8 19.7 18.9 19.7 20.1
 19.7 20.  20.2 20.  19.9 19.5 19.8 20.  20.  19.5 20.8 20.3 20.6 21.
 19.9 20.  20.  19.5 19.2 19.4 19.3 18.9 20.4 19.3 19.8 20.3 20.3 19.4
 19.8 20.   nan 19.9 20.4 20.  19.6 19.1 19.6 19.7 19.5 19.4 19.9 20.1
 20.  19.3 19.3 19.4 19.7 19.  19.3 20.2 19.8 20.1 18.8 20.3 20.1 20.1
 19.1  nan  nan 19.8 19.6 19.4 19.6 19.7 19.5 19.6 19.8 20.4 20.8 20.3
 20.3 20.3 19.4 19.7 19.7 19.9  nan]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True False  True  True  True  True  True  True  True
  True False  True  True  True  True  True  True  True  True False  True
  True  True  True  True  True  True  True  True False  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True False False  True  True  True  True  True False  True  True  True
  True  True  True  True False  True  True  True False]
False

Complete the following without using any loops. (Each should take 1 line to do the operation, and one line to check the shape where relevant.)

Select all of the monthly rainfall data into an array (that is, all columns except the year and the rainfall). Check the shape is as you would expect.
Use np.mean with the axis argument to get an array of the means for each year, rounded to 1 decimal place. Check the shape (to ensure you used the right axis).
Get a boolean array of all those years where the stated annual average is not equal to your calculated average. Check the shape.
Use numpy's any to check whether there are any that are different.

Use the boolean array from above as a mask to select all the rows in the Albany data where the averages don't match. (1 line of code)
Change your line of code from above so that it prints just the year for each of the rows that don't match.

In [17]:

# This data is only for albany

monthly_heat = albany[:, 1:13]

mean_heat = np.round(np.mean(monthly_heat, axis = 1), 1)

# Find whether the annual data we cpompute is not equal to the average data it provided
mismatch = (mean_heat != albany[:, 13])

# Use numpy's any to check whether there are any that are different
print(np.any(mismatch))

wrong_data_for_albany = albany[mismatch]
print(wrong_data_for_albany)

True
[[1905.    22.8   21.6   20.9   21.1   18.2   16.2   15.9   15.    15.7
    17.3   20.2   22.5   18.9]
 [1920.    25.3   24.2   23.8   22.8   19.    16.2   15.8   15.8   18.2
    20.8   22.5   23.4   20.6]
 [1929.    21.9   22.5   20.5   20.1   17.7   15.5   15.2   15.2   17.7
    18.1   20.1   21.7   18.9]
 [1938.    22.4   21.9   22.2   19.9   18.5   16.1   16.    16.9   18.
    19.7   21.4    nan    nan]
 [1948.    23.8   23.8   23.7   20.6   20.1   17.3   15.9   17.4   17.6
    18.5   19.7   21.    19.9]
 [1965.    24.4   23.6   23.1   22.3    nan    nan    nan    nan    nan
     nan    nan    nan    nan]
 [2002.     nan    nan    nan    nan    nan    nan    nan    nan   17.5
    19.    20.2   21.3    nan]
 [2008.    21.8   22.8   21.9   21.3   19.9   17.9   15.6   16.6   17.9
    18.4   18.6   20.7   19.5]
 [2016.    22.7   23.4   22.6   20.6   18.6   16.7   16.2   16.1   15.9
    18.    20.6   20.8   19.4]
 [2020.    22.2   23.4   23.1   21.2   19.6   18.9   17.3   16.9    nan
     nan    nan    nan    nan]]

In [18]:

# cols is a mask used for print the first column and the last column
cols = [True, False, False, False, False, False, False, False, False, False, False, False, False, True]
print(albany[mismatch][:,cols])
print(np.shape(albany[mismatch][:,cols]))

# Or we can use this way
print(albany[mismatch][:,[0, -1]])

[[1905.    18.9]
 [1920.    20.6]
 [1929.    18.9]
 [1938.     nan]
 [1948.    19.9]
 [1965.     nan]
 [2002.     nan]
 [2008.    19.5]
 [2016.    19.4]
 [2020.     nan]]
(10, 2)
[[1905.    18.9]
 [1920.    20.6]
 [1929.    18.9]
 [1938.     nan]
 [1948.    19.9]
 [1965.     nan]
 [2002.     nan]
 [2008.    19.5]
 [2016.    19.4]
 [2020.     nan]]

Let's say we want to extract both the year and annual average columns from the rainfall table. We could do that using boolean masks with something like this:

cols = [True, False, False, False, False, False, False, False, False, False, False, False, False, True]
print(albany[mismatch][:,cols])

Try this out. What is the shape of the resulting array?

Note that cols was automagically cast to an array in this demonstration. Better would be to define cols as an array.

Repeat this with cols defined as an array.

Using a boolean array to select the columns in the above example is a bit unweildy. Luckily numpy allow integer masks as well.

For example, if I want the rows for January and April, I could use albany[:,[1,4]]. Give this a try.

Now print the year and annual average columns for the mismatched years using an integer array.

You should get the same output as above.

In [19]:

# np.stack() takes each column of the data as 1-d array
#    and stack them to create a 2-d array

print(np.stack((albany[mismatch, 0], albany[mismatch, 13])))

# If we want it vertically again, we can transpose it
print(np.stack((albany[mismatch, 0], albany[mismatch, 13])).T)

[[1905.  1920.  1929.  1938.  1948.  1965.  2002.  2008.  2016.  2020. ]
 [  18.9   20.6   18.9    nan   19.9    nan    nan   19.5   19.4    nan]]
[[1905.    18.9]
 [1920.    20.6]
 [1929.    18.9]
 [1938.     nan]
 [1948.    19.9]
 [1965.     nan]
 [2002.     nan]
 [2008.    19.5]
 [2016.    19.4]
 [2020.     nan]]

`np.stack`

Another way of extracting the two columns is to stack them in a new array. Look up the API for numpy.stack.

Assuming mismatch is my boolean mask, I can say:

print(np.stack((albany[mismatch,0], albany[mismatch,13])))

[[1905.  1920.  1929.  1938.  1948.  1965.  2002.  2008.  2016.  2020. ]
 [  18.9   20.6   18.9    nan   19.9    nan    nan   19.5   19.4    nan]]

Give this a try.

This has taken each column of data as a 1-d (flattened) array and stacked them to create a 2-d array.

If we want to see it vertically again, we can transpose it:

print(np.stack((albany[mismatch,0], albany[mismatch,13])).T)

While this is less efficient than the previous approach, since it creates a new array, it is probably easier to read. More importantly, however, we can include other data in the new array, such as our calculated means.

Using this approach, create an array that has the year, reported average, and calculated mean, for each year where the reported and calculated averaqes don't match. You should have a 3x10 array.
To make it a little easier to read, transpose it to a 10x3 array and print it. Your output should start like this:

[[1905.    18.9   19. ]
 [1920.    20.6   20.7]
 [1929.    18.9   18.8]
 [1938.     nan    nan]
...

Could all of your difference be explained by rounding errors and missing data?

In [20]:

mean_heat[mismatch]

print(np.stack((albany[mismatch, 0], albany[mismatch, 13], mean_heat[mismatch])).T)

[[1905.    18.9   19. ]
 [1920.    20.6   20.7]
 [1929.    18.9   18.8]
 [1938.     nan    nan]
 [1948.    19.9   20. ]
 [1965.     nan    nan]
 [2002.     nan    nan]
 [2008.    19.5   19.4]
 [2016.    19.4   19.3]
 [2020.     nan    nan]]

`np.hstack`, `np.vstack` and `np.concatenate`

Stacking (and concatenating) can be used to create a bigger array in the same dimensions, or in a new dimension (as we did with stack above).

Let's say we have the years in 3 1-d arrays

(albany, broome, perth) = (tables[towns.index("Albany")], tables[towns.index("Albany")], tables[towns.index("Albany")])

and we want to put them in one long array. We can use hstack to stack them "horizontally":

all_years = np.hstack((albany[:,0], broome[:,0], perth[:,0]))
print(all_years.shape)
print(all_years)

(315,)
[1880. 1881. 1882. 1883. 1884. 1885. 1886. 1887. 1888. 1889. 1890. 1891.
 1892. 1893. 1894. 1895. 1896. 1897. 1898. 1899. 1900. 1901. 1902. 1903.
 1904. 1905. 1906. 1907. 1908. 1909. 1910. 1911. 1912. 1913. 1914. 1915.
...

Try this using concatenate.
Try using vstack. What is the difference?

In [31]:

(albany, broome, perth) = (tables[towns.index("Albany")],
                           tables[towns.index("Albany")],
                           tables[towns.index("Albany")])

print(albany[:,0].shape)

# Their size are all 105x1 ---> 315x1
all_years = np.hstack((albany[:,0], broome[:,0], perth[:,0]))
print(all_years.shape)

# Using concatenate function (no need to use axis since they are 1-d) --> 315x1
print(np.concatenate((albany[:,0], broome[:,0], perth[:,0])).shape)

# Using vstack, you will get a 3x105
print(np.vstack((albany[:,0], broome[:,0], perth[:,0])).shape)

(105,)
(315,)
(315,)
(3, 105)

Putting it all together [1 lab mark]

Write a function that:

reads the temperature data for all towns specified in the stations dictionary in alphabetical order of town name (using get_temps_data)
returns an $n$ x 3 array which has one line for each of the $n$ cases in which the recorded annual average does not reconcile with the calculated average

Note that:

the array should be ordered (vertically) by town and then by year
nan is considered as not matching any number (including nan)
your code should be general:
- you should not assume that the towns are Albany, Perth and Broome
- you can assume that all files will take the same format as downloaded from the BOM (that is, a temps_file method that uses PRODUCT, MONTHS12 and stations) will correctly generate the data filename for any town in stations, and the columns will be the same

(This should only take about 8 lines of code, without being overly terse.)

In [22]:

def validate_averages():
    (towns, tables) = get_temperatures(STATIONS, quiet=True)
    stacks = np.array([])
    for table in tables:
        mean_heat = np.round(np.mean(table[:, 1:13], axis = 1), 1)
        mismatch  = (mean_heat != table[:, 13])
        stack = np.stack((table[mismatch, 0], table[mismatch, 13], mean_heat[mismatch])).T
        stacks = stack if stacks.size == 0 else np.concatenate((stacks, stack), axis = 0)
    return stacks

In [23]:

from nose.tools import assert_equal
answer = validate_averages()
assert_equal(answer[0,0], 1905.)
assert_equal(answer[10,0], 1939.)
assert_equal(answer[11,2], 32.6)
assert_equal(np.all(np.equal(np.isnan(answer[3]), np.array([False, True, True]))), True)

We will make the following validating and data-cleaning assumptions:

If the calculated mean is within 0.1 of the reported annual average, we'll assume the data is valid.
Since the reported averages may have been calculated on more precise figures and reported to 1 decimal place, we will take the reported average when considering annual figures (that is, we delay rounding until as late as possible to avoid compounding rounding errors).
When considering annual figures, we will remove years where the annual figure is a null value.

In [24]:

def clean_all(tables):

    result = []

    for table in tables:

        # First get rid of all the row containing nan
        #     Since we already transfer the NULL to nan, we only need to deal with nan
        nan_mask = np.logical_not(np.isnan(table[:,13]))
        filter_table = table[nan_mask]

        # Then keep all the valid rows
        mean_heat = np.mean(filter_table[:, 1:13], axis = 1)
        match = (np.absolute((filter_table[:,13] - mean_heat)) < 0.1)
        final_mean = filter_table[match]

        # Return a list of tables
        result.append(final_mean)

    return result

(towns, tables) = get_temperatures(STATIONS, quiet=True)
result = clean_all(tables)
result

[array([[1880. ,   22.1,   22.8, ...,   18. ,   20.8,   18.2],
        [1881. ,   20.1,   22.2, ...,   18.7,   20.7,   18.2],
        [1882. ,   20.2,   22.1, ...,   17.8,   20.8,   17.8],
        ...,
        [2017. ,   22.1,   22.2, ...,   20.6,   21.5,   19.7],
        [2018. ,   22.4,   23. , ...,   18.7,   21.1,   19.7],
        [2019. ,   22.2,   22.2, ...,   19.4,   23.5,   19.9]]),
 array([[1940. ,   31.8,   33.1, ...,   35.4,   34.4,   32.3],
        [1941. ,   33.5,   34.9, ...,   33.2,   34.7,   31.8],
        [1942. ,   31.6,   31.3, ...,   34.2,   33.5,   31.8],
        ...,
        [2017. ,   32.8,   32.2, ...,   33.7,   34.5,   32.9],
        [2018. ,   32.6,   31. , ...,   32.8,   33.7,   32.3],
        [2019. ,   34. ,   33.4, ...,   36.9,   36.7,   33.7]]),
 array([[1880. ,   31.3,   26.8,   26.4,   23.9,   21.8,   18.8,   18.7,
           19.1,   21.4,   20.8,   23.7,   25.6,   23.2],
        [1881. ,   26.9,   28.9,   29.7,   25.1,   21.5,   18.9,   20. ,
           20.4,   21.6,   22.7,   24.4,   28.9,   24.1],
        [1882. ,   26.6,   28.8,   25.7,   23.2,   20.9,   18.2,   17.8,
           18.2,   19.8,   22.4,   23.6,   26.3,   22.6],
        [1883. ,   26.4,   31.9,   25.7,   26.1,   22.9,   20.6,   18.8,
           18.8,   20.4,   22.9,   24.1,   25.7,   23.7],
        [1885. ,   27.1,   28. ,   28.1,   25.2,   21.1,   20.4,   19.4,
           19.4,   20.4,   23.4,   29.1,   31.7,   24.4],
        [1886. ,   32.2,   34.6,   30.7,   28. ,   26.1,   22.1,   22.3,
           20.7,   22.3,   23.6,   26.6,   32.2,   26.8],
        [1887. ,   30.9,   33.2,   32.4,   27.8,   24.2,   21.2,   19.6,
           19.9,   20.3,   23. ,   27.3,   28.8,   25.7],
        [1888. ,   29.3,   28.8,   33.8,   25.2,   22.7,   21.1,   19.1,
           19.9,   21.8,   24.8,   26.9,   28.2,   25.1],
        [1889. ,   29.4,   27.2,   30.3,   27.6,   22.4,   19.8,   20.3,
           20.1,   21.6,   22.4,   23.7,   26.9,   24.3],
        [1890. ,   29.2,   27.8,   28.1,   28.1,   22.4,   19.3,   18.1,
           19.4,   20.2,   21.7,   26.2,   28.1,   24.1],
        [1892. ,   28.1,   32.3,   27.2,   26.8,   23.4,   19.7,   19.7,
           19.5,   21.7,   21.9,   24.1,   26.5,   24.2],
        [1893. ,   30.3,   29.1,   28.5,   24.4,   21.6,   19.3,   19.7,
           20.4,   21.8,   24.2,   25.5,   26.4,   24.3],
        [1894. ,   32.3,   32.9,   29.1,   26.7,   23.9,   21.6,   20.6,
           20.1,   21.1,   23.2,   26.4,   28.8,   25.6],
        [1895. ,   26.9,   27.7,   31.2,   26.3,   22.8,   21.3,   20.5,
           21.2,   21.4,   24.1,   28.1,   26.8,   24.9],
        [1896. ,   30.9,   30.5,   26.4,   24.7,   23.2,   21.7,   18.6,
           20.4,   22.3,   22.5,   23.9,   25.3,   24.2],
        [1897. ,   28.2,   27.9,   26.2,   26.7,   23.3,   21.1,   22.2,
           21.6,   21.9,   22.8,   26.4,   30.8,   24.9],
        [1898. ,   27.6,   30.7,   31.2,   26.2,   24.3,   19.6,   21.6,
           20.7,   22.3,   22.3,   25.7,   27. ,   24.9],
        [1899. ,   29.5,   27.7,   28.5,   24.8,   23.2,   20.1,   20. ,
           20.4,   21.7,   22.3,   23.1,   27.8,   24.1],
        [1900. ,   30.5,   30.2,   26.9,   24.7,   22.2,   20.6,   19.4,
           19.3,   21.7,   23.7,   24.6,   26.2,   24.2],
        [1901. ,   27.2,   29.9,   30.3,   26.7,   23.2,   20.9,   20.2,
           20.3,   21.4,   22.8,   24.9,   24.8,   24.4],
        [1902. ,   26.6,   26.3,   30.2,   25.6,   23.4,   21. ,   19.8,
           22.2,   21.9,   22. ,   25.4,   25.1,   24.1],
        [1903. ,   29.5,   31.6,   26.5,   24.6,   24.2,   21.6,   19.5,
           19.7,   20.3,   22.7,   24.8,   29. ,   24.5],
        [1904. ,   32.1,   29.9,   31.8,   26.7,   21.7,   20.9,   19.5,
           21.1,   20.9,   23.2,   24.5,   27.1,   24.9],
        [1905. ,   26.3,   30.3,   27.7,   26.9,   23.2,   21.2,   20.4,
           20.4,   22.2,   23.9,   24.2,   26.9,   24.5],
        [1906. ,   28.4,   25.9,   29.3,   28.3,   23.5,   21.4,   19.1,
           20.2,   20.1,   23.8,   24.6,   29.2,   24.5],
        [1907. ,   32.3,   30.5,   29.4,   27.9,   24.6,   21.2,   20.3,
           21. ,   21.2,   22.1,   24.4,   26.5,   25.1],
        [1908. ,   28.9,   29.7,   27.9,   26.8,   22.7,   19.2,   20.2,
           20.4,   22.2,   23.5,   24.9,   26.7,   24.4],
        [1909. ,   30.7,   32.3,   28.2,   26.7,   23.2,   21.2,   19.8,
           20.3,   22.1,   21.8,   24.1,   31.6,   25.2],
        [1910. ,   28.5,   31.8,   31.5,   28.1,   22.1,   20.4,   18.7,
           20.6,   21.7,   21.7,   24.5,   26.8,   24.7],
        [1911. ,   28.5,   31.5,   28.1,   26.9,   22.8,   20.6,   20.3,
           19.9,   21.3,   24.6,   23.9,   24.1,   24.4],
        [1912. ,   27.2,   29.8,   26.8,   26.8,   22.7,   22.3,   19.9,
           21.4,   20.5,   22.2,   26.5,   27.1,   24.4],
        [1913. ,   28.8,   29.8,   28.5,   26. ,   26.1,   22.7,   20.4,
           19.9,   22.7,   22.5,   24.7,   25.3,   24.8],
        [1914. ,   29.4,   28.5,   28.1,   23.9,   22.5,   21.8,   19.8,
           22.9,   24.8,   25. ,   25.1,   27.3,   24.9],
        [1915. ,   30.2,   29. ,   28.9,   28.2,   24.4,   21.9,   20.3,
           20.8,   19.9,   22.5,   27.8,   30.4,   25.4],
        [1916. ,   26.9,   28.1,   26.9,   28.6,   23.5,   20. ,   19.1,
           19.8,   24.1,   22.6,   26.4,   28.5,   24.5],
        [1917. ,   29.2,   30.6,   26. ,   26.8,   22.2,   20.7,   19.5,
           19.5,   20. ,   23.4,   26.2,   27.3,   24.3],
        [1918. ,   32.5,   29.8,   29.4,   27.9,   23.6,   21.5,   20.9,
           20.1,   22.4,   22.8,   25.8,   27.7,   25.4],
        [1919. ,   27. ,   29. ,   29.1,   27. ,   24.6,   21.3,   21.3,
           20.4,   21.8,   22.4,   25. ,   25.3,   24.5],
        [1920. ,   32. ,   29.4,   28.1,   28.9,   23.1,   20.4,   19.4,
           18.5,   22.3,   23.8,   26.6,   28. ,   25. ],
        [1921. ,   30.7,   33. ,   31.5,   27.8,   22.6,   21.6,   20.7,
           21.7,   21.8,   23.5,   25. ,   28.3,   25.7],
        [1922. ,   28.2,   30.1,   30.1,   26.2,   22.9,   20.3,   19.3,
           20.7,   21.4,   24.2,   25.8,   25.4,   24.5],
        [1923. ,   28.1,   29.7,   29.8,   27.5,   23.3,   19.9,   19.9,
           20.9,   20.2,   22.6,   26. ,   26.6,   24.5],
        [1924. ,   28.6,   30.3,   28. ,   29.4,   24.1,   21.3,   21.2,
           20.2,   22. ,   21.6,   25.5,   30.3,   25.2],
        [1925. ,   29. ,   28.9,   28.9,   25.7,   22.8,   21.7,   19.4,
           21.2,   22.4,   21.6,   28.3,   27.1,   24.8],
        [1926. ,   30.6,   26.7,   28.9,   24.4,   23.4,   20.9,   19.7,
           19.4,   22.2,   23.1,   26.4,   30.3,   24.7],
        [1927. ,   29.1,   28.8,   26.4,   29. ,   23.9,   20.1,   19.2,
           20.4,   22.8,   23.9,   25.5,   31.1,   25. ],
        [1928. ,   29.5,   29.8,   27.8,   26. ,   24.2,   21.4,   19.8,
           20.6,   20.8,   21.5,   25.4,   26.6,   24.5],
        [1929. ,   30.4,   29. ,   29.7,   25.9,   22.4,   20. ,   18.7,
           19.9,   21.6,   22.7,   23.5,   27.8,   24.3],
        [1930. ,   28.1,   27.3,   29.9,   26.4,   23.8,   21.4,   20. ,
           20.5,   22.5,   23.1,   25. ,   26.7,   24.6],
        [1931. ,   31.2,   27.9,   29.8,   26. ,   21.2,   18.8,   19.4,
           19.9,   20.2,   24.6,   27.9,   31.5,   24.9],
        [1932. ,   29.4,   31.1,   28.8,   26.2,   24.5,   20.8,   20. ,
           19.4,   21.3,   24.1,   25.3,   27.7,   24.9],
        [1933. ,   29.5,   30.9,   28.8,   28.6,   23.2,   21.4,   19.5,
           20.8,   22.5,   23.2,   29.1,   29.3,   25.6],
        [1934. ,   30.2,   30.9,   29.7,   25.6,   25.3,   19.8,   19. ,
           20.9,   21.1,   22.2,   26. ,   27.3,   24.8],
        [1935. ,   29.2,   29.9,   28.6,   26.9,   23.5,   20.3,   19.4,
           20.1,   19.9,   22.1,   23.5,   26.6,   24.2],
        [1936. ,   28.3,   28.9,   29.3,   27.5,   22.6,   20.7,   19.3,
           21. ,   24.2,   22.6,   25.9,   27.4,   24.8],
        [1937. ,   28.5,   28.1,   28.1,   28.5,   23.1,   21.5,   20.7,
           20.4,   21.1,   23.4,   25.7,   26. ,   24.6],
        [1938. ,   29.8,   27.7,   27.6,   25.6,   23.6,   20.5,   20.2,
           20.6,   23.4,   22.9,   24.7,   26.8,   24.4],
        [1939. ,   26. ,   30.4,   30.3,   26.6,   22.8,   21.1,   19.1,
           19.8,   22.6,   23.3,   26.4,   28.1,   24.7],
        [1940. ,   29.3,   32.4,   27.6,   26.5,   24. ,   22.9,   20.4,
           21.8,   22.9,   22.3,   26.6,   26.4,   25.3],
        [1941. ,   29.9,   29.1,   27.7,   25. ,   25.5,   21. ,   20.7,
           21.6,   21.3,   23.6,   24.8,   29. ,   24.9],
        [1943. ,   27.1,   29. ,   28.6,   26.2,   24. ,   21.2,   19.1,
           20.4,   21.5,   22.3,   26.8,   27.5,   24.5],
        [1944. ,   27.2,   28.9,   28. ,   26.5,   22.7,   22.3,   20.6,
           20.8,   21.7,   25.3,   27.5,   26.8,   24.9],
        [1945. ,   30.1,   31.2,   29.1,   28. ,   24.2,   21.6,   19.6,
           19.9,   21. ,   24.4,   27.1,   26.9,   25.3],
        [1946. ,   27.2,   29.3,   28.2,   27.4,   23.2,   20.6,   19.2,
           20.1,   22.5,   24.5,   24.6,   26. ,   24.4],
        [1947. ,   26.9,   29. ,   30.4,   26.1,   23.1,   21.8,   19.8,
           20.4,   21.7,   23.9,   24.1,   25.6,   24.4],
        [1948. ,   28.7,   30.1,   32.2,   26.3,   25.4,   22.3,   19.6,
           22.1,   22.1,   23.8,   24.7,   25.7,   25.3],
        [1949. ,   30.6,   30.6,   30.5,   28.1,   24. ,   22.7,   21.5,
           21.8,   22.4,   22.3,   26.8,   27.5,   25.7],
        [1950. ,   31.9,   30.8,   30.2,   28.5,   22.8,   21.5,   20.5,
           21.5,   22. ,   22.2,   25.1,   26.6,   25.3],
        [1951. ,   26.6,   29.3,   29.5,   26.6,   23. ,   20.2,   18.6,
           19.7,   22.2,   23.5,   24.4,   26.4,   24.2],
        [1952. ,   29.3,   31.5,   30.6,   29. ,   22.8,   22. ,   19.5,
           20.8,   21.9,   23.6,   23.8,   27.7,   25.2]]),
 array([[1945. ,   31.5,   32.1, ...,   25.8,   28.4,   24.4],
        [1946. ,   28.2,   29.8, ...,   24.4,   27.4,   23.6],
        [1947. ,   29.6,   32. , ...,   24.9,   26. ,   23.7],
        ...,
        [2017. ,   32. ,   30.3, ...,   30. ,   30.6,   25.3],
        [2018. ,   31.7,   30.8, ...,   25.7,   30.3,   25.1],
        [2019. ,   31.3,   32.2, ...,   29.8,   33.9,   26.1]])]

In [25]:

def clean(table):

    # Get rid of all the row containing nan
    nan_mask = np.logical_not(np.isnan(table[:,13]))
    filter_table = table[nan_mask]

    # Then keep all the valid rows
    mean_heat = np.mean(filter_table[:, 1:13], axis = 1)
    match = (np.absolute((filter_table[:,13] - mean_heat)) < 0.1)
    final_mean = filter_table[match]

    return final_mean

Write a function clean (table) that takes a 14 column table, as returned by get_temps_data() and cleans it according to the above assumptions.
Write a function clean_all (tables) that takes a list of tables and returns a cleaned list of tables.

To avoid a warning caused by nan's, it is suggested your remove the nan rows first.

Therefore, it is suggested you (in 4 lines of code):

make a mask that keeps all the rows that don't have null value for annual average temperature (you may find np.logical_not useful)
mask out those rows
make a mask that keeps all the rows that have a mean temperature within 0.1 of the reported annual average
mask out those rows

Visualisation

For convenience, define a method get_table (town, stations) that returns the temperature table for a named town.

Using the table for Albany, plot the annual averages (last column) against the years (first column).

You will notice that the plot has not treated missing data well. Rather than a line joining the missing data, it would be more instructive to have a gap. (Our plot of the cleaned data will also not include the final year in the x-axis.)

One way to leave gaps is to ensure all missing years are set to nan. Matplotlib will not plot these years.

To demonstrate this, plot the uncleaned data for Albany.

At first sight this appears to have solved our problem! We can just use the uncleaned data.

But actually this is just a coincidence of our specific data, not a general solution.

Have a look at the data file for Albany. It happens that the gap of missing data from 1966 to 2001 has a null value either side. This causes matplotlib not put a line between those values.

To demonstrate this, we can leave some other gaps without null values. A quick way to do this is by plotting every tenth value.

table = get_table("Albany",STATIONS)
plt.plot(table[::10,0], table[::10,-1])
plt.show()

Generate this plot. What do you see?

In [26]:

import matplotlib.pyplot as plt

def get_table(town, stations):

    # Uncleaned data (we only transfer NULL to nan in get_temparature function)
    (towns, tables) = get_temperatures(stations, quiet=True)
    return tables[towns.index(town)]

table = get_table("Albany",STATIONS)
plt.plot(table[:,0], table[:,-1], color = "red")
plt.plot(table[::10,0], table[::10,-1])
plt.show()

Augmenting the data

Matplotlib 'does its best' to anticpate what we want to see, and does a pretty good job in general! After all, it filled in all the years on the x-axis for us (allowing us to be a bit lazy), even though we only fed it x values for some of the years.

But it can only do so much. We will need to do it properly.

We will now 'augment' the data by including all of the years ranging from the first to the last year in the data. We will set the missing years to np.nan so that they are not plotted.

We can do all of this in numpy without using any loops.

Identifying missing years [1 lab mark]

Write a method missing_years (table) that takes a (uncleaned) table, and returns an array of years (as integers) that fall within the range of years from the first in the table to the last in the table, that either:

don't have a reported annual average in the table
have a reported annual average of "null"

Your method should not use any loops. You can use previously defined methods.

Hint: It is suggested that you break it down into the following steps:

use the information from the (uncleaned) table to generate an array of all the years in the range of your table
use the cleaned table to get an array of the years that have valid data
determine the indices of the valid years in the full range of years
generate a boolean mask over the full range of years that selects the non-valid years

Tip: You may also find the method np.astype() useful.

Check your results are correct by comparing (visually) with the downloaded tables.

In [27]:

def missing_years (table):

    first = table[0, 0].astype('int')
    last  = table[-1, 0].astype('int')

    # This is the year range in the table
    years = np.array(range(first, last + 1))

    filter_table = clean(table)

    valid_years = filter_table[:, 0]

    invalid_years = np.setdiff1d(years, valid_years)

    return invalid_years

In [28]:

from nose.tools import assert_equal
assert_equal(missing_years(get_table("Albany", STATIONS))[2], 1966)
assert_equal(len(missing_years(get_table("Albany", STATIONS))), 40)
assert_equal(len(missing_years(get_table("Perth", STATIONS))), 2)

Augmenting with `nan` [1 lab mark]

Write a function augmented (table) that returns a pair of arrays:
- the first is an array of all the years in the range of years covered in the table, as integers
- the second is an array of floats which contains:
  - the reported annual average for the corresponding year
  - np.nan where either the data was null or missing

Again the function should have no loops.

Hint: Use a similar structure to missing_years.

Again, check you are getting the right results on the downloaded tables.

In [29]:

def augmented (table):

    years = table[:, 0]

    results = table[:,-1]

    return (years, results)

In [30]:

# [Modify the tests below for your own problem]
# Check that squares returns the correct output for several inputs:
from nose.tools import assert_equal
assert_equal(augmented(get_table("Albany", STATIONS))[1][84], 19.1)
assert_equal(np.isnan(augmented(get_table("Albany", STATIONS))[1][85]), True)
assert_equal(np.isnan(augmented(get_table("Albany", STATIONS))[1][86]), True)

Plot the augmented table for Albany. Is it whate you expected?

Visualising the Complete Data!

At last its time to plot the historic rainfall. But because we've prepared well, this is very straightforward! It shouldn't more than about 4 lines of code to get and plot the data for all the towns in STATIONS (plus a few lines to label the graph).

Plot all the historical data for average annual temperatures for all towns on a single chart.

Part of the chart should look like this (yours should include all towns and all years):

Can you see any trends? How might you quantify this?

Feel the Heat!

Working with 2-Dimensional Arrays

Data Acquisition and Inspection

Set up

File paths

`np.genfromtxt`

Selecting items in 2-d arrays

Inspecting the year range for a town [1 lab mark]

Collecting all the data

A 'production' version [1 lab mark]

Data Cleaning and Conversion

Selecting a row or column

Selecting an area

Two ways to skin a cat?

Verifying the data

`np.stack`

`np.hstack`, `np.vstack` and `np.concatenate`

Putting it all together [1 lab mark]

Visualisation

Augmenting the data

Identifying missing years [1 lab mark]

Augmenting with `nan` [1 lab mark]

Visualising the Complete Data!

Product

Resources

Company

Feel the Heat!

Working with 2-Dimensional Arrays

Data Acquisition and Inspection

Set up

File paths

np.genfromtxt

Selecting items in 2-d arrays

Inspecting the year range for a town [1 lab mark]

Collecting all the data

A 'production' version [1 lab mark]

Data Cleaning and Conversion

Selecting a row or column

Selecting an area

Two ways to skin a cat?

Verifying the data

np.stack

np.hstack, np.vstack and np.concatenate

Putting it all together [1 lab mark]

Visualisation

Augmenting the data

Identifying missing years [1 lab mark]

Augmenting with nan [1 lab mark]

Visualising the Complete Data!

`np.genfromtxt`

`np.stack`

`np.hstack`, `np.vstack` and `np.concatenate`

Augmenting with `nan` [1 lab mark]