Project: Ruida He - CITS2402 Introduction to Data Science

Path: Labs / CensusAge / CensusAge.ipynb

Views: ⁵³¹³
Image: default

Kernel: Python 3 (system-wide)

Lab 3

Exploring the Census Age Data

In The Census and the "Typical" Australian?", we discussed the age chosen for Clare (38), the "Typical" Australian, and Paul (37), the "Typical" Western Australian.

We saw that the ABS chose to use the mode for most attributes, while for age they chose to use the median. We asked why they may have made that decision. (Noting as well that in their Quickstats they also used some means (averages), somewhat confusingly, for the same attributes.)

In this lab we'll explore the data, and see if you think that choice is justified.

It is recommended that the case study videos are viewed in conjunction with this lab - either beforehand to get some direction and/or afterwards to review your approach, depending on your learning strategy.

Data Acquisition

Download the 2016 Census General Community Profile datapack from the ABS.

Open the Metadata spreadsheet (in your preferred spreadsheet software) and find the table that reports age data from the Census.

Go to the Cell Descriptors for that data. You should find age data starting from the age of zero.

Q: What happens when you get to 80?

You should see that people over 80 do not get the same representation as those under 80. Can you think of any good reason to justify this blatant ageism?

Before you upload the data to CoCalc, check the file size. If you choose to upload the whole zip file, it should be less than 5MB (if not, check that you have only downloaded the data for Australia as a whole, not all the separate regions as well.) It is recommended that you remove an additional 3.5MB by unzipping the package and deleting the file 2016Census_geog_desc_1st_and_2nd_release.xlsx which is not needed for the labs.
Alternatively, you can just upload the components you need. The tables (csv files) you need for the lab are each less than 5KB.

Upload the data to the directory for this Lab.
Set up suitable constants to point to the age data. Read in and print out the data to ensure you have accessed it correctly.

Data Cleaning

We are going to examine the age profile of the population as a whole.

Have another look at the Cell Descriptors in the spreadsheet. Which rows of data do you need for this task?

Let's start by pulling the data into lists.

Write some code to read in the data and store the data you need for this task in lists. You can strip out some of the rows you don't need at this stage, however don't strip out the 5 year totals yet, as we'll need those for the next step. Make any appropriate type conversions, and tidy up category names if you wish.

Tip: A similar example is covered in the videos.

For comparison, I have two lists with 101 items in each after this step.

Data Augmentation

As the ABS have inexplicably left off people over 80, we will have to reconstruct the data as best we can. This is akin to the Statistical Imputation discussed in the videos (where we discussed Hot Decking, Mean Substitution, Probabilistic and Regression methods).

Checked Answer [1 prac mark]

As a first approximation we will simply use an averaging approach. We'll make an assumption that the number of people in a five year category over 80 is the approximately the same for each year.

Write a function spread(age_cat, num) that:
- takes an argument age_cat that represents a date range in the format Age_yr_n1_n2 (eg. Age_yr_80_84) and an integer num
- returns a pair (2-tuple) of lists:
  - the first list should contain strings in the form Age_yr_n where n ranges from the first to the last year in the range (eg. Age_yr_80, Age_yr_81,...)
  - the second list contain a list of integers adding up to num such that:
    - if num is divisible by the number of years, then each integer should be the same (ie. the people are evenly spread)
    - if there is a remainder from dividing num by the number of years, the remainder should be added one at a time to the year groups starting at the lowest age (80 in this example) until there are no more remaining

For example, spread('Age_yr_80_84', 7) should return:

(['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [2, 2, 1, 1, 1])

Tip: You might find divmod useful. Also, recall that + and * can be used to create lists.

In [1]:

def Pre_process(DATA1, DATA2):

    f1 = open(DATA1, "r")
    f2 = open(DATA2, "r")

    age_cat1 = []
    age_num1 = []

    age_cat2 = []
    age_num2 = []

    age_cat = []
    age_num = []


    # Dealing with the file A
    lst1 = f1.readlines()
    lst1[0] = lst1[0].split(",")
    lst1[1] = lst1[1].split(",")

    for col in range(3, len(lst1[0]), 3):
        age_cat1.append(lst1[0][col])

    for col in range(3, len(lst1[1]), 3):
        age_num1.append(int(lst1[1][col]))


    # Dealing with the file B
    lst2 = f2.readlines()
    lst2[0] = lst2[0].split(",")
    lst2[1] = lst2[1].split(",")

    # Get rid of the nect line character
    lst2[0][-1] = lst2[0][-1].strip("\n")

    for col in range(1, len(lst2[0]), 3):
        age_cat2.append(lst2[0][col])

    for col in range(1, len(lst2[1]), 3):
        age_num2.append(int(lst2[1][col]))

    age_cat = age_cat1 + age_cat2
    age_num = age_num1 + age_num2

    for i in range(len(age_cat)):
        age_cat[i] = age_cat[i].replace("_P", "")

    return age_cat, age_num

DATA1 = "2016Census_G04A_AUS.csv"
DATA2 = "2016Census_G04B_AUS.csv"
# Cat = Pre_process(DATA1, DATA2)[0]
# Num = Pre_process(DATA1, DATA2)[1]
Cat, Num = Pre_process(DATA1, DATA2)
# print(Cat)
# print(Num)

In [2]:

def spread(age_cat, num):

    initial = age_cat[:7]

    age_cat = age_cat.split("_")

    group_num = int(age_cat[-1]) - int(age_cat[-2]) + 1

    quotient, remainder = divmod(num, group_num)

    result_num = []
    result_cat = []

    for i in range(group_num):

        result_cat.append(initial + str(int(age_cat[-2]) + i))

        result_num.append(quotient)
        if remainder != 0:
            result_num[i] += 1
            remainder = remainder - 1

    return result_cat, result_num



print(spread('Age_yr_80_84',0))
print(460549%5)
print(int(460549/5))

(['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [0, 0, 0, 0, 0])
4
92109

In [3]:

from nose.tools import assert_equal
assert_equal(spread('Age_yr_80_84', 7), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [2, 2, 1, 1, 1]))
assert_equal(spread('Age_yr_80_89', 7), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84', 'Age_yr_85', 'Age_yr_86', 'Age_yr_87', 'Age_yr_88', 'Age_yr_89'], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]))
assert_equal(spread('Age_yr_80_84', 0), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [0, 0, 0, 0, 0]))
assert_equal(spread('Age_yr_80_84', 460549), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [92110, 92110, 92110, 92110, 92109]))
print("So far, so good. Additional test cases will be applied.")

So far, so good. Additional test cases will be applied.

Using your spread function, complete (reconstruct) the lists of age categories, following the pattern of lower age years, as follows:
- for the age ranges from 80-84 to 95-99, insert the categories and ages for each year calculated by spreading the total number over the previous 5 years
- for the age range 100 years and over, assume the ages are spread over the range 100-104 (again a poor assumption) and insert as above

Combine your list of categories and your list of numbers into a list of pairs containing only entries for each single year.
In the process also strip out "yr_" from the category names.

Tip: Use zip to combine the lists.

Your list should now start like this:

[('Age_0', 276227),
 ('Age_1', 293503),
 ('Age_2', 295142),
 ('Age_3', 299725),
 ('Age_4', 300184),
 ('Age_5', 298271),
 ('Age_6', 302901),
 ('Age_7', 299413),
 ...

Check your list has the number of entries that you would expect.

Checked Answer [1 prac mark]

Drawing from your work so far, define a function cleaned_data() that:
- reads in the age data from the ABS datapack
- augments the data with the missing age data as described above
- keeps only the necessary data (number of persons for each year)
- shortens the category names to the form Age_n
- returns the data in a list of pairs in the form (category_name, number)

Check that your output is as you would expect, before applying the sample tests.

Note that your function can call functions you have defined previously. As usual, however, check that it works with a fresh kernel.

In [4]:

def cleaned_data():

    # My preprocessing
    DATA1 = "2016Census_G04A_AUS.csv"
    DATA2 = "2016Census_G04B_AUS.csv"
    Cat, Num = Pre_process(DATA1, DATA2)

    age_cat = []
    age_num = []

    idx1 = Cat.index("Age_yr_80_84")
    idx2 = Cat.index("Age_yr_100_yr_over")
    Cat[idx2] = "Age_yr_100_105"

    for i in range(idx1):
        if len(Cat[i]) < 10:
            age_cat.append(Cat[i])
            age_num.append(Num[i])

    for i in range(idx1, idx2 + 1):
        age_cat1, age_num1 = spread(Cat[i], Num[i])

        age_cat += age_cat1
        age_num += age_num1

    for k in range(len(age_cat)):
        age_cat[k] = age_cat[k].replace("_yr", "")

    result = list(zip(age_cat, age_num))

    return result

print(cleaned_data())

[('Age_0', 276227), ('Age_1', 293503), ('Age_2', 295142), ('Age_3', 299725), ('Age_4', 300184), ('Age_5', 298271), ('Age_6', 302901), ('Age_7', 299413), ('Age_8', 301400), ('Age_9', 300664), ('Age_10', 291803), ('Age_11', 280279), ('Age_12', 277765), ('Age_13', 274328), ('Age_14', 273014), ('Age_15', 278837), ('Age_16', 282097), ('Age_17', 282478), ('Age_18', 284249), ('Age_19', 293937), ('Age_20', 302373), ('Age_21', 310938), ('Age_22', 311388), ('Age_23', 318881), ('Age_24', 323217), ('Age_25', 329436), ('Age_26', 336235), ('Age_27', 331744), ('Age_28', 334236), ('Age_29', 332945), ('Age_30', 342321), ('Age_31', 341084), ('Age_32', 342332), ('Age_33', 343142), ('Age_34', 334962), ('Age_35', 329654), ('Age_36', 317170), ('Age_37', 308794), ('Age_38', 304021), ('Age_39', 302040), ('Age_40', 305551), ('Age_41', 306170), ('Age_42', 315801), ('Age_43', 321509), ('Age_44', 334220), ('Age_45', 342135), ('Age_46', 319874), ('Age_47', 315888), ('Age_48', 306342), ('Age_49', 297225), ('Age_50', 298409), ('Age_51', 296133), ('Age_52', 308770), ('Age_53', 311129), ('Age_54', 309113), ('Age_55', 307724), ('Age_56', 296245), ('Age_57', 288988), ('Age_58', 284495), ('Age_59', 276881), ('Age_60', 273932), ('Age_61', 263946), ('Age_62', 257389), ('Age_63', 255829), ('Age_64', 248301), ('Age_65', 245354), ('Age_66', 239654), ('Age_67', 231175), ('Age_68', 230653), ('Age_69', 242161), ('Age_70', 200020), ('Age_71', 189011), ('Age_72', 180760), ('Age_73', 159896), ('Age_74', 158024), ('Age_75', 145679), ('Age_76', 138588), ('Age_77', 130681), ('Age_78', 122201), ('Age_79', 115511), ('Age_80', 92110), ('Age_81', 92110), ('Age_82', 92110), ('Age_83', 92110), ('Age_84', 92109), ('Age_85', 61792), ('Age_86', 61792), ('Age_87', 61792), ('Age_88', 61792), ('Age_89', 61792), ('Age_90', 28080), ('Age_91', 28080), ('Age_92', 28080), ('Age_93', 28079), ('Age_94', 28079), ('Age_95', 6784), ('Age_96', 6784), ('Age_97', 6784), ('Age_98', 6784), ('Age_99', 6784), ('Age_100', 595), ('Age_101', 595), ('Age_102', 595), ('Age_103', 595), ('Age_104', 595), ('Age_105', 594)]

In [5]:

from nose.tools import assert_equal
assert_equal(cleaned_data()[0], ('Age_0', 276227))
assert_equal(cleaned_data()[-1], ('Age_105', 594))
assert_equal(len(cleaned_data()), 106)
print("So far, so good. Additional test cases will be applied.")

So far, so good. Additional test cases will be applied.

Checked Answer [1 prac mark]

Without using any libraries, write a function central_measures(clean_data) that takes a list of (age_year, number) pairs, and returns a triple (mean, median, mode), with mean as the nearest integer, containing the measures of central tendency for age for the list.

Test your function thoroughly. Some example tests are included below. Do you agree with the answers given?

In [22]:

def central_measures (clean_data):

    length = len(clean_data);
    all  = 0
    mean = 0
    temp = 0

    # Get the mode
    for i in range(length):
        all += clean_data[i][-1]
        if temp < clean_data[i][-1]:
            temp = clean_data[i][-1]
            mode = int(clean_data[i][0].split("_")[-1])


    for i in range(length):
        year = int(clean_data[i][0].split("_")[-1])
        mean += year * clean_data[i][-1] / all

    # Get the nearest integer for mean
    mean = round(mean)

    temp = 0
    median = 0
    median_num = int(all / 2)

    for k in range(length - 1):
        temp += clean_data[k][-1]
        if median_num > temp:
            median = int(clean_data[k + 1][0].split("_")[-1])

    return (mean, median, mode)


print(central_measures(cleaned_data()[:3]))

(1, 1, 2)

In [24]:

from nose.tools import assert_equal
assert_equal(central_measures(cleaned_data()[:1]), (0,0,0))
assert_equal(central_measures(cleaned_data()[:2]), (1,1,1))
assert_equal(central_measures(cleaned_data()[:3]), (1,1,2))
assert_equal(central_measures(cleaned_data()[:4]), (2,2,3))

Q: What are the measures of central tendency for your data as a whole? How do they compare with the "typical" Australians?

Bear in mind that these numbers will depend on the assumptions we made when filling in the missing data!

Visualisation

Use pyplot to produce a basic vertical bar chart for your data.

How would you improve your chart to better understand the data from the plot?

Improve your plot to make it 'production-ready':
- choose an appropriate width for the plot that allows individual years to be better distinguished
- place the ticks at 5 year intervals along the x-axis
- label the ticks, with the labels written vertically
- label your axes and give the chart a title
- save your chart to a png file

The bottom of your plot should look like this:

Open the png file in CoCalc and check that it looks as you would expect.

Q: What can you learn from the plot? What do you make of the ABS' assertion that the 'typical Australians' are in the 36-38 year range? What age would you choose looking at the distribution?

Q: Peter Costello introduced the 'baby bonus' around 2002, and made his plea to parents in 2004. The census was taken in 2016. Do you see any (circumstantial) evidence that parents may have heeded his plea? (Bearing in mind, of course, that correlation does not imply causation.))

Challenge

We filled in the missing data using the average for the age range (in a similar spirit to Mean Substitution). This hasn't turned out too badly, but we could do better.

While staying true to the total number in each age range, improve your method for filling in the missing data to make it as realistic as you can. Remember that you would like to try to use a general approach that would also work on other sets of data, rather than one that is specific to this set of data.

(Fitting the data to a trend is a bit like using a Regression approach to Statistical Imputation, although you are not being asked to use regression here. We will come to this later.)

Which of the measures of central tendency could be affected by this? Check whether the measures have changed.

More Babies

The Census and the 'Typical' Australian?, Part 6 includes an exercise to change the line plot so that it uses the percentage on the y-axis rather than the raw numbers. Part 7 asks if you see evidence for the Treasurer's baby boom.

Go ahead and complete the exercise if you haven't done so already.

Exploring the Census Age Data

Data Acquisition

Data Cleaning

Data Augmentation

Checked Answer [1 prac mark]

Checked Answer [1 prac mark]

Checked Answer [1 prac mark]

Visualisation

Challenge

More Babies

Product

Resources

Company