Project: Ruida He - CITS2402 Introduction to Data Science

Path: Labs / CovidTests / CovidTests.ipynb

Views: ⁵³¹³
Image: default

Kernel: Python 3 (system-wide)

Lab 2 (Fake News! Case Study)

Covid Tests per Capita: Is the US Leading the World?

President Trump has repeatedly said that the US tests more than any other country in the world by far, and sometimes more than all the other countries put together. See for example the official Whitehouse transcript: Remarks by President Trump on Supporting our Nation’s Small Businesses Through the Paycheck Protection Program, 28th April, 2020.

In this practical work, as part of our Fake News! case study, we'll investigate the veracity of one aspect of these claims, the per capita test rates.

Data Acquisition

Our World in Data

In the videos for this case study we obtained data, originally from the European CDC, through Our World in Data: https://ourworldindata.org/.

Read the About page to find out what this initiative is, what they hope to achieve, and who it is backed by. It is always important to know who is behind a data source in order to make an assessment about what degree of credibility to give the source.

Follow the link to Health | Coronavirus Pandemic and then to Tests.

Note that the Data Scientists behind this site have provided a number of different ways of looking at and interpreting test rates. There is also a lot of background provided.

Scroll down to the section "Our checklist for COVID-19 testing data" and read through the ten items on the checklist.

This is terrific example of Data Science done well!

Per-capita testing

Find the section entitled "How many tests are performed each day".

Have a look at how the Map is presented, with the time slider that allows you to see snapshots over time.

Then look at the Chart view and scroll the cursor over the chart.

Again, these are great examples of interactive data presentation.

Note: Consistent with many of the media sites, we will refer to these data in general terms as the "per capita" data to distinguish them from the totals data. More precisely, however, they are tests per 1000 people. That is, they are exactly 1000 times the per capita rate. The only reason for multiplying by 1000 is that its easier to read, say, 0.08 than 0.00008.

Downloading and uploading the data

Go to DOWNLOAD and download the csv file daily-tests-per-thousand-people-smoothed-7-day.csv.
Then open the folder icon in CoCalc, make sure you are in the same directory as this lab sheet, and drop (upload) the csv file into the directory.

Click on the file to open it in CoCalc and have a look at the file format.

You should see, as anticipated from the filename extension, that this is a comma separated values (csv) file: in each row, the fields are separated by commas. There is also a header line, indicating what the data in each field represents.

Reading in the data

As usual, start by setting up a constant with the path (in this case, its just the name) of the data file, so you don't need to keep typing it. For this lab we'll use the version of this file from the 29th July, distributed with the lab, so that we are all using the same file. (Feel free to run your code on your own version too - just be aware that the outputs and test results may be different to those in this sheet.)

You can access this file with:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

Note that in this lab we won't put empty cells for you to complete your code in. You can create cells as you need them using the '+' button.

In [2]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")

for i in range(5):
    print(f.readline())

Entity,Code,Date,Daily tests per thousand people (7-day smoothed) (tests per thousand)

Argentina,ARG,"Feb 18, 2020",0

Argentina,ARG,"Feb 19, 2020",0

Argentina,ARG,"Feb 20, 2020",0

Argentina,ARG,"Feb 21, 2020",0

Read and print out the first 5 lines of data. The output should start like this:

Entity,Code,Date,Daily tests per thousand people (7-day smoothed) (tests per thousand)

Argentina,ARG,"Feb 18, 2020",0

Argentina,ARG,"Feb 19, 2020",0

Data Conversion and Cleaning

From strings to lists

Read the first 5 lines again. This time use the split method to turn each line into a list before printing it out.

Your output should start like this:

['Entity', 'Code', 'Date', 'Daily tests per thousand people (7-day smoothed) (tests per thousand)\n']
['Argentina', 'ARG', '"Feb 18', ' 2020"', '0\n']
['Argentina', 'ARG', '"Feb 19', ' 2020"', '0\n']

In [3]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")

s = []

for line in f:
    temp = line.strip('\n').split(',')
    s.append(temp)
    

for i in range(5):
    print(s[i])

['Entity', 'Code', 'Date', 'Daily tests per thousand people (7-day smoothed) (tests per thousand)']
['Argentina', 'ARG', '"Feb 18', ' 2020"', '0']
['Argentina', 'ARG', '"Feb 19', ' 2020"', '0']
['Argentina', 'ARG', '"Feb 20', ' 2020"', '0']
['Argentina', 'ARG', '"Feb 21', ' 2020"', '0']

Notice that we still have the newline character in the last item.

Use the strip function to remove whitespace before splitting the lines.

Your output should now start like this:

['Entity', 'Code', 'Date', 'Daily tests per thousand people (7-day smoothed) (tests per thousand)']
['Argentina', 'ARG', '"Feb 18', ' 2020"', '0']
['Argentina', 'ARG', '"Feb 19', ' 2020"', '0']

You may have noticed another problem, caused by the fact that the dates include commas. In fact, you may conclude that with this date format, the choice of a comma as the delimiter (separator) was not a particularly good one, and an alternative such as tab separated (tsv) would have been better choice for these data.

Nevertheless, it is not ambiguous, because commas that are not intended as delimiters only appear within double quotes. The 'user' (our code in this case) is expected to take the quotes into account when splitting the lines.

There are many ways to deal with dates, but for now, we can just remove the comma and the quotes, since neither provide us with any information. The fields within the quotes (month, day and year) can be distinguished by their order (the comma is just for human consumption) and the quotes are redundant since it is a text file and all the fields will be read as strings.

Change your code so that it has a preprocessing step before it splits the lines. For each line after the header line your preprocessing step should:

find the positions of the two double quotes
replace the comma between the quotes with a space
replace the old date with the new date without a comma
throw a ValueError if the line doesn't have double quotes

We'll break it down into steps.

In [4]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")
lines = f.readlines()[1:5]

def find_all_indices(ch, s):
    return [i for i, ltr in enumerate(s) if ltr == ch]

for line in lines:
    line = line.strip('\n')
    print(line)
    print(find_all_indices('"', line))

Argentina,ARG,"Feb 18, 2020",0
[14, 27]
Argentina,ARG,"Feb 19, 2020",0
[14, 27]
Argentina,ARG,"Feb 20, 2020",0
[14, 27]
Argentina,ARG,"Feb 21, 2020",0
[14, 27]

Print each line (of the first 5, other than the header, and with the whitespace removed) followed by the indices of the two double quotes:

Argentina,ARG,"Feb 18, 2020",0
14 27
Argentina,ARG,"Feb 19, 2020",0
14 27
Argentina,ARG,"Feb 20, 2020",0
14 27
...

Tip: You can get the double quote character by enclosing it in single quotes ('"').

Hint: Compare the find and index methods. Why would you choose one or the other?

In [5]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")
lines = f.readlines()[1:5]

def find_all_indices(ch, s):
    return [i for i, ltr in enumerate(s) if ltr == ch]

for line in lines:
    
    line = line.strip('\n')
    
    # Find the indeices of all quotes
    s = find_all_indices('"', line)
    
    # Get the original string with double quotes 
    Old_string = line[s[0] : s[1] + 1]
    # Get the new string without the quotes and replace the , with whitespace
    New_string = line[s[0] + 1 : s[1]].replace(',', '')
    
    line = line.replace(Old_string, New_string)
    print(line)

Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
Argentina,ARG,Feb 20 2020,0
Argentina,ARG,Feb 21 2020,0

Next, print the line followed by the date string:

Argentina,ARG,"Feb 18, 2020",0
"Feb 18, 2020"
Argentina,ARG,"Feb 19, 2020",0
"Feb 19, 2020"
Argentina,ARG,"Feb 20, 2020",0
"Feb 20, 2020"

Now do the same, except with the comma removed from the date string:

Feb 18 2020
Argentina,ARG,"Feb 19, 2020",0
Feb 19 2020
Argentina,ARG,"Feb 20, 2020",0
Feb 20 2020

Next, print the original lines with the old date field replaced by the cleaned date field:

Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
Argentina,ARG,Feb 20 2020,0

Checked Solution [1 lab mark]

Now that we have this working, we don't want this preprocessing step 'muddying' up our code, so let's put it in a separate function.

Write a function clean (data_row) that takes as its argument a string, and:
- strips any unnecessary whitespace characters from the ends
- if it contains a string in double quotes, strips the quotes and the comma between them
- if it doesn't contain any quotes (this will be the header line), it strips everything in the line from the space before the first parenthesis

If called with the following code:

with open(DATA,'r') as file:
    for i in range(5):
        print(clean(file.readline()))

the output should start like this:

Entity,Code,Date,Daily tests per thousand people
Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
...

In [6]:

def clean (data_row): 

    # This function is only for one row 
    line = data_row.strip('\n')


    if '"' in line:

        # Find the indeices of all quotes
        s = [i for i, ltr in enumerate(line) if ltr == '"']

        # Get the original string with double quotes 
        Old_string = line[s[0] : s[1] + 1]
        # Get the new string without the quotes and replace the , with whitespace
        New_string = line[s[0] + 1 : s[1]].replace(',', '')

        line = line.replace(Old_string, New_string)

    else:

        # The header line which doesn't have any double quotes and extra comma
        idx = line.index('(');
        line = line[:idx]
        line = line.strip(' ')

    return line



    # YOUR CODE HERE
    # raise NotImplementedError()

In [7]:

from nose.tools import assert_equal
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"
with open(DATA,'r') as file:
    assert_equal(clean(file.readline()), "Entity,Code,Date,Daily tests per thousand people")
    assert_equal(clean(file.readline()), "Argentina,ARG,Feb 18 2020,0")
print("So far, so good on the practice (non-hidden) tests. Remember there will be additional tests applied.")

So far, so good on the practice (non-hidden) tests. Remember there will be additional tests applied.

From a file to a list

Next, rather than printing the lines, store them all in a list (of lists).

Define a variable data_lists as an empty list (). Write code that cleans and splits the (entire) input into lists, and appends them to data_lists.

For example, the following should print the first five rows as a list of lists:

print(data_lists[:5])

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', '0'], ['Argentina', 'ARG', 'Feb 19 2020', '0'], ['Argentina', 'ARG', 'Feb 20 2020', '0'], ['Argentina', 'ARG', 'Feb 21 2020', '0']]

How many entries (lines of data) are there in the file?

In [8]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")

data_lists = []

for line in f:
    line = clean(line).split(',')
    data_lists.append(line)

print(data_lists[:5])

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', '0'], ['Argentina', 'ARG', 'Feb 19 2020', '0'], ['Argentina', 'ARG', 'Feb 20 2020', '0'], ['Argentina', 'ARG', 'Feb 21 2020', '0']]

Checking our cleaning so far

We now have a tidy list of lists, each with the four fields. Or do we?

With big data it may not be possible to manually look at every entry to see if we've accounted for every possibility. We should try to make our cleaning or preprocessing as general as possible so that we catch unexpected variations or bad data.

In practice, we often have to make some assumptions about the data. However we should endeavour to test these.

In this case, we've assumed that the patterns we see at the start of the file continue through the file. So let's check that assumption.

Write code that checks whether all your entries have 4 fields.

If not, why not? What have we missed?

Hint: Print out the first row (if there is one) where this is not true. Print out the number of that row, open the data file in CoCalc, and have a look at the data in that row. What do you find?

In [18]:

DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

f = open(DATA, "r")

data_lists = []
# line_num = 1

for line in f:
    line = new_clean(line).split(',')

#     if len(line) != 4:
#         print(line_num)
#         print(line)
#     line_num += 1
    data_lists.append(line)
    

print(data_lists[:228])

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', '0'], ['Argentina', 'ARG', 'Feb 19 2020', '0'], ['Argentina', 'ARG', 'Feb 20 2020', '0'], ['Argentina', 'ARG', 'Feb 21 2020', '0'], ['Argentina', 'ARG', 'Feb 22 2020', '0'], ['Argentina', 'ARG', 'Feb 23 2020', '0'], ['Argentina', 'ARG', 'Feb 24 2020', '0'], ['Argentina', 'ARG', 'Feb 25 2020', '0'], ['Argentina', 'ARG', 'Feb 26 2020', '0'], ['Argentina', 'ARG', 'Feb 27 2020', '0'], ['Argentina', 'ARG', 'Feb 28 2020', '0'], ['Argentina', 'ARG', 'Feb 29 2020', '0'], ['Argentina', 'ARG', 'Mar 1 2020', '0'], ['Argentina', 'ARG', 'Mar 2 2020', '0'], ['Argentina', 'ARG', 'Mar 3 2020', '0'], ['Argentina', 'ARG', 'Mar 4 2020', '0'], ['Argentina', 'ARG', 'Mar 5 2020', '0'], ['Argentina', 'ARG', 'Mar 6 2020', '0'], ['Argentina', 'ARG', 'Mar 7 2020', '0'], ['Argentina', 'ARG', 'Mar 8 2020', '0'], ['Argentina', 'ARG', 'Mar 9 2020', '0'], ['Argentina', 'ARG', 'Mar 10 2020', '0'], ['Argentina', 'ARG', 'Mar 11 2020', '0'], ['Argentina', 'ARG', 'Mar 12 2020', '0'], ['Argentina', 'ARG', 'Mar 13 2020', '0.001'], ['Argentina', 'ARG', 'Mar 14 2020', '0.001'], ['Argentina', 'ARG', 'Mar 15 2020', '0.001'], ['Argentina', 'ARG', 'Mar 16 2020', '0.001'], ['Argentina', 'ARG', 'Mar 17 2020', '0.002'], ['Argentina', 'ARG', 'Mar 18 2020', '0.002'], ['Argentina', 'ARG', 'Mar 19 2020', '0.002'], ['Argentina', 'ARG', 'Mar 20 2020', '0.003'], ['Argentina', 'ARG', 'Mar 21 2020', '0.004'], ['Argentina', 'ARG', 'Mar 22 2020', '0.004'], ['Argentina', 'ARG', 'Mar 23 2020', '0.005'], ['Argentina', 'ARG', 'Mar 24 2020', '0.005'], ['Argentina', 'ARG', 'Mar 25 2020', '0.006'], ['Argentina', 'ARG', 'Mar 26 2020', '0.007'], ['Argentina', 'ARG', 'Mar 27 2020', '0.008'], ['Argentina', 'ARG', 'Mar 28 2020', '0.008'], ['Argentina', 'ARG', 'Mar 29 2020', '0.009'], ['Argentina', 'ARG', 'Mar 30 2020', '0.01'], ['Argentina', 'ARG', 'Mar 31 2020', '0.011'], ['Argentina', 'ARG', 'Apr 1 2020', '0.012'], ['Argentina', 'ARG', 'Apr 2 2020', '0.015'], ['Argentina', 'ARG', 'Apr 3 2020', '0.018'], ['Argentina', 'ARG', 'Apr 4 2020', '0.02'], ['Argentina', 'ARG', 'Apr 5 2020', '0.021'], ['Argentina', 'ARG', 'Apr 6 2020', '0.022'], ['Argentina', 'ARG', 'Apr 7 2020', '0.025'], ['Argentina', 'ARG', 'Apr 8 2020', '0.027'], ['Argentina', 'ARG', 'Apr 9 2020', '0.027'], ['Argentina', 'ARG', 'Apr 10 2020', '0.027'], ['Argentina', 'ARG', 'Apr 11 2020', '0.028'], ['Argentina', 'ARG', 'Apr 12 2020', '0.029'], ['Argentina', 'ARG', 'Apr 13 2020', '0.03'], ['Argentina', 'ARG', 'Apr 14 2020', '0.031'], ['Argentina', 'ARG', 'Apr 15 2020', '0.031'], ['Argentina', 'ARG', 'Apr 16 2020', '0.032'], ['Argentina', 'ARG', 'Apr 17 2020', '0.034'], ['Argentina', 'ARG', 'Apr 18 2020', '0.035'], ['Argentina', 'ARG', 'Apr 19 2020', '0.035'], ['Argentina', 'ARG', 'Apr 20 2020', '0.036'], ['Argentina', 'ARG', 'Apr 21 2020', '0.038'], ['Argentina', 'ARG', 'Apr 22 2020', '0.039'], ['Argentina', 'ARG', 'Apr 23 2020', '0.041'], ['Argentina', 'ARG', 'Apr 24 2020', '0.043'], ['Argentina', 'ARG', 'Apr 25 2020', '0.044'], ['Argentina', 'ARG', 'Apr 26 2020', '0.045'], ['Argentina', 'ARG', 'Apr 27 2020', '0.045'], ['Argentina', 'ARG', 'Apr 28 2020', '0.045'], ['Argentina', 'ARG', 'Apr 29 2020', '0.046'], ['Argentina', 'ARG', 'Apr 30 2020', '0.046'], ['Argentina', 'ARG', 'May 1 2020', '0.044'], ['Argentina', 'ARG', 'May 2 2020', '0.042'], ['Argentina', 'ARG', 'May 3 2020', '0.042'], ['Argentina', 'ARG', 'May 4 2020', '0.042'], ['Argentina', 'ARG', 'May 5 2020', '0.043'], ['Argentina', 'ARG', 'May 6 2020', '0.042'], ['Argentina', 'ARG', 'May 7 2020', '0.043'], ['Argentina', 'ARG', 'May 8 2020', '0.043'], ['Argentina', 'ARG', 'May 9 2020', '0.044'], ['Argentina', 'ARG', 'May 10 2020', '0.045'], ['Argentina', 'ARG', 'May 11 2020', '0.046'], ['Argentina', 'ARG', 'May 12 2020', '0.047'], ['Argentina', 'ARG', 'May 13 2020', '0.049'], ['Argentina', 'ARG', 'May 14 2020', '0.049'], ['Argentina', 'ARG', 'May 15 2020', '0.052'], ['Argentina', 'ARG', 'May 16 2020', '0.053'], ['Argentina', 'ARG', 'May 17 2020', '0.054'], ['Argentina', 'ARG', 'May 18 2020', '0.056'], ['Argentina', 'ARG', 'May 19 2020', '0.06'], ['Argentina', 'ARG', 'May 20 2020', '0.063'], ['Argentina', 'ARG', 'May 21 2020', '0.067'], ['Argentina', 'ARG', 'May 22 2020', '0.071'], ['Argentina', 'ARG', 'May 23 2020', '0.073'], ['Argentina', 'ARG', 'May 24 2020', '0.075'], ['Argentina', 'ARG', 'May 25 2020', '0.075'], ['Argentina', 'ARG', 'May 26 2020', '0.074'], ['Argentina', 'ARG', 'May 27 2020', '0.075'], ['Argentina', 'ARG', 'May 28 2020', '0.076'], ['Argentina', 'ARG', 'May 29 2020', '0.077'], ['Argentina', 'ARG', 'May 30 2020', '0.079'], ['Argentina', 'ARG', 'May 31 2020', '0.079'], ['Argentina', 'ARG', 'Jun 1 2020', '0.081'], ['Argentina', 'ARG', 'Jun 2 2020', '0.083'], ['Argentina', 'ARG', 'Jun 3 2020', '0.085'], ['Argentina', 'ARG', 'Jun 4 2020', '0.085'], ['Argentina', 'ARG', 'Jun 5 2020', '0.086'], ['Argentina', 'ARG', 'Jun 6 2020', '0.088'], ['Argentina', 'ARG', 'Jun 7 2020', '0.088'], ['Argentina', 'ARG', 'Jun 8 2020', '0.09'], ['Argentina', 'ARG', 'Jun 9 2020', '0.093'], ['Argentina', 'ARG', 'Jun 10 2020', '0.095'], ['Argentina', 'ARG', 'Jun 11 2020', '0.099'], ['Argentina', 'ARG', 'Jun 12 2020', '0.104'], ['Argentina', 'ARG', 'Jun 13 2020', '0.108'], ['Argentina', 'ARG', 'Jun 14 2020', '0.112'], ['Argentina', 'ARG', 'Jun 15 2020', '0.112'], ['Argentina', 'ARG', 'Jun 16 2020', '0.113'], ['Argentina', 'ARG', 'Jun 17 2020', '0.117'], ['Argentina', 'ARG', 'Jun 18 2020', '0.121'], ['Argentina', 'ARG', 'Jun 19 2020', '0.124'], ['Argentina', 'ARG', 'Jun 20 2020', '0.129'], ['Argentina', 'ARG', 'Jun 21 2020', '0.131'], ['Argentina', 'ARG', 'Jun 22 2020', '0.139'], ['Argentina', 'ARG', 'Jun 23 2020', '0.148'], ['Argentina', 'ARG', 'Jun 24 2020', '0.151'], ['Argentina', 'ARG', 'Jun 25 2020', '0.156'], ['Argentina', 'ARG', 'Jun 26 2020', '0.161'], ['Argentina', 'ARG', 'Jun 27 2020', '0.162'], ['Argentina', 'ARG', 'Jun 28 2020', '0.163'], ['Argentina', 'ARG', 'Jun 29 2020', '0.164'], ['Argentina', 'ARG', 'Jun 30 2020', '0.163'], ['Argentina', 'ARG', 'Jul 1 2020', '0.163'], ['Argentina', 'ARG', 'Jul 2 2020', '0.16'], ['Argentina', 'ARG', 'Jul 3 2020', '0.16'], ['Argentina', 'ARG', 'Jul 4 2020', '0.163'], ['Argentina', 'ARG', 'Jul 5 2020', '0.164'], ['Argentina', 'ARG', 'Jul 6 2020', '0.165'], ['Argentina', 'ARG', 'Jul 7 2020', '0.168'], ['Argentina', 'ARG', 'Jul 8 2020', '0.177'], ['Argentina', 'ARG', 'Jul 9 2020', '0.183'], ['Argentina', 'ARG', 'Jul 10 2020', '0.183'], ['Argentina', 'ARG', 'Jul 11 2020', '0.183'], ['Argentina', 'ARG', 'Jul 12 2020', '0.185'], ['Argentina', 'ARG', 'Jul 13 2020', '0.189'], ['Argentina', 'ARG', 'Jul 14 2020', '0.194'], ['Argentina', 'ARG', 'Jul 15 2020', '0.195'], ['Argentina', 'ARG', 'Jul 16 2020', '0.198'], ['Argentina', 'ARG', 'Jul 17 2020', '0.207'], ['Argentina', 'ARG', 'Jul 18 2020', '0.211'], ['Argentina', 'ARG', 'Jul 19 2020', '0.216'], ['Argentina', 'ARG', 'Jul 20 2020', '0.221'], ['Argentina', 'ARG', 'Jul 21 2020', '0.224'], ['Argentina', 'ARG', 'Jul 22 2020', '0.228'], ['Argentina', 'ARG', 'Jul 23 2020', '0.23'], ['Argentina', 'ARG', 'Jul 24 2020', '0.226'], ['Argentina tests performed', '', 'Feb 18 2020', '0'], ['Argentina tests performed', '', 'Feb 19 2020', '0'], ['Argentina tests performed', '', 'Feb 20 2020', '0'], ['Argentina tests performed', '', 'Feb 21 2020', '0'], ['Argentina tests performed', '', 'Feb 22 2020', '0'], ['Argentina tests performed', '', 'Feb 23 2020', '0'], ['Argentina tests performed', '', 'Feb 24 2020', '0'], ['Argentina tests performed', '', 'Feb 25 2020', '0'], ['Argentina tests performed', '', 'Feb 26 2020', '0'], ['Argentina tests performed', '', 'Feb 27 2020', '0'], ['Argentina tests performed', '', 'Feb 28 2020', '0'], ['Argentina tests performed', '', 'Feb 29 2020', '0'], ['Argentina tests performed', '', 'Mar 1 2020', '0'], ['Argentina tests performed', '', 'Mar 2 2020', '0'], ['Argentina tests performed', '', 'Mar 3 2020', '0'], ['Argentina tests performed', '', 'Mar 4 2020', '0'], ['Argentina tests performed', '', 'Mar 5 2020', '0'], ['Argentina tests performed', '', 'Mar 6 2020', '0'], ['Argentina tests performed', '', 'Mar 7 2020', '0'], ['Argentina tests performed', '', 'Mar 8 2020', '0'], ['Argentina tests performed', '', 'Mar 9 2020', '0'], ['Argentina tests performed', '', 'Mar 10 2020', '0'], ['Argentina tests performed', '', 'Mar 11 2020', '0'], ['Argentina tests performed', '', 'Mar 12 2020', '0.001'], ['Argentina tests performed', '', 'Mar 13 2020', '0.001'], ['Argentina tests performed', '', 'Mar 14 2020', '0.001'], ['Argentina tests performed', '', 'Mar 15 2020', '0.001'], ['Argentina tests performed', '', 'Mar 16 2020', '0.001'], ['Argentina tests performed', '', 'Mar 17 2020', '0.002'], ['Argentina tests performed', '', 'Mar 18 2020', '0.002'], ['Argentina tests performed', '', 'Mar 19 2020', '0.003'], ['Argentina tests performed', '', 'Mar 20 2020', '0.003'], ['Argentina tests performed', '', 'Mar 21 2020', '0.004'], ['Argentina tests performed', '', 'Mar 22 2020', '0.004'], ['Argentina tests performed', '', 'Mar 23 2020', '0.005'], ['Argentina tests performed', '', 'Mar 24 2020', '0.006'], ['Argentina tests performed', '', 'Mar 25 2020', '0.007'], ['Argentina tests performed', '', 'Mar 26 2020', '0.008'], ['Argentina tests performed', '', 'Mar 27 2020', '0.009'], ['Argentina tests performed', '', 'Mar 28 2020', '0.01'], ['Argentina tests performed', '', 'Mar 29 2020', '0.011'], ['Argentina tests performed', '', 'Mar 30 2020', '0.012'], ['Argentina tests performed', '', 'Mar 31 2020', '0.013'], ['Argentina tests performed', '', 'Apr 1 2020', '0.015'], ['Argentina tests performed', '', 'Apr 2 2020', '0.018'], ['Argentina tests performed', '', 'Apr 3 2020', '0.021'], ['Argentina tests performed', '', 'Apr 4 2020', '0.024'], ['Argentina tests performed', '', 'Apr 5 2020', '0.025'], ['Argentina tests performed', '', 'Apr 6 2020', '0.027'], ['Argentina tests performed', '', 'Apr 7 2020', '0.03'], ['Argentina tests performed', '', 'Apr 8 2020', '0.032'], ['Argentina tests performed', '', 'Apr 9 2020', '0.033'], ['Argentina tests performed', '', 'Apr 10 2020', '0.034'], ['Argentina tests performed', '', 'Apr 11 2020', '0.035'], ['Argentina tests performed', '', 'Apr 12 2020', '0.036'], ['Argentina tests performed', '', 'Apr 13 2020', '0.038'], ['Argentina tests performed', '', 'Apr 14 2020', '0.039'], ['Argentina tests performed', '', 'Apr 15 2020', '0.04'], ['Argentina tests performed', '', 'Apr 16 2020', '0.041'], ['Argentina tests performed', '', 'Apr 17 2020', '0.043'], ['Argentina tests performed', '', 'Apr 18 2020', '0.044'], ['Argentina tests performed', '', 'Apr 19 2020', '0.044'], ['Argentina tests performed', '', 'Apr 20 2020', '0.045'], ['Argentina tests performed', '', 'Apr 21 2020', '0.048'], ['Argentina tests performed', '', 'Apr 22 2020', '0.049'], ['Argentina tests performed', '', 'Apr 23 2020', '0.051'], ['Argentina tests performed', '', 'Apr 24 2020', '0.053'], ['Argentina tests performed', '', 'Apr 25 2020', '0.055'], ['Argentina tests performed', '', 'Apr 26 2020', '0.055']]

General vs Specific

The variations you see in the data are not unusual - remember that this is the collated official government data, its not an 'exercise'. It is exactly the kind of thing you would deal with as a working Data Scientist.

We'll need to come back to some differences in the content of the data, but for now let's focus on the formatting. Or, more precisely, transforming the data from the format in which it is provided to a format that is suitable for our use.

Alter your code to do more cleaning as necessary so that it is transformed into a list of lists, each with four fields.

You should try to make your code as general as possible. This means that, rather than just adjust for the specific case, think about patterns.

For example, we have seen that the data format uses double quotes around fields containing the delimiter (a comma in this case). If the data is not corrupt (which is an assumption) therefore, we would expect the double quotes to always occur in pairs. By focussing only on dates, we have been more specific than we need to be, so we may miss other cases. A more general solution will assume that the same pattern may occur in other fields.

Ensure you re-test your code after any changes you make to ensure it satisfies the requirements. (You might find it useful to write them all down.)

In [11]:

def new_clean (data_row): 

    # This function is only for one row 
    line = data_row.strip('\n')


    # There are some double comma in some lines, we need to get rid of one of these
#     if ',,' in line:
#         line = line.replace(',,', ',')


    if '"' in line:

        # Find the indeices of all quotes
        s = [i for i, ltr in enumerate(line) if ltr == '"']

        # If we change the previous part first,
        #    it will change the location of the next part
        #    so we need to start editing from the end of the string
        for i in range(len(s) - 2, -1, -2):

            # Get the original string with double quotes 
            Old_string = line[s[i] : s[i + 1] + 1]
            # Get the new string without the quotes and replace the , with whitespace
            New_string = line[s[i] + 1 : s[i + 1]].replace(',', '')

            line = line.replace(Old_string, New_string)

    else:

        # The header line which doesn't have any double quotes and extra comma
        idx = line.index('(');
        line = line[:idx]
        line = line.strip(' ')

    return line

In [27]:

for i in range(1, len(data_lists)):

    if data_lists[i][-1] != '':
        data_lists[i][-1] = float(data_lists[i][-1])


#    if data_lists[i][-1] != '' and data_lists[i][-1].isnumeric():
#     if len(data_lists[i][-1]) == 1:
#         data_lists[i][-1] = float(int(data_lists[i][-1]))
#     else:
#         data_lists[i][-1] = data_lists[i][-1]

print(data_lists[:240])

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', 0.0], ['Argentina', 'ARG', 'Feb 19 2020', 0.0], ['Argentina', 'ARG', 'Feb 20 2020', 0.0], ['Argentina', 'ARG', 'Feb 21 2020', 0.0], ['Argentina', 'ARG', 'Feb 22 2020', 0.0], ['Argentina', 'ARG', 'Feb 23 2020', 0.0], ['Argentina', 'ARG', 'Feb 24 2020', 0.0], ['Argentina', 'ARG', 'Feb 25 2020', 0.0], ['Argentina', 'ARG', 'Feb 26 2020', 0.0], ['Argentina', 'ARG', 'Feb 27 2020', 0.0], ['Argentina', 'ARG', 'Feb 28 2020', 0.0], ['Argentina', 'ARG', 'Feb 29 2020', 0.0], ['Argentina', 'ARG', 'Mar 1 2020', 0.0], ['Argentina', 'ARG', 'Mar 2 2020', 0.0], ['Argentina', 'ARG', 'Mar 3 2020', 0.0], ['Argentina', 'ARG', 'Mar 4 2020', 0.0], ['Argentina', 'ARG', 'Mar 5 2020', 0.0], ['Argentina', 'ARG', 'Mar 6 2020', 0.0], ['Argentina', 'ARG', 'Mar 7 2020', 0.0], ['Argentina', 'ARG', 'Mar 8 2020', 0.0], ['Argentina', 'ARG', 'Mar 9 2020', 0.0], ['Argentina', 'ARG', 'Mar 10 2020', 0.0], ['Argentina', 'ARG', 'Mar 11 2020', 0.0], ['Argentina', 'ARG', 'Mar 12 2020', 0.0], ['Argentina', 'ARG', 'Mar 13 2020', 0.001], ['Argentina', 'ARG', 'Mar 14 2020', 0.001], ['Argentina', 'ARG', 'Mar 15 2020', 0.001], ['Argentina', 'ARG', 'Mar 16 2020', 0.001], ['Argentina', 'ARG', 'Mar 17 2020', 0.002], ['Argentina', 'ARG', 'Mar 18 2020', 0.002], ['Argentina', 'ARG', 'Mar 19 2020', 0.002], ['Argentina', 'ARG', 'Mar 20 2020', 0.003], ['Argentina', 'ARG', 'Mar 21 2020', 0.004], ['Argentina', 'ARG', 'Mar 22 2020', 0.004], ['Argentina', 'ARG', 'Mar 23 2020', 0.005], ['Argentina', 'ARG', 'Mar 24 2020', 0.005], ['Argentina', 'ARG', 'Mar 25 2020', 0.006], ['Argentina', 'ARG', 'Mar 26 2020', 0.007], ['Argentina', 'ARG', 'Mar 27 2020', 0.008], ['Argentina', 'ARG', 'Mar 28 2020', 0.008], ['Argentina', 'ARG', 'Mar 29 2020', 0.009], ['Argentina', 'ARG', 'Mar 30 2020', 0.01], ['Argentina', 'ARG', 'Mar 31 2020', 0.011], ['Argentina', 'ARG', 'Apr 1 2020', 0.012], ['Argentina', 'ARG', 'Apr 2 2020', 0.015], ['Argentina', 'ARG', 'Apr 3 2020', 0.018], ['Argentina', 'ARG', 'Apr 4 2020', 0.02], ['Argentina', 'ARG', 'Apr 5 2020', 0.021], ['Argentina', 'ARG', 'Apr 6 2020', 0.022], ['Argentina', 'ARG', 'Apr 7 2020', 0.025], ['Argentina', 'ARG', 'Apr 8 2020', 0.027], ['Argentina', 'ARG', 'Apr 9 2020', 0.027], ['Argentina', 'ARG', 'Apr 10 2020', 0.027], ['Argentina', 'ARG', 'Apr 11 2020', 0.028], ['Argentina', 'ARG', 'Apr 12 2020', 0.029], ['Argentina', 'ARG', 'Apr 13 2020', 0.03], ['Argentina', 'ARG', 'Apr 14 2020', 0.031], ['Argentina', 'ARG', 'Apr 15 2020', 0.031], ['Argentina', 'ARG', 'Apr 16 2020', 0.032], ['Argentina', 'ARG', 'Apr 17 2020', 0.034], ['Argentina', 'ARG', 'Apr 18 2020', 0.035], ['Argentina', 'ARG', 'Apr 19 2020', 0.035], ['Argentina', 'ARG', 'Apr 20 2020', 0.036], ['Argentina', 'ARG', 'Apr 21 2020', 0.038], ['Argentina', 'ARG', 'Apr 22 2020', 0.039], ['Argentina', 'ARG', 'Apr 23 2020', 0.041], ['Argentina', 'ARG', 'Apr 24 2020', 0.043], ['Argentina', 'ARG', 'Apr 25 2020', 0.044], ['Argentina', 'ARG', 'Apr 26 2020', 0.045], ['Argentina', 'ARG', 'Apr 27 2020', 0.045], ['Argentina', 'ARG', 'Apr 28 2020', 0.045], ['Argentina', 'ARG', 'Apr 29 2020', 0.046], ['Argentina', 'ARG', 'Apr 30 2020', 0.046], ['Argentina', 'ARG', 'May 1 2020', 0.044], ['Argentina', 'ARG', 'May 2 2020', 0.042], ['Argentina', 'ARG', 'May 3 2020', 0.042], ['Argentina', 'ARG', 'May 4 2020', 0.042], ['Argentina', 'ARG', 'May 5 2020', 0.043], ['Argentina', 'ARG', 'May 6 2020', 0.042], ['Argentina', 'ARG', 'May 7 2020', 0.043], ['Argentina', 'ARG', 'May 8 2020', 0.043], ['Argentina', 'ARG', 'May 9 2020', 0.044], ['Argentina', 'ARG', 'May 10 2020', 0.045], ['Argentina', 'ARG', 'May 11 2020', 0.046], ['Argentina', 'ARG', 'May 12 2020', 0.047], ['Argentina', 'ARG', 'May 13 2020', 0.049], ['Argentina', 'ARG', 'May 14 2020', 0.049], ['Argentina', 'ARG', 'May 15 2020', 0.052], ['Argentina', 'ARG', 'May 16 2020', 0.053], ['Argentina', 'ARG', 'May 17 2020', 0.054], ['Argentina', 'ARG', 'May 18 2020', 0.056], ['Argentina', 'ARG', 'May 19 2020', 0.06], ['Argentina', 'ARG', 'May 20 2020', 0.063], ['Argentina', 'ARG', 'May 21 2020', 0.067], ['Argentina', 'ARG', 'May 22 2020', 0.071], ['Argentina', 'ARG', 'May 23 2020', 0.073], ['Argentina', 'ARG', 'May 24 2020', 0.075], ['Argentina', 'ARG', 'May 25 2020', 0.075], ['Argentina', 'ARG', 'May 26 2020', 0.074], ['Argentina', 'ARG', 'May 27 2020', 0.075], ['Argentina', 'ARG', 'May 28 2020', 0.076], ['Argentina', 'ARG', 'May 29 2020', 0.077], ['Argentina', 'ARG', 'May 30 2020', 0.079], ['Argentina', 'ARG', 'May 31 2020', 0.079], ['Argentina', 'ARG', 'Jun 1 2020', 0.081], ['Argentina', 'ARG', 'Jun 2 2020', 0.083], ['Argentina', 'ARG', 'Jun 3 2020', 0.085], ['Argentina', 'ARG', 'Jun 4 2020', 0.085], ['Argentina', 'ARG', 'Jun 5 2020', 0.086], ['Argentina', 'ARG', 'Jun 6 2020', 0.088], ['Argentina', 'ARG', 'Jun 7 2020', 0.088], ['Argentina', 'ARG', 'Jun 8 2020', 0.09], ['Argentina', 'ARG', 'Jun 9 2020', 0.093], ['Argentina', 'ARG', 'Jun 10 2020', 0.095], ['Argentina', 'ARG', 'Jun 11 2020', 0.099], ['Argentina', 'ARG', 'Jun 12 2020', 0.104], ['Argentina', 'ARG', 'Jun 13 2020', 0.108], ['Argentina', 'ARG', 'Jun 14 2020', 0.112], ['Argentina', 'ARG', 'Jun 15 2020', 0.112], ['Argentina', 'ARG', 'Jun 16 2020', 0.113], ['Argentina', 'ARG', 'Jun 17 2020', 0.117], ['Argentina', 'ARG', 'Jun 18 2020', 0.121], ['Argentina', 'ARG', 'Jun 19 2020', 0.124], ['Argentina', 'ARG', 'Jun 20 2020', 0.129], ['Argentina', 'ARG', 'Jun 21 2020', 0.131], ['Argentina', 'ARG', 'Jun 22 2020', 0.139], ['Argentina', 'ARG', 'Jun 23 2020', 0.148], ['Argentina', 'ARG', 'Jun 24 2020', 0.151], ['Argentina', 'ARG', 'Jun 25 2020', 0.156], ['Argentina', 'ARG', 'Jun 26 2020', 0.161], ['Argentina', 'ARG', 'Jun 27 2020', 0.162], ['Argentina', 'ARG', 'Jun 28 2020', 0.163], ['Argentina', 'ARG', 'Jun 29 2020', 0.164], ['Argentina', 'ARG', 'Jun 30 2020', 0.163], ['Argentina', 'ARG', 'Jul 1 2020', 0.163], ['Argentina', 'ARG', 'Jul 2 2020', 0.16], ['Argentina', 'ARG', 'Jul 3 2020', 0.16], ['Argentina', 'ARG', 'Jul 4 2020', 0.163], ['Argentina', 'ARG', 'Jul 5 2020', 0.164], ['Argentina', 'ARG', 'Jul 6 2020', 0.165], ['Argentina', 'ARG', 'Jul 7 2020', 0.168], ['Argentina', 'ARG', 'Jul 8 2020', 0.177], ['Argentina', 'ARG', 'Jul 9 2020', 0.183], ['Argentina', 'ARG', 'Jul 10 2020', 0.183], ['Argentina', 'ARG', 'Jul 11 2020', 0.183], ['Argentina', 'ARG', 'Jul 12 2020', 0.185], ['Argentina', 'ARG', 'Jul 13 2020', 0.189], ['Argentina', 'ARG', 'Jul 14 2020', 0.194], ['Argentina', 'ARG', 'Jul 15 2020', 0.195], ['Argentina', 'ARG', 'Jul 16 2020', 0.198], ['Argentina', 'ARG', 'Jul 17 2020', 0.207], ['Argentina', 'ARG', 'Jul 18 2020', 0.211], ['Argentina', 'ARG', 'Jul 19 2020', 0.216], ['Argentina', 'ARG', 'Jul 20 2020', 0.221], ['Argentina', 'ARG', 'Jul 21 2020', 0.224], ['Argentina', 'ARG', 'Jul 22 2020', 0.228], ['Argentina', 'ARG', 'Jul 23 2020', 0.23], ['Argentina', 'ARG', 'Jul 24 2020', 0.226], ['Argentina tests performed', '', 'Feb 18 2020', 0.0], ['Argentina tests performed', '', 'Feb 19 2020', 0.0], ['Argentina tests performed', '', 'Feb 20 2020', 0.0], ['Argentina tests performed', '', 'Feb 21 2020', 0.0], ['Argentina tests performed', '', 'Feb 22 2020', 0.0], ['Argentina tests performed', '', 'Feb 23 2020', 0.0], ['Argentina tests performed', '', 'Feb 24 2020', 0.0], ['Argentina tests performed', '', 'Feb 25 2020', 0.0], ['Argentina tests performed', '', 'Feb 26 2020', 0.0], ['Argentina tests performed', '', 'Feb 27 2020', 0.0], ['Argentina tests performed', '', 'Feb 28 2020', 0.0], ['Argentina tests performed', '', 'Feb 29 2020', 0.0], ['Argentina tests performed', '', 'Mar 1 2020', 0.0], ['Argentina tests performed', '', 'Mar 2 2020', 0.0], ['Argentina tests performed', '', 'Mar 3 2020', 0.0], ['Argentina tests performed', '', 'Mar 4 2020', 0.0], ['Argentina tests performed', '', 'Mar 5 2020', 0.0], ['Argentina tests performed', '', 'Mar 6 2020', 0.0], ['Argentina tests performed', '', 'Mar 7 2020', 0.0], ['Argentina tests performed', '', 'Mar 8 2020', 0.0], ['Argentina tests performed', '', 'Mar 9 2020', 0.0], ['Argentina tests performed', '', 'Mar 10 2020', 0.0], ['Argentina tests performed', '', 'Mar 11 2020', 0.0], ['Argentina tests performed', '', 'Mar 12 2020', 0.001], ['Argentina tests performed', '', 'Mar 13 2020', 0.001], ['Argentina tests performed', '', 'Mar 14 2020', 0.001], ['Argentina tests performed', '', 'Mar 15 2020', 0.001], ['Argentina tests performed', '', 'Mar 16 2020', 0.001], ['Argentina tests performed', '', 'Mar 17 2020', 0.002], ['Argentina tests performed', '', 'Mar 18 2020', 0.002], ['Argentina tests performed', '', 'Mar 19 2020', 0.003], ['Argentina tests performed', '', 'Mar 20 2020', 0.003], ['Argentina tests performed', '', 'Mar 21 2020', 0.004], ['Argentina tests performed', '', 'Mar 22 2020', 0.004], ['Argentina tests performed', '', 'Mar 23 2020', 0.005], ['Argentina tests performed', '', 'Mar 24 2020', 0.006], ['Argentina tests performed', '', 'Mar 25 2020', 0.007], ['Argentina tests performed', '', 'Mar 26 2020', 0.008], ['Argentina tests performed', '', 'Mar 27 2020', 0.009], ['Argentina tests performed', '', 'Mar 28 2020', 0.01], ['Argentina tests performed', '', 'Mar 29 2020', 0.011], ['Argentina tests performed', '', 'Mar 30 2020', 0.012], ['Argentina tests performed', '', 'Mar 31 2020', 0.013], ['Argentina tests performed', '', 'Apr 1 2020', 0.015], ['Argentina tests performed', '', 'Apr 2 2020', 0.018], ['Argentina tests performed', '', 'Apr 3 2020', 0.021], ['Argentina tests performed', '', 'Apr 4 2020', 0.024], ['Argentina tests performed', '', 'Apr 5 2020', 0.025], ['Argentina tests performed', '', 'Apr 6 2020', 0.027], ['Argentina tests performed', '', 'Apr 7 2020', 0.03], ['Argentina tests performed', '', 'Apr 8 2020', 0.032], ['Argentina tests performed', '', 'Apr 9 2020', 0.033], ['Argentina tests performed', '', 'Apr 10 2020', 0.034], ['Argentina tests performed', '', 'Apr 11 2020', 0.035], ['Argentina tests performed', '', 'Apr 12 2020', 0.036], ['Argentina tests performed', '', 'Apr 13 2020', 0.038], ['Argentina tests performed', '', 'Apr 14 2020', 0.039], ['Argentina tests performed', '', 'Apr 15 2020', 0.04], ['Argentina tests performed', '', 'Apr 16 2020', 0.041], ['Argentina tests performed', '', 'Apr 17 2020', 0.043], ['Argentina tests performed', '', 'Apr 18 2020', 0.044], ['Argentina tests performed', '', 'Apr 19 2020', 0.044], ['Argentina tests performed', '', 'Apr 20 2020', 0.045], ['Argentina tests performed', '', 'Apr 21 2020', 0.048], ['Argentina tests performed', '', 'Apr 22 2020', 0.049], ['Argentina tests performed', '', 'Apr 23 2020', 0.051], ['Argentina tests performed', '', 'Apr 24 2020', 0.053], ['Argentina tests performed', '', 'Apr 25 2020', 0.055], ['Argentina tests performed', '', 'Apr 26 2020', 0.055], ['Argentina tests performed', '', 'Apr 27 2020', 0.055], ['Argentina tests performed', '', 'Apr 28 2020', 0.055], ['Argentina tests performed', '', 'Apr 29 2020', 0.056], ['Argentina tests performed', '', 'Apr 30 2020', 0.056], ['Argentina tests performed', '', 'May 1 2020', 0.054], ['Argentina tests performed', '', 'May 2 2020', 0.053], ['Argentina tests performed', '', 'May 3 2020', 0.052], ['Argentina tests performed', '', 'May 4 2020', 0.052], ['Argentina tests performed', '', 'May 5 2020', 0.054], ['Argentina tests performed', '', 'May 6 2020', 0.053], ['Argentina tests performed', '', 'May 7 2020', 0.053], ['Argentina tests performed', '', 'May 8 2020', 0.054]]

Type casting

Finally, the last field, tests per 1000 people, should be a float.

Alter your code to change the tests ratio to a float.

Your first few lines should now look like this:

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', 0.0], ['Argentina', 'ARG', 'Feb 19 2020', 0.0], ['Argentina', 'ARG', 'Feb 20 2020', 0.0], ['Argentina', 'ARG', 'Feb 21 2020', 0.0]]

Again, we could make an assumption that the fourth string is always able to be converted to a float, but its possible that somewhere in the file that is not true. Later we will deal with this using Exceptions. For now, it is good practice to do some checks before attempting to cast. Before casting, check that:

the fourth field is not an empty string
the first character in the string is a number

Tip: You may find string functions like isnumeric useful.

Remember to run your own tests, not rely on my tests or my sample output. For example, I've included the first few lines of output (the file is too large to include them all) but as we've seen those lines may not be indicative of the file as a whole. For one thing, they are all cases where the daily test ratio is zero.

Again, remember that a Data Scientist is like a detective - always thinking about what we could possibly have missed, and testing for it.

Checked Solution [2 marks]

Complete the function get_cleaned_lists (filename) so that it returns a list of lists, each containing the fields from the data file, with quotes, commas, and leading/trailing whitespace characters removed, any parenthesised text removed, and the daily tests ratio as a float.

Note: Your checked functions may call preceding functions that you have written - they do not have to all be in one long function. As always, however, you should ensure you validate your function with a "clean" kernel.

In [30]:

def get_cleaned_lists (filename):

    f = open(filename, "r")

    data_lists = []

    for line in f:

        # 1. get rid of the '\n' from each line
        line = line.strip('\n')

        if '"' in line:

            # 2. get all the indices of double quotes
            s = [i for i, ltr in enumerate(line) if ltr == '"']

            # 3. remove all the quotes and comma between double quotes
            for i in range(len(s) - 2, -1, -2):

                Old_string = line[s[i] : s[i + 1] + 1]
                New_string = line[s[i] + 1 : s[i + 1]].replace(',', '')

                line = line.replace(Old_string, New_string)

        else:

            # deal with the first line (special case)
            idx = line.index('(');
            line = line[:idx]
            line = line.strip(' ')

        # turn each line into a list
        line = line.split(',')
        # append them into the big list
        data_lists.append(line)


    # convert the string to float numbers
    for i in range(1, len(data_lists)):
        if data_lists[i][-1] != '':
            data_lists[i][-1] = float(data_lists[i][-1])

    return data_lists

    # YOUR CODE HERE
    # raise NotImplementedError()

In [31]:

from nose.tools import assert_equal
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"
assert_equal(get_cleaned_lists(DATA)[0],['Entity', 'Code', 'Date', 'Daily tests per thousand people'])
assert_equal(get_cleaned_lists(DATA)[5],['Argentina', 'ARG', 'Feb 22 2020', 0.0])
assert_equal(get_cleaned_lists(DATA)[40],['Argentina', 'ARG', 'Mar 28 2020', 0.008])
assert_equal(get_cleaned_lists(DATA)[180],['Argentina tests performed', '', 'Mar 10 2020', 0.0])
print("So far, so good. Remember there will be additional tests applied.")

So far, so good. Remember there will be additional tests applied.

In [32]:

# For marking use only.

Data Presentation - Using Dictionaries

For this part it is left to you to break down the task into subtasks and test them as you go.

We want to be able to access the daily test ratio for a country on a given date without using loops.

Dictionaries provide much faster access to data by hashing the dictionary keys.

Store the daily tests data in a dictionary of dictionaries. The outer dictionary should use the countries as keys. The inner dictionary should use the dates as keys.

For example, if my outer dictionary is called country_dict, then evaluating country_dict["Australia"] should return:

{'Mar 29 2020': 0.382,
 'Mar 30 2020': 0.396,
 'Mar 31 2020': 0.41,
 'Apr 1 2020': 0.425,
 'Apr 2 2020': 0.439,
 'Apr 3 2020': 0.453,
 'Apr 4 2020': 0.467,
 ...

and country_dict["Australia"]["Jul 1 2020"] should return:

1.824

How many "countries" are there?

Print out each country, followed by its number of tests per 1000 people, on 1st July, 2020. [If the country doesn't have a reading for that day, it can simply be skipped.]

It will be clear from this that some countries have more than one entry - for example, "Poland" and "Poland people tested" will show as two separate "countries". While we could easily clean these out, we haven't looked closely enough at the sources of the data to determine the reason for the different figures, and whether one is better than the other, to have grounds for choosing one over the other (you may wish to follow this up). Therefore we need to leave them all in for now.

Presenting rankings in a table

Finally, write a function print_ranking(date) that takes a date (as a string) and prints a table of testing results for that date, ranked from highest testing rate to the lowest.

So, for example, print_ranking("Jul 1 2020") will start as follows:

Testing results for Jul 1 2020
8.032 	 Luxembourg
5.653 	 United Arab Emirates
5.055 	 Bahrain
...

Checking the news

The Whitehouse statement cited in the introduction, above, was made on April 28th.

What do you make of President Trump's statements from a per capita perspective?

How does that compare with more recently?

In a May 11 Rose Garden briefing President Trump stated:

We’re testing more people per capita than South Korea, the United Kingdom, France, Japan, Sweden, Finland, and many other countries — and, in some cases, combined.

The BBC's May 15th article Coronavirus: President Trump’s testing claims fact-checked "fact-checks" this claim (Claim One).

Modify your function to the signature print_ranking(date, countries=[]) so that:
- if countries is omitted, it still prints the table for all countries reporting on that day
- if a list of countries is passed to the function, then it only prints the table for those countries

Print the table for the US and those six countries on 11th May.

Is the BBC's fact check for the 11th May borne out by these data?

Can you think of a possible reason for these discrepancies? (Hint: Should "we're testing" be interpreted as a rate or as a cumulative total?)

On 22nd June Newsweek, in Why Trump Is Both Right and Wrong About U.S. Coronavirus Testing Numbers, compares the US with Russia, Spain, Germany and Portugal on cumulative per capita figures.

How does this compare with the picture you get for the daily rate at this date?

Congratulations - you can now get a job as a fact checking journalist!