Lab 2 (Fake News! Case Study)
Covid Tests per Capita: Is the US Leading the World?
President Trump has repeatedly said that the US tests more than any other country in the world by far, and sometimes more than all the other countries put together. See for example the official Whitehouse transcript: Remarks by President Trump on Supporting our Nation’s Small Businesses Through the Paycheck Protection Program, 28th April, 2020.
In this practical work, as part of our Fake News! case study, we'll investigate the veracity of one aspect of these claims, the per capita test rates.
Data Acquisition
Our World in Data
In the videos for this case study we obtained data, originally from the European CDC, through Our World in Data: https://ourworldindata.org/.
Read the About page to find out what this initiative is, what they hope to achieve, and who it is backed by. It is always important to know who is behind a data source in order to make an assessment about what degree of credibility to give the source.
Follow the link to Health | Coronavirus Pandemic
and then to Tests
.
Note that the Data Scientists behind this site have provided a number of different ways of looking at and interpreting test rates. There is also a lot of background provided.
Scroll down to the section "Our checklist for COVID-19 testing data" and read through the ten items on the checklist.
This is terrific example of Data Science done well!
Per-capita testing
Find the section entitled "How many tests are performed each day".
Have a look at how the Map is presented, with the time slider that allows you to see snapshots over time.
Then look at the Chart view and scroll the cursor over the chart.
Again, these are great examples of interactive data presentation.
Note: Consistent with many of the media sites, we will refer to these data in general terms as the "per capita" data to distinguish them from the totals data. More precisely, however, they are tests per 1000 people. That is, they are exactly 1000 times the per capita rate. The only reason for multiplying by 1000 is that its easier to read, say, 0.08 than 0.00008.
Downloading and uploading the data
Go to
DOWNLOAD
and download the csv filedaily-tests-per-thousand-people-smoothed-7-day.csv
.Then open the folder icon in CoCalc, make sure you are in the same directory as this lab sheet, and drop (upload) the csv file into the directory.
Click on the file to open it in CoCalc and have a look at the file format.
You should see, as anticipated from the filename extension, that this is a comma separated values (csv) file: in each row, the fields are separated by commas. There is also a header line, indicating what the data in each field represents.
Reading in the data
As usual, start by setting up a constant with the path (in this case, its just the name) of the data file, so you don't need to keep typing it. For this lab we'll use the version of this file from the 29th July, distributed with the lab, so that we are all using the same file. (Feel free to run your code on your own version too - just be aware that the outputs and test results may be different to those in this sheet.)
You can access this file with:
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"
Note that in this lab we won't put empty cells for you to complete your code in. You can create cells as you need them using the '+' button.
Read and print out the first 5 lines of data. The output should start like this:
Data Conversion and Cleaning
From strings to lists
Read the first 5 lines again. This time use the
split
method to turn each line into a list before printing it out.
Your output should start like this:
Notice that we still have the newline character in the last item.
Use the
strip
function to remove whitespace before splitting the lines.
Your output should now start like this:
You may have noticed another problem, caused by the fact that the dates include commas. In fact, you may conclude that with this date format, the choice of a comma as the delimiter (separator) was not a particularly good one, and an alternative such as tab separated (tsv) would have been better choice for these data.
Nevertheless, it is not ambiguous, because commas that are not intended as delimiters only appear within double quotes. The 'user' (our code in this case) is expected to take the quotes into account when splitting the lines.
There are many ways to deal with dates, but for now, we can just remove the comma and the quotes, since neither provide us with any information. The fields within the quotes (month, day and year) can be distinguished by their order (the comma is just for human consumption) and the quotes are redundant since it is a text file and all the fields will be read as strings.
Change your code so that it has a preprocessing step before it splits the lines. For each line after the header line your preprocessing step should:
find the positions of the two double quotes
replace the comma between the quotes with a space
replace the old date with the new date without a comma
throw a
ValueError
if the line doesn't have double quotes
We'll break it down into steps.
Print each line (of the first 5, other than the header, and with the whitespace removed) followed by the indices of the two double quotes:
Tip: You can get the double quote character by enclosing it in single quotes ('"'
).
Hint: Compare the find
and index
methods. Why would you choose one or the other?
Next, print the line followed by the date string:
Now do the same, except with the comma removed from the date string:
Next, print the original lines with the old date field replaced by the cleaned date field:
Checked Solution [1 lab mark]
Now that we have this working, we don't want this preprocessing step 'muddying' up our code, so let's put it in a separate function.
Write a function
clean (data_row)
that takes as its argument a string, and:strips any unnecessary whitespace characters from the ends
if it contains a string in double quotes, strips the quotes and the comma between them
if it doesn't contain any quotes (this will be the header line), it strips everything in the line from the space before the first parenthesis
If called with the following code:
the output should start like this:
From a file to a list
Next, rather than printing the lines, store them all in a list (of lists).
Define a variable
data_lists
as an empty list (). Write code that cleans and splits the (entire) input into lists, and appends them todata_lists
.
For example, the following should print the first five rows as a list of lists:
How many entries (lines of data) are there in the file?
Checking our cleaning so far
We now have a tidy list of lists, each with the four fields. Or do we?
With big data it may not be possible to manually look at every entry to see if we've accounted for every possibility. We should try to make our cleaning or preprocessing as general as possible so that we catch unexpected variations or bad data.
In practice, we often have to make some assumptions about the data. However we should endeavour to test these.
In this case, we've assumed that the patterns we see at the start of the file continue through the file. So let's check that assumption.
Write code that checks whether all your entries have 4 fields.
If not, why not? What have we missed?
Hint: Print out the first row (if there is one) where this is not true. Print out the number of that row, open the data file in CoCalc, and have a look at the data in that row. What do you find?
General vs Specific
The variations you see in the data are not unusual - remember that this is the collated official government data, its not an 'exercise'. It is exactly the kind of thing you would deal with as a working Data Scientist.
We'll need to come back to some differences in the content of the data, but for now let's focus on the formatting. Or, more precisely, transforming the data from the format in which it is provided to a format that is suitable for our use.
Alter your code to do more cleaning as necessary so that it is transformed into a list of lists, each with four fields.
You should try to make your code as general as possible. This means that, rather than just adjust for the specific case, think about patterns.
For example, we have seen that the data format uses double quotes around fields containing the delimiter (a comma in this case). If the data is not corrupt (which is an assumption) therefore, we would expect the double quotes to always occur in pairs. By focussing only on dates, we have been more specific than we need to be, so we may miss other cases. A more general solution will assume that the same pattern may occur in other fields.
Ensure you re-test your code after any changes you make to ensure it satisfies the requirements. (You might find it useful to write them all down.)
Type casting
Finally, the last field, tests per 1000 people, should be a float
.
Alter your code to change the tests ratio to a float.
Your first few lines should now look like this:
Again, we could make an assumption that the fourth string is always able to be converted to a float, but its possible that somewhere in the file that is not true. Later we will deal with this using Exceptions. For now, it is good practice to do some checks before attempting to cast. Before casting, check that:
the fourth field is not an empty string
the first character in the string is a number
Tip: You may find string functions like isnumeric
useful.
Remember to run your own tests, not rely on my tests or my sample output. For example, I've included the first few lines of output (the file is too large to include them all) but as we've seen those lines may not be indicative of the file as a whole. For one thing, they are all cases where the daily test ratio is zero.
Again, remember that a Data Scientist is like a detective - always thinking about what we could possibly have missed, and testing for it.
Checked Solution [2 marks]
Complete the function
get_cleaned_lists (filename)
so that it returns a list of lists, each containing the fields from the data file, with quotes, commas, and leading/trailing whitespace characters removed, any parenthesised text removed, and the daily tests ratio as a float.
Note: Your checked functions may call preceding functions that you have written - they do not have to all be in one long function. As always, however, you should ensure you validate your function with a "clean" kernel.
Data Presentation - Using Dictionaries
For this part it is left to you to break down the task into subtasks and test them as you go.
We want to be able to access the daily test ratio for a country on a given date without using loops.
Dictionaries provide much faster access to data by hashing the dictionary keys.
Store the daily tests data in a dictionary of dictionaries. The outer dictionary should use the countries as keys. The inner dictionary should use the dates as keys.
For example, if my outer dictionary is called country_dict
, then evaluating country_dict["Australia"]
should return:
and country_dict["Australia"]["Jul 1 2020"]
should return:
How many "countries" are there?
Print out each country, followed by its number of tests per 1000 people, on 1st July, 2020. [If the country doesn't have a reading for that day, it can simply be skipped.]
It will be clear from this that some countries have more than one entry - for example, "Poland" and "Poland people tested" will show as two separate "countries". While we could easily clean these out, we haven't looked closely enough at the sources of the data to determine the reason for the different figures, and whether one is better than the other, to have grounds for choosing one over the other (you may wish to follow this up). Therefore we need to leave them all in for now.
Presenting rankings in a table
Finally, write a function
print_ranking(date)
that takes a date (as a string) and prints a table of testing results for that date, ranked from highest testing rate to the lowest.
So, for example, print_ranking("Jul 1 2020")
will start as follows:
Checking the news
The Whitehouse statement cited in the introduction, above, was made on April 28th.
What do you make of President Trump's statements from a per capita perspective?
How does that compare with more recently?
In a May 11 Rose Garden briefing President Trump stated:
We’re testing more people per capita than South Korea, the United Kingdom, France, Japan, Sweden, Finland, and many other countries — and, in some cases, combined.
The BBC's May 15th article Coronavirus: President Trump’s testing claims fact-checked "fact-checks" this claim (Claim One).
Modify your function to the signature
print_ranking(date, countries=[])
so that:if
countries
is omitted, it still prints the table for all countries reporting on that dayif a list of countries is passed to the function, then it only prints the table for those countries
Print the table for the US and those six countries on 11th May.
Is the BBC's fact check for the 11th May borne out by these data?
Can you think of a possible reason for these discrepancies? (Hint: Should "we're testing" be interpreted as a rate or as a cumulative total?)
On 22nd June Newsweek, in Why Trump Is Both Right and Wrong About U.S. Coronavirus Testing Numbers, compares the US with Russia, Spain, Germany and Portugal on cumulative per capita figures.
How does this compare with the picture you get for the daily rate at this date?
Congratulations - you can now get a job as a fact checking journalist!
© Cara MacNish, Univeristy of Western Australia