CoCalc Public Filesweek_1 / Week_1_project_J_Phipps.ipynb
Author: Jez Phipps
Views : 72

# Project 1: Deaths by tuberculosis

#### produced by Jez Phipps, with earlier assistance from Michel Wermelinger, on 11th October 2016.

This is the project notebook for Week 1 of The Open University's Learn to code for Data Analysis course.

In 2000, the United Nations set eight Millenium Development Goals (MDGs) to reduce poverty and diseases, improve gender equality and environmental sustainability, etc. Each goal is quantified and time-bound, to be achieved by the end of 2015. Goal 6 is to have halted and started reversing the spread of HIV, malaria and tuberculosis (TB). TB doesn't make headlines like Ebola, SARS (severe acute respiratory syndrome) and other epidemics, but is far deadlier. For more information, see the World Health Organisation (WHO) page http://www.who.int/gho/tb/en/.

Given the population and number of deaths due to TB in some countries during one year, the following questions will be answered:

• What is the total, maximum, minimum and average number of deaths in that year?
• Which countries have the most and the least deaths?
• What is the death rate (deaths per 100,000 inhabitants) for each country?
• Which countries have the lowest and highest death rate?

The death rate allows for a better comparison of countries with widely different population sizes.

In [51]:
from IPython.display import display, HTML
HTML('''Note: <p style="display:inline;color:darkred;">Option to toggle code visibility on/off is at <a href="#Bottom">bottom</a> of page.</p>''')

Note:

Option to toggle code visibility on/off is at bottom of page.

## The data

The data consists of total population and total number of deaths due to TB (excluding HIV) in 2013 in each of the BRICS (Brazil, Russia, India, China, South Africa) and Portuguese-speaking countries.

The data was taken in July 2015 from http://apps.who.int/gho/data/node.main.POP107?lang=en (population) and http://apps.who.int/gho/data/node.main.593?lang=en (deaths). The uncertainty bounds of the number of deaths were ignored.

The data was collected into an Excel file which should be in the same folder as this notebook.

In [52]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

def roundToInt(number):
return int(round(number, -1)) # Round to nearest 10

from pandas import *
data = read_excel('WHO POP TB all.xls', encoding='utf-8')
numberOfRows = len(data.index) # Get number of rows
set_option('max_rows', numberOfRows) # Set max_rows option to display numberOfRows
data

WARNING: Some output was deleted.

## The range of the problem

The column of interest is the last one.

In [53]:
tbColumn = data['TB deaths']


The total number of deaths in 2013 is:

In [54]:
tbColumn.sum()

1072677.97

The largest and smallest number of deaths in a single country are:

In [55]:
tbColumn.max()

240000.0
In [56]:
tbColumn.min()

0.0

From zero to almost a quarter of a million deaths is a huge range. The average number of deaths, over all countries in the data, can give a better idea of the seriousness of the problem in each country. The average can be computed as the mean or the median. Given the wide range of deaths, the median is probably a more sensible average measure.

In [57]:
tbColumn.mean()

5529.2678865979378
In [58]:
tbColumn.median()

315.0

The median is far lower than the mean. This indicates that some of the countries had a very high number of TB deaths in 2013, pushing the value of the mean up.

## The most affected

To see the most affected countries, the table is sorted in ascending order by the last column, which puts those countries in the last rows.

In [59]:
set_option('max_rows', numberOfRows) # Set max_rows option to display numberOfRows
data.sort('TB deaths')

WARNING: Some output was deleted.

The table raises the possibility that a large number of deaths may be partly due to a large population. To compare the countries on an equal footing, the death rate per 100,000 inhabitants is computed below:

In [60]:
populationColumn = data['Population (1000s)']
data['TB deaths (per 100,000)'] = tbColumn * 100 / populationColumn
data = data.sort('TB deaths (per 100,000)')
data

WARNING: Some output was deleted.
In [61]:
print("\nUsing the table above, it can be seen that:\n")
print(u"\u2022 the two least affected countries were " + str(data['Country'].iloc[0]) + " and " + str(data['Country'].iloc[1]) + " with about " + str(roundToInt(data['TB deaths (per 100,000)'].iloc[1])) + " deaths per 100 thousand inhabitants.\n")
print(u"\u2022 the two worst affected countries were " + str(data['Country'].iloc[-1]) + " and " + str(data['Country'].iloc[-2]) + " with over " + str(roundToInt(data['TB deaths (per 100,000)'].iloc[-2])) + " deaths per 100 thousand inhabitants.")


Using the table above, it can be seen that: • the two least affected countries were San Marino and Monaco with about 0 deaths per 100 thousand inhabitants. • the two worst affected countries were Djibouti and Nigeria with over 90 deaths per 100 thousand inhabitants.

## Conclusions

There were over a million deaths due to TB in 2013. The median shows that half of these coutries had fewer than 315 deaths. The much higher mean (over 5,500) indicates that some countries had a very high number. The least affected were San Marino and Niue, with 0 and 0.01 deaths respectively. The most affected were Nigeria and India with 160 thousand and 240 thousand deaths in a single year. However, taking the population size into account, the least affected were San Marino and Monaco with less than 0.08 deaths per 100 thousand inhabitants, and the most affected were Nigeria and Djibouti with over 90 deaths per 100,000 inhabitants.

One should not forget that most values are estimates, and that the chosen countries are a small sample of all the world's countries. Nevertheless, they convey the message that TB is still a major cause of fatalities, and that there is a huge disparity between countries, with several ones being highly affected.

In [62]:
print("\nPandas version", pandas.__version__, end="")
from IPython.display import HTML
HTML('''<script>
function toggler() {
$('div.input').toggle(); location.href="#Bottom"; } </script> <p style="display:inline;"><center>Click <a href="javascript:toggler();">here</a> to toggle code visibility on/off</center></p> <script>$('div.input').show();
location.href="#Top";</script></div>''')

Pandas version 0.15.0