Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Jupyter notebook TB deaths all world – Chris Pyves.ipynb

Project: 161004 test
Views: 102
Kernel: Python 3

Project 1: Tuberculosis & Mortality - a study by Chris Pyves

Based on an original article published by Michel Wermelinger, 14 July 2015, with recent updates & contributions by Chris Pyves dated 13 October 2016

Introduction

In 2000, the United Nations set eight Millenium Development Goals (MDGs) to reduce poverty and diseases, improve gender equality and environmental sustainability, etc. Each goal is quantified and time-bound, to be achieved by the end of 2015. Goal 6 is to have halted and started reversing the spread of HIV, malaria and tuberculosis (TB). TB doesn't make headlines like Ebola, SARS (severe acute respiratory syndrome) and other epidemics, but is far deadlier. For more information, see the World Health Organisation (WHO) page http://www.who.int/gho/tb/en/.


Tuberculosis is history’s deadliest disease: it has killed one out of every seven people to ever live on the planet!

“If the importance of a disease for mankind is measured by the number of fatalities it causes, then tuberculosis must be considered much more important than those most feared infectious diseases, plague, cholera and the like. One in seven of all human beings dies from tuberculosis.” –Robert Koch, 1882

Given the population and number of deaths due to TB in some countries during one year, the following questions will be answered:

  • What is the total, maximum, minimum and average number of deaths in that year?

  • Which countries have the most and the least deaths?

  • What is the death rate (deaths per 100,000 inhabitants) for each country?

  • Which countries have the lowest and highest death rate?

The death rate allows for a better comparison of countries with widely different population sizes. The standard that will be used for reporting is: the number of deaths from TB estimated to occur in a year for every group of 100,000 people

The source data

The data consists of total population and total number of deaths due to TB (excluding HIV) in 2014 for all 193 countries in the world.

The original study covered the year 2013 (and whilst these figures have subsequently been updated by WHO) new figures are now available for 2014 for both Population and TB deaths by Country. A decision was therefore made to combine Deaths from TB with Population totals by Country for 2014 and review the resulting data.

The original data for 2013 taken in June 2015 whilst the data for 2014 was downloaded in October 2016 from the WHO website: [World Population data for 2014](http://apps.who.int/gho/data/node.main.POP107?lang=en (population) and [World TB deaths for 2014](http://apps.who.int/gho/data/node.main.593?lang=en (deaths).
(The uncertainty bounds of the number of deaths were ignored).

  • Data download CSV file 5.5kb Population by Country 2014 Last updated: (2016-04-14)

  • Data download CSV file 10.0kb Deaths due to TB by Country for 2014 Last updated: (2015-11-25)

  • The raw data was parsed using Google sheets and saved as 'WHO POP TB 2014 Chris Pyves.xlsx'

  • Link to excel data file (read only) This file should be placed in the same folder as this notebook.

# First, we need to import the standard pandas libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt # import matplotlib.pyplot as plt standard New in Panda version 0.11.0. pd.__version__ # Outputs panda version being used
'0.15.0'
import warnings warnings.simplefilter('ignore', FutureWarning) # Next, we read in the sample data (and get a summary of how it looks). # To return a DataFrame # read_excel('path_to_file.xls', sheetname='Sheet1') # This is the basic syntax #data = read_excel('WHO POP TB 2014.xlsx') # note: this was the original format given when "import panda as *" was used data = pd.read_excel('WHO POP TB 2014 Chris Pyves.xlsx') # needed to add pd. prefix to take advantage of pands pd assignment

Examining the data file

Data Validation checks:

Because this data has come direct from the WHO website and is relatively small it has already been visually checked and found to be in good condition. No further validation steps are required. Once the excel file has been read: the size of the data can be defined as follows:

  • Number of records in the data set (rows)x (columns)

  • Column Header Names (useful if further reports are required)

# Output the number of rows & columns numrows = len(data) # 194 rows in this example numcols = len(data.columns) # 3 columns in this example print("The data file contains {0} Total rows x {1} Total columns of data (each row is indexed)".format(numrows, numcols)) # code to report how large the excel import file is
The data file contains 194 Total rows x 3 Total columns of data (each row is indexed)
# display how the dataframe sees the column names #data.columns.tolist() print("Column header names used in this data file are as follows: {0} ".format(data.columns.tolist())) # These names will be required for coding reports (by column)
Column header names used in this data file are as follows: ['Country', 'Population (1000s)', 'TB deaths']

Max_row settings:

There seems to be a limit on the number of rows that panda reports.

  • pd.options.display.max_rows # will report max_rows that panda is set to

  • pd.options.display.max_rows = 999 allows this number to be increased

Given that this data file contains 195 rows it only needs to be changed to 200 to cover everything

#This command reports max rows will be displayed # To change the current setting remove # and enter required value in line below then run cell to check that change has been made pd.options.display.max_rows = 200 # this number can be changed if required [between 0 - 999] print("Note: maximum row setting is currently set at {0} rows".format(pd.options.display.max_rows)) # this value can be increased to 999 but text wrapping may still occur with column data
Note: maximum row setting is currently set at 200 rows
# Now show the first 5 rows of table data.head() # this command is helpful to see how the imported data is set out
Country Population (1000s) TB deaths
0 Afghanistan 31627.5 14000.00
1 Albania 2889.7 17.00
2 Algeria 38934.3 4400.00
3 Andorra 72.8 0.55
4 Angola 24227.5 13000.00
# Code to check the bottom of the data set (last five records) data[-5:] # note column widths are different to head() command
Country Population (1000s) TB deaths
189 Venezuela (Bolivarian Republic of) 30693.8 540
190 Viet Nam 92423.3 17000
191 Yemen 26183.7 1100
192 Zambia 15721.3 5100
193 Zimbabwe 15245.9 2300

Now that we know that the data appears to have loaded successfully we can start analysing it to see what information is contained within.
(Note in the original data upload an error was detected & the excel file was corrected at source)

Data total checks

We now start to look at the data to get a better feel for the scope of the data

  • Data Totals (by column) for Population & TB deaths

  • Population statistics:

    • minimum maximum & Data range

    • mean median & mode

    • Upper & lower quartiles

What software tools are available to analyse this data?

The following shows the use of the describe() method:

# Geting a handle on the dataset: How large is the file and how many records does it contain [when was it issued? Source & date of aquisition] # Data Statistics: Number of records value range & distribution data.describe()# #This is a very useful command to provide an initial overview of your data
Population (1000s) TB deaths
count 194.000000 194.000000
mean 37407.166495 5768.637835
std 141406.103139 22706.403800
min 1.600000 0.000000
25% 1832.425000 24.250000
50% 7950.600000 265.000000
75% 25894.475000 2300.000000
max 1400000.000000 220000.000000

How to Extract useful totals from this data?

Lets just check the what the Total Population was for the World in 2014

According to the United Nations, world population reached 7 Billion on October 31, 2011. [Google search]

# Creating a variable called popColumn enables us to call up all population column data & perform calculations popColumn = data['Population (1000s)'] #popColumn.describe() # Returns Name: Population (1000s), dtype: float64 popColumn.sum()
7256990.2999999998

That looks reasonable so lets now look for the largest and smallest populations in our data set:

popColumn.max() #data,max({index (0), columns (2)}) # NameError: name 'index' is not defined #axis : {index (0), columns (1)}
1400000.0
popColumn.min()
1.6000000000000001

The Range of Population data can be calculated as: Population (Max) - Population (Min)

popRange = popColumn.max()-popColumn.min() popRange
1399998.3999999999

What else can we do to analyse Population data a bit further?

  • mean (a number expressing the central or typical value in a set of data, which is calculated by dividing the sum of the values in the set by their number)

  • median (a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities)

  • mode (the value that occurs most frequently in a given set of data)

popColumn.mean() # Return the mean of the values for the requested axis # mean : Series or DataFrame (if level specified) # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html
37407.166494845362
popColumn.median() # Return the median of the values for the requested axis # median : Series or DataFrame (if level specified) #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html
7950.6

Comment on Population data:

As the median value is significantly lower than the mean it suggests that Population is not evenly distributed by country. This fact will be confirmed later on.

# mode: finding values that occurs most frequently in a given set of data #Unless this data is grouped into ranges it is not going to produce meaningful results popColSort = data.sort('Population (1000s)', ascending = 0)#.mode() #popColSort['Population (1000s)']# produces list of all Populations in descending order #http://pandas.pydata.org/pandas-docs/stable/groupby.html #Unless this data is grouped into ranges it is not going to produce meaningful results # modes : DataFrame (sorted) # This could be a clue! # pandas.DataFrame.mode applies to ver 0.18.0 # df = pd.DataFrame({'A': [1, 2, 1, 2, 1, 2, 3]}) # df.mode() # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html

Comment on the Mode command:

Using the mode command is not going to yield any useful data unless we can group the data into range categories This is a topic that will be returned to later. See for reference:

How to return Sum Control Totals for all columns in the file?

# The following is a useful comand to return Total values for all columns data.sum() # To apply this command to a range of columns set the start, end & step parameters # data[0:].sum() both commands return totals stacked one on top of the other # I have yet to work out how to show values horizonatally (Transpose does NOT work here)
Country AfghanistanAlbaniaAlgeriaAndorraAngolaAntigua ... Population (1000s) 7256990 TB deaths 1119116 dtype: object

How is the Population data distributed?

What does it look like?

One way to get a better idea of how our data is distributed is to display it in some way. I have tried a number of options and the following seems to show the data best

temps = pd.DataFrame(data) #temps['Population (1000s)']#returns index & data from specified col #type(temps['Population (1000s)']) # returns pandas.core.series.Series temps['Population (1000s)'].plot() # returns a plot of population data according to index position in data file *not sorted* #How do we report this data as a sorted list? #if we assign to new variable we get a series #newtemp = temps['Population (1000s)'] #newtemp
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e491dc908>
Image in a Jupyter notebook

What the plot above shows are the populations for each country according to their position in the data file.

So we can see at a glance that just four countries appear to have populations above 200m. Whilst the rest are all below 200m.

However it might be better if we could sort this list into descending Population order to see the distribution a bit more clearly.

#Used to plot Country Populations (sorted into descending order by country population) popIndex = data.sort('Population (1000s)', ascending = 0) popIndex['Sorted Country List'] = list(range(len(popIndex.index))) #popIndex popIndex.plot(kind='scatter', x= 'Sorted Country List', y = 'Population (1000s)') # Plots Population against an index of countries sorted by population
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e491c2630>
Image in a Jupyter notebook

This shows that global population is concentrated in just a few countries. This graph appears to describe a hyperbolic curve and suggests that perhaps a logarithmic scale might be a better tool for analysing this data. From the plots above it appears that around five countries had a Population above 200m in 2014. This can be confirmed by printing out the top six countries.

#popIndex['Sorted Country List'] data.sort('Population (1000s)', ascending = 0)[:6]
Country Population (1000s) TB deaths
35 China 1400000 38000
77 India 1300000 220000
185 United States of America 319449 460
78 Indonesia 254455 100000
23 Brazil 206078 5300
128 Pakistan 185044 48000

The top five countries account for almost half the worlds population.

Number of countries 5 Total sum of combined populations 3,479,982k = 3.48bn people Total sum of TB deaths 363,760

If we add up the top five country populations ( those above 200m) we get a total of 3.48bn which when divided by the World's Total Population of 7.25bn gives a result of 48%.

Top Five Countries by Population: Country Index Codes [35,77,185,78,23]

Turning our attention to the TB problem

One third of the world's population is infected with TB. In 2014, 9.6 million people around the world became sick with TB disease.
There were 1.5 million TB-related deaths worldwide. TB is a leading killer of people who are HIV infected.

Whilst Tuberculosis is an entirely curable disease it is also one of the Top 10 Global Killer's

What data can we extract to present a better picture of the situation?

  • TB statistics:

    • minimum maximum & Data range

    • mean median & mode

    • Upper & lower quartiles

** The actual WHO data for TB deaths in 2014 is a bit smaller than the headline figures would suggest: as 1.5m becomes 1.1m **

# Creating a variable called tbColumn enables us to call up all TB death column data & perform calculations tbColumn = data['TB deaths'] #popColumn.describe() # Returns Name: Population (1000s), dtype: float64 tbColumn.sum() #Chigozie Onyekwelu - code test #tbColumn = data['Population (1000s)'] #tbColumn.sum ()
1119115.7400000002

The largest and smallest number of deaths in a single country are:

tbColumn.max()
220000.0
tbColumn.min()
0.0

So the Range of TB deaths can be calculated as: TB deaths (Max) - TB deaths (Min)

tbRange = tbColumn.max()-tbColumn.min() tbRange
220000.0

What else can we do to analyse the TB deaths data a bit further?

  • mean (a number expressing the central or typical value in a set of data, which is calculated by dividing the sum of the values in the set by their number)

  • median (a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities)

  • mode (the value that occurs most frequently in a given set of data)

tbColumn.mean() # Return the mean of the values for the requested axis # mean : Series or DataFrame (if level specified) # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html
5768.6378350515479
tbColumn.median() # Return the median of the values for the requested axis # median : Series or DataFrame (if level specified) #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html
265.0

Comment on Tuberculosis data:

From 0.0 to almost a quarter of a million deaths is a huge range. The average number of deaths, over all countries in the data, gives a better idea of the seriousness of the problem in each country. But given the wide range of deaths, the median is probably a more sensible average measure.

As was found in the study into Population distribution TB deaths are not evenly distributed across all countries. Some countries are more affected than others.

# Looking at the TB deaths in the data file by county temps2 = pd.DataFrame(data) #temps['Population (1000s)']#returns index & data from specified col #type(temps['Population (1000s)']) # returns pandas.core.series.Series temps2['TB deaths'].plot() # returns a plot of TB deaths data according to index position in data file *not sorted*
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e438e2588>
Image in a Jupyter notebook

It would be interesting to try and overlay the two graphs above for Population & TB deaths as they are both in the same order.

But a more revealing analysis might be to plot TB deaths by Population (see below).

#Used to plot TB deaths by county (sorted into descending order) tbIndex = data.sort('TB deaths', ascending = 0) tbIndex['Sorted TB List'] = list(range(len(tbIndex.index))) #tbIndex tbIndex.plot(kind='scatter', x= 'Sorted TB List', y = 'TB deaths') # Plots TB deaths against an index of countries sorted by TB deaths
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e4386b6d8>
Image in a Jupyter notebook

This graph shows the distribution of TB deaths by Countries (sorted in descending order of TB deaths) The plots describe a hyperbolic curve similar to the distribution of Population by Countries. From the points above it appears that around five countries had more than 50,000 TB deaths in 2014. This can be confirmed by listing the top six countries (see below).

#popIndex['Sorted Country List'] data.sort('TB deaths', ascending = 0)[:6]
Country Population (1000s) TB deaths
77 India 1300000 220000
124 Nigeria 177476 170000
78 Indonesia 254455 100000
13 Bangladesh 159078 81000
47 Democratic Republic of the Congo 74877 52000
128 Pakistan 185044 48000

Results:

5 Countries that account for 56% of global deaths caused by TB

  • number of countries 5

  • Total sum of combined populations 1,965,886k = 1.965bn people

  • Total sum of TB deaths 623,000

  • Country population as a percentage of the world total 1.965/7.257 = 27.0%

  • Country TB deaths as a percentage of world TB deaths 623,000/1,119,116 = 55.7%

Top 5 Countries by TB deaths: Country Index Codes [77,124,78,13,47] India, Nigeria, Indonesia, Bangladesh,Democratic Republic of the Congo,

The top five countries sorted by TB deaths (above 50,000) have a combined total of 623,000 which when divided by Total TB deaths in the World of 1.1m gives a result of 56%.

At this point in the research it might be appropriate to extend the list of countries with the most TB death to look at this data in more detail.

This is very easy to implement as there is only one number in the code to change (see below)

data.sort('TB deaths', ascending = 0)[:15]
Country Population (1000s) TB deaths
77 India 1300000.0 220000
124 Nigeria 177476.0 170000
78 Indonesia 254455.0 100000
13 Bangladesh 159078.0 81000
47 Democratic Republic of the Congo 74877.0 52000
128 Pakistan 185044.0 48000
35 China 1400000.0 38000
58 Ethiopia 96958.7 32000
184 United Republic of Tanzania 51822.6 30000
116 Myanmar 53437.2 28000
159 South Africa 53969.1 24000
115 Mozambique 27216.3 18000
190 Viet Nam 92423.3 17000
141 Russian Federation 143429.0 16000
0 Afghanistan 31627.5 14000

Results:

Top 15 Countries account for 80% of worldwide deaths caused by TB

  • number of countries 15

  • Total sum of combined populations 4,101,813.7k = 4.1bn people

  • Total sum of TB deaths 888,000

  • Country population as a percentage of the world total 4.1/7.257 = 56.5%

  • Country TB deaths as a percentage of world TB deaths 888,000/1,119,116 = 80%

Adding 10 more countries to the previous list of 5 countries creates a group with a combined country population of 4.1bn people which represents 56% of worldwide population. Total TB deaths for the group are 888,000 and represents 80% of all worldwide deaths from TB.

The 10 extra countries that are added to the previous five countries are (in order): Pakistan, China, Ethiopia, United Republic of Tanzania, Myanmar, South Africa, Mozambique, Viet Nam, Russian Federation, Afghanistan

The top 15 countries can be seen in the list above. The range of actual deaths due to TB is quite wide starting with India at 220,000 and running down to Afghanistan with 14,000 TB related deaths.

Whilst it is quite difficult to put a classification on this collection of 15 countries it does include all the BRIC's countries except for Brazil (see BRICs table lower down in this report). China was retained because its TB mortality is above 14,000 whereas Brazil with 5,300 TB deaths was excluded because it was below the cutoff threshold of 14,000.

Link to map showing the top 15 countries by location

One way to classify these 15 countries might be to divide them into their continental groupings; African or Asian countries.

** The African Countries include:** Nigeria, Democratic Republic of the Congo, Ethiopia, United Republic of Tanzania, South Africa, Mozambique :(6)

** The Asian countries inclide:** India, Indonesia, Bangladesh, Pakistan, China, Myanmar, Viet Nam, Russian Federation, Afghanistan :(9)

Top 15 Countries by TB deaths: Country Index Codes: [77,124,78,13,47,128,35,58,184,116,159,115,190,141,0]

If we plot both Population & TB Deaths together (Population on the Y axis and TB deaths on the X axis) we will get a better view of this data to identify which countries are on the data extremities (see below).

#This graph plots deaths by population #data['TB deaths'].plot() # 'Population (1000s)' # Produces a graphical representation of Population data by country index number #It would be nice to sort this data into descending order & plot histogram #Question: How do you plot a sorted list of Populations in descending order data.plot(kind='scatter', x= 'TB deaths', y = 'Population (1000s)') # This also works by plotting deaths against Population
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e43847c18>
Image in a Jupyter notebook

The cutoff for the top 15 Countries is 14,000 TB deaths in a year. Countries that are above this level will appear in the list whereas those that are below will not.

An interesting graph with some extreme results:

Whilst the bulk of the data is concentrated in the bottom left hand corner it is the countries that are outside this area that warrent close attention.

  • The country with the highest population but a TB death rate of less than 50,000 is China [35].

  • The country with the second highest population but highest number of TB death is India [77].

  • The country with the second highest TB death rate but with a population just below 200,000 is Nigeria [124].

(For full details see the list below)

# How to report different rows of all data together # countryCodes = [35,77,124,] data.loc[data.index.isin([35,77,124])]
Country Population (1000s) TB deaths
35 China 1400000 38000
77 India 1300000 220000
124 Nigeria 177476 170000

It is worth noting that focusing our attention strictly on Countries with the largest Populations or the largest TB deaths could be misleading.

In order to make a fair comparison the rate of TB deaths per 100,000 people will now be calculated and then analysed.

Identifying the Countries that have the highest TB death rates per 100,000

The calculations is performed as follows and a new group of 15 Countries (ordered by TB Death rates per 100,000) are listed in descending order:

# Command to calculate rate per 100,000 and add results to data table with column heading: 'TB deaths (per 100,000)' popColumn = data['Population (1000s)'] tbColumn = data['TB deaths'] data['TB deaths (per 100,000)'] = tbColumn * 100 / popColumn #data data.sort('TB deaths (per 100,000)', ascending = 0)[0:15]
Country Population (1000s) TB deaths TB deaths (per 100,000)
49 Djibouti 876.2 1100 125.542114
124 Nigeria 177476.0 170000 95.787599
172 Timor-Leste 1157.4 1100 95.040608
47 Democratic Republic of the Congo 74877.0 52000 69.447227
96 Liberia 4396.6 3000 68.234545
158 Somalia 10517.6 7000 66.555108
95 Lesotho 2109.2 1400 66.375877
115 Mozambique 27216.3 18000 66.136837
117 Namibia 2402.9 1500 62.424570
71 Guinea-Bissau 1800.5 1100 61.094141
29 Cambodia 15328.1 8900 58.063296
184 United Republic of Tanzania 51822.6 30000 57.889801
62 Gabon 1687.7 940 55.697103
92 Lao People's Democratic Republic 6689.3 3700 55.312215
4 Angola 24227.5 13000 53.658033

Results:

28% of all TB deaths occur in countries with 5.5% of the worlds population

  • number of countries 15

  • Total sum of combined populations 402,584.9k = 402.6m people

  • Total sum of TB deaths 312,740

  • Country population as a percentage of the world total 0.402/7.257 = 5.5%

  • Country TB deaths as a percentage of world TB deaths 312,740/1,119,116 = 28%

So this analysis really seems to have churned the data up. Only four countries that were in the previous list remain (see below) whilst 11 countries have come off the list they have been replaced by a new 11 countries. Previously the cut off was TB deaths below 14,000. But now that the focus has turned to countries with the highest rates of TB death / 100,000 countries with smaller populations have been admitted whilst some countries with larger population have been removed because they were below the rates threshold which can be seen from the listing to be above 53.0 TB deaths/100,000. Countries above this rate will be included while countries below will not.

But what has actually happened here? Previously selecting the top 15 countries that had the highest TB deaths produced a list that covered 56.5% of the worlds population and accounted for 80% of worldwide deaths caused by TB. Now by focusing on those countries with the highest rates of TB death per 100,000 people the countries included only cover 5.5% of the worlds population but account for 28% of worldwide deaths caused by TB.

The change in countries represents a subtle shift in focusing attention to those countries that have the highest rates of TB death. One can only assume that finding a balance between reducing total numbers and having a qualitative impact lies somewhere between these two sets of data.

The four Countries that stay in the Top 15 list are Mozambique, Nigeria, United Republic of Tanzania

The 11 Countries that come off the top 15 list are Afghanistan, Bangladesh, China, Ethiopia, India, Indonesia, Myanmar, Pakistan, Russian Federation, South Africa, Viet Nam,Timor-Leste

The 11 Countries that are added to the top 15 list are Angola, Cambodia, Djibouti, Gabon,Guinea-Bissau,Lao People's Democratic Republic, Lesotho, Liberia, Namibia, Somalia,

Link to map of counties included in this list

It seems that the previous method of classifying these 15 countries also applies here of dividing them into their continental groupings; African or Asian countries.

**The African countries include:**Djibouti, Nigeria, Democratic Republic of the Congo, Liberia ,Somalia, Lesotho, Mozambique, Namibia, Guinea-Bissau, United Republic of Tanzania, Gabon, Angola (12) **The Asian countries include:**Timor-Leste, Cambodia,Lao People's Democratic Republic (3)

Top 15 Countries by Rate of TB deaths/100,000: Country Index Codes: [49,124,172,47,96,158,95,115,117,71,29,184,62,92,4]

  • 11 Countries that came off the list [-77,-78,-13,-128,-35,-58,-116,-159,-190,-141,-0] -11 deduct

  • 4 Countries that stayed on the list [124,47,184,115] 4 stay

  • 11 Countries that were added to the list [+49,+172,+96,+158,+95,+117,+71,+29,+62,+92,+4] +11 add

Putting all this information together - to better see things in context

Now that we have calculated TB rates of death per 100,000 heads of population we can plot Population on the Y axis against TB Death rates per 100,000 heads of population on the X axis.

The size of each circle is scaled to represent the actual number of TB deaths per country. Using the same approach as before we focus on those countries which exhibit some form of data extremity

# A picture is worth 1,000 words: Population vs Death rates per 100,000 # To set this up, before any plotting is performed you must execute the %matplotlib data.plot(kind='scatter', x= 'TB deaths (per 100,000)', y = 'Population (1000s)', s=data['TB deaths']/100 )
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e437b0c88>
Image in a Jupyter notebook
This graph helps to see the three data elements better & in context: (Population, Rate of deaths/100,000, TB Deaths)

the following observations can be made:

  • The country with the Highest rate of TB deaths is Djibouti [49] with 1,100 deaths from TB and a population 0.876m. This country is a primary target for International health care and the prevention of TB.

  • The two countries with the next Highest rate of death are Nigeria [124] and Timor-Leste [172] although actual numbers of TB deaths are quite different. Nigeria 170,000 Timor-Leste 1,100.

  • The country next worthy of attention is the large blue circle at the left hand top side this is India [77]. Population: 1.3bn with 220,000 TB deaths giving a rate of 16.9 deaths per 100,000

  • The small blue circle next to that represents China [35] with a population of 1.4bn and only 38,000 TB deaths giving a rate of 2.7 deaths per 100,000.

(Note: The five countries that are mentioned are all listed below in index number order) Just looking at these five countries highlights the difficulty in trying to make sense of just three data elements as the countries are so varied and different.

# How to report different rows of all data together # countryCodes = [49,124,172,77,35] data.loc[data.index.isin([49,124,172,77,35])]
Country Population (1000s) TB deaths TB deaths (per 100,000)
35 China 1400000.0 38000 2.714286
49 Djibouti 876.2 1100 125.542114
77 India 1300000.0 220000 16.923077
124 Nigeria 177476.0 170000 95.787599
172 Timor-Leste 1157.4 1100 95.040608
# Creating a variable called tbColumn enables us to call up all TB death column data & perform calculations rateTBColumn = data['TB deaths (per 100,000)'] #ratetTBColumn.describe() # Returns Name: Population (1000s), dtype: float64 #rateColumn.sum() Not appropriate or relevant here

What else can we do to analyse the rate of TB deaths per 100,000 data further?

  • mean (a number expressing the central or typical value in a set of data, which is calculated by dividing the sum of the values in the set by their number)

  • median (a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities)

  • mode (the value that occurs most frequently in a given set of data)

rateTBColumn.mean() # Return the mean of the values for the requested axis # mean : Series or DataFrame (if level specified) # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html
14.10920912596119
rateTBColumn.median()
3.5427927696619586

With a mean average rate of 14.1 TB deaths per 100,000 how many countries are above this number but below the top 15 list?

The following report lists the top 56 countries which have more than 14.0 deaths per 100,000 per annum.

56 countries out of 194 represent 28.8% of all countries

data.sort('TB deaths (per 100,000)', ascending = 0)[0:56]#
Country Population (1000s) TB deaths TB deaths (per 100,000)
49 Djibouti 876.2 1100.0 125.542114
124 Nigeria 177476.0 170000.0 95.787599
172 Timor-Leste 1157.4 1100.0 95.040608
47 Democratic Republic of the Congo 74877.0 52000.0 69.447227
96 Liberia 4396.6 3000.0 68.234545
158 Somalia 10517.6 7000.0 66.555108
95 Lesotho 2109.2 1400.0 66.375877
115 Mozambique 27216.3 18000.0 66.136837
117 Namibia 2402.9 1500.0 62.424570
71 Guinea-Bissau 1800.5 1100.0 61.094141
29 Cambodia 15328.1 8900.0 58.063296
184 United Republic of Tanzania 51822.6 30000.0 57.889801
62 Gabon 1687.7 940.0 55.697103
92 Lao People's Democratic Republic 6689.3 3700.0 55.312215
4 Angola 24227.5 13000.0 53.658033
116 Myanmar 53437.2 28000.0 52.397955
165 Swaziland 1269.1 650.0 51.217398
13 Bangladesh 159078.0 81000.0 50.918417
100 Madagascar 23571.7 12000.0 50.908505
89 Kiribati 110.5 54.0 48.868778
32 Central African Republic 4804.3 2300.0 47.873780
38 Congo 4505.0 2100.0 46.614872
159 South Africa 53969.1 24000.0 44.469891
153 Sierra Leone 6315.6 2800.0 44.334663
0 Afghanistan 31627.5 14000.0 44.265275
131 Papua New Guinea 7463.6 3000.0 40.195080
78 Indonesia 254455.0 100000.0 39.299680
106 Marshall Islands 52.9 20.0 37.807183
66 Ghana 26786.6 9700.0 36.212136
58 Ethiopia 96958.7 32000.0 33.003743
192 Zambia 15721.3 5100.0 32.440065
30 Cameroon 22773.0 7100.0 31.177271
28 Cabo Verde 513.9 160.0 31.134462
70 Guinea 12275.5 3600.0 29.326708
160 South Sudan 11911.2 3400.0 28.544563
22 Botswana 2219.9 620.0 27.929186
128 Pakistan 185044.0 48000.0 25.939776
27 Burundi 10816.9 2500.0 23.111982
33 Chad 13587.1 3100.0 22.815759
107 Mauritania 3969.6 890.0 22.420395
41 Cote d'Ivoire 22157.1 4800.0 21.663485
150 Senegal 14672.6 3100.0 21.127816
163 Sudan 39350.3 8300.0 21.092596
88 Kenya 44863.6 9400.0 20.952398
72 Guyana 763.9 160.0 20.945150
46 Democratic People's Republic of Korea 25026.8 5000.0 19.978583
73 Haiti 10572.0 2100.0 19.863791
190 Viet Nam 92423.3 17000.0 18.393630
63 Gambia 1928.2 350.0 18.151644
123 Niger 19113.7 3400.0 17.788288
119 Nepal 28174.7 4900.0 17.391490
77 India 1300000.0 220000.0 16.923077
101 Malawi 16695.3 2800.0 16.771187
110 Micronesia (Federated States of) 104.0 17.0 16.346154
193 Zimbabwe 15245.9 2300.0 15.086023
179 Tuvalu 9.9 1.4 14.141414

The following code was used to gererate totals for the following analysis:

  • 56 countries Total Population

  • 56 countries TB deaths total

# Special list - Extracting total for selected number of countries sorted in decending order of TB death rate per 100,000 population temp3 = data.sort('TB deaths (per 100,000)', ascending = 0)[0:56] spopTB=temp3['Population (1000s)'] stbTB=temp3['TB deaths'] srateTB=temp3['TB deaths (per 100,000)']
spopTB.sum()
3036923.3999999994
stbTB.sum()
982462.40000000002

Summary of final results:

⅓ of the worlds countries account for 88% of worldwide TB deaths

** Ranking the Top 56 counties with rates of TB deaths greater than the mean of 14 deaths per 100,000**

  • number of countries 56

  • Total sum of combined populations 3,036,923.4k = 3.0bn people

  • Total sum of TB deaths 982,462.4

  • Country population as a percentage of the world total 3.0/7.257 = 41%

  • Country TB deaths as a percentage of world TB deaths 982,462.4/1,119,116 = 88%

Unfortunately due to time constrainst we are unable to persue this investigation any further.

Conclusions

A conclusion summarising your findings, with qualitative analysis of the quantitative results and critical reflection on any shortcomings in the data or analysis process.

Despite the United Nations making the eradication of Tuberculosis Millenium Development Goals the results show little sign of improvement.

#Top 5 by Population #[35,77,185,78,23] data.loc[data.index.isin([35,77,185,78,23])]
Country Population (1000s) TB deaths TB deaths (per 100,000)
23 Brazil 206078 5300 2.571842
35 China 1400000 38000 2.714286
77 India 1300000 220000 16.923077
78 Indonesia 254455 100000 39.299680
185 United States of America 319449 460 0.143998

Qualitative analysis of the quantitative results

Top Five Countries by Population: Country Index Codes [23,35,77,78,185]

Top 5 Countries for population (50%) account for 37% of worldwide deaths from TB Because the distribution of populations is not even five countries alone account for almost half the worlds population. They are in descending order of population: China, India, United States of America, Indonesia,Brazil. The combined deaths in 2014 from TB in this group came to a total of 411,760 deaths which when divided by the worldwide figure of 1,119,116 deaths due to TB represents 37%. Whilst it is reasonable to expect that bigger countries will have more TB Deaths this assumtion does not always hold true and there are some notable exceptions which have a TB death rate lower than the worldwide mean of 14 deaths per annum due to TB.

If we sort the TB deaths by country into descending order 3 countries drop out of the list (China, United States of America, Brazil) and three new countries take their place (Nigeria, Bangladesh,Democratic Republic of the Congo).

#Top 5 by TB deaths #[35,77,185,78,23] data.loc[data.index.isin([77,124,78,13,47])]
Country Population (1000s) TB deaths TB deaths (per 100,000)
13 Bangladesh 159078 81000 50.918417
47 Democratic Republic of the Congo 74877 52000 69.447227
77 India 1300000 220000 16.923077
78 Indonesia 254455 100000 39.299680
124 Nigeria 177476 170000 95.787599

Top 5 Countries for TB deaths account for 60% of worldwide deaths from TB Top 5 Countries by TB deaths: Country Index Codes [13,47,77,78,124] India, Nigeria, Indonesia, Bangladesh,Democratic Republic of the Congo,

Whilst the share of worldwide population for this group falls from 50% to 30% the share of deaths caused by TB rises from 37% to 60%. All of these countries have Rates of death caused by TB which are greater than the worldwide mean of 14.0 perople per 100,000. So if you want to tackle Tuberculosis these countries ought to be included in your list.

The country with the largest number of deaths due to TB is India with 220,000 but if you look at their Rate of TB death per 100,000 of their population it is just short of being 3.0 people above the worldwide mean being 16.9. Whereas the Democratic Republic of the Congo has a much higher death rate of 69.5 people per 100,000 but a smaller country population of 52m. So there are wide discrepencies between Countries with large population and TB related deaths. However in order to get a better picture of the problem and investigation was carried out to discover how many countries would have to be included if one wanted to include 80% of all deaths from TB worldwide. The answer turns out that we would have to extend our original list of countries sorted in descending order of TB deaths from five to fifteen. But this increase the percentage of TB deaths covered from 60% to 80% of worldwide TB related deaths. Whilst the worldwide population covered increases from 30% to 57%.

data.sort('TB deaths', ascending = 0)[:15]
Country Population (1000s) TB deaths TB deaths (per 100,000)
77 India 1300000.0 220000 16.923077
124 Nigeria 177476.0 170000 95.787599
78 Indonesia 254455.0 100000 39.299680
13 Bangladesh 159078.0 81000 50.918417
47 Democratic Republic of the Congo 74877.0 52000 69.447227
128 Pakistan 185044.0 48000 25.939776
35 China 1400000.0 38000 2.714286
58 Ethiopia 96958.7 32000 33.003743
184 United Republic of Tanzania 51822.6 30000 57.889801
116 Myanmar 53437.2 28000 52.397955
159 South Africa 53969.1 24000 44.469891
115 Mozambique 27216.3 18000 66.136837
190 Viet Nam 92423.3 17000 18.393630
141 Russian Federation 143429.0 16000 11.155345
0 Afghanistan 31627.5 14000 44.265275

Top 15 Countries for TB deaths account for 80% of worldwide deaths from TB Top 15 Countries by TB deaths: Country Index Codes: [77,124,78,13,47,128,35,58,184,116,159,115,190,141,0]

Adding 10 more countries to the previous list creates group with a combined country population of 4.1bn people which represents 30% of worldwide population. Total TB deaths for the group are 888,000 and represents 80% of all worldwide deaths from TB.

The largest new additions to this group with high rates of TB mortality include the following; Bangladesh (Population 159,078 TB mortality rate 50.9) closely followed by Pakistan (Population 185,044 TB mortality rate 25.9) and then Russia (Population 143,429 TB mortality rate 11.1). But at this point the Country rates appear to increase from around 30.0 to 60.0 whilst their populations fall from 100m to 30m.

** It is hardly surprising that countries with larger populations are likely to have greater numbers of deaths due to TB than smaller populations.** But if we look at the Rate of deaths per 100,000 within this group they range from 2.7/100,000 in China to 95.8/100,000 in Nigeria.

Trying to classify this group of countries prove problematic as they fall roughly into two group; African or Asian. However it is woth noting that all but one country (i.e. Brazil) is present in the Top 15 Country group and looking at the following data extract one can see why (see BRICs country summary below). Brazil was excluded from inclusion because it was below the threshold of 14,000 TB deaths per annum whereas China was retained.

BRICs country analysis:

# BRIC's Countrys [ Brazil China India Russia South Africa] # countryCodes = [23,35,77,141,159] data.loc[data.index.isin([23,35,77,141,159])]
Country Population (1000s) TB deaths TB deaths (per 100,000)
23 Brazil 206078.0 5300 2.571842
35 China 1400000.0 38000 2.714286
77 India 1300000.0 220000 16.923077
141 Russian Federation 143429.0 16000 11.155345
159 South Africa 53969.1 24000 44.469891

BRICs country summary:

The 5 BRIC countries have just short of 30% of all worldwide deaths caused by TB BRICs: Country Index Codes: [23,35,77,141,159] Brazil, Russia, India, China

  • number of countries 5

  • Total sum of combined populations 3,103,476.1k = 3.1bn people

  • Total sum of TB deaths 303,300

  • Country population as a percentage of the world total 3.1/7.257 = 42.7%

  • Country TB deaths as a percentage of world TB deaths 982,462.4/1,119,116 = 27%

The 5 BRICS countries had in total 303,300 deaths due to TB in 2014 which is about 30% of worldwide total. But the bulk of the deaths 220,000 come from India which has the second largest population after China. South Africa has the highest death rate of all these countries with 44.5 people per annum. But yhe star performer has to be China which has the worlds largest population 1.4bn peoiple but with 38,000 deaths from TB this equates to a rate per 100,000 of 2.57 deaths from TB which puts it below the mean average and the country is ranked 108th out of 194 countries. Russia also falls below the threshold for TB deaths rates comming in at 11.1 people per 100,000.

This study fell short of a thorough analysis of TB Rates of death per 100,000 due to time restriction but further investigate of this area is recommended. Also some of the countries already mentioned might warrent further investigation to identify actions that might help other countries in the fight against TB. Two countries spring to mind immediately; China for its barefoot doctor programme and Djibouti for just being top in the world ranking by rates of TB death.

Also it has to be noted that these statistics are prone to subsequent variation. A prime example is Djibouti which in 2013 was quoted as having 6,000 deaths and then later in the light of further information this figure was 'revised' to 12,000.

End of Report

Post submission changes:

18/10/16 13:49 Q: How can I display the data so that it really shows the problem? A: Try plotting Deaths from TB on Y axis and Rate of Deaths from TB on X axis: That should sum the situation that you have been trying to describe This should be failry easy as we all ready have the code and just need to make a few changes:

# TB Deaths vs Death rates per 100,000 # To set this up, before any plotting is performed you must execute the %matplotlib data.plot(kind='scatter', x = 'TB deaths', y = 'TB deaths (per 100,000)')
<matplotlib.axes._subplots.AxesSubplot at 0x7f1e4374c6a0>
Image in a Jupyter notebook

Bingo!Just need to identify each dot on the plot... Top left - is Djibouti (highest rate of death) Bottom far right is India (highest TB deaths) Top right - rate just is Nigeria (under 100 Total TB deaths second highest in TB deaths)

Project Appendix: Coding notes that are useful

# Coding Tips: How to use Markdown # YouTube guide # Markdown & LaTeX - Jupyter Tutorial (IPython 3) # https://www.youtube.com/watch?v=-F4WS8o-G2A
# Coding Tip: display_max_rows #Hiding most of the data in the middle is the default. #It assumes you only need to see a few rows to check that all is working correctly. #However you can set a maximum number for rows to display: #import pandas as pd #pd.options.display.max_rows = 999
# Coding Tip: Row selection # data[a:b] # where a = row index i.e start at 0 and b = a+1 # data.sort('TB deaths', ascending = 0)[:6] # Reports first five rows # data.sort('TB deaths', ascending = 0)[-5:] # Reports last five rows
# Coding Tip: Data File indexing: Specifying a range of Rows & Columns to be selected #data.ix[2:4,['Country','Population (1000s)','TB deaths']] # works for cols & rows #data.ix[2:4,['Population (1000s)','TB deaths','Country']] # change the cols around if you want them ordered differently #data.sort('TB deaths', ascending = 0).ix[:10,['TB deaths','Country']] #Sorted
# Coding Tips: Data Sorting # Note: DataFile.sort has been depreciated: ver 0.19.0 #DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') # #data_sorted = data.sort_values(['Col 1 heading','Col 2 heading'], ascending=False) # #data_sorted[['Col 1 heading','Col 2 heading']].head(10) #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values
# Coding Tips: Method guide # The following commands are all equivalent: # 1. popColumn == data['Population (1000s)'] # 2. popColumn.max() == data['Population (1000s)'].max() # 3. max(popColumn) == max(data['Population (1000s)']) # Peters response: # 1. popColumn.max() # 2. max(popColumn) #Python used to use a mix of both methods, but is now trying to consistently use the first method #In the first method .max() is a property of popColumn
# Graph plotting commands # 1. To plot histogram of data in cols #data.hist() # 2. To plot Population data by country index number (unsorted country list) #data['Population (1000s)'].plot() # 'Population (1000s)' # Produces a graphical representation of Population data by country index number #Question: How do you plot a sorted list of Populations in descending order ? # 3. To plot two columns of data against each other: deaths against Population #data.plot(kind='scatter', x= 'TB deaths', y = 'Population (1000s)') # This also works by plotting deaths against Population
# Outstanding Task: Trying to control the formatting of results #popColumn = data['Population (1000s)'] #popColumn.max() #max(popColumn) #print ("%.2f".format(popColumn.sum()) # SyntaxError: unexpected EOF while parsing #print("Population sum total is: {0} ".format(max(data['Population (1000s)'])))
# to test sort within sort # songs.sort_index(by=['Peak', 'Weeks'], ascending=[True, False]) #data.sort_index(by=['TB deaths', 'Country'], ascending=[True, False]) test = data.sort(['TB deaths','Population (1000s)'], ascending = [False, False]) #test = data.sort(['TB deaths'], ascending = 1) # 'Population (1000s)' #test[23:24]# Produces a list of the top 10 countries by TB deaths but changing population has no effect #data.irow([23,24,28,32,33]) data.irow([x for x in range(55,71)])
Country Population (1000s) TB deaths TB deaths (per 100,000)
55 Equatorial Guinea 820.9 54.00 6.578146
56 Eritrea 5110.4 710.00 13.893237
57 Estonia 1316.2 27.00 2.051360
58 Ethiopia 96958.7 32000.00 33.003743
59 Fiji 886.5 41.00 4.624929
60 Finland 5479.7 11.00 0.200741
61 France 64121.2 370.00 0.577032
62 Gabon 1687.7 940.00 55.697103
63 Gambia 1928.2 350.00 18.151644
64 Georgia 4034.8 270.00 6.691782
65 Germany 80646.3 330.00 0.409194
66 Ghana 26786.6 9700.00 36.212136
67 Greece 11000.8 110.00 0.999927
68 Grenada 106.3 0.47 0.442145
69 Guatemala 16015.5 260.00 1.623427
70 Guinea 12275.5 3600.00 29.326708
#data[[3,0]].head() #data.irow([7,5,2]) #data.columns() # does not work #data[[3,2,0,1]].head() #data[3:6] # not working #data[[3,2,0,1]].irow([4,3,2]) data[[0,1,2,3]].irow([1,2,3]) # this is pretty cool - cols first then rowsf
Country Population (1000s) TB deaths TB deaths (per 100,000)
1 Albania 2889.7 17.00 0.588296
2 Algeria 38934.3 4400.00 11.301089
3 Andorra 72.8 0.55 0.755495