This is the project notebook for Week 3 of The Open University's Learn to code for Data Analysis course.
Richer countries can afford to invest more on healthcare, on work and road safety, and other measures that reduce mortality. On the other hand, richer countries may have less healthy lifestyles. Is there any relation between the wealth of a country and the life expectancy of its inhabitants?
The following analysis checks whether there is any correlation between the total gross domestic product (GDP) of a country in 2013 and the life expectancy of people born in that country in 2013.
The project has also been extended to answer some key questions including those concerning whether GDP per capita has greater bearing on life expectancy than GDP.
from IPython.display import display, HTML
HTML('''Note: <p style="display:inline;color:darkred;">Option to toggle code visibility on/off is at <a href="#Bottom">bottom</a> of page.</p>''')
Two datasets of the World Bank are considered. One dataset, available at http://data.worldbank.org/indicator/NY.GDP.MKTP.CD, lists the GDP of the world's countries in current US dollars, for various years. The use of a common currency allows us to compare GDP values across countries. The other dataset, available at http://data.worldbank.org/indicator/SP.DYN.LE00.IN, lists the life expectancy of the world's countries. The datasets were downloaded as CSV files in March 2016.
import warnings
warnings.simplefilter('ignore', FutureWarning)
from pandas import *
YEAR = 2013
GDP_INDICATOR = 'NY.GDP.MKTP.CD'
gdpReset = read_csv('WB GDP 2013.csv')
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
lifeReset = read_csv('WB LE 2013.csv')
lifeReset.head()
Inspecting the data with head()
and tail()
shows that:
The data is therefore cleaned by:
gdpCountries = gdpReset[34:].dropna()
lifeCountries = lifeReset[34:].dropna()
The World Bank reports GDP in US dollars and cents. To make the data easier to read, the GDP is converted to millions of British pounds (the author's local currency) with the following auxiliary functions, using the average 2013 dollar-to-pound conversion rate provided by http://www.ukforex.co.uk/forex-tools/historical-rate-tools/yearly-average-rates.
def roundToMillions (value):
return round(value / 1000000)
def usdToGBP (usd):
return usd / 1.564768
GDP = 'GDP (£m)'
gdpCountries[GDP] = gdpCountries[GDP_INDICATOR].apply(usdToGBP).apply(roundToMillions)
gdpCountries.head()
The unnecessary columns can be dropped.
COUNTRY = 'country'
headings = [COUNTRY, GDP]
gdpClean = gdpCountries[headings]
gdpClean.head()
The World Bank reports the life expectancy with several decimal places. After rounding, the original column is discarded.
LIFE = 'Life expectancy (years)'
lifeCountries[LIFE] = lifeCountries[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
lifeClean = lifeCountries[headings]
lifeClean.head()
The tables are combined through an inner join on the common 'country' column.
gdpVsLife = merge(gdpClean, lifeClean, on=COUNTRY, how='inner')
gdpVsLife.head()
To measure if the life expectancy and the GDP grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.
from scipy.stats import spearmanr
gdpColumn = gdpVsLife[GDP]
lifeColumn = gdpVsLife[LIFE]
(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)
print('The correlation using Spearman ranking is', correlation)
print('The p-value is', pValue)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
The value shows a direct correlation, i.e. richer countries tend to have longer life expectancy, but it is not very strong.
Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot. The GDP axis uses a logarithmic scale to better display the vast range of GDP values, from a few million to several billion (million of million) pounds.
%matplotlib inline
gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(10, 4))
The plot shows there is no clear correlation: there are rich countries with low life expectancy, poor countries with high expectancy, and countries with around 10 thousand (104) million pounds GDP have almost the full range of values, from below 50 to over 80 years. Towards the lower and higher end of GDP, the variation diminishes. Above 40 thousand million pounds of GDP (3rd tick mark to the right of 104), most countries have an expectancy of 70 years or more, whilst below that threshold most countries' life expectancy is below 70 years.
Comparing the 10 poorest countries and the 10 countries with the lowest life expectancy shows that total GDP is a rather crude measure. The population size should be taken into account for a more precise definiton of what 'poor' and 'rich' means. Furthermore, looking at the countries below, droughts and internal conflicts may also play a role in life expectancy.
# the 10 countries with lowest GDP
gdpVsLife.sort(GDP).head(10)
# the 10 countries with highest GDP
gdpVsLife.sort(GDP, ascending=False).head(10)
# the 10 countries with lowest life expectancy
gdpVsLife.sort(LIFE).head(10)
To what extent do the ten countries with the highest GDP coincide with the ten countries with the longest life expectancy?
#################################
# Extending the project - Task 1
#################################
# To what extent do the ten countries with the highest GDP coincide
# with the ten countries with the longest life expectancy?
# Which to use? sort(LIFE, ascending=False).head(10) or sort(LIFE).tail(10) ?
# The two sort methods don't return the same data frames:
# -Using sort(LIFE, ascending=False).head(10) gives you just Japan and France.
# -Using sort(LIFE).tail(10) gives Italy as well as Japan and France.
# See PROOF below.
# the 10 countries with highest GDP
gdpTopTenHead = gdpVsLife.sort(GDP, ascending=False).head(10)
#gdpTopTenTail = gdpVsLife.sort(GDP).tail(10)
# the 10 countries with highest life expectancy
lifeTopTenHead = gdpVsLife.sort(LIFE, ascending=False).head(10)
#lifeTopTenTail = gdpVsLife.sort(LIFE).tail(10)
###### PROOF ######
# Proof that sort(LIFE, ascending=False).head(10) and sort(LIFE).tail(10) give different results:
#display(gdpTopTenHead)
#display(gdpTopTenTail)
#display(lifeTopTenHead)
#display(lifeTopTenTail)
# What are the differences regarding countries included in each resultant table?
#lifeTopTenTailList = lifeTopTenTail[COUNTRY].tolist()
#lifeTopTenHeadList = lifeTopTenHead[COUNTRY].tolist()
#print(set(lifeTopTenTailList) ^ set(lifeTopTenHeadList)) # Calculate symmetrical difference
####################
print("From the data, just Japan and France have some of the highest GDP figures that coincide with the longest life expectancy figures.")
print("Hence there appears to be no strong correlation between a country's wealth and the life expectancy of its inhabitants.")
display(merge(gdpTopTenHead, lifeTopTenHead,how='inner')) # Gives just Japan and France
#display(merge(gdpTopTenTail[[COUNTRY]], lifeTopTenTail[[COUNTRY]],how='inner')) # Gives Italy, Japan and France
Which are the two countries in the right half of the plot (higher GDP) with life expectancy below 60 years? What factors could explain their lower life expectancy compared to countries with similar GDP? Hint: use the filtering techniques you learned in Week 2 to find the two countries.
gdpVsLife[(gdpVsLife[GDP] > 10E4) & (gdpVsLife[LIFE] < 60)]
The two countries with higher GDP (i.e. above £105 million) but a life expectancy below 60 are Nigeria and South Africa.
Redo the analysis using the countries’ GDP per capita (i.e. per inhabitant) instead of their total GDP. If you’ve done the workbook exercises, you already have a column with the population data. Hint: write an expression involving the GDP and population columns, as you learned in Calculating over columns in Week 1. Think about the units in which you display GDP per capita.
# 1 Get the GDP data from earlier (missing data and first 34 rows already removed).
# 2 Convert GDP from $ to £ (not £m because we don't want to round it too early and introduce rounding errors).
# 3 Get rid of columns we don't want
GDP = 'GDP (£)'
gdpCountries[GDP] = gdpCountries[GDP_INDICATOR].apply(usdToGBP) # US$ to GB£
headings = [COUNTRY, GDP] # Put headings we want to keep in a list
gdpClean = gdpCountries[headings] # Create new dataframe with selected headings
display(gdpClean.head())
# 1 Get the POP data
# 2 Clean it by removing missing data and first 34 rows of unwanted data.
# 3 Get rid of columns we don't want
POP = 'SP.POP.TOTL' # NOTE - This 'indicator' is a column name in csv file
popReset = read_csv('WB POP 2013.csv')
popCountries = popReset[34:].dropna() #1
headings = [COUNTRY, POP] #2 Put headings we want to keep in a list
popClean = popCountries[headings] #3 Create new dataframe with selected headings
popClean.head()
# OK, lets merge gdp and population tables to get one table we can use to calculate 'GDP per capita'.
gdpVsPop = merge(gdpClean, popClean, on=COUNTRY, how='inner')
gdpVsPop.head()
# Create 'GDP per capita' column using an appropriate calculation on the other two columns.
def roundTo2dp (value):
return round(value, 2)
GDPPC = 'GDP per capita (£)'
gdpVsPop[GDPPC] = gdpVsPop[GDP] / gdpVsPop[POP]
gdpVsPop[GDPPC] = gdpVsPop[GDPPC].apply(roundTo2dp) # Round GDPPC to 2 decimal places
headings = [COUNTRY, GDPPC] # Put headings we want to keep in a list
gdppc = gdpVsPop[headings] # Create new dataframe with selected headings
gdppc.head()
# OK, lets merge gdppc and life expectancy tables to get one table we can work with...
gdppcVsLife = merge(gdppc, lifeClean, on=COUNTRY, how='inner')
gdppcTop3 = gdppcVsLife.sort(GDPPC, ascending=False).head(3)
gdppcBottom3 = gdppcVsLife.sort(GDPPC, ascending=False).tail(3)
lifeTop3 = gdppcVsLife.sort(LIFE, ascending=False).head(3)
lifeBottom3 = gdppcVsLife.sort(LIFE, ascending=False).tail(3)
print("Top 3 for GDP per capita (£):")
display(gdppcTop3)
print("Bottom 3 for GDP per capita (£):")
display(gdppcBottom3)
print("Top 3 for life expectancy (years)")
display(lifeTop3)
print("Bottom 3 for life expectancy (years):")
display(lifeBottom3)
print("The UK GDP per capita (£) and life expectancy figures are as shown below:")
display(gdppcVsLife[(gdppcVsLife[COUNTRY] == 'United Kingdom')]) # Sanity check!
gdppcVsLife.plot(x=GDPPC, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(10, 5))
from scipy.stats import spearmanr
gdppcColumn = gdppcVsLife[GDPPC]
lifeColumn = gdppcVsLife[LIFE]
(correlation, pValue) = spearmanr(gdppcColumn, lifeColumn)
print('The correlation using Spearman ranking is', correlation)
print('The p-value is', pValue)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
From our analysis, there appears to be no strong correlation between a country's wealth and the life expectancy of its inhabitants: there is often a wide variation of life expectancy for countries with similar GDP, countries with the lowest life expectancy are not the poorest countries, and countries with the highest expectancy are not the richest countries. Nevertheless there is some relationship, because the vast majority of countries with a life expectancy below 70 years is on the left half of the scatterplot.
From the chart above, however, we can see that, generally, as GDP per capita increases so does life expectancy. Having said that, there is one country with a higher GDP per capita but a life expectancy below 60. This country is Equatorial Guinea, as shown below. This alongside the moderate correlation (0.501, p<0.001) indicates that although there is a general trend between these two variables, there are likely to be other factors involved in this relationship, which should be identified and further investigated.
gdppcVsLife[(gdppcVsLife[GDPPC] > 10000) & (gdppcVsLife[LIFE] < 60)]
numberOfRows = len(gdppcVsLife.index) # Get number of rows
set_option('max_rows', numberOfRows) # Set max_rows option
gdppcSort = gdppcVsLife.sort(GDPPC, ascending=False)
gdppcSort = gdppcSort.reset_index(drop=True) # Reset index and drop old one
print("\nNOTE:\n\nFor GDP per capita, Equatorial Guinea is " + str(gdppcSort[(gdppcSort[COUNTRY] == 'Equatorial Guinea')].index[0]+1) + " out of " + str(len(gdppcSort)) + " countries.")
lifeSort = gdppcVsLife.sort(LIFE, ascending=False)
lifeSort = lifeSort.reset_index(drop=True) # Reset index and drop old one
print("But for life expectancy, Equatorial Guinea is number " + str(lifeSort[(lifeSort[COUNTRY] == 'Equatorial Guinea')].index[0]+1) + "!")
print("\nPandas version", pandas.__version__, end="")
from IPython.display import HTML
HTML('''<script>
function toggler() {
$('div.input').toggle();
location.href="#Bottom";
}
</script>
<p style="display:inline;"><center>Click <a href="javascript:toggler();">here</a> to toggle code visibility on/off</center></p>
<script>
$('div.input').show();
location.href="#Top";</script></div>''')