Although we were directed to look at the World Bank list of indicators I found the range of data sets presented somewhat overwhelming and initially I had some difficulty searching through the website to try and find some suitable data that I could work with. When I followed up a particular subject that interested me I found that the data belonged to a third party or was almost non existant or partial. There were lots of avenues to persue and it was easy to get lost in all the profusion of data. However eventually I stumbled across the World Bank page listing all their indicators (which is what I should have looked at from the outsett) and this was where I made my selection from http://data.worldbank.org/indicator I chose to download the following datasets for 2014 for all countries:
I wanted to look at whether there was any statistical correlation between GDP per Capita and a counties labour force its rate of unemployment or energy use. I suspected that energy use went 'hand-in-hand' with GDP per Capita. But wanted to see if there was any relation to the size of the Labour force or unemployment. In particular I wanted to plot this information on a chart to see what it looked like.
The Datasets were all downloaded as csv files without problems and then uploaded to SageMathCloud again without any problems.
import warnings
warnings.simplefilter('ignore', FutureWarning)
from pandas import *
POP_INDICATOR = 'SP.POP.TOTL' # - Population, total
GDP_INDICATOR = 'NY.GDP.PCAP.CD' # - GDP per capita (current US$)
LAB_INDICATOR = 'SL.TLF.TOTL.IN' # - Labor force, total
UNEMP_INDICATOR = 'SL.UEM.TOTL.ZS' # - Unemployment, total (% of total labor force) (modeled ILO estimate)
ENERGY_INDICATOR = 'EG.USE.PCAP.KG.OE' # - Energy use (kg of oil equivalent per capita) (2013 Data)
YEAR = 2014
########## Download command ####################################################
# popWB = download(indicator=POP_INDICATOR, country='all', start=YEAR, end=YEAR)
# download(indicator=string, country='all', start=number, end=number) is a function in the pandas.io.wb module
# downloads World Bank data for the given indicator and all countries and country groups for a given range of years
###############################################################################
popReset = read_csv('SP.POP.TOTL.csv', skip_blank_lines=True )
gdpReset = read_csv('NY.GDP.PCAP.CD.csv')
labReset = read_csv('SL.TLF.TOTL.IN.csv')
unempReset = read_csv('SL.UEM.TOTL.ZS.csv')
energyReset = read_csv('EG.USE.PCAP.KG.OE.csv')
# The following functions may be useful during the data cleaning process and so are loaded here
def roundToMillions (value):
return round(value / 1000000)
def roundToThousands (value):
return round(value / 1000)
# The following function will not be required:
#def usdToGBP (usd):
# return usd / 1.564768# note if this is used check the correct rate for 2014 is used (this contains the 2013 rate)
Each Reset file was then cleaned in the following order and saved as a Country file:
Whilst the data appeared to be in good order I had difficulty trying to change the data type of the data column with Population values from object to type int64. After some testing this was found to be cause by a single cell containing two dots which presumably is used to signify that no data is currently available for that country. Whilst this was quite frustrating to have everything stopped by just one cell it turned out to a useful learning experience.
A simple solution was used and this cell was assigned the value 0.00
After that the asType method was applied to the whole column and the column type was changed to int64.
"asType(aType) when applied to a dataframe column, the method changes the data type of each value in that column to the type given by the string aType." Glossary Week 2
#Identifying the cell causing the problem that prevent asType('int64') from being applied
POP_INDICATOR = '2014 [YR2014]'
popReset[(popReset[POP_INDICATOR]=='..')]
#POP_INDICATOR = 'SP.POP.TOTL' Population, total
#popReset.head()
#popReset.columns# column names Index([u'Series Name', u'Series Code', u'Country Name', u'Country Code', u'2014 [YR2014]'],dtype='object')
popCountries = popReset[:217].dropna()
POP_INDICATOR = '2014 [YR2014]' # 'SP.POP.TOTL' replace with header name
#popCountries.dtypes# check before: all objects
######################################################################################
popCountries.loc[(popCountries[POP_INDICATOR]=='..'),POP_INDICATOR]=0.0 #
######################################################################################
popCountries[POP_INDICATOR] = popCountries[POP_INDICATOR].astype('int64')#
#popCountries.dtypes
#GDP_INDICATOR = 'NY.GDP.PCAP.CD' GDP per capita (current USD)
#gdpReset.head()
#popReset.columns#column names Index([u'Series Name', u'Series Code', u'Country Name', u'Country Code', u'2014 [YR2014]'],dtype='object')
gdpCountries = gdpReset[:217].dropna()
GDP_INDICATOR = '2014 [YR2014]' # 'NY.GDP.PCAP.CD' replace with header name
gdpCountries.dtypes# check before: all objects
#############Clean: Try using .loc[row_indexer,col_indexer] = value instead###########
gdpCountries.loc[(gdpCountries[GDP_INDICATOR]=='..'),GDP_INDICATOR]=0.0 #
######################################################################################
gdpCountries[GDP_INDICATOR] = gdpCountries[GDP_INDICATOR].astype('float64')#
#gdpCountries.dtypes
#LAB_INDICATOR = 'SL.TLF.TOTL.IN' - Labor force, total
#labReset.head()
#popReset.columns# column names Index([u'Series Name', u'Series Code', u'Country Name', u'Country Code', u'2014 [YR2014]'],dtype='object')
labCountries = labReset[:217].dropna()
# rename(columns={'2014 [YR2014]' : POP}) renames the column oldName to newName.
# london = london.rename(columns={'WindDirDegrees<br />':'WindDirDegrees'})
lab_INDICATOR = '2014 [YR2014]' # 'SL.TLF.TOTL.IN' replace with header name
labCountries.dtypes# check before: all objects
#############Clean: Try using .loc[row_indexer,col_indexer] = value instead###########
labCountries.loc[(labCountries[lab_INDICATOR]=='..'),lab_INDICATOR]=0.0 #
######################################################################################
labCountries[lab_INDICATOR] = labCountries[lab_INDICATOR].astype('float64')#
#labCountries.dtypes
#UNEMP_INDICATOR = 'SL.UEM.TOTL.ZS' # - Unemployment, total (% of total labor force) (modeled ILO estimate)
#unempReset.head()
#popReset.columns# column names Index([u'Series Name', u'Series Code', u'Country Name', u'Country Code', u'2014 [YR2014]'],dtype='object')
unempCountries = unempReset[:217].dropna()
unemp_INDICATOR = '2014 [YR2014]' # 'SL.UEM.TOTL.ZS' replace with header name
unempCountries.dtypes# check before: all objects
#############Clean: Try using .loc[row_indexer,col_indexer] = value instead###########
unempCountries.loc[(unempCountries[unemp_INDICATOR]=='..'),unemp_INDICATOR]=0.0 #
######################################################################################
unempCountries[unemp_INDICATOR] = unempCountries[unemp_INDICATOR].astype('float64')#
#unempCountries.dtypes
This file was cleaned just like all the others before. But later on it was felt that replacing the cells with '..' with zeros might not be the best course of action and a decision was made to remove all these rows from this particular data set. This can be seen by the addition of a drop command into the code. When the resultat file was checked later on it returned a better correlation coefficient and made the graph look a lot clearer.
#ENERGY_INDICATOR = 'EG.USE.PCAP.KG.OE' # - Energy use (kg of oil equivalent per capita) (2013 Data)
#energyReset.head()
#popReset.columns#column names Index([u'Series Name', u'Series Code', u'Country Name', u'Country Code', u'2014 [YR2014]'],dtype='object')
energyCountries = energyReset[:217].dropna()
energy_INDICATOR = '2013 [YR2013]' # 'SL.UEM.TOTL.ZS' replace with header name
energyCountries.dtypes# check before: all objects
#############Clean: Try using .loc[row_indexer,col_indexer] = value instead###########
#energyCountries.loc[(energyCountries[energy_INDICATOR]=='..'),energy_INDICATOR]=0.0 # try deleting these rows.
# df.drop(df.index[[1,3]])
energyColumn = energyCountries.drop(energyCountries.index[energyCountries[energy_INDICATOR]=='..']) # deletes all these rows.
######################################################################################
energyCountries = energyColumn
#energyCountries.dtypes
#energyCountries['2013 [YR2013]']
energyCountries[energy_INDICATOR] = energyCountries[energy_INDICATOR].astype('float64')#
#energyCountries.dtypes
Each individual Country file was reduced to just essential columns and saved as a Clean file. I kept the Country Codes column as I felt that this would prove to be a better 'Common Column' join all the tables with.
"A join is the merging of two tables on a common column. The resulting table has all columns of both tables (the common column isn’t duplicated), and the rows are determined by the type of join. Rows in the two tables that have the same value in the common column become a joined row in the resulting table." Glossary Week 3
#POP_INDICATOR = 'SP.POP.TOTL' Population, total
# Round Population to 1000s
COUNTRY = 'Country Name'
CODE = 'Country Code'
POP = 'Population (1000s)'
popCountries[POP]= popCountries[POP_INDICATOR].apply(roundToThousands)# cant run this till data cleaned
# Consolidates data & gets rid of year & Source data [checked]
headings = [COUNTRY, CODE, POP]
popClean = popCountries[headings]
#popClean.head()
#GDP_INDICATOR = 'NY.GDP.PCAP.CD' GDP per capita (current USD)
# Round GDP to? leave as is for time being
COUNTRY = 'Country Name'
CODE = 'Country Code'
GDP = 'Per Capita GDP $'
gdpCountries[GDP]= gdpCountries[GDP_INDICATOR]#.apply(roundToThousands)# cant run this till data cleaned
# Consolidates data & gets rid of year & Source data [checked]
headings = [COUNTRY, CODE, GDP]
gdpClean = gdpCountries[headings]
#gdpClean.head()
#LAB_INDICATOR = 'SL.TLF.TOTL.IN' - Labor force, total
# Round LAB to ?? leave as is for the time being
COUNTRY = 'Country Name'
CODE = 'Country Code'
LAB = 'Labour force total'
labCountries[LAB]= labCountries[lab_INDICATOR]#.apply(roundToThousands)# cant run this till data cleaned
# Consolidates data & gets rid of year & Source data [checked]
headings = [COUNTRY, CODE, LAB]
labClean = labCountries[headings]
#labClean.head()
#UNEMP_INDICATOR = 'SL.UEM.TOTL.ZS' # - Unemployment, total (% of total labor force) (modeled ILO estimate)
# Round Unemp to %
COUNTRY = 'Country Name'
CODE = 'Country Code'
UNEMP = 'Unemployment (% total force)' #Unemployment, total (% of total labor force) (modeled ILO estimate)
unempCountries[UNEMP]= unempCountries[unemp_INDICATOR]#
# Consolidates data & gets rid of year & Source data [checked]
headings = [COUNTRY, CODE, UNEMP]
unempClean = unempCountries[headings]
#unempClean.head()
#ENERGY_INDICATOR = 'EG.USE.PCAP.KG.OE' # - Energy use (kg of oil equivalent per capita) (2013 Data)
# Round Energy to ??
COUNTRY = 'Country Name'
CODE = 'Country Code'
ENERGY = 'kg of oil / capita' # Energy use (kg of oil equivalent per capita) (2013 Data)
energyCountries[ENERGY]= energyCountries[energy_INDICATOR]#.apply(roundToThousands)# cant run this till data cleaned
# Consolidates data & gets rid of year & Source data [checked]
headings = [COUNTRY, CODE, ENERGY]
energyClean = energyCountries[headings]
#energyClean.head()
All the Clean files were then merged together into one data file and I was able to test out Method Chaining using the Merge function to merge five files all at the same time. I was half expecting this process to blow up in my face but it actually worked! The data file once checked was then reduced down to a finaldata file with just the columns required for analysis.
# Perform Method Chain Merge:
data = popClean.merge(gdpClean, on=CODE).merge(labClean,on=CODE).merge(unempClean,on=CODE).merge(energyClean,on=CODE)
data.head()
# Could make use of drop axis command inplace to remove specific commands or use method in next cell
#df.drop('Column name',axis=1,inplace=True)
# Now Consolidate all data into one big table [checked]
headings = [COUNTRY, CODE, POP, GDP, LAB, UNEMP, ENERGY]
finalData = data[headings]
finalData.head()
#finalData.tail()
A number of tests were then performed on this data set to determine the strength of any correlation between the data sets.
Per Capita GDP & Labour Force Surprisingly there was onlt a small correlation of 0.193 between Per Capita GDP and the size of the Labour force. Yet current economic theory suggest that where more people are gathered incomes tend to be higher. This does not appear to hold true when it comes to looking at Countries
Per Capita GDP & Unemployment This returned a slightly higher correlation of 0.274. Although as unemployment is measured as a % of the total labout force the relationship here between Per Capitas GDP and unemployment will be an inverse one. The more Kg of oil a country consumes the less its unemployment percentage is likely to be. Similar curves were in fact found with the next data set.
Kg Oil per Capita & Unemployment This returned a marginal increase in correlation of 0.332. The graph of this data set is shown below and seems to describe an inverse realtionship in the form 1/x. Although it is quite difficult to make much sense of this data an increase in Kg of oil per Capital would suggests lower unemployment. But there is a large group of countries just sitting in the bottom left corner.
Kg Oil per Capita & Labour Force This has a higher correlation coefficient of 0.44668 and would seem to make sense. The bigger the size of the Labour force the higher the KgOil consumption per Capita.
GDP per Capita & Kg Oil per Capita This returned the highest Correlation Coefficient of all the data sets tested of 0.666 and is in line with common sense. However the dataset when plotted did have a number of results on the zero line for Kg Oil per Capita. I wondered whether this might be distoring the data somewhat so I took the decision to eliminat all the rows from my dataset that I had replaced with a zero. The code was amended and rerun to return a correlation factor of 0.799 the highest correlation coefficient so far.
The next step was to graph and plot this data.
from scipy.stats import spearmanr
# A number of test were performed to identify which data set had the most significant correlation
# However imported Energy data had had null values set to zero and it was felt this may be distorting results so they were stripped
gdpColumn = finalData[GDP]#
energyColumn = finalData[ENERGY]#
(correlation, pValue) = spearmanr(energyColumn, gdpColumn)
print('The correlation is', correlation)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
%matplotlib inline
finalData.plot(y=ENERGY, x=GDP, kind='scatter', grid=True, figsize=(10, 4))# Graph changed logx=True,
# I tried both layouts; this graph seemed to be the easiest to read
If you look at the Per Capital GDP there is a distinct group of four countries all to the right of the vertical line measuring $
80,000. These countries were investigated using the boolean search below as their per capita GDP appears to be the highest in the world. Swizerland Luxembourg & Norway all have Per Capital Kg of Oil ranging from $
3.3k - $
7.3k whilst Qatar is in a league of its own just short of $
19k.
finalData[(finalData[GDP] > 80000) & (finalData[ENERGY] > 0)].sort(GDP,ascending=0)# These are the outliers for GDP
I found this dataset difficult to understand. But researching these countries provided explanations which seem to support the facts
Luxembourg has a total population equivalent to a small city. It is easy to find other areas the same size that have higher GDP per capita, they just aren't countries so don't appear in the rankings. Luxembourg's GDP also includes a lot of businesses that are set up there to take advantage of low taxes. Most of the revenue is actually generated elsewhere, but directed through Luxembourg by clever accounting tricks - that boosts the GDP figures. Whilst Luxembourg’s high energy use per capita can not be explained by access to abundant and cheap energy sources. The high ratio of energy use in Luxembourg can be partly explained by the low sales taxes on petroleum products, which encourage motorists and other consumers from neighbouring countries (Belgium, France and Germany) to buy their supplies in Luxembourg.
Norway has enormous oil and gas resources and is a major producer of hydropower. It is Europe's largest oil producer and the third-largest natural gas exporter in the entire world behind Russia and Qatar.
Qatar has a well-developed oil exploration industry where the petroleum industry accounts for 70% of its government revenue, 60% of its GDP and 85% of its export earnings. Qatar is also the most conterversial country on the list for its use of forced labour from nations throughout Asia and parts of Africa. Qatar belches out more CO2 per capita than anywhere else.
Swiss banking and financial institutions keep Switzerland and its economy afloat. It is important to note that some of the wealthiest people and companies in the world own Swiss bank accounts and therefore Switzerland has excess capital to use for investment purposes. Zurich and Geneva, Switzerland's most well-known cities, have consistently ranked among the top ten highest living standard cities in the world.
Energy use refers to use of primary energy before transformation to other end-use fuels, which is equal to indigenous production plus imports and stock changes, minus exports and fuels supplied to ships and aircraft engaged in international transport.
The next dataset to look at are those countries with per Capita GDP less than $80000 but with Kg of Oil / Capita greater than $9,500 which seem to contain about 5 countries. They are shown below sorted in descending order of Kg of Oil / Capita.
finalData[(finalData[GDP] < 80000) & (finalData[ENERGY] > 9500)].sort(ENERGY,ascending=0)# These are the outliers for GDP
This data extract threw up another curious collection of countries which apart from Kuwait & Bahrain I would not normally associate as being amongst the highest per Capita users of energy in the World.
With a population of just 327 thousand Iceland is the world’s largest hydropower country per capita, the world’s largest geothermal energy producer per capita, and the world’s largest electricity producer per capita. It's competitively priced electricity (from hydro- and geothermal power) has attracted numerous industries and services. Currently, the aluminum industry in Iceland consumes close to 75% of all the electricity produced in Iceland. A very high share of Iceland’s energy comes from renewable sources, making Iceland one of the greenest country in the world with regard to energy consumption. In total, approximately 86% of Iceland’s consumption of primary energy comes from renewable sources. And what is especially interesting, is the fact that Iceland still has access to numerous competitive renewable energy sources yet to be harnessed.
Trinidad and Tobago is the wealthiest country in the Caribbean as well as the third richest country by GDP (PPP) per capita in the Americas after the United States and Canada. It is the region's leading producer of oil and gas; and houses one of the largest natural gas processing facilities in the Western Hemisphere. Trinidad and Tobago is the leading Caribbean producer of oil and gas, and its economy is heavily dependent upon these resources but it also supplies manufactured goods, notably food and beverages, as well as cement to the Caribbean region. Oil and gas account for about 40% of GDP and 80% of exports, but only 5% of employment. https://en.wikipedia.org/wiki/Economy_of_Trinidad_and_Tobago
Curaçao. With a population of 156 thousand Curaçao is an island country in the southern Caribbean Sea, approximately 65 kilometres north of the Venezuelan coast. It has one of the highest standards of living in the Caribbean. The island has a well-developed infrastructure with strong tourism and financial services sectors. Shipping, international trade, oil refining, and other activities related to the port of Willemstad (like the Free Trade Zone) also make a significant contribution to the economy. The discovery of oil in the Maracaibo Basin of Venezuela in the early 20th century forced the Venezuelan government to search for ideal locations for large scale refining. Curaçao's proximity to the country, naturally deep harbors, and stable government led Royal Dutch Shell to construct the Isla Refinery, the largest refinery in the world at the time. The refinery is located in Willemstad harbour and began operating in 1918. In 2013, the unemployment rate was 13%. Though Curaçao is considered a tax haven, it adheres to the EU Code of Conduct against harmful tax practices. https://en.wikipedia.org/wiki/Economy_of_Cura%C3%A7ao
Bahrain. Bahrain has the freest economy in the Middle East and North Africa region and is the tenth freest economy in the world. Bahrain was recognised by the World Bank as a high income economy. Petroleum production is Bahrain's most exported product, accounting for 60% of export receipts, 70% of government revenues, and 11% of GDP. Aluminium production is the second most exported product, followed by finance and construction materials. https://en.wikipedia.org/wiki/Economy_of_Bahrain
Kuwait is a constitutional monarchy with a high income economy backed by the world's sixth largest oil reserves. According to the World Bank, the country has the fourth highest per capita income in the world. 70% of the population are expatriates, while only 30% of the population are Kuwaiti citizens. https://en.wikipedia.org/wiki/Kuwait
Development Relevance: In developing economies growth in energy use is closely related to growth in the modern sectors - industry, motorized transport, and urban areas - but energy use also reflects climatic, geographic, and economic factors (such as the relative price of energy). Energy use has been growing rapidly in low- and middle-income economies, but high-income economies still use almost five times as much energy on a per capita basis. Governments in many countries are increasingly aware of the urgent need to make better use of the world's energy resources. Improved energy efficiency is often the most economic and readily available means of improving energy security and reducing greenhouse gas emissions.
Limitations and Exceptions: The IEA makes these estimates in consultation with national statistical offices, oil companies, electric utilities, and national energy experts. The IEA occasionally revises its time series to reflect political changes, and energy statistics undergo continual changes in coverage or methodology as more detailed energy accounts become available. Breaks in series are therefore unavoidable.
%matplotlib inline
finalData.plot(y=UNEMP, x=ENERGY, kind='scatter', grid=True, figsize=(10, 4))# Graph changed logx=True,
Whilst a significant correlation was also found between energy and Unemployment logic would suggest that this is an inverse relationship and plotting the county values suggest a curve comming down the y axis and turning to run along the x axis. This a clasical inverse relationship with most of the data being bunched in the bottom left corner. As we have already considered the high per Capita energy & GNP countries the following extract focuses on those countries with low energy per Capita & high unemployment. The list has been sorted in descending order of unemployment as % of the Labour force.
finalData[(finalData[UNEMP] > 15) & (finalData[ENERGY] <5000)].sort(UNEMP,ascending=0)#
The top 10 Countries in this list have unemployment rates around or above 20%: Greece South Africa Spain & Mozambique all have large populations & struggling economies. What is noticeable is the inclusion of Iraq & Libya which are two of the major oil producing economies yet due War their economies have suffered major setbacks.
The data acquisition was not difficult and the data cleaning although initially laborious got quicker with each file that was repeated.
However the analysis of the final data proved to be more difficult than anticipated to understand. The decision to reanalyse the data after removing all countries with no per Capita energy results turned out to be the right one as the correlation coefficients were improved and it was easier to focus on remaining data.
What surprised me most about this study was the list of countries from the two data extracts that were so diverse and out of the ordinary but which on further investigation turned out to be substantialy correct.