by Tony Hirst and Michel Wermelinger, 19 November 2015 and 28 March 2016 with ammendments by Chris Pyves 17th November 2016.
This is the project notebook for Week 4 of The Open University's Learn to Code for Data Analysis course.
A country's economy depends, sometimes heavily, on its exports and imports. The United Nations Comtrade database provides data on global trade. It will be used to analyse the UK's imports and exports of milk and cream in 2015:
# Now Repeat the whole analysis for January–May 2014 and compare the results:
import warnings
warnings.simplefilter('ignore', FutureWarning)
from pandas import *
%matplotlib inline
The data is obtained from the United Nations Comtrade website, by selecting the following configuration:
Clicking on 'Preview' results in a message that the data exceeds 500 rows. Data was downloaded using the Download CSV button and the download file renamed appropriately.
#Type of Product: goods
#Frequency: monthly
#Periods: January to May of 2015
#Reporter: United Kingdom
#Partners: all
#Flows: imports and exports
#HS (as reported) commodity codes: 0401 (Milk and cream, neither concentrated nor sweetened) and 0402 (Milk and cream, concentrated or sweetened)
# http://comtrade.un.org/api/get?max=5000&type=C&freq=M&px=HS&ps=201501%2C201502%2C201503%2C201504%2C201505&r=826&p=all&rg=1%2C2&cc=0401%2C0402&fmt=csv
#downloaded comtrade(12).csv
#Type of Product: goods
#Frequency: monthly
#Periods: January to May of 2014[ Note: systems limit: can only select 5 items]
#Reporter: United Kingdom
#Partners: all
#Flows: imports and exports
#HS (as reported) commodity codes: 0401 (Milk and cream, neither concentrated nor sweetened) and 0402 (Milk and cream, concentrated or sweetened)
# http://comtrade.un.org/api/get?max=5000&type=C&freq=M&px=HS&ps=201401%2C201402%2C201403%2C201404%2C201405&r=826&p=all&rg=1%2C2&cc=0401%2C0402&fmt=csv
#downloaded comtrade(13).csv
# Now Repeat the whole analysis for January–May 2014 and compare the results:
# there seem to be two ways of approaching this problem
# 1. Repeat what has already been done but with a new datafile, or
# 2. Add the previous year data to the file and work through the questions using all the techniques and methods that we have learnt
LOCATION2 = 'comtrade_milk_uk_jan_jul_15.csv'# OK
LOCATION1 = 'comtrade_milk_uk_jan_may_14.csv'# OK
The data can also be downloaded directly from Comtrade using the "View API Call" URL, modified in two ways:
max=500
is increased to max=5000
to make sure all data is loaded,&fmt=csv
is added at the end to obtain the data in CSV format.# LOCATION2 = 'http://comtrade.un.org/api/get?max=5000&type=C&freq=M&px=HS&ps=201505%2C201504%2C201503%2C201502%2C201501&r=826&p=all&rg=1%2C2&cc=0401%2C0402&fmt=csv'
# LOCATION1 = 'http://comtrade.un.org/api/get?max=5000&type=C&freq=M&px=HS&ps=201401%2C201402%2C201403%2C201404%2C201405&r=826&p=all&rg=1%2C2&cc=0401%2C0402&fmt=csv
On reading in the data, the commodity code has to be read as a string, to not lose the leading zero.
milk1 = read_csv(LOCATION1, dtype={'Commodity Code':str})# 2014
milk1.tail(2)
#type(milk1)# pandas.core.frame.DataFrame
#milk1.describe# [606 rows x 35 columns]
milk2 = read_csv(LOCATION2, dtype={'Commodity Code':str})# 2015
milk2.tail(2)
The data in the two files above only covers the first five months of 2014 & 2015. Most columns are irrelevant for this analysis, or contain always the same value, like the year and reporter columns. The commodity code is transformed into a short but descriptive text and only the relevant columns are selected.
def milkType(code):
if code == '0401': # neither concentrated nor sweetened
return 'unprocessed'
if code == '0402': # concentrated or sweetened
return 'processed'
return 'unknown'
COMMODITY = 'Milk and cream'
milk1[COMMODITY] = milk1['Commodity Code'].apply(milkType)
YEAR = 'Year'
PERIOD = 'Period'
PARTNER = 'Partner'
FLOW = 'Trade Flow'
VALUE = 'Trade Value (US$)'
headings = [YEAR, PERIOD, PARTNER, FLOW, COMMODITY, VALUE]
milk1 = milk1[headings]
milk1.tail()
COMMODITY = 'Milk and cream'
milk2[COMMODITY] = milk2['Commodity Code'].apply(milkType)
YEAR = 'Year'
PERIOD = 'Period'
PARTNER = 'Partner'
FLOW = 'Trade Flow'
VALUE = 'Trade Value (US$)'
headings = [YEAR, PERIOD, PARTNER, FLOW, COMMODITY, VALUE]
milk2 = milk2[headings]
milk2.tail()
# merge(left=frame1, right=frame2, on=columnName, how=string) returns a new dataframe,
# obtained by joining the two frames on the columns with the given common name.
# The how argument can be one of ‘left’, ‘right’, ‘inner’ and ‘outer’.
# Join the two data files using append
milk = milk1.append(milk2)
# reset the index 605 + 1 + 636 + 1 = 1242 rows (0 - 1242)
# frame = frame.reset_index(drop=True) returns a new dataframe in which rows are labelled from 0 onwards.
milk = milk.reset_index(drop=True)
milk.tail()
#milk.describe # [1243 rows x 6 columns]
The data contains the total imports and exports per month, under the 'World' partner. Those rows are removed to keep only the per-country data.
milk = milk[milk[PARTNER] != 'World']
milk.tail()#
#milk[350:400]
# Check your data:
result = pivot_table(milk,
index=[YEAR],
columns= PERIOD,
values=VALUE,
aggfunc=sum)
result
# This suggest you may need to create a MONTH column - which when combined with Year will give you the granularity you require
# see next cell:
MONTH = 'Month'
milk[MONTH]=milk[PERIOD].astype(str).str[5:].astype('int64')# removes year from Period and puts month value in MONTH column
#milk[MONTH]# operation successful
#milk.dtypes
# Now check the data (using MONTH):
result = pivot_table(milk,
index=[MONTH],
columns= YEAR,
values=VALUE,
aggfunc=sum)
result
To answer the first question, 'how much does the UK export and import and is the balance positive (more exports than imports)?', the dataframe is split into two groups: exports from the UK and imports into the UK. The trade values within each group are summed up to get the total trading.
# Original answer:
# This is a neat way to get the two totals - adding another YEAR of data just requires adding YEAR to groupby
grouped = milk.groupby([YEAR,FLOW])[VALUE]# puts it all in one statement
# grouped.groups.keys()# dict_keys([(2014, 'Imports'), (2015, 'Imports'), (2015, 'Exports'), (2014, 'Exports')])
# grouped.get_group(groupName)
print ('in 2014 the trade surplus was: ',(grouped.get_group((2014, 'Exports')).sum())-(grouped.get_group((2014, 'Imports')).sum()))
print ('in 2015 the trade surplus was: ',(grouped.get_group((2015, 'Exports')).sum())-(grouped.get_group((2015, 'Imports')).sum()))
result = pivot_table(milk,
index=[MONTH,COMMODITY],
columns= [YEAR , FLOW],
values=VALUE,
aggfunc=sum)
result
#note these totals are all positive - may have to create InFlows & OutFlows to report Net Trade Flow (see cell below)
NET='Netflow'
values= milk['Trade Value (US$)']# set up values
milk[NET] = values.where(milk[FLOW] == 'Exports', other=-values)#apply condition
#milk.head()
result = pivot_table(milk,
index=[MONTH,COMMODITY],
columns= [YEAR, FLOW],
values=NET,
aggfunc=sum)
print('2014 Net Flows',result[2014].sum().sum())
print('2015 Net Flows',result[2015].sum().sum())
result#
These figures indicate that in the first five months Jan - May the UK trade surplus has declined by 100m dollars from 209m dollars in 2014 to 109m dollars in 2015.
To address the second question, 'Which are the main trading partners, i.e. from/to which countries does the UK import/export the most?', the dataframe is split by country instead, and then each group aggregated for the total trade value. This is done separately for imports and exports. The result is sorted in descending order so that the main partners are at the top.
# The following cells are various attempts at reworking the data with different coding approaches to achieve the desired result
# First pass - note: table now contains two years of data instead of one - results will by combined (see cell below)
# make your selection by FLOW then use groupby
imports = milk[milk[FLOW] == 'Imports']# with additional year data an additional filter is required (see cell below)
grouped = imports.groupby([PARTNER])
#print('The UK imports from', len(grouped), 'countries.')
#print('The 5 biggest exporters to the UK are:')
totalImports = grouped[VALUE].aggregate(sum).sort(inplace=False,ascending=False)# this now includes both years!!!!!!
#totalImports.head()
# second pass using same approach as first to return results by year
imports2014 = milk[((milk[FLOW] == 'Imports') & (milk[YEAR] == 2014))]# simply add a year to your filter
grouped = imports2014.groupby([PARTNER])
print('In 2014 the UK imports from', len(grouped), 'countries.')
print('The 5 biggest exporters to the UK are:')
totalImports = grouped[VALUE].aggregate(sum).sort(inplace=False,ascending=False)
totalImports.head()
# How to use groupby as a filter (without resorting to the filter function)
# Problem: to return total number of countries imported/exported with during a given year
# grouped = milk.groupby(['Year','Trade Flow'])# extracts key components 'Year' & 'Flow' (does not incluse country)
# next line strips out a unique list of countries traded with and returns total count:
# len(grouped.get_group((2014,'Imports'))['Partner'].unique())# how identify which countries you imported/exported with in a given year
# consolidated into one line
# len(milk.groupby(['Year','Trade Flow']).get_group((2014,'Imports'))['Partner'].unique())
# second pass using pivot table to return both year results
# make your selection (at detail level) then summarise:
# How to return total countries you imported/exported with in a given year using groupby without filter
print('In 2014 the UK imported from',len(milk.groupby(['Year','Trade Flow']).get_group((2014,'Imports'))['Partner'].unique()),'countries')
print('In 2015 the UK imported from',len(milk.groupby(['Year','Trade Flow']).get_group((2015,'Imports'))['Partner'].unique()),'countries')
print('The following table shows the top 5 importers for 2015')
result = pivot_table(milk,
index=[PARTNER],
columns= [YEAR , FLOW],
values=VALUE,
aggfunc=sum)
result.dropna().sort((2015,'Imports'),ascending=False).head(5)# sorts by latest year Imports descending order
#note the results in original cell (first pass) will now report results for two years
The UK import values by country can be plotted as a bar chart, making differences between countries & year easier to see.
#totalImports.head(10).plot(kind='barh')# worked with original data
# How to plot bar chart for both years Imports?
result.dropna().sort((2015,'Imports'),ascending=False)[[(2014,'Imports'),(2015,'Imports')]].head(10).plot(kind='barh')
From these results it can be seen that in 2015 the UK reduced imports from its top 5 suppliers (in the first five months compared to 2014).
# Additional question: Who are regular exporters to the UK (i.e. more than 10 months of sales)
def sellsEveryMonth(g):
return (len(g) >= 10)
importer = milk[(milk[FLOW]=='Imports')]# simple but powerful comand to return imports only
regular = importer.groupby([YEAR,PARTNER]).filter(sellsEveryMonth)
countries = pivot_table(regular,
index=[PARTNER],
columns=[YEAR,FLOW],
values=VALUE,
aggfunc=[len,sum])
countries.sort(('sum',2015,'Imports'),ascending=False)# how to specify a pivot table col to sortby
# note: can put .dropna() after countries to exclude places like Denmark but that would be wrong.
Now lets look at who the UK exports to:
# Note original code - worked for 2015 but now you have added 2014 it will report total for both years
exports = milk[milk[FLOW] == 'Exports']
grouped = exports.groupby([PARTNER])
print('The UK exports to', len(grouped), 'countries.')
grouped[VALUE].aggregate(sum).order(ascending=False).head()
# second pass using pivot table to return both year results
# make your selection (at detail level) then summarise:
# How to return total countries you imported/exported with in a given year using groupby without filter
exports = milk[milk[FLOW] == 'Exports']
print('Len(exports. In 2014 the UK exported to',len(exports.groupby(['Year','Trade Flow']).get_group((2014,'Exports'))['Partner'].unique()),'countries')
print('In 2015 the UK exported to',len(exports.groupby(['Year','Trade Flow']).get_group((2015,'Exports'))['Partner'].unique()),'countries')
print('The 5 biggest importers from the UK are:')
result = pivot_table(exports,
index=[PARTNER],
columns= [YEAR , FLOW],
values=VALUE,
aggfunc=sum)
result.dropna().sort((2015,'Exports'),ascending=False).head(5)
#note the results in original cell (first pss) will now report results for two years
# Where could the export market be further developed, i.e. which countries import the least? Do the figures look realistic?
countries = pivot_table(exports,
index=[PARTNER],
columns=[YEAR,FLOW],
values=VALUE,
aggfunc=sum)
countries.sort((2015,'Exports')).head(10)
# The figures are too low to start with - may heve to start higer up >greater than
# Poor countries that have better cattle & grazing than we do. Need to study each market for niche opportunities
# What are they buying when & why?
Given that there are two commodities, the third question, 'Which are the regular customers, i.e. which countries buy milk from the UK every month?', is meant in the sense that a regular customer imports both commodities every month. This means that if the exports dataframe is grouped by country, each group has exactly ten rows (two commodities bought each of the five months). To see the countries, only the first month of one commodity has to be listed, as by definition it's the same countries every month and for the other commodity.
# Original code - see improved code in next cell
#def buysEveryMonth(group):
# return len(group) == 10
#grouped = exports.groupby([PARTNER])
#regular = grouped.filter(buysEveryMonth)
#regular[(regular[MONTH] == 201501) & (regular[COMMODITY] == 'processed')]
# which countries buy milk from the UK every month?
# a regular customer = imports both commodities every month: Partner Year 'Exports' 'Both Commodoities' = 10
def bothImpExp(g):
return (len(g)==10) & (g[VALUE].min() > 0)# set at 20 cos you have 5 months of data for each year x 2 years
# Partner year commodity flow =5 (combine flow = 10 = both imports & exports)
exports = milk[milk[FLOW] == 'Exports']
table = exports.groupby([YEAR,PARTNER]).filter(bothImpExp)#
PT = pivot_table(table,
index = [PARTNER],
columns = [YEAR, FLOW],
values = VALUE,
# margins = True,# True (default) or False seems to make no difference
aggfunc = [len,sum])
PT.dropna() #
# Whilst you might be willing to drop a contry that has a NaN in latest year does the same apply for a new customer who had NaN in prior year?
# see also calculation problems encountered as a result of using dropna below - does not return the correct resultant table data
#regular[(regular[MONTH] == 201501) & (regular[COMMODITY] == 'processed')]
PT[('sum',2014,'Exports')].sum() / milk.groupby([YEAR,FLOW]).get_group((2014,'Exports'))[VALUE].sum(VALUE)#
#regular[(regular[MONTH] == 201501) & (regular[COMMODITY] == 'processed')]
PT[('sum',2015,'Exports')].sum() / milk.groupby([YEAR,FLOW]).get_group((2015,'Exports'))[VALUE].sum(VALUE)#
Whilst in 2014 these regular customers represented 73% of total UK exports in 2015 it increased to 75%.
# note problem with dropna series total reports all in group.
PT[('sum',2014,'Exports')]
To address the fourth question, 'Which countries does the UK both import from and export to?', a pivot table is used to list the total export and import value for each country.
# Which countries does the UK both import from and export to?',
# This whole task can be done so easily with pivot table = create NET column
# NET='Netflow'
# values= milk['Trade Value (US$)']# set up values
# milk[NET] = values.where(milk[FLOW] == 'Exports', other=-values)#apply correect sign to net column Exports=+ Imports=-
# sum is calculated by adding margins-True
# filter by year
table = milk[milk[YEAR] == 2014]# Need to select year
result = pivot_table(table,index=[PARTNER],columns=[YEAR,FLOW],values=[NET],margins=True,aggfunc=[sum]).dropna()
result
# What is the total amount of exports to and imports from the bi-lateral trade countries?
#[Hint: pivot tables can have ‘marginal’ values.]
# Series of pivot table has colum total 'All' - need to access this result
count=len(result)
print('Total Net Trade for UK Milk Products:',result[('sum', 'Netflow', 'All')][count-1])# correct result is 209527379.0
# How to get Net Trade result - series
# What is the total amount of exports to and imports from the bi-lateral trade countries? [Hint: pivot tables can have ‘marginal’ values.]
# just report sum 2,099,527,379
#result[('sum','Netflow','All')]#.sum()
#'Which countries does the UK both import from and export to?'
# What is the total amount of exports to and imports from the bi-lateral trade countries? [Hint: pivot tables can have ‘marginal’ values.]
# This whole task can be done so easily with pivot table - sum is calulated by adding margins-True
#
countries = pivot_table(milk, index=[PARTNER], columns=[YEAR,FLOW],
values=VALUE, dropna= True, margins=True, aggfunc=sum,)
countries# returns table of imports & exports by year by country total: 1087599674
#How to sort the list using 'All' as the criteria
countries.sort('All',ascending=False).dropna().head(10)
#
NET='Netflow'
values= milk['Trade Value (US$)']# set up values
milk[NET] = values.where(milk[FLOW] == 'Exports', other=-values)#apply condition
#milk.head()
result = pivot_table(milk,
index=[MONTH,COMMODITY],
columns= [YEAR, FLOW],
values=NET,
#margins=True,# including this creates problems when summing values
aggfunc=sum)
print('2014 Net Flows',result[2014].sum().sum())
print('2015 Net Flows',result[2015].sum().sum())
result#
# How to return a total value from a datFrame that has totals included
#result['All'][len(result)-1]
# How to return a total value from a datFrame that DOES NOT HAVE totals included
result[2014].sum()
# 2014
# 437806707-228279328 = 209527379 net trade
# 2015
# 265029661-156483978 = 108545683 net trade
# total net
# 209527379+108545683 = 318073062
The milk and cream trade of the UK for 2014 & 2015 was analyse over the periods January to May in terms of which countries the UK mostly depends on for income (exports) and goods (imports). Over the period, the UK trade surplus fell by 50% form 200 million US dollars in 2014 to just over 100 million US dollars 2015.
print('In 2014 the UK imported from',len(milk.groupby(['Year','Trade Flow']).get_group((2014,'Imports'))['Partner'].unique()),'countries')
print('In 2015 the UK imported from',len(milk.groupby(['Year','Trade Flow']).get_group((2015,'Imports'))['Partner'].unique()),'countries')
print('The following table shows the top 5 importers for 2015')
result = pivot_table(milk,
index=[PARTNER],
columns= [YEAR , FLOW],
values=VALUE,
aggfunc=sum)
result.dropna().sort((2015,'Imports'),ascending=False).head(5)
Ireland remains the UK's number one importer for Milk Products; although exports fell by 73m from 200m in 2014 to 128m in 2015. Whilst imports of Milk products from Ireland also fell during the same period the reduction of 29m from 75m in 2014 to 46m in 2015 was not sufficient to halt decline in the UK Net Balance of trade of 43m with Ireland.
exports = milk[milk[FLOW] == 'Exports']
print('In 2014 the UK exported to',len(exports.groupby(['Year','Trade Flow']).get_group((2014,'Exports'))['Partner'].unique()),'countries')
print('In 2015 the UK exported to',len(exports.groupby(['Year','Trade Flow']).get_group((2015,'Exports'))['Partner'].unique()),'countries')
print('The 5 biggest importers from the UK are:')
result = pivot_table(exports,
index=[PARTNER],
columns= [YEAR , FLOW],
values=VALUE,
aggfunc=sum)
result.dropna().sort((2015,'Exports'),ascending=False).head(5)
The UK exported to over 100 countries during the period 2014-2015, the num,ner of countries that the UK imported from reduced from 23 to 21, the main ones (top five by trade value) all being geographically close to the UK. China and Hong Kong are the main importers that are not also main exporters.
def bothImpExp(g):
return (len(g)==10) & (g[VALUE].min() > 0)# set at 20 cos you have 5 months of data for each year x 2 years
exports = milk[milk[FLOW] == 'Exports']
table = exports.groupby([YEAR,PARTNER]).filter(bothImpExp)#
PT = pivot_table(table,
index = [PARTNER],
columns = [YEAR, FLOW],
values = VALUE,
# margins = True,# True (default) or False seems to make no difference
aggfunc = [len,sum])
PT.dropna() #
The UK is heavily dependent on its regular customers, the 16 countries (see above) that buy all types of milk and cream every month. They contribute three quarters of the total export value.
#'Which countries does the UK both import from and export to?'
# What is the total amount of exports to and imports from the bi-lateral trade countries? [Hint: pivot tables can have ‘marginal’ values.]
# This whole task can be done so easily with pivot table - sum is calulated by adding margins-True
#
countries = pivot_table(milk, index=[PARTNER], columns=[YEAR,FLOW],
values=VALUE, dropna= True, margins=True, aggfunc=sum,)
countries# returns table of imports & exports by year by country total: 1087599674
#How to sort the list using 'All' as the criteria
countries.sort('All',ascending=False).dropna().head(20)
The UK has bi-directional trade (i.e. both exports and imports) with 20 countries, although for some the trade value (in US dollars) is suspiciously low, which raises questions about the data's accuracy.