SharedLearn-to-code-for-data-analysis-master / 2_Cleaning_up_our_act / Project_2.ipynbOpen in CoCalc
Suzanne Lees Project 2 - Moscow

Project 2: Holiday weather

by Rob Griffiths and Suzanne Lees, 11 September 2015, updated 11 April 2017, 18 October 2017, 20 December 2017, and 6 August 2017

This is the project notebook for the second part of The Open University's Learn to code for Data Analysis course.

There is nothing I like better than taking a holiday. In the winter I like to have a two week break in a country where I can be guaranteed sunny dry days. In the summer I like to have two weeks off relaxing in my garden in London. However I'm often disappointed because I pick a fortnight when the weather is dull and it rains. So in this project I am going to use the historic weather data from the Weather Underground for London to try to predict two good weather weeks to take off as holiday next summer. Of course the weather in the summer of 2016 may be very different to 2014 but it should give me some indication of when would be a good time to take a summer break.

Getting the data

Weather Underground keeps historical weather data collected in many airports around the world. Right-click on the following URL and choose 'Open Link in New Window' (or similar, depending on your browser):

http://www.wunderground.com/history

When the new page opens start typing 'LHR' in the 'Location' input box and when the pop up menu comes up with the option 'LHR, United Kingdom' select it and then click on 'Submit'.

When the next page opens with London Heathrow data, click on the 'Custom' tab and select the time period From: 1 January 2014 to: 31 December 2014 and then click on 'Get History'. The data for that year should then be displayed further down the page.

You can copy each month's data directly from the browser to a text editor like Notepad or TextEdit, to obtain a single file with as many months as you wish.

Weather Underground has changed in the past the way it provides data and may do so again in the future. I have therefore collated the whole 2014 data in the provided 'London_2014.csv' file.

Now load the CSV file into a dataframe making sure that any extra spaces are skipped:

import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *
london = read_csv('London_2014.csv', skipinitialspace=True)

Cleaning the data

First we need to clean up the data. I'm not going to make use of 'WindDirDegrees' in my analysis, but you might in yours so we'll rename 'WindDirDegrees< br />' to 'WindDirDegrees'.

london = london.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})

remove the < br /> html line breaks from the values in the 'WindDirDegrees' column.

london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />')

and change the values in the 'WindDirDegrees' column to float64:

london['WindDirDegrees'] = london['WindDirDegrees'].astype('float64')   

We definitely need to change the values in the 'GMT' column into values of the datetime64 date type.

london['GMT'] = to_datetime(london['GMT'])

We also need to change the index from the default to the datetime64 values in the 'GMT' column so that it is easier to pull out rows between particular dates and display more meaningful graphs:

london.index = london['GMT']

Finding a summer break

According to meteorologists, summer extends for the whole months of June, July, and August in the northern hemisphere and the whole months of December, January, and February in the southern hemisphere. So as I'm in the northern hemisphere I'm going to create a dataframe that holds just those months using the datetime index, like this:

summer = london.loc[datetime(2014,6,1) : datetime(2014,8,31)]

I now look for the days with warm temperatures.

summer[summer['Mean TemperatureC'] >= 25]
GMT Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity ... Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
GMT

0 rows × 23 columns

Summer 2014 was rather cool in London: there are no days with temperatures of 25 Celsius or higher. Best to see a graph of the temperature and look for the warmest period.

So next we tell Jupyter to display any graph created inside this notebook:

%matplotlib inline

Now let's plot the 'Mean TemperatureC' for the summer:

summer['Mean TemperatureC'].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcf58a7b8>

Well looking at the graph the second half of July looks good for mean temperatures over 20 degrees C so let's also put precipitation on the graph too:

summer[['Mean TemperatureC', 'Precipitationmm']].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcd0d9ba8>

The second half of July is still looking good, with just a couple of peaks showing heavy rain. Let's have a closer look by just plotting mean temperature and precipitation for July.

july = summer.loc[datetime(2014,7,1) : datetime(2014,7,31)]
july[['Mean TemperatureC', 'Precipitationmm']].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcd4f5160>

Yes, second half of July looks pretty good, just two days that have significant rain, the 25th and the 28th and just one day when the mean temperature drops below 20 degrees, also the 28th.

Conclusions

The graphs have shown the volatility of a British summer, but a couple of weeks were found when the weather wasn't too bad in 2014. Of course this is no guarantee that the weather pattern will repeat itself in future years. To make a sensible prediction we would need to analyse the summers for many more years. By the time you have finished this course you should be able to do that.

Moscow

I will now analyse the data for Moscow!

Getting the Data:

from pandas import *
moscow = read_csv('Moscow_SVO_2014.csv', skipinitialspace=True)

Cleaning the Data:

moscow = moscow.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})
moscow['WindDirDegrees'] = moscow['WindDirDegrees'].str.rstrip('<br />')
moscow['WindDirDegrees'] = moscow['WindDirDegrees'].astype('float64') 
moscow['Date'] = to_datetime(moscow['Date'])
moscow.index = moscow['Date']

Finding a summer break:

summer = moscow.loc[datetime(2014,6,1) : datetime(2014,8,31)]
summer[(summer['Mean TemperatureC'] <= 25) & (summer['Mean TemperatureC'] >= 20)]

Date Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity ... Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
Date
2014-06-02 2014-06-02 27 22 17 12 8 4 68 42 23 ... 10.0 10.0 10.0 29 16 43.0 0.0 7.0 Rain 115.0
2014-06-03 2014-06-03 29 20 12 11 6 3 77 40 19 ... NaN NaN NaN 21 13 35.0 0.0 NaN NaN 137.0
2014-06-04 2014-06-04 29 20 11 12 8 2 82 45 18 ... 8.0 8.0 8.0 26 11 43.0 0.0 1.0 NaN 108.0
2014-06-05 2014-06-05 31 22 14 14 8 2 77 44 16 ... NaN NaN NaN 18 10 35.0 0.0 NaN NaN 116.0
2014-06-06 2014-06-06 31 22 14 13 9 6 82 45 21 ... 10.0 10.0 10.0 32 5 NaN 0.0 6.0 Thunderstorm 154.0
2014-06-07 2014-06-07 30 22 14 17 13 10 94 64 31 ... 10.0 9.0 2.0 29 5 29.0 0.0 6.0 Rain-Thunderstorm 144.0
2014-06-08 2014-06-08 27 21 15 15 13 12 94 63 39 ... 10.0 9.0 8.0 21 10 NaN 0.0 4.0 NaN 256.0
2014-07-01 2014-07-01 28 22 17 14 12 11 72 52 37 ... NaN NaN NaN 26 19 43.0 0.0 NaN NaN 186.0
2014-07-02 2014-07-02 29 22 14 16 13 9 100 61 32 ... 10.0 10.0 7.0 32 13 47.0 0.0 6.0 Rain-Thunderstorm 155.0
2014-07-14 2014-07-14 30 21 13 13 10 8 94 49 25 ... 9.0 9.0 9.0 14 10 29.0 0.0 1.0 NaN 138.0
2014-07-15 2014-07-15 30 22 14 15 11 8 88 51 27 ... NaN NaN NaN 18 5 32.0 0.0 NaN NaN 208.0
2014-07-16 2014-07-16 31 23 16 16 13 11 82 52 31 ... 10.0 10.0 10.0 18 5 21.0 0.0 5.0 Thunderstorm 178.0
2014-07-17 2014-07-17 27 21 16 19 17 15 94 75 48 ... 10.0 9.0 5.0 21 6 40.0 0.0 5.0 Rain-Thunderstorm 12.0
2014-07-20 2014-07-20 27 20 13 13 11 9 88 57 34 ... 10.0 10.0 9.0 14 8 32.0 0.0 3.0 NaN 2.0
2014-07-26 2014-07-26 26 21 17 16 12 8 73 54 36 ... 10.0 10.0 10.0 26 16 47.0 0.0 3.0 NaN 327.0
2014-07-27 2014-07-27 29 20 11 13 10 8 88 54 30 ... NaN NaN NaN 18 8 29.0 0.0 NaN NaN 319.0
2014-07-28 2014-07-28 32 23 15 13 12 10 82 49 26 ... 9.0 9.0 9.0 18 8 NaN 0.0 1.0 NaN 278.0
2014-07-29 2014-07-29 32 23 15 14 12 8 82 47 22 ... 7.0 5.0 2.0 26 8 50.0 0.0 6.0 NaN 233.0
2014-07-31 2014-07-31 32 24 18 17 12 8 88 48 22 ... 9.0 7.0 5.0 21 10 40.0 0.0 1.0 NaN 237.0
2014-08-01 2014-08-01 33 24 16 14 12 10 77 47 24 ... 9.0 9.0 9.0 14 5 26.0 0.0 6.0 NaN 172.0
2014-08-03 2014-08-03 29 24 19 19 16 11 94 65 35 ... 10.0 9.0 7.0 26 10 NaN 0.0 6.0 Rain 14.0
2014-08-04 2014-08-04 29 22 16 14 13 10 82 56 34 ... NaN NaN NaN 29 13 35.0 0.0 NaN NaN 14.0
2014-08-05 2014-08-05 29 23 17 16 13 12 82 56 35 ... 10.0 10.0 10.0 29 14 47.0 0.0 3.0 NaN 29.0
2014-08-06 2014-08-06 30 23 17 16 14 12 88 58 37 ... 10.0 10.0 9.0 29 14 43.0 0.0 4.0 NaN 43.0
2014-08-07 2014-08-07 29 23 16 17 15 13 100 72 37 ... 10.0 10.0 9.0 32 11 35.0 0.0 6.0 Rain-Thunderstorm 43.0
2014-08-09 2014-08-09 29 22 15 17 14 11 100 66 33 ... 10.0 10.0 10.0 21 10 32.0 0.0 3.0 NaN 347.0
2014-08-10 2014-08-10 28 23 18 17 16 15 83 65 45 ... 10.0 10.0 10.0 32 10 35.0 0.0 6.0 Rain 298.0
2014-08-11 2014-08-11 29 22 16 16 14 12 94 63 37 ... 10.0 10.0 10.0 14 8 NaN 0.0 4.0 Fog 7.0
2014-08-12 2014-08-12 31 23 17 17 15 13 88 59 33 ... 10.0 10.0 10.0 21 8 32.0 0.0 6.0 NaN 162.0
2014-08-13 2014-08-13 24 20 16 19 16 12 100 75 60 ... 10.0 9.0 5.0 26 11 NaN 0.0 7.0 Rain 234.0
2014-08-15 2014-08-15 26 21 16 20 18 15 100 85 61 ... 10.0 10.0 6.0 29 14 32.0 0.0 6.0 Rain-Thunderstorm 229.0
2014-08-21 2014-08-21 27 22 16 13 12 9 77 54 34 ... 10.0 10.0 10.0 14 8 32.0 0.0 7.0 Rain 245.0

32 rows × 23 columns

For my ideal holiday, I would have little rain, low humidity, high visibility, and temperatures between 20 and 25 degrees C. The temperature condition was satisfied for the 32 dates above. I will now disregard any of these 32 dates for which there was rain.

temp = summer[(summer['Mean TemperatureC'] <= 25) & (summer['Mean TemperatureC'] >= 20) & (summer['Precipitationmm'] == 0)]
temp
Date Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity ... Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
Date
2014-06-02 2014-06-02 27 22 17 12 8 4 68 42 23 ... 10.0 10.0 10.0 29 16 43.0 0.0 7.0 Rain 115.0
2014-06-03 2014-06-03 29 20 12 11 6 3 77 40 19 ... NaN NaN NaN 21 13 35.0 0.0 NaN NaN 137.0
2014-06-04 2014-06-04 29 20 11 12 8 2 82 45 18 ... 8.0 8.0 8.0 26 11 43.0 0.0 1.0 NaN 108.0
2014-06-05 2014-06-05 31 22 14 14 8 2 77 44 16 ... NaN NaN NaN 18 10 35.0 0.0 NaN NaN 116.0
2014-06-06 2014-06-06 31 22 14 13 9 6 82 45 21 ... 10.0 10.0 10.0 32 5 NaN 0.0 6.0 Thunderstorm 154.0
2014-06-07 2014-06-07 30 22 14 17 13 10 94 64 31 ... 10.0 9.0 2.0 29 5 29.0 0.0 6.0 Rain-Thunderstorm 144.0
2014-06-08 2014-06-08 27 21 15 15 13 12 94 63 39 ... 10.0 9.0 8.0 21 10 NaN 0.0 4.0 NaN 256.0
2014-07-01 2014-07-01 28 22 17 14 12 11 72 52 37 ... NaN NaN NaN 26 19 43.0 0.0 NaN NaN 186.0
2014-07-02 2014-07-02 29 22 14 16 13 9 100 61 32 ... 10.0 10.0 7.0 32 13 47.0 0.0 6.0 Rain-Thunderstorm 155.0
2014-07-14 2014-07-14 30 21 13 13 10 8 94 49 25 ... 9.0 9.0 9.0 14 10 29.0 0.0 1.0 NaN 138.0
2014-07-15 2014-07-15 30 22 14 15 11 8 88 51 27 ... NaN NaN NaN 18 5 32.0 0.0 NaN NaN 208.0
2014-07-16 2014-07-16 31 23 16 16 13 11 82 52 31 ... 10.0 10.0 10.0 18 5 21.0 0.0 5.0 Thunderstorm 178.0
2014-07-17 2014-07-17 27 21 16 19 17 15 94 75 48 ... 10.0 9.0 5.0 21 6 40.0 0.0 5.0 Rain-Thunderstorm 12.0
2014-07-20 2014-07-20 27 20 13 13 11 9 88 57 34 ... 10.0 10.0 9.0 14 8 32.0 0.0 3.0 NaN 2.0
2014-07-26 2014-07-26 26 21 17 16 12 8 73 54 36 ... 10.0 10.0 10.0 26 16 47.0 0.0 3.0 NaN 327.0
2014-07-27 2014-07-27 29 20 11 13 10 8 88 54 30 ... NaN NaN NaN 18 8 29.0 0.0 NaN NaN 319.0
2014-07-28 2014-07-28 32 23 15 13 12 10 82 49 26 ... 9.0 9.0 9.0 18 8 NaN 0.0 1.0 NaN 278.0
2014-07-29 2014-07-29 32 23 15 14 12 8 82 47 22 ... 7.0 5.0 2.0 26 8 50.0 0.0 6.0 NaN 233.0
2014-07-31 2014-07-31 32 24 18 17 12 8 88 48 22 ... 9.0 7.0 5.0 21 10 40.0 0.0 1.0 NaN 237.0
2014-08-01 2014-08-01 33 24 16 14 12 10 77 47 24 ... 9.0 9.0 9.0 14 5 26.0 0.0 6.0 NaN 172.0
2014-08-03 2014-08-03 29 24 19 19 16 11 94 65 35 ... 10.0 9.0 7.0 26 10 NaN 0.0 6.0 Rain 14.0
2014-08-04 2014-08-04 29 22 16 14 13 10 82 56 34 ... NaN NaN NaN 29 13 35.0 0.0 NaN NaN 14.0
2014-08-05 2014-08-05 29 23 17 16 13 12 82 56 35 ... 10.0 10.0 10.0 29 14 47.0 0.0 3.0 NaN 29.0
2014-08-06 2014-08-06 30 23 17 16 14 12 88 58 37 ... 10.0 10.0 9.0 29 14 43.0 0.0 4.0 NaN 43.0
2014-08-07 2014-08-07 29 23 16 17 15 13 100 72 37 ... 10.0 10.0 9.0 32 11 35.0 0.0 6.0 Rain-Thunderstorm 43.0
2014-08-09 2014-08-09 29 22 15 17 14 11 100 66 33 ... 10.0 10.0 10.0 21 10 32.0 0.0 3.0 NaN 347.0
2014-08-10 2014-08-10 28 23 18 17 16 15 83 65 45 ... 10.0 10.0 10.0 32 10 35.0 0.0 6.0 Rain 298.0
2014-08-11 2014-08-11 29 22 16 16 14 12 94 63 37 ... 10.0 10.0 10.0 14 8 NaN 0.0 4.0 Fog 7.0
2014-08-12 2014-08-12 31 23 17 17 15 13 88 59 33 ... 10.0 10.0 10.0 21 8 32.0 0.0 6.0 NaN 162.0
2014-08-13 2014-08-13 24 20 16 19 16 12 100 75 60 ... 10.0 9.0 5.0 26 11 NaN 0.0 7.0 Rain 234.0
2014-08-15 2014-08-15 26 21 16 20 18 15 100 85 61 ... 10.0 10.0 6.0 29 14 32.0 0.0 6.0 Rain-Thunderstorm 229.0
2014-08-21 2014-08-21 27 22 16 13 12 9 77 54 34 ... 10.0 10.0 10.0 14 8 32.0 0.0 7.0 Rain 245.0

32 rows × 23 columns

In fact, it did not rain on any of these 32 dates! I will show graphically the precipitation and rainfall between the earliest and latest of the above dates.

temp = summer.loc[temp['Date'].min() : temp['Date'].max()]
temp[['Mean TemperatureC', 'Precipitationmm']].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcd47b358>

I will now include the visibility and humidity in this graph. Since there was no rain during this time period, I will no longer include this data.

temp[['Mean TemperatureC', 'Mean Humidity', 'Mean VisibilityKm']].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcd3470b8>

The graph looks misleading because the values for humidity are much higher than those for visibility and temperature. In an attempt to combat this, I will scale the humidity down.

temperature = temp['Mean TemperatureC']
humidity = temp['Mean Humidity']
humidity = (temperature.mean()*humidity)/humidity.mean()
temp['Scaled Mean Humidity'] = humidity
temp[['Mean TemperatureC', 'Scaled Mean Humidity', 'Mean VisibilityKm']].plot(grid=True, figsize=(10,5))
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcc47ee10>

Now it appears that the visibility does not vary by enough for it to impact my decision. I wish to choose the 2 week period with most temperatures in my chosen range and low humidity. It looks like the end of July/beginning of August is the best. I will restrict my consideration to the time period 15th July to 15th August. I will now show the maximum and minimum humidity and temperature.

temp = temp.loc[datetime(2014,7,15) : datetime(2014,8,15)]
temperature = temp['Mean TemperatureC']
humidity = temp['Mean Humidity']
minHumidity = temp['Min Humidity']
minHumidity = (temperature.mean()*minHumidity)/humidity
maxHumidity = temp['Max Humidity']
maxHumidity = (temperature.mean()*maxHumidity)/humidity
temp['Scaled Min Humidity'] = minHumidity
temp['Scaled Max Humidity'] = maxHumidity

temp[['Min TemperatureC', 'Scaled Max Humidity', 'Scaled Min Humidity',  'Max TemperatureC', 'Mean TemperatureC']].plot(grid=True, figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bcc1363c8>

With this amount of variation, it is difficult to choose the best 2 weeks.

Conclusion

Using the data about Moscow, and my personal weather preferences, I have concluded that the best time to visit Moscow is between late July and early August. However, during this time period, humidity and temperature did not vary with enough significance to choose a specific 2 week period.