Jupyter notebook Staycation_Project.ipynb
Project 2: Holiday weather
by Jake Stokes, 2nd of July 2016.
This is the project for Week 2 of The Open University's Learn to code for Data Analysis course.
The purpose of the project is to examine historic weather data from the Weather Underground for London to try to predict the best dates this year to take a nice warm staycation. My aim will be to:
obtain weather data for the year of 2015
clean the obtained data
run some basic data analysis techniques on the data set to:
find two weeks with the highest mean temperature; and,
avoid precipitation where possible.
The weather may of course may be very different this year to the weather of 2015, but it should give me some indication of when would be a good time to take a break.
Getting the data
The weather data was obtained from the Weather Underground website, using the dates 1st Jan 2015 til 31st Dec 2015, and saved as 'London_2015.csv'.
To obtain the data you must first enter London, United Kingdom as the location, and hit submit. On the following page there are some tabs - select 'custom', and from here you can enter the dates. The option to see the data in a CSV format is at the very bottom of the page underneath the data. This can be right-click-saved, and renamed from a .html to a .csv ready for use.
If you haven't the 'London_2014.csv' file, you can obtain the data as follows. Right-click on the following URL and choose 'Open Link in New Window' (or similar, depending on your browser):
http://www.wunderground.com/history
When the new page opens start typing 'London' in the 'Location' input box and when the pop up menu comes up with the option 'London, United Kingdom' select it and then click on 'Submit'.
Once ready, as shown below, I have loaded the dataframe, ensuing that any extra spaces at the start of values are removed. I have also imported the whole pandas module for data analytics.
Cleaning the data
First I will display some of the data to see if there are any obvious issues.
First we need to clean up the data. I'm not going to make use of 'WindDirDegrees'
in my analysis, but you might in yours so we'll rename 'WindDirDegrees< br />'
to 'WindDirDegrees'
.
GMT | Max TemperatureC | Mean TemperatureC | Min TemperatureC | Dew PointC | MeanDew PointC | Min DewpointC | Max Humidity | Mean Humidity | Min Humidity | ... | Max VisibilityKm | Mean VisibilityKm | Min VisibilitykM | Max Wind SpeedKm/h | Mean Wind SpeedKm/h | Max Gust SpeedKm/h | Precipitationmm | CloudCover | Events | WindDirDegrees<br /> | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-1-1 | 12 | 8 | 4 | 11 | 7 | 3 | 94 | 88 | 78 | ... | 18 | 9 | 5 | 39 | 21 | 60 | 0.51 | 7 | Rain | 209<br /> |
1 | 2015-1-2 | 11 | 7 | 4 | 12 | 4 | 0 | 94 | 70 | 41 | ... | 31 | 16 | 3 | 35 | 24 | 50 | 0.00 | 2 | Rain | 258<br /> |
2 | 2015-1-3 | 6 | 4 | 2 | 6 | 3 | 1 | 100 | 91 | 70 | ... | 31 | 10 | 2 | 19 | 10 | NaN | 7.11 | 5 | Rain | 19<br /> |
3 | 2015-1-4 | 3 | 1 | -2 | 3 | 1 | -2 | 100 | 97 | 90 | ... | 13 | 4 | 0 | 13 | 6 | 27 | 0.00 | 6 | Fog | 225<br /> |
4 | 2015-1-5 | 10 | 6 | 2 | 8 | 5 | 2 | 100 | 86 | 67 | ... | 31 | 10 | 3 | 19 | 10 | NaN | 0.25 | 6 | NaN | 199<br /> |
5 rows × 23 columns
There are some immediately obvious issues with the data:
the final column: 'WindDirDegrees
<br />
' and its contents have retained the html line breaks on the end of the data linethis means that the final column will be an object dtype as opposed to int64 as intended (as shown below)
the GMT column has the dtype 'object' as opposed to 'datetime' (as shown below)
there are various NaN values in the results
First I will rename the column to remove the html line breaks:
This next one is me just being anal about the title format for continuity.
Now to remove the <br />
html line breaks from the values in the 'WindDirDegrees'
column:
Here I change the values in the 'WindDirDegrees'
column to the int64
dtype:
Finally, I change the values in the 'GMT'
column to the datetime64
dtype:
I also need to change the index from the default to the datetime64
values in the 'GMT'
column so that it is easier to pull out rows between particular dates and display more meaningful graphs:
Now I need to address the 'NaN'
values in the data and then decide what to do with them. The intentions for this project are to use the 'Mean TemperatureC' and 'Precipitationmm' column values to establish the best dates for the staycation, so first I will check if here are any NaN values in these columns:
Considering that there are no NaN values in the data I will actually be utilising for this project, I am able to ignore the NaN values in the dataframe for this project.
Finding a summer break
According to meteorologists, summer extends for the whole months of June, July, and August in the northern hemisphere and the whole months of December, January, and February in the southern hemisphere. I'm in the northern hemisphere, so I'm going to create a dataframe that holds just those months, and starting from tomorrow's date (today is July 2nd 2016):
I now look for the days with warm temperatures.
GMT | Max TemperatureC | Mean TemperatureC | Min TemperatureC | Dew PointC | MeanDew PointC | Min DewpointC | Max Humidity | Mean Humidity | Min Humidity | ... | Max VisibilityKm | Mean VisibilityKm | Min VisibilityKm | Max Wind SpeedKm/h | Mean Wind SpeedKm/h | Max Gust SpeedKm/h | Precipitationmm | CloudCover | Events | WindDirDegrees | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GMT | |||||||||||||||||||||
2015-07-03 | 2015-07-03 | 27 | 20 | 13 | 15 | 11 | 8 | 83 | 56 | 23 | ... | 31 | 21 | 9 | 27 | 10 | NaN | 8.89 | 2 | Rain | 83 |
2015-07-04 | 2015-07-04 | 27 | 22 | 17 | 18 | 14 | 10 | 100 | 67 | 33 | ... | 31 | 8 | 2 | 27 | 14 | 42 | 0.00 | 2 | Rain-Thunderstorm | 220 |
2015-07-10 | 2015-07-10 | 27 | 20 | 13 | 11 | 7 | 2 | 82 | 45 | 11 | ... | 31 | 23 | 10 | 23 | 11 | 39 | 0.00 | NaN | NaN | 182 |
2015-07-11 | 2015-07-11 | 26 | 20 | 14 | 14 | 10 | 8 | 77 | 49 | 24 | ... | 31 | 19 | 10 | 27 | 13 | 42 | 0.00 | 2 | Rain | 274 |
2015-07-14 | 2015-07-14 | 23 | 20 | 17 | 18 | 15 | 14 | 100 | 78 | 51 | ... | 31 | 13 | 3 | 24 | 18 | NaN | 2.03 | 6 | Rain | 252 |
2015-07-16 | 2015-07-16 | 25 | 20 | 14 | 15 | 13 | 9 | 88 | 65 | 44 | ... | 26 | 14 | 6 | 24 | 13 | NaN | 7.11 | 4 | Rain-Thunderstorm | 90 |
2015-07-17 | 2015-07-17 | 25 | 20 | 14 | 17 | 13 | 9 | 94 | 67 | 35 | ... | 23 | 11 | 6 | 35 | 16 | NaN | 0.25 | 3 | Rain | 242 |
2015-08-03 | 2015-08-03 | 25 | 20 | 16 | 16 | 13 | 8 | 83 | 64 | 40 | ... | 31 | 14 | 10 | 34 | 18 | 47 | 0.00 | 3 | Rain | 200 |
2015-08-08 | 2015-08-08 | 26 | 20 | 14 | 15 | 13 | 10 | 94 | 62 | 28 | ... | 31 | 15 | 10 | 23 | 10 | NaN | 0.00 | 2 | NaN | 148 |
2015-08-21 | 2015-08-21 | 26 | 22 | 17 | 17 | 16 | 13 | 88 | 70 | 39 | ... | 31 | 13 | 10 | 26 | 18 | 37 | 0.00 | 4 | NaN | 199 |
2015-08-22 | 2015-08-22 | 31 | 23 | 15 | 17 | 14 | 12 | 94 | 63 | 27 | ... | 31 | 16 | 7 | 26 | 10 | NaN | 0.00 | 4 | NaN | 114 |
11 rows × 23 columns
Summer 2015 had a toal of 11 days with a mean temperature of 20 Celsius or higher.
From here it would be best to see a graph of the temperature for a better look at the trends in temperature, so next I tell Jupyter to display any graph created inside this notebook:
Now to plot the 'Mean TemperatureC'
for the Summer:
The graph shows that the mean temperature was generally in the 18ºC to 20ºC range, with the exceptions of the final weeks of both July and August.
To get a better idea of when would be best for a staycation I will also put precipitation onto the graph:
It seems that there were days of high precipitation in the last week of both July and August. Both months seem very similar on face value, so for the best chance an enjoyable staycation, I will look at the mean of the 'mean temperatureC' for each month to see if there is statistically better option:
With the mean temperatures for each month varying by only 0.3ºC, not a noticable difference; I have decided to also examine the mean precipitation of both months to see if the result makes a particular month a clearer best choice:
July had 0.9mm less precipitation on average per day than August.
Conclusions
The graphs have shown both July and August both had very similar weather throughout the month. Ultimately, July had both a higher average mean temperature across the month, and a lower precipitation level, so I will take my 2 week staycation this month, starting immediately! (That way I can get more practice with Python)
Of course these results are no guarantee that the weather pattern will repeat itself in future years. To make a sensible prediction I would need to analyse the summers for many more years. I am currently studying to expand my skills to be able to achieve this in future projects.