1

Problem 1. Using the August Chicago data, test the hypothesis that the means of hourly temperature for the first half of the month are equal to the second half of the month. Use 00Z on Aug 15 as the start of the second half of the month. Compute p values, assuming the Gaussian distribution is an adequate approximation to the null distribution of the test statistic.

2

In [1]:

import pandas as pd import scipy.stats as st df = pd.read_csv('chicago_hourly_aug_2015.csv', header=6) firstHalf = df['DryBulbCelsius'][df['Date'] < 20150815] secondHalf = df['DryBulbCelsius'][df['Date'] >= 20150815] # use two-sample t-test: http://stattrek.com/hypothesis-test/difference-in-means.aspx stat, p = st.ttest_ind(firstHalf, secondHalf) print("p-value is %e" % p)

3

p-value is 6.833892e-25

The p-value is quite small, so we can reject the null hypothesis of equal means.

4

Problem 2. (a) Using the same dataset, calculate the correlation coefficient between hourly temperature and dewpoint for August 2015.

5

In [2]:

df1 = pd.DataFrame({'Temperature': df['DryBulbCelsius'], 'Dewpoint': df['DewPointCelsius']}).dropna() df1.corr()

6

Dewpoint | Temperature | |
---|---|---|

Dewpoint | 1.000000 | 0.401586 |

Temperature | 0.401586 | 1.000000 |

(b) Is this correlation statistically significant at the 99% level?

7

In [4]:

df1['data1'] = df1['Temperature'] > df1['Temperature'].median() df1['data2'] = df1['Dewpoint'] > df1['Dewpoint'].median() out1 = pd.crosstab(df1['data1'], df1['data2']) print(out1) p_val = st.chi2_contingency(out1)[1] print("p-value is %f" % p_val)

8

data2 False True
data1
False 281 185
True 183 220
p-value is 0.000016

p-value is less than 0.01, so there is statistically significant at 99% level.

9

(c) Repeat (a) and (b) for daytime temperatures only (6 AM-6PM local time). Does your conclusion change?

10

In [62]:

data_set = df[(df['Time'][:] >= 600) & (df['Time'][:] <= 1800)] #print(data_set) # Part a df_new = pd.DataFrame({'Temperature': data_set['DryBulbCelsius'], 'Dewpoint': data_set['DewPointCelsius']}).dropna() print(df_new.corr()) # Part b df_new['data1'] = df_new['Temperature'] > df_new['Temperature'].median() df_new['data2'] = df_new['Dewpoint'] > df_new['Dewpoint'].median() out2 = pd.crosstab(df_new['data1'], df_new['data2']) p_val = st.chi2_contingency(out2)[1] print("p-value is %f" % p_val)

11

Dewpoint Temperature
Dewpoint 1.000000 0.359294
Temperature 0.359294 1.000000
p-value is 0.142380

p-value is larger than 0.01, so there is no statistically significant at 99% level.

12