CoCalc Public Filesatms391geodata / Week 9 / Week 9 Homework (answer key).ipynb
Author: Steve Nesbitt
Views : 99
Compute Environment: Ubuntu 18.04 (Deprecated)

ATMS 391: Geophysical Data Analysis

Homework 9

Problem 1. Using the August Chicago data, test the hypothesis that the means of hourly temperature for the first half of the month are equal to the second half of the month. Use 00Z on Aug 15 as the start of the second half of the month. Compute p values, assuming the Gaussian distribution is an adequate approximation to the null distribution of the test statistic.

In [1]:
import pandas as pd
import scipy.stats as st

df = pd.read_csv('chicago_hourly_aug_2015.csv', header=6)

firstHalf = df['DryBulbCelsius'][df['Date'] < 20150815]
secondHalf = df['DryBulbCelsius'][df['Date'] >= 20150815]

# use two-sample t-test: http://stattrek.com/hypothesis-test/difference-in-means.aspx
stat, p = st.ttest_ind(firstHalf, secondHalf)
print("p-value is %e" % p)

p-value is 6.833892e-25

The p-value is quite small, so we can reject the null hypothesis of equal means.

Problem 2. (a) Using the same dataset, calculate the correlation coefficient between hourly temperature and dewpoint for August 2015.

In [2]:
df1 = pd.DataFrame({'Temperature': df['DryBulbCelsius'],
'Dewpoint': df['DewPointCelsius']}).dropna()
df1.corr()

Dewpoint Temperature
Dewpoint 1.000000 0.401586
Temperature 0.401586 1.000000

(b) Is this correlation statistically significant at the 99% level?

In [4]:
df1['data1'] = df1['Temperature'] > df1['Temperature'].median()
df1['data2'] = df1['Dewpoint'] > df1['Dewpoint'].median()

out1 = pd.crosstab(df1['data1'], df1['data2'])
print(out1)
p_val = st.chi2_contingency(out1)[1]
print("p-value is %f" % p_val)

data2 False True data1 False 281 185 True 183 220 p-value is 0.000016

p-value is less than 0.01, so there is statistically significant at 99% level.

(c) Repeat (a) and (b) for daytime temperatures only (6 AM-6PM local time). Does your conclusion change?

In [62]:
data_set = df[(df['Time'][:] >= 600) & (df['Time'][:] <= 1800)]
#print(data_set)

# Part a
df_new = pd.DataFrame({'Temperature': data_set['DryBulbCelsius'],
'Dewpoint': data_set['DewPointCelsius']}).dropna()
print(df_new.corr())

# Part b
df_new['data1'] = df_new['Temperature'] > df_new['Temperature'].median()
df_new['data2'] = df_new['Dewpoint'] > df_new['Dewpoint'].median()
out2 = pd.crosstab(df_new['data1'], df_new['data2'])
p_val = st.chi2_contingency(out2)[1]
print("p-value is %f" % p_val)

Dewpoint Temperature Dewpoint 1.000000 0.359294 Temperature 0.359294 1.000000 p-value is 0.142380

p-value is larger than 0.01, so there is no statistically significant at 99% level.