 CoCalc Public FilesWeek 9 / hw8-turnin.ipynb
Author: Michelle Arroyo
Views : 67
Compute Environment: Ubuntu 18.04 (Deprecated)

# Homework 9

Name: Michelle Arroyo

I collaborated with: Andrew Awadallah and Anthony Reyna

In :
# 1 and 3
from IPython.display import Image
Image(filename='Question#2.png', height=900)
#The answer in purple is for number 1 and the answer in black is for question 3. In :
# 2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt # Adds sub-library to plot line
import scipy.stats as stats # Adds sub-library to calculate regression line and correlation coefficient

import pandas as pd
NEN

p=sns.lmplot('Ice Breakup Day of Year','Years Since 1900', data=NEN,fit_reg=False)
p.set(title='Scatter Plot of Ice Breakup Day of Year  vs Years Since 1900',xlabel='Ice Breakup Day of Year',ylabel='Years Since 1900');

#We should look back at our plot from Lab 8 that showed the distributions of X and Y
sns.jointplot('Ice Breakup Day of Year', 'Years Since 1900', data=NEN)


<seaborn.axisgrid.JointGrid at 0x7f79ed6e4f98>  In :
rho = stats.pearsonr(NEN['Ice Breakup Day of Year'],NEN['Years Since 1900']) #we imported scipy.stats and called it stats, so in order to use scipy.stats.pearsonr we should say stats.pearsonr
print("the pearson correlation coefficient is",rho)

regr=stats.linregress(NEN['Ice Breakup Day of Year'],NEN['Years Since 1900'])
slopeobs=regr.slope
regr

#Pick min and max by finding what your min and max x values are in the dataset
datamin = np.min(NEN['Ice Breakup Day of Year'])
datamax = np.max(NEN['Ice Breakup Day of Year'])
print(datamin,datamax)

X_plot = np.linspace(datamin,datamax,100)
Y_plot = regr.slope*X_plot + regr.intercept #you can get the slope and intercept from your regression result using this syntax

p=sns.lmplot('Ice Breakup Day of Year','Years Since 1900',data=NEN,fit_reg=False)
p.set(title='Scatter Plot of Ice Breakup Day of Year vs Years Since 1900 with Regression Line',xlabel='Ice Breakup Day of Year',ylabel='Years Since 1900')
plt.plot(X_plot,Y_plot); #these are collections of X and Y values that will be plotted as (X,Y) pairs

the pearson correlation coefficient is -0.38614431815900013 104.014 140.487 In :
#2b
ntimes=10000
results=np.zeros(ntimes)
for i in range(ntimes):
sim=NEN.sample(len(NEN),replace=True)
results[i]=stats.linregress(sim['Ice Breakup Day of Year'],sim['Years Since 1900']).slope

results.sort()
CIlower=2*slopeobs-results # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper)
CIupper=2*slopeobs-results # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower)
p=sns.distplot(results,kde=False)
p.set(xlabel = 'Regression slope',ylabel = 'Count',title = 'Simulations for Confidence Interval')
p.axvline(CIlower,color='green') # Lower CI limit
p.axvline(CIupper,color='green') # Upper CI limit
p.axvline(slopeobs,color='red'); # Observed value
print("the effect size is", (CIlower, CIupper))
print("the observe slope is",slopeobs)

the effect size is (-2.963697116424238, -0.5966914414737978) the observe slope is -1.766460334550153 In :
# 2c
#based on the data above, it shows that there is a negative trend in the days when the ice melts, with the line of best fit having a slope of -1.766. Also, when the effect size is taken into account, it shows that 99% of the data is from about -.55 to -3.0, and since it does not include 0 it suggests that there is a difference.

In :
# 3
#The answer to this quesion is in question one's cell.

In :
# 4
# The article mentions that an increase in sample size one will have a higher power. Also, the article mentioned that P-values less than 0.05 and happen very often by the relative extreme. One difference though is that although the sample size is large in the article, the sample size we use in lab is larger, at 10000. Having a small sample size can give you a lower p value, but the power value will be low as well. In lab, we were told to increase sample size to see how that many affect the power, and by increasing sample size and effect size it increased power significantly.

In :
#5
#Three things that effect the power of a study, based off of figure 6 in the study is the sample size of the study, the population itself, and the alpha value of the study.

In :
#6 Since we were not taught how to use the power analysis for anything else other than a 2 x 2 table this is all I could think of to do. I used the website to figure out the rest.
plant_eph = [34,51,20,47,22,19,24,18,33,45,36,42,30,25,44,33,47,35,51,15,24,13,25,32,49,28,37,17,13,32,24,18,22]
plant_perm = [31,40,43,37,23,45,53,46,44,25,47,34,41,31,37,34,42,37,14,29,18,15,14,28]

eph_plant=np.median(plant_eph)
print(eph_plant)
perm_plant=np.median(plant_perm)
print(perm_plant)

obs1= np.array([[eph_plant,perm_plant]])
#len(plant_eph)
len(plant_perm)

from IPython.display import Image
Image(filename='2020-03-13 (2).png', height=400)
#the statistical power is 22.2%

30.0 35.5 In :
#7
#We might've gotten a significant p-value for one dataset and not the other because the data sets could have contained different populations and yield different powers, thus not allowing us to catch the difference if there was one. Another reason, of us catching a difference (Significant p-value), we could have had a small sample size and been hit with a "Winner's Curse".

In :
#8
plant_eph = [34,51,20,47,22,19,24,18,33,45,36,42,30,25,44,33,47,35,51,15,24,13,25,32,49,28,37,17,13,32,24,18,22]
plant_perm = [31,40,43,37,23,45,53,46,44,25,47,34,41,31,37,34,42,37,14,29,18,15,14,28]

eph_plant=np.median(plant_eph)
print(eph_plant)
perm_plant=np.median(plant_perm)
print(perm_plant)

obs1= np.array([[eph_plant,perm_plant]])
#len(plant_eph)
len(plant_perm)

from IPython.display import Image
Image(filename='2020-03-13 (4).png', height=400)
#the statistical power is 80.2%
#The minimum sample size for each group would be n=100 in order to have at least 80% power for the effect size measured in the original data

30.0 35.5 In [ ]: