CoCalc Shared FilesWeek 9 / hw8-turnin.ipynbOpen in CoCalc with one click!
Author: Michelle Arroyo
Views : 37

Homework 9

Name: Michelle Arroyo

I collaborated with: Andrew Awadallah and Anthony Reyna

In [6]:
# 1 and 3 from IPython.display import Image Image(filename='Question#2.png', height=900) #The answer in purple is for number 1 and the answer in black is for question 3.
In [31]:
# 2 import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # Adds sub-library to plot line import scipy.stats as stats # Adds sub-library to calculate regression line and correlation coefficient import pandas as pd NEN = pd.read_csv('nenana.txt', sep="\t") NEN p=sns.lmplot('Ice Breakup Day of Year','Years Since 1900', data=NEN,fit_reg=False) p.set(title='Scatter Plot of Ice Breakup Day of Year vs Years Since 1900',xlabel='Ice Breakup Day of Year',ylabel='Years Since 1900'); #We should look back at our plot from Lab 8 that showed the distributions of X and Y sns.jointplot('Ice Breakup Day of Year', 'Years Since 1900', data=NEN)
<seaborn.axisgrid.JointGrid at 0x7f79ed6e4f98>
In [32]:
rho = stats.pearsonr(NEN['Ice Breakup Day of Year'],NEN['Years Since 1900'])[0] #we imported scipy.stats and called it stats, so in order to use scipy.stats.pearsonr we should say stats.pearsonr print("the pearson correlation coefficient is",rho) regr=stats.linregress(NEN['Ice Breakup Day of Year'],NEN['Years Since 1900']) slopeobs=regr.slope regr #Pick min and max by finding what your min and max x values are in the dataset datamin = np.min(NEN['Ice Breakup Day of Year']) datamax = np.max(NEN['Ice Breakup Day of Year']) print(datamin,datamax) X_plot = np.linspace(datamin,datamax,100) Y_plot = regr.slope*X_plot + regr.intercept #you can get the slope and intercept from your regression result using this syntax p=sns.lmplot('Ice Breakup Day of Year','Years Since 1900',data=NEN,fit_reg=False) p.set(title='Scatter Plot of Ice Breakup Day of Year vs Years Since 1900 with Regression Line',xlabel='Ice Breakup Day of Year',ylabel='Years Since 1900') plt.plot(X_plot,Y_plot); #these are collections of X and Y values that will be plotted as (X,Y) pairs
the pearson correlation coefficient is -0.38614431815900013 104.014 140.487
In [33]:
#2b ntimes=10000 results=np.zeros(ntimes) for i in range(ntimes): sim=NEN.sample(len(NEN),replace=True) results[i]=stats.linregress(sim['Ice Breakup Day of Year'],sim['Years Since 1900']).slope results.sort() CIlower=2*slopeobs-results[9949] # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper) CIupper=2*slopeobs-results[49] # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower) p=sns.distplot(results,kde=False) p.set(xlabel = 'Regression slope',ylabel = 'Count',title = 'Simulations for Confidence Interval') p.axvline(CIlower,color='green') # Lower CI limit p.axvline(CIupper,color='green') # Upper CI limit p.axvline(slopeobs,color='red'); # Observed value print("the effect size is", (CIlower, CIupper)) print("the observe slope is",slopeobs)
the effect size is (-2.963697116424238, -0.5966914414737978) the observe slope is -1.766460334550153
In [34]:
# 2c #based on the data above, it shows that there is a negative trend in the days when the ice melts, with the line of best fit having a slope of -1.766. Also, when the effect size is taken into account, it shows that 99% of the data is from about -.55 to -3.0, and since it does not include 0 it suggests that there is a difference.
In [35]:
# 3 #The answer to this quesion is in question one's cell.
In [36]:
# 4 # The article mentions that an increase in sample size one will have a higher power. Also, the article mentioned that P-values less than 0.05 and happen very often by the relative extreme. One difference though is that although the sample size is large in the article, the sample size we use in lab is larger, at 10000. Having a small sample size can give you a lower p value, but the power value will be low as well. In lab, we were told to increase sample size to see how that many affect the power, and by increasing sample size and effect size it increased power significantly.
In [2]:
#5 #Three things that effect the power of a study, based off of figure 6 in the study is the sample size of the study, the population itself, and the alpha value of the study.
In [28]:
#6 Since we were not taught how to use the power analysis for anything else other than a 2 x 2 table this is all I could think of to do. I used the website to figure out the rest. plant_eph = [34,51,20,47,22,19,24,18,33,45,36,42,30,25,44,33,47,35,51,15,24,13,25,32,49,28,37,17,13,32,24,18,22] plant_perm = [31,40,43,37,23,45,53,46,44,25,47,34,41,31,37,34,42,37,14,29,18,15,14,28] eph_plant=np.median(plant_eph) print(eph_plant) perm_plant=np.median(plant_perm) print(perm_plant) obs1= np.array([[eph_plant,perm_plant]]) #len(plant_eph) len(plant_perm) from IPython.display import Image Image(filename='2020-03-13 (2).png', height=400) #the statistical power is 22.2%
30.0 35.5
In [2]:
#7 #We might've gotten a significant p-value for one dataset and not the other because the data sets could have contained different populations and yield different powers, thus not allowing us to catch the difference if there was one. Another reason, of us catching a difference (Significant p-value), we could have had a small sample size and been hit with a "Winner's Curse".
In [38]:
#8 plant_eph = [34,51,20,47,22,19,24,18,33,45,36,42,30,25,44,33,47,35,51,15,24,13,25,32,49,28,37,17,13,32,24,18,22] plant_perm = [31,40,43,37,23,45,53,46,44,25,47,34,41,31,37,34,42,37,14,29,18,15,14,28] eph_plant=np.median(plant_eph) print(eph_plant) perm_plant=np.median(plant_perm) print(perm_plant) obs1= np.array([[eph_plant,perm_plant]]) #len(plant_eph) len(plant_perm) from IPython.display import Image Image(filename='2020-03-13 (4).png', height=400) #the statistical power is 80.2% #The minimum sample size for each group would be n=100 in order to have at least 80% power for the effect size measured in the original data
30.0 35.5
In [ ]: