CoCalc Public FilesWeek 8 / hw8-turnin.ipynbOpen with one click!
Author: Andrew Awadallah
Views : 192
Compute Environment: Ubuntu 18.04 (Deprecated)

Homework 8

Name: Andrew Awadallah

I collaborated with: Michelle, Anthony

In [1]:
# 1 #a) There is a main effect of note taking type because the Z scores of longhand is consistently higher than those of the laptop #b) There is a main effect of the type of question because the Z score for factual questions is higher in longhand and the conceptual is also higher in longhand #c) Yes because the lines are not parallel #d) The study shows that the students who take notes by longhand have better results than those who take it by laptop. This is especially evident in conceptual questions.
In [2]:
#2 #It is important to understand that correlation hints at a relationship between two variables. It does not mean that one variable causes another, but rather that the two may be tied or connected. In other words, two things can be positively or negatively correlated due to a third factor that influences them both.
In [3]:
#3 #One possible explanation is the way that inflation could have caused the prices of goods to increase (goods like alcohol), which caused the administrators to begin paying the teachers more money. Another possible reason is that there was a rise in minimum wage, which could have caused the price of alcohol to increase and then subsequently influenced admins to increase their workers' salary (Teachers). Finally, a surge of immigration could have also caused this correlation; such population increase could have caused alcohol to be in more demand and cause it to increase in price and it could also cause more students to attend school, which increasing the workload for teachers, and can increase their salary.
In [4]:
#5 #a import pandas as pd import numpy as np import seaborn as sns CO2=pd.read_csv("co2.csv")
In [5]:
sns.jointplot("CO2 Concentration","CO2 Uptake Rate", data=CO2)
<seaborn.axisgrid.JointGrid at 0x7f1ab5b0ee48>
In [11]:
#5b #Since we can see some type of correlation, but it is not linear, it is best to go with spearman's correlation coeficient.
In [6]:
#5c from scipy.stats.stats import spearmanr spearmanr(CO2["CO2 Concentration"], CO2["CO2 Uptake Rate"])
SpearmanrResult(correlation=0.6768225136164059, pvalue=8.605567282157638e-07)
In [16]:
#5d #The null hypothesis is that there is no correlation between the rate of CO2 uptake and the CO2 Concentration and the sample size is 43.
In [12]:
#5e #debugging sample (10 simulations) CO2_copy=CO2.copy() CO2_Concentration=list(CO2_copy["CO2 Concentration"]) CO2_Uptake_Rate=list(CO2_copy["CO2 Uptake Rate"]) np.random.shuffle(CO2_Uptake_Rate) B = 10 # Hint: If you define # of bootstraps/simulations as a variable, then you can easily run a few simulations as a test and then increase to more! sims=np.zeros(B) for i in range(B): np.random.shuffle(CO2_Uptake_Rate) # Shuffle CO2 data sims[i]=spearmanr(CO2_Concentration,CO2_Uptake_Rate)[0] # Compute Spearman correlation coefficient for shuffled data corrObs = spearmanr(CO2["CO2 Concentration"],CO2["CO2 Uptake Rate"])[0] # Gets first element of tuple outputed by spearman function (correlation coefficient) using indexing just like in a list corrAbs = np.abs(corrObs) p = sns.distplot(sims, kde = False) p.set(xlabel = "Spearman's Correlation Coefficient", ylabel = "Count", title = "Null Distribution") p.axvline(corrObs,color='red') p.axvline(-corrObs,color='red')
<matplotlib.lines.Line2D at 0x7f1aaa52d668>
In [13]:
#10,000 simulations B = 10000 # Hint: If you define # of bootstraps/simulations as a variable, then you can easily run a few simulations as a test and then increase to more! sims=np.zeros(B) for i in range(B): np.random.shuffle(CO2_Uptake_Rate) # Shuffle CO2 data sims[i]=spearmanr(CO2_Concentration,CO2_Uptake_Rate)[0] # Compute Spearman correlation coefficient for shuffled data corrObs = spearmanr(CO2["CO2 Concentration"],CO2["CO2 Uptake Rate"])[0] # Gets first element of tuple outputed by pearsonr function (correlation coefficient) using indexing just like in a list corrAbs = np.abs(corrObs) p = sns.distplot(sims, kde = False) p.set(xlabel = "Spearman's Correlation Coefficient", ylabel = "Count", title = "Null Distribution") p.axvline(corrObs,color='red') p.axvline(-corrObs,color='red')
<matplotlib.lines.Line2D at 0x7f1aa9e6cfd0>
In [14]:
one_tail_pvalue=np.sum(sims>=corrObs)/B two_tail_pvalue=(np.sum(sims>=corrAbs)+np.sum(sims<=-corrAbs))/B print(one_tail_pvalue) print(two_tail_pvalue)
0.0 0.0001
In [25]:
#5f correlation_pairs=np.zeros(10000) # Create an array to store the correlation value for each simulation for i in range(10000): new_CO2=CO2.sample(len(CO2_Uptake_Rate),replace=True) # Sample datapoint pairs with replacement correlation_pairs[i]=spearmanr(new_CO2["CO2 Concentration"],new_CO2["CO2 Uptake Rate"])[0] # Calculate Spearman correlation coefficient for each resample correlation_pairs.sort() CIlower=2*corrObs-correlation_pairs[9749] # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper) CIupper=2*corrObs-correlation_pairs[49] # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower) p=sns.distplot(correlation_pairs,kde=False) p.set(xlabel = "Spearman's correlation coefficient",ylabel = "Count",title = "Simulations for Confidence Interval") p.axvline(CIlower,color='green') # Lower CI limit p.axvline(CIupper,color='green') # Upper CI limit p.axvline(corrObs,color='red') # Observed value print("The Spearman's correlation coefficient for the observed sample is",corrObs) print("The 95% confidence interval for Spearman's correlation coefficient is",(CIlower,CIupper))
The Spearman's correlation coefficient for the observed sample is 0.6768225136164059 The 95% confidence interval for Spearman's correlation coefficient is (0.5114712931910269, 1.0421276612490509)
In [21]:
#5i #The results in steps e-f suggest that there is a correlation between CO2 concentration and CO2 uptake rates. The reason for this conclusion is because the correlation coeffecient is 0.676, which is a pretty high number. Also, p value was less than 0.05, which suggests that there is a significant relationship because there is almost a 0 percent chance we would get the results we got or something more extreme if the null hypothesis was true. Finally, the confidence interval states that the range is from 0.51 to 1, which again means that there is a correlation because the effect size does not encompass 0. In other words, if we conduct such findings/experiement again 10,000 times, 99 percent of them would contain such effect size. We cannot say that one variable is causing the other because the correlation coeficient is not extremely high; rather, we can predict that a third variable is influencing both.
In [16]:
#6a NEN = pd.read_csv('nenana.txt', sep="\t") sns.jointplot("Years Since 1900", "Ice Breakup Day of Year",data=NEN)
<seaborn.axisgrid.JointGrid at 0x7f1aaa5b1cc0>
In [ ]:
#6b #The data suggests us to use the Pearson Correlation because it is somewhat linear and we can see some correlation in the data.
In [17]:
#6c from scipy.stats.stats import pearsonr pearsonr(NEN["Ice Breakup Day of Year"], NEN["Years Since 1900"])
(-0.38614431815900013, 5.612942930757996e-05)
In [ ]:
#6d #The null hypothesis is that there is no correlation between ice breakup day of year and time since 1900
In [18]:
#6e NEN_copy=NEN.copy() Ice_Breakup_Day_of_Year=list(NEN_copy["Ice Breakup Day of Year"]) Years_Since_1900=list(NEN_copy["Years Since 1900"]) np.random.shuffle(Ice_Breakup_Day_of_Year) C = 10000 # Hint: If you define # of bootstraps/simulations as a variable, then you can easily run a few simulations as a test and then increase to more! sims=np.zeros(C) for i in range(C): np.random.shuffle(Ice_Breakup_Day_of_Year) # Shuffle brain weight data sims[i]=pearsonr(Years_Since_1900,Ice_Breakup_Day_of_Year)[0] # Compute Pearson correlation coefficient for shuffled data corrObss = pearsonr(NEN["Years Since 1900"],NEN["Ice Breakup Day of Year"])[0] # Gets first element of tuple outputed by pearsonr function (correlation coefficient) using indexing just like in a list corrAbss = np.abs(corrObss) p = sns.distplot(sims, kde = False) p.set(xlabel = "Spearman's Correlation Coefficient", ylabel = "Count", title = "Null Distribution") p.axvline(corrObss,color='red') p.axvline(-corrObss,color='red') print(corrObss) one_tail_pvalue=np.sum(sims>=corrObss)/C two_tail_pvalue=(np.sum(sims>=corrAbss)+np.sum(sims<=-corrAbss))/C print(one_tail_pvalue) print(two_tail_pvalue)
-0.38614431815900013 1.0 0.0
In [24]:
correlation_pairss=np.zeros(10000) # Create an array to store the correlation value for each simulation for i in range(10000): new_NEN=NEN.sample(len(Ice_Breakup_Day_of_Year),replace=True) # Sample datapoint pairs with replacement correlation_pairss[i]=pearsonr(new_NEN["Years Since 1900"],new_NEN["Ice Breakup Day of Year"])[0] # Calculate Spearman's correlation coefficient for each resample correlation_pairss.sort() CIlower=2*corrObss-correlation_pairss[9749] # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper) CIupper=2*corrObss-correlation_pairss[49] # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower) p=sns.distplot(correlation_pairs,kde=False) p.set(xlabel = "Pearson's correlation coefficient",ylabel = "Count",title = "Simulations for Confidence Interval") p.axvline(CIlower,color='green') # Lower CI limit p.axvline(CIupper,color='green') # Upper CI limit p.axvline(corrObss,color='red') # Observed value print("The Pearson's correlation coefficient for the observed sample is",corrObss) print("The 95% confidence interval for Pearson's correlation coefficient is",(CIlower,CIupper))
The Pearson's correlation coefficient for the observed sample is -0.38614431815900013 The 95% confidence interval for Pearson's correlation coefficient is (-0.5813196700792721, -0.1634528230157879)
In [ ]:
#6i #Based upon from the results from e, it suggests that there is a significant result based upon the p-value that we got. Also, based on the confidence interval, which is from -0.58 to -0.16, it suggests that there is somewhere from a low to a moderate correlation between years since 1900 and day when the ice melts. The correlation coeffecient isn't incredibly high so it might be super appropriate to suggest that the one variable is causing a change in the other. It may be more appropriate to analyze other variables. In addition, the p-value we got is very low, which suggests that the correlation coeficient of -0.38 is significant. In other words, assuming that the null is correct, there is a very low chance we would get the result we got or something more extreme due to chance.
In [23]:
#7a acid=pd.read_csv("acid-phosphatase-corrected.csv") sns.jointplot("Temperature","Initial Reaction Rate", data=acid)
<seaborn.axisgrid.JointGrid at 0x7f650be75898>
In [24]:
#7b #For this data, it seems that it is most appropriate to use spearman's correlation because the data seems correlated but it is not linear.
In [26]:
#7c from scipy.stats.stats import spearmanr spearmanr(acid["Temperature"], acid["Initial Reaction Rate"])
SpearmanrResult(correlation=0.6037137142130132, pvalue=0.0037572399881533225)
In [27]:
#7d #The null hypothesis is that there is no correlation between the initial reaction rate of the acid and the temperature and the sample size is 21
In [30]:
len(acid)
21
In [35]:
#7e #debugging sample (10 times) acid_copy=acid.copy() Temp=list(acid_copy["Temperature"]) IRR=list(acid_copy["Initial Reaction Rate"]) np.random.shuffle(IRR) B = 10 # Hint: If you define # of bootstraps/simulations as a variable, then you can easily run a few simulations as a test and then increase to more! sims=np.zeros(B) for i in range(B): np.random.shuffle(IRR) # Shuffle reaction rate data sims[i]=spearmanr(Temp,IRR)[0] # Compute Spearman correlation coefficient for shuffled data corrObs = spearmanr(acid["Temperature"],acid["Initial Reaction Rate"])[0] # Gets first element of tuple outputed by spearman function (correlation coefficient) using indexing just like in a list corrAbs = np.abs(corrObs) p = sns.distplot(sims, kde = False) p.set(xlabel = "Spearman's Correlation Coefficient", ylabel = "Count", title = "Null Distribution") p.axvline(corrObs,color='red') p.axvline(-corrObs,color='red')
<matplotlib.lines.Line2D at 0x7f650b81af28>
In [39]:
#10,000 times acid_copy=acid.copy() Temp=list(acid_copy["Temperature"]) IRR=list(acid_copy["Initial Reaction Rate"]) np.random.shuffle(IRR) #debugging sample B = 10000 # Hint: If you define # of bootstraps/simulations as a variable, then you can easily run a few simulations as a test and then increase to more! sims=np.zeros(B) for i in range(B): np.random.shuffle(IRR) # Shuffle reaction rate data sims[i]=spearmanr(Temp,IRR)[0] # Compute Spearman correlation coefficient for shuffled data corrObs = spearmanr(acid["Temperature"],acid["Initial Reaction Rate"])[0] # Gets first element of tuple outputed by spearman function (correlation coefficient) using indexing just like in a list corrAbs = np.abs(corrObs) p = sns.distplot(sims, kde = False) p.set(xlabel = "Spearman's Correlation Coefficient", ylabel = "Count", title = "Null Distribution") p.axvline(corrObs,color='red') p.axvline(-corrObs,color='red') one_tail_pvalue=np.sum(sims>=corrObs)/B two_tail_pvalue=(np.sum(sims>=corrAbs)+np.sum(sims<=-corrAbs))/B print(one_tail_pvalue) print(two_tail_pvalue)
0.0026 0.004
In [40]:
#7f correlation_pairs=np.zeros(10000) # Create an array to store the correlation value for each simulation for i in range(10000): newrate=acid.sample(len(IRR),replace=True) # Sample datapoint pairs with replacement correlation_pairs[i]=spearmanr(newrate["Temperature"],newrate["Initial Reaction Rate"])[0] # Calculate Spearman's correlation coefficient for each resample correlation_pairs.sort() CIlower=corrObs-correlation_pairs[9749] # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper) CIupper=corrObs-correlation_pairs[49] # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower) p=sns.distplot(correlation_pairs,kde=False) p.set(xlabel = "Spearman's correlation coefficient",ylabel = "Count",title = "Simulations for Confidence Interval") p.axvline(CIlower,color='green') # Lower CI limit p.axvline(CIupper,color='green') # Upper CI limit p.axvline(corrObs,color='red') # Observed value print("The Spearman's correlation coefficient for the observed sample is",corrObs) print("The 95% confidence interval for Pearson's correlation coefficient is",(CIlower,CIupper))
The Spearman's correlation coefficient for the observed sample is 0.6037137142130132 The 95% confidence interval for Pearson's correlation coefficient is (-0.29799705656686515, 0.7665846254237327)
In [ ]:
#7i #Based on the evidence from e-f, we cannot for sure say if there is a correlation. Firstly, the p-value is very low (much lower than 0.05) and indicates that there is a very low chance that we would get the correlation coeficient result of 0.603 or something more extreme due to chance (assuming that the null is true). This is a good result in terms of proving a correlation, however, the confidence interval contains 0 between the effect size, which makes us unsure if we would get a significant correlation coeficcient if we run our experiement many many times. While the p-value supports correlation, the confidence interval does not support it, nor deny it, but puts it between around (-0.29,0.76). We cannot say that one variable is causing the other because the correlation coeficient is not extremely high; rather, we can predict that a third variable is influencing both.