CoCalc Public FilesWeek 8 / hw8-turnin.ipynb
Author: Jose Haro-Aldana
Views : 170
Compute Environment: Ubuntu 18.04 (Deprecated)

# Homework 8

Name: Jose Haro

I collaborated with: Cinthya Montoya

In [2]:
import numpy as np
import seaborn as sns
import pandas as pd

In [3]:
#1
#a) There is an A effect present amongst longhand students, however there is no A effect present at all for laptop users as they have a lower A2 value in comparison to their A1.  The difference between longhand and laptop users derives in the difference of question being asked, whether it be a factual or a conceptual question. There is little to no difference between longhand and laptop users who are asked factual questions, but when asked a conceptual question the differences are shown.

#b) There is an effect of the question being asked as well, as mentioned earlier that longhand students have a higher z-score than laptop users amongst the conceptual column. There is a B effect present between the longhand and laptop users as the differences between the lines are large.

#c) There is no interaction effect as the lines do not intersect, and remain parallel to each other.

#d) Students who use longhand to write their notes have a greater understanding when it comes to conceptual questions in comparison to those who use laptops to write their notes. Despite this, there is no known difference in understanding between longhand and laptop users when asked to write down factual information.

In [4]:
#2
# Correlation is significant as it can lead to a variable amount of conclusiosn in regards to where the correlation lies. For instance, it could be that A can cause B, or B causes A, or A and B are dependent on each other, or a completley different external factor (C) causes both A and B to which they are related to each other. Another reason why correlation is significant is that it also determines the strength between two datasets, whether they are postively or negatively correlated.

In [5]:
#3
# One possible explanation for the correlation could be that the inflation of money that could have occurred during a 25 year gap which is why both the professor's salary and alcohol prices seem to be rising incrementally which each other. This often seems to be the case as inflation causes wages and salaries to increase but at the cost of consumer prices to increase as well making this judgement very plausible.

#Another possible explanation for the correlation could be that the correlation could derive completley from chance; there could be a completley different variable besides inflation that can affect both the salary amount and alcohol prices. This may not seem plausible as often correlations are measured and do not occur randomly.

#A final possible explanation for the correlation could be that the staticians could have performed a faulty statistical test. The data isn't visualized, we are only told that there is a correlation and a significant p-value. This can lead us to believe that perhaps the visual relationship between salaries and alcohol prices could be non-linear, and the staticians have used a linear correlation test to find the correlation value, leading to such a high correlation between the two variables. This is plausible as we do not have a visual representation of the data, we can only know for sure until we graph the data.

In [6]:
#5a
co2

CO2 Concentration CO2 Uptake Rate
0 95 16.0
1 95 13.6
2 95 16.2
3 175 30.4
4 175 27.3
5 175 32.4
6 250 34.8
7 250 37.1
8 250 40.3
9 350 37.2
10 350 41.8
11 350 42.1
12 500 35.3
13 500 40.6
14 500 42.9
15 675 39.2
16 675 41.4
17 675 43.9
18 1000 39.7
19 1000 44.3
20 1000 45.5
21 95 10.6
22 95 12.0
23 95 11.3
24 175 19.2
25 175 22.0
26 175 19.4
27 250 26.2
28 250 30.6
29 250 25.8
30 350 30.0
31 350 31.8
32 350 27.9
33 500 30.9
34 500 32.4
35 500 28.5
36 675 32.4
37 675 31.1
38 675 28.1
39 1000 35.5
40 1000 31.5
41 1000 27.8
In [7]:
sns.jointplot("CO2 Concentration", "CO2 Uptake Rate", data = co2)

<seaborn.axisgrid.JointGrid at 0x7effc972e198>
In [8]:
#5b
# It would be appropiate to use Spearman's correlation coefficient test as opposed to Pearson's because the jointplot displays a non-linear relationship. The positive incline between CO2 correlation and CO2 uptake rate demonstrates that there is a positive correlation between the two variables, but a non-linear relationship.

In [9]:
#5c
from scipy.stats.stats import spearmanr

In [10]:
spearman_coefficient = spearmanr(co2["CO2 Concentration"], co2["CO2 Uptake Rate"])[0]
spearman_p_value = spearmanr(co2["CO2 Concentration"], co2["CO2 Uptake Rate"])[1]

print(spearman_coefficient), print(spearman_p_value)

0.6768225136164059 8.605567282157638e-07
(None, None)
In [11]:
#5d
#Null (H0): There is no known correlation between CO2 concentration and CO2 uptake rate.
#The sample size of the study would be 42 as there are 42 known results for both groups.

In [12]:
#5e
co2_copy = co2.copy() #creates a copy to use for NHST, a copy is needed so that we can preserve the original while being able to use the copy for testing purposes as the NHST will destroy any correaltion between pairs of data.

co2_concentration = list(co2_copy["CO2 Concentration"]) #set co2_concentration to the column of values that lie under the CO2 Concentration column
co2_uptake_rate = list(co2_copy["CO2 Uptake Rate"]) #set co2_uptake_rate to the column of values that lie under the CO2 Uptake Rate column.

np.random.shuffle(co2_concentration) #shuffles the values in co2_concentration
np.random.shuffle(co2_uptake_rate) #shuffles the values in co2_uptake_rate

In [13]:
#Calculating p-value
sims = 10000
zeros = np.zeros(10000)
for i in range(sims): #for loop
np.random.shuffle(co2_concentration) #shuffles co2_concentration
s_correlation_copy = spearmanr(co2_concentration, co2_uptake_rate)[0] #finds the spearman correaltion value for the copied column values
zeros[i] = s_correlation_copy #appends the spearman correaltion values to zeros.

p_value = np.abs(np.sum(zeros>=spearman_coefficient))/10000
p_value_inverse = np.sum(zeros<=-1*(spearman_coefficient))/10000

p = sns.distplot(zeros)
p.axvline(spearman_coefficient, color = "green")
p.axvline(p_value, color = 'blue')
p.axvline(-1*(spearman_coefficient), color = 'red')
p.set(title = "Null Distribution", xlabel = "Null Values", ylabel = 'Count')

print(p_value, p_value_inverse)

0.0 0.0
In [14]:
#5f
#Confidence Interval
zeros1 = np.zeros(10000)
for i in range(10000):
co2_resample = co2.sample(len(co2_copy), replace = True) #instead of using the copy, the original is used to preserve the correaltion of the data, and as we are resampling the data, the original is needed.
s_CI = spearmanr(co2_resample["CO2 Concentration"], co2_resample["CO2 Uptake Rate"])[0] #determines the spearman correaltion value of the original data set
zeros1[i] = s_CI #appends the spearman correaltion value to zeros1

#Looking for Upper and Lower Bounds
zeros1.sort()
M_lower = zeros1[49]
M_upper = zeros1[9949]

lower_bound = (2*spearman_coefficient - M_upper)
upper_bound = (2*spearman_coefficient - M_lower)

q = sns.distplot(zeros1)
q.axvline(lower_bound, color = 'red')
q.axvline(upper_bound, color = 'blue')
q.axvline(spearman_coefficient, color = 'green')
q.set(title = "Confidence Interval", xlabel = "Correlation Values", ylabel = "Count")

[Text(0, 0.5, 'Count'), Text(0.5, 0, 'Null Values'), Text(0.5, 1.0, 'Confidence Interval')]
In [ ]:
#5i
#With the 99% Confidence Interval at the observed correlation value, it is safe to determine that there is a true correlation occuring between CO2 concentration and CO2 uptake rate. With a p-value that is or is close to 0, it is also safe to determine that the correlation between these two variables are significant.

In [18]:
#6a
nenana

Ice Breakup Day of Year Years Since 1900 Year
0 119.479 17 1917
1 130.398 18 1918
2 122.606 19 1919
3 131.448 20 1920
4 130.279 21 1921
5 131.556 22 1922
6 128.083 23 1923
7 131.632 24 1924
8 126.772 25 1925
9 115.669 26 1926
10 132.238 27 1927
11 126.684 28 1928
12 124.653 29 1929
13 127.794 30 1930
14 129.391 31 1931
15 121.427 32 1932
16 127.813 33 1933
17 119.588 34 1934
18 134.564 35 1935
19 120.540 36 1936
20 131.836 37 1937
21 125.843 38 1938
22 118.560 39 1939
23 110.644 40 1940
24 122.076 41 1941
25 119.561 42 1942
26 117.807 43 1943
27 124.589 44 1944
28 135.404 45 1945
29 124.694 46 1946
... ... ... ...
73 113.722 90 1990
74 120.003 91 1991
75 134.269 92 1992
76 112.542 93 1993
77 118.959 94 1994
78 115.557 95 1995
79 125.522 96 1996
80 119.436 97 1997
81 109.704 98 1998
82 118.908 99 1999
83 121.449 100 2000
84 127.542 101 2001
85 126.894 102 2002
86 118.765 103 2003
87 114.594 104 2004
88 117.501 105 2005
89 121.729 106 2006
90 116.658 107 2007
91 126.953 108 2008
92 120.862 109 2009
93 118.379 110 2010
94 123.685 111 2011
95 114.819 112 2012
96 139.612 113 2013
97 114.658 114 2014
98 113.601 115 2015
99 112.652 116 2016
100 120.500 117 2017
101 121.554 118 2018
102 104.014 119 2019

103 rows × 3 columns

In [19]:
sns.jointplot("Ice Breakup Day of Year", "Years Since 1900", data = nenana)

<seaborn.axisgrid.JointGrid at 0x7f5d1377c588>
In [20]:
#6b
#It would be appropiate to use Pearson's correlation coefficient test as opposed to Spearman's test because the jointplot demonstrates although it is not distinctly linear, it demonstrates a negative linear relationship.

In [21]:
#6c
from scipy.stats.stats import pearsonr

In [22]:
pearson_coefficient = pearsonr(nenana["Ice Breakup Day of Year"], nenana["Years Since 1900"])[0]
pearson_p_value = pearsonr(nenana["Ice Breakup Day of Year"], nenana["Years Since 1900"])[1]

print(pearson_coefficient, pearson_p_value)

-0.38614431815900013 5.6129429307579826e-05
In [23]:
#6d
#Null (H0): The null hypothesis claims that there is no correlation between the number of days that takes to break the ice, and the years that have past since 1900.

#The sample size would be 103 as there are 103 rows for each group that we are comparing to.

In [24]:
#6e
nenana_copy = nenana.copy()

nenana_days = list(nenana_copy["Ice Breakup Day of Year"])
nenana_years = list(nenana_copy["Years Since 1900"])

np.random.shuffle(nenana_days)
np.random.shuffle(nenana_years)

In [25]:
#Calculating p-value
zeros2 = np.zeros(10000)
for i in range(10000):
np.random.shuffle(nenana_days)
p_correlation_copy = pearsonr(nenana_days, nenana_years)[0]
zeros2[i] = p_correlation_copy

p_value = (np.sum(zeros2<=pearson_coefficient))/10000
p_value_inverse = np.sum(zeros2>=-1*(pearson_coefficient))/10000

p = sns.distplot(zeros2)
p.axvline(pearson_coefficient, color = "green")
p.axvline(p_value, color = 'blue')
p.axvline(-1*(pearson_coefficient), color = 'red')

print(p_value, p_value_inverse)

0.0 0.0
In [27]:
#6f
#Confidence Interval Testing
zeros3 = np.zeros(10000)
for i in range(10000):
nenana_resample = nenana.sample(len(nenana_copy), replace = True)
p_CI = pearsonr(nenana_resample["Ice Breakup Day of Year"], nenana_resample["Years Since 1900"])[0]
zeros3[i] = p_CI

#Looking for Upper and Lower Bounds
zeros3.sort()
M_lower = zeros3[49]
M_upper = zeros3[9949]

lower_bound = (2*pearson_coefficient - M_upper)
upper_bound = (2*pearson_coefficient - M_lower)

r = sns.distplot(zeros3)
r.axvline(lower_bound, color = 'red')
r.axvline(upper_bound, color = 'blue')
r.axvline(pearson_coefficient, color = 'green')

<matplotlib.lines.Line2D at 0x7f5d130a8e80>
In [ ]:
#6i
#With the observed correlation being negative--as shown by the jointplot in 6a-- it is also appropiate to presume that the two variables do cause a negative correlation. With the 99% confidence interval conducted, the true correlation value will be found 99% of the time between teh lower and upper bounds. Additionally, with a p-value that is or is close to 0, it is also safe to assume that the correlation value is significant. Finally, based on the results of this NHST/CI, the relationship is determined to be linear.

In [28]:
#7a
acid_phosphatase

Temperature Initial Reaction Rate
0 298.0 0.05
1 303.0 0.07
2 308.0 0.12
3 313.0 0.20
4 313.0 0.18
5 318.0 0.34
6 323.0 0.48
7 328.0 0.79
8 333.0 0.98
9 335.0 1.02
10 333.5 1.04
11 338.0 1.10
12 343.0 0.98
13 298.0 0.04
14 343.7 1.00
15 353.0 0.53
16 353.0 0.58
17 353.0 0.61
18 338.0 1.07
19 348.0 0.74
20 348.0 0.72
In [29]:
sns.jointplot(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"], data = acid_phosphatase)

<seaborn.axisgrid.JointGrid at 0x7f5d139a0668>
In [30]:
#7b
#Spearman's coefficient test would be appropiate to use as the correlation above displays a non-linear relationship, therby exclusing Pearson's as a viable coefficient test.

In [31]:
#7c
from scipy.stats.stats import spearmanr

In [33]:
spearman_correlation1 = spearmanr(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"])[0]

spearman_correlation_p_value1 = spearmanr(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"])[1]

print(spearman_correlation1, spearman_correlation_p_value1)

0.6037137142130132 0.0037572399881533225
In [34]:
#7d
#Null (H0): The null hypothesis claims that there is no correlation between the initial reaction rate and temperature.
#The sample size for this NHST would be 21 as there are 21 rows.

In [35]:
#7e
acid_phosphatase_copy = acid_phosphatase.copy()

acid_phosphatase_temp = list(acid_phosphatase_copy["Temperature"])
acid_phosphatase_rate = list(acid_phosphatase_copy["Initial Reaction Rate"])

np.random.shuffle(acid_phosphatase_temp)
np.random.shuffle(acid_phosphatase_rate)

In [36]:
#Calculating p-value
zeros4 = np.zeros(10000)
for i in range(10000):
np.random.shuffle(acid_phosphatase_temp)
s_correlation_copy2 = spearmanr(acid_phosphatase_temp, acid_phosphatase_rate)[0]
zeros4[i] = s_correlation_copy2

p_value = np.abs(np.sum(zeros4>=spearman_correlation1))/10000
p_value_inverse = np.sum(zeros4<=-1*(spearman_correlation1))/10000

p = sns.distplot(zeros4)
p.axvline(spearman_correlation1, color = "green")
p.axvline(p_value, color = 'blue')
p.axvline(-1*(spearman_correlation1), color = 'red')

print(p_value, p_value_inverse)

0.0021 0.0027
In [38]:
#Confidence Interval
zeros5 = np.zeros(10000)
for i in range(10000):
acid_phosphatase_resample = acid_phosphatase.sample(len(acid_phosphatase), replace = True)
s_CI1 = spearmanr(acid_phosphatase_resample["Temperature"], acid_phosphatase_resample["Initial Reaction Rate"])[0]
zeros5[i] = s_CI1

#Looking for Upper and Lower Bounds
zeros5.sort()
M_lower = zeros5[49]
M_upper = zeros5[9949]

lower_bound = (2*spearman_correlation1 - M_upper)
upper_bound = (2*spearman_correlation1- M_lower)

q = sns.distplot(zeros5)
q.axvline(lower_bound, color = 'red')
q.axvline(upper_bound, color = 'blue')
q.axvline(spearman_correlation1, color = 'green')

<matplotlib.lines.Line2D at 0x7f5d132db8d0>
In [ ]:
#7i
#With both NHST and the 99% confidence interval conducted, it is safe to presume that there is a known correlation between temperature and the intial reaction rate. The confidence interval gives the assertion that the correaltion value between the known variables will be found 99% of the time between the lower and upper bounds. Additionally, the NHST determines that the correaltion value is significant as the p-value is less than the critical alpha value of 0.01.