1

Name: Jose Haro

I collaborated with: Cinthya Montoya

2

In [2]:

import numpy as np import seaborn as sns import pandas as pd

3

In [3]:

#1 #a) There is an A effect present amongst longhand students, however there is no A effect present at all for laptop users as they have a lower A2 value in comparison to their A1. The difference between longhand and laptop users derives in the difference of question being asked, whether it be a factual or a conceptual question. There is little to no difference between longhand and laptop users who are asked factual questions, but when asked a conceptual question the differences are shown. #b) There is an effect of the question being asked as well, as mentioned earlier that longhand students have a higher z-score than laptop users amongst the conceptual column. There is a B effect present between the longhand and laptop users as the differences between the lines are large. #c) There is no interaction effect as the lines do not intersect, and remain parallel to each other. #d) Students who use longhand to write their notes have a greater understanding when it comes to conceptual questions in comparison to those who use laptops to write their notes. Despite this, there is no known difference in understanding between longhand and laptop users when asked to write down factual information.

4

In [4]:

#2 # Correlation is significant as it can lead to a variable amount of conclusiosn in regards to where the correlation lies. For instance, it could be that A can cause B, or B causes A, or A and B are dependent on each other, or a completley different external factor (C) causes both A and B to which they are related to each other. Another reason why correlation is significant is that it also determines the strength between two datasets, whether they are postively or negatively correlated.

5

In [5]:

#3 # One possible explanation for the correlation could be that the inflation of money that could have occurred during a 25 year gap which is why both the professor's salary and alcohol prices seem to be rising incrementally which each other. This often seems to be the case as inflation causes wages and salaries to increase but at the cost of consumer prices to increase as well making this judgement very plausible. #Another possible explanation for the correlation could be that the correlation could derive completley from chance; there could be a completley different variable besides inflation that can affect both the salary amount and alcohol prices. This may not seem plausible as often correlations are measured and do not occur randomly. #A final possible explanation for the correlation could be that the staticians could have performed a faulty statistical test. The data isn't visualized, we are only told that there is a correlation and a significant p-value. This can lead us to believe that perhaps the visual relationship between salaries and alcohol prices could be non-linear, and the staticians have used a linear correlation test to find the correlation value, leading to such a high correlation between the two variables. This is plausible as we do not have a visual representation of the data, we can only know for sure until we graph the data.

6

In [6]:

#5a co2 = pd.read_csv("co2.csv") co2

7

CO2 Concentration | CO2 Uptake Rate | |
---|---|---|

0 | 95 | 16.0 |

1 | 95 | 13.6 |

2 | 95 | 16.2 |

3 | 175 | 30.4 |

4 | 175 | 27.3 |

5 | 175 | 32.4 |

6 | 250 | 34.8 |

7 | 250 | 37.1 |

8 | 250 | 40.3 |

9 | 350 | 37.2 |

10 | 350 | 41.8 |

11 | 350 | 42.1 |

12 | 500 | 35.3 |

13 | 500 | 40.6 |

14 | 500 | 42.9 |

15 | 675 | 39.2 |

16 | 675 | 41.4 |

17 | 675 | 43.9 |

18 | 1000 | 39.7 |

19 | 1000 | 44.3 |

20 | 1000 | 45.5 |

21 | 95 | 10.6 |

22 | 95 | 12.0 |

23 | 95 | 11.3 |

24 | 175 | 19.2 |

25 | 175 | 22.0 |

26 | 175 | 19.4 |

27 | 250 | 26.2 |

28 | 250 | 30.6 |

29 | 250 | 25.8 |

30 | 350 | 30.0 |

31 | 350 | 31.8 |

32 | 350 | 27.9 |

33 | 500 | 30.9 |

34 | 500 | 32.4 |

35 | 500 | 28.5 |

36 | 675 | 32.4 |

37 | 675 | 31.1 |

38 | 675 | 28.1 |

39 | 1000 | 35.5 |

40 | 1000 | 31.5 |

41 | 1000 | 27.8 |

In [7]:

sns.jointplot("CO2 Concentration", "CO2 Uptake Rate", data = co2)

8

<seaborn.axisgrid.JointGrid at 0x7effc972e198>

In [8]:

#5b # It would be appropiate to use Spearman's correlation coefficient test as opposed to Pearson's because the jointplot displays a non-linear relationship. The positive incline between CO2 correlation and CO2 uptake rate demonstrates that there is a positive correlation between the two variables, but a non-linear relationship.

9

In [9]:

#5c from scipy.stats.stats import spearmanr

10

In [10]:

spearman_coefficient = spearmanr(co2["CO2 Concentration"], co2["CO2 Uptake Rate"])[0] spearman_p_value = spearmanr(co2["CO2 Concentration"], co2["CO2 Uptake Rate"])[1] print(spearman_coefficient), print(spearman_p_value)

11

0.6768225136164059
8.605567282157638e-07

(None, None)

In [11]:

#5d #Null (H0): There is no known correlation between CO2 concentration and CO2 uptake rate. #The sample size of the study would be 42 as there are 42 known results for both groups.

12

In [12]:

#5e co2_copy = co2.copy() #creates a copy to use for NHST, a copy is needed so that we can preserve the original while being able to use the copy for testing purposes as the NHST will destroy any correaltion between pairs of data. co2_concentration = list(co2_copy["CO2 Concentration"]) #set co2_concentration to the column of values that lie under the CO2 Concentration column co2_uptake_rate = list(co2_copy["CO2 Uptake Rate"]) #set co2_uptake_rate to the column of values that lie under the CO2 Uptake Rate column. np.random.shuffle(co2_concentration) #shuffles the values in co2_concentration np.random.shuffle(co2_uptake_rate) #shuffles the values in co2_uptake_rate

13

In [13]:

#Calculating p-value sims = 10000 zeros = np.zeros(10000) for i in range(sims): #for loop np.random.shuffle(co2_concentration) #shuffles co2_concentration s_correlation_copy = spearmanr(co2_concentration, co2_uptake_rate)[0] #finds the spearman correaltion value for the copied column values zeros[i] = s_correlation_copy #appends the spearman correaltion values to zeros. p_value = np.abs(np.sum(zeros>=spearman_coefficient))/10000 p_value_inverse = np.sum(zeros<=-1*(spearman_coefficient))/10000 p = sns.distplot(zeros) p.axvline(spearman_coefficient, color = "green") p.axvline(p_value, color = 'blue') p.axvline(-1*(spearman_coefficient), color = 'red') p.set(title = "Null Distribution", xlabel = "Null Values", ylabel = 'Count') print(p_value, p_value_inverse)

14

0.0 0.0

In [14]:

#5f #Confidence Interval zeros1 = np.zeros(10000) for i in range(10000): co2_resample = co2.sample(len(co2_copy), replace = True) #instead of using the copy, the original is used to preserve the correaltion of the data, and as we are resampling the data, the original is needed. s_CI = spearmanr(co2_resample["CO2 Concentration"], co2_resample["CO2 Uptake Rate"])[0] #determines the spearman correaltion value of the original data set zeros1[i] = s_CI #appends the spearman correaltion value to zeros1 #Looking for Upper and Lower Bounds zeros1.sort() M_lower = zeros1[49] M_upper = zeros1[9949] lower_bound = (2*spearman_coefficient - M_upper) upper_bound = (2*spearman_coefficient - M_lower) q = sns.distplot(zeros1) q.axvline(lower_bound, color = 'red') q.axvline(upper_bound, color = 'blue') q.axvline(spearman_coefficient, color = 'green') q.set(title = "Confidence Interval", xlabel = "Correlation Values", ylabel = "Count")

15

[Text(0, 0.5, 'Count'),
Text(0.5, 0, 'Null Values'),
Text(0.5, 1.0, 'Confidence Interval')]

In [ ]:

#5i #With the 99% Confidence Interval at the observed correlation value, it is safe to determine that there is a true correlation occuring between CO2 concentration and CO2 uptake rate. With a p-value that is or is close to 0, it is also safe to determine that the correlation between these two variables are significant.

16

In [18]:

#6a nenana = pd.read_csv('nenana.csv',sep="\t") nenana

17

Ice Breakup Day of Year | Years Since 1900 | Year | |
---|---|---|---|

0 | 119.479 | 17 | 1917 |

1 | 130.398 | 18 | 1918 |

2 | 122.606 | 19 | 1919 |

3 | 131.448 | 20 | 1920 |

4 | 130.279 | 21 | 1921 |

5 | 131.556 | 22 | 1922 |

6 | 128.083 | 23 | 1923 |

7 | 131.632 | 24 | 1924 |

8 | 126.772 | 25 | 1925 |

9 | 115.669 | 26 | 1926 |

10 | 132.238 | 27 | 1927 |

11 | 126.684 | 28 | 1928 |

12 | 124.653 | 29 | 1929 |

13 | 127.794 | 30 | 1930 |

14 | 129.391 | 31 | 1931 |

15 | 121.427 | 32 | 1932 |

16 | 127.813 | 33 | 1933 |

17 | 119.588 | 34 | 1934 |

18 | 134.564 | 35 | 1935 |

19 | 120.540 | 36 | 1936 |

20 | 131.836 | 37 | 1937 |

21 | 125.843 | 38 | 1938 |

22 | 118.560 | 39 | 1939 |

23 | 110.644 | 40 | 1940 |

24 | 122.076 | 41 | 1941 |

25 | 119.561 | 42 | 1942 |

26 | 117.807 | 43 | 1943 |

27 | 124.589 | 44 | 1944 |

28 | 135.404 | 45 | 1945 |

29 | 124.694 | 46 | 1946 |

... | ... | ... | ... |

73 | 113.722 | 90 | 1990 |

74 | 120.003 | 91 | 1991 |

75 | 134.269 | 92 | 1992 |

76 | 112.542 | 93 | 1993 |

77 | 118.959 | 94 | 1994 |

78 | 115.557 | 95 | 1995 |

79 | 125.522 | 96 | 1996 |

80 | 119.436 | 97 | 1997 |

81 | 109.704 | 98 | 1998 |

82 | 118.908 | 99 | 1999 |

83 | 121.449 | 100 | 2000 |

84 | 127.542 | 101 | 2001 |

85 | 126.894 | 102 | 2002 |

86 | 118.765 | 103 | 2003 |

87 | 114.594 | 104 | 2004 |

88 | 117.501 | 105 | 2005 |

89 | 121.729 | 106 | 2006 |

90 | 116.658 | 107 | 2007 |

91 | 126.953 | 108 | 2008 |

92 | 120.862 | 109 | 2009 |

93 | 118.379 | 110 | 2010 |

94 | 123.685 | 111 | 2011 |

95 | 114.819 | 112 | 2012 |

96 | 139.612 | 113 | 2013 |

97 | 114.658 | 114 | 2014 |

98 | 113.601 | 115 | 2015 |

99 | 112.652 | 116 | 2016 |

100 | 120.500 | 117 | 2017 |

101 | 121.554 | 118 | 2018 |

102 | 104.014 | 119 | 2019 |

103 rows Ã— 3 columns

In [19]:

sns.jointplot("Ice Breakup Day of Year", "Years Since 1900", data = nenana)

18

<seaborn.axisgrid.JointGrid at 0x7f5d1377c588>

In [20]:

#6b #It would be appropiate to use Pearson's correlation coefficient test as opposed to Spearman's test because the jointplot demonstrates although it is not distinctly linear, it demonstrates a negative linear relationship.

19

In [21]:

#6c from scipy.stats.stats import pearsonr

20

In [22]:

pearson_coefficient = pearsonr(nenana["Ice Breakup Day of Year"], nenana["Years Since 1900"])[0] pearson_p_value = pearsonr(nenana["Ice Breakup Day of Year"], nenana["Years Since 1900"])[1] print(pearson_coefficient, pearson_p_value)

21

-0.38614431815900013 5.6129429307579826e-05

In [23]:

#6d #Null (H0): The null hypothesis claims that there is no correlation between the number of days that takes to break the ice, and the years that have past since 1900. #The sample size would be 103 as there are 103 rows for each group that we are comparing to.

22

In [24]:

#6e nenana_copy = nenana.copy() nenana_days = list(nenana_copy["Ice Breakup Day of Year"]) nenana_years = list(nenana_copy["Years Since 1900"]) np.random.shuffle(nenana_days) np.random.shuffle(nenana_years)

23

In [25]:

#Calculating p-value zeros2 = np.zeros(10000) for i in range(10000): np.random.shuffle(nenana_days) p_correlation_copy = pearsonr(nenana_days, nenana_years)[0] zeros2[i] = p_correlation_copy p_value = (np.sum(zeros2<=pearson_coefficient))/10000 p_value_inverse = np.sum(zeros2>=-1*(pearson_coefficient))/10000 p = sns.distplot(zeros2) p.axvline(pearson_coefficient, color = "green") p.axvline(p_value, color = 'blue') p.axvline(-1*(pearson_coefficient), color = 'red') print(p_value, p_value_inverse)

24

0.0 0.0

In [27]:

#6f #Confidence Interval Testing zeros3 = np.zeros(10000) for i in range(10000): nenana_resample = nenana.sample(len(nenana_copy), replace = True) p_CI = pearsonr(nenana_resample["Ice Breakup Day of Year"], nenana_resample["Years Since 1900"])[0] zeros3[i] = p_CI #Looking for Upper and Lower Bounds zeros3.sort() M_lower = zeros3[49] M_upper = zeros3[9949] lower_bound = (2*pearson_coefficient - M_upper) upper_bound = (2*pearson_coefficient - M_lower) r = sns.distplot(zeros3) r.axvline(lower_bound, color = 'red') r.axvline(upper_bound, color = 'blue') r.axvline(pearson_coefficient, color = 'green')

25

<matplotlib.lines.Line2D at 0x7f5d130a8e80>

In [ ]:

#6i #With the observed correlation being negative--as shown by the jointplot in 6a-- it is also appropiate to presume that the two variables do cause a negative correlation. With the 99% confidence interval conducted, the true correlation value will be found 99% of the time between teh lower and upper bounds. Additionally, with a p-value that is or is close to 0, it is also safe to assume that the correlation value is significant. Finally, based on the results of this NHST/CI, the relationship is determined to be linear.

26

In [28]:

#7a acid_phosphatase = pd.read_csv("acid-phosphatase-corrected.csv") acid_phosphatase

27

Temperature | Initial Reaction Rate | |
---|---|---|

0 | 298.0 | 0.05 |

1 | 303.0 | 0.07 |

2 | 308.0 | 0.12 |

3 | 313.0 | 0.20 |

4 | 313.0 | 0.18 |

5 | 318.0 | 0.34 |

6 | 323.0 | 0.48 |

7 | 328.0 | 0.79 |

8 | 333.0 | 0.98 |

9 | 335.0 | 1.02 |

10 | 333.5 | 1.04 |

11 | 338.0 | 1.10 |

12 | 343.0 | 0.98 |

13 | 298.0 | 0.04 |

14 | 343.7 | 1.00 |

15 | 353.0 | 0.53 |

16 | 353.0 | 0.58 |

17 | 353.0 | 0.61 |

18 | 338.0 | 1.07 |

19 | 348.0 | 0.74 |

20 | 348.0 | 0.72 |

In [29]:

sns.jointplot(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"], data = acid_phosphatase)

28

<seaborn.axisgrid.JointGrid at 0x7f5d139a0668>

In [30]:

#7b #Spearman's coefficient test would be appropiate to use as the correlation above displays a non-linear relationship, therby exclusing Pearson's as a viable coefficient test.

29

In [31]:

#7c from scipy.stats.stats import spearmanr

30

In [33]:

spearman_correlation1 = spearmanr(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"])[0] spearman_correlation_p_value1 = spearmanr(acid_phosphatase["Temperature"], acid_phosphatase["Initial Reaction Rate"])[1] print(spearman_correlation1, spearman_correlation_p_value1)

31

0.6037137142130132 0.0037572399881533225

In [34]:

#7d #Null (H0): The null hypothesis claims that there is no correlation between the initial reaction rate and temperature. #The sample size for this NHST would be 21 as there are 21 rows.

32

In [35]:

#7e acid_phosphatase_copy = acid_phosphatase.copy() acid_phosphatase_temp = list(acid_phosphatase_copy["Temperature"]) acid_phosphatase_rate = list(acid_phosphatase_copy["Initial Reaction Rate"]) np.random.shuffle(acid_phosphatase_temp) np.random.shuffle(acid_phosphatase_rate)

33

In [36]:

#Calculating p-value zeros4 = np.zeros(10000) for i in range(10000): np.random.shuffle(acid_phosphatase_temp) s_correlation_copy2 = spearmanr(acid_phosphatase_temp, acid_phosphatase_rate)[0] zeros4[i] = s_correlation_copy2 p_value = np.abs(np.sum(zeros4>=spearman_correlation1))/10000 p_value_inverse = np.sum(zeros4<=-1*(spearman_correlation1))/10000 p = sns.distplot(zeros4) p.axvline(spearman_correlation1, color = "green") p.axvline(p_value, color = 'blue') p.axvline(-1*(spearman_correlation1), color = 'red') print(p_value, p_value_inverse)

34

0.0021 0.0027

In [38]:

#Confidence Interval zeros5 = np.zeros(10000) for i in range(10000): acid_phosphatase_resample = acid_phosphatase.sample(len(acid_phosphatase), replace = True) s_CI1 = spearmanr(acid_phosphatase_resample["Temperature"], acid_phosphatase_resample["Initial Reaction Rate"])[0] zeros5[i] = s_CI1 #Looking for Upper and Lower Bounds zeros5.sort() M_lower = zeros5[49] M_upper = zeros5[9949] lower_bound = (2*spearman_correlation1 - M_upper) upper_bound = (2*spearman_correlation1- M_lower) q = sns.distplot(zeros5) q.axvline(lower_bound, color = 'red') q.axvline(upper_bound, color = 'blue') q.axvline(spearman_correlation1, color = 'green')

35

<matplotlib.lines.Line2D at 0x7f5d132db8d0>

In [ ]:

#7i #With both NHST and the 99% confidence interval conducted, it is safe to presume that there is a known correlation between temperature and the intial reaction rate. The confidence interval gives the assertion that the correaltion value between the known variables will be found 99% of the time between the lower and upper bounds. Additionally, the NHST determines that the correaltion value is significant as the p-value is less than the critical alpha value of 0.01.

36