CoCalc Shared FilesWeek 6 / hw6-turnin.ipynb
Author: Carlos Gomez-Guzman
Views : 42

# Homework 6

Name: Carlos Gomez

I collaborated with: Jewel Smith

In [ ]:
#1
# We started using np.zeros to store the results of our simulations instead of having to append to a list. This allows us to have one simulation result in an array. Arrays are much more powerful and beneficial to work with in analyzing statistics.

In [ ]:
# 2

In [1]:
# 3
# The Bonferroni correction is more conservative because it is more stringent. It is used to test the control for the number of false positives. This method is also when the alpha value is adjusted by dividing the alpha by the number of pairwise tests conducted. This is an advantage of the test because altering the cutoff value from 0.05 to 0.01 will make the test more strict and precise.  A disadvantage of the test would be that every comparison must yield a p-value of <0.001, which is very stringent and small.
# The Benjamini-Hochberg correction differs from a Bonferroni correction because it is a method in which the p-values are first sorted from the pairwise tests from smallest to largest. Each p-value is then compared to a*i/m in which m is the number of pairwise tests and i is the index of the p-value. Finding the the largest p value is the next step and it must be less than a*i/m and this results in index k and a more significant difference in the data is thus found. In LS 40, we are using the Benjamini-Hochberg method because its advantage is that it results in a higher accurately corrected p-value. However, a disadvantage is that it is less stringent, compared to the Bonferroni correction.

In [1]:
# 4
# There is a significant difference between treatments 1 and 3 and also a significant difference between treatments 2 and 3.

# Benjamini-Hochberg Analysis
pvals = [0.001, 0.003, 0.006, 0.0084, 0.013, 0.34]
critical_alpha=0.01
import statsmodels.stats.multitest as smm
smm.multipletests(pvals, alpha=critical_alpha, method='fdr_bh')

(array([ True, True, False, False, False, False]), array([0.006 , 0.009 , 0.012 , 0.0126, 0.0156, 0.34 ]), 0.0016736538523104416, 0.0016666666666666668)
In [2]:
# 5
# You would interpret this by stating that the study failed the omnibus test because the p-value was greater than pre-selected ⍺ cutoff. An appropriate next step to further investigate this issue would be to compose more simulations of the study to see if it was this certain study that turned out this way. We can also increase the sample size of the study and using python to run more simulations.

In [83]:
import pandas as pd
import numpy as np
import seaborn as sns

In [84]:
# 6
ct

Bath Bath+Exercise Exercise
0 5 6 -12
1 10 10 -10
2 -4 0 -7
3 11 14 -1
4 -3 0 -1
5 13 15 0
6 0 4 0
7 2 5 0
8 10 11 0
9 6 7 0
10 -1 20 2
11 8 9 4
12 10 11 5
13 -9 21 5
In [85]:
# 1)
ct=pd.read_csv("contrast-baths.csv") #assigns the data to ct
# Made a csv file on my Week 6 folder because txt file was not working from CCLE
p=sns.violinplot(data=ct) #A) plots the data using a violin plot
p.set(ylabel="Frequency") #labels the y-axis of the graph

[Text(0, 0.5, 'Frequency')]
In [86]:
p=sns.swarmplot(data=ct) #A) plots the data using a violin plot
p.set(ylabel="Frequency")

[Text(0, 0.5, 'Frequency')]
In [87]:
# 2)
# The Null hypothesis of this study is that there is no difference in hand volume or that the reduction in hand swelling is the same with any treatment.
# The appropriate measure to describe the groups is the median because there is overlap between all three groups.
# The appropriate measure to compare the groups is the difference in medians between the groups.
# The box model of the study would be Big Box, which is where we would resample all of our data because the null hypothesis states that there is no difference between the three groups.
# The sample Sample size of the data is 42 because the sample size of every group, (bath, bath+exercise, and exercise), is 14 and multiplied by 3 equals 42.

In [88]:
samplesize = len(ct) *3
samplesize

42
In [89]:
# 3)
def computeFlike(ct): #defines the function flike with the input as data
medians = ct.median() #finds the median of the data set and assigns it to medians
grandMedian = np.median(ct) #calculated the grand median by finding the median of the median all the groups
var_bt = np.sum(np.abs(medians-grandMedian)*len(ct)) #creates the function of variation among groups, finds difference between the median and the grandmedian
#    print(medians)
#    print(var_bt)
sumabsdev = np.sum(np.abs(ct - medians)) #creates the function of variation within groups, finds the difference between the data and the median calculated
var_within = np.sum(sumabsdev)
Flike = var_bt/var_within #divides the among and within groups and puts it into the variable of f
return Flike #calls back f

obsFlike = computeFlike(ct)
print(obsFlike)

0.75
In [96]:
alldata = np.concatenate([ct["Bath"], ct["Bath+Exercise"], ct["Exercise"]]) #adds the data together under one name

B = 10000
n = len(ct)
sims =np.zeros(B) #centers the data around zero
for i in range(B):
group1 = np.random.choice(alldata,n)
group2 = np.random.choice(alldata,n)
group3 = np.random.choice(alldata,n)
sim = pd.DataFrame(np.column_stack((group1,group2,group3)))
sims[i] = computeFlike(sim)
p=sns.distplot(sims,kde = True)
p.set(ylabel="Frequency",xlabel="F-like statistic value")
p.axvline(obsFlike, color="orange")

<matplotlib.lines.Line2D at 0x7fe4d54ce3c8>
In [97]:
pvalue = sum(sims>=obsFlike)/B
print(pvalue)

obsFlike

0.0181
0.75
In [101]:
def BigBox (group1, group2):
group1 = np.array(group1)
group2 = np.array(group2)
n1 = len(group1)
n2 = len(group2)
differenceMedian = np.median(group2) - np.median(group1) # Calculate effect size (difference in medians) for observed sample
population = np.concatenate([group1, group2]) # Make Big Box model
sims = [] # Using append
for i in range(10000):
sim_group1 = np.random.choice(population,n1)
sim_group2 = np.random.choice(population,n2)
sims.append(np.median(sim_group2)-np.median(sim_group1)) # Calculate difference of medians for each pseudosample
# Define positive and negative versions of observed effect size
posLimit = abs(differenceMedian)
negLimit = -1*posLimit
# I made a histogram of simulations based on the null hypothesis with observed effect size and its opposite
p = sns.distplot(sims,kde=False)
p.set(xlabel="Difference in Medians", ylabel="Count", title = "Null Distribution with Observed Difference in Median");
p.axvline(posLimit, color="red");
p.axvline(negLimit, color="red")
# Calculate & report p-value
differenceMedian = np.round(differenceMedian,3)
extreme = sum(sims >= posLimit) + sum(sims <= negLimit)
pvalue = extreme/10000
print("The observed difference in medians is",differenceMedian)
print ("The p-value is",pvalue)

In [102]:
BigBox(ct["Bath"],ct["Bath+Exercise"])

The observed difference in medians is 4.0 The p-value is 0.2621
In [103]:
BigBox(ct["Bath+Exercise"],ct["Exercise"])

The observed difference in medians is -9.5 The p-value is 0.0096
In [104]:
BigBox(ct["Bath"],ct["Exercise"])

The observed difference in medians is -5.5 The p-value is 0.056
In [105]:
pvals = [0.0096, 0.056, 0.2621]
critical_alpha=0.01
import statsmodels.stats.multitest as smm
smm.multipletests(pvals, alpha=0.05, method='fdr_bh')

(array([ True, False, False]), array([0.0288, 0.084 , 0.2621]), 0.016952427508441503, 0.016666666666666666)
In [118]:
# 4)
AvsBobs=np.median(ct["Bath"])-np.median(ct["Bath+Exercise"])
AvsCobs=np.median(ct["Bath"])-np.median(ct["Exercise"])
BvsCobs=np.median(ct["Bath+Exercise"])-np.median(ct["Exercise"])

In [119]:
b=10000
arrAvsB=np.zeros(b)
arrAvsC=np.zeros(b)
arrBvsC=np.zeros(b)
for i in range(b):
A=np.random.choice(alldata, len(ct["Bath"]))
B=np.random.choice(alldata, len(ct["Bath+Exercise"]))
C=np.random.choice(alldata, len(ct["Exercise"]))
arrAvsB[i]=np.median(A)-np.median(B)
arrAvsC[i]=np.median(A)-np.median(C)
arrBvsC[i]=np.median(B)-np.median(C)

In [125]:
arrAvsB.sort()
ablower=arrAvsB[49]
abupper=arrAvsB[9949]
ABlower=2*AvsBobs - abupper
ABupper=2*AvsBobs - ablower
AB=sns.distplot(arrAvsB, kde=False)
AB.set(xlabel="Difference in medians in Bath and Bath+Exercise", ylabel="Frequency")
AB.axvline(ABlower, color="orange")
AB.axvline(ABupper, color="orange")

<matplotlib.lines.Line2D at 0x7fe4d53d96d8>
In [57]:
arrAvsC.sort()
aclower=arrAvsC[49]
acupper=arrAvsC[9949]
AClower=2*AvsCobs - acupper
ACupper=2*AvsCobs - aclower
AC=sns.distplot(arrAvsC, kde=False)
AC.set(xlabel="Difference in medians in group Bath and Exercise", ylabel="Frequency")
AC.axvline(AClower, color="purple")
AC.axvline(ACupper, color="purple")

<matplotlib.lines.Line2D at 0x7f4e541a3710>
In [58]:
arrBvsC.sort()
bclower=arrBvsC[49]
bcupper=arrBvsC[9949]
BClower=2*BvsCobs - bcupper
BCupper=2*BvsCobs - bclower
BC=sns.distplot(arrBvsC, kde=False)
BC.set(xlabel="Difference in medians in group Bath+Exercise and Exercise", ylabel="Frequency")
BC.axvline(BClower, color="gold")
BC.axvline(BCupper, color="gold")

<matplotlib.lines.Line2D at 0x7f4e53fdfe10>
In [ ]:
# 5)
# After doing the f-like statistic test for the three groups that were tested to treat carpal tunnel, I calculated an f-like statistic of 0.75. This value is the ratio of among group variability and within group variability of the three groups in the study. The p-value I calculated is 0.0181 and this value is less than the cutoff alpha value of 0.05. Therefore, we can reject the null hypothesis and conclude that there is a difference in hand volume and that the reduction in hand swelling is different between treatments of bath, exercise, or both. I also did a post hoc analysis using the Benjamini-Hochberg correcton test. This allowed me to compare the different methods and determine which treatments resulted in different results. Through this analysis, I came to the conclusion that we can reject the null hypothesis of Bath+Exercise vs. Exercise because the p-value showed a "True" value when doing the code for the Benjamini Hochberg correction test. The other test comparisons showed "False" values. For the 4th part of this problem, I  calculated 99% confidence intervals for each comparison and we can interpret them as such: If zero is not within the range, we can reject the null hypothesis. But, if zero is within the range, we fail to reject the null hypothesis.
# The treatment that I would recommend to reduce hand volume the most would be Exercise because it is the treatment method that rejected the null hypothesis.

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns

In [26]:
# 7
sub_use

Alcohol MDMA THC NoDrug
0 38.956475 19.008920 49.805912 55.895314
1 33.921949 20.087996 42.051608 38.546327
2 49.762751 20.767624 34.573858 21.917550
3 32.641102 21.794142 14.817698 356.965032
4 43.977121 22.498652 32.071967 55.032420
5 47.385251 22.552377 27.385088 58.719569
6 26.237367 24.883327 30.431792 25.942951
7 28.908599 26.303733 48.494563 34.508789
8 29.914909 27.363866 54.329265 29.655055
9 38.306418 27.535141 27.857918 21.268192
10 44.109048 28.130536 65.664941 34.413163
11 37.693984 28.235663 28.921727 28.588107
12 68.339650 29.149041 28.687429 33.241967
13 17.885190 31.116597 32.964899 47.492813
14 63.031868 35.683037 39.481022 43.251451
15 27.700828 56.611745 37.055159 25.937962
In [27]:
sub_use=pd.read_csv("tailgating.csv") #assigns the data to drug_use
# Made a csv file on my Week 6 folder because txt file was not working from CCLE
p=sns.violinplot(data=sub_use) #A) plots the data using a violin plot
p.set(ylabel="Frequency") #labels the y-axis of the graph

[Text(0, 0.5, 'Frequency')]
In [28]:
p=sns.swarmplot(data=sub_use) #A) plots the data using a violin plot
p.set(ylabel="Frequency")

[Text(0, 0.5, 'Frequency')]
In [29]:
# 2
# The Null hypothesis of this study is that there is no difference between the use of each substance or no substance in correlation to the distance traveled behind a car, or tailgating. In other words, the use of a substance or no substance does not lead to risky driving.
# The appropriate measure to describe the groups is the median because there is overlap between all four groups.
# The appropriate measure to compare the groups is the difference in medians between the groups.
# The box model of the study would be Big Box, which is where we would resample all of our data because the null hypothesis states that there is no difference between the four groups.
# The sample Sample size of the data is 64 because the sample size of every group, (Alcohol, MDMA, THC, and No Drug), is 16 and multiplied by 4 equals 64.

In [34]:
samplesize = len(sub_use) * 4
samplesize

64
In [35]:
# 3)
def computeFlike(sub_use): #defines the function flike with the input as data
medians = sub_use.median() #finds the median of the data set and assigns it to medians
grandMedian = np.median(sub_use) #calculated the grand median by finding the median of the median all the groups
var_bt = np.sum(np.abs(medians-grandMedian)*len(sub_use)) #creates the function of variation among groups, finds difference between the median and the grandmedian
#    print(medians)
#    print(var_bt)
sumabsdev = np.sum(np.abs(sub_use - medians)) #creates the function of variation within groups, finds the difference between the data and the median calculated
var_within = np.sum(sumabsdev)
Flike = var_bt/var_within #divides the among and within groups and puts it into the variable of f
return Flike #calls back f

obsFlike = computeFlike(sub_use)
print(obsFlike)

0.27234311626577623
In [36]:
alldata = np.concatenate([sub_use["Alcohol"], sub_use["MDMA"], sub_use["THC"], sub_use["NoDrug"]]) #adds the data together under one name

B = 10000
n = len(sub_use)
sims =np.zeros(B) #centers the data around zero
for i in range(B):
group1 = np.random.choice(alldata,n)
group2 = np.random.choice(alldata,n)
group3 = np.random.choice(alldata,n)
group4 = np.random.choice(alldata,n)
sim = pd.DataFrame(np.column_stack((group1,group2,group3,group4)))
sims[i] = computeFlike(sim)
p=sns.distplot(sims,kde = True)
p.set(ylabel="Frequency",xlabel="F-like statistic value")
p.axvline(obsFlike, color="orange")

<matplotlib.lines.Line2D at 0x7f4e54762be0>
In [37]:
pvalue = sum(sims>=obsFlike)/B
print(pvalue)

obsFlike

0.2045
0.27234311626577623
In [38]:
def BigBox (group1, group2):
group1 = np.array(group1)
group2 = np.array(group2)
n1 = len(group1)
n2 = len(group2)
differenceMedian = np.median(group2) - np.median(group1) # Calculate effect size (difference in medians) for observed sample
population = np.concatenate([group1, group2]) # Make Big Box model
sims = [] # Using append
for i in range(10000):
sim_group1 = np.random.choice(population,n1)
sim_group2 = np.random.choice(population,n2)
sims.append(np.median(sim_group2)-np.median(sim_group1)) # Calculate difference of medians for each pseudosample
# Define positive and negative versions of observed effect size
posLimit = abs(differenceMedian)
negLimit = -1*posLimit
# I made a histogram of simulations based on the null hypothesis with observed effect size and its opposite
p = sns.distplot(sims,kde=False)
p.set(xlabel="Difference in Medians", ylabel="Count", title = "Null Distribution with Observed Difference in Median");
p.axvline(posLimit, color="red");
p.axvline(negLimit, color="red")
# Calculate & report p-value
differenceMedian = np.round(differenceMedian,3)
extreme = sum(sims >= posLimit) + sum(sims <= negLimit)
pvalue = extreme/10000
print("The observed difference in medians is",differenceMedian)
print ("The p-value is",pvalue)

In [39]:
BigBox(sub_use["Alcohol"],sub_use["MDMA"])

The observed difference in medians is -11.166 The p-value is 0.0167
In [40]:
BigBox(sub_use["Alcohol"],sub_use["THC"])

The observed difference in medians is -4.231 The p-value is 0.4374
In [41]:
BigBox(sub_use["Alcohol"],sub_use["NoDrug"])

The observed difference in medians is -3.539 The p-value is 0.5586
In [42]:
BigBox(sub_use["MDMA"],sub_use["THC"])

The observed difference in medians is 6.936 The p-value is 0.049
In [43]:
BigBox(sub_use["MDMA"],sub_use["NoDrug"])

The observed difference in medians is 7.627 The p-value is 0.0629
In [44]:
BigBox(sub_use["THC"],sub_use["NoDrug"])

The observed difference in medians is 0.692 The p-value is 0.8811
In [70]:
pvals = [0.0167, 0.049, 0.0629, 0.4374, 0.5586, 0.8811]
critical_alpha=0.01
import statsmodels.stats.multitest as smm
smm.multipletests(pvals, alpha=0.05, method='fdr_bh')

(array([False, False, False, False, False, False]), array([0.1002 , 0.1258 , 0.1258 , 0.6561 , 0.67032, 0.8811 ]), 0.008512444610847103, 0.008333333333333333)
In [48]:
# 4)
AvsBobs=np.median(sub_use["Alcohol"])-np.median(sub_use["MDMA"])
AvsCobs=np.median(sub_use["Alcohol"])-np.median(sub_use["THC"])
AvsDobs=np.median(sub_use["Alcohol"])-np.median(sub_use["NoDrug"])
BvsCobs=np.median(sub_use["MDMA"])-np.median(sub_use["THC"])
BvsDobs=np.median(sub_use["MDMA"])-np.median(sub_use["NoDrug"])
CvsDobs=np.median(sub_use["THC"])-np.median(sub_use["NoDrug"])

In [49]:
b=10000
arrAvsB=np.zeros(b)
arrAvsC=np.zeros(b)
arrAvsD=np.zeros(b)
arrBvsC=np.zeros(b)
arrBvsD=np.zeros(b)
arrCvsD=np.zeros(b)
for i in range(b):
A=np.random.choice(alldata, len(sub_use["Alcohol"]))
B=np.random.choice(alldata, len(sub_use["MDMA"]))
C=np.random.choice(alldata, len(sub_use["THC"]))
D=np.random.choice(alldata, len(sub_use["NoDrug"]))
arrAvsB[i]=np.median(A)-np.median(B)
arrAvsC[i]=np.median(A)-np.median(C)
arrAvsD[i]=np.median(A)-np.median(D)
arrBvsC[i]=np.median(B)-np.median(C)
arrBvsD[i]=np.median(B)-np.median(D)
arrCvsD[i]=np.median(C)-np.median(D)

In [64]:
arrAvsB.sort()
ablower=arrAvsB[49]
abupper=arrAvsB[9949]
ABlower=2*AvsBobs - abupper
ABupper=2*AvsBobs - ablower
AB=sns.distplot(arrAvsB, kde=False)
AB.set(xlabel="Difference in medians in Alcohol and MDMA", ylabel="Frequency")
AB.axvline(ABlower, color="orange")
AB.axvline(ABupper, color="orange")

<matplotlib.lines.Line2D at 0x7f4e53d4b208>
In [65]:
arrAvsC.sort()
aclower=arrAvsC[49]
acupper=arrAvsC[9949]
AClower=2*AvsCobs - acupper
ACupper=2*AvsCobs - aclower
AC=sns.distplot(arrAvsC, kde=False)
AC.set(xlabel="Difference in medians in group Alcohol and THC", ylabel="Frequency")
AC.axvline(AClower, color="purple")
AC.axvline(ACupper, color="purple")

<matplotlib.lines.Line2D at 0x7f4e53b98c50>
In [66]:
arrAvsD.sort()
AD.set(xlabel="Difference in medians in group Alcohol and NoDrug", ylabel="Frequency")

<matplotlib.lines.Line2D at 0x7f4e53aa5828>
In [67]:
arrBvsC.sort()
bclower=arrBvsC[49]
bcupper=arrBvsC[9949]
BClower=2*BvsCobs - bcupper
BCupper=2*BvsCobs - bclower
BC=sns.distplot(arrBvsC, kde=False)
BC.set(xlabel="Difference in medians in group MDMA and THC", ylabel="Frequency")
BC.axvline(BClower, color="black")
BC.axvline(BCupper, color="black")

<matplotlib.lines.Line2D at 0x7f4e53aecba8>
In [68]:
arrBvsD.sort()
bdlower=arrBvsD[49]
bdupper=arrBvsD[9949]
BDlower=2*BvsDobs - bdupper
BDupper=2*BvsDobs - bdlower
BD=sns.distplot(arrBvsD, kde=False)
BD.set(xlabel="Difference in medians in group MDMA and NoDrug", ylabel="Frequency")
BD.axvline(BDlower, color="green")
BD.axvline(BDupper, color="green")

<matplotlib.lines.Line2D at 0x7f4e539bbb38>
In [69]:
arrCvsD.sort()
cdlower=arrCvsD[49]
cdupper=arrCvsD[9949]
CDlower=2*CvsDobs - cdupper
CDupper=2*CvsDobs - cdlower
CD=sns.distplot(arrCvsD, kde=False)
CD.set(xlabel="Difference in medians in group THC and NoDrug", ylabel="Frequency")
CD.axvline(CDlower, color="red")
CD.axvline(CDupper, color="red")

<matplotlib.lines.Line2D at 0x7f4e5377ed30>
In [ ]:
# 5)
# After doing the f-like statistic test for the three groups that were tested to treat carpal tunnel, I calculated an f-like statistic of 0.27. This value is the ratio of among group variability and within group variability of the four groups in the study. The p-value I calculated is 0.2045 and this value is greater than the cutoff alpha value of 0.05. Therefore, we fail to reject the null hypothesis and conclude that there is a possibility of no difference in the use of different substances, such as: alcohol, MDMA, THC, or no drug, and risky driving such as tailgaiting. I also did a post hoc analysis using the Benjamini-Hochberg correcton test. This allowed me to compare the different substances and determine which ones resulted in different results. Through this analysis, I came to the conclusion that we fail to reject the null hypothesis for all of the substac=nces studied because all of the p-values tested "False" when doing the code for the Benjamini Hochberg correction test. For the 4th part of this problem, I  calculated 99% confidence intervals for each comparison and we can interpret them as such: If zero is not within the range, we can reject the null hypothesis. But, if zero is within the range, we fail to reject the null hypothesis. There is only one confidence interval that does not contain zero and that is the Alcohol vs. MDMA CI and the other CI's do contain zero. This one test is not enough to prove that we should reject the null hypothesis.
# According to this dataset, the type of drug routinely used does not affect tailgating because the many test that we ran mainly show that we fail to reject the null hypothesis that there is no difference between the substances. This means that there is a possibility that the substances show a similarity in the affect they have on risky driving, like tailgating. The distribution of the data also shows that they are not very different in variability.