SharedWeek 6 / hw6-turnin.ipynbOpen in CoCalc

Homework 6

Name: Colette White

I collaborated with:

# 1
#Paired design and analysis allows for data to be compared with a before and after effect, and thus results in conclusions to be drawn from what takes place or changes over time, rather than what can be done with independent samples, which are taken at only one point in time and cannot account for as many variables simultanously being effected as with paired data.
#2
from IPython.display import Image
Image("unnamed.jpg")
#3
#Why do we always calculate a one-tailed p-value when using an F-statistic or F-like statistic?
#We always calculate a one-tailed p-value when using an F-statistic or an F-like statistic, because these statistics calculate the 
#4.
#We use a multiple-testing correction when we are conducting multiple tests comparing two groups to determine which differences between two groups are statistically significant after a significant omnibus test. This is important because conducting multiple tests comparing two groups increases the Type I error rate for the entire study being done, and we need to keep the error rate at the chosen alpha value.
#5
#A Bonferroni correction adjusts the ⍺ cut-off where the adjusted α is α/m , where m is the number of pairwise (two-group) tests conducted. This method often leads to the Type 1 error rate being lower than the original alpha value, which means that most studies will not find any significant differences even if the omnibus test finds a significant difference. Thus, the Benjamini-Hochberg is what we are using in LS 40 instead. In this method we find index k, and find a significant difference for the comparisons with the k smallest p-values, and are also able to correct p-values while comparing them to the alpha values and find if they are significant. This method results in being able to find more significant differences in comparison to the Bonferroni method and allows for more comparison in the data results.
#6
#You conduct an ANOVA of the number of mosquito bites volunteers receive with three different types of insect repellents. There are 30 volunteers in each of the three groups (total of 90). You calculate a p-value that is greater than the alpha you pre-selected for this study. How do you interpret this? What further steps should you take to further investigate this issue?
#If the calculated p-value is greater than the pre-selected alpha for this study, then we interpret this to mean that no significant differences were found. Further steps that we should take to investigate this issue would be performing the Benjamini-Hochberg method, in order to find the k index and correct the alpha value. This way we can find a significant difference using the k smallest p-values and be more accurate in th interpretation of the data.
#7(a)
import pandas as pd
df = pd.DataFrame()
import numpy as np
import matplotlib as mat
import seaborn as sns
%matplotlib inline
granny = pd.read_csv("granny.csv")
granny
Placebo Remedy
0 5.3 3.5
1 3.6 2.3
2 4.3 4.7
3 5.7 1.5
4 6.7 3.7
p=sns.violinplot(data=granny, palette="Paired")
p=sns.stripplot(data=granny, palette="Reds")
#7(b)
# The null hypothesis would be that the grandmother's remedy has no effect on the duration of colds. The appropriate measure to describe the groups would be with a 99% pivotal confidence interval concerning the median values of the data. The measure to compare the groups would be the difference in medians, and a big box model using the ranking method. The sample size for the study would be 5.
#7(c)
#Calculate p-value for NHST using the appropriate method. Make sure you specify which method you’re using from the list in #6 above (or similar for comparing two groups). Include your histogram of simulations results with observed result indicated. Comment on any lines that are different from the standard Big Box method. If you are comparing three or more groups, make sure you determine between which two groups any difference occurs.
#Using the big box model with ranking.
#First I will put all of the values (both the placebo and the remedy) into a big box. Then I will sort those and assign ranks to the values.
placeboranks=[4,6,8,9,10]
remedyranks=[1,2,3,5,7]
pranksmed=np.median(placeboranks)
pranksmed
8.0
rranksmed=np.median(remedyranks)
rranksmed
3.0
dobs=pranksmed-rranksmed
dobs
5.0
population = [1,2,3,4,5,6,7,8,9,10]
zerothing = 10000
effect_difference = np.zeros(zerothing)
other_limit = -dobs
for i in range(10000): 
    random_pla = np.random.choice(population,5)
    random_rem = np.random.choice(population,5)
    difference = np.median(random_pla) - np.median(random_rem)
    effect_difference[i] = difference 
p=sns.distplot(effect_difference, kde=True, bins=65, axlabel="Difference in effect of remedy vs placebo") 
p.set(ylabel="Count")
p.axvline(dobs, color="red");
p.axvline(other_limit, color="red");
p.axvline(np.median(effect_difference), color="orange");
obscount_positive = sum(effect_difference>=dobs) 
obscount_negative = sum(effect_difference<=other_limit)# find values lower than the lower limit
count = obscount_positive + obscount_negative
pvalue = count/10000 #find the p-value
display("Two-tailed p-value",pvalue)
'Two-tailed p-value'
0.0962
#7(d)
effect_difference.sort() #sort the differences in species medians from resampling in order of values
M_lower = effect_difference[49] #find the bottom 0.5%
M_upper = effect_difference[9949] #find the top 0.5%
M_observed = np.median(effect_difference) #find the observed median
M_upperp = 2*M_observed - M_lower #find the upper pivotal median value
M_lowerp = 2*M_observed - M_upper #find the lower pivotal median value
display("M Lower : Red Line", M_lowerp, "M Upper: Green Line", M_upperp) #display these found values
p=sns.distplot(effect_difference, kde=False, bins=25, axlabel="Difference in effect of remedy vs placebo") #create a histogram of the resamples
p.set(ylabel="Count")
p.axvline(M_lowerp, color="red");
p.axvline(M_upperp, color="green");
p.axvline(M_observed, color="purple");
p.axvline(dobs, color="orange");
p.axvline(other_limit, color="orange");
'M Lower : Red Line'
-6.0
'M Upper: Green Line'
7.0
#7(e) 
#Based on the results from steps (c) and (d), the study reuslts in the p-value and the 0 being within the 99% confidence interval. This means that we fail to reject the null hypothesis.
#7(f)
#They should collect a far greater pool of data in their study, in order to have more accurate confidence intervals, and p-values. With more data, there would be no need to use rank based testing in the big box method.
#8
drugs = pd.read_csv("https://s3.amazonaws.com/pbreheny-data-sets/cystic-fibrosis.txt",sep='\t')
drugs
Drug Placebo
0 213 224
1 95 80
2 33 75
3 440 541
4 -32 74
5 -28 85
6 445 293
7 -178 -23
8 367 525
9 140 -38
10 323 508
11 10 255
12 65 525
13 343 1023
#8(a)
p=sns.violinplot(data=drug, palette="Paired")
p=sns.stripplot(data=drug, palette="Reds")
#8(b)
#The null hypothesis for this study would be that there is no difference in the effects of the placebo over time versus the drug over time on the improvement of lung function in patients with cystic fibrosis. The appropriate measure to describe the groups would be the median, and the groups should be compared using 99% pivotal confidence intervals and p values using the paired test. The sample size for the study would be 14.
#8(c)
drug = list(drugs["Drug"])
placebo = list(drugs["Placebo"])
drug.sort
placebo.sort
zerostuff=10000
effectdiff=np.zeros(len(drug))
for i in range(14):
    effectdiff[i] = placebo[i]-drug[i]
obsmedian=np.median(effectdiff)
otherlim=-obsmedian
difffinal=np.zeros(zerostuff)
assign=[1,-1]
new_median=[]
for i in range(10000):
    randomassign=np.random.choice(assign,14)
    nextdiff = effectdiff*randomassign
    new_median.append(np.median(nextdiff))
p=sns.distplot(new_median, kde=False, bins=6)
p.axvline(obsmedian, color="red");
p.set(xlabel="Median Between Groups", ylabel="Frequency");
count_pos = np.sum(new_median>=obsmedian)
count_neg = np.sum(new_median<=otherlim)
count = np.sum(count_pos+count_neg)
pvalue = count/10000
display("P-value", pvalue)
obsmedian
'P-value'
0.1127
109.5
#8(d)
new_median.sort() 
M_lower = new_median[49] #find the bottom 0.5%
M_upper = new_median[9949] #find the top 0.5%
M_observed = np.median(new_median) #find the observed median
M_upperp = 2*M_observed - M_lower #find the upper pivotal median value
M_lowerp = 2*M_observed - M_upper #find the lower pivotal median value
display("M Lower : Red Line", M_lowerp, "M Upper: Green Line", M_upperp) #display these found values
p=sns.distplot(new_median, kde=False, bins=25, axlabel="Difference in effect of drug vs placebo") #create a histogram of the resamples
p.set(ylabel="Count")
p.axvline(M_lowerp, color="red");
p.axvline(M_upperp, color="green");
p.axvline(M_observed, color="purple");
p.axvline(obsmedian, color="orange");
p.axvline(other_limit, color="orange");
'M Lower : Red Line'
-130.0
'M Upper: Green Line'
138.0
#8(e) Since zero is within the confidence interval, we fail to reject the null hypothesis. 
#9
baths = pd.read_csv("contrast-baths.txt",sep='\t')
baths
Bath Bath+Exercise Exercise
0 5 6 -12
1 10 10 -10
2 -4 0 -7
3 11 14 -1
4 -3 0 -1
5 13 15 0
6 0 4 0
7 2 5 0
8 10 11 0
9 6 7 0
10 -1 20 2
11 8 9 4
12 10 11 5
13 -9 21 5
#9(a)
p=sns.violinplot(data=baths, palette="Paired")
p=sns.stripplot(data=baths, palette="Reds")
#9(b)
#The null hypothesis for this study would be that there is no difference between the effects of the treatments to reduce hand swelling. The most appropriate measure to describe the groups would be by median, and the most appropriate measure to compare the groups would be a 99% pivotal confidence interval using multiple box re-sampling, with re-centering. The sample size would be n=14.
#9(c)
#Calculate p-value for NHST using the appropriate method. Make sure you specifywhich method you’re using from the list in #6 above (or similar for comparing two groups). Include your histogram of simulations results with observed result indicated. Comment on any lines that are different from the standard Big Box method. If you are comparing three or more groups, make sure you determine between which two groups any difference occurs.
def Flike(df):
    columnmed=df.median()
    grandmed=np.median(df)
    num=(len(df))*sum(abs(columnmed-grandmed))
    denom=np.sum(np.sum(abs(df-columnmed)))
    return(num/denom)
Fobs=Flike(baths)
Fobs
0.75
Flikesnew=[]
total=10000
B = [5,10,-4,11,-3,13,0,2,10,6,-1,8,10,-9]
BE = [6,10,0,14,0,15,4,5,11,7,20,9,11,21]
E = [-12,-10,-7,-1,-1,0,0,0,0,0,2,4,5,5]
alldata=np.concatenate([B, BE, E])
for i in range(total):
    Brand = np.random.choice(alldata, len(B))
    BErand = np.random.choice(alldata, len(BE))
    Erand = np.random.choice(alldata, len(E))
    column = np.column_stack((Brand,BErand,Erand))
    randomdata = pd.DataFrame(column)
    newFlike = Flike(randomdata)
    Flikesnew.append(newFlike)
pvalue = (sum(Flikesnew>=Fobs))/total
display("P-value",pvalue)
'P-value'
0.0177
B = [5,10,-4,11,-3,13,0,2,10,6,-1,8,10,-9]
BE = [6,10,0,14,0,15,4,5,11,7,20,9,11,21]
E = [-12,-10,-7,-1,-1,0,0,0,0,0,2,4,5,5]
B1=np.abs(B-np.median(B))
BE1=np.abs(BE-np.median(BE))
E1=np.abs(E-np.median(E))
alldata = np.concatenate([B1,BE1,E1])
bath1=np.random.choice(alldata,14)
bathexercise1=np.random.choice(alldata,14)
exercise1=np.random.choice(alldata,14)
q=np.zeros(10000)
for i in range(10000):
    bath1=np.random.choice(alldata,14)
    bathexercise1=np.random.choice(alldata,14)
    exercise1=np.random.choice(alldata,14)
    together1=bath1, bathexercise1, exercise1
    togetherstack=np.column_stack(together1)
    finalstack=pd.DataFrame(togetherstack)
    f=Flike(finalstack)
    q[i]=f
p=sns.distplot(q,bins=30,kde=False)
p.set(xlabel="Flike value", ylabel="Number of times for Flike value or greater")
p.axvline(0.75, color="purple")
<matplotlib.lines.Line2D at 0x7faa744aa048>
q.sort()
M_lower = q[49] #find the bottom 0.5%
M_upper = q[9949] #find the top 0.5%
M_observed = np.median(q) #find the observed median
M_upperp = 2*M_observed - M_lower #find the upper pivotal median value
M_lowerp = 2*M_observed - M_upper #find the lower pivotal median value
display("M Lower : Red Line", M_lowerp, "M Upper: Green Line", M_upperp) #display these found values
p=sns.distplot(q, kde=False, bins=25, axlabel="Difference in effect of drug vs placebo") #create a histogram of the resamples
p.set(ylabel="Count")
p.axvline(M_lowerp, color="red");
p.axvline(M_upperp, color="green");
p.axvline(M_observed, color="purple");
p.axvline(Fobs, color="orange");
p.axvline(-Fobs, color="orange");
'M Lower : Red Line'
-0.33200407262854237
'M Upper: Green Line'
0.5060240963855421
#Since zero is within the confidence interval, we cannot reject the null hypothesis.