Name: Dean Neutel

I collaborated with: Vennis

#1 Paired design and analysis allows for data to be compared with a before and after effect, and thus results in conclusions to be drawn from what takes place or changes over time, rather than what can be done with independent samples, which are taken at only one point in time and cannot account for as many variables simultanously being effected as with paired data.

#2 Attached as another link

#3 We calculate a one-tailed p value when using an F-like statistic because the F-like statistic shows the most extreme value in one direction because it is the absolute value difference from the GM while a two-tailed p value would show both extremes (positive an negative) which doesnt make sense for what the F-like statistic is telling us.

#4 We use a multiple-testing correction when we are conducting multiple tests comparing two groups to determine which differences between two groups are statistically significant after a significant omnibus test. This is important because conducting multiple tests comparing two groups increases the Type I error rate for the entire study being done, and we need to keep the error rate at the chosen alpha value.

#5 A Bonferroni correction adjusts the alpha cut-off so the adjusted alpha is alpha/m (m is the number of pairwise tests conducted). This method is very stringent and leads to many false negatives. The Benjamini-Hochberg is the method in which we find index k, and find a significant difference for the comparisons with the k smallest p-values, and are also able to correct p-values while comparing them to the alpha values and find if they are significant. This method results in being able to find more significant differences in comparison to the Bonferroni method and allows for more comparison in the data results.

#6 You conduct an ANOVA of the number of mosquito bites volunteers receive with three different types of insect repellents. There are 30 volunteers in three groups (total of 90). You calculate a p-value that is greater than the alpha you pre-selected for this study. If the calculated p-value is greater than the pre-selected alpha for this study, then we interpret this to mean that no significant differences were found. Further steps that we should take to investigate this issue would be performing the Benjamini-Hochberg method, in order to find the k index and correct the alpha value. This way we can find a significant difference using the k smallest p-values and be more accurate in th interpretation of the data.

In [36]:

#7a import pandas as pd import numpy as np import seaborn as sns remedy = [3.5,2.3,4.7,1.5,3.7] placebo = [5.3,3.6,4.3,5.7,6.7] column = np.column_stack((remedy,placebo)) data_frame = pd.DataFrame(column) p=sns.swarmplot(data=data_frame) p.set(xlabel="Treatment Group", ylabel="Cold Duration (Days)");

In [126]:

p=sns.violinplot(data=data_frame) p.set(xlabel="Treatment Group", ylabel="Cold Duration (Days)");

#7b Null Hypothesis: There is no difference between the medians of these groups. Since the first group doesn't have a perfect bell-curve, we use the median for the data here. I will use the p-value to compare these two groups. I will use a rank based test as my box model. The sample size for this study is 5.

In [127]:

#7c remedy = [3.5,2.3,4.7,1.5,3.7] placebo = [5.3,3.6,4.3,5.7,6.7] ranked_remedy = [1,2,3,5,7] ranked_placebo = [4,6,8,9,10] dobs = np.mean(ranked_placebo) - np.mean(ranked_remedy) other_limit = -dobs total = 10000 mixed = np.concatenate([ranked_remedy,ranked_placebo]) new_d = [] for i in range(total): new_ranked = np.random.choice(mixed,len(mixed), replace=False) new_remedy = new_ranked[0:5] new_placebo = new_ranked[5:10] new_d.append(np.mean(new_placebo) - np.mean(new_remedy)) p=sns.distplot(new_d, kde=False, bins=20) p.axvline(dobs, color="red") count_positive = np.sum(new_d>=dobs) count_negative = np.sum(new_d<=other_limit) count = np.sum(count_positive+count_negative) pvalue = count/total display("P-Value", pvalue)

'P-Value'

0.0551

#7d We can't create a confidence interval for rank based testing since the sample size is extremely small and it would be based on integers alone.

#7e Since our p-value is larger than our alpha value of 0.05, we must retain the null hypothesis and state that the difference in groups are not statistically significant.

#7f They should include a larger sample size and instead of doing a ranking test, they should use a pair test with the larger sample size.

In [7]:

#8a df = pd.read_csv("https://s3.amazonaws.com/pbreheny-data-sets/cystic-fibrosis.txt",sep='\t') p=sns.swarmplot(data=df) p.set(xlabel="Treatment Group", ylabel = "Reduction in Lung Function");

#8b Null Hypothesis: There is no difference between the medians of these two groups. I would use the median as the measurement since the violinplot shows that the groups don't follow a uniform bell-curve which invalidates the use of the mean. I will use paired testing for the box model. I will use the p-value and confidence interval to compare these groups. The sample size for this study is 14.

In [8]:

df

Drug | Placebo | |
---|---|---|

0 | 213 | 224 |

1 | 95 | 80 |

2 | 33 | 75 |

3 | 440 | 541 |

4 | -32 | 74 |

5 | -28 | 85 |

6 | 445 | 293 |

7 | -178 | -23 |

8 | 367 | 525 |

9 | 140 | -38 |

10 | 323 | 508 |

11 | 10 | 255 |

12 | 65 | 525 |

13 | 343 | 1023 |

In [123]:

#8c drug = list(df["Drug"]) placebo = list(df["Placebo"]) drug.sort() placebo.sort() difference = np.zeros(len(drug)) multiply = [1,-1] new_median = [] total = 10000 for i in range(len(drug)): difference[i] = placebo[i]-drug[i] median_observed = np.median(difference) other_limit = - median_observed for i in range(total): sign = np.random.choice(multiply,len(drug)) new_difference = difference*sign new_median.append(np.median(new_difference)) p=sns.distplot(new_median, kde=False, bins=6) p.axvline(median_observed, color="red"); p.set(xlabel="median between Groups", ylabel="Frequency"); count_positive = np.sum(new_median>=median_observed) count_negative = np.sum(new_median<=other_limit) count = np.sum(count_positive+count_negative) pvalue = count/total display("P-Value",pvalue)

'P-Value'

0.0082

In [10]:

#8d drug = list(df["Drug"]) placebo = list(df["Placebo"]) drug.sort placebo.sort total=10000 difference=np.zeros(len(drug)) for i in range(len(drug)): difference[i] = placebo[i]-drug[i] median_observed=np.median(difference) other_limit=-median_observed difffinal=np.zeros(total) assign=[1,-1] new_median=[] for i in range(total): sign=np.random.choice(assign,len(drug)) new_difference = difference*sign new_median.append(np.median(new_difference)) p=sns.distplot(new_median, kde=False, bins=6) p.axvline(median_observed, color="red"); p.set(xlabel="Median Between Groups", ylabel="Frequency"); new_median.sort() m_lower = new_median[49] m_upper = new_median[9949] m_upper_pivotal = 2*median_observed - m_lower m_lower_pivotal = 2*median_observed - m_upper p.axvline(m_upper_pivotal, color="green"); p.axvline(m_lower_pivotal, color="green");

#8e For this data set, we see that the confidence interval doesn't contain our null value of 0 and that the p-value is less than our alpha value of 0.05 which means that this data is statistically significant. This means that there is a statistically significant difference between the medians of these two groups.

In [11]:

#9a baths = pd.read_csv("contrast-baths.txt", sep='\t') baths

Bath | Bath+Exercise | Exercise | |
---|---|---|---|

0 | 5 | 6 | -12 |

1 | 10 | 10 | -10 |

2 | -4 | 0 | -7 |

3 | 11 | 14 | -1 |

4 | -3 | 0 | -1 |

5 | 13 | 15 | 0 |

6 | 0 | 4 | 0 |

7 | 2 | 5 | 0 |

8 | 10 | 11 | 0 |

9 | 6 | 7 | 0 |

10 | -1 | 20 | 2 |

11 | 8 | 9 | 4 |

12 | 10 | 11 | 5 |

13 | -9 | 21 | 5 |

In [118]:

p=sns.swarmplot(data=baths) p.set(xlabel="Treatment Group",ylabel="Hand Volume");

In [117]:

p=sns.violinplot(data=baths) p.set(xlabel="Treatment Group", ylabel="Change in Hand Volume");

#9b Null Hypothesis: There is no difference between any of these groups I would use the median as the measurement since the violinplot shows that the groups don't follow a uniform bell-curve which invalidates the use of the mean. I will use ANOVA and omnibus testing to find the p-value between groups in order to measure the difference between groups. I will use the big box resampling method within ANOVA for my box model. The sample size for this study is 14.

In [13]:

#9c def F_like_function(data): n = len(data) variation_within_groups = sum(np.sum(abs(data-data.median()))) variation_between_groups = (n*abs(data.median()[0]-np.median(data)))+(n*abs(data.median()[1]-np.median(data)))+(n*abs(data.median()[2]-np.median(data))) F_like = (variation_between_groups)/(variation_within_groups) return(F_like)

In [79]:

total = 10000 differenceMedian = np.median(baths["Bath"]) - np.median(baths["Bath+Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Bath"], baths["Bath+Exercise"]]) for i in range(total): Random_Bath = np.random.choice(alldata, len(baths["Bath"])) Random_Bath_Exercise = np.random.choice(alldata, len(baths["Bath+Exercise"])) difference = np.median(Random_Bath) - np.median(Random_Bath_Exercise) bath_exercise_difference[i] = difference pvalue12 = (sum(bath_exercise_difference>=max(limits))+sum(bath_exercise_difference<=min(limits)))/total display("P-Value (Bath-Bath+Exercise)", pvalue12) p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") p.axvline(differenceMedian,color="red");

'P-Value (Bath-Bath+Exercise)'

0.2646

In [124]:

total = 10000 differenceMedian = np.median(baths["Bath"]) - np.median(baths["Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Bath"], baths["Exercise"]]) for i in range(total): Random_Bath = np.random.choice(alldata, len(baths["Bath"])) Random_Exercise = np.random.choice(alldata, len(baths["Exercise"])) difference = np.median(Random_Bath) - np.median(Random_Exercise) bath_exercise_difference[i] = difference pvalue23 = (sum(bath_exercise_difference>=max(limits))+sum(bath_exercise_difference<=min(limits)))/total display("P-Value (Bath-Exercise)", pvalue23) p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") p.axvline(differenceMedian,color="red");

'P-Value (Bath-Exercise)'

0.0552

In [125]:

total = 10000 differenceMedian = np.median(baths["Exercise"]) - np.median(baths["Bath+Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Exercise"], baths["Bath+Exercise"]]) for i in range(total): Random_Exercise = np.random.choice(alldata, len(baths["Exercise"])) Random_Bath_Exercise = np.random.choice(alldata, len(baths["Bath+Exercise"])) difference = np.median(Random_Exercise) - np.median(Random_Bath_Exercise) bath_exercise_difference[i] = difference pvalue34 = (sum(bath_exercise_difference>=max(limits))+sum(bath_exercise_difference<=min(limits)))/total display("P-Value (Exercise-Bath+Exercise)", pvalue34) p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") p.axvline(differenceMedian,color="red");

'P-Value (Exercise-Bath+Exercise)'

0.0089

In [85]:

import statsmodels.stats.multitest as smm display("Bath/Bath+Exercise P-Value",smm.multipletests(pvalue12, alpha=0.05, method='fdr_bh'),"Bath/Exercise P-Value",smm.multipletests(pvalue23, alpha=0.05, method='fdr_bh'), "Exercise/Bath+Exercise P-Value",smm.multipletests(pvalue34, alpha=0.05, method='fdr_bh'))

'Bath/Bath+Exercise P-Value'

(array([False]), array([0.2646]), 0.050000000000000044, 0.05)

'Bath/Exercise P-Value'

(array([False]), array([0.0579]), 0.050000000000000044, 0.05)

'Exercise/Bath+Exercise P-Value'

(array([ True]), array([0.0003]), 0.050000000000000044, 0.05)

In [86]:

#9d total = 10000 differenceMedian = np.median(baths["Bath"]) - np.median(baths["Bath+Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Bath"], baths["Bath+Exercise"]]) for i in range(total): Random_Bath = np.random.choice(alldata, len(baths["Bath"])) Random_Bath_Exercise = np.random.choice(alldata, len(baths["Bath+Exercise"])) difference = np.median(Random_Bath) - np.median(Random_Bath_Exercise) bath_exercise_difference[i] = difference p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") bath_exercise_difference.sort() m_lower = bath_exercise_difference[49] m_upper = bath_exercise_difference[9949] m_upper_pivotal = 2*differenceMedian - m_lower m_lower_pivotal = 2*differenceMedian - m_upper p.axvline(differenceMedian,color="red"); p.axvline(m_upper_pivotal, color="green"); p.axvline(m_lower_pivotal, color="green");

In [87]:

total = 10000 differenceMedian = np.median(baths["Bath"]) - np.median(baths["Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Bath"], baths["Exercise"]]) for i in range(total): Random_Bath = np.random.choice(alldata, len(baths["Bath"])) Random_Exercise = np.random.choice(alldata, len(baths["Exercise"])) difference = np.median(Random_Bath) - np.median(Random_Exercise) bath_exercise_difference[i] = difference p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") bath_exercise_difference.sort() m_lower = bath_exercise_difference[49] m_upper = bath_exercise_difference[9949] m_upper_pivotal = 2*differenceMedian - m_lower m_lower_pivotal = 2*differenceMedian - m_upper p.axvline(differenceMedian,color="red"); p.axvline(m_upper_pivotal, color="green"); p.axvline(m_lower_pivotal, color="green");

In [96]:

total = 10000 differenceMedian = np.median(baths["Exercise"]) - np.median(baths["Bath+Exercise"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] bath_exercise_difference = np.zeros(total) alldata = np.concatenate([baths["Exercise"], baths["Bath+Exercise"]]) for i in range(total): Random_Exercise = np.random.choice(alldata, len(baths["Exercise"])) Random_Bath_Exercise = np.random.choice(alldata, len(baths["Bath+Exercise"])) difference = np.median(Random_Exercise) - np.median(Random_Bath_Exercise) bath_exercise_difference[i] = difference p=sns.distplot(bath_exercise_difference,kde=False) p.set(xlabel="Difference Between Medians",ylabel="Frequence") bath_exercise_difference.sort() m_lower = bath_exercise_difference[49] m_upper = bath_exercise_difference[9949] m_upper_pivotal = 2*differenceMedian - m_lower m_lower_pivotal = 2*differenceMedian - m_upper p.axvline(differenceMedian,color="red"); p.axvline(m_upper_pivotal, color="green"); p.axvline(m_lower_pivotal, color="green");

#9e In part c, when we test the p-values between each of the different groups, we see that the only p-value that is below our alpha-value of 0.05 is the one between just exercise and bath and exercise. Our confidence interval shows the same data since the just bath and bath and exercise group's confidence interval includes the value 0 which means that the difference in medians is not statistically significant. The next group between just bath and just exercise. This group has a confidence interval that doesn't include 0 which shows that it is statistically significant, however the p-value is above our alpha value of 0.05 so we don't accept this group as significant for this study. We are left with the group between just exercise and baths and exercise. Since this group is statistically significant, we look and see that the median for the exercise only group is lower than the median for the exercise and bath group which tells us that we should recommend the exercise only group since it causes the most statistically significant reduction in hand volume.

In [ ]: