CoCalc Shared FilesWeek 6 / hw6-turnin.ipynbOpen in CoCalc with one click!
Author: Jose Haro-Aldana
Views : 98

Homework 6

Name:

I collaborated with:

In [2]:
import pandas as pd import numpy as np import seaborn as sns
In [21]:
#1 # It is suggested to use the np.zeros input rather than .append() because np.zeros creates an array of placeholder zeros than can later be appended; although .append() performs the same function as np.zeros, .append() has to create the array of values in which the subsequent value of the empty list is dependednt upon the previous value that was appended. However with np.zeros, it is as if the computer copy and pasted the simulated values onto the np.zeros array, making it slightly more accurate to the values of the simulation than np.zeros.

#2

In [23]:
#3 #The Benjamini - Hochberg correction is a more conservative correction method than the Bonferroni method. #The Bonferroni correction method finds the adjusted p-value by dividing the NHST p-value by the number of comparisons made. This correction method is used to reduce the number of possible false postives to a minimum, however increasing the number of comparisons--although it will lower the false positive probability-- will raise the probability of obtaining a false negative, a great disadvantage. #The Benjamini - Hochberg is less sensitive in comparison to the Bonferroni in which the Benjamini - Hochberg method can calculate the adjusted alpha value with the number of comparisons while taking into consideration the changes of the alpha value as the number of comparisons increase. The Benjamini - Hochberg method is less sensitive to the changes of alpha valeues to comparisons compared to the Bonferroni method.
In [4]:
#4 p_values= np.sort([0.34 , 0.001, 0.0084 , 0.003 ,0.013 , 0.006]) B_H_method = (p_values*0.01)/6 np.sum(B_H_method <= 0.01) #Using the Benjamini - Hochberg correction method, it is determined that all six p_values in p_values contain a statistical difference.
6
In [25]:
#5 # The F-like statistic assists by determining whether two groups are significant before any NHST can be done, this is done to save time and to provide assurance that a statistical significance is present between two groups as to not waste time in performing a NHST between two groups out of a four-group sample only to find that they are not statistically significant. Once the F-like statistic determines a significance, use NHST to determine the p-value of the null hypothesis.
In [34]:
#6 contrast_bath = pd.read_csv("contrast-baths.txt",sep='\t') contrast_bath
Bath Bath+Exercise Exercise
0 5 6 -12
1 10 10 -10
2 -4 0 -7
3 11 14 -1
4 -3 0 -1
5 13 15 0
6 0 4 0
7 2 5 0
8 10 11 0
9 6 7 0
10 -1 20 2
11 8 9 4
12 10 11 5
13 -9 21 5
In [35]:
#6.1 sns.swarmplot(data = contrast_bath, palette = "Greens") sns.violinplot(data = contrast_bath, palette = "Reds") group_median = contrast_bath.median() length = len(contrast_bath) group_median, length
(Bath 5.5 Bath+Exercise 9.5 Exercise 0.0 dtype: float64, 14)
In [44]:
def MADAM(data): return np.median(np.abs(data - np.median(data)))
In [46]:
#MADAM of Bath Group MADAM([5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9])
5.0
In [47]:
#MADAM of Bath + Exercise Group MADAM([6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21])
4.5
In [49]:
#MADAM of Exercise Group MADAM([-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5])
1.5
In [39]:
#6.2 #It is shown above that the MADAM values of both the Bath and the Bath + Exercise are close to each other, however the MADAM value of the Exercise group strongly differs from that of the other two. Hence, a two box model with re-centering is needed for NHST. The null hypothesis, in this case, assumes that because the MADAM values are different, the "Bath" and "Bath + Exercise" groups derive from the same population because of their identical MADAM values, but the large difference between the Bath to the Exercise groups and the Bath + Exercise and the Exercise groups can account that these groups, together, do not derive from the same population but will have the same median distribution regardless. Hence a one box method will be first used to note the p-value between the Bath and Bath + Exercise groups. Then a two-box method is used to compare between the Bath & Bath + Exercise groups with the Exercise group. #The median will be used as the measure of central tendency, while the MADAM will be used as the comparator. The sample size of this study will be 14, as each group contains 14 values, and will be seperated from each other.
In [82]:
#6.3 #Big Box Method bath_group = [5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9] np_bath = np.array(bath_group) bath_exercise_group = [6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21] np_bath_exercise = np.array(bath_exercise_group) bath_len = len(np_bath) bath_exercise_len = len(np_bath_exercise) M_obs = np.abs(np.median(np_bath) - np.median(np_bath_exercise)) population = np.concatenate([np_bath , np_bath_exercise]) pop_len = len(population) sims = [] for i in range(10000): psuedo_bath = np.random.choice(population, pop_len) psuedo_bath_exercise = np.random.choice(population, pop_len) sims.append(np.median(psuedo_bath) - np.median(psuedo_bath_exercise)) Neg_M_obs = -1*M_obs p_value_calc = sum(sims >= M_obs) + sum(sims<=Neg_M_obs) p_value = p_value_calc/10000 print("p-value is" ,p_value) p = sns.distplot(sims, kde = False) p.set(xlabel = 'Difference in Medians') p.set(ylabel = 'Count') p.axvline(M_obs, color = 'purple') p.axvline(Neg_M_obs, color='red')
p-value is 0.1561
<matplotlib.lines.Line2D at 0x7f56b2cb8be0>
In [93]:
# One-Box Confidence Interval bath_group = [5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9] np_bath = np.array(bath_group) bath_exercise_group = [6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21] np_bath_exercise = np.array(bath_exercise_group) bath_len = len(np_bath) bath_exercise_len = len(np_bath_exercise) M_obs = (np.median(np_bath) - np.median(np_bath_exercise)) CI_sims = [] for i in range(10000): sim_bath = np.random.choice(np_bath, bath_len) sim_bath_exercise = np.random.choice(np_bath_exercise, bath_exercise_len) sim_diff = (np.median(sim_bath) - np.median(sim_bath_exercise)) CI_sims.append(sim_diff) CI_sims.sort() Mlower = sims[49] Mupper = sims[9949] lower_bound = (2*M_obs - Mupper) upper_bound = (2*M_obs - Mlower) k = sns.distplot(CI_sims, kde = False) k.axvline(lower_bound, color = 'blue') k.axvline(upper_bound, color = 'green') k.axvline(M_obs, color = 'red') print('The Confidence Interval sits between', (lower_bound, upper_bound)) print('M_obs is', M_obs)
The Confidence Interval sits between (-3.5, -10.0) M_obs is -4.0
In [75]:
#Two Box with Recentering exercise_group = [-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5] np_exercise = np.array(exercise_group) exercise_len = len(np_exercise) pop_len = len(population) population_median = np.median(population) exercise_median = np.median(np_exercise) M_obs = np.abs(population_median - exercise_median) #Re-centering: re_data_pop = (population - population_median) re_data_exercise = (np_exercise - exercise_median) sim = [] for i in range(10000): psuedo_pop = np.random.choice(re_data_pop, pop_len) psuedo_exercise = np.random.choice(re_data_exercise, exercise_len) sim.append(np.median(psuedo_pop) - np.median(psuedo_exercise)) q = sns.distplot(sim, kde = False) q.set(xlabel = 'Difference in Medians') q.set(ylabel = 'Count') q.axvline(M_obs, color = 'red') q.axvline(-1*M_obs, color = 'red') p_value_calc = sum(sim>= M_obs) + sum(sim<=-1*M_obs) p_value = p_value_calc/10000 print('The p-value is', p_value)
The p-value is 0.003
In [91]:
#Two-Box Confidence Interval exercise_group = [-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5] np_exercise = np.array(exercise_group) exercise_len = len(np_exercise) pop_len = len(population) population_median = np.median(population) exercise_median = np.median(np_exercise) M_obs = np.abs(population_median - exercise_median) CI_sim = [] for i in range(10000): sim_pop = np.random.choice(population, pop_len) sim_exercise = np.random.choice(np_exercise, exercise_len) sims_diff = np.median(sim_pop) - np.median(sim_exercise) CI_sim.append(sims_diff) #Confidence Interval CI_sim.sort() M_lower = CI_sim[49] M_upper = CI_sim[9949] lower_bound2 = (2*M_obs - M_upper) upper_bound2 = (2*M_obs - M_lower) r = sns.distplot(CI_sim, kde = False) r.set(xlabel = 'Difference in Medians') r.set(ylabel = 'Count') r.axvline(lower_bound2, color = 'red') r.axvline(upper_bound2, color = 'blue') r.axvline(M_obs, color = 'green') print("The Confidence Interval is", (lower_bound2, upper_bound2)) print("The M observed is", M_obs)
The Confidence Interval is (1.5, 13.0) The M observed is 7.5
In [99]:
#Determining whether the p-value significance lies between the Bath and Exercise group: def TwoBoxwithRecentering(group1, group2): np_group1 = np.array(group1) group1_len = len(group1) np_group2 = np.array(group2) group2_len = len(group2) group1_median = np.median(group1) group2_median = np.median(group2) M_obs = np.abs(group1_median - group2_median) #Re-centering: re_data_group1 = (group1 - group1_median) re_data_group2 = (group2 - group2_median) sim = [] for i in range(10000): psuedo_group1 = np.random.choice(re_data_group1, group1_len) psuedo_group2 = np.random.choice(re_data_group2, group2_len) sim.append(np.median(psuedo_group1) - np.median(psuedo_group2)) q = sns.distplot(sim, kde = False) q.set(xlabel = 'Difference in Medians') q.set(ylabel = 'Count') q.axvline(M_obs, color = 'red') q.axvline(-1*M_obs, color = 'red') p_value_calc = sum(sim>= M_obs) + sum(sim<=-1*M_obs) p_value = p_value_calc/10000 print('The p-value is', p_value) return q
In [143]:
#Confidence Interval Function def ConfidenceInterval(group1, group2): np_group1 = np.array(group1) np_group2 = np.array(group2) group1_len = len(group1) group2_len = len(group2) group1_median = np.median(group1) group2_median = np.median(group2) M_obs = np.abs(group1_median - group2_median) CI_sim = [] for i in range(10000): sim_group1 = np.random.choice(group1, group1_len) sim_group2 = np.random.choice(group2, group2_len) sims_diff = np.abs(np.median(sim_group1) - np.median(sim_group2)) CI_sim.append(sims_diff) #Confidence Interval CI_sim.sort() M_lower = CI_sim[49] M_upper = CI_sim[9949] lower_bound = (2*M_obs - M_upper) upper_bound = (2*M_obs - M_lower) r = sns.distplot(CI_sim, kde = False) r.set(xlabel = 'Difference in Medians') r.set(ylabel = 'Count') r.axvline(lower_bound, color = 'red') r.axvline(upper_bound, color = 'blue') r.axvline(M_obs, color = 'green') print("The Confidence Interval is", (lower_bound, upper_bound)) print("The M observed is", M_obs)
In [108]:
TwoBoxwithRecentering(bath_group, exercise_group)
The p-value is 0.095
<matplotlib.axes._subplots.AxesSubplot at 0x7f56b2b74518>
In [109]:
ConfidenceInterval(bath_group, exercise_group)
The Confidence Interval is (1.5, 13.0) The M observed is 5.5
In [105]:
TwoBoxwithRecentering(bath_exercise_group, exercise_group)
The p-value is 0.0009
<matplotlib.axes._subplots.AxesSubplot at 0x7f56b227ca58>
In [144]:
ConfidenceInterval(bath_exercise_group, exercise_group)
The Confidence Interval is (3.0, 15.0) The M observed is 9.5
In [ ]:
#6.5 #The graphs tell us that the significance lies between bath_exercise_group and the exercise_group. I would recommend the Bath & Exercise treatment to be the most effective against carpal tunnel syndrome.
In [115]:
#7.1 tailgating_file = pd.read_csv("tailgating.txt",sep='\t') tailgating_file
Alcohol MDMA THC NoDrug
0 38.956475 19.008920 49.805912 55.895314
1 33.921949 20.087996 42.051608 38.546327
2 49.762751 20.767624 34.573858 21.917550
3 32.641102 21.794142 14.817698 356.965032
4 43.977121 22.498652 32.071967 55.032420
5 47.385251 22.552377 27.385088 58.719569
6 26.237367 24.883327 30.431792 25.942951
7 28.908599 26.303733 48.494563 34.508789
8 29.914909 27.363866 54.329265 29.655055
9 38.306418 27.535141 27.857918 21.268192
10 44.109048 28.130536 65.664941 34.413163
11 37.693984 28.235663 28.921727 28.588107
12 68.339650 29.149041 28.687429 33.241967
13 17.885190 31.116597 32.964899 47.492813
14 63.031868 35.683037 39.481022 43.251451
15 27.700828 56.611745 37.055159 25.937962
In [142]:
sns.violinplot(data = tailgating_file) sns.swarmplot(data = tailgating_file)
<matplotlib.axes._subplots.AxesSubplot at 0x7f56af97cb38>
In [114]:
#Determining what the MADAM is for each group to determine which box model to use: def MADAM(data): return np.median(np.abs(data - np.median(data)))
In [118]:
MADAM(tailgating_file["Alcohol"])
8.588446745000004
In [120]:
MADAM(tailgating_file["MDMA"])
4.2821099700000005
In [121]:
MADAM(tailgating_file["THC"])
5.811551785000001
In [122]:
MADAM(tailgating_file["NoDrug"])
8.656744849999999
In [ ]:
#7.2 #Because the MADAM values of both the Alcohol and the NoDrug group are identical, the null hypothesis assumes that they are part of the same population. The same can be said for both the THC and the MDMA group. Because of this, a big-box method will be used for these two pairs of groups, and then a two-box with recentering method will be used to determine significance between the groups, and then individual box methods will be used to determine among which group does the significance lie. #The sample size will be 30 for both big-box methods.
In [129]:
#7.3 def One_Box(group1, group2): np_group1 = np.array(group1) np_group2 = np.array(group2) group1_len = len(group1) group2_leb = len(group2) M_obs = np.abs(np.median(group1) - np.median(group2)) population = np.concatenate([np_group1, np_group2]) pop_len = len(population) sims = [] for i in range(10000): psuedo_group1 = np.random.choice(population, pop_len) psuedo_group2 = np.random.choice(population, pop_len) sims.append(np.median(psuedo_group1) - np.median(psuedo_group2)) p_value_calc = sum(sims >= M_obs) + sum(sims<=-1*M_obs) p_value = p_value_calc/10000 print("p-value is" ,p_value) p = sns.distplot(sims, kde = False) p.set(xlabel = 'Difference in Medians') p.set(ylabel = 'Count') p.axvline(M_obs, color = 'purple') p.axvline(-1*M_obs, color = 'red')
In [ ]:
One_Box(tailgating_file["Alcohol"], tailgating_file["NoDrug"])
In [145]:
ConfidenceInterval(tailgating_file["Alcohol"], tailgating_file["NoDrug"])
The Confidence Interval is (-11.881612319999995, 7.031878605000003) The M observed is 3.5392249750000033
In [132]:
One_Box(tailgating_file["MDMA"], tailgating_file["THC"])
p-value is 0.0061
In [146]:
ConfidenceInterval(tailgating_file["MDMA"], tailgating_file["THC"])
The Confidence Interval is (-8.366611724999999, 13.079967380000003) The M observed is 6.935579295
In [136]:
TwoBoxwithRecentering(tailgating_file["MDMA"], tailgating_file["Alcohol"]) #significance between MDMA and Alcohol
The p-value is 0.0049
<matplotlib.axes._subplots.AxesSubplot at 0x7f56b27c9f98>
In [147]:
ConfidenceInterval(tailgating_file["MDMA"], tailgating_file['Alcohol'])
The Confidence Interval is (0.08386001000000931, 20.753886945000012) The M observed is 11.166401485000005
In [139]:
TwoBoxwithRecentering(tailgating_file["MDMA"], tailgating_file["NoDrug"])
The p-value is 0.1463
<matplotlib.axes._subplots.AxesSubplot at 0x7f56b29f8ba8>
In [148]:
ConfidenceInterval(tailgating_file["MDMA"], tailgating_file["NoDrug"])
The Confidence Interval is (-13.654470704999994, 14.891076075) The M observed is 7.627176510000002
In [150]:
TwoBoxwithRecentering(tailgating_file["THC"], tailgating_file["Alcohol"])
The p-value is 0.4083
<matplotlib.axes._subplots.AxesSubplot at 0x7f56af5840b8>
In [151]:
ConfidenceInterval(tailgating_file["THC"], tailgating_file["Alcohol"])
The Confidence Interval is (-6.853712699999992, 8.455102025000016) The M observed is 4.230822190000005
In [153]:
TwoBoxwithRecentering(tailgating_file["THC"], tailgating_file["NoDrug"])
The p-value is 0.8918
<matplotlib.axes._subplots.AxesSubplot at 0x7f56b2eab208>
In [154]:
ConfidenceInterval(tailgating_file['THC'], tailgating_file['NoDrug'])
The Confidence Interval is (-19.075367449999995, 1.3250081500000022) The M observed is 0.6915972150000016
In [ ]:
#7d) #The true significance lies between MDMA and Alcohol groups in which the p-value between the groups is presently shown to be 0.0049.