1

Name:

I collaborated with:

2

In [2]:

import pandas as pd import numpy as np import seaborn as sns

3

In [21]:

#1 # It is suggested to use the np.zeros input rather than .append() because np.zeros creates an array of placeholder zeros than can later be appended; although .append() performs the same function as np.zeros, .append() has to create the array of values in which the subsequent value of the empty list is dependednt upon the previous value that was appended. However with np.zeros, it is as if the computer copy and pasted the simulated values onto the np.zeros array, making it slightly more accurate to the values of the simulation than np.zeros.

4

#2

5

In [23]:

#3 #The Benjamini - Hochberg correction is a more conservative correction method than the Bonferroni method. #The Bonferroni correction method finds the adjusted p-value by dividing the NHST p-value by the number of comparisons made. This correction method is used to reduce the number of possible false postives to a minimum, however increasing the number of comparisons--although it will lower the false positive probability-- will raise the probability of obtaining a false negative, a great disadvantage. #The Benjamini - Hochberg is less sensitive in comparison to the Bonferroni in which the Benjamini - Hochberg method can calculate the adjusted alpha value with the number of comparisons while taking into consideration the changes of the alpha value as the number of comparisons increase. The Benjamini - Hochberg method is less sensitive to the changes of alpha valeues to comparisons compared to the Bonferroni method.

6

In [4]:

#4 p_values= np.sort([0.34 , 0.001, 0.0084 , 0.003 ,0.013 , 0.006]) B_H_method = (p_values*0.01)/6 np.sum(B_H_method <= 0.01) #Using the Benjamini - Hochberg correction method, it is determined that all six p_values in p_values contain a statistical difference.

7

6

In [25]:

#5 # The F-like statistic assists by determining whether two groups are significant before any NHST can be done, this is done to save time and to provide assurance that a statistical significance is present between two groups as to not waste time in performing a NHST between two groups out of a four-group sample only to find that they are not statistically significant. Once the F-like statistic determines a significance, use NHST to determine the p-value of the null hypothesis.

8

In [34]:

#6 contrast_bath = pd.read_csv("contrast-baths.txt",sep='\t') contrast_bath

9

Bath | Bath+Exercise | Exercise | |
---|---|---|---|

0 | 5 | 6 | -12 |

1 | 10 | 10 | -10 |

2 | -4 | 0 | -7 |

3 | 11 | 14 | -1 |

4 | -3 | 0 | -1 |

5 | 13 | 15 | 0 |

6 | 0 | 4 | 0 |

7 | 2 | 5 | 0 |

8 | 10 | 11 | 0 |

9 | 6 | 7 | 0 |

10 | -1 | 20 | 2 |

11 | 8 | 9 | 4 |

12 | 10 | 11 | 5 |

13 | -9 | 21 | 5 |

In [35]:

#6.1 sns.swarmplot(data = contrast_bath, palette = "Greens") sns.violinplot(data = contrast_bath, palette = "Reds") group_median = contrast_bath.median() length = len(contrast_bath) group_median, length

10

(Bath 5.5
Bath+Exercise 9.5
Exercise 0.0
dtype: float64, 14)

In [44]:

def MADAM(data): return np.median(np.abs(data - np.median(data)))

11

In [46]:

#MADAM of Bath Group MADAM([5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9])

12

5.0

In [47]:

#MADAM of Bath + Exercise Group MADAM([6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21])

13

4.5

In [49]:

#MADAM of Exercise Group MADAM([-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5])

14

1.5

In [39]:

#6.2 #It is shown above that the MADAM values of both the Bath and the Bath + Exercise are close to each other, however the MADAM value of the Exercise group strongly differs from that of the other two. Hence, a two box model with re-centering is needed for NHST. The null hypothesis, in this case, assumes that because the MADAM values are different, the "Bath" and "Bath + Exercise" groups derive from the same population because of their identical MADAM values, but the large difference between the Bath to the Exercise groups and the Bath + Exercise and the Exercise groups can account that these groups, together, do not derive from the same population but will have the same median distribution regardless. Hence a one box method will be first used to note the p-value between the Bath and Bath + Exercise groups. Then a two-box method is used to compare between the Bath & Bath + Exercise groups with the Exercise group. #The median will be used as the measure of central tendency, while the MADAM will be used as the comparator. The sample size of this study will be 14, as each group contains 14 values, and will be seperated from each other.

15

In [82]:

#6.3 #Big Box Method bath_group = [5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9] np_bath = np.array(bath_group) bath_exercise_group = [6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21] np_bath_exercise = np.array(bath_exercise_group) bath_len = len(np_bath) bath_exercise_len = len(np_bath_exercise) M_obs = np.abs(np.median(np_bath) - np.median(np_bath_exercise)) population = np.concatenate([np_bath , np_bath_exercise]) pop_len = len(population) sims = [] for i in range(10000): psuedo_bath = np.random.choice(population, pop_len) psuedo_bath_exercise = np.random.choice(population, pop_len) sims.append(np.median(psuedo_bath) - np.median(psuedo_bath_exercise)) Neg_M_obs = -1*M_obs p_value_calc = sum(sims >= M_obs) + sum(sims<=Neg_M_obs) p_value = p_value_calc/10000 print("p-value is" ,p_value) p = sns.distplot(sims, kde = False) p.set(xlabel = 'Difference in Medians') p.set(ylabel = 'Count') p.axvline(M_obs, color = 'purple') p.axvline(Neg_M_obs, color='red')

16

p-value is 0.1561

<matplotlib.lines.Line2D at 0x7f56b2cb8be0>

In [93]:

# One-Box Confidence Interval bath_group = [5, 10, -4, 11, -3, 13, 0, 2, 10, 6, -1, 8, 10, -9] np_bath = np.array(bath_group) bath_exercise_group = [6, 10, 0, 14, 0, 15, 4, 5, 11, 7, 20, 9, 11, 21] np_bath_exercise = np.array(bath_exercise_group) bath_len = len(np_bath) bath_exercise_len = len(np_bath_exercise) M_obs = (np.median(np_bath) - np.median(np_bath_exercise)) CI_sims = [] for i in range(10000): sim_bath = np.random.choice(np_bath, bath_len) sim_bath_exercise = np.random.choice(np_bath_exercise, bath_exercise_len) sim_diff = (np.median(sim_bath) - np.median(sim_bath_exercise)) CI_sims.append(sim_diff) CI_sims.sort() Mlower = sims[49] Mupper = sims[9949] lower_bound = (2*M_obs - Mupper) upper_bound = (2*M_obs - Mlower) k = sns.distplot(CI_sims, kde = False) k.axvline(lower_bound, color = 'blue') k.axvline(upper_bound, color = 'green') k.axvline(M_obs, color = 'red') print('The Confidence Interval sits between', (lower_bound, upper_bound)) print('M_obs is', M_obs)

17

The Confidence Interval sits between (-3.5, -10.0)
M_obs is -4.0

In [75]:

#Two Box with Recentering exercise_group = [-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5] np_exercise = np.array(exercise_group) exercise_len = len(np_exercise) pop_len = len(population) population_median = np.median(population) exercise_median = np.median(np_exercise) M_obs = np.abs(population_median - exercise_median) #Re-centering: re_data_pop = (population - population_median) re_data_exercise = (np_exercise - exercise_median) sim = [] for i in range(10000): psuedo_pop = np.random.choice(re_data_pop, pop_len) psuedo_exercise = np.random.choice(re_data_exercise, exercise_len) sim.append(np.median(psuedo_pop) - np.median(psuedo_exercise)) q = sns.distplot(sim, kde = False) q.set(xlabel = 'Difference in Medians') q.set(ylabel = 'Count') q.axvline(M_obs, color = 'red') q.axvline(-1*M_obs, color = 'red') p_value_calc = sum(sim>= M_obs) + sum(sim<=-1*M_obs) p_value = p_value_calc/10000 print('The p-value is', p_value)

18

The p-value is 0.003

In [91]:

#Two-Box Confidence Interval exercise_group = [-12, -10, -7, -1, -1, 0, 0, 0, 0, 0, 2, 4, 5, 5] np_exercise = np.array(exercise_group) exercise_len = len(np_exercise) pop_len = len(population) population_median = np.median(population) exercise_median = np.median(np_exercise) M_obs = np.abs(population_median - exercise_median) CI_sim = [] for i in range(10000): sim_pop = np.random.choice(population, pop_len) sim_exercise = np.random.choice(np_exercise, exercise_len) sims_diff = np.median(sim_pop) - np.median(sim_exercise) CI_sim.append(sims_diff) #Confidence Interval CI_sim.sort() M_lower = CI_sim[49] M_upper = CI_sim[9949] lower_bound2 = (2*M_obs - M_upper) upper_bound2 = (2*M_obs - M_lower) r = sns.distplot(CI_sim, kde = False) r.set(xlabel = 'Difference in Medians') r.set(ylabel = 'Count') r.axvline(lower_bound2, color = 'red') r.axvline(upper_bound2, color = 'blue') r.axvline(M_obs, color = 'green') print("The Confidence Interval is", (lower_bound2, upper_bound2)) print("The M observed is", M_obs)

19

The Confidence Interval is (1.5, 13.0)
The M observed is 7.5

In [99]:

#Determining whether the p-value significance lies between the Bath and Exercise group: def TwoBoxwithRecentering(group1, group2): np_group1 = np.array(group1) group1_len = len(group1) np_group2 = np.array(group2) group2_len = len(group2) group1_median = np.median(group1) group2_median = np.median(group2) M_obs = np.abs(group1_median - group2_median) #Re-centering: re_data_group1 = (group1 - group1_median) re_data_group2 = (group2 - group2_median) sim = [] for i in range(10000): psuedo_group1 = np.random.choice(re_data_group1, group1_len) psuedo_group2 = np.random.choice(re_data_group2, group2_len) sim.append(np.median(psuedo_group1) - np.median(psuedo_group2)) q = sns.distplot(sim, kde = False) q.set(xlabel = 'Difference in Medians') q.set(ylabel = 'Count') q.axvline(M_obs, color = 'red') q.axvline(-1*M_obs, color = 'red') p_value_calc = sum(sim>= M_obs) + sum(sim<=-1*M_obs) p_value = p_value_calc/10000 print('The p-value is', p_value) return q

20

In [143]:

#Confidence Interval Function def ConfidenceInterval(group1, group2): np_group1 = np.array(group1) np_group2 = np.array(group2) group1_len = len(group1) group2_len = len(group2) group1_median = np.median(group1) group2_median = np.median(group2) M_obs = np.abs(group1_median - group2_median) CI_sim = [] for i in range(10000): sim_group1 = np.random.choice(group1, group1_len) sim_group2 = np.random.choice(group2, group2_len) sims_diff = np.abs(np.median(sim_group1) - np.median(sim_group2)) CI_sim.append(sims_diff) #Confidence Interval CI_sim.sort() M_lower = CI_sim[49] M_upper = CI_sim[9949] lower_bound = (2*M_obs - M_upper) upper_bound = (2*M_obs - M_lower) r = sns.distplot(CI_sim, kde = False) r.set(xlabel = 'Difference in Medians') r.set(ylabel = 'Count') r.axvline(lower_bound, color = 'red') r.axvline(upper_bound, color = 'blue') r.axvline(M_obs, color = 'green') print("The Confidence Interval is", (lower_bound, upper_bound)) print("The M observed is", M_obs)

21

In [108]:

TwoBoxwithRecentering(bath_group, exercise_group)

22

The p-value is 0.095

<matplotlib.axes._subplots.AxesSubplot at 0x7f56b2b74518>

In [109]:

ConfidenceInterval(bath_group, exercise_group)

23

The Confidence Interval is (1.5, 13.0)
The M observed is 5.5

In [105]:

TwoBoxwithRecentering(bath_exercise_group, exercise_group)

24

The p-value is 0.0009

<matplotlib.axes._subplots.AxesSubplot at 0x7f56b227ca58>

In [144]:

ConfidenceInterval(bath_exercise_group, exercise_group)

25

The Confidence Interval is (3.0, 15.0)
The M observed is 9.5

In [ ]:

#6.5 #The graphs tell us that the significance lies between bath_exercise_group and the exercise_group. I would recommend the Bath & Exercise treatment to be the most effective against carpal tunnel syndrome.

26

In [115]:

#7.1 tailgating_file = pd.read_csv("tailgating.txt",sep='\t') tailgating_file

27

Alcohol | MDMA | THC | NoDrug | |
---|---|---|---|---|

0 | 38.956475 | 19.008920 | 49.805912 | 55.895314 |

1 | 33.921949 | 20.087996 | 42.051608 | 38.546327 |

2 | 49.762751 | 20.767624 | 34.573858 | 21.917550 |

3 | 32.641102 | 21.794142 | 14.817698 | 356.965032 |

4 | 43.977121 | 22.498652 | 32.071967 | 55.032420 |

5 | 47.385251 | 22.552377 | 27.385088 | 58.719569 |

6 | 26.237367 | 24.883327 | 30.431792 | 25.942951 |

7 | 28.908599 | 26.303733 | 48.494563 | 34.508789 |

8 | 29.914909 | 27.363866 | 54.329265 | 29.655055 |

9 | 38.306418 | 27.535141 | 27.857918 | 21.268192 |

10 | 44.109048 | 28.130536 | 65.664941 | 34.413163 |

11 | 37.693984 | 28.235663 | 28.921727 | 28.588107 |

12 | 68.339650 | 29.149041 | 28.687429 | 33.241967 |

13 | 17.885190 | 31.116597 | 32.964899 | 47.492813 |

14 | 63.031868 | 35.683037 | 39.481022 | 43.251451 |

15 | 27.700828 | 56.611745 | 37.055159 | 25.937962 |

In [142]:

sns.violinplot(data = tailgating_file) sns.swarmplot(data = tailgating_file)

28

<matplotlib.axes._subplots.AxesSubplot at 0x7f56af97cb38>

In [114]:

#Determining what the MADAM is for each group to determine which box model to use: def MADAM(data): return np.median(np.abs(data - np.median(data)))

29

In [118]:

MADAM(tailgating_file["Alcohol"])

30

8.588446745000004

In [120]:

MADAM(tailgating_file["MDMA"])

31

4.2821099700000005

In [121]:

MADAM(tailgating_file["THC"])

32

5.811551785000001

In [122]:

MADAM(tailgating_file["NoDrug"])

33

8.656744849999999

In [ ]:

#7.2 #Because the MADAM values of both the Alcohol and the NoDrug group are identical, the null hypothesis assumes that they are part of the same population. The same can be said for both the THC and the MDMA group. Because of this, a big-box method will be used for these two pairs of groups, and then a two-box with recentering method will be used to determine significance between the groups, and then individual box methods will be used to determine among which group does the significance lie. #The sample size will be 30 for both big-box methods.

34

In [129]:

#7.3 def One_Box(group1, group2): np_group1 = np.array(group1) np_group2 = np.array(group2) group1_len = len(group1) group2_leb = len(group2) M_obs = np.abs(np.median(group1) - np.median(group2)) population = np.concatenate([np_group1, np_group2]) pop_len = len(population) sims = [] for i in range(10000): psuedo_group1 = np.random.choice(population, pop_len) psuedo_group2 = np.random.choice(population, pop_len) sims.append(np.median(psuedo_group1) - np.median(psuedo_group2)) p_value_calc = sum(sims >= M_obs) + sum(sims<=-1*M_obs) p_value = p_value_calc/10000 print("p-value is" ,p_value) p = sns.distplot(sims, kde = False) p.set(xlabel = 'Difference in Medians') p.set(ylabel = 'Count') p.axvline(M_obs, color = 'purple') p.axvline(-1*M_obs, color = 'red')

35

In [ ]:

One_Box(tailgating_file["Alcohol"], tailgating_file["NoDrug"])

36

In [145]:

ConfidenceInterval(tailgating_file["Alcohol"], tailgating_file["NoDrug"])

37

The Confidence Interval is (-11.881612319999995, 7.031878605000003)
The M observed is 3.5392249750000033

In [132]:

One_Box(tailgating_file["MDMA"], tailgating_file["THC"])

38

p-value is 0.0061

In [146]:

ConfidenceInterval(tailgating_file["MDMA"], tailgating_file["THC"])

39

The Confidence Interval is (-8.366611724999999, 13.079967380000003)
The M observed is 6.935579295

In [136]:

TwoBoxwithRecentering(tailgating_file["MDMA"], tailgating_file["Alcohol"]) #significance between MDMA and Alcohol

40

The p-value is 0.0049

<matplotlib.axes._subplots.AxesSubplot at 0x7f56b27c9f98>

In [147]:

ConfidenceInterval(tailgating_file["MDMA"], tailgating_file['Alcohol'])

41

The Confidence Interval is (0.08386001000000931, 20.753886945000012)
The M observed is 11.166401485000005

In [139]:

TwoBoxwithRecentering(tailgating_file["MDMA"], tailgating_file["NoDrug"])

42

The p-value is 0.1463

<matplotlib.axes._subplots.AxesSubplot at 0x7f56b29f8ba8>

In [148]:

ConfidenceInterval(tailgating_file["MDMA"], tailgating_file["NoDrug"])

43

The Confidence Interval is (-13.654470704999994, 14.891076075)
The M observed is 7.627176510000002

In [150]:

TwoBoxwithRecentering(tailgating_file["THC"], tailgating_file["Alcohol"])

44

The p-value is 0.4083

<matplotlib.axes._subplots.AxesSubplot at 0x7f56af5840b8>

In [151]:

ConfidenceInterval(tailgating_file["THC"], tailgating_file["Alcohol"])

45

The Confidence Interval is (-6.853712699999992, 8.455102025000016)
The M observed is 4.230822190000005

In [153]:

TwoBoxwithRecentering(tailgating_file["THC"], tailgating_file["NoDrug"])

46

The p-value is 0.8918

<matplotlib.axes._subplots.AxesSubplot at 0x7f56b2eab208>

In [154]:

ConfidenceInterval(tailgating_file['THC'], tailgating_file['NoDrug'])

47

The Confidence Interval is (-19.075367449999995, 1.3250081500000022)
The M observed is 0.6915972150000016

In [ ]:

#7d) #The true significance lies between MDMA and Alcohol groups in which the p-value between the groups is presently shown to be 0.0049.

48