Name:Colette White

I collaborated with:

In [ ]:

#1. #We suggested using np.zeros to store the results of our simulations instead of just appending to an empty list because this way, an array is created instead of a list. This array is made up of 10000 zeros, which are all able to be replaced by the results of simulations. This is useful because it creates an array beforehand, eliminating the need for extra code when graphing results, and we can also apply the chi-squared test to an array. Having an array also ensures that the number of values stored is no more and no less than the number of zeros assigned as placeholders.

In [3]:

#2. #βΊ = 0.01. #0.001 (T1 vs T3) #0.003 (T2 vs T3) #0.006 (T3 vs T4) #0.0084 (T1 vs T4) #0.013 (T2 vs T4) #0.34 (T1 vs T2) 0.01*(1/4), 0.01*(2/4), 0.01*(3/4), 0.01*(4/4), 0.01*(5/4), 0.01*(6/4)

(0.0025, 0.005, 0.0075, 0.01, 0.0125, 0.015)

In [ ]:

#p(1) is less than πβ 1/m, therefore T1 and T3 are significantly different from eachother (reject null hypothesis). #p(2) is less than πβ 1/m, therefore T2 and T3 are significantly different from eachother (reject null hypothesis). #p(3) is less than πβ 1/m, therefore T3 and T4 are significantly different from eachother (reject null hypothesis). #p(4) is less than πβ 1/m, therefore T1 and T4 are significantly different from eachother (reject null hypothesis). #p(5) is greater than πβ 1/m, therefore T2 and T4 are not significantly different from eachother (fail to reject null hypothesis). #p(6) is greater than πβ 1/m, therefore T1 and T2 are not significantly different from eachother (fail to reject null hypothesis).

In [4]:

#3.

In [7]:

#contingency table for expected results ((8167+8117)/(8167+8117+529+620))*(8167+529) #first I found the expected values for the no cancer row

8122.851144381346

In [8]:

((8167+8117)/(8167+8117+529+620))*(8117+620)

8161.148855618655

In [10]:

(8167+529)-8122.851144381346 #Then, I could subtract these values from the column totals in order to find the values that coincided with the no cancer values in the prostate row.

573.1488556186541

In [9]:

(8117+620)-8161.148855618655

575.851144381345

In [4]:

import numpy as np import matplotlib as mat import seaborn as sns %matplotlib inline

In [14]:

expectedarray=np.array([[8122.85, 8161.15], [573.15, 575.85]]) expectedarray

array([[8122.85, 8161.15],
[ 573.15, 575.85]])

In [18]:

#chi-abs def chisqrd(data): aobs=data[0,0] bobs=data[0,1] cobs=data[1,0] dobs=data[1,1] aexp=8122.85 bexp=8161.15 cexp=573.15 dexp=575.85 difa=(aobs-aexp)**2 difb=(bobs-bexp)**2 difc=(cobs-cexp)**2 difd=(dobs-dexp)**2 chistat=((difa/aexp)+(difb/bexp)+(difc/cexp)+(difd/dexp)) return(chistat)

In [19]:

obsarray=np.array([[8167, 8117], [529,620]])

In [20]:

chisqrd(obsarray)

7.2646519714768445

In [21]:

#relative risk totalvitE=(8117+620) totalplac=(8167+529) p_drug = obsarray[0,0]/totalvitE p_wo_drug = obsarray[0,1]/totalplac relative_risk_obs = p_drug / p_wo_drug print(relative_risk_obs)

1.0014383184893125

In [22]:

#4. Why should we use right-tailed p-values for chi-square/chi-abs and two-tailed p-values for relative risk? #We should use right-tailed p-values for chi-square/chi-abs and two-tailed p-values for relative risk because when taking the absolute value or squaring the deviations of the observed and expected values, we are doing so to make sure that the test statistic is greater than zero. If the sum of these squared deviations is large, then we are able to reject the claim that they were taken from the same distribution. If the sum of the squared deviations is relatively small, then this would mean that the observed frequencies are similar to those that are expected, and we therefore fail to reject that they are from the same distribution. Thus, the chi-squared/ chi-abs p-values are always right-tailed tests since the squared numerator is going to result in test statistics that are positive or perfectly fitting and these tests will always result in positive values. Two-tailed p-values are used for relative risk because there can be positive or negative differences as results.

In [23]:

#5. #We should use chi-abs to compare proportions instead of chi-squared because chi-abs because chi-abs gives us more than null hypothesis significance testing. Chi-abs allows us to define an effect size and put a confidence interval on that effect size.

In [24]:

#6. Use the formula for chi-squared/chi-abs to explain why the null distribution for chi-squared/chi-abs is high near 0. #The formula for chi-squared/chi-abs, results in positive values, thus, if there are many results that show no significant difference between groups in tests, then there will be a high null distribution near 0.

In [45]:

#7. Use the formula for relative risk to explain why the null distribution for relative risk is high near 1. #Relative Risk is the proportion of survived in drug-yes group divided by the proportion of survived in drug-no group, so RR =1 no effect. Thus, when there are many outcomes that show there is no significant difference between the groups (i.e. there is littel to no effect), there will be a concentration of values in the null distribution at and near 1.

In [46]:

#8. If the relative risk (risk ratio) for a particular treatment compared to control is 1.4, then the equally extreme relative risk that you should use to calculate a two-tailed p-value is 0.71. Why does this make sense? #This makes sense because with relative risk is the comparison of rations, and 0.71 is the reciprocal of 1.4. This is a value we are able to use, because it is the same ration, just the inverse. Thus, it can be used and interpreted to achieve the same conclusion as 1.4.

In [27]:

#9. We also compared proportions using Monte Carlo simulations at the beginning of LS 40 when we first studied NHST in Lab 2. How were those studies different from this one? Why are we using different strategies now? #Those studies were different from this one because the Monte Carlo simulations did not use control groups, and instead of simulating insidividual groups, the simulated groups in those where relative risk is being applied are related to each other.

In [1]:

import pandas as pd df = pd.DataFrame()

In [2]:

#10. #(a) tailgating = pd.read_csv("tailgating.txt",sep='\t') tailgating

Alcohol | MDMA | THC | NoDrug | |
---|---|---|---|---|

0 | 38.956475 | 19.008920 | 49.805912 | 55.895314 |

1 | 33.921949 | 20.087996 | 42.051608 | 38.546327 |

2 | 49.762751 | 20.767624 | 34.573858 | 21.917550 |

3 | 32.641102 | 21.794142 | 14.817698 | 43.341763 |

4 | 43.977121 | 22.498652 | 32.071967 | 55.032420 |

5 | 47.385251 | 22.552377 | 27.385088 | 58.719569 |

6 | 26.237367 | 24.883327 | 30.431792 | 25.942951 |

7 | 28.908599 | 26.303733 | 48.494563 | 34.508789 |

8 | 29.914909 | 27.363866 | 54.329265 | 29.655055 |

9 | 38.306418 | 27.535141 | 27.857918 | 21.268192 |

10 | 44.109048 | 28.130536 | 65.664941 | 34.413163 |

11 | 37.693984 | 28.235663 | 28.921727 | 28.588107 |

12 | 68.339650 | 29.149041 | 28.687429 | 33.241967 |

13 | 17.885190 | 31.116597 | 32.964899 | 47.492813 |

14 | 63.031868 | 35.683037 | 39.481022 | 43.251451 |

15 | 27.700828 | 56.611745 | 37.055159 | 25.937962 |

In [5]:

p=sns.violinplot(data=tailgating, palette="Paired")

In [6]:

p=sns.stripplot(data=tailgating, palette="Reds")

In [7]:

#10(b) Identify the independent variable and dependent variable. Classify each as categorical or continuous. #The independent variable is the type of substance, which is categorical, and the dependent variable is the average following distance in meters during a car-following task, which is continuous.

In [8]:

#10(c) Describe the appropriate null hypothesis, measure to describe the groups, measure to compare the groups, box model, and sample size for the study. #The null hypothesis would state that drugs have no significant effect on driving behavior. The appropriate measure to describe the groups would be the medians, since the groups are not normally distributed. To know which box model to use we will compare the MADAMs. The sample size would be n=16.

In [11]:

def MADAM(list): med = np.median(list) new_list = abs(list - med) new_med = np.median(new_list) return new_med MADAM(tailgating["Alcohol"]), MADAM(tailgating["MDMA"]), MADAM(tailgating["THC"]), MADAM(tailgating["NoDrug"])

(8.588446745000004, 4.2821099700000005, 5.811551785000001, 8.656744849999999)

In [12]:

#Since the MADAMs of groups 1 and 4 are larger than groups 2 and 3, we will use four-box recentering. In each simulation, we will sample 16 individuals four times from each subgroup, to obtain our four psuedo-groups. The four groups of Alcohol, MDMA, and THC, and NoDrug recentered, will be our box models.

In [13]:

#10(d) Calculate p-value for NHST using the appropriate method. Make sure you specify which method youβre using. Include your histogram of simulations results with observed result indicated. If you are comparing three or more groups, make sure you determine between which two groups any difference occurs. def computeFlike(df): medians = df.median() grandMedian = np.median(df) var_bt = np.sum(np.abs(medians-grandMedian)*len(df)) sumabsdev = np.sum(np.abs(df - medians)) var_within = np.sum(sumabsdev) Flike = var_bt/var_within return Flike

In [14]:

Fobs = computeFlike(tailgating) Fobs

0.4279127376215329

In [15]:

tailgating_meds = tailgating.median() recentered_tailgating = tailgating - tailgating_meds recentered_tailgating

Alcohol | MDMA | THC | NoDrug | |
---|---|---|---|---|

0 | 0.956274 | -7.824879 | 16.036533 | 21.434338 |

1 | -4.078252 | -6.745804 | 8.282229 | 4.085351 |

2 | 11.762550 | -6.066175 | 0.804480 | -12.543426 |

3 | -5.359098 | -5.039658 | -18.951681 | 8.880788 |

4 | 5.976920 | -4.335147 | -1.697412 | 20.571444 |

5 | 9.385050 | -4.281422 | -6.384291 | 24.258593 |

6 | -11.762834 | -1.950472 | -3.337586 | -8.518025 |

7 | -9.091602 | -0.530066 | 14.725185 | 0.047813 |

8 | -8.085291 | 0.530066 | 20.559886 | -4.805921 |

9 | 0.306217 | 0.701341 | -5.911460 | -13.192783 |

10 | 6.108847 | 1.296736 | 31.895562 | -0.047813 |

11 | -0.306217 | 1.401864 | -4.847652 | -5.872868 |

12 | 30.339449 | 2.315241 | -5.081950 | -1.219009 |

13 | -20.115011 | 4.282798 | -0.804480 | 13.031837 |

14 | 25.031667 | 8.849238 | 5.711643 | 8.790475 |

15 | -10.299373 | 29.777946 | 3.285781 | -8.523014 |

In [16]:

flikes = np.zeros(10000) for i in range(10000): group1 = np.random.choice(recentered_tailgating["Alcohol"], len(recentered_tailgating["Alcohol"])) group2 = np.random.choice(recentered_tailgating["MDMA"], len(recentered_tailgating["MDMA"])) group3 = np.random.choice(recentered_tailgating["THC"], len(recentered_tailgating["THC"])) group4 = np.random.choice(recentered_tailgating["NoDrug"], len(recentered_tailgating["NoDrug"])) sim_data = pd.DataFrame(np.column_stack((group1, group2, group3, group4))) flikes[i] = computeFlike(sim_data)

In [17]:

p = sns.distplot(flikes, kde = False) p.axvline(Fobs, color = "red") p.set(xlabel = "Simulated F-like values", ylabel = "Counts")

[Text(0, 0.5, 'Counts'), Text(0.5, 0, 'Simulated F-like values')]

In [18]:

#10(d) p_value = np.sum(flikes >= Fobs)/len(flikes) p_value

0.1517

In [19]:

#10(e) #This p-value is greater than the given alpha value of 0.05, so we fail to reject the null hypothesis that there is no difference among groups. Thus, the result is not significant.

In [64]:

#11. #Of the 2201 passengers on board, 367 males survived, 1364 males died, 344 females survived, and 126 females died. Determine how the relative risk of dying differed between males and females. obsarray1=np.array([[1364, 126], [367, 344]]) obsarray1

array([[1364, 126],
[ 367, 344]])

In [ ]:

```
```

In [65]:

#11(a) #The independent variable is the gender of the passenger, which is categorical. The dependent variable is whether or not they survived, which is also categorical.

#11(b)

Male | Female | |
---|---|---|

Died | 1364 | 126 |

Survived | 367 | 344 |

In [ ]:

```
```

In [66]:

#11(c) #The null hypothesis would state that gender did not have any effect on whether or not an individual survived the sinking of the Titanic.

In [67]:

#11(d)Calculate relative risk of dying for females vs. males in the observed data. RRdying=(1364/(1364+367))/(126/(126+344)) RRdying

2.939304741731085

In [68]:

#11(e). Calculate p-value for NHST using relative risk. Include your histogram of simulations results with observed result indicated. Comment every line unless the line is standard to past re-sampling analyses (i.e., itβs boilerplate!). def RR(df): relrisk=(df[0,0]/(df[0,0]+df[1,0]))/(df[0,1]/(df[0,1]+df[1,1])) return relrisk

In [69]:

RR(obsarray1)

2.939304741731085

In [70]:

population = 1490*['D']+711*['S'] ntimes = 10000 results = np.zeros(ntimes) for i in range(ntimes): female_resample = np.random.choice(population, 470) # randomly select outcomes for subjects that were female male_resample = np.random.choice(population, 1731) # randomly select outcomes for subjects that were male res = np.zeros([2,2]) res[0,0] = np.sum(female_resample == 'D') # Create 2x2 table for simulation res[1,0] = np.sum(female_resample == 'S') res[0,1] = np.sum(male_resample == 'D') res[1,1] = np.sum(male_resample == 'S') results[i] = RR(res) # Calculate relative risk comparing simulation to observed

In [71]:

p = sns.distplot(results, kde = False) p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Null Distribution") p.axvline(RR(obsarray1), color="green")

<matplotlib.lines.Line2D at 0x7fa4f4533588>

In [72]:

print(sum(results >= RR(obsarray1))/ntimes) #The p-value is 0.0

0.0

In [75]:

#11(f) results.sort() M_lower = results[49] #find the bottom 0.5% M_upper = results[9949] #find the top 0.5% M_observed = np.median(results) #find the observed median M_upperp = 2*M_observed - M_lower #find the upper pivotal median value M_lowerp = 2*M_observed - M_upper #find the lower pivotal median value display("M Lower : Red Line", M_lowerp, "M Upper: Green Line", M_upperp) #display these found values p=sns.distplot(results, kde=False, bins=25, axlabel="Relative Risks for deaths of females vs men") #create a histogram of the resamples p.set(ylabel="Count") p.axvline(M_lowerp, color="red"); p.axvline(M_upperp, color="green"); p.axvline(M_observed, color="purple"); p.axvline(RR(obsarray1), color="orange"); p.axvline(-RR(obsarray1), color="orange");

'M Lower : Red Line'

0.9061305821300083

'M Upper: Green Line'

1.0915195784764515

In [76]:

#11(g) #g. Interpret (e) and (f) in the context of this study. #(e) and (f) show that we can reject the null hypothesis, since the p-value is below the alpha value. However, the observed value is within the confidence interval, and so is 0, thus making it both possible to observe what took place, and to observe a relative risk of 0. In the context of this study, these results show that the Titanic's sinking was not a rare and unique case, and that in most cases, gender would have had an effect on the survival of the individual if the Titanic had sunk many times.

In [36]:

#12(a) #The independent variables are the two alleles, which are categorical, and the dependent variable is the allele frequency, which is continuous.

#12(b) #Observed

A/A | A/B | B/B | |
---|---|---|---|

Frequency | 1203 | 2919 | 1678 |

#Expected

A/A | A/B | B/B | |
---|---|---|---|

Frequency | 1200.6 | 2876.8 | 1722.6 |

In [38]:

#12(c) #The null hypothesis is that genotype frequencies fit the genotype proportions predicted by the Hardy-Weinberg Formula.

In [41]:

#12(d) #Calculate p-value for NHST using chi-abs. Include your histogram of simulations results with observed result indicated. obs = np.array([1203, 2919, 1678]) expected = np.array([1200.6, 2876.8, 1722.6]) def chi_abs(obs, ex): temp = np.abs(obs-ex)/ex # / is element-wise divide here return np.sum(temp) # sum up all elements chi_abs_obs = chi_abs(obs, expected) print(chi_abs_obs)

0.042559172108308785

In [44]:

population = 1203*['A/A'] + 2919*['A/B'] + 1678*['B/B'] ntimes = 10000 results = np.zeros(ntimes) for i in range(ntimes): AA_resample = np.random.choice(population, 1203) AB_resample = np.random.choice(population, 2919) BB_resample = np.random.choice(population, 1678) res = np.zeros([3,1]) res[0,0] = np.sum(AA_resample == 'A/A') res[1,0] = np.sum(AB_resample == 'A/B') res[2,0] = np.sum(BB_resample == 'B/B') results[i] = chi_abs(res, expected) p = sns.distplot(results) p.set(xlabel = "Absolute Value of Chi Statistic") p.axvline(chi_abs(obs, expected), color="green") pval = sum(results >= chi_abs(obs,expected))/ntimes pval

1.0

In [ ]:

#12(e) #According to the p-value, the Hardy-Weinberk principle works 100% of the time. Thus, we fail to reject the null hypothesis.