SharedWeek 7 / hw7-turnin.ipynbOpen in CoCalc
Author: Colette White
Views : 9

Homework 7

Name:Colette White

I collaborated with:

In [ ]:
#1. #We suggested using np.zeros to store the results of our simulations instead of just appending to an empty list because this way, an array is created instead of a list. This array is made up of 10000 zeros, which are all able to be replaced by the results of simulations. This is useful because it creates an array beforehand, eliminating the need for extra code when graphing results, and we can also apply the chi-squared test to an array. Having an array also ensures that the number of values stored is no more and no less than the number of zeros assigned as placeholders.
In [3]:
#2. #⍺ = 0.01. #0.001 (T1 vs T3) #0.003 (T2 vs T3) #0.006 (T3 vs T4) #0.0084 (T1 vs T4) #0.013 (T2 vs T4) #0.34 (T1 vs T2) 0.01*(1/4), 0.01*(2/4), 0.01*(3/4), 0.01*(4/4), 0.01*(5/4), 0.01*(6/4)
(0.0025, 0.005, 0.0075, 0.01, 0.0125, 0.015)
In [ ]:
#p(1) is less than 𝛂⋅1/m, therefore T1 and T3 are significantly different from eachother (reject null hypothesis). #p(2) is less than 𝛂⋅1/m, therefore T2 and T3 are significantly different from eachother (reject null hypothesis). #p(3) is less than 𝛂⋅1/m, therefore T3 and T4 are significantly different from eachother (reject null hypothesis). #p(4) is less than 𝛂⋅1/m, therefore T1 and T4 are significantly different from eachother (reject null hypothesis). #p(5) is greater than 𝛂⋅1/m, therefore T2 and T4 are not significantly different from eachother (fail to reject null hypothesis). #p(6) is greater than 𝛂⋅1/m, therefore T1 and T2 are not significantly different from eachother (fail to reject null hypothesis).
In [4]:
In [7]:
#contingency table for expected results ((8167+8117)/(8167+8117+529+620))*(8167+529) #first I found the expected values for the no cancer row
In [8]:
In [10]:
(8167+529)-8122.851144381346 #Then, I could subtract these values from the column totals in order to find the values that coincided with the no cancer values in the prostate row.
In [9]:
In [4]:
import numpy as np import matplotlib as mat import seaborn as sns %matplotlib inline
In [14]:
expectedarray=np.array([[8122.85, 8161.15], [573.15, 575.85]]) expectedarray
array([[8122.85, 8161.15], [ 573.15, 575.85]])
In [18]:
#chi-abs def chisqrd(data): aobs=data[0,0] bobs=data[0,1] cobs=data[1,0] dobs=data[1,1] aexp=8122.85 bexp=8161.15 cexp=573.15 dexp=575.85 difa=(aobs-aexp)**2 difb=(bobs-bexp)**2 difc=(cobs-cexp)**2 difd=(dobs-dexp)**2 chistat=((difa/aexp)+(difb/bexp)+(difc/cexp)+(difd/dexp)) return(chistat)
In [19]:
obsarray=np.array([[8167, 8117], [529,620]])
In [20]:
In [21]:
#relative risk totalvitE=(8117+620) totalplac=(8167+529) p_drug = obsarray[0,0]/totalvitE p_wo_drug = obsarray[0,1]/totalplac relative_risk_obs = p_drug / p_wo_drug print(relative_risk_obs)
In [22]:
#4. Why should we use right-tailed p-values for chi-square/chi-abs and two-tailed p-values for relative risk? #We should use right-tailed p-values for chi-square/chi-abs and two-tailed p-values for relative risk because when taking the absolute value or squaring the deviations of the observed and expected values, we are doing so to make sure that the test statistic is greater than zero. If the sum of these squared deviations is large, then we are able to reject the claim that they were taken from the same distribution. If the sum of the squared deviations is relatively small, then this would mean that the observed frequencies are similar to those that are expected, and we therefore fail to reject that they are from the same distribution. Thus, the chi-squared/ chi-abs p-values are always right-tailed tests since the squared numerator is going to result in test statistics that are positive or perfectly fitting and these tests will always result in positive values. Two-tailed p-values are used for relative risk because there can be positive or negative differences as results.
In [23]:
#5. #We should use chi-abs to compare proportions instead of chi-squared because chi-abs because chi-abs gives us more than null hypothesis significance testing. Chi-abs allows us to define an effect size and put a confidence interval on that effect size.
In [24]:
#6. Use the formula for chi-squared/chi-abs to explain why the null distribution for chi-squared/chi-abs is high near 0. #The formula for chi-squared/chi-abs, results in positive values, thus, if there are many results that show no significant difference between groups in tests, then there will be a high null distribution near 0.
In [45]:
#7. Use the formula for relative risk to explain why the null distribution for relative risk is high near 1. #Relative Risk is the proportion of survived in drug-yes group divided by the proportion of survived in drug-no group, so RR =1 no effect. Thus, when there are many outcomes that show there is no significant difference between the groups (i.e. there is littel to no effect), there will be a concentration of values in the null distribution at and near 1.
In [46]:
#8. If the relative risk (risk ratio) for a particular treatment compared to control is 1.4, then the equally extreme relative risk that you should use to calculate a two-tailed p-value is 0.71. Why does this make sense? #This makes sense because with relative risk is the comparison of rations, and 0.71 is the reciprocal of 1.4. This is a value we are able to use, because it is the same ration, just the inverse. Thus, it can be used and interpreted to achieve the same conclusion as 1.4.
In [27]:
#9. We also compared proportions using Monte Carlo simulations at the beginning of LS 40 when we first studied NHST in Lab 2. How were those studies different from this one? Why are we using different strategies now? #Those studies were different from this one because the Monte Carlo simulations did not use control groups, and instead of simulating insidividual groups, the simulated groups in those where relative risk is being applied are related to each other.
In [1]:
import pandas as pd df = pd.DataFrame()
In [2]:
#10. #(a) tailgating = pd.read_csv("tailgating.txt",sep='\t') tailgating
Alcohol MDMA THC NoDrug
0 38.956475 19.008920 49.805912 55.895314
1 33.921949 20.087996 42.051608 38.546327
2 49.762751 20.767624 34.573858 21.917550
3 32.641102 21.794142 14.817698 43.341763
4 43.977121 22.498652 32.071967 55.032420
5 47.385251 22.552377 27.385088 58.719569
6 26.237367 24.883327 30.431792 25.942951
7 28.908599 26.303733 48.494563 34.508789
8 29.914909 27.363866 54.329265 29.655055
9 38.306418 27.535141 27.857918 21.268192
10 44.109048 28.130536 65.664941 34.413163
11 37.693984 28.235663 28.921727 28.588107
12 68.339650 29.149041 28.687429 33.241967
13 17.885190 31.116597 32.964899 47.492813
14 63.031868 35.683037 39.481022 43.251451
15 27.700828 56.611745 37.055159 25.937962
In [5]:
p=sns.violinplot(data=tailgating, palette="Paired")
In [6]:
p=sns.stripplot(data=tailgating, palette="Reds")
In [7]:
#10(b) Identify the independent variable and dependent variable. Classify each as categorical or continuous. #The independent variable is the type of substance, which is categorical, and the dependent variable is the average following distance in meters during a car-following task, which is continuous.
In [8]:
#10(c) Describe the appropriate null hypothesis, measure to describe the groups, measure to compare the groups, box model, and sample size for the study. #The null hypothesis would state that drugs have no significant effect on driving behavior. The appropriate measure to describe the groups would be the medians, since the groups are not normally distributed. To know which box model to use we will compare the MADAMs. The sample size would be n=16.
In [11]:
def MADAM(list): med = np.median(list) new_list = abs(list - med) new_med = np.median(new_list) return new_med MADAM(tailgating["Alcohol"]), MADAM(tailgating["MDMA"]), MADAM(tailgating["THC"]), MADAM(tailgating["NoDrug"])
(8.588446745000004, 4.2821099700000005, 5.811551785000001, 8.656744849999999)
In [12]:
#Since the MADAMs of groups 1 and 4 are larger than groups 2 and 3, we will use four-box recentering. In each simulation, we will sample 16 individuals four times from each subgroup, to obtain our four psuedo-groups. The four groups of Alcohol, MDMA, and THC, and NoDrug recentered, will be our box models.
In [13]:
#10(d) Calculate p-value for NHST using the appropriate method. Make sure you specify which method you’re using. Include your histogram of simulations results with observed result indicated. If you are comparing three or more groups, make sure you determine between which two groups any difference occurs. def computeFlike(df): medians = df.median() grandMedian = np.median(df) var_bt = np.sum(np.abs(medians-grandMedian)*len(df)) sumabsdev = np.sum(np.abs(df - medians)) var_within = np.sum(sumabsdev) Flike = var_bt/var_within return Flike
In [14]:
Fobs = computeFlike(tailgating) Fobs
In [15]:
tailgating_meds = tailgating.median() recentered_tailgating = tailgating - tailgating_meds recentered_tailgating
Alcohol MDMA THC NoDrug
0 0.956274 -7.824879 16.036533 21.434338
1 -4.078252 -6.745804 8.282229 4.085351
2 11.762550 -6.066175 0.804480 -12.543426
3 -5.359098 -5.039658 -18.951681 8.880788
4 5.976920 -4.335147 -1.697412 20.571444
5 9.385050 -4.281422 -6.384291 24.258593
6 -11.762834 -1.950472 -3.337586 -8.518025
7 -9.091602 -0.530066 14.725185 0.047813
8 -8.085291 0.530066 20.559886 -4.805921
9 0.306217 0.701341 -5.911460 -13.192783
10 6.108847 1.296736 31.895562 -0.047813
11 -0.306217 1.401864 -4.847652 -5.872868
12 30.339449 2.315241 -5.081950 -1.219009
13 -20.115011 4.282798 -0.804480 13.031837
14 25.031667 8.849238 5.711643 8.790475
15 -10.299373 29.777946 3.285781 -8.523014
In [16]:
flikes = np.zeros(10000) for i in range(10000): group1 = np.random.choice(recentered_tailgating["Alcohol"], len(recentered_tailgating["Alcohol"])) group2 = np.random.choice(recentered_tailgating["MDMA"], len(recentered_tailgating["MDMA"])) group3 = np.random.choice(recentered_tailgating["THC"], len(recentered_tailgating["THC"])) group4 = np.random.choice(recentered_tailgating["NoDrug"], len(recentered_tailgating["NoDrug"])) sim_data = pd.DataFrame(np.column_stack((group1, group2, group3, group4))) flikes[i] = computeFlike(sim_data)
In [17]:
p = sns.distplot(flikes, kde = False) p.axvline(Fobs, color = "red") p.set(xlabel = "Simulated F-like values", ylabel = "Counts")
[Text(0, 0.5, 'Counts'), Text(0.5, 0, 'Simulated F-like values')]
In [18]:
#10(d) p_value = np.sum(flikes >= Fobs)/len(flikes) p_value
In [19]:
#10(e) #This p-value is greater than the given alpha value of 0.05, so we fail to reject the null hypothesis that there is no difference among groups. Thus, the result is not significant.
In [64]:
#11. #Of the 2201 passengers on board, 367 males survived, 1364 males died, 344 females survived, and 126 females died. Determine how the relative risk of dying differed between males and females. obsarray1=np.array([[1364, 126], [367, 344]]) obsarray1
array([[1364, 126], [ 367, 344]])
In [ ]:
In [65]:
#11(a) #The independent variable is the gender of the passenger, which is categorical. The dependent variable is whether or not they survived, which is also categorical.


Male Female
Died 1364 126
Survived 367 344
In [ ]:
In [66]:
#11(c) #The null hypothesis would state that gender did not have any effect on whether or not an individual survived the sinking of the Titanic.
In [67]:
#11(d)Calculate relative risk of dying for females vs. males in the observed data. RRdying=(1364/(1364+367))/(126/(126+344)) RRdying
In [68]:
#11(e). Calculate p-value for NHST using relative risk. Include your histogram of simulations results with observed result indicated. Comment every line unless the line is standard to past re-sampling analyses (i.e., it’s boilerplate!). def RR(df): relrisk=(df[0,0]/(df[0,0]+df[1,0]))/(df[0,1]/(df[0,1]+df[1,1])) return relrisk
In [69]:
In [70]:
population = 1490*['D']+711*['S'] ntimes = 10000 results = np.zeros(ntimes) for i in range(ntimes): female_resample = np.random.choice(population, 470) # randomly select outcomes for subjects that were female male_resample = np.random.choice(population, 1731) # randomly select outcomes for subjects that were male res = np.zeros([2,2]) res[0,0] = np.sum(female_resample == 'D') # Create 2x2 table for simulation res[1,0] = np.sum(female_resample == 'S') res[0,1] = np.sum(male_resample == 'D') res[1,1] = np.sum(male_resample == 'S') results[i] = RR(res) # Calculate relative risk comparing simulation to observed
In [71]:
p = sns.distplot(results, kde = False) p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Null Distribution") p.axvline(RR(obsarray1), color="green")
<matplotlib.lines.Line2D at 0x7fa4f4533588>
In [72]:
print(sum(results >= RR(obsarray1))/ntimes) #The p-value is 0.0
In [75]:
#11(f) results.sort() M_lower = results[49] #find the bottom 0.5% M_upper = results[9949] #find the top 0.5% M_observed = np.median(results) #find the observed median M_upperp = 2*M_observed - M_lower #find the upper pivotal median value M_lowerp = 2*M_observed - M_upper #find the lower pivotal median value display("M Lower : Red Line", M_lowerp, "M Upper: Green Line", M_upperp) #display these found values p=sns.distplot(results, kde=False, bins=25, axlabel="Relative Risks for deaths of females vs men") #create a histogram of the resamples p.set(ylabel="Count") p.axvline(M_lowerp, color="red"); p.axvline(M_upperp, color="green"); p.axvline(M_observed, color="purple"); p.axvline(RR(obsarray1), color="orange"); p.axvline(-RR(obsarray1), color="orange");
'M Lower : Red Line'
'M Upper: Green Line'
In [76]:
#11(g) #g. Interpret (e) and (f) in the context of this study. #(e) and (f) show that we can reject the null hypothesis, since the p-value is below the alpha value. However, the observed value is within the confidence interval, and so is 0, thus making it both possible to observe what took place, and to observe a relative risk of 0. In the context of this study, these results show that the Titanic's sinking was not a rare and unique case, and that in most cases, gender would have had an effect on the survival of the individual if the Titanic had sunk many times.
In [36]:
#12(a) #The independent variables are the two alleles, which are categorical, and the dependent variable is the allele frequency, which is continuous.

#12(b) #Observed

Frequency 1203 2919 1678


Frequency 1200.6 2876.8 1722.6
In [38]:
#12(c) #The null hypothesis is that genotype frequencies fit the genotype proportions predicted by the Hardy-Weinberg Formula.
In [41]:
#12(d) #Calculate p-value for NHST using chi-abs. Include your histogram of simulations results with observed result indicated. obs = np.array([1203, 2919, 1678]) expected = np.array([1200.6, 2876.8, 1722.6]) def chi_abs(obs, ex): temp = np.abs(obs-ex)/ex # / is element-wise divide here return np.sum(temp) # sum up all elements chi_abs_obs = chi_abs(obs, expected) print(chi_abs_obs)
In [44]:
population = 1203*['A/A'] + 2919*['A/B'] + 1678*['B/B'] ntimes = 10000 results = np.zeros(ntimes) for i in range(ntimes): AA_resample = np.random.choice(population, 1203) AB_resample = np.random.choice(population, 2919) BB_resample = np.random.choice(population, 1678) res = np.zeros([3,1]) res[0,0] = np.sum(AA_resample == 'A/A') res[1,0] = np.sum(AB_resample == 'A/B') res[2,0] = np.sum(BB_resample == 'B/B') results[i] = chi_abs(res, expected) p = sns.distplot(results) p.set(xlabel = "Absolute Value of Chi Statistic") p.axvline(chi_abs(obs, expected), color="green") pval = sum(results >= chi_abs(obs,expected))/ntimes pval
In [ ]:
#12(e) #According to the p-value, the Hardy-Weinberk principle works 100% of the time. Thus, we fail to reject the null hypothesis.