1

Name: Michelle Arroyo

I collaborated with:

2

In [33]:

# 1 import numpy as np import seaborn as sns i=[36,64] ni=[1215,1184] alldata=[i,ni] obs=np.array([alldata]) #observed table infected_rate= 100/2499 print("the infected rate is", infected_rate) drug=1251 placebo=1248 expected_drug=infected_rate*drug print("the expected infected value for drug is", expected_drug) expected_placebo= infected_rate*placebo print("the expected infected value for placebo is", expected_placebo) ei=[expected_drug,expected_placebo] #the expected infected a=1251-expected_drug b=1248-expected_placebo eni=[(a),b] #the expected noninfected exp=np.array([ei,eni]) #expected table print(exp) def abschi(obs,exp): chi_squared=np.sum((abs(obs-exp))/exp) return(chi_squared) abschi(obs,exp)

3

the infected rate is 0.040016006402561026
the expected infected value for drug is 50.06002400960384
the expected infected value for placebo is 49.939975990396164
[[ 50.06002401 49.93997599]
[1200.93997599 1198.06002401]]

0.5858449460010663

In [8]:

#2 import numpy as np import seaborn as sns nc=[8167,8117] pc=[529,620] obsp=np.array([nc,pc]) #makes a matrix of the observed print("the observed table is", obsp) r_risk=(8167/(8167+8117))/(529/(529+620)) print("the relative risk rate is", r_risk) cancer_rate=1149/17433 #this is the relative risk ratio. Calculated by taking those that developed prostate cancer under both treatment groups divided by the total sample size. placebo=8696 #total placebo sample size: 8167+529 vitE=8737 #total vitamin E sample size: 8117+620 expected_vit=cancer_rate*vitE #takes the number expected of participants that will develop cancer from the vitamin E treatment group print("the expected people with cancer under the vitamin E treatment is", expected_vit) expected_placebo= cancer_rate*placebo expected_placebo= cancer_rate*placebo #takes the number expected of participants that will develop cancer from the placebo group print("the expected people with cancer under the placebo treatment is", expected_placebo) ec=[expected_vit,expected_placebo] #makes a list of number of expected cancer development a=vitE-expected_vit b=placebo-expected_placebo enc=[a,b] #number of those that will not develop expp=np.array([enc,ec]) print(expp) #matrix of the expected values def abschi(obs,exp): chi_squared=np.sum((abs(obs-exp))/exp) return(chi_squared) abschi(obsp,expp) #this is the absolute chi value. print(abschi) from IPython.display import Image Image(filename='#2.png', height=300)

4

the observed table is [[8167 8117]
[ 529 620]]
the relative risk rate is 1.0893459385138742
the expected people with cancer under the vitamin E treatment is 575.8511443813458
the expected people with cancer under the placebo treatment is 573.1488556186544
[[8161.14885562 8122.85114438]
[ 575.85114438 573.14885562]]
<function abschi at 0x7fcc8bafe048>

In [4]:

#3 #Generally, we use the one-tailed p-values when using chi because we are testing if there is a significant change in a certain direction of the distribution in our hypothesis. Additionally, the results would be more significant for a one-tailed tested in comparison to a two-tailed using the case above, since one-tailed provides more power to detect an effect.

5

In [5]:

#4 The advantage in using absolute chi instead of chi squared is that absolute chi gives a more precise value while chi squared increases or decreases the chi value since we are squaring to get rid of any negatives.

6

In [6]:

#5 Use the formula (x)^2/abs(x) to explain why the null distribution for relative risk is high near 0. #the relative risk is higher near zero because we are expecting our observed and expected values to be the same which will give us 0 using this formula, since the numerator substracts the observed from the expected.

7

In [7]:

#6 Use the formula for RR to explain why the null distribution for relative rick is high near 1. #when using a relative risk for a null distribution, the relative risk will be high near one because it signifies that there is no difference in risk between the group's proportions, thus supporting null distribution. In other words, since we are not subtracting or recentering and we are just take the proportions, if the proportions are nearly the same the result will be near 1.

8

In [8]:

#7 if RR for a treatment compared to #0.71 should be used to calculate a two-tailed p-value because it is the inverse of the RR risk ratio, and as such it can be used for a two tailed p value.

9

In [9]:

#8 The difference between both methods is that now we have increased the variables when comparing proportions. For example, if we are testing a drug that prevents an illness, we have a control and an experimental and we calculate whether the groups develop the illness or not. While Monte Carlo only simulates whether those that received the treatment. Lastly, we are trying to find out if there's a significant differences between the two groups that are testing a certain drug or variable and whthere

10

In [14]:

#9 #a The independent variable of the study is new vs old procedure, and the dependent variable is if they survived or died. They are both categorial. #9b l=[34,16] d=[6,19] alldata=[l,d] obs1=np.array([alldata]) obs1 #9c #The null hypothesis for the study is that there is no significant difference in the mortality rate when doctors used the new technique compared to the old technique. #9d r_risk=(6/(6+34))/(19/(16+19)) print("the relative risk rate is", r_risk)

11

the relative risk rate is 0.27631578947368424

In [33]:

#9e import numpy as np import seaborn as sns death_rate=25/75 Sterile=50 #total placebo sample size: 34+6 NSterile=25 #total non sterile sample size: 16+19 expected_dead=death_rate*NSterile #takes the number expected of participants that will die from the nonsterile group print("the expected people who will die under the Nonsterile treatment is", expected_dead) expected_sterile= death_rate*Sterile #takes the number expected of participants that will die from the sterile group print("the expected people who will die under the sterile treatment is", expected_sterile) ed=[expected_dead,expected_sterile] #makes a list of number of expected dead a1=NSterile-expected_dead b1=Sterile-expected_sterile enc1=[a1,b1] #number of those that will not develop exp1=np.array([enc1,ed]) print(exp1) #matrix of the expected values abschi(obs1,exp1) #this is the absolute chi value. print("The observed absolute chi value is", abschi) sterile= 50*['S'] + 25*['NS'] ar=np.array(sterile) reps=10000 #number of reps rr_data=np.zeros(reps) table=np.zeros([2,2]) #makes a 2x2 table for i in range(reps): random_s=np.random.choice(ar,50) c=np.sum(random_s=='S') d=np.sum(random_s=='NS') table[0,0]=c table[1,0]=d random_ns=np.random.choice(ar,25) a=np.sum(random_ns=='S') b=np.sum(random_ns=='NS') table[0,1]=a table[1,1]=b rr_value= ((a/(a+b)) / (c/(c+d))) #formula for relative risk rr_data[i]=rr_value #records all 10000 relative risk values #r=(0.27631578947368424/2)+0.01 #if relative risk is above 1 then we divide it in half and add 0.01 r=0.14815789 dis=sns.distplot(rr_data,kde=False) p_value = (np.sum(rr_data<= r))/reps print("p_value is", p_value) dis.axvline(r, color="salmon")

12

the expected people who will die under the Nonsterile treatment is 8.333333333333332
the expected people who will die under the sterile treatment is 16.666666666666664
[[16.66666667 33.33333333]
[ 8.33333333 16.66666667]]
The observed absolute chi value is <function abschi at 0x7fcc8bafe048>
p_value is 0.0

<matplotlib.lines.Line2D at 0x7fcc83517eb8>

In [34]:

#9f calculate 99%confidence interval r=(r_risk/2)+0.01 ntimes = 10000 results = np.zeros(ntimes) for i in range(ntimes): # Randomly sample each group from its box model with observed sample size s = np.random.choice(ar, 50) ns = np.random.choice(ar,25) # Calculate probability of infection for each group p_ster = np.sum(s == 'N')/50 p_noster = np.sum(ns == 'N')/25 # Calculate and store relative risk relative_risk = p_ster/p_noster results[i] = relative_risk p = sns.distplot(results, kde = False) p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations") results.sort() percent = 0.99 il = int(ntimes*(1-percent)/2)-1 # index for M_lower should be 49 for 99% iu = int(ntimes*(1-(1-percent)/2))-1 # index for M_upper should be 9949 for 99% print("Lower and upper indices are {} and {}".format(il, iu)) M_lower = results[il] M_upper = results[iu] #### CI_lower = 2*r - M_upper CI_upper = 2*r_risk - M_lower print("{}% CI for Relative Risk {}".format(int(percent*100), (CI_lower, CI_upper))) p = sns.distplot(results, kde = False) p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations for Calculating Confidence Interval") p.axvline(CI_lower, color='green') p.axvline(CI_upper, color='green') p.axvline(relative_risk_obs, color='red')

13

/ext/anaconda5/lib/python3.6/site-packages/ipykernel/__main__.py:17: RuntimeWarning: invalid value encountered in double_scalars
/ext/anaconda5/lib/python3.6/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims, initial)
/ext/anaconda5/lib/python3.6/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce
return umr_minimum(a, axis, None, out, keepdims, initial)

```
---------------------------------------------------------------------------
```

```
ValueError Traceback (most recent call last)
```

```
<ipython-input-34-aa53ee023e87> in <module>()
18 results[i] = relative_risk
19
---> 20 p = sns.distplot(results, kde = False)
21 p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations")
22
```

```
/ext/anaconda5/lib/python3.6/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
213 if hist:
214 if bins is None:
--> 215 bins = min(_freedman_diaconis_bins(a), 50)
216 hist_kws.setdefault("alpha", 0.4)
217 if LooseVersion(mpl.__version__) < LooseVersion("2.2"):
```

```
/ext/anaconda5/lib/python3.6/site-packages/seaborn/distributions.py in _freedman_diaconis_bins(a)
37 return int(np.sqrt(a.size))
38 else:
---> 39 return int(np.ceil((a.max() - a.min()) / h))
40
41
```

```
ValueError: cannot convert float NaN to integer
```

In [ ]:

#9g

14

In [ ]:

#10a There is no independent variable and the dependent variable is the genotype frequency. Both are categorical. #10b contiguency observed table m=[1203,1459.5] f=[1678,1459.5] all=[m,f] obs2=np.array([all]) obs2 #expected table em=[1200.6,1435.5] ef=[1722.6,1435.5] exp2=np.array([em,ef]) #10c null hypothesis #The null hypothesis is that the population is at hardy weinburg equilibrium. #10d pvalue import numpy as np import seaborn as sns Sterile=40 #total placebo sample size: 34+6 NSterile=35 #total non sterile sample size: 16+19 expected_dead=death_rate*NSterile #takes the number expected of participants that will die from the nonsterile group print("the expected people who will die under the Nonsterile treatment is", expected_dead) expected_sterile= death_rate*Sterile #takes the number expected of participants that will die from the sterile group print("the expected people who will die under the sterile treatment is", expected_sterile) ed=[expected_dead,expected_sterile] #makes a list of number of expected dead a1=NSterile-expected_dead b1=Sterile-expected_sterile enc1=[a1,b1] #number of those that will not develop exp=np.array([enc1,ed]) print(exp) #matrix of the expected values abschi(obs,exp) #this is the absolute chi value. box_model= 40*['S'] + 35*['NS'] reps=10 chi_data=np.zeros(10) table=np.zeros([2,2]) ar=np.array(box_model) for i in range(reps): random_s=np.random.choice(ar,40) id=np.sum(random_s=='S') nid=np.sum(random_s=='NS') table[0,0]=id table[1,0]=nid random_ns=np.random.choice(ar,35) ip=np.sum(random_ns=='S') nip=np.sum(random_ns=='NS') table[0,1]=ip table[1,1]=nip chi_value=abschi(table,exp) chi_data[i]=chi_value dis=sns.distplot(chi_data,kde=False) p_value = np.sum(chi_data>=)

15

In [ ]:

16

In [ ]:

17

In [ ]:

18

In [ ]:

19