CoCalc Shared FilesWeek 7 / hw7-turnin.ipynb
Author: Michelle Arroyo
Views : 121

Homework 7

Name: Michelle Arroyo

I collaborated with:

In [33]:
# 1
import numpy as np
import seaborn as sns

i=[36,64]
ni=[1215,1184]
alldata=[i,ni]
obs=np.array([alldata]) #observed table

infected_rate= 100/2499
print("the infected rate is", infected_rate)
drug=1251
placebo=1248
expected_drug=infected_rate*drug
print("the expected infected value for drug is", expected_drug)
expected_placebo= infected_rate*placebo
print("the expected infected value for placebo is", expected_placebo)
ei=[expected_drug,expected_placebo] #the expected infected
a=1251-expected_drug
b=1248-expected_placebo
eni=[(a),b]    #the expected noninfected
exp=np.array([ei,eni])  #expected table
print(exp)

def abschi(obs,exp):
chi_squared=np.sum((abs(obs-exp))/exp)
return(chi_squared)
abschi(obs,exp)

the infected rate is 0.040016006402561026 the expected infected value for drug is 50.06002400960384 the expected infected value for placebo is 49.939975990396164 [[ 50.06002401 49.93997599] [1200.93997599 1198.06002401]]
0.5858449460010663
In [8]:
#2
import numpy as np
import seaborn as sns

nc=[8167,8117]
pc=[529,620]
obsp=np.array([nc,pc]) #makes a matrix of the observed
print("the observed table is", obsp)

r_risk=(8167/(8167+8117))/(529/(529+620))
print("the relative risk rate is", r_risk)

cancer_rate=1149/17433 #this is the relative risk ratio. Calculated by taking those that developed prostate cancer under both treatment groups divided by the total sample size.

placebo=8696 #total placebo sample size: 8167+529
vitE=8737 #total vitamin E sample size: 8117+620

expected_vit=cancer_rate*vitE  #takes the number expected of participants that will develop cancer from the vitamin E treatment group
print("the expected people with cancer under the vitamin E treatment is", expected_vit)

expected_placebo= cancer_rate*placebo
expected_placebo= cancer_rate*placebo #takes the number expected of participants that will develop cancer from the placebo group
print("the expected people with cancer under the placebo treatment is", expected_placebo)

ec=[expected_vit,expected_placebo] #makes a list of number of expected cancer development
a=vitE-expected_vit
b=placebo-expected_placebo
enc=[a,b] #number of those that will not develop
expp=np.array([enc,ec])
print(expp) #matrix of the expected values

def abschi(obs,exp):
chi_squared=np.sum((abs(obs-exp))/exp)
return(chi_squared)
abschi(obsp,expp) #this is the absolute chi value.
print(abschi)

from IPython.display import Image
Image(filename='#2.png', height=300)


the observed table is [[8167 8117] [ 529 620]] the relative risk rate is 1.0893459385138742 the expected people with cancer under the vitamin E treatment is 575.8511443813458 the expected people with cancer under the placebo treatment is 573.1488556186544 [[8161.14885562 8122.85114438] [ 575.85114438 573.14885562]] <function abschi at 0x7fcc8bafe048>
In [4]:
#3
#Generally, we use the one-tailed p-values when using chi because we are testing if there is a significant change in a certain direction of the distribution in our hypothesis. Additionally, the results would be more significant for a one-tailed tested in comparison to a two-tailed using the case above, since one-tailed provides more power to detect an effect.

In [5]:
#4 The advantage in using absolute chi instead of chi squared is that absolute chi gives a more precise value while chi squared increases or decreases the chi value since we are squaring to get rid of any negatives.

In [6]:
#5 Use the formula (x)^2/abs(x) to explain why the null distribution for relative risk is high near 0.

#the relative risk is higher near zero because we are expecting our observed and expected values to be the same which will give us 0 using this formula, since the numerator substracts the observed from the expected.

In [7]:
#6 Use the formula for RR to explain why the null distribution for relative rick is high near 1.
#when using a relative risk for a null distribution, the relative risk will be high near one because it signifies that there is no difference in risk between the group's proportions, thus supporting null distribution. In other words, since we are not subtracting or recentering and we are just take the proportions, if the proportions are nearly the same the result will be near 1.


In [8]:
#7 if RR for a treatment compared to
#0.71 should be used to calculate a two-tailed p-value because it is the inverse of the RR risk ratio, and as such it can be used for a two tailed p value.

In [9]:
#8 The difference between both methods is that now we have increased the variables when comparing proportions. For example, if we are testing a drug that prevents an illness, we have a control and an experimental and we calculate whether the groups develop the illness or not. While Monte Carlo only simulates whether those that received the treatment. Lastly, we are trying to find out if there's a significant differences between the two groups that are testing a certain drug or variable and whthere

In [14]:
#9 #a The independent variable of the study is new vs old procedure, and the dependent variable is if they survived or died. They are both categorial.

#9b
l=[34,16]
d=[6,19]
alldata=[l,d]
obs1=np.array([alldata])
obs1

#9c
#The null hypothesis for the study is that there is no significant difference in the mortality rate when doctors used the new technique compared to the old technique.

#9d
r_risk=(6/(6+34))/(19/(16+19))
print("the relative risk rate is", r_risk)

the relative risk rate is 0.27631578947368424
In [33]:
#9e
import numpy as np
import seaborn as sns

death_rate=25/75

Sterile=50 #total placebo sample size: 34+6
NSterile=25 #total non sterile sample size: 16+19

expected_dead=death_rate*NSterile  #takes the number expected of participants that will die from the nonsterile group
print("the expected people who will die under the Nonsterile treatment is", expected_dead)

expected_sterile= death_rate*Sterile #takes the number expected of participants that will die from the sterile group
print("the expected people who will die under the sterile treatment is", expected_sterile)

b1=Sterile-expected_sterile
enc1=[a1,b1] #number of those that will not develop
exp1=np.array([enc1,ed])
print(exp1) #matrix of the expected values

abschi(obs1,exp1) #this is the absolute chi value.
print("The observed absolute chi value is", abschi)

sterile= 50*['S'] + 25*['NS']
ar=np.array(sterile)
reps=10000    #number of reps
rr_data=np.zeros(reps)
table=np.zeros([2,2]) #makes a 2x2 table
for i in range(reps):
random_s=np.random.choice(ar,50)
c=np.sum(random_s=='S')
d=np.sum(random_s=='NS')
table[0,0]=c
table[1,0]=d
random_ns=np.random.choice(ar,25)
a=np.sum(random_ns=='S')
b=np.sum(random_ns=='NS')
table[0,1]=a
table[1,1]=b
rr_value= ((a/(a+b)) / (c/(c+d))) #formula for relative risk
rr_data[i]=rr_value  #records all 10000 relative risk values

#r=(0.27631578947368424/2)+0.01 #if relative risk is above 1 then we divide it in half and add 0.01
r=0.14815789
dis=sns.distplot(rr_data,kde=False)
p_value = (np.sum(rr_data<= r))/reps
print("p_value is", p_value)
dis.axvline(r, color="salmon")

the expected people who will die under the Nonsterile treatment is 8.333333333333332 the expected people who will die under the sterile treatment is 16.666666666666664 [[16.66666667 33.33333333] [ 8.33333333 16.66666667]] The observed absolute chi value is <function abschi at 0x7fcc8bafe048> p_value is 0.0
<matplotlib.lines.Line2D at 0x7fcc83517eb8>
In [34]:
#9f calculate 99%confidence interval
r=(r_risk/2)+0.01

ntimes = 10000

results = np.zeros(ntimes)
for i in range(ntimes):
# Randomly sample each group from its box model with observed sample size
s = np.random.choice(ar, 50)
ns = np.random.choice(ar,25)

# Calculate probability of infection for each group
p_ster = np.sum(s == 'N')/50
p_noster = np.sum(ns == 'N')/25

# Calculate and store relative risk
relative_risk = p_ster/p_noster
results[i] = relative_risk

p = sns.distplot(results, kde = False)
p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations")

results.sort()
percent = 0.99
il = int(ntimes*(1-percent)/2)-1 # index for M_lower should be 49 for 99%
iu = int(ntimes*(1-(1-percent)/2))-1 # index for M_upper should be 9949 for 99%
print("Lower and upper indices are {} and {}".format(il, iu))
M_lower = results[il]
M_upper = results[iu]
####

CI_lower = 2*r - M_upper
CI_upper = 2*r_risk - M_lower

print("{}% CI for Relative Risk {}".format(int(percent*100), (CI_lower, CI_upper)))

p = sns.distplot(results, kde = False)
p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations for Calculating Confidence Interval")
p.axvline(CI_lower, color='green')
p.axvline(CI_upper, color='green')
p.axvline(relative_risk_obs, color='red')


/ext/anaconda5/lib/python3.6/site-packages/ipykernel/__main__.py:17: RuntimeWarning: invalid value encountered in double_scalars /ext/anaconda5/lib/python3.6/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce return umr_maximum(a, axis, None, out, keepdims, initial) /ext/anaconda5/lib/python3.6/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce return umr_minimum(a, axis, None, out, keepdims, initial)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-34-aa53ee023e87> in <module>() 18 results[i] = relative_risk 19 ---> 20 p = sns.distplot(results, kde = False) 21 p.set(xlabel = "Relative Risk", ylabel = "Count", title = "Distribution of Simulations") 22 /ext/anaconda5/lib/python3.6/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax) 213 if hist: 214 if bins is None: --> 215 bins = min(_freedman_diaconis_bins(a), 50) 216 hist_kws.setdefault("alpha", 0.4) 217 if LooseVersion(mpl.__version__) < LooseVersion("2.2"): /ext/anaconda5/lib/python3.6/site-packages/seaborn/distributions.py in _freedman_diaconis_bins(a) 37 return int(np.sqrt(a.size)) 38 else: ---> 39 return int(np.ceil((a.max() - a.min()) / h)) 40 41 ValueError: cannot convert float NaN to integer 
In [ ]:
#9g

In [ ]:
#10a There is no independent variable and the dependent variable is the genotype frequency. Both are categorical.

#10b contiguency observed table
m=[1203,1459.5]
f=[1678,1459.5]
all=[m,f]
obs2=np.array([all])
obs2

#expected table
em=[1200.6,1435.5]
ef=[1722.6,1435.5]
exp2=np.array([em,ef])

#10c null hypothesis
#The null hypothesis is that the population is at hardy weinburg equilibrium.

#10d pvalue
import numpy as np
import seaborn as sns

Sterile=40 #total placebo sample size: 34+6
NSterile=35 #total non sterile sample size: 16+19

expected_dead=death_rate*NSterile  #takes the number expected of participants that will die from the nonsterile group
print("the expected people who will die under the Nonsterile treatment is", expected_dead)

expected_sterile= death_rate*Sterile #takes the number expected of participants that will die from the sterile group
print("the expected people who will die under the sterile treatment is", expected_sterile)

b1=Sterile-expected_sterile
enc1=[a1,b1] #number of those that will not develop
exp=np.array([enc1,ed])
print(exp) #matrix of the expected values

abschi(obs,exp) #this is the absolute chi value.
box_model= 40*['S'] + 35*['NS']
reps=10
chi_data=np.zeros(10)
table=np.zeros([2,2])
ar=np.array(box_model)
for i in range(reps):
random_s=np.random.choice(ar,40)
id=np.sum(random_s=='S')
nid=np.sum(random_s=='NS')
table[0,0]=id
table[1,0]=nid
random_ns=np.random.choice(ar,35)
ip=np.sum(random_ns=='S')
nip=np.sum(random_ns=='NS')
table[0,1]=ip
table[1,1]=nip
chi_value=abschi(table,exp)
chi_data[i]=chi_value

dis=sns.distplot(chi_data,kde=False)
p_value = np.sum(chi_data>=)

In [ ]:


In [ ]:


In [ ]:


In [ ]: