CoCalc Public FilesWeek 10 / Lab 10 - Power.ipynb
Author: Michelle Arroyo
Views : 88
Compute Environment: Ubuntu 18.04 (Deprecated)

# Lab 9: Power

The power of a study is the probability that the study will detect a result of a given size, assuming that result actually exists. In this lab, you will study power in the context of a $2 \times 2$ table.

Bendectin was a drug given to pregnant women to treat morning sickness. A study was done to determine whether Bendectin increased the risk of birth defects. Here, we will focus specifically on limb reduction disorder, a birth defect in which children are born with shortened or missing limbs. The data from the study is given below.

Bendectin No Bendectin
Limb reduction 13 382
No limb reduction 1156 48731

First, we want to see if these results lead us to conclude that the drug in question causes limb reduction

1. Import Numpy and Seaborn.
In [2]:
#TODO
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt # Adds sub-library to plot line
import scipy.stats as stats # Adds sub-library to calculate regression line and correlation coefficient

1. Enter the data above into a Numpy array. HINT: For this and several later exercises, you might want to refer back to the "Comparing Proportions" lab.
In [5]:
#TODO
obs = np.array([[13,382],[1156,48731]])
totalben=obs[:,0].sum() #number with drug
totLR= obs[0,:].sum() #number with limb reduction
n=np.sum(obs) #total number of patients
totNoBen= obs[:,1].sum()
totNoLR= obs[1,:].sum()

1. Compute the relative risk of limb reduction for the Bendectin group compared to the no Bendectin group.
In [8]:
#TODO

def RR_Ben (tab): # Function to calculate relative risk for Bedectin contingency table
probDB = tab[0,0]/np.sum(tab[:,0]) # Probability of defect if mother used Benedictin
probDNB = tab[0,1]/np.sum(tab[:,1]) # Probability of defect if mother didn't use Benedictin
return probDB/probDNB
relrisk= RR_Ben(obs)
print("The relative risk is", relrisk)

The relative risk is 1.4297560451273965

## Computing a p-value

1. Find the two-tailed p-value for this study in the usual way. In this and all other bootstrap analyses in this lab, only do 1000 simulations, not our customary 10,000.
In [9]:
def relRisk (obs):
gr1Rate = obs[0,0]/sum(obs[:,0]) # % group with LR
gr2Rate = obs[0,1]/sum(obs[:,1]) # % group with LR
return gr1Rate/gr2Rate

numBen = np.sum(obs[:,0]) # calculate marginal sums
numNo = np.sum(obs[:,1])
num = numBen + numNo
numL = np.sum(obs[0,:])
numNo = np.sum(obs[1,:])

box = ["L"]*numL+["N"]*numNo

ntimes = 1000
simsRR = np.zeros(ntimes)
sim = np.zeros([2,2])

for i in range(ntimes):
# randomly deal limb and no limb cards out with replacement
simBen = np.random.choice(box,numBen,True)
simNo = np.random.choice(box,numNo,True)

sim[0,0] = np.sum(simBen == "L") # fill 2x2 contingency table
sim[1,0] = np.sum(simBen == "N")
sim[0,1] = np.sum(simNo == "L")
sim[1,1] = np.sum(simNo == "N")
simsRR[i] = relRisk(sim) # calculate relative risk and store

if relRisk(obs) > 1: # check whether observed relative risk is greater than 1 or not
upper_cutoff = relRisk(obs)
else:
upper_cutoff = 1/relRisk(obs)

lower_cutoff = 1/upper_cutoff # opposite extreme cutoff for ratio is reciprocal
pval = (np.sum(simsRR <= lower_cutoff) + np.sum(simsRR >= upper_cutoff))/ntimes

p = sns.distplot(simsRR, kde = False)
p.axvline(lower_cutoff, color='red')
p.axvline(upper_cutoff, color='red')

print(pval)

1169 0.289
In [ ]:


1. What conclusion does the p-value lead you to draw?
In [ ]:
#TODO


Drawing values from a list or array works fine for relatively small simulations but can get very slow for larger ones. Therefore, we'll use a faster alternative. To simulate drawing $n$ items from a box that has probability $p$ of an outcome, use np.random.binomial(n,p). (To get the number of times the other outcome occurred, subtract your result from the total number of events.) This is much faster than what we did before, and speed is important in power analysis!

1. Use np.random.binomial to simulate drawing 20 balls from a bag in which 25% are green. Note that you're interested in the number of balls, not the sequence in which they come out.
In [7]:
#TODO
#list= ["v"]*25 + ["U"]*75 the function is similar to creating this list
sample=np.random.binomial(20, 0.25)
sample

4
1. Find the p-value again, this time using np.random.binomial.
In [20]:
#TODO

ntimes = 1000
simsRR = np.zeros(ntimes)
sim = np.zeros([ 2,2])
total = []

for i in range(ntimes):
sim[0,0] = np.random.binomial(numBen,0.007838953) # fill 2x2 contingency table
sim[1,0] = numBen - sim[0,0]
sim[0,1] = np.random.binomial(numNo,0.007838953)
sim[1,1] = numNo - sim[0,1]
simsRR[i] = relRisk(sim) # calculate relative risk and store

if relRisk(obs) > 1: # check whether observed relative risk is greater than 1 or not
upper_cutoff = relRisk(obs)
else:
upper_cutoff = 1/relRisk(obs)

lower_cutoff = 1/upper_cutoff # opposite extreme cutoff for ratio is reciprocal
pval = (np.sum(simsRR <= lower_cutoff) + np.sum(simsRR >= upper_cutoff))/ntimes

p = sns.distplot(simsRR, kde = False)
p.axvline(lower_cutoff, color='red')
p.axvline(upper_cutoff, color='red')

print(pval)

0.298
1. Create a function that will take data from a study similar to this one and use np.random.binomial to return a p-value. Make sure to test your function.
In [3]:
#TODO
# sim[row,column]

def bin(sim):
c = sim[0,1]/np.sum(sim[:,1])
ntimes = 1000
new = np.zeros([ 2,2])

for i in range(ntimes):
new[0,0] = np.random.binomial(np.sum(sim[:,0]),c) # fill 2x2 contingency table
new[1,0] = np.sum(sim[:,0]) - new[0,0]
new[0,1] = np.random.binomial(np.sum(sim[:,1]),c)
new[1,1] = np.sum(sim[:,1])- new[0,1]
simsRR[i] = relRisk(new) # calculate relative risk and store

if relRisk(sim) > 1: # check whether observed relative risk is greater than 1 or not
upper_cutoff = relRisk(sim)
else:
upper_cutoff = 1/relRisk(sim)

lower_cutoff = 1/upper_cutoff # opposite extreme cutoff for ratio is reciprocal
pval = (np.sum(simsRR <= lower_cutoff) + np.sum(simsRR >= upper_cutoff))/ntimes

p = sns.distplot(simsRR, kde = False)
p.axvline(lower_cutoff, color='red')
p.axvline(upper_cutoff, color='red')

print(pval)

--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-0675ae55db05> in <module> 18 upper_cutoff = 1/relRisk(sim) 19 ---> 20 lower_cutoff = 1/upper_cutoff # opposite extreme cutoff for ratio is reciprocal 21 pval = (np.sum(simsRR <= lower_cutoff) + np.sum(simsRR >= upper_cutoff))/ntimes 22 NameError: name 'upper_cutoff' is not defined 

## Performing Power Analysis

We will now examine situations in which the relative risk of Bendectin for limb defects or the sample size changes. Your TA will give you the values you should use for relative risk and sample size.

9a. If you're doing a different relative risk, create a “phantom” data table in which the column totals are the same but the fraction of children with birth defects in the Bendectin group is appropriately different from that in the no Bendectin group. Calculate the relative risk to confirm you did this correctly.

In [4]:
# You can either calculate the appropriate numbers using algebra and the relative risk formula, or you can guess and check, which is what I did here. We want to make sure the column totals are the same, so I'm also outputting the column total for Bendectin (since I only changed that column) for both the original and new data. I chose to simply increase the risk for Bendectin users, so I only changed the starred items in the table below.

# Example: effect size of RR = 3 & same sample size

#  Original data:
#                    Bendectin       No Bendectin
#  Limb defects         *13              382
#  No defects         *1156             48731

# For effect size = 3
es3 = np.array([[28,382],[1141,48731]])
total_es3 = np.sum(es3)
print(RR_Ben(es3),np.sum(es3[:,0]),np.sum(obs[:,0]))
# Yay, my relative risk is ~3 (not 3*original), and the sample size is the same!


--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-48585f187718> in <module> 11 es3 = np.array([[28,382],[1141,48731]]) 12 total_es3 = np.sum(es3) ---> 13 print(RR_Ben(es3),np.sum(es3[:,0]),np.sum(obs[:,0])) 14 # Yay, my relative risk is ~3 (not 3*original), and the sample size is the same! NameError: name 'RR_Ben' is not defined 

9b. If you're doing a different sample size, create a "phantom" data table in which the number of subjects in each Bendectin group is multiplied by the factor you're given (we're only changing the size of the Bendectin group, since the non-Bendectin group is much larger). Check that your relative risk is what it should be.

In [3]:
#TODO
# Example: sample size of 2,338 (2*original Bendectin group) & same effect size (Kristin's assignment)

#  Original data:
#                    Bendectin       No Bendectin
#  Limb defects         *13              382
#  No defects         *1156             48731

ss2 = np.array([[13*2,368],[1156*2,48731]])
total_ss2 = np.sum(ss2)
print(RR_Ben(ss2),np.sum(ss2[:,0]),RR_Ben(obs),np.sum(obs[:,0]))
# Yay, my sample size for Bendectin group is double what it was, and the effect size is almost the same.

--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-90d8d1de0c03> in <module> 9 ss2 = np.array([[13*2,368],[1156*2,48731]]) 10 total_ss2 = np.sum(ss2) ---> 11 print(RR_Ben(ss2),np.sum(ss2[:,0]),RR_Ben(obs),np.sum(obs[:,0])) 12 # Yay, my sample size for Bendectin group is double what it was, and the effect size is almost the same. NameError: name 'RR_Ben' is not defined 
1. Using the function you previously wrote, find a p-value for this simulated data. HINT: Rememember that, once you've written a function, you can call it elsewhere in your code. Don't copy the function.
In [ ]:
#TODO


Since we are using random simulation, we have to run the “study” multiple times to find its power.

1. Run 1000 simulations of your phantom world, computing relative risk for each. The simulation will be similar to those used for confidence intervals.
In [ ]:
#TODO

1. Modify your code from the previous problem to find a p-value for each phantom world. Keep a list of the p-values.
In [ ]:
#TODO

1. Using a significance threshold of 0.05, what fraction of your simulated studies return a significant result? With a significance threshold of 0.01?
In [ ]:
#TODO

1. What do this and your classmates' results imply for how we can interpret the original data? What sample size of Bendectin users would we need to get at least 80% power for our original relative risk? (80% is a common standard in the medical literature.)
In [ ]:
#TODO