SharedWeek 7 / Lab 7 - Comparing Proportions.ipynbOpen in CoCalc
Author: Dean Neutel
Views : 11

# Comparing Proportions

The HIV virus is transmitted through bodily fluids and sexual activity. Although condoms are effective at preventing HIV transmission, many people at high risk of infection do not use them reliably, either because of personal preference or inability to do so. A different approach uses low doses of the same drugs used to treat HIV to prevent infection. The basic principle is that the drugs prevent any HIV viruses entering the body from reproducing, thus not allowing an infection to establish. This approach is called pre-exposure prophylaxis, or PrEP.

Here is some real data from a clinical trial of a PrEP drug.

Drug Placebo
Infected 36 64
Not Infected 1215 1184

We need to run a statistical analysis to make sure the drug works.

In [1]:
36+1215

1251
1. As usual, import Numpy and Seaborn.
In [45]:
#1
import numpy as np
import seaborn as sns

1. Enter the data above into a Numpy array using the function np.array. To do this, give the input as a list of lists, with each list representing a row of the data table.
In [46]:
#2
infected = [36,64]
not_infected = [1215,1184]
observed = np.array([infected,not_infected])
observed

array([[ 36, 64], [1215, 1184]])

We want to compare the observed data to what we would expect if the drug did not prevent HIV infection. In that case, the probability of infection for both groups would be the one observed in the whole sample. The total numbers of people receiving the drug vs. placebo would be the same as in the real study, but the numbers of infected and uninfected individuals would be different.

1. By hand (you can use the computer as a calculator), compute the values you would expect to observe if the drug did not work and enter this array into Jupyter. Decimals are OK.
In [47]:
#3
infected_placebo_expected = (36+64)/(np.sum(observed))*(64+1184)
infected_drug_expected = (36+64)/(np.sum(observed))*(36+1215)
not_infected_drug_expected = (1215+1184)/(np.sum(observed))*(36+1215)
not_infected_placebo_expected = (1215+1184)/(np.sum(observed))*(64+1184)
infected_expected = [infected_drug_expected,infected_placebo_expected]
not_infected_expected = [not_infected_drug_expected,not_infected_placebo_expected]
expected = np.array([infected_expected,not_infected_expected])
expected

array([[ 50.06002401, 49.93997599], [1200.93997599, 1198.06002401]])

We can use the $\chi^{2}$ statistic, defined as $\chi^{2}=\sum\frac{(\text{observed-expected})^{2}}{\text{expected}}$, to compare the observed values to the expected ones.

1. Compute $\chi^{2}$ for your data. NOTE: Since we are using pure Python, not SageMath, you'll need to use ** for powers.
In [48]:
#4
chi_squared_observed = np.sum(((observed-expected)**2)/expected)
chi_squared_observed

8.236994006680051
1. Write a function to compute $\chi^2$ for any arrays of observed and expected data.
In [49]:
#5
def chi_squared_function(observed_data,expected_data):
numerator = (observed_data-expected_data)**2
chi_squared = np.sum(numerator/expected_data)
return(chi_squared)


## Bootstrapping

By itself, the $\chi^{2}$ statistic doesn't give us much information. Its purpose in life is to be compared to what we would get if the null hypothesis was true. To simulate the null hypothesis, we randomly assign outcomes to “drug” and “placebo” groups.

1. Make a list of infected and not-infected individuals, putting in as many of each as there are in the whole sample. HINT: Remember the [entry]*n syntax for making a list of n copies of entry.
In [17]:
#6
new_population = ["infected"]*100+["not_infected"]*2399

1. Use the function np.zeros([rows,cols]) to make an array of zeros to store your simulated data.
In [18]:
#7
new_observed = np.zeros([2,2])

1. From your list of infected and not-infected individuals, take a sample (with replacement) of as many outcomes as there were people in the drug group. Then, count how many infected and not-infected individuals there are and place the counts in the appropriate cells in your storage array. HINT: To access an element in an array, use the notation arr[row, column], where counting starts at 0.
In [39]:
#8
new_drug = np.random.choice(new_population,1251)
count_infected_drug = np.sum(new_drug=="infected")
count_not_infected_drug = np.sum(new_drug=="not_infected")
new_observed[0,0] = count_infected_drug
new_observed[1,0] = count_not_infected_drug

1. Do the same thing for the placebo group.
In [40]:
#9
new_placebo = np.random.choice(new_population,1248)
count_infected_placebo = np.sum(new_placebo=="infected")
count_not_infected_placebo = np.sum(new_placebo=="not_infected")
new_observed[0,1] = count_infected_placebo
new_observed[1,1] = count_not_infected_placebo
new_observed

array([[ 49., 46.], [1202., 1202.]])
1. Compute $\chi^{2}$ to compare your simulated array to your expected array.
In [41]:
#10
chi_squared_function(new_observed,expected)

0.34718021257473786
1. Set up an array in which to store your simulated $\chi^{2}$ values and perform the simulation 10,000 times.
In [42]:
#11
total = 10000
chi_squared = np.zeros(total)
for i in range(total):
new_observed = np.zeros([2,2])
new_drug = np.random.choice(new_population,1251)
new_placebo = np.random.choice(new_population,1248)
count_infected_drug = np.sum(new_drug=="infected")
count_not_infected_drug = np.sum(new_drug=="not_infected")
count_infected_placebo = np.sum(new_placebo=="infected")
count_not_infected_placebo = np.sum(new_placebo=="not_infected")
new_observed[0,0] = count_infected_drug
new_observed[1,0] = count_not_infected_drug
new_observed[0,1] = count_infected_placebo
new_observed[1,1] = count_not_infected_placebo
chi_squared[i] = chi_squared_function(new_observed,expected)

1. Make a histogram of your simulated results. Mark the observed result with a vertical line.
In [51]:
#12
p=sns.distplot(chi_squared,kde=False)
p.axvline(chi_squared_observed,color="red");

1. Compute a p-value for your data. What do you conclude about the drug?
In [53]:
#13
count = np.sum(chi_squared>=chi_squared_observed)
pvalue = count/total
display("P-Value",pvalue)

'P-Value'
0.0183

## A Different Comparison

As you saw previously in the course, we can replace squaring with absolute values.

1. Write a function to compute $|\chi|=\sum\frac{|\text{observed-expected}|}{\text{expected}}$ ("chi-abs"). HINT: Use the function np.abs to compute the absolute value of each entry in an array.
In [54]:
#14
def chi_abs_function(observed_data,expected_data):
numerator = np.abs(observed_data-expected_data)
chi_abs = np.sum(numerator/expected_data)
return(chi_abs)

1. Repeat the analysis using $|\chi|$ as your test statistic. Do your conclusions change?
In [58]:
#15
total = 10000
chi_abs = np.zeros(total)
chi_abs_observed = chi_abs_function(observed,expected)
for i in range(total):
new_observed = np.zeros([2,2])
new_drug = np.random.choice(new_population,1251)
new_placebo = np.random.choice(new_population,1248)
count_infected_drug = np.sum(new_drug=="infected")
count_not_infected_drug = np.sum(new_drug=="not_infected")
count_infected_placebo = np.sum(new_placebo=="infected")
count_not_infected_placebo = np.sum(new_placebo=="not_infected")
new_observed[0,0] = count_infected_drug
new_observed[1,0] = count_not_infected_drug
new_observed[0,1] = count_infected_placebo
new_observed[1,1] = count_not_infected_placebo
chi_abs[i] = chi_abs_function(new_observed,expected)
p=sns.distplot(chi_abs,kde=False)
p.axvline(chi_abs_observed,color="red");
count = np.sum(chi_abs>=chi_abs_observed)
pvalue = count/total
display("P-Value",pvalue)

'P-Value'
0.0069

Our p value becomes more significant (smaller) and the entire spread of data is changed to be over a smaller range. The shape of the histogram also changes.

## A More Meaningful Analysis

The $\chi^2$ and $|\chi|$ tests you just performed told you whether the study found a statistically significant effect. However, they said nothing about the magnitude of the effect or the uncertainty associated with it. In the exercises that follow, you will quantify the study's effect and put a confidence interval on this number.

One commonly used effect size measure for binary outcomes (like "infected" vs. "not infected" or "lived" vs. "died") is the relative risk, also called the risk ratio. This is just the ratio of the probabilities of an outcome for the two groups. In this case, it is the probability of infection for the drug group divided by the probability of infection for the placebo group.

1. Compute the relative risk of infection for this study. Write a sentence interpreting it.
In [76]:
#16
probability_infection_drug = (36)/(36+64)
probability_infection_placebo = (64)/(36+64)
relative_risk_observed = probability_infection_drug/probability_infection_placebo
relative_risk_observed

0.5625

Now we want to put a confidence interval on the relative risk. To do this, we will need to resample the drug and placebo groups separately, counting how many infected and uninfected individuals are in each group. We can then compute the relative risk for this resampled data and construct a confidence interval as usual.

1. Compute the relative risk for the original data again, using indexing.
In [77]:
#17
probability_infection_drug = (observed[0,0])/(observed[0,0]+observed[0,1])
probability_infection_placebo = (observed[0,1])/(observed[0,0]+observed[0,1])
relative_risk_observed = probability_infection_drug/probability_infection_placebo
relative_risk_observed

0.5625
1. Make lists of outcomes for the drug and placebo groups.
In [63]:
#18
new_drug_list = ["infected"]*36+["not infected"]*1215
new_placebo_list = ["infected"]*64+["not infected"]*1184

1. Resampling from the lists, fill in an array with simulated data. HINT: You did something very similar in #8.
In [72]:
#19
new_data = np.zeros([2,2])
random_drug = np.random.choice(new_drug_list,1251)
count_infected_drug = np.sum(random_drug=="infected")
count_not_infected_drug = np.sum(random_drug=="not infected")
random_placebo = np.random.choice(new_placebo_list,1248)
count_infected_placebo = np.sum(random_placebo=="infected")
count_not_infected_placebo = np.sum(random_placebo=="not infected")
new_data[0,0] = count_infected_drug
new_data[1,0] = count_not_infected_drug
new_data[0,1] = count_infected_placebo
new_data[1,1] = count_not_infected_placebo
new_data

array([[ 37., 57.], [1214., 1191.]])
1. Compute the relative risk for your simulated data.
In [73]:
#20
probability_infection_drug = (new_data[0,0])/(new_data[0,0]+new_data[0,1])
probability_infection_placebo = (new_data[0,1])/(new_data[0,0]+new_data[0,1])
relative_risk = probability_infection_drug/probability_infection_placebo
relative_risk

0.6491228070175439
1. Repeat this procedure 10,000 times.
In [74]:
#21
total = 10000
new_relative_risk = np.zeros(total)
for i in range(total):
new_data = np.zeros([2,2])
random_drug = np.random.choice(new_drug_list,1251)
count_infected_drug = np.sum(random_drug=="infected")
count_not_infected_drug = np.sum(random_drug=="not infected")
random_placebo = np.random.choice(new_placebo_list,1248)
count_infected_placebo = np.sum(random_placebo=="infected")
count_not_infected_placebo = np.sum(random_placebo=="not infected")
new_data[0,0] = count_infected_drug
new_data[1,0] = count_not_infected_drug
new_data[0,1] = count_infected_placebo
new_data[1,1] = count_not_infected_placebo
probability_infection_drug = (new_data[0,0])/(new_data[0,0]+new_data[0,1])
probability_infection_placebo = (new_data[0,1])/(new_data[0,0]+new_data[0,1])
new_relative_risk[i] = probability_infection_drug/probability_infection_placebo

1. Find a 95% pivotal confidence interval for the relative risk.
In [81]:
#22
p=sns.distplot(new_relative_risk,kde=False)
p.axvline(relative_risk_observed,color="red");
new_relative_risk.sort()
m_lower = new_relative_risk[49]
m_upper = new_relative_risk[9949]
m_upper_pivotal = 2*relative_risk_observed - m_lower
m_lower_pivotal = 2*relative_risk_observed - m_upper
p.axvline(m_upper_pivotal, color="green");
p.axvline(m_lower_pivotal, color="green");
display("M Upper Pivotal", m_upper_pivotal, "M Lower Pivotal", m_lower_pivotal)

'M Upper Pivotal'
0.8151408450704225
'M Lower Pivotal'
0.17763157894736836
#23