The HIV virus is transmitted through bodily fluids and sexual activity. Although condoms are effective at preventing HIV transmission, many people at high risk of infection do not use them reliably, either because of personal preference or inability to do so. A different approach uses low doses of the same drugs used to treat HIV to prevent infection. The basic principle is that the drugs prevent any HIV viruses entering the body from reproducing, thus not allowing an infection to establish. This approach is called pre-exposure prophylaxis, or PrEP.
Here is some real data from a clinical trial of a PrEP drug.
We need to run a statistical analysis to make sure the drug works.
As usual, import Numpy and Seaborn.
Enter the data above into a Numpy array using the function np.array. To do this, give the input as a list of lists, with each list representing a row of the data table.
We want to compare the observed data to what we would expect if the drug did not prevent HIV infection. In that case, the probability of infection for both groups would be the one observed in the whole sample. The total numbers of people receiving the drug vs. placebo would be the same as in the real study, but the numbers of infected and uninfected individuals would be different.
By hand (you can use the computer as a calculator), compute the values you would expect to observe if the drug did not work and enter this array into Jupyter. Decimals are OK.
By itself, the χ2 statistic doesn't give us much information. Its purpose in life is to be compared to what we would get if the null hypothesis was true. To simulate the null hypothesis, we randomly assign outcomes to “drug” and “placebo” groups.
Make a list of infected and not-infected individuals, putting in as many of each as there are in the whole sample. HINT: Remember the [entry]*n syntax for making a list of n copies of entry.
Use the function np.zeros([rows,cols]) to make an array of zeros to store your simulated data.
From your list of infected and not-infected individuals, take a sample (with replacement) of as many outcomes as there were people in the drug group. Then, count how many infected and not-infected individuals there are and place the counts in the appropriate cells in your storage array. HINT: To access an element in an array, use the notation arr[row, column], where counting starts at 0.
Our p value becomes more significant (smaller) and the entire spread of data is changed to be over a smaller range. The shape of the histogram also changes.
A More Meaningful Analysis
The χ2 and ∣χ∣ tests you just performed told you whether the study found a statistically significant effect. However, they said nothing about the magnitude of the effect or the uncertainty associated with it. In the exercises that follow, you will quantify the study's effect and put a confidence interval on this number.
One commonly used effect size measure for binary outcomes (like "infected" vs. "not infected" or "lived" vs. "died") is the relative risk, also called the risk ratio. This is just the ratio of the probabilities of an outcome for the two groups. In this case, it is the probability of infection for the drug group divided by the probability of infection for the placebo group.
Compute the relative risk of infection for this study. Write a sentence interpreting it.
Now we want to put a confidence interval on the relative risk. To do this, we will need to resample the drug and placebo groups separately, counting how many infected and uninfected individuals are in each group. We can then compute the relative risk for this resampled data and construct a confidence interval as usual.
Compute the relative risk for the original data again, using indexing.