CoCalc Public FilesFinal / Final.ipynb
Author: Bianka Aguirre
Views : 63
Compute Environment: Ubuntu 20.04 (Default)

# Life Sciences 40

## Final Exam Part II

Type the names of all members of your group below. By submitting these names, you affirm that you have neither given nor received unauthorized help on this exam.

Bianka Aguirre, Annalis Soto, Anissa Torres, Cherina Dominguez

## Instructions (read all instructions carefully before proceeding, they differ from the Midterm):

• This is Part II of the final. It consists of 24 parts organized into three questions. The total point value is 150 points (43% of final exam grade), including 10 points for group participation, which will be assessed individually over CCLE.
• Please enter all of your answers either as code or markdown text into the appropriate place in a copy of this notebook on CoCalc (similar to homework). Some problems will require Python programming.
• Then, you will only submit one version of the midterm, for your entire group, via Gradescope. The Gradescope assignment for the Midterm allows group participation, so make sure you select and enter all members of your group.
• While screen shots are an acceptable alternative to uploading PDF, due to low resolution, we do not recommend taking photos of your computer screen with your phone, unless absolutely necessary.
• You may use your notes, assignments, slides, readings, solutions, and other resources on our LS 40 CCLE site and your CoCalc project (but not elsewhere on the internet).
• However, as always, you must show all of your work to receive full credit for each problem.
• If you have a clarifying question about the exam at any point during the exam period, email Professor Tingley. Questions about content or your own progress will not be answered.
• For technical glitches with Python, try "Kernel menu > Restart kernel" or Backups in files view first.
• Gradescope will forbid uploads after 3:00 pm Pacific Time on Thursday, March 18, 2021. Please plan accordingly.

We recommend import libraries first:

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import Image
import matplotlib.pyplot as plt
from scipy.stats import linregress
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from PIL import Image


• Blue: 7
• Orange: 10
• Green: 2
• Yellow: 5
• Red: 4
• Brown: 7

#### a. You first wonder whether the probability of getting any given color of M&M is more likely than any other color. What single statistical test you would use to test this hypothesis. If multiple suitable tests are possible, justify your choice. (5 points)

This would be chi squared goodness of fit test. This is because this statistical test is the one that compares the observed frequency with the theoretical frequency, which is needed in this situation.

#### b. What would be the observed test statistic for the appropriate test suggested in (a)? [note: in your calculations, do not round to whole M&Ms, fractions of M&Ms are OK] (5 points)

In [5]:
def chi_squ(obs,exp):
result=np.sum(((obs-exp)**2)/exp)
return result

In [6]:
exp=35/6 #expected outcome
exp

5.833333333333333

Chi squared value: np.sum(((7-5.83)**2/5.83),((10-5.83)**2/5.83),((2-5.83)**2/5.83),((5-5.83)**2/5.83),((4-5.83)**2/5.83),((7-5.83)**2/5.83))= 6.7

#### c. Calculate the probability of observing a test statistic of equal or greater magnitude purely due to random chance under the null hypothesis that frequencies are equal. (5 points)

In [43]:
sim=np.zeros(10000)
null=10*["B"]+10*["O"]+10*["G"]+10*["Y"]+10*["R"]+10*["BR"]

for i in range(10000):
sample=np.random.choice(null,35)
b=np.sum(sample=="B")
o=np.sum(sample=="O")
g=np.sum(sample=="G")
y=np.sum(sample=="Y")
r=np.sum(sample=="R")
br=np.sum(sample=="BR")
chi=((b-5.83)**2/5.83)+((o-5.83)**2/5.83)+((g-5.83)**2/5.83)+((y-5.83)**2/5.83)+((r-5.83)**2/5.83)+((br-5.83)**2/5.83)
sim[i]=chi

p=sns.displot(sim)
plt.axvline(6.7,color="red")

pval=(np.sum(sim>=6.7))/10000
print("The p value for our simulation is:", pval)

The p value for our simulation is: 0.2359

#### d. Is the probability calculated in part (c) one-tailed or two-tailed? Why or why not? (5 points)

This will be one tailed, right tailed, because chi squared is only positive values and represents sums of distances from expected value. The observed data is equal to the expected data, this shows the chi squared statistic will be 0 because the two values are identical under the null hypothesis. The chi squared statistic will increase as the differences between the observed and expected becomes more extreme.

#### e. What is your inference from the probability calculated in part (c)? (5 points)

From the pvalue we calculated of 0.2359, we cannot reject the null hypothesis of M & M packs having the same proportions of color of candy because it is not under the significance threshold of 0.05. This shows us that our observed color distribution was likely just due to random chance.

#### f. Emotionally invested in this discovery, you show your results to your roommate. Looking at the sorted colors of M&Ms on your desk, she comments, “it seems unlikely that you would draw 10 orange and only 2 green just by chance if the probabilities are truly equal. Can you just calculate that specific probability for me?” Knowing everything you learned in LS40 and based off the preceding analysis, do you comply with her request? If yes, explain how you would calculate the requested probability. If not, defend your choice. (5 points)

Yes, we can find that probability. We can use a basic big box null hypothesis statistical test. This is because we want to find the probability of getting 10 orange and 2 green m&ms in the same bag. With a big box, we can resample the same sample size of 35 m&ms and find the probability of that specific combination. The null hypothesis for this request would be that there is no difference in the ratios of each m&m color, but we could see how statistically significant the probability of getting 10 orange and 2 green in the same batch is.

#### After a night of google searches, you discover that all M&Ms are manufactured at just two different plants in the United States, each packaging a slightly different proportion of colors. Each plant produces approximately 50% of all M&Ms sold in the United States per year. The color frequencies for these two factories are shown in the table below:

In [7]:
Image.open("Table.PNG")


#### g. These new factory %s have changed your expectations. Using the same analytic framework as before, calculate two observed test statistics based on the data from your original bag of M&Ms. One test statistic should relate to the expectations from the New Jersey factory and the other test statistic should relate to the expectations from the Tennessee factory. [note: again, do not round to whole M&Ms] (8 points)

In [38]:
#new jersey
newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"]
newjer_chi_obs=((7-8.75)**2/8.75)+((10-8.75)**2/8.75)+((2-4.375)**2/4.375)+((5-4.375)**2/4.375)+((4-4.375)**2/4.375)+((7-4.375)**2/4.375)
print("Our new chi squared for new jersey is:",newjer_chi_obs)

tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"]
tenn_chi_obs=((7-7.245)**2/7.245)+((10-7.175)**2/7.175)+((2-6.93)**2/6.93)+((5-4.725)**2/4.725)+((4-4.585)**2/4.585)+((7-4.34)**2/4.34)
print("Our new chi squared for tennessee is:",tenn_chi_obs)


Our new chi squared for new jersey is: 3.5142857142857142 Our new chi squared for tennessee is: 6.348735833832281

#### h. Calculate the probability of opening a bag of M&Ms and finding the frequency of colors that you observed (or a frequency distribution more extreme), given the expected frequencies from each factory. Your answer should calculate two probabilities, one for each factory. [hint: For your box model, imagine each factory produces 1000 M&Ms of the expected proportions. Refer to homework 7 solutions if you’re having trouble coding it.] (8 points)

In [39]:
#new jersey factory
sim_nj=np.zeros(10000)
newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"]

for i in range(10000):
sample=np.random.choice(newjersey,35)
b=np.sum(sample=="B")
o=np.sum(sample=="O")
g=np.sum(sample=="G")
y=np.sum(sample=="Y")
r=np.sum(sample=="R")
br=np.sum(sample=="BR")
chi=((b-8.75)**2/8.75)+((o-8.75)**2/8.75)+((g-4.375)**2/4.375)+((y-4.375)**2/4.375)+((r-4.375)**2/4.375)+((br-4.375)**2/4.375)
sim_nj[i]=chi

p=sns.displot(sim_nj)
plt.axvline(newjer_chi_obs,color="red")

pval_nj=(np.sum(sim_nj>=newjer_chi_obs))/10000
print("The p value for our simulation  for new jersey is:", pval_nj)

The p value for our simulation for new jersey is: 0.6439
In [41]:
#tennessee
sim_ten=np.zeros(10000)
tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"]

for i in range(10000):
sample=np.random.choice(tennessee,35)
b=np.sum(sample=="B")
o=np.sum(sample=="O")
g=np.sum(sample=="G")
y=np.sum(sample=="Y")
r=np.sum(sample=="R")
br=np.sum(sample=="BR")
chi=((b-7.245)**2/7.245)+((o-7.175)**2/7.175)+((g-6.93)**2/6.93)+((y-4.725)**2/4.725)+((r-4.585)**2/4.585)+((br-4.34)**2/4.34)
sim_ten[i]=chi

p=sns.displot(sim_ten)
plt.axvline(tenn_chi_obs,color="red")

pval_tenn=(np.sum(sim_ten>=tenn_chi_obs))/10000
print("The p value for our simulation  for tennessee is:", pval_tenn)

The p value for our simulation for tennessee is: 0.2675

#### i. Based on your calculations, which factory is more likely to produce a color frequency matching your bag of M&Ms? How certain are you? Can you rule out one of the two factories as the producer? Support your answer only with information from your previous calculations. (5 points)

The new jersey factory showed a pvalue of about 0.6439. This shows us we cannot reject the null hypothesis because it is over the significane threshold of 0.05. Additionally, the new jersey pvalue is much higher than the pvalue for the tennesse factory, which was 0.2675,. This tells us there is a higher chance of getting m&m packs from the new jersey factory. We are very certain of this because the distributions produced in new jersey are very closely aligned with our bag, but we cannot explicitly rule out tennessee because both factories are above the significance threshold, which tells us these proportions came out to random chance. With random chance, either factory is technically possible.

#### j. Given your calculations, what is the probability that your bag of M&Ms was made in Tennessee? [Hint: Note the difference between this question and that asked in part h. Remember, from above, that the factories produce equal numbers.] (6 points)

We will make a new data frame and allocate 50% to each factory location. We then run a simulation for 10,000 times to see what the probability is to get the bag from the tennessee factory. This would be a big box test.

#### k. Determined to know the truth, you open 100 more bags, carefully counting up the total numbers of each color. You re-do your simulations, eventually finding that the expected color frequency from Tennessee would produce your observed frequency 85% of the time, while the expected color frequency from New Jersey would produce your observed frequency only 4% of the time. Given this new knowledge, what is the probability that the M&Ms you purchased were made in Tennessee? [Assume that all M&Ms purchased came from the same factory.] (6 points)

We would use the same simulation as in J but now we have new data that we must account for so we will alter our observed frequencies to match the findings in the 100 new bags. This will give us an altered probability of the chance our bag was made in tennessee.

#### l. Explain why your probability changed from part (j) to (k), referring explicitly to the “reallocation of credibility”. (5 points)

Reallocation of credibility tells us that when our circumstances or situation change, so does our probablity we calculated. The more information we gather, the mre we have to make adjustments to "reallocate" our credibility from our old data to the updated one to have the most accuarate probabilities. This is relevant to j and k because in j we just use 1 bag of candy to calculate data but in k we have now 100 more bags of candy. This gives us new knowledge to integrate into our calculations and more accuracy as well.

#### The following passage was published in the Journal of Hospital Infection (2020, vol. 105, pp. 104–105) on the role of N95 masks in preventing transmission of the virus SARS-CoV-2:

In [8]:
Image.open("Covid.PNG")


#### a. For each of the bolded and underlined numbers, classify each into one of the following categories: effect size, p-value, confidence interval, sample size. (8 points)

The effect size is 4.65%

The pvalue is 2.2e-16

the confidence interval is 95% and from (1.75% -infinite)

The sample size is 491 medical staff.

#### b. For each p-value in part a, give a plausible null hypothesis that is being tested. (5 points)

There is no difference in the infection rate of 2019-nCOV medical staff between no mask group and the N95 respirator group.

#### c. Note that the authors report something as “infinite.” Based on the information provided in the above passage, why did the authors calculate the upper bound as “infinite”? Is infinite a plausible upper bound? (5 points)

The authors calculate the upper bound will be infinite because at the time of the study, the scientists did not have enough data to actually set an uppper bound for this study. This is because infection rate can fluctuate and there were many other factors about the novel virus that were not yet know. Infinite upper bound is not very plausible but was used as a place holder for the time being.

## 3. (S)querulous Correlations

#### A set of wildlife biologists set out to study the relationship between tail length and tail bushy-ness in UCLA squirrels. Below are sets of x-y measurements (x = length in cm, y = bushy-ness in mm) for each of 11 individual squirrels:

In [105]:
X1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
Y1 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74]


#### a. Calculate the Pearson correlation coefficient for the above set of data. (5 points)

In [106]:
xarr=np.array(X1)
yarr=np.array(Y1)

In [107]:
pearsonr(xarr,yarr)

(0.8162365060002428, 0.0021788162369107975)

The pearson correlation coefficient is 0.8162365060002428.

#### b. Use a suitable resampling method to estimate whether there is a significant correlation between tail length and tail bushy-ness. (5 points)

In [108]:
df = pd.DataFrame(np.column_stack([X1,Y1]))

df.columns = ["X1","Y1"]


In [109]:
corrObs = pearsonr(df["X1"],df["Y1"])[0]

In [110]:
tail=list(df["X1"])
bushy=list(df["Y1"])
df_data=np.zeros(10000)
for i in range(10000):
np.random.shuffle(tail)
df_data[i]=pearsonr(tail,bushy)[0]
p=sns.displot(df_data,kde=False)
plt.axvline(-corrObs,color='red')
plt.axvline(corrObs,color='red')
pval=(np.sum(df_data<=-corrObs)+np.sum(df_data>=corrObs))/10000
print ("The observed correlation coefficient is",corrObs)
print("The p-value is",pval)

The observed correlation coefficient is 0.8162365060002428 The p-value is 0.0005

#### c. In part (b), did you calculate a 1- or 2-sided p-value? Justify your decision in the context of the data, the test, and the research goals. (5 points)

This pvalue is two-sided because we calculated the statistical significance in opposite directions. We are testing for the possibility of a relationship between tail length and tail bushyness in UCLA squirrels in either direction. This is either a positive or negative direction. This helps further the goals of the study.

#### d. Encouraged by the results, the researchers are interested in learning how much bushier UCLA squirrel tails are for every cm of length. Calculate this statistic and provide an appropriate confidence interval. [Assume that tail lengths are measured and reported exactly.] (6 points)

In [111]:
reg = linregress(df["X1"], df["Y1"])
slope=reg.slope
intercept=reg.intercept
X_plot = np.linspace(4, 13, 100)
Y_plot = slope*X_plot+intercept
p=sns.lmplot(x="X1",y="Y1",data=df,fit_reg=False)
plt.plot(X_plot,Y_plot)
print("The slope for the regression line is",slope)
print("The y-intercept for the regression line is",intercept)

The slope for the regression line is 0.5000000000000001 The y-intercept for the regression line is 3.000909090909089
In [112]:
reg_slope=np.zeros(10000)
for i in range(10000):
rand_samp=df.sample(len(df),replace=True)
t=list(rand_samp["X1"])
bu=list(rand_samp["Y1"])
reg = linregress(t, bu)
reg_slope[i]=reg.slope
reg_slope.sort()
slope_upper=2*slope-reg_slope[49]
slope_lower=2*slope-reg_slope[9949]
print ("The 99% confidence intervals are",
(slope_lower,slope_upper))

The 99% confidence intervals are (0.007389162561576401, 0.9844117647058825)

the statistic shows that with every cm increase in tail length, the bushyness will increase by around 0.5 mm. The 99% confidence interval is from (0.002417218543046662 to 0.9637800687285226)

#### e. Part (d) asks you to assume that tail lengths are measured and reported exactly. How would your analysis in part (d) change if tail length measurements had instead been rounded to the nearest cm? [You don't have to conduct this alternative analysis, just describe how your analysis would change] (5 points)

This would change our analysis because we know that the tail lengths are rounded and not exact measures. Ordinary Least Squares requires exact tail length values. But since we are rounding now, this is when we use orthogonal regression because we assume that we are rounding the tail length in centimeters, resulting in errors in the tail length. This would increase our slope because we are rounding and giving a steeper slope.

#### f. A competing set of wildlife biologists from USC decide to replicate the squirrel study and come up with the following measurements from 11 individuals:

In [15]:
X2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
Y2 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]


#### With this new set of data, calculate the Pearson's correlation coefficient and associated p-value (as in parts a–b). (8 points)

In [16]:
xarr2=np.array(X2)
yarr2=np.array(Y2)

In [17]:
pearsonr(xarr2,yarr2)

(0.8162867394895982, 0.002176305279228025)

The pearson correlation coefficient is 0.8162867394895982

In [22]:
df2 = pd.DataFrame(np.column_stack([X2,Y2]))

df2.columns = ["X2","Y2"]

In [23]:
corrObs2 = pearsonr(df2["X2"],df2["Y2"])[0]

In [26]:
reg = linregress(df2["X2"], df2["Y2"])
slope=reg.slope
intercept=reg.intercept
X_plot2 = np.linspace(4, 13, 100)
Y_plot2 = slope*X_plot+intercept
p2=sns.lmplot(x="X2",y="Y2",data=df2,fit_reg=False)
plt.plot(X_plot2,Y_plot2)
print("The slope for the regression line is",slope)
print("The y-intercept for the regression line is",intercept)

The slope for the regression line is 0.4997272727272729 The y-intercept for the regression line is 3.002454545454544

#### g. From these two studies, what can we conclude about whether longer tails cause bushier tails in squirrels? (5 points)

We can conclude that longer tails and bushier tails in squirrels are highly correlated because of the high correlation coefficient of about 0.816 for each study. This data, however, does not allow us to conlcude anything about causation. Because this is an observational study, not an experimental one.

#### h. Looking at the data produced, the UCLA and USC scientists come together and conclude that the relationship between tail length and tail bushy-ness in squirrels is the same on both campuses. You don’t agree. Convince them otherwise, using at least one appropriate graph to support your position. (10 points)

The graphs as shown above show that we cannot conclude that the squirrel tail length and tail bushyness relationship is the same between the UCLA and USC campus because when we graph our data, the UCLA graph is not very closely correlated with the line of regression, showing a lower correlation relationship while the USC graph has data points very close to the line showing a higher correlation relationship.

#### i. Thanks to your persuasive evidence, the two teams of researchers conclude that the relationship between tail length and bushy-ness is not the same between the two schools. In order for them to not make the same mistake again, what final lesson would you impart? [Note: this final lesson should be considered as one of the most enduring mantras of LS40!] (5 points)

We cannot automatically conclude the relationship is the same by just looking at significant pvalue that is less than 0.05 and correlation coefficient. This tells us we should always visualize data before we make any conclusions and that pvalue being less than 0.05 is not the end all statistic. More tests can always be done to create the most accurate data visualization.

In [ ]: