Type the names of all members of your group below. By submitting these names, you affirm that you have neither given nor received unauthorized help on this exam.

1

Bianka Aguirre, Annalis Soto, Anissa Torres, Cherina Dominguez

2

- This is Part II of the final. It consists of 24 parts organized into three questions. The total point value is 150 points (43% of final exam grade), including 10 points for group participation, which will be assessed individually over CCLE.
- Please enter all of your answers either as code or markdown text into the appropriate place in a copy of this notebook on CoCalc (similar to homework). Some problems will require Python programming.
- Then, you will only submit
**one**version of the midterm, for your entire group, via Gradescope. The Gradescope assignment for the Midterm allows group participation, so make sure you select and enter all members of your group. - Since you are completing this on CoCalc but uploading to Gradescope, we recommend that you wait to upload until you have completed the full exam on CoCalc. At that point, on CoCalc, select File / Download as… / PDF. CoCalc will convert to a PDF that you can download. This entire PDF can then be easily uploaded to Gradescope, and you can manually select where each answer is for each question.
- Please be aware, when uploading PDFs and your answers to Gradescope, that page-breaks can accidentally hide text. Additionally, if you answer questions via code comments (as opposed to markdown text), unless you manually create line breaks in your comment, the full text answer may not be legible. After uploading to Gradescope, please double-check all of your answers to make sure that they are legible and clear.
- While screen shots are an acceptable alternative to uploading PDF, due to low resolution, we do not recommend taking photos of your computer screen with your phone, unless absolutely necessary.
- Again,
**we will not grade any material left on CoCalc**. In order to receive a grade, you must submit – as a group – to Gradescope. Don’t forget to add group members to your submission on Gradescope! - You may use your notes, assignments, slides, readings, solutions, and other resources on our LS 40 CCLE site and your CoCalc project (but not elsewhere on the internet).
- However, as always, you must show all of your work to receive full credit for each problem.
- If you have a clarifying question about the exam at any point during the exam period, email Professor Tingley. Questions about content or your own progress will not be answered.
- For technical glitches with Python, try "Kernel menu > Restart kernel" or Backups in files view first.
- Gradescope will forbid uploads after 3:00 pm Pacific Time on Thursday, March 18, 2021. Please plan accordingly.

3

We recommend import libraries first:

4

In [4]:

import pandas as pd import numpy as np import seaborn as sns from IPython.display import Image import matplotlib.pyplot as plt from scipy.stats import linregress from scipy.stats import spearmanr from scipy.stats import pearsonr from PIL import Image

5

- Blue: 7
- Orange: 10
- Green: 2
- Yellow: 5
- Red: 4
- Brown: 7

6

7

This would be chi squared goodness of fit test. This is because this statistical test is the one that compares the observed frequency with the theoretical frequency, which is needed in this situation.

8

9

In [5]:

def chi_squ(obs,exp): result=np.sum(((obs-exp)**2)/exp) return result

10

In [6]:

exp=35/6 #expected outcome exp

11

5.833333333333333

Chi squared value: np.sum(((7-5.83)**2/5.83),((10-5.83)**2/5.83),((2-5.83)**2/5.83),((5-5.83)**2/5.83),((4-5.83)**2/5.83),((7-5.83)**2/5.83))= 6.7

12

13

In [43]:

sim=np.zeros(10000) null=10*["B"]+10*["O"]+10*["G"]+10*["Y"]+10*["R"]+10*["BR"] for i in range(10000): sample=np.random.choice(null,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-5.83)**2/5.83)+((o-5.83)**2/5.83)+((g-5.83)**2/5.83)+((y-5.83)**2/5.83)+((r-5.83)**2/5.83)+((br-5.83)**2/5.83) sim[i]=chi p=sns.displot(sim) plt.axvline(6.7,color="red") pval=(np.sum(sim>=6.7))/10000 print("The p value for our simulation is:", pval)

14

The p value for our simulation is: 0.2359

15

This will be one tailed, right tailed, because chi squared is only positive values and represents sums of distances from expected value. The observed data is equal to the expected data, this shows the chi squared statistic will be 0 because the two values are identical under the null hypothesis. The chi squared statistic will increase as the differences between the observed and expected becomes more extreme.

16

17

From the pvalue we calculated of 0.2359, we cannot reject the null hypothesis of M & M packs having the same proportions of color of candy because it is not under the significance threshold of 0.05. This shows us that our observed color distribution was likely just due to random chance.

18

19

Yes, we can find that probability. We can use a basic big box null hypothesis statistical test. This is because we want to find the probability of getting 10 orange and 2 green m&ms in the same bag. With a big box, we can resample the same sample size of 35 m&ms and find the probability of that specific combination. The null hypothesis for this request would be that there is no difference in the ratios of each m&m color, but we could see how statistically significant the probability of getting 10 orange and 2 green in the same batch is.

20

21

In [7]:

Image.open("Table.PNG")

22

23

24

In [38]:

#new jersey newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"] newjer_chi_obs=((7-8.75)**2/8.75)+((10-8.75)**2/8.75)+((2-4.375)**2/4.375)+((5-4.375)**2/4.375)+((4-4.375)**2/4.375)+((7-4.375)**2/4.375) print("Our new chi squared for new jersey is:",newjer_chi_obs) tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"] tenn_chi_obs=((7-7.245)**2/7.245)+((10-7.175)**2/7.175)+((2-6.93)**2/6.93)+((5-4.725)**2/4.725)+((4-4.585)**2/4.585)+((7-4.34)**2/4.34) print("Our new chi squared for tennessee is:",tenn_chi_obs)

25

Our new chi squared for new jersey is: 3.5142857142857142
Our new chi squared for tennessee is: 6.348735833832281

26

In [39]:

#new jersey factory sim_nj=np.zeros(10000) newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"] for i in range(10000): sample=np.random.choice(newjersey,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-8.75)**2/8.75)+((o-8.75)**2/8.75)+((g-4.375)**2/4.375)+((y-4.375)**2/4.375)+((r-4.375)**2/4.375)+((br-4.375)**2/4.375) sim_nj[i]=chi p=sns.displot(sim_nj) plt.axvline(newjer_chi_obs,color="red") pval_nj=(np.sum(sim_nj>=newjer_chi_obs))/10000 print("The p value for our simulation for new jersey is:", pval_nj)

27

The p value for our simulation for new jersey is: 0.6439

In [41]:

#tennessee sim_ten=np.zeros(10000) tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"] for i in range(10000): sample=np.random.choice(tennessee,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-7.245)**2/7.245)+((o-7.175)**2/7.175)+((g-6.93)**2/6.93)+((y-4.725)**2/4.725)+((r-4.585)**2/4.585)+((br-4.34)**2/4.34) sim_ten[i]=chi p=sns.displot(sim_ten) plt.axvline(tenn_chi_obs,color="red") pval_tenn=(np.sum(sim_ten>=tenn_chi_obs))/10000 print("The p value for our simulation for tennessee is:", pval_tenn)

28

The p value for our simulation for tennessee is: 0.2675

29

The new jersey factory showed a pvalue of about 0.6439. This shows us we cannot reject the null hypothesis because it is over the significane threshold of 0.05. Additionally, the new jersey pvalue is much higher than the pvalue for the tennesse factory, which was 0.2675,. This tells us there is a higher chance of getting m&m packs from the new jersey factory. We are very certain of this because the distributions produced in new jersey are very closely aligned with our bag, but we cannot explicitly rule out tennessee because both factories are above the significance threshold, which tells us these proportions came out to random chance. With random chance, either factory is technically possible.

30

31

We will make a new data frame and allocate 50% to each factory location. We then run a simulation for 10,000 times to see what the probability is to get the bag from the tennessee factory. This would be a big box test.

32

33

We would use the same simulation as in J but now we have new data that we must account for so we will alter our observed frequencies to match the findings in the 100 new bags. This will give us an altered probability of the chance our bag was made in tennessee.

34

35

Reallocation of credibility tells us that when our circumstances or situation change, so does our probablity we calculated. The more information we gather, the mre we have to make adjustments to "reallocate" our credibility from our old data to the updated one to have the most accuarate probabilities. This is relevant to j and k because in j we just use 1 bag of candy to calculate data but in k we have now 100 more bags of candy. This gives us new knowledge to integrate into our calculations and more accuracy as well.

36

37

38

In [8]:

Image.open("Covid.PNG")

39

40

The effect size is 4.65%

41

The pvalue is 2.2e-16

42

the confidence interval is 95% and from (1.75% -infinite)

43

The sample size is 491 medical staff.

44

45

There is no difference in the infection rate of 2019-nCOV medical staff between no mask group and the N95 respirator group.

46

47

The authors calculate the upper bound will be infinite because at the time of the study, the scientists did not have enough data to actually set an uppper bound for this study. This is because infection rate can fluctuate and there were many other factors about the novel virus that were not yet know. Infinite upper bound is not very plausible but was used as a place holder for the time being.

48

49

50

In [105]:

X1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] Y1 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74]

51

52

In [106]:

xarr=np.array(X1) yarr=np.array(Y1)

53

In [107]:

pearsonr(xarr,yarr)

54

(0.8162365060002428, 0.0021788162369107975)

The pearson correlation coefficient is 0.8162365060002428.

55

56

In [108]:

df = pd.DataFrame(np.column_stack([X1,Y1])) df.columns = ["X1","Y1"]

57

In [109]:

corrObs = pearsonr(df["X1"],df["Y1"])[0]

58

In [110]:

tail=list(df["X1"]) bushy=list(df["Y1"]) df_data=np.zeros(10000) for i in range(10000): np.random.shuffle(tail) df_data[i]=pearsonr(tail,bushy)[0] p=sns.displot(df_data,kde=False) plt.axvline(-corrObs,color='red') plt.axvline(corrObs,color='red') pval=(np.sum(df_data<=-corrObs)+np.sum(df_data>=corrObs))/10000 print ("The observed correlation coefficient is",corrObs) print("The p-value is",pval)

59

The observed correlation coefficient is 0.8162365060002428
The p-value is 0.0005

60

This pvalue is two-sided because we calculated the statistical significance in opposite directions. We are testing for the possibility of a relationship between tail length and tail bushyness in UCLA squirrels in either direction. This is either a positive or negative direction. This helps further the goals of the study.

61

62

In [111]:

reg = linregress(df["X1"], df["Y1"]) slope=reg.slope intercept=reg.intercept X_plot = np.linspace(4, 13, 100) Y_plot = slope*X_plot+intercept p=sns.lmplot(x="X1",y="Y1",data=df,fit_reg=False) plt.plot(X_plot,Y_plot) print("The slope for the regression line is",slope) print("The y-intercept for the regression line is",intercept)

63

The slope for the regression line is 0.5000000000000001
The y-intercept for the regression line is 3.000909090909089

In [112]:

reg_slope=np.zeros(10000) for i in range(10000): rand_samp=df.sample(len(df),replace=True) t=list(rand_samp["X1"]) bu=list(rand_samp["Y1"]) reg = linregress(t, bu) reg_slope[i]=reg.slope reg_slope.sort() slope_upper=2*slope-reg_slope[49] slope_lower=2*slope-reg_slope[9949] print ("The 99% confidence intervals are", (slope_lower,slope_upper))

64

The 99% confidence intervals are (0.007389162561576401, 0.9844117647058825)

the statistic shows that with every cm increase in tail length, the bushyness will increase by around 0.5 mm. The 99% confidence interval is from (0.002417218543046662 to 0.9637800687285226)

65

66

This would change our analysis because we know that the tail lengths are rounded and not exact measures. Ordinary Least Squares requires exact tail length values. But since we are rounding now, this is when we use orthogonal regression because we assume that we are rounding the tail length in centimeters, resulting in errors in the tail length. This would increase our slope because we are rounding and giving a steeper slope.

67

68

In [15]:

X2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] Y2 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

69

70

In [16]:

xarr2=np.array(X2) yarr2=np.array(Y2)

71

In [17]:

pearsonr(xarr2,yarr2)

72

(0.8162867394895982, 0.002176305279228025)

The pearson correlation coefficient is 0.8162867394895982

73

In [22]:

df2 = pd.DataFrame(np.column_stack([X2,Y2])) df2.columns = ["X2","Y2"]

74

In [23]:

corrObs2 = pearsonr(df2["X2"],df2["Y2"])[0]

75

In [26]:

reg = linregress(df2["X2"], df2["Y2"]) slope=reg.slope intercept=reg.intercept X_plot2 = np.linspace(4, 13, 100) Y_plot2 = slope*X_plot+intercept p2=sns.lmplot(x="X2",y="Y2",data=df2,fit_reg=False) plt.plot(X_plot2,Y_plot2) print("The slope for the regression line is",slope) print("The y-intercept for the regression line is",intercept)

76

The slope for the regression line is 0.4997272727272729
The y-intercept for the regression line is 3.002454545454544

77

We can conclude that longer tails and bushier tails in squirrels are highly correlated because of the high correlation coefficient of about 0.816 for each study. This data, however, does not allow us to conlcude anything about causation. Because this is an observational study, not an experimental one.

78

79

The graphs as shown above show that we cannot conclude that the squirrel tail length and tail bushyness relationship is the same between the UCLA and USC campus because when we graph our data, the UCLA graph is not very closely correlated with the line of regression, showing a lower correlation relationship while the USC graph has data points very close to the line showing a higher correlation relationship.

80

81

We cannot automatically conclude the relationship is the same by just looking at significant pvalue that is less than 0.05 and correlation coefficient. This tells us we should always visualize data before we make any conclusions and that pvalue being less than 0.05 is not the end all statistic. More tests can always be done to create the most accurate data visualization.

82

In [ ]:

83