Type the names of all members of your group below. By submitting these names, you affirm that you have neither given nor received unauthorized help on this exam.
Bianka Aguirre, Annalis Soto, Anissa Torres, Cherina Dominguez
We recommend import libraries first:
import pandas as pd import numpy as np import seaborn as sns from IPython.display import Image import matplotlib.pyplot as plt from scipy.stats import linregress from scipy.stats import spearmanr from scipy.stats import pearsonr from PIL import Image
This would be chi squared goodness of fit test. This is because this statistical test is the one that compares the observed frequency with the theoretical frequency, which is needed in this situation.
def chi_squ(obs,exp): result=np.sum(((obs-exp)**2)/exp) return result
exp=35/6 #expected outcome exp
Chi squared value: np.sum(((7-5.83)**2/5.83),((10-5.83)**2/5.83),((2-5.83)**2/5.83),((5-5.83)**2/5.83),((4-5.83)**2/5.83),((7-5.83)**2/5.83))= 6.7
sim=np.zeros(10000) null=10*["B"]+10*["O"]+10*["G"]+10*["Y"]+10*["R"]+10*["BR"] for i in range(10000): sample=np.random.choice(null,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-5.83)**2/5.83)+((o-5.83)**2/5.83)+((g-5.83)**2/5.83)+((y-5.83)**2/5.83)+((r-5.83)**2/5.83)+((br-5.83)**2/5.83) sim[i]=chi p=sns.displot(sim) plt.axvline(6.7,color="red") pval=(np.sum(sim>=6.7))/10000 print("The p value for our simulation is:", pval)
This will be one tailed, right tailed, because chi squared is only positive values and represents sums of distances from expected value. The observed data is equal to the expected data, this shows the chi squared statistic will be 0 because the two values are identical under the null hypothesis. The chi squared statistic will increase as the differences between the observed and expected becomes more extreme.
From the pvalue we calculated of 0.2359, we cannot reject the null hypothesis of M & M packs having the same proportions of color of candy because it is not under the significance threshold of 0.05. This shows us that our observed color distribution was likely just due to random chance.
Yes, we can find that probability. We can use a basic big box null hypothesis statistical test. This is because we want to find the probability of getting 10 orange and 2 green m&ms in the same bag. With a big box, we can resample the same sample size of 35 m&ms and find the probability of that specific combination. The null hypothesis for this request would be that there is no difference in the ratios of each m&m color, but we could see how statistically significant the probability of getting 10 orange and 2 green in the same batch is.
Image.open("Table.PNG")
#new jersey newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"] newjer_chi_obs=((7-8.75)**2/8.75)+((10-8.75)**2/8.75)+((2-4.375)**2/4.375)+((5-4.375)**2/4.375)+((4-4.375)**2/4.375)+((7-4.375)**2/4.375) print("Our new chi squared for new jersey is:",newjer_chi_obs) tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"] tenn_chi_obs=((7-7.245)**2/7.245)+((10-7.175)**2/7.175)+((2-6.93)**2/6.93)+((5-4.725)**2/4.725)+((4-4.585)**2/4.585)+((7-4.34)**2/4.34) print("Our new chi squared for tennessee is:",tenn_chi_obs)
#new jersey factory sim_nj=np.zeros(10000) newjersey=250*["B"]+250*["O"]+125*["G"]+125*["Y"]+125*["R"]+125*["BR"] for i in range(10000): sample=np.random.choice(newjersey,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-8.75)**2/8.75)+((o-8.75)**2/8.75)+((g-4.375)**2/4.375)+((y-4.375)**2/4.375)+((r-4.375)**2/4.375)+((br-4.375)**2/4.375) sim_nj[i]=chi p=sns.displot(sim_nj) plt.axvline(newjer_chi_obs,color="red") pval_nj=(np.sum(sim_nj>=newjer_chi_obs))/10000 print("The p value for our simulation for new jersey is:", pval_nj)
#tennessee sim_ten=np.zeros(10000) tennessee=207*["B"]+205*["O"]+198*["G"]+135*["Y"]+131*["R"]+124*["BR"] for i in range(10000): sample=np.random.choice(tennessee,35) b=np.sum(sample=="B") o=np.sum(sample=="O") g=np.sum(sample=="G") y=np.sum(sample=="Y") r=np.sum(sample=="R") br=np.sum(sample=="BR") chi=((b-7.245)**2/7.245)+((o-7.175)**2/7.175)+((g-6.93)**2/6.93)+((y-4.725)**2/4.725)+((r-4.585)**2/4.585)+((br-4.34)**2/4.34) sim_ten[i]=chi p=sns.displot(sim_ten) plt.axvline(tenn_chi_obs,color="red") pval_tenn=(np.sum(sim_ten>=tenn_chi_obs))/10000 print("The p value for our simulation for tennessee is:", pval_tenn)
The new jersey factory showed a pvalue of about 0.6439. This shows us we cannot reject the null hypothesis because it is over the significane threshold of 0.05. Additionally, the new jersey pvalue is much higher than the pvalue for the tennesse factory, which was 0.2675,. This tells us there is a higher chance of getting m&m packs from the new jersey factory. We are very certain of this because the distributions produced in new jersey are very closely aligned with our bag, but we cannot explicitly rule out tennessee because both factories are above the significance threshold, which tells us these proportions came out to random chance. With random chance, either factory is technically possible.
We will make a new data frame and allocate 50% to each factory location. We then run a simulation for 10,000 times to see what the probability is to get the bag from the tennessee factory. This would be a big box test.
We would use the same simulation as in J but now we have new data that we must account for so we will alter our observed frequencies to match the findings in the 100 new bags. This will give us an altered probability of the chance our bag was made in tennessee.
Reallocation of credibility tells us that when our circumstances or situation change, so does our probablity we calculated. The more information we gather, the mre we have to make adjustments to "reallocate" our credibility from our old data to the updated one to have the most accuarate probabilities. This is relevant to j and k because in j we just use 1 bag of candy to calculate data but in k we have now 100 more bags of candy. This gives us new knowledge to integrate into our calculations and more accuracy as well.
Image.open("Covid.PNG")
The effect size is 4.65%
The pvalue is 2.2e-16
the confidence interval is 95% and from (1.75% -infinite)
The sample size is 491 medical staff.
There is no difference in the infection rate of 2019-nCOV medical staff between no mask group and the N95 respirator group.
The authors calculate the upper bound will be infinite because at the time of the study, the scientists did not have enough data to actually set an uppper bound for this study. This is because infection rate can fluctuate and there were many other factors about the novel virus that were not yet know. Infinite upper bound is not very plausible but was used as a place holder for the time being.
X1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] Y1 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74]
xarr=np.array(X1) yarr=np.array(Y1)
pearsonr(xarr,yarr)
The pearson correlation coefficient is 0.8162365060002428.
df = pd.DataFrame(np.column_stack([X1,Y1])) df.columns = ["X1","Y1"]
corrObs = pearsonr(df["X1"],df["Y1"])[0]
tail=list(df["X1"]) bushy=list(df["Y1"]) df_data=np.zeros(10000) for i in range(10000): np.random.shuffle(tail) df_data[i]=pearsonr(tail,bushy)[0] p=sns.displot(df_data,kde=False) plt.axvline(-corrObs,color='red') plt.axvline(corrObs,color='red') pval=(np.sum(df_data<=-corrObs)+np.sum(df_data>=corrObs))/10000 print ("The observed correlation coefficient is",corrObs) print("The p-value is",pval)
This pvalue is two-sided because we calculated the statistical significance in opposite directions. We are testing for the possibility of a relationship between tail length and tail bushyness in UCLA squirrels in either direction. This is either a positive or negative direction. This helps further the goals of the study.
reg = linregress(df["X1"], df["Y1"]) slope=reg.slope intercept=reg.intercept X_plot = np.linspace(4, 13, 100) Y_plot = slope*X_plot+intercept p=sns.lmplot(x="X1",y="Y1",data=df,fit_reg=False) plt.plot(X_plot,Y_plot) print("The slope for the regression line is",slope) print("The y-intercept for the regression line is",intercept)
reg_slope=np.zeros(10000) for i in range(10000): rand_samp=df.sample(len(df),replace=True) t=list(rand_samp["X1"]) bu=list(rand_samp["Y1"]) reg = linregress(t, bu) reg_slope[i]=reg.slope reg_slope.sort() slope_upper=2*slope-reg_slope[49] slope_lower=2*slope-reg_slope[9949] print ("The 99% confidence intervals are", (slope_lower,slope_upper))
the statistic shows that with every cm increase in tail length, the bushyness will increase by around 0.5 mm. The 99% confidence interval is from (0.002417218543046662 to 0.9637800687285226)
This would change our analysis because we know that the tail lengths are rounded and not exact measures. Ordinary Least Squares requires exact tail length values. But since we are rounding now, this is when we use orthogonal regression because we assume that we are rounding the tail length in centimeters, resulting in errors in the tail length. This would increase our slope because we are rounding and giving a steeper slope.
X2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5] Y2 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
xarr2=np.array(X2) yarr2=np.array(Y2)
pearsonr(xarr2,yarr2)
The pearson correlation coefficient is 0.8162867394895982
df2 = pd.DataFrame(np.column_stack([X2,Y2])) df2.columns = ["X2","Y2"]
corrObs2 = pearsonr(df2["X2"],df2["Y2"])[0]
reg = linregress(df2["X2"], df2["Y2"]) slope=reg.slope intercept=reg.intercept X_plot2 = np.linspace(4, 13, 100) Y_plot2 = slope*X_plot+intercept p2=sns.lmplot(x="X2",y="Y2",data=df2,fit_reg=False) plt.plot(X_plot2,Y_plot2) print("The slope for the regression line is",slope) print("The y-intercept for the regression line is",intercept)
We can conclude that longer tails and bushier tails in squirrels are highly correlated because of the high correlation coefficient of about 0.816 for each study. This data, however, does not allow us to conlcude anything about causation. Because this is an observational study, not an experimental one.
The graphs as shown above show that we cannot conclude that the squirrel tail length and tail bushyness relationship is the same between the UCLA and USC campus because when we graph our data, the UCLA graph is not very closely correlated with the line of regression, showing a lower correlation relationship while the USC graph has data points very close to the line showing a higher correlation relationship.
We cannot automatically conclude the relationship is the same by just looking at significant pvalue that is less than 0.05 and correlation coefficient. This tells us we should always visualize data before we make any conclusions and that pvalue being less than 0.05 is not the end all statistic. More tests can always be done to create the most accurate data visualization.