1

Often, we have to compare data from three or more groups to see if one or more of them is different from the others. To do this, scientists use a statistic called the *F*- statistic, defined as $F=\frac{Var_{between}}{Var_{within}}$. The variability between groups is the sum of the squared differences between the group means and the grand mean. The variability within groups is the sum of the group variances. In LS 40, we will use a variation of the F-statistic that does not require squaring, which is called the *F*-like statistic.

2

- Import pandas, Numpy, Seaborn and Pyplot.

3

In [2]:

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

4

In this lab, we will examine the results of a pharmaceutical company's study comparing the effectiveness of different pain relief medications on migraine headaches. For the experiment, 27 volunteers were selected and 9 were randomly assigned to one of three drug formulations. The subjects were instructed to take the drug during their next migraine headache episode and to report their pain on a scale of 1 = no pain to 10 = extreme pain 30 minutes after taking the drug.

5

- Using the pandas
`read_csv`

function, import the file`migraines.csv`

and show the data.

6

In [3]:

migrainedata=pd.read_csv("migraines.csv")

7

In [4]:

migrainedata

8

Drug A | Drug B | Drug C | |
---|---|---|---|

0 | 4 | 7 | 6 |

1 | 5 | 8 | 7 |

2 | 4 | 4 | 6 |

3 | 2 | 5 | 6 |

4 | 2 | 4 | 7 |

5 | 4 | 6 | 5 |

6 | 3 | 5 | 6 |

7 | 3 | 8 | 5 |

8 | 3 | 7 | 5 |

- Visualize the data. Use as many different plots as you need to get a sense of the distributions. What are your initial impressions?

9

In [5]:

#My initial impressions are that Drug A has mostly lower pain ratings with an almost symmetrical unimodal distribution. On the other hand, Drug B has higher pain ratings with symmetrical bimodality. Drug C is almost in the middle of the two, with above average pain ratings and appears to be unimodal with a slight right skew.

10

In [6]:

p=sns.swarmplot(data=migrainedata) p.set(ylabel="Patient Pain Rating After Taking the Drug")

11

[Text(0, 0.5, 'Patient Pain Rating After Taking the Drug')]

In [7]:

p=sns.displot(data=migrainedata) p.set(xlabel="Patient Pain Rating After Taking the Drug",ylabel="Count")

12

<seaborn.axisgrid.FacetGrid at 0x7fe4a4d94ee0>

In [8]:

p=sns.stripplot(data=migrainedata) p.set(xlabel="Patient Pain Rating After Taking the Drug",ylabel="Count")

13

[Text(0.5, 0, 'Patient Pain Rating After Taking the Drug'),
Text(0, 0.5, 'Count')]

In [9]:

p=sns.violinplot(data=migrainedata) p.set(xlabel="Patient Pain Rating After Taking the Drug",ylabel="Count")

14

[Text(0.5, 0, 'Patient Pain Rating After Taking the Drug'),
Text(0, 0.5, 'Count')]

15

In the next several exercises, you will compute the *F*-like statistic for your data.

*F*-like=$\frac{n_{a}|\widetilde{a}-\widetilde{G}|+n_{b}|\widetilde{b}-\widetilde{G}|+n_{c}|\widetilde{c}-\widetilde{G}|}{\Sigma|a_{i}-\widetilde{a}|+\Sigma|b_{i}-\widetilde{b}|+\Sigma|c_{i}-\widetilde{c}|}$

Each quantity you compute in this section should be assigned to a variable.

16

- Find the median of each column. HINT: pandas data frames have a built-in function for columnwise medians, so you can just use
`df.median()`

, where`df`

is the name of your data frame.

17

In [10]:

druga=migrainedata["Drug A"] drugb=migrainedata["Drug B"] drugc=migrainedata["Drug C"]

18

In [11]:

groupmedians=migrainedata.median() groupmedians

19

Drug A 3.0
Drug B 6.0
Drug C 6.0
dtype: float64

- Find the grand median (the median of the whole sample). HINT: Use
`np.median`

.

20

In [12]:

grandmedian=np.median(migrainedata) grandmedian

21

5.0

- Find the numerator of the
*F*-like statistic (variation among groups). HINT: When working with data frames and NumPy arrays, you can do computations like addition and multiplication directly, without for loops (unlike in regular Python). Also, Numpy has`abs`

and`sum`

functions.

22

In [13]:

lem = len(migrainedata) print(lem)

23

9

In [14]:

numerator=np.sum(lem*np.abs(groupmedians-grandmedian)) numerator

24

36.0

- Compute the denominator of the
*F*-like statistic. This represents variation within groups. HINT: Numpy and pandas will handle columns automatically.

25

In [15]:

denom=np.sum(np.sum(np.abs(migrainedata-groupmedians))) denom

26

24.0

- Compute
*F*-like, which is the ratio $\frac{\text{variation among groups}}{\text{variation within groups}}$.

27

In [16]:

Flike=numerator/denom Flike

28

1.5

29

We now want to find a p-value for our data by simulating the null hypothesis. This, of course, means computing the *F*-like statistic each time, which takes a lot of code and would make a mess in the bootstrap loop. Instead, we will package our code into a function and call this function whereever necessary.

30

- Write a function that will compute the
*F*-like statistic for this dataset or one of the same size.

31

In [17]:

def fstat(df): medians=df.median() grandmedian=np.median(df) numerator=np.sum(len(df)*abs(medians-grandmedian)) denominator=np.sum(np.sum(abs(df-df.median()))) Flike=numerator/denominator return Flike

32

We now want to simulate the null hypothesis that there is no difference between the groups. To do this, we have to make all the data into one dataset, sample pseudo-groups from it, and compute the *F*-like statistic for the resampled data.

33

- Use the code
`alldata = np.concatenate([migraine["Drug A"], migraine["Drug B"], migraine["Drug C"]])`

(this assumes your data frame is called "migraine") to put all the data into one 1-D array.

34

In [18]:

alldata = np.concatenate([migrainedata["Drug A"], migrainedata["Drug B"], migrainedata["Drug C"]])

35

- Sample three groups of the appropriate size from
`alldata`

. Assign each to a variable.

36

In [19]:

a=np.random.choice(alldata,len(migrainedata)) b=np.random.choice(alldata,len(migrainedata)) c=np.random.choice(alldata,len(migrainedata))

37

- Make the three samples into a data frame. To do this, use the NumPy function
`column_stack`

to put the 1-D arrays side by side and then use the pandas function`DataFrame`

to convert the result into a data frame.

38

In [20]:

data=pd.DataFrame(np.column_stack([a,b,c])) data

39

0 | 1 | 2 | |
---|---|---|---|

0 | 7 | 3 | 6 |

1 | 3 | 3 | 5 |

2 | 5 | 5 | 5 |

3 | 6 | 6 | 4 |

4 | 4 | 6 | 4 |

5 | 4 | 6 | 2 |

6 | 2 | 3 | 7 |

7 | 7 | 6 | 3 |

8 | 6 | 3 | 2 |

- Compute the
*F*-like statistic for your resampled data.

40

In [21]:

fstat(data)

41

0.24324324324324326

- Do the above steps 10,000 times to simulate the null hypothesis, storing the results.

42

In [22]:

results=np.zeros(10000) alldata=np.concatenate([migrainedata["Drug A"], migrainedata["Drug B"], migrainedata["Drug C"]]) for i in range(10000): a=np.random.choice(alldata,len(migrainedata)) b=np.random.choice(alldata,len(migrainedata)) c=np.random.choice(alldata,len(migrainedata)) data=pd.DataFrame(np.column_stack([a,b,c])) results[i]=fstat(data) results

43

array([0.25714286, 0. , 0. , ..., 0.25714286, 0.21428571,
0. ])

- Find the p-value for your data. What do you conclude about the migraine treatments? Use $\alpha$=0.05 for NHST.

44

In [23]:

d1=np.sum(results>=Flike) pvalue=(d1)/10000 pvalue #I can conclude that the migraine treatments did have an effect on the patients that was not due to random chance since the p-value I obtained was 0.0008 which is smaller than 0.05, therefore it is statistically significant.

45

0.0005

46

Having obtained a significant result from the omnibus test, we can now try to track down the source of the significance. Which groups are actually different from which?

47

- Perform pairwise two-group comparisons and record the p-value for each.

48

In [24]:

total = 10000 differenceMedian = np.median(migrainedata["Drug A"]) - np.median(migrainedata["Drug B"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] a_b_difference = np.zeros(total) alldata = np.concatenate([migrainedata["Drug A"], migrainedata["Drug B"]]) for i in range(total): Random_Drug_A = np.random.choice(alldata, len(migrainedata["Drug A"])) Random_Drug_B = np.random.choice(alldata, len(migrainedata["Drug B"])) difference = np.median(Random_Drug_A) - np.median(Random_Drug_B) a_b_difference[i] = difference pvalueAB = (sum(a_b_difference>=max(limits))+sum(a_b_difference<=min(limits)))/total display("P-Value (A-B)", pvalueAB)

49

'P-Value (A-B)'

0.044

In [25]:

total = 10000 differenceMedian = np.median(migrainedata["Drug C"]) - np.median(migrainedata["Drug B"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] c_b_difference = np.zeros(total) alldata = np.concatenate([migrainedata["Drug C"], migrainedata["Drug B"]]) for i in range(total): Random_Drug_C = np.random.choice(alldata, len(migrainedata["Drug C"])) Random_Drug_B = np.random.choice(alldata, len(migrainedata["Drug B"])) difference = np.median(Random_Drug_C) - np.median(Random_Drug_B) c_b_difference[i] = difference pvalueBC = np.sum(np.sum(c_b_difference>=max(limits))+np.sum(c_b_difference<=min(limits)))/total display("P-Value (B-C)", pvalueBC)

50

'P-Value (B-C)'

1.4576

In [26]:

total = 10000 differenceMedian = np.median(migrainedata["Drug A"]) - np.median(migrainedata["Drug C"]) other_limit = -differenceMedian limits = [differenceMedian,other_limit] a_b_difference = np.zeros(total) alldata = np.concatenate([migrainedata["Drug A"], migrainedata["Drug C"]]) for i in range(total): Random_Drug_A = np.random.choice(alldata, len(migrainedata["Drug A"])) Random_Drug_B = np.random.choice(alldata, len(migrainedata["Drug B"])) difference = np.median(Random_Drug_A) - np.median(Random_Drug_B) a_b_difference[i] = difference pvalueAC = (sum(a_b_difference>=max(limits))+sum(a_b_difference<=min(limits)))/total display("P-Value (A-C)", pvalueAC)

51

'P-Value (A-C)'

0.0225

- Use the list of p-values from your two-group comparisons and the Benjamini-Hochberg method below to determine which two-group comparisons have a significant difference at $\alpha$=0.05. See more details in "Controlling the False Positive Rate with Multiple Comparisons Corrections" PDF on Week 6 on CCLE site.

`import statsmodels.stats.multitest as smm`

`smm.multipletests(pvals, alpha=my_alpha, method='fdr_bh')`

52

In [27]:

import statsmodels.stats.multitest as smm smm.multipletests(pvalueAB, alpha=0.05, method='fdr_bh')+smm.multipletests(pvalueBC, alpha=0.05, method='fdr_bh')+smm.multipletests(pvalueAC, alpha=0.05, method='fdr_bh')

53

(array([ True]),
array([0.044]),
0.050000000000000044,
0.05,
array([False]),
array([1.]),
0.050000000000000044,
0.05,
array([ True]),
array([0.0225]),
0.050000000000000044,
0.05)

- Write a sentence or two describing your findings.

54

In [ ]:

55