SharedWeek 6 / Lab 6 - ANOVA-corrected.ipynbOpen in CoCalc
Author: Dean Neutel
Views : 14

# Lab 6: Comparing Three or More Groups with ANOVA

Often, we have to compare data from three or more groups to see if one or more of them is different from the others. To do this, scientists use a statistic called the F- statistic, defined as the ratio of the among-group variances to the within-group variance. The between-group variance is the sum of the squared differences between the group means and the mean of all the samples taken together, called the grand mean. The within-group variance is the sum of the individual group variances. In LS 40, we will use a variation of the F-statistic that does not require squaring, called the F-like statistic.

1. Import pandas, Numpy and Seaborn.
In [2]:
#1
import pandas as pd
import numpy as np
import seaborn as sns


In this lab, we will examine the results of a hypothetical study comparing the effectiveness of different pain relief medications on migraine headaches. For the experiment, 27 volunteers with migraines were selected and 9 were randomly assigned to one of three drug formulations. The subjects were instructed to take the drug during their next migraine headache episode and to report their pain on a scale of 1 = no pain to 10 =extreme pain 30 minutes after taking the drug.

1. Using the pandas read_csv function, import the file migraines.csv and show the data.
In [3]:
#2
migraines

Drug A Drug B Drug C
0 4 7 6
1 5 8 7
2 4 4 6
3 2 5 6
4 2 4 7
5 4 6 5
6 3 5 6
7 3 8 5
8 3 7 5
1. Visualize the data. Use as many different plots as you need to get a sense of the distributions. What are your initial impressions?
In [4]:
#3
p=sns.swarmplot(data=migraines)
p.set(xlabel="Drug Group",ylabel="Count");

In [5]:
drug_a = list(migraines["Drug A"])
drug_b = list(migraines["Drug B"])
drug_c = list(migraines["Drug C"])
p=sns.distplot(drug_a, kde=True, bins=4)
p=sns.distplot(drug_b, kde=True, color="green", bins=4)
p=sns.distplot(drug_c, kde=True, color="orange", bins=4)
p.set(xlabel="Drug Group", ylabel="Count");


## Computing the F-like statistic

In the next several exercises, you will compute the F-like statistic for your data.

## $F-like= \frac{n_{a}|\widetilde{a}-\widetilde{G}|+n_{b}|\widetilde{b}-\widetilde{G}|+n_{c}|\widetilde{c}-\widetilde{G}|}{\Sigma|a_{i}-\widetilde{a}|+\Sigma|b_{i}-\widetilde{b}|+\Sigma|c_{i}-\widetilde{c}|}$

Here, $n_x$ is the size of group $x$ and symbols with ~s over them represent medians, with $\widetilde{G}$ being the grand median -- the median of all the data taken together.

Each quantity you compute in this section should be assigned to a variable.

1. Find the median of each column. HINT: pandas data frames have a built-in function for columnwise medians, so you can just use df.median(), where df is the name of your data frame.
In [6]:
#4
medians = migraines.median()
medians

Drug A 3.0 Drug B 6.0 Drug C 6.0 dtype: float64
1. Find the grand median (the median of the whole sample). HINT: Use np.median.
In [7]:
#5
grand_median = np.median(migraines)
grand_median

5.0
1. Find the numerator of the F-like statistic (variation among groups). HINT: When working with data frames and NumPy arrays, you can do computations like addition and multiplication directly, without for loops (unlike in regular Python). Also, Numpy has abs and sum functions.
In [8]:
#6
n = len(migraines)
variation_between_groups = (n*abs(medians["Drug A"]-grand_median))+(n*abs(medians["Drug B"]-grand_median))+(n*abs(medians["Drug C"]-grand_median))
variation_between_groups

36.0
1. Compute the denominator of the F-like statistic. This represents variation within groups. HINT: Numpy and pandas will handle columns automatically.
In [9]:
#7
variation_within_groups = sum(np.sum(abs(migraines-medians)))
variation_within_groups

24.0
1. Compute F-like statistic, which is the ratio $\frac{\text{variation among groups}}{\text{variation within groups}}$.
In [10]:
#8
F_observed = (variation_between_groups)/(variation_within_groups)
F_observed

1.5

## Bootstrapping

We now want to find a p-value for our data by simulating the null hypothesis. This, of course, means computing the F-like statistic each time, which takes a lot of code and would make a mess in the bootstrap loop. Instead, we will package our code into a function and call this function whereever necessary.

1. Write a function that will compute the F-like statistic for this dataset or one of the same size.
In [11]:
#9
def F_like_function(data):
n = len(data)
variation_within_groups = sum(np.sum(abs(data-data.median())))
variation_between_groups = (n*abs(data.median()[0]-np.median(data)))+(n*abs(data.median()[1]-np.median(data)))+(n*abs(data.median()[2]-np.median(data)))
F_like = (variation_between_groups)/(variation_within_groups)
return(F_like)
F_like_function(migraines)

1.5

We now want to simulate the null hypothesis that there is no difference between the groups. To do this, we have to make all the data into one dataset, sample pseudo-groups from it, and compute the F-like statistic for the resampled data.

1. Use the code alldata = np.concatenate([migraine["Drug A"], migraine["Drug B"], migraine["Drug C"]]) (this assumes your data frame is called "migraine") to put all the data into one 1-D array.
In [12]:
#10
alldata = np.concatenate([migraines["Drug A"], migraines["Drug B"], migraines["Drug C"]])
alldata

array([4, 5, 4, 2, 2, 4, 3, 3, 3, 7, 8, 4, 5, 4, 6, 5, 8, 7, 6, 7, 6, 6, 7, 5, 6, 5, 5])
1. Sample three groups of the appropriate size from alldata. Assign each to a variable.
In [13]:
#11
a_random = np.random.choice(alldata,len(migraines["Drug A"]))
b_random = np.random.choice(alldata,len(migraines["Drug B"]))
c_random = np.random.choice(alldata,len(migraines["Drug C"]))
display("New Drug A Group", a_random,"New Drug B Group", b_random,"New Drug C Group", c_random)

'New Drug A Group'
array([7, 8, 5, 4, 3, 7, 5, 5, 7])
'New Drug B Group'
array([6, 6, 7, 4, 5, 5, 6, 4, 2])
'New Drug C Group'
array([8, 4, 6, 6, 7, 4, 7, 5, 4])
1. Make the three samples into a data frame. To do this, use the NumPy function column_stack to put the 1-D arrays side by side and then use the pandas function DataFrame to convert the result into a data frame.
In [14]:
#12
column = np.column_stack((a_random,b_random,c_random))
random_data_frame = pd.DataFrame(column)
random_data_frame

0 1 2
0 7 6 8
1 8 6 4
2 5 7 6
3 4 4 6
4 3 5 7
5 7 5 4
6 5 6 7
7 5 4 5
8 7 2 4
1. Compute the F-like statistic for your resampled data.
In [15]:
#13
F_like_function(random_data_frame)

0.2727272727272727
1. Do the above steps 10,000 times to simulate the null hypothesis, storing the results.
In [33]:
#14
F_like = []
alldata = np.concatenate([migraines["Drug A"], migraines["Drug B"], migraines["Drug C"]])
total = 10000
for i in range(total):
a_random = np.random.choice(alldata,len(migraines["Drug A"]))
b_random = np.random.choice(alldata,len(migraines["Drug B"]))
c_random = np.random.choice(alldata,len(migraines["Drug C"]))
column = np.column_stack((a_random,b_random,c_random))
random_data_frame = pd.DataFrame(column)
new_F_like = F_like_function(random_data_frame)
F_like.append(new_F_like)

--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-33-95801e4e4ec7> in <module>() 9 column = np.column_stack((a_random,b_random,c_random)) 10 random_data_frame = pd.DataFrame(column) ---> 11 new_F_like = F_like_function(random_data_frame) 12 F_like.append(new_F_like) 13 alldata <ipython-input-11-c18fa83f0a13> in F_like_function(data) 2 def F_like_function(data): 3 n = len(data) ----> 4 variation_within_groups = sum(np.sum(abs(data-data.median()))) 5 variation_between_groups = (n*abs(data.median()[0]-np.median(data)))+(n*abs(data.median()[1]-np.median(data)))+(n*abs(data.median()[2]-np.median(data))) 6 F_like = (variation_between_groups)/(variation_within_groups) /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in f(self, other, axis, level, fill_value) 2028 return _combine_series_frame(self, other, pass_op, 2029 fill_value=fill_value, axis=axis, -> 2030 level=level) 2031 else: 2032 if fill_value is not None: /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in _combine_series_frame(self, other, func, fill_value, axis, level) 1928 1929 # default axis is columns -> 1930 return self._combine_match_columns(other, func, level=level) 1931 1932 /usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in _combine_match_columns(self, other, func, level) 5114 copy=False) 5115 assert left.columns.equals(right.index) -> 5116 return ops.dispatch_to_series(left, right, func, axis="columns") 5117 5118 def _combine_const(self, other, func): /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in dispatch_to_series(left, right, func, str_rep, axis) 1155 raise NotImplementedError(right) 1156 -> 1157 new_data = expressions.evaluate(column_op, str_rep, left, right) 1158 1159 result = left._constructor(new_data, index=left.index, copy=False) /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs) 206 use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b) 207 if use_numexpr: --> 208 return _evaluate(op, op_str, a, b, **eval_kwargs) 209 return _evaluate_standard(op, op_str, a, b) 210 /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs) 121 122 if result is None: --> 123 result = _evaluate_standard(op, op_str, a, b) 124 125 return result /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs) 66 _store_test_result(False) 67 with np.errstate(all='ignore'): ---> 68 return op(a, b) 69 70 /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in column_op(a, b) 1142 def column_op(a, b): 1143 return {i: func(a.iloc[:, i], b.iloc[i]) -> 1144 for i in range(len(a.columns))} 1145 1146 elif isinstance(right, ABCSeries): /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in <dictcomp>(.0) 1142 def column_op(a, b): 1143 return {i: func(a.iloc[:, i], b.iloc[i]) -> 1144 for i in range(len(a.columns))} 1145 1146 elif isinstance(right, ABCSeries): /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in wrapper(left, right) 1581 rvalues = rvalues.values 1582 -> 1583 result = safe_na_op(lvalues, rvalues) 1584 return construct_result(left, result, 1585 index=left.index, name=res_name, dtype=None) /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues) 1527 try: 1528 with np.errstate(all='ignore'): -> 1529 return na_op(lvalues, rvalues) 1530 except Exception: 1531 if is_object_dtype(lvalues): /usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in na_op(x, y) 1503 import pandas.core.computation.expressions as expressions 1504 try: -> 1505 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs) 1506 except TypeError: 1507 result = masked_arith_op(x, y, op) /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs) 206 use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b) 207 if use_numexpr: --> 208 return _evaluate(op, op_str, a, b, **eval_kwargs) 209 return _evaluate_standard(op, op_str, a, b) 210 /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs) 121 122 if result is None: --> 123 result = _evaluate_standard(op, op_str, a, b) 124 125 return result /usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs) 65 if _TEST_MODE: 66 _store_test_result(False) ---> 67 with np.errstate(all='ignore'): 68 return op(a, b) 69 /usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in __enter__(self) 2890 2891 def __enter__(self): -> 2892 self.oldstate = seterr(**self.kwargs) 2893 if self.call is not _Unspecified: 2894 self.oldcall = seterrcall(self.call) /usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in seterr(all, divide, over, under, invalid) 2586 2587 pyvals[1] = maskvalue -> 2588 umath.seterrobj(pyvals) 2589 return old 2590 KeyboardInterrupt: 
In [ ]:


1. Find the p-value for your data. What do you conclude about the migraine treatments? Use $\alpha$=0.05 for NHST.
In [ ]:
#15
F_like = []
alldata = np.concatenate([migraines["Drug A"], migraines["Drug B"], migraines["Drug C"]])
total = 10000
for i in range(total):
a_random = np.random.choice(alldata,len(migraines["Drug A"]))
b_random = np.random.choice(alldata,len(migraines["Drug B"]))
c_random = np.random.choice(alldata,len(migraines["Drug C"]))
column = np.column_stack((a_random,b_random,c_random))
random_data_frame = pd.DataFrame(column)
new_F_like = F_like_function(random_data_frame)
F_like.append(new_F_like)
pvalue = (sum(F_like>=F_observed))/total
display("P-Value", pvalue)


## Post Hoc Analysis

Having obtained a significant result from the omnibus test, we can now try to track down the source of the significance. Which groups are actually different from which?

1. Perform pairwise two-group comparisons and record the p-value for each.
In [20]:
#16
total = 10000
differenceMedian = np.median(migraines["Drug A"]) - np.median(migraines["Drug B"])
other_limit = -differenceMedian
limits = [differenceMedian,other_limit]
a_b_difference = np.zeros(total)
alldata = np.concatenate([migraines["Drug A"], migraines["Drug B"]])
for i in range(total):
Random_Drug_A = np.random.choice(alldata, len(migraines["Drug A"]))
Random_Drug_B = np.random.choice(alldata, len(migraines["Drug B"]))
difference = np.median(Random_Drug_A) - np.median(Random_Drug_B)
a_b_difference[i] = difference
pvalueAB = (sum(a_b_difference>=max(limits))+sum(a_b_difference<=min(limits)))/total
display("P-Value (A-B)", pvalueAB)

'P-Value (A-B)'
0.0437
In [29]:
total = 10000
differenceMedian = np.median(migraines["Drug C"]) - np.median(migraines["Drug B"])
other_limit = -differenceMedian
limits = [differenceMedian,other_limit]
c_b_difference = np.zeros(total)
alldata = np.concatenate([migraines["Drug C"], migraines["Drug B"]])
for i in range(total):
Random_Drug_C = np.random.choice(alldata, len(migraines["Drug C"]))
Random_Drug_B = np.random.choice(alldata, len(migraines["Drug B"]))
difference = np.median(Random_Drug_C) - np.median(Random_Drug_B)
c_b_difference[i] = difference
pvalueBC = np.sum(np.sum(c_b_difference>=max(limits))+np.sum(c_b_difference<=min(limits)))/total
display("P-Value (B-C)", pvalueBC)

'P-Value (B-C)'
1.461
In [22]:
total = 10000
differenceMedian = np.median(migraines["Drug A"]) - np.median(migraines["Drug C"])
other_limit = -differenceMedian
limits = [differenceMedian,other_limit]
a_b_difference = np.zeros(total)
alldata = np.concatenate([migraines["Drug A"], migraines["Drug C"]])
for i in range(total):
Random_Drug_A = np.random.choice(alldata, len(migraines["Drug A"]))
Random_Drug_B = np.random.choice(alldata, len(migraines["Drug B"]))
difference = np.median(Random_Drug_A) - np.median(Random_Drug_B)
a_b_difference[i] = difference
pvalueAC = (sum(a_b_difference>=max(limits))+sum(a_b_difference<=min(limits)))/total
display("P-Value (A-C)", pvalueAC)

'P-Value (A-C)'
0.0203
1. Use the list of p-values from your two-group comparisons and the Benjamini-Hochberg method below to determine which two-group comparisons have a significant difference at $\alpha$=0.05. See more details in "Controlling the False Positive Rate with Multiple Comparisons Corrections" PDF on Week 6 on CCLE site.

import statsmodels.stats.multitest as smm
smm.multipletests(pvals, alpha=critical_alpha, method='fdr_bh')
In [25]:
#17
import statsmodels.stats.multitest as smm
smm.multipletests(pvalueAB, alpha=0.05, method='fdr_bh')+smm.multipletests(pvalueBC, alpha=0.05, method='fdr_bh')+smm.multipletests(pvalueAC, alpha=0.05, method='fdr_bh')

(array([ True]), array([0.0437]), 0.050000000000000044, 0.05, array([False]), array([1.]), 0.050000000000000044, 0.05, array([ True]), array([0.0203]), 0.050000000000000044, 0.05)
1. Write a sentence or two describing your findings.
In [22]:
#18