CoCalc -- chap09soln.py

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Website: http://greenteapress.com/wp/think-stats-2e/
Path: think-stats-code / chap09soln.py
Views: ⁷¹²⁰
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function, division
9

10
import first
11
import hypothesis
12
import scatter
13
import thinkstats2
14

15
import numpy as np
16

17

18
"""This file contains a solution to exercises in Think Stats:
19

20
As sample size increases, the power of a hypothesis test increases,
21
which means it is more likely to be positive if the effect is real.
22
Conversely, as sample size decreases, the test is less likely to
23
be positive even if the effect is real.
24

25
To investigate this behavior, run the tests in this chapter with
26
different subsets of the NSFG data.  You can use thinkstats2.SampleRows
27
to select a random subset of the rows in a DataFrame.
28

29
What happens to the p-values of these tests as sample size decreases?
30
What is the smallest sample size that yields a positive test?
31

32
My results:
33

34
test1: difference in mean pregnancy length
35
test2: difference in mean birth weight
36
test3: correlation of mother's age and birth weight
37
test4: chi-square test of pregnancy length
38

39
n       test1   test2   test2   test4
40
9148	0.16	0.00	0.00	0.00
41
4574	0.10	0.01	0.00	0.00
42
2287	0.25	0.06	0.00	0.00
43
1143	0.24	0.03	0.39	0.03
44
571	0.81	0.00	0.04	0.04
45
285	0.57	0.41	0.48	0.83
46
142	0.45	0.08	0.60	0.04
47

48
Conclusion: As expected, tests that are positive with large sample
49
sizes become negative as we take away data.  But the pattern is
50
erratic, with some positive tests even at small sample sizes.
51

52

53
In Section~\ref{testing}, we simulated the null hypothesis by
54
permutation; that is, we treated the observed values as if they
55
represented the entire population, and randomly assigned the
56
members of the population to the two groups.
57

58
An alternative is to use the sample to estimate the distribution for
59
the population, then draw a random sample from that distribution.
60
This process is called resampling.  There are several ways to
61
implement the resampling, but one of the simplest is to draw a sample,
62
with replacement, from the observed values, as in Section~\ref{power}.
63

64
Write a class named {\tt DiffMeansResample} that inherits from
65
{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement
66
resampling, rather than permutation.
67

68
Use this model to test the differences in pregnancy length and
69
birth weight.  How much does the model affect the results?
70

71
Results:
72

73
means permute preglength
74
p-value = 0.1674
75
actual = 0.0780372667775
76
ts max = 0.226752436104
77

78
means permute birthweight
79
p-value = 0.0
80
actual = 0.124761184535
81
ts max = 0.112243501197
82

83

84
Conclusions: Using resampling instead of permutation has very
85
little effect on the results.
86

87
The two models are based on slightly difference assumptions, and in
88
this example there is no compelling reason to choose one or the other.
89
But in general p-values depend on the choice of the null hypothesis;
90
different models can yield very different results.
91

92

93
"""
94

95
class DiffMeansResample(hypothesis.DiffMeansPermute):
96
    """Tests a difference in means using resampling."""
97
    
98
    def RunModel(self):
99
        """Run the model of the null hypothesis.
100

101
        returns: simulated data
102
        """
103
        group1 = np.random.choice(self.pool, self.n, replace=True)
104
        group2 = np.random.choice(self.pool, self.m, replace=True)
105
        return group1, group2
106
  
107

108
def RunResampleTest(firsts, others):
109
    """Tests differences in means by resampling.
110

111
    firsts: DataFrame
112
    others: DataFrame
113
    """
114
    data = firsts.prglngth.values, others.prglngth.values
115
    ht = DiffMeansResample(data)
116
    p_value = ht.PValue(iters=10000)
117
    print('\nmeans permute preglength')
118
    print('p-value =', p_value)
119
    print('actual =', ht.actual)
120
    print('ts max =', ht.MaxTestStat())
121

122
    data = (firsts.totalwgt_lb.dropna().values,
123
            others.totalwgt_lb.dropna().values)
124
    ht = hypothesis.DiffMeansPermute(data)
125
    p_value = ht.PValue(iters=10000)
126
    print('\nmeans permute birthweight')
127
    print('p-value =', p_value)
128
    print('actual =', ht.actual)
129
    print('ts max =', ht.MaxTestStat())
130

131

132
def RunTests(live, iters=1000):
133
    """Runs the tests from Chapter 9 with a subset of the data.
134

135
    live: DataFrame
136
    iters: how many iterations to run
137
    """
138
    n = len(live)
139
    firsts = live[live.birthord == 1]
140
    others = live[live.birthord != 1]
141

142
    # compare pregnancy lengths
143
    data = firsts.prglngth.values, others.prglngth.values
144
    ht = hypothesis.DiffMeansPermute(data)
145
    p1 = ht.PValue(iters=iters)
146

147
    data = (firsts.totalwgt_lb.dropna().values,
148
            others.totalwgt_lb.dropna().values)
149
    ht = hypothesis.DiffMeansPermute(data)
150
    p2 = ht.PValue(iters=iters)
151

152
    # test correlation
153
    live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
154
    data = live2.agepreg.values, live2.totalwgt_lb.values
155
    ht = hypothesis.CorrelationPermute(data)
156
    p3 = ht.PValue(iters=iters)
157

158
    # compare pregnancy lengths (chi-squared)
159
    data = firsts.prglngth.values, others.prglngth.values
160
    ht = hypothesis.PregLengthTest(data)
161
    p4 = ht.PValue(iters=iters)
162

163
    print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))
164

165

166
def main():
167
    thinkstats2.RandomSeed(18)
168

169
    live, firsts, others = first.MakeFrames()
170
    RunResampleTest(firsts, others)
171

172
    n = len(live)
173
    for _ in range(7):
174
        sample = thinkstats2.SampleRows(live, n)
175
        RunTests(sample)
176
        n //= 2
177

178

179
if __name__ == '__main__':
180
    main()
181

182