Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.

This is the accompanying code for this book.

Website: http://greenteapress.com/wp/think-stats-2e/

Views: 7120
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function, division
9
10
import first
11
import hypothesis
12
import scatter
13
import thinkstats2
14
15
import numpy as np
16
17
18
"""This file contains a solution to exercises in Think Stats:
19
20
As sample size increases, the power of a hypothesis test increases,
21
which means it is more likely to be positive if the effect is real.
22
Conversely, as sample size decreases, the test is less likely to
23
be positive even if the effect is real.
24
25
To investigate this behavior, run the tests in this chapter with
26
different subsets of the NSFG data. You can use thinkstats2.SampleRows
27
to select a random subset of the rows in a DataFrame.
28
29
What happens to the p-values of these tests as sample size decreases?
30
What is the smallest sample size that yields a positive test?
31
32
My results:
33
34
test1: difference in mean pregnancy length
35
test2: difference in mean birth weight
36
test3: correlation of mother's age and birth weight
37
test4: chi-square test of pregnancy length
38
39
n test1 test2 test2 test4
40
9148 0.16 0.00 0.00 0.00
41
4574 0.10 0.01 0.00 0.00
42
2287 0.25 0.06 0.00 0.00
43
1143 0.24 0.03 0.39 0.03
44
571 0.81 0.00 0.04 0.04
45
285 0.57 0.41 0.48 0.83
46
142 0.45 0.08 0.60 0.04
47
48
Conclusion: As expected, tests that are positive with large sample
49
sizes become negative as we take away data. But the pattern is
50
erratic, with some positive tests even at small sample sizes.
51
52
53
In Section~\ref{testing}, we simulated the null hypothesis by
54
permutation; that is, we treated the observed values as if they
55
represented the entire population, and randomly assigned the
56
members of the population to the two groups.
57
58
An alternative is to use the sample to estimate the distribution for
59
the population, then draw a random sample from that distribution.
60
This process is called resampling. There are several ways to
61
implement the resampling, but one of the simplest is to draw a sample,
62
with replacement, from the observed values, as in Section~\ref{power}.
63
64
Write a class named {\tt DiffMeansResample} that inherits from
65
{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement
66
resampling, rather than permutation.
67
68
Use this model to test the differences in pregnancy length and
69
birth weight. How much does the model affect the results?
70
71
Results:
72
73
means permute preglength
74
p-value = 0.1674
75
actual = 0.0780372667775
76
ts max = 0.226752436104
77
78
means permute birthweight
79
p-value = 0.0
80
actual = 0.124761184535
81
ts max = 0.112243501197
82
83
84
Conclusions: Using resampling instead of permutation has very
85
little effect on the results.
86
87
The two models are based on slightly difference assumptions, and in
88
this example there is no compelling reason to choose one or the other.
89
But in general p-values depend on the choice of the null hypothesis;
90
different models can yield very different results.
91
92
93
"""
94
95
class DiffMeansResample(hypothesis.DiffMeansPermute):
96
"""Tests a difference in means using resampling."""
97
98
def RunModel(self):
99
"""Run the model of the null hypothesis.
100
101
returns: simulated data
102
"""
103
group1 = np.random.choice(self.pool, self.n, replace=True)
104
group2 = np.random.choice(self.pool, self.m, replace=True)
105
return group1, group2
106
107
108
def RunResampleTest(firsts, others):
109
"""Tests differences in means by resampling.
110
111
firsts: DataFrame
112
others: DataFrame
113
"""
114
data = firsts.prglngth.values, others.prglngth.values
115
ht = DiffMeansResample(data)
116
p_value = ht.PValue(iters=10000)
117
print('\nmeans permute preglength')
118
print('p-value =', p_value)
119
print('actual =', ht.actual)
120
print('ts max =', ht.MaxTestStat())
121
122
data = (firsts.totalwgt_lb.dropna().values,
123
others.totalwgt_lb.dropna().values)
124
ht = hypothesis.DiffMeansPermute(data)
125
p_value = ht.PValue(iters=10000)
126
print('\nmeans permute birthweight')
127
print('p-value =', p_value)
128
print('actual =', ht.actual)
129
print('ts max =', ht.MaxTestStat())
130
131
132
def RunTests(live, iters=1000):
133
"""Runs the tests from Chapter 9 with a subset of the data.
134
135
live: DataFrame
136
iters: how many iterations to run
137
"""
138
n = len(live)
139
firsts = live[live.birthord == 1]
140
others = live[live.birthord != 1]
141
142
# compare pregnancy lengths
143
data = firsts.prglngth.values, others.prglngth.values
144
ht = hypothesis.DiffMeansPermute(data)
145
p1 = ht.PValue(iters=iters)
146
147
data = (firsts.totalwgt_lb.dropna().values,
148
others.totalwgt_lb.dropna().values)
149
ht = hypothesis.DiffMeansPermute(data)
150
p2 = ht.PValue(iters=iters)
151
152
# test correlation
153
live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
154
data = live2.agepreg.values, live2.totalwgt_lb.values
155
ht = hypothesis.CorrelationPermute(data)
156
p3 = ht.PValue(iters=iters)
157
158
# compare pregnancy lengths (chi-squared)
159
data = firsts.prglngth.values, others.prglngth.values
160
ht = hypothesis.PregLengthTest(data)
161
p4 = ht.PValue(iters=iters)
162
163
print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))
164
165
166
def main():
167
thinkstats2.RandomSeed(18)
168
169
live, firsts, others = first.MakeFrames()
170
RunResampleTest(firsts, others)
171
172
n = len(live)
173
for _ in range(7):
174
sample = thinkstats2.SampleRows(live, n)
175
RunTests(sample)
176
n //= 2
177
178
179
if __name__ == '__main__':
180
main()
181
182