Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.

This is the accompanying code for this book.

Website: http://greenteapress.com/wp/think-stats-2e/

Views: 7138
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function, division
9
10
import thinkstats2
11
import thinkplot
12
13
import math
14
import random
15
import numpy as np
16
17
from scipy import stats
18
from estimation import RMSE, MeanError
19
20
21
"""This file contains a solution to exercises in Think Stats:
22
23
Exercise 8.1
24
25
In this chapter we used $\xbar$ and median to estimate $\mu$, and
26
found that $\xbar$ yields lower MSE.
27
Also, we used $S^2$ and $S_{n-1}^2$ to estimate $\sigma$, and found that
28
$S^2$ is biased and $S_{n-1}^2$ unbiased.
29
30
Run similar experiments to see if $\xbar$ and median are biased estimates
31
of $\mu$.
32
Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.
33
34
My conclusions:
35
36
1) xbar and median yield lower mean error as m increases, so neither
37
one is obviously biased, as far as we can tell from the experiment.
38
39
2) The biased estimator of variance yields lower RMSE than the unbiased
40
estimator, by about 10%. And the difference holds up as m increases.
41
42
43
Exercise 8.2
44
45
Suppose you draw a sample with size $n=10$ from a population
46
with an exponential disrtribution with $\lambda=2$. Simulate
47
this experiment 1000 times and plot the sampling distribution of
48
the estimate $\lamhat$. Compute the standard error of the estimate
49
and the 90\% confidence interval.
50
51
Repeat the experiment with a few different values of $n$ and make
52
a plot of standard error versus $n$.
53
54
1) With sample size 10:
55
56
standard error 0.896717911545
57
confidence interval (1.2901330772324622, 3.8692334892427911)
58
59
2) As sample size increases, standard error and the width of
60
the CI decrease:
61
62
10 0.90 (1.3, 3.9)
63
100 0.21 (1.7, 2.4)
64
1000 0.06 (1.9, 2.1)
65
66
All three confidence intervals contain the actual value, 2.
67
68
69
Exercise 8.3
70
71
In games like hockey and soccer, the time between goals is
72
roughly exponential. So you could estimate a team's goal-scoring rate
73
by observing the number of goals they score in a game. This
74
estimation process is a little different from sampling the time
75
between goals, so let's see how it works.
76
77
Write a function that takes a goal-scoring rate, {\tt lam}, in goals
78
per game, and simulates a game by generating the time between goals
79
until the total time exceeds 1 game, then returns the number of goals
80
scored.
81
82
Write another function that simulates many games, stores the
83
estimates of {\tt lam}, then computes their mean error and RMSE.
84
85
Is this way of making an estimate biased? Plot the sampling
86
distribution of the estimates and the 90\% confidence interval. What
87
is the standard error? What happens to sampling error for increasing
88
values of {\tt lam}?
89
90
My conclusions:
91
92
1) RMSE for this way of estimating lambda is 1.4
93
94
2) The mean error is small and decreases with m, so this estimator
95
appears to be unbiased.
96
97
One note: If the time between goals is exponential, the distribution
98
of goals scored in a game is Poisson.
99
100
See https://en.wikipedia.org/wiki/Poisson_distribution
101
102
"""
103
104
def Estimate1(n=7, m=100000):
105
"""Mean error for xbar and median as estimators of population mean.
106
107
n: sample size
108
m: number of iterations
109
"""
110
mu = 0
111
sigma = 1
112
113
means = []
114
medians = []
115
for _ in range(m):
116
xs = [random.gauss(mu, sigma) for i in range(n)]
117
xbar = np.mean(xs)
118
median = np.median(xs)
119
means.append(xbar)
120
medians.append(median)
121
122
print('Experiment 1')
123
print('mean error xbar', MeanError(means, mu))
124
print('mean error median', MeanError(medians, mu))
125
126
127
def Estimate2(n=7, m=100000):
128
"""RMSE for biased and unbiased estimators of population variance.
129
130
n: sample size
131
m: number of iterations
132
"""
133
mu = 0
134
sigma = 1
135
136
estimates1 = []
137
estimates2 = []
138
for _ in range(m):
139
xs = [random.gauss(mu, sigma) for i in range(n)]
140
biased = np.var(xs)
141
unbiased = np.var(xs, ddof=1)
142
estimates1.append(biased)
143
estimates2.append(unbiased)
144
145
print('Experiment 2')
146
print('RMSE biased', RMSE(estimates1, sigma**2))
147
print('RMSE unbiased', RMSE(estimates2, sigma**2))
148
149
150
def SimulateSample(lam=2, n=10, m=1000):
151
"""Sampling distribution of L as an estimator of exponential parameter.
152
153
lam: parameter of an exponential distribution
154
n: sample size
155
m: number of iterations
156
"""
157
def VertLine(x, y=1):
158
thinkplot.Plot([x, x], [0, y], color='0.8', linewidth=3)
159
160
estimates = []
161
for j in range(m):
162
xs = np.random.exponential(1/lam, n)
163
lamhat = 1/np.mean(xs)
164
estimates.append(lamhat)
165
166
stderr = RMSE(estimates, lam)
167
print('standard error', stderr)
168
169
cdf = thinkstats2.Cdf(estimates)
170
ci = cdf.Percentile(5), cdf.Percentile(95)
171
print('confidence interval', ci)
172
VertLine(ci[0])
173
VertLine(ci[1])
174
175
# plot the CDF
176
thinkplot.Cdf(cdf)
177
thinkplot.Save(root='estimation2',
178
xlabel='estimate',
179
ylabel='CDF',
180
title='Sampling distribution')
181
182
return stderr
183
184
185
def SimulateGame(lam):
186
"""Simulates a game and returns the estimated goal-scoring rate.
187
188
lam: actual goal scoring rate in goals per game
189
"""
190
goals = 0
191
t = 0
192
while True:
193
time_between_goals = random.expovariate(lam)
194
t += time_between_goals
195
if t > 1:
196
break
197
goals += 1
198
199
# estimated goal-scoring rate is the actual number of goals scored
200
L = goals
201
return L
202
203
204
def Estimate4(lam=2, m=1000000):
205
206
estimates = []
207
for i in range(m):
208
L = SimulateGame(lam)
209
estimates.append(L)
210
211
print('Experiment 4')
212
print('rmse L', RMSE(estimates, lam))
213
print('mean error L', MeanError(estimates, lam))
214
215
pmf = thinkstats2.Pmf(estimates)
216
217
thinkplot.Hist(pmf)
218
thinkplot.Show()
219
220
221
def main():
222
thinkstats2.RandomSeed(17)
223
224
Estimate1()
225
Estimate2()
226
227
print('Experiment 3')
228
for n in [10, 100, 1000]:
229
stderr = SimulateSample(n=n)
230
print(n, stderr)
231
232
Estimate4()
233
234
235
if __name__ == '__main__':
236
main()
237
238