CoCalc -- chap08soln.py

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Website: http://greenteapress.com/wp/think-stats-2e/
Path: think-stats-code / chap08soln.py
Views: ⁷¹³⁸
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function, division
9

10
import thinkstats2
11
import thinkplot
12

13
import math
14
import random
15
import numpy as np
16

17
from scipy import stats
18
from estimation import RMSE, MeanError
19

20

21
"""This file contains a solution to exercises in Think Stats:
22

23
Exercise 8.1
24

25
In this chapter we used $\xbar$ and median to estimate $\mu$, and
26
found that $\xbar$  yields lower MSE.
27
Also, we used $S^2$ and $S_{n-1}^2$ to estimate $\sigma$, and found that
28
$S^2$ is biased and $S_{n-1}^2$ unbiased.
29

30
Run similar experiments to see if $\xbar$ and median are biased estimates
31
of $\mu$.
32
Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.
33

34
My conclusions:
35

36
1) xbar and median yield lower mean error as m increases, so neither
37
one is obviously biased, as far as we can tell from the experiment.
38

39
2) The biased estimator of variance yields lower RMSE than the unbiased
40
estimator, by about 10%.  And the difference holds up as m increases.
41

42

43
Exercise 8.2
44

45
Suppose you draw a sample with size $n=10$ from a population 
46
with an exponential disrtribution with $\lambda=2$.  Simulate
47
this experiment 1000 times and plot the sampling distribution of
48
the estimate $\lamhat$.  Compute the standard error of the estimate
49
and the 90\% confidence interval.
50

51
Repeat the experiment with a few different values of $n$ and make
52
a plot of standard error versus $n$.
53

54
1) With sample size 10:
55

56
standard error 0.896717911545
57
confidence interval (1.2901330772324622, 3.8692334892427911)
58

59
2) As sample size increases, standard error and the width of
60
the CI decrease:
61

62
10      0.90    (1.3, 3.9)
63
100     0.21    (1.7, 2.4)
64
1000    0.06    (1.9, 2.1)
65

66
All three confidence intervals contain the actual value, 2.
67

68

69
Exercise 8.3
70

71
In games like hockey and soccer, the time between goals is
72
roughly exponential.  So you could estimate a team's goal-scoring rate
73
by observing the number of goals they score in a game.  This
74
estimation process is a little different from sampling the time
75
between goals, so let's see how it works.
76

77
Write a function that takes a goal-scoring rate, {\tt lam}, in goals
78
per game, and simulates a game by generating the time between goals
79
until the total time exceeds 1 game, then returns the number of goals
80
scored.
81

82
Write another function that simulates many games, stores the
83
estimates of {\tt lam}, then computes their mean error and RMSE.
84

85
Is this way of making an estimate biased?  Plot the sampling
86
distribution of the estimates and the 90\% confidence interval.  What
87
is the standard error?  What happens to sampling error for increasing
88
values of {\tt lam}?
89

90
My conclusions:
91

92
1) RMSE for this way of estimating lambda is 1.4
93

94
2) The mean error is small and decreases with m, so this estimator
95
appears to be unbiased.
96

97
One note: If the time between goals is exponential, the distribution
98
of goals scored in a game is Poisson.
99

100
See https://en.wikipedia.org/wiki/Poisson_distribution
101

102
"""
103

104
def Estimate1(n=7, m=100000):
105
    """Mean error for xbar and median as estimators of population mean.
106

107
    n: sample size
108
    m: number of iterations
109
    """
110
    mu = 0
111
    sigma = 1
112

113
    means = []
114
    medians = []
115
    for _ in range(m):
116
        xs = [random.gauss(mu, sigma) for i in range(n)]
117
        xbar = np.mean(xs)
118
        median = np.median(xs)
119
        means.append(xbar)
120
        medians.append(median)
121

122
    print('Experiment 1')
123
    print('mean error xbar', MeanError(means, mu))
124
    print('mean error median', MeanError(medians, mu))
125

126

127
def Estimate2(n=7, m=100000):
128
    """RMSE for biased and unbiased estimators of population variance.
129

130
    n: sample size
131
    m: number of iterations
132
    """
133
    mu = 0
134
    sigma = 1
135

136
    estimates1 = []
137
    estimates2 = []
138
    for _ in range(m):
139
        xs = [random.gauss(mu, sigma) for i in range(n)]
140
        biased = np.var(xs)
141
        unbiased = np.var(xs, ddof=1)
142
        estimates1.append(biased)
143
        estimates2.append(unbiased)
144

145
    print('Experiment 2')
146
    print('RMSE biased', RMSE(estimates1, sigma**2))
147
    print('RMSE unbiased', RMSE(estimates2, sigma**2))
148

149

150
def SimulateSample(lam=2, n=10, m=1000):
151
    """Sampling distribution of L as an estimator of exponential parameter.
152

153
    lam: parameter of an exponential distribution
154
    n: sample size
155
    m: number of iterations
156
    """
157
    def VertLine(x, y=1):
158
        thinkplot.Plot([x, x], [0, y], color='0.8', linewidth=3)
159

160
    estimates = []
161
    for j in range(m):
162
        xs = np.random.exponential(1/lam, n)
163
        lamhat = 1/np.mean(xs)
164
        estimates.append(lamhat)
165

166
    stderr = RMSE(estimates, lam)
167
    print('standard error', stderr)
168

169
    cdf = thinkstats2.Cdf(estimates)
170
    ci = cdf.Percentile(5), cdf.Percentile(95)
171
    print('confidence interval', ci)
172
    VertLine(ci[0])
173
    VertLine(ci[1])
174

175
    # plot the CDF
176
    thinkplot.Cdf(cdf)
177
    thinkplot.Save(root='estimation2',
178
                   xlabel='estimate',
179
                   ylabel='CDF',
180
                   title='Sampling distribution')
181

182
    return stderr
183

184

185
def SimulateGame(lam):
186
    """Simulates a game and returns the estimated goal-scoring rate.
187

188
    lam: actual goal scoring rate in goals per game
189
    """
190
    goals = 0
191
    t = 0
192
    while True:
193
        time_between_goals = random.expovariate(lam)
194
        t += time_between_goals
195
        if t > 1:
196
            break
197
        goals += 1
198

199
    # estimated goal-scoring rate is the actual number of goals scored
200
    L = goals
201
    return L
202

203

204
def Estimate4(lam=2, m=1000000):
205

206
    estimates = []
207
    for i in range(m):
208
        L = SimulateGame(lam)
209
        estimates.append(L)
210

211
    print('Experiment 4')
212
    print('rmse L', RMSE(estimates, lam))
213
    print('mean error L', MeanError(estimates, lam))
214
    
215
    pmf = thinkstats2.Pmf(estimates)
216

217
    thinkplot.Hist(pmf)
218
    thinkplot.Show()
219
        
220

221
def main():
222
    thinkstats2.RandomSeed(17)
223

224
    Estimate1()
225
    Estimate2()
226

227
    print('Experiment 3')
228
    for n in [10, 100, 1000]:
229
        stderr = SimulateSample(n=n)
230
        print(n, stderr)
231

232
    Estimate4()
233

234

235
if __name__ == '__main__':
236
    main()
237

238
Product

Resources

Company