Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Examples and Exercises from Think Stats, 2nd Edition
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
The estimation game
Root mean squared error is one of several ways to summarize the average error of an estimation process.
The following function simulates experiments where we try to estimate the mean of a population based on a sample with size n=7
. We run iters=1000
experiments and collect the mean and median of each sample.
Using to estimate the mean works a little better than using the median; in the long run, it minimizes RMSE. But using the median is more robust in the presence of outliers or large errors.
Estimating variance
The obvious way to estimate the variance of a population is to compute the variance of the sample, , but that turns out to be a biased estimator; that is, in the long run, the average error doesn't converge to 0.
The following function computes the mean error for a collection of estimates.
The following function simulates experiments where we try to estimate the variance of a population based on a sample with size n=7
. We run iters=1000
experiments and two estimates for each sample, and .
The mean error for is non-zero, which suggests that it is biased. The mean error for is close to zero, and gets even smaller if we increase iters
.
The sampling distribution
The following function simulates experiments where we estimate the mean of a population using , and returns a list of estimates, one from each experiment.
Here's the "sampling distribution of the mean" which shows how much we should expect to vary from one experiment to the next.
The mean of the sample means is close to the actual value of .
An interval that contains 90% of the values in the sampling disrtribution is called a 90% confidence interval.
And the RMSE of the sample means is called the standard error.
Confidence intervals and standard errors quantify the variability in the estimate due to random sampling.
Estimating rates
The following function simulates experiments where we try to estimate the mean of an exponential distribution using the mean and median of a sample.
The RMSE is smaller for the sample mean than for the sample median.
But neither estimator is unbiased.
Exercises
Exercise: In this chapter we used and median to estimate µ, and found that yields lower MSE. Also, we used and to estimate σ, and found that is biased and unbiased. Run similar experiments to see if and median are biased estimates of µ. Also check whether or yields a lower MSE.
Exercise: Suppose you draw a sample with size n=10 from an exponential distribution with λ=2. Simulate this experiment 1000 times and plot the sampling distribution of the estimate L. Compute the standard error of the estimate and the 90% confidence interval.
Repeat the experiment with a few different values of n
and make a plot of standard error versus n
.
Exercise: In games like hockey and soccer, the time between goals is roughly exponential. So you could estimate a team’s goal-scoring rate by observing the number of goals they score in a game. This estimation process is a little different from sampling the time between goals, so let’s see how it works.
Write a function that takes a goal-scoring rate, lam
, in goals per game, and simulates a game by generating the time between goals until the total time exceeds 1 game, then returns the number of goals scored.
Write another function that simulates many games, stores the estimates of lam
, then computes their mean error and RMSE.
Is this way of making an estimate biased?