Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Examples and Exercises from Think Stats, 2nd Edition
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
Time series analysis
Load the data from "Price of Weed".
city | state | price | amount | quality | date | ppg | state.name | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Annandale | VA | 100 | 7.075 | high | 2010-09-02 | 14.13 | Virginia | 38.830345 | -77.213870 |
1 | Auburn | AL | 60 | 28.300 | high | 2010-09-02 | 2.12 | Alabama | 32.578185 | -85.472820 |
2 | Austin | TX | 60 | 28.300 | medium | 2010-09-02 | 2.12 | Texas | 30.326374 | -97.771258 |
3 | Belleville | IL | 400 | 28.300 | high | 2010-09-02 | 14.13 | Illinois | 38.532311 | -89.983521 |
4 | Boone | NC | 55 | 3.540 | high | 2010-09-02 | 15.54 | North Carolina | 36.217052 | -81.687983 |
The following function takes a DataFrame of transactions and compute daily averages.
The following function returns a map from quality name to a DataFrame of daily averages.
dailies
is the map from quality name to DataFrame.
The following plots the daily average price for each quality.
We can use statsmodels
to run a linear model of price as a function of time.
Here's what the results look like.
Dep. Variable: | ppg | R-squared: | 0.444 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.444 |
Method: | Least Squares | F-statistic: | 989.7 |
Date: | Thu, 24 Jan 2019 | Prob (F-statistic): | 3.69e-160 |
Time: | 09:23:26 | Log-Likelihood: | -1510.1 |
No. Observations: | 1241 | AIC: | 3024. |
Df Residuals: | 1239 | BIC: | 3035. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 13.4496 | 0.045 | 296.080 | 0.000 | 13.361 | 13.539 |
years | -0.7082 | 0.023 | -31.460 | 0.000 | -0.752 | -0.664 |
Omnibus: | 56.254 | Durbin-Watson: | 1.847 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 128.992 |
Skew: | 0.252 | Prob(JB): | 9.76e-29 |
Kurtosis: | 4.497 | Cond. No. | 4.71 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Dep. Variable: | ppg | R-squared: | 0.030 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.029 |
Method: | Least Squares | F-statistic: | 35.90 |
Date: | Thu, 24 Jan 2019 | Prob (F-statistic): | 2.76e-09 |
Time: | 09:23:26 | Log-Likelihood: | -3091.3 |
No. Observations: | 1179 | AIC: | 6187. |
Df Residuals: | 1177 | BIC: | 6197. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.3616 | 0.194 | 27.671 | 0.000 | 4.981 | 5.742 |
years | 0.5683 | 0.095 | 5.991 | 0.000 | 0.382 | 0.754 |
Omnibus: | 649.338 | Durbin-Watson: | 1.820 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6347.614 |
Skew: | 2.373 | Prob(JB): | 0.00 |
Kurtosis: | 13.329 | Cond. No. | 4.85 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Dep. Variable: | ppg | R-squared: | 0.050 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.049 |
Method: | Least Squares | F-statistic: | 64.92 |
Date: | Thu, 24 Jan 2019 | Prob (F-statistic): | 1.82e-15 |
Time: | 09:23:26 | Log-Likelihood: | -2053.9 |
No. Observations: | 1238 | AIC: | 4112. |
Df Residuals: | 1236 | BIC: | 4122. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 8.8791 | 0.071 | 125.043 | 0.000 | 8.740 | 9.018 |
years | 0.2832 | 0.035 | 8.057 | 0.000 | 0.214 | 0.352 |
Omnibus: | 133.025 | Durbin-Watson: | 1.767 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 630.863 |
Skew: | 0.385 | Prob(JB): | 1.02e-137 |
Kurtosis: | 6.411 | Cond. No. | 4.73 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now let's plot the fitted model with the data.
The following function plots the original data and the fitted curve.
Here are results for the high quality category:
Moving averages
As a simple example, I'll show the rolling average of the numbers from 1 to 10.
With a "window" of size 3, we get the average of the previous 3 elements, or nan when there are fewer than 3.
The following function plots the rolling mean.
Here's what it looks like for the high quality category.
The exponentially-weighted moving average gives more weight to more recent points.
We can use resampling to generate missing values with the right amount of noise.
Here's what the EWMA model looks like with missing values filled.
Serial correlation
The following function computes serial correlation with the given lag.
Before computing correlations, we'll fill missing values.
Here are the serial correlations for raw price data.
It's not surprising that there are correlations between consecutive days, because there are obvious trends in the data.
It is more interested to see whether there are still correlations after we subtract away the trends.
Even if the correlations between consecutive days are weak, there might be correlations across intervals of one week, one month, or one year.
The strongest correlation is a weekly cycle in the medium quality category.
Autocorrelation
The autocorrelation function is the serial correlation computed for all lags.
We can use it to replicate the results from the previous section.
To get a sense of how much autocorrelation we should expect by chance, we can resample the data (which eliminates any actual autocorrelation) and compute the ACF.
The following function plots the actual autocorrelation for lags up to 40 days.
The flag add_weekly
indicates whether we should add a simulated weekly cycle.
To show what a strong weekly cycle would look like, we have the option of adding a price increase of 1-2 dollars on Friday and Saturdays.
Here's what the real ACFs look like. The gray regions indicate the levels we expect by chance.
Here's what it would look like if there were a weekly cycle.
Prediction
The simplest way to generate predictions is to use statsmodels
to fit a model to the data, then use the predict
method from the results.
Here's what the prediction looks like for the high quality category, using the linear model.
When we generate predictions, we want to quatify the uncertainty in the prediction. We can do that by resampling. The following function fits a model to the data, computes residuals, then resamples from the residuals to general fake datasets. It fits the same model to each fake dataset and returns a list of results.
To generate predictions, we take the list of results fitted to resampled data. For each model, we use the predict
method to generate predictions, and return a sequence of predictions.
If add_resid
is true, we add resampled residuals to the predicted values, which generates predictions that include predictive uncertainty (due to random noise) as well as modeling uncertainty (due to random sampling).
To visualize predictions, I show a darker region that quantifies modeling uncertainty and a lighter region that quantifies predictive uncertainty.
Here are the results for the high quality category.
But there is one more source of uncertainty: how much past data should we use to build the model?
The following function generates a sequence of models based on different amounts of past data.
And this function plots the results.
Here's what the high quality category looks like if we take into account uncertainty about how much past data to use.
Exercises
Exercise: The linear model I used in this chapter has the obvious drawback that it is linear, and there is no reason to expect prices to change linearly over time. We can add flexibility to the model by adding a quadratic term, as we did in Section 11.3.
Use a quadratic model to fit the time series of daily prices, and use the model to generate predictions. You will have to write a version of RunLinearModel
that runs that quadratic model, but after that you should be able to reuse code from the chapter to generate predictions.
Exercise: Write a definition for a class named SerialCorrelationTest
that extends HypothesisTest
from Section 9.2. It should take a series and a lag as data, compute the serial correlation of the series with the given lag, and then compute the p-value of the observed correlation.
Use this class to test whether the serial correlation in raw price data is statistically significant. Also test the residuals of the linear model and (if you did the previous exercise), the quadratic model.
Worked example: There are several ways to extend the EWMA model to generate predictions. One of the simplest is something like this:
Compute the EWMA of the time series and use the last point as an intercept,
inter
.Compute the EWMA of differences between successive elements in the time series and use the last point as a slope,
slope
.To predict values at future times, compute
inter + slope * dt
, wheredt
is the difference between the time of the prediction and the time of the last observation.
As an exercise, run this analysis again for the other quality categories.