Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Examples and Exercises from Think Stats, 2nd Edition
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
Multiple regression
Let's load up the NSFG data again.
Here's birth weight as a function of mother's age (which we saw in the previous chapter).
Dep. Variable: | totalwgt_lb | R-squared: | 0.005 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.005 |
Method: | Least Squares | F-statistic: | 43.02 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 5.72e-11 |
Time: | 16:34:15 | Log-Likelihood: | -15897. |
No. Observations: | 9038 | AIC: | 3.180e+04 |
Df Residuals: | 9036 | BIC: | 3.181e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.8304 | 0.068 | 100.470 | 0.000 | 6.697 | 6.964 |
agepreg | 0.0175 | 0.003 | 6.559 | 0.000 | 0.012 | 0.023 |
Omnibus: | 1024.052 | Durbin-Watson: | 1.618 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3081.833 |
Skew: | -0.601 | Prob(JB): | 0.00 |
Kurtosis: | 5.596 | Cond. No. | 118. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can extract the parameters.
And the p-value of the slope estimate.
And the coefficient of determination.
The difference in birth weight between first babies and others.
The difference in age between mothers of first babies and others.
The age difference plausibly explains about half of the difference in weight.
Running a single regression with a categorical variable, isfirst
:
Dep. Variable: | totalwgt_lb | R-squared: | 0.002 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.002 |
Method: | Least Squares | F-statistic: | 17.74 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 2.55e-05 |
Time: | 16:34:16 | Log-Likelihood: | -15909. |
No. Observations: | 9038 | AIC: | 3.182e+04 |
Df Residuals: | 9036 | BIC: | 3.184e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 7.3259 | 0.021 | 356.007 | 0.000 | 7.286 | 7.366 |
isfirst[T.True] | -0.1248 | 0.030 | -4.212 | 0.000 | -0.183 | -0.067 |
Omnibus: | 988.919 | Durbin-Watson: | 1.613 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2897.107 |
Skew: | -0.589 | Prob(JB): | 0.00 |
Kurtosis: | 5.511 | Cond. No. | 2.58 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now finally running a multiple regression:
Dep. Variable: | totalwgt_lb | R-squared: | 0.005 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.005 |
Method: | Least Squares | F-statistic: | 24.02 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 3.95e-11 |
Time: | 16:34:16 | Log-Likelihood: | -15894. |
No. Observations: | 9038 | AIC: | 3.179e+04 |
Df Residuals: | 9035 | BIC: | 3.182e+04 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.9142 | 0.078 | 89.073 | 0.000 | 6.762 | 7.066 |
isfirst[T.True] | -0.0698 | 0.031 | -2.236 | 0.025 | -0.131 | -0.009 |
agepreg | 0.0154 | 0.003 | 5.499 | 0.000 | 0.010 | 0.021 |
Omnibus: | 1019.945 | Durbin-Watson: | 1.618 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3063.682 |
Skew: | -0.599 | Prob(JB): | 0.00 |
Kurtosis: | 5.588 | Cond. No. | 137. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
As expected, when we control for mother's age, the apparent difference due to isfirst
is cut in half.
If we add age squared, we can control for a quadratic relationship between age and weight.
Dep. Variable: | totalwgt_lb | R-squared: | 0.007 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.007 |
Method: | Least Squares | F-statistic: | 22.64 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 1.35e-14 |
Time: | 16:34:16 | Log-Likelihood: | -15884. |
No. Observations: | 9038 | AIC: | 3.178e+04 |
Df Residuals: | 9034 | BIC: | 3.181e+04 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.6923 | 0.286 | 19.937 | 0.000 | 5.133 | 6.252 |
isfirst[T.True] | -0.0504 | 0.031 | -1.602 | 0.109 | -0.112 | 0.011 |
agepreg | 0.1124 | 0.022 | 5.113 | 0.000 | 0.069 | 0.155 |
agepreg2 | -0.0018 | 0.000 | -4.447 | 0.000 | -0.003 | -0.001 |
Omnibus: | 1007.149 | Durbin-Watson: | 1.616 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3003.343 |
Skew: | -0.594 | Prob(JB): | 0.00 |
Kurtosis: | 5.562 | Cond. No. | 1.39e+04 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
When we do that, the apparent effect of isfirst
gets even smaller, and is no longer statistically significant.
These results suggest that the apparent difference in weight between first babies and others might be explained by difference in mothers' ages, at least in part.
Data Mining
We can use join
to combine variables from the preganancy and respondent tables.
And we can search for variables with explanatory power.
Because we don't clean most of the variables, we are probably missing some good ones.
The following functions report the variables with the highest values of .
Some of the variables that do well are not useful for prediction because they are not known ahead of time.
Combining the variables that seem to have the most explanatory power.
Dep. Variable: | totalwgt_lb | R-squared: | 0.060 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.059 |
Method: | Least Squares | F-statistic: | 79.98 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 4.86e-113 |
Time: | 16:35:03 | Log-Likelihood: | -14295. |
No. Observations: | 8781 | AIC: | 2.861e+04 |
Df Residuals: | 8773 | BIC: | 2.866e+04 |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.6303 | 0.065 | 102.223 | 0.000 | 6.503 | 6.757 |
C(race)[T.2] | 0.3570 | 0.032 | 11.215 | 0.000 | 0.295 | 0.419 |
C(race)[T.3] | 0.2665 | 0.051 | 5.175 | 0.000 | 0.166 | 0.367 |
babysex == 1[T.True] | 0.2952 | 0.026 | 11.216 | 0.000 | 0.244 | 0.347 |
nbrnaliv > 1[T.True] | -1.3783 | 0.108 | -12.771 | 0.000 | -1.590 | -1.167 |
paydu == 1[T.True] | 0.1196 | 0.031 | 3.861 | 0.000 | 0.059 | 0.180 |
agepreg | 0.0074 | 0.003 | 2.921 | 0.004 | 0.002 | 0.012 |
totincr | 0.0122 | 0.004 | 3.110 | 0.002 | 0.005 | 0.020 |
Omnibus: | 398.813 | Durbin-Watson: | 1.604 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1388.362 |
Skew: | -0.037 | Prob(JB): | 3.32e-302 |
Kurtosis: | 4.947 | Cond. No. | 221. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Logistic regression
Example: suppose we are trying to predict y
using explanatory variables x1
and x2
.
According to the logit model the log odds for the th element of is
So let's start with an arbitrary guess about the elements of :
Plugging in the model, we get log odds.
Which we can convert to odds.
And then convert to probabilities.
The likelihoods of the actual outcomes are where is 1 and where is 0.
The likelihood of given is the product of likes
:
Logistic regression works by searching for the values in that maximize like
.
Here's an example using variables in the NSFG respondent file to predict whether a baby will be a boy or a girl.
The mother's age seems to have a small effect.
Dep. Variable: | boy | No. Observations: | 8884 |
---|---|---|---|
Model: | Logit | Df Residuals: | 8882 |
Method: | MLE | Df Model: | 1 |
Date: | Mon, 25 Feb 2019 | Pseudo R-squ.: | 6.144e-06 |
Time: | 16:35:05 | Log-Likelihood: | -6156.7 |
converged: | True | LL-Null: | -6156.8 |
LLR p-value: | 0.7833 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.0058 | 0.098 | 0.059 | 0.953 | -0.185 | 0.197 |
agepreg | 0.0010 | 0.004 | 0.275 | 0.783 | -0.006 | 0.009 |
Here are the variables that seemed most promising.
Dep. Variable: | boy | No. Observations: | 8782 |
---|---|---|---|
Model: | Logit | Df Residuals: | 8776 |
Method: | MLE | Df Model: | 5 |
Date: | Mon, 25 Feb 2019 | Pseudo R-squ.: | 0.0001440 |
Time: | 16:35:05 | Log-Likelihood: | -6085.4 |
converged: | True | LL-Null: | -6086.3 |
LLR p-value: | 0.8822 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0301 | 0.104 | -0.290 | 0.772 | -0.234 | 0.173 |
C(race)[T.2] | -0.0224 | 0.051 | -0.439 | 0.660 | -0.122 | 0.077 |
C(race)[T.3] | -0.0005 | 0.083 | -0.005 | 0.996 | -0.163 | 0.162 |
agepreg | -0.0027 | 0.006 | -0.484 | 0.629 | -0.014 | 0.008 |
hpagelb | 0.0047 | 0.004 | 1.112 | 0.266 | -0.004 | 0.013 |
birthord | 0.0050 | 0.022 | 0.227 | 0.821 | -0.038 | 0.048 |
To make a prediction, we have to extract the exogenous and endogenous variables.
The baseline prediction strategy is to guess "boy". In that case, we're right almost 51% of the time.
If we use the previous model, we can compute the number of predictions we get right.
And the accuracy, which is slightly higher than the baseline.
To make a prediction for an individual, we have to get their information into a DataFrame
.
This person has a 51% chance of having a boy (according to the model).
Exercises
Exercise: Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.
The following are the only variables I found that have a statistically significant effect on pregnancy length.
Dep. Variable: | prglngth | R-squared: | 0.011 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.011 |
Method: | Least Squares | F-statistic: | 34.28 |
Date: | Mon, 25 Feb 2019 | Prob (F-statistic): | 5.09e-22 |
Time: | 16:35:08 | Log-Likelihood: | -18247. |
No. Observations: | 8884 | AIC: | 3.650e+04 |
Df Residuals: | 8880 | BIC: | 3.653e+04 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 38.7617 | 0.039 | 1006.410 | 0.000 | 38.686 | 38.837 |
birthord == 1[T.True] | 0.1015 | 0.040 | 2.528 | 0.011 | 0.023 | 0.180 |
race == 2[T.True] | 0.1390 | 0.042 | 3.311 | 0.001 | 0.057 | 0.221 |
nbrnaliv > 1[T.True] | -1.4944 | 0.164 | -9.086 | 0.000 | -1.817 | -1.172 |
Omnibus: | 1587.470 | Durbin-Watson: | 1.619 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6160.751 |
Skew: | -0.852 | Prob(JB): | 0.00 |
Kurtosis: | 6.707 | Cond. No. | 10.9 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exercise: The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis
Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.
As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?
Exercise: If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson
. It works the same way as ols
and logit
. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes
.
Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?
Now we can predict the number of children for a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000
Exercise: If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit
. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital
.
Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?
Make a prediction for a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000.