Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Examples and Exercises from Think Stats, 2nd Edition
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
Least squares
One more time, let's load up the NSFG data.
The following function computes the intercept and slope of the least squares fit.
Here's the least squares fit to birth weight as a function of mother's age.
The intercept is often easier to interpret if we evaluate it at the mean of the independent variable.
And the slope is easier to interpret if we express it in pounds per decade (or ounces per year).
The following function evaluates the fitted line at the given xs
.
And here's an example.
Here's a scatterplot of the data with the fitted line.
Residuals
The following functon computes the residuals.
Now we can add the residuals as a column in the DataFrame.
To visualize the residuals, I'll split the respondents into groups by age, then plot the percentiles of the residuals versus the average age in each group.
First I'll make the groups and compute the average age in each group.
Next I'll compute the CDF of the residuals in each group.
The following function plots percentiles of the residuals against the average age in each group.
The following figure shows the 25th, 50th, and 75th percentiles.
Curvature in the residuals suggests a non-linear relationship.
Sampling distribution
To estimate the sampling distribution of inter
and slope
, I'll use resampling.
The following function resamples the given dataframe and returns lists of estimates for inter
and slope
.
Here's an example.
The following function takes a list of estimates and prints the mean, standard error, and 90% confidence interval.
Here's the summary for inter
.
And for slope
.
Exercise: Use ResampleRows
and generate a list of estimates for the mean birth weight. Use Summarize
to compute the SE and CI for these estimates.
Visualizing uncertainty
To show the uncertainty of the estimated slope and intercept, we can generate a fitted line for each resampled estimate and plot them on top of each other.
Or we can make a neater (and more efficient plot) by computing fitted lines and finding percentiles of the fits for each value of the dependent variable.
This example shows the confidence interval for the fitted values at each mother's age.
Coefficient of determination
The coefficient compares the variance of the residuals to the variance of the dependent variable.
For birth weight and mother's age is very small, indicating that the mother's age predicts a small part of the variance in birth weight.
We can confirm that :
To express predictive power, I think it's useful to compare the standard deviation of the residuals to the standard deviation of the dependent variable, as a measure RMSE if you try to guess birth weight with and without taking into account mother's age.
As another example of the same idea, here's how much we can improve guesses about IQ if we know someone's SAT scores.
Hypothesis testing with slopes
Here's a HypothesisTest
that uses permutation to test whether the observed slope is statistically significant.
And it is.
Under the null hypothesis, the largest slope we observe after 1000 tries is substantially less than the observed value.
We can also use resampling to estimate the sampling distribution of the slope.
The distribution of slopes under the null hypothesis, and the sampling distribution of the slope under resampling, have the same shape, but one has mean at 0 and the other has mean at the observed slope.
To compute a p-value, we can count how often the estimated slope under the null hypothesis exceeds the observed slope, or how often the estimated slope under resampling falls below 0.
Here's how to get a p-value from the sampling distribution.
Resampling with weights
Resampling provides a convenient way to take into account the sampling weights associated with respondents in a stratified survey design.
The following function resamples rows with probabilities proportional to weights.
We can use it to estimate the mean birthweight and compute SE and CI.
And here's what the same calculation looks like if we ignore the weights.
The difference is non-negligible, which suggests that there are differences in birth weight between the strata in the survey.
Exercises
Exercise: Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?
Read the BRFSS data and extract heights and log weights.
Estimate intercept and slope.
Make a scatter plot of the data and show the fitted line.
Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.
Plot percentiles of the residuals.
Compute correlation.
Compute coefficient of determination.
Confirm that .
Compute Std(ys), which is the RMSE of predictions that don't use height.
Compute Std(res), the RMSE of predictions that do use height.
How much does height information reduce RMSE?
Use resampling to compute sampling distributions for inter and slope.
Plot the sampling distribution of slope.
Compute the p-value of the slope.
Compute the 90% confidence interval of slope.
Compute the mean of the sampling distribution.
Compute the standard deviation of the sampling distribution, which is the standard error.
Resample rows without weights, compute mean height, and summarize results.
Resample rows with weights. Note that the weight column in this dataset is called finalwt
.