Here are some of the functions from Chapter 5.
Read the GSS data again.
Most variables use special codes to indicate missing data. We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with NaN
, which Pandas recognizes as a missing value.
Distribution of age
Here's the CDF of ages.
Exercise: Each of the following cells shows the distribution of ages under various transforms, compared to various models. In each text cell, add a sentence or two that interprets the result. What can we say about the distribution of ages based on each figure?
Here's the CDF of ages compared to a normal distribution with the same mean and standard deviation.
Interpretation:
Here's a normal probability plot for the distribution of ages.
Interpretation:
Here's the complementary CDF on a log-y scale.
Interpretation:
Here's the CDF of ages on a log-x scale.
Interpretation:
Here's the CDF of the logarithm of ages, compared to a normal model.
Interpretation:
Here's a normal probability plot for the logarithm of ages.
Interpretation:
Here's the complementary CDF on a log-log scale.
Interpretation:
Here's a test to see whether ages are well-modeled by a Weibull distribution.
Interpretation:
Distribution of income
Here's the CDF of realinc
.
Exercise: Use visualizations like the ones in the previous exercise to see whether there is an analytic model that describes the distribution of gss.realinc
well.
Here's a normal probability plot for the values.
Here's the complementary CDF on a log-y scale.
Here's the CDF on a log-x scale.
Here's the CDF of the logarithm of the values, compared to a normal model.
Here's a normal probability plot for the logarithm of the values.
Here's the complementary CDF on a log-log scale.
Here's a test to see whether the values are well-modeled by a Weibull distribution.
Interpretation:
BRFSS
Let's look at the distribution of height in the BRFSS dataset. Here's the CDF.
To see whether a normal model describes this data well, we can use KDE to estimate the PDF.
Here's an example using the default bandwidth method.
It doesn't work very well; we can improve it by overriding the bandwidth with a constant.
Now we can generate a normal model with the same mean and standard deviation.
Here's the model compared to the estimated PDF.
The data don't fit the model particularly well, possibly because the distribution of heights is a mixture of two distributions, for men and women.
Exercise: Generate a similar figure for just women's heights and see if the normal model does any better.
Exercise: Generate a similar figure for men's weights, brfss.WTKG3
. How well does the normal model fit?
Exercise: Try it one more time with the log of men's weights. How well does the normal model fit? What does that imply about the distribution of weight?
Skewness
Let's look at the skewness of the distribution of weights for men and women.
As we've seen, these distributions are skewed to the right, so we expect the mean to be higher than the median.
We can compute the moment-based sample skewness using Pandas or thinkstats2
. The results are almost the same.
But moment-based sample skewness is a terrible statistic! A more robust alternative is Pearson's median skewness:
Exercise: Compute the same statistics for women. Which distribution is more skewed?
Exercise: Explore the GSS or BRFSS dataset and find something interesting!