Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

📚 The CoCalc Library - books, templates and other resources

Views: 96160
License: OTHER
Kernel: Python 3

Homework 2

Visualize, describe, and model distributions

Allen Downey

MIT License

%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(style='white') from utils import decorate from thinkstats2 import Pmf, Cdf import thinkstats2 import thinkplot

Here are some of the functions from Chapter 5.

def MakeNormalModel(values, label=''): """Plots a CDF with a Normal model. values: sequence """ cdf = thinkstats2.Cdf(values, label=label) mean, var = thinkstats2.TrimmedMeanVar(values) std = np.sqrt(var) print('n, mean, std', len(values), mean, std) xmin = mean - 4 * std xmax = mean + 4 * std xs, ps = thinkstats2.RenderNormalCdf(mean, std, xmin, xmax) thinkplot.Plot(xs, ps, label='model', linewidth=4, color='0.8') thinkplot.Cdf(cdf)
def MakeNormalPlot(values, label=''): """Generates a normal probability plot. values: sequence """ mean, var = thinkstats2.TrimmedMeanVar(values, p=0.01) std = np.sqrt(var) xs = [-5, 5] xs, ys = thinkstats2.FitLine(xs, mean, std) thinkplot.Plot(xs, ys, color='0.8', label='model') xs, ys = thinkstats2.NormalProbability(values) thinkplot.Plot(xs, ys, '+', alpha=0.3, label=label)

Read the GSS data again.

%time gss = pd.read_hdf('gss.hdf5', 'gss') gss.shape
gss.head()

Most variables use special codes to indicate missing data. We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with NaN, which Pandas recognizes as a missing value.

def replace_invalid(df): df.realinc.replace([0], np.nan, inplace=True) df.educ.replace([98,99], np.nan, inplace=True) # 89 means 89 or older df.age.replace([98, 99], np.nan, inplace=True) df.cohort.replace([9999], np.nan, inplace=True) df.adults.replace([9], np.nan, inplace=True) replace_invalid(gss)

Distribution of age

Here's the CDF of ages.

cdf_age = Cdf(gss.age) thinkplot.cdf(cdf_age, label='age') decorate(title='Distribution of age', xlabel='Age (years)', ylabel='CDF')

Exercise: Each of the following cells shows the distribution of ages under various transforms, compared to various models. In each text cell, add a sentence or two that interprets the result. What can we say about the distribution of ages based on each figure?

  1. Here's the CDF of ages compared to a normal distribution with the same mean and standard deviation.

Interpretation:

MakeNormalModel(gss.age.dropna(), label='') decorate(title='Distribution of age', xlabel='Age (years)', ylabel='CDF')
  1. Here's a normal probability plot for the distribution of ages.

Interpretation:

MakeNormalPlot(gss.age.dropna(), label='') decorate(title='Normal probability plot', xlabel='Standard normal sample', ylabel='Age (years)')
  1. Here's the complementary CDF on a log-y scale.

Interpretation:

thinkplot.cdf(cdf_age, label='age', complement=True) decorate(title='Distribution of age', xlabel='Age (years)', ylabel='Complementary CDF, log scale', yscale='log')
  1. Here's the CDF of ages on a log-x scale.

Interpretation:

thinkplot.cdf(cdf_age, label='age') decorate(title='Distribution of age', xlabel='Age (years)', ylabel='CDF', xscale='log')
  1. Here's the CDF of the logarithm of ages, compared to a normal model.

Interpretation:

values = np.log10(gss.age.dropna()) MakeNormalModel(values, label='') decorate(title='Distribution of age', xlabel='Age (log10 years)', ylabel='CDF')
  1. Here's a normal probability plot for the logarithm of ages.

Interpretation:

MakeNormalPlot(values, label='') decorate(title='Distribution of age', xlabel='Standard normal sample', ylabel='Age (log10 years)')
  1. Here's the complementary CDF on a log-log scale.

Interpretation:

thinkplot.cdf(cdf_age, label='age', complement=True) decorate(title='Distribution of age', xlabel='Age (years)', ylabel='Complementary CDF, log scale', xscale='log', yscale='log')
  1. Here's a test to see whether ages are well-modeled by a Weibull distribution.

Interpretation:

thinkplot.cdf(cdf_age, label='age', transform='Weibull') decorate(title='Distribution of age', xlabel='Age (years)', ylabel='log Complementary CDF, log scale', xscale='log', yscale='log')

Distribution of income

Here's the CDF of realinc.

cdf_realinc = Cdf(gss.realinc) thinkplot.cdf(cdf_realinc, label='income') decorate(title='Distribution of income', xlabel='Income (1986 $)', ylabel='CDF')

Exercise: Use visualizations like the ones in the previous exercise to see whether there is an analytic model that describes the distribution of gss.realinc well.

# Solution goes here
  1. Here's a normal probability plot for the values.

# Solution goes here
  1. Here's the complementary CDF on a log-y scale.

# Solution goes here
  1. Here's the CDF on a log-x scale.

# Solution goes here
  1. Here's the CDF of the logarithm of the values, compared to a normal model.

# Solution goes here
  1. Here's a normal probability plot for the logarithm of the values.

# Solution goes here
  1. Here's the complementary CDF on a log-log scale.

# Solution goes here
  1. Here's a test to see whether the values are well-modeled by a Weibull distribution.

Interpretation:

# Solution goes here

BRFSS

%time brfss = pd.read_hdf('brfss.hdf5', 'brfss') brfss.head()

Let's look at the distribution of height in the BRFSS dataset. Here's the CDF.

heights = brfss.HTM4 cdf_heights = Cdf(heights) thinkplot.Cdf(cdf_heights) decorate(xlabel='Height (cm)', ylabel='CDF')

To see whether a normal model describes this data well, we can use KDE to estimate the PDF.

from scipy.stats import gaussian_kde

Here's an example using the default bandwidth method.

kde = gaussian_kde(heights.dropna()) xs = np.linspace(heights.min(), heights.max()) ds = kde.evaluate(xs) ds /= ds.sum() plt.plot(xs, ds, label='KDE heights') decorate(xlabel='Height (cm)', ylabel='PDF')

It doesn't work very well; we can improve it by overriding the bandwidth with a constant.

kde = gaussian_kde(heights.dropna(), bw_method=0.3) ds = kde.evaluate(xs) ds /= ds.sum() plt.plot(xs, ds, label='KDE heights') decorate(xlabel='Height (cm)', ylabel='PDF')

Now we can generate a normal model with the same mean and standard deviation.

mean = heights.mean() std = heights.std() mean, std

Here's the model compared to the estimated PDF.

normal_pdf = thinkstats2.NormalPdf(mean, std) ps = normal_pdf.Density(xs) ps /= ps.sum() plt.plot(xs, ps, color='gray', label='Normal model') plt.plot(xs, ds, label='KDE heights') decorate(xlabel='Height (cm)', ylabel='PDF')

The data don't fit the model particularly well, possibly because the distribution of heights is a mixture of two distributions, for men and women.

Exercise: Generate a similar figure for just women's heights and see if the normal model does any better.

# Solution goes here
# Solution goes here

Exercise: Generate a similar figure for men's weights, brfss.WTKG3. How well does the normal model fit?

# Solution goes here
# Solution goes here

Exercise: Try it one more time with the log of men's weights. How well does the normal model fit? What does that imply about the distribution of weight?

# Solution goes here
# Solution goes here

Skewness

Let's look at the skewness of the distribution of weights for men and women.

male = (brfss.SEX == 1) male_weights = brfss.loc[male, 'WTKG3']
female = (brfss.SEX == 2) female_weights = brfss.loc[female, 'WTKG3']

As we've seen, these distributions are skewed to the right, so we expect the mean to be higher than the median.

male_weights.mean(), male_weights.median()

We can compute the moment-based sample skewness using Pandas or thinkstats2. The results are almost the same.

male_weights.skew(), thinkstats2.Skewness(male_weights.dropna())

But moment-based sample skewness is a terrible statistic! A more robust alternative is Pearson's median skewness:

thinkstats2.PearsonMedianSkewness(male_weights.dropna())

Exercise: Compute the same statistics for women. Which distribution is more skewed?

# Solution goes here
# Solution goes here
# Solution goes here

Exercise: Explore the GSS or BRFSS dataset and find something interesting!