| Download
Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Project: Support and Testing
Views: 7115License: GPL3
"""This file contains code used in "Think Stats",1by Allen B. Downey, available from greenteapress.com23Copyright 2014 Allen B. Downey4License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html5"""67from __future__ import print_function89import numpy as np10import pandas1112import hinc13import thinkplot14import thinkstats21516"""This file contains a solution to an exercise in Think Stats:1718The distributions of wealth and income are sometimes modeled using19lognormal and Pareto distributions. To see which is better, let's20look at some data.2122The Current Population Survey (CPS) is joint effort of the Bureau23of Labor Statistics and the Census Bureau to study income and related24variables. Data collected in 2013 is available from25http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm.26I downloaded hinc06.xls, which is an Excel spreadsheet with27information about household income, and converted it to hinc06.csv,28a CSV file you will find in the repository for this book. You29will also find hinc.py, which reads the CSV file.3031Extract the distribution of incomes from this dataset. Are any of the32analytic distributions in this chapter a good model of the data? A33solution to this exercise is in hinc_soln.py.3435My solution generates three figures:36371) The CDF of income on a linear scale.38392) The CCDF on a log-log scale along with a Pareto model intended40to match the tail behavior.41423) The CDF on a log-x scale along with a lognormal model chose to43match the median and inter-quartile range.4445My conclusions based on these figures are:46471) The Pareto model is probably a reasonable choice for the top4810-20% of incomes.49502) The lognormal model captures the shape of the distribution better,51but the data deviate substantially from the model. With different52choices for sigma, you could match the upper or lower tail, but not53both at the same time.5455In summary I would say that neither model captures the whole distribution,56so you might have to57581) look for another analytic model,592) choose one that captures the part of the distribution that is most60relevent, or613) avoid using an analytic model altogether.6263"""646566class SmoothCdf(thinkstats2.Cdf):67"""Represents a CDF based on calculated quantiles.68"""69def Render(self):70"""Because this CDF was not computed from a sample, it71should not be rendered as a step function.72"""73return self.xs, self.ps7475def Prob(self, x):76"""Compute CDF(x), interpolating between known values.77"""78return np.interp(x, self.xs, self.ps)7980def Value(self, p):81"""Compute inverse CDF(x), interpolating between probabilities.82"""83return np.interp(p, self.ps, self.xs)848586def MakeFigures(df):87"""Plots the CDF of income in several forms.88"""89xs, ps = df.income.values, df.ps.values90cdf = SmoothCdf(xs, ps, label='data')91cdf_log = SmoothCdf(np.log10(xs), ps, label='data')9293# linear plot94thinkplot.Cdf(cdf)95thinkplot.Save(root='hinc_linear',96xlabel='household income',97ylabel='CDF')9899# pareto plot100# for the model I chose parameters by hand to fit the tail101xs, ys = thinkstats2.RenderParetoCdf(xmin=55000, alpha=2.5,102low=0, high=250000)103thinkplot.Plot(xs, 1-ys, label='model', color='0.8')104105thinkplot.Cdf(cdf, complement=True)106thinkplot.Save(root='hinc_pareto',107xlabel='log10 household income',108ylabel='CCDF',109xscale='log',110yscale='log')111112# lognormal plot113# for the model I estimate mu and sigma using114# percentile-based statistics115median = cdf_log.Percentile(50)116iqr = cdf_log.Percentile(75) - cdf_log.Percentile(25)117std = iqr / 1.349118119# choose std to match the upper tail120std = 0.35121print(median, std)122123xs, ps = thinkstats2.RenderNormalCdf(median, std, low=3.5, high=5.5)124thinkplot.Plot(xs, ps, label='model', color='0.8')125126thinkplot.Cdf(cdf_log)127thinkplot.Save(root='hinc_normal',128xlabel='log10 household income',129ylabel='CDF')130131132def main():133df = hinc.ReadData()134MakeFigures(df)135136137if __name__ == "__main__":138main()139140141