CoCalc -- hinc

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Website: http://greenteapress.com/wp/think-stats-2e/
Path: think-stats-code / hinc_soln.py
Views: ⁷¹¹⁵
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function
9

10
import numpy as np
11
import pandas
12

13
import hinc
14
import thinkplot
15
import thinkstats2
16

17
"""This file contains a solution to an exercise in Think Stats:
18

19
The distributions of wealth and income are sometimes modeled using
20
lognormal and Pareto distributions.  To see which is better, let's
21
look at some data.
22

23
The Current Population Survey (CPS) is joint effort of the Bureau
24
of Labor Statistics and the Census Bureau to study income and related
25
variables.  Data collected in 2013 is available from
26
http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm.
27
I downloaded hinc06.xls, which is an Excel spreadsheet with
28
information about household income, and converted it to hinc06.csv,
29
a CSV file you will find in the repository for this book.  You
30
will also find hinc.py, which reads the CSV file.
31

32
Extract the distribution of incomes from this dataset.  Are any of the
33
analytic distributions in this chapter a good model of the data?  A
34
solution to this exercise is in hinc_soln.py.
35

36
My solution generates three figures:
37

38
1) The CDF of income on a linear scale.
39

40
2) The CCDF on a log-log scale along with a Pareto model intended
41
to match the tail behavior.
42

43
3) The CDF on a log-x scale along with a lognormal model chose to
44
match the median and inter-quartile range.
45

46
My conclusions based on these figures are:
47

48
1) The Pareto model is probably a reasonable choice for the top
49
   10-20% of incomes.
50

51
2) The lognormal model captures the shape of the distribution better,
52
   but the data deviate substantially from the model.  With different
53
   choices for sigma, you could match the upper or lower tail, but not
54
   both at the same time.
55
 
56
In summary I would say that neither model captures the whole distribution,
57
so you might have to 
58

59
1) look for another analytic model, 
60
2) choose one that captures the part of the distribution that is most 
61
   relevent, or 
62
3) avoid using an analytic model altogether.
63

64
"""
65

66

67
class SmoothCdf(thinkstats2.Cdf):
68
    """Represents a CDF based on calculated quantiles.
69
    """
70
    def Render(self):
71
        """Because this CDF was not computed from a sample, it
72
        should not be rendered as a step function.
73
        """
74
        return self.xs, self.ps
75

76
    def Prob(self, x):
77
        """Compute CDF(x), interpolating between known values.
78
        """
79
        return np.interp(x, self.xs, self.ps)
80

81
    def Value(self, p):
82
        """Compute inverse CDF(x), interpolating between probabilities.
83
        """
84
        return np.interp(p, self.ps, self.xs)
85

86

87
def MakeFigures(df):
88
    """Plots the CDF of income in several forms.
89
    """
90
    xs, ps = df.income.values, df.ps.values
91
    cdf = SmoothCdf(xs, ps, label='data')
92
    cdf_log = SmoothCdf(np.log10(xs), ps, label='data')
93
    
94
    # linear plot
95
    thinkplot.Cdf(cdf) 
96
    thinkplot.Save(root='hinc_linear',
97
                   xlabel='household income',
98
                   ylabel='CDF')
99

100
    # pareto plot
101
    # for the model I chose parameters by hand to fit the tail
102
    xs, ys = thinkstats2.RenderParetoCdf(xmin=55000, alpha=2.5, 
103
                                         low=0, high=250000)
104
    thinkplot.Plot(xs, 1-ys, label='model', color='0.8')
105

106
    thinkplot.Cdf(cdf, complement=True) 
107
    thinkplot.Save(root='hinc_pareto',
108
                   xlabel='log10 household income',
109
                   ylabel='CCDF',
110
                   xscale='log',
111
                   yscale='log')
112

113
    # lognormal plot
114
    # for the model I estimate mu and sigma using 
115
    # percentile-based statistics
116
    median = cdf_log.Percentile(50)
117
    iqr = cdf_log.Percentile(75) - cdf_log.Percentile(25)
118
    std = iqr / 1.349
119

120
    # choose std to match the upper tail
121
    std = 0.35
122
    print(median, std)
123

124
    xs, ps = thinkstats2.RenderNormalCdf(median, std, low=3.5, high=5.5)
125
    thinkplot.Plot(xs, ps, label='model', color='0.8')
126

127
    thinkplot.Cdf(cdf_log) 
128
    thinkplot.Save(root='hinc_normal',
129
                   xlabel='log10 household income',
130
                   ylabel='CDF')
131
    
132

133
def main():
134
    df = hinc.ReadData()
135
    MakeFigures(df)
136

137

138
if __name__ == "__main__":
139
    main()
140

141