Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.

This is the accompanying code for this book.

Website: http://greenteapress.com/wp/think-stats-2e/

Views: 7115
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function
9
10
import numpy as np
11
import pandas
12
13
import hinc
14
import thinkplot
15
import thinkstats2
16
17
"""This file contains a solution to an exercise in Think Stats:
18
19
The distributions of wealth and income are sometimes modeled using
20
lognormal and Pareto distributions. To see which is better, let's
21
look at some data.
22
23
The Current Population Survey (CPS) is joint effort of the Bureau
24
of Labor Statistics and the Census Bureau to study income and related
25
variables. Data collected in 2013 is available from
26
http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm.
27
I downloaded hinc06.xls, which is an Excel spreadsheet with
28
information about household income, and converted it to hinc06.csv,
29
a CSV file you will find in the repository for this book. You
30
will also find hinc.py, which reads the CSV file.
31
32
Extract the distribution of incomes from this dataset. Are any of the
33
analytic distributions in this chapter a good model of the data? A
34
solution to this exercise is in hinc_soln.py.
35
36
My solution generates three figures:
37
38
1) The CDF of income on a linear scale.
39
40
2) The CCDF on a log-log scale along with a Pareto model intended
41
to match the tail behavior.
42
43
3) The CDF on a log-x scale along with a lognormal model chose to
44
match the median and inter-quartile range.
45
46
My conclusions based on these figures are:
47
48
1) The Pareto model is probably a reasonable choice for the top
49
10-20% of incomes.
50
51
2) The lognormal model captures the shape of the distribution better,
52
but the data deviate substantially from the model. With different
53
choices for sigma, you could match the upper or lower tail, but not
54
both at the same time.
55
56
In summary I would say that neither model captures the whole distribution,
57
so you might have to
58
59
1) look for another analytic model,
60
2) choose one that captures the part of the distribution that is most
61
relevent, or
62
3) avoid using an analytic model altogether.
63
64
"""
65
66
67
class SmoothCdf(thinkstats2.Cdf):
68
"""Represents a CDF based on calculated quantiles.
69
"""
70
def Render(self):
71
"""Because this CDF was not computed from a sample, it
72
should not be rendered as a step function.
73
"""
74
return self.xs, self.ps
75
76
def Prob(self, x):
77
"""Compute CDF(x), interpolating between known values.
78
"""
79
return np.interp(x, self.xs, self.ps)
80
81
def Value(self, p):
82
"""Compute inverse CDF(x), interpolating between probabilities.
83
"""
84
return np.interp(p, self.ps, self.xs)
85
86
87
def MakeFigures(df):
88
"""Plots the CDF of income in several forms.
89
"""
90
xs, ps = df.income.values, df.ps.values
91
cdf = SmoothCdf(xs, ps, label='data')
92
cdf_log = SmoothCdf(np.log10(xs), ps, label='data')
93
94
# linear plot
95
thinkplot.Cdf(cdf)
96
thinkplot.Save(root='hinc_linear',
97
xlabel='household income',
98
ylabel='CDF')
99
100
# pareto plot
101
# for the model I chose parameters by hand to fit the tail
102
xs, ys = thinkstats2.RenderParetoCdf(xmin=55000, alpha=2.5,
103
low=0, high=250000)
104
thinkplot.Plot(xs, 1-ys, label='model', color='0.8')
105
106
thinkplot.Cdf(cdf, complement=True)
107
thinkplot.Save(root='hinc_pareto',
108
xlabel='log10 household income',
109
ylabel='CCDF',
110
xscale='log',
111
yscale='log')
112
113
# lognormal plot
114
# for the model I estimate mu and sigma using
115
# percentile-based statistics
116
median = cdf_log.Percentile(50)
117
iqr = cdf_log.Percentile(75) - cdf_log.Percentile(25)
118
std = iqr / 1.349
119
120
# choose std to match the upper tail
121
std = 0.35
122
print(median, std)
123
124
xs, ps = thinkstats2.RenderNormalCdf(median, std, low=3.5, high=5.5)
125
thinkplot.Plot(xs, ps, label='model', color='0.8')
126
127
thinkplot.Cdf(cdf_log)
128
thinkplot.Save(root='hinc_normal',
129
xlabel='log10 household income',
130
ylabel='CDF')
131
132
133
def main():
134
df = hinc.ReadData()
135
MakeFigures(df)
136
137
138
if __name__ == "__main__":
139
main()
140
141