Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.

This is the accompanying code for this book.

Website: http://greenteapress.com/wp/think-stats-2e/

Views: 7116
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function
9
10
import numpy as np
11
12
import density
13
import hinc
14
import thinkplot
15
import thinkstats2
16
17
"""This file contains a solution to an exercise in Think Stats:
18
19
The distribution of income is famously skewed to the right. In this
20
exercise, we'll measure how strong that skew is.
21
...
22
23
Compute the median, mean, skewness and Pearson's skewness of the
24
resulting sample. What fraction of households reports a taxable
25
income below the mean? How do the results depend on the assumed
26
upper bound?
27
28
My results with log_upper=6
29
30
mean 74278.7075312
31
std 93946.9299635
32
median 51226.4544789
33
skewness 4.94992024443
34
pearson skewness 0.736125801914
35
cdf[mean] 0.660005879567
36
37
With log_upper=7
38
39
mean 124267.397222
40
std 559608.501374
41
median 51226.4544789
42
skewness 11.6036902675
43
pearson skewness 0.391564509277
44
cdf[mean] 0.856563066521
45
46
With a higher upper bound, the moment-based skewness increases, as
47
expected. Surprisingly, the Pearson skewness goes down! The reason
48
seems to be that increasing the upper bound has a modest effect on the
49
mean, and a stronger effect on standard deviation. Since std is in
50
the denominator with exponent 3, it has a stronger effect on the
51
result.
52
53
So this is apparently an example where Pearson skewness is not working
54
well as a summary statistic. A better choice is a statistic that has
55
meaning in context, like the fraction of people with income below the
56
mean. Or something like the Gini coefficient designed to quantify a
57
property of the distribution (like the relative difference we expect
58
between two random people).
59
60
"""
61
62
def InterpolateSample(df, log_upper=6.0):
63
"""Makes a sample of log10 household income.
64
65
Assumes that log10 income is uniform in each range.
66
67
df: DataFrame with columns income and freq
68
log_upper: log10 of the assumed upper bound for the highest range
69
70
returns: NumPy array of log10 household income
71
"""
72
# compute the log10 of the upper bound for each range
73
df['log_upper'] = np.log10(df.income)
74
75
# get the lower bounds by shifting the upper bound and filling in
76
# the first element
77
df['log_lower'] = df.log_upper.shift(1)
78
df.log_lower[0] = 3.0
79
80
# plug in a value for the unknown upper bound of the highest range
81
df.log_upper[41] = log_upper
82
83
# use the freq column to generate the right number of values in
84
# each range
85
arrays = []
86
for _, row in df.iterrows():
87
vals = np.linspace(row.log_lower, row.log_upper, row.freq)
88
arrays.append(vals)
89
90
# collect the arrays into a single sample
91
log_sample = np.concatenate(arrays)
92
return log_sample
93
94
95
def main():
96
df = hinc.ReadData()
97
log_sample = InterpolateSample(df, log_upper=6.0)
98
99
log_cdf = thinkstats2.Cdf(log_sample)
100
thinkplot.Cdf(log_cdf)
101
thinkplot.Show(xlabel='household income',
102
ylabel='CDF')
103
104
sample = np.power(10, log_sample)
105
mean, median = density.Summarize(sample)
106
107
cdf = thinkstats2.Cdf(sample)
108
print('cdf[mean]', cdf[mean])
109
110
pdf = thinkstats2.EstimatedPdf(sample)
111
thinkplot.Pdf(pdf)
112
thinkplot.Show(xlabel='household income',
113
ylabel='PDF')
114
115
116
if __name__ == "__main__":
117
main()
118
119