CoCalc -- chap06soln.py

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Website: http://greenteapress.com/wp/think-stats-2e/
Path: think-stats-code / chap06soln.py
Views: ⁷¹¹⁶
License: GPL3
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function
9

10
import numpy as np
11

12
import density
13
import hinc
14
import thinkplot
15
import thinkstats2
16

17
"""This file contains a solution to an exercise in Think Stats:
18

19
The distribution of income is famously skewed to the right.  In this
20
exercise, we'll measure how strong that skew is.
21
...
22

23
Compute the median, mean, skewness and Pearson's skewness of the
24
resulting sample.  What fraction of households reports a taxable
25
income below the mean?  How do the results depend on the assumed
26
upper bound?
27

28
My results with log_upper=6
29

30
mean 74278.7075312
31
std 93946.9299635
32
median 51226.4544789
33
skewness 4.94992024443
34
pearson skewness 0.736125801914
35
cdf[mean] 0.660005879567
36

37
With log_upper=7
38

39
mean 124267.397222
40
std 559608.501374
41
median 51226.4544789
42
skewness 11.6036902675
43
pearson skewness 0.391564509277
44
cdf[mean] 0.856563066521
45

46
With a higher upper bound, the moment-based skewness increases, as
47
expected.  Surprisingly, the Pearson skewness goes down!  The reason
48
seems to be that increasing the upper bound has a modest effect on the
49
mean, and a stronger effect on standard deviation.  Since std is in
50
the denominator with exponent 3, it has a stronger effect on the
51
result.
52

53
So this is apparently an example where Pearson skewness is not working
54
well as a summary statistic.  A better choice is a statistic that has
55
meaning in context, like the fraction of people with income below the
56
mean.  Or something like the Gini coefficient designed to quantify a
57
property of the distribution (like the relative difference we expect
58
between two random people).
59

60
"""
61

62
def InterpolateSample(df, log_upper=6.0):
63
    """Makes a sample of log10 household income.
64

65
    Assumes that log10 income is uniform in each range.
66

67
    df: DataFrame with columns income and freq
68
    log_upper: log10 of the assumed upper bound for the highest range
69

70
    returns: NumPy array of log10 household income
71
    """
72
    # compute the log10 of the upper bound for each range
73
    df['log_upper'] = np.log10(df.income)
74

75
    # get the lower bounds by shifting the upper bound and filling in
76
    # the first element
77
    df['log_lower'] = df.log_upper.shift(1)
78
    df.log_lower[0] = 3.0
79

80
    # plug in a value for the unknown upper bound of the highest range
81
    df.log_upper[41] = log_upper
82

83
    # use the freq column to generate the right number of values in
84
    # each range
85
    arrays = []
86
    for _, row in df.iterrows():
87
        vals = np.linspace(row.log_lower, row.log_upper, row.freq)
88
        arrays.append(vals)
89

90
    # collect the arrays into a single sample
91
    log_sample = np.concatenate(arrays)
92
    return log_sample
93

94

95
def main():
96
    df = hinc.ReadData()
97
    log_sample = InterpolateSample(df, log_upper=6.0)
98

99
    log_cdf = thinkstats2.Cdf(log_sample)
100
    thinkplot.Cdf(log_cdf)
101
    thinkplot.Show(xlabel='household income',
102
                   ylabel='CDF')
103

104
    sample = np.power(10, log_sample)
105
    mean, median = density.Summarize(sample)
106

107
    cdf = thinkstats2.Cdf(sample)
108
    print('cdf[mean]', cdf[mean])
109

110
    pdf = thinkstats2.EstimatedPdf(sample)
111
    thinkplot.Pdf(pdf)
112
    thinkplot.Show(xlabel='household income',
113
                   ylabel='PDF')
114

115

116
if __name__ == "__main__":
117
    main()
118

119