| Download
Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Project: Support and Testing
Views: 7116License: GPL3
"""This file contains code used in "Think Stats",1by Allen B. Downey, available from greenteapress.com23Copyright 2014 Allen B. Downey4License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html5"""67from __future__ import print_function89import numpy as np1011import density12import hinc13import thinkplot14import thinkstats21516"""This file contains a solution to an exercise in Think Stats:1718The distribution of income is famously skewed to the right. In this19exercise, we'll measure how strong that skew is.20...2122Compute the median, mean, skewness and Pearson's skewness of the23resulting sample. What fraction of households reports a taxable24income below the mean? How do the results depend on the assumed25upper bound?2627My results with log_upper=62829mean 74278.707531230std 93946.929963531median 51226.454478932skewness 4.9499202444333pearson skewness 0.73612580191434cdf[mean] 0.6600058795673536With log_upper=73738mean 124267.39722239std 559608.50137440median 51226.454478941skewness 11.603690267542pearson skewness 0.39156450927743cdf[mean] 0.8565630665214445With a higher upper bound, the moment-based skewness increases, as46expected. Surprisingly, the Pearson skewness goes down! The reason47seems to be that increasing the upper bound has a modest effect on the48mean, and a stronger effect on standard deviation. Since std is in49the denominator with exponent 3, it has a stronger effect on the50result.5152So this is apparently an example where Pearson skewness is not working53well as a summary statistic. A better choice is a statistic that has54meaning in context, like the fraction of people with income below the55mean. Or something like the Gini coefficient designed to quantify a56property of the distribution (like the relative difference we expect57between two random people).5859"""6061def InterpolateSample(df, log_upper=6.0):62"""Makes a sample of log10 household income.6364Assumes that log10 income is uniform in each range.6566df: DataFrame with columns income and freq67log_upper: log10 of the assumed upper bound for the highest range6869returns: NumPy array of log10 household income70"""71# compute the log10 of the upper bound for each range72df['log_upper'] = np.log10(df.income)7374# get the lower bounds by shifting the upper bound and filling in75# the first element76df['log_lower'] = df.log_upper.shift(1)77df.log_lower[0] = 3.07879# plug in a value for the unknown upper bound of the highest range80df.log_upper[41] = log_upper8182# use the freq column to generate the right number of values in83# each range84arrays = []85for _, row in df.iterrows():86vals = np.linspace(row.log_lower, row.log_upper, row.freq)87arrays.append(vals)8889# collect the arrays into a single sample90log_sample = np.concatenate(arrays)91return log_sample929394def main():95df = hinc.ReadData()96log_sample = InterpolateSample(df, log_upper=6.0)9798log_cdf = thinkstats2.Cdf(log_sample)99thinkplot.Cdf(log_cdf)100thinkplot.Show(xlabel='household income',101ylabel='CDF')102103sample = np.power(10, log_sample)104mean, median = density.Summarize(sample)105106cdf = thinkstats2.Cdf(sample)107print('cdf[mean]', cdf[mean])108109pdf = thinkstats2.EstimatedPdf(sample)110thinkplot.Pdf(pdf)111thinkplot.Show(xlabel='household income',112ylabel='PDF')113114115if __name__ == "__main__":116main()117118119