| Download
Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
Project: Support and Testing
Views: 7120License: GPL3
"""This file contains code used in "Think Stats",1by Allen B. Downey, available from greenteapress.com23Copyright 2014 Allen B. Downey4License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html5"""67from __future__ import print_function, division89import first10import hypothesis11import scatter12import thinkstats21314import numpy as np151617"""This file contains a solution to exercises in Think Stats:1819As sample size increases, the power of a hypothesis test increases,20which means it is more likely to be positive if the effect is real.21Conversely, as sample size decreases, the test is less likely to22be positive even if the effect is real.2324To investigate this behavior, run the tests in this chapter with25different subsets of the NSFG data. You can use thinkstats2.SampleRows26to select a random subset of the rows in a DataFrame.2728What happens to the p-values of these tests as sample size decreases?29What is the smallest sample size that yields a positive test?3031My results:3233test1: difference in mean pregnancy length34test2: difference in mean birth weight35test3: correlation of mother's age and birth weight36test4: chi-square test of pregnancy length3738n test1 test2 test2 test4399148 0.16 0.00 0.00 0.00404574 0.10 0.01 0.00 0.00412287 0.25 0.06 0.00 0.00421143 0.24 0.03 0.39 0.0343571 0.81 0.00 0.04 0.0444285 0.57 0.41 0.48 0.8345142 0.45 0.08 0.60 0.044647Conclusion: As expected, tests that are positive with large sample48sizes become negative as we take away data. But the pattern is49erratic, with some positive tests even at small sample sizes.505152In Section~\ref{testing}, we simulated the null hypothesis by53permutation; that is, we treated the observed values as if they54represented the entire population, and randomly assigned the55members of the population to the two groups.5657An alternative is to use the sample to estimate the distribution for58the population, then draw a random sample from that distribution.59This process is called resampling. There are several ways to60implement the resampling, but one of the simplest is to draw a sample,61with replacement, from the observed values, as in Section~\ref{power}.6263Write a class named {\tt DiffMeansResample} that inherits from64{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement65resampling, rather than permutation.6667Use this model to test the differences in pregnancy length and68birth weight. How much does the model affect the results?6970Results:7172means permute preglength73p-value = 0.167474actual = 0.078037266777575ts max = 0.2267524361047677means permute birthweight78p-value = 0.079actual = 0.12476118453580ts max = 0.112243501197818283Conclusions: Using resampling instead of permutation has very84little effect on the results.8586The two models are based on slightly difference assumptions, and in87this example there is no compelling reason to choose one or the other.88But in general p-values depend on the choice of the null hypothesis;89different models can yield very different results.909192"""9394class DiffMeansResample(hypothesis.DiffMeansPermute):95"""Tests a difference in means using resampling."""9697def RunModel(self):98"""Run the model of the null hypothesis.99100returns: simulated data101"""102group1 = np.random.choice(self.pool, self.n, replace=True)103group2 = np.random.choice(self.pool, self.m, replace=True)104return group1, group2105106107def RunResampleTest(firsts, others):108"""Tests differences in means by resampling.109110firsts: DataFrame111others: DataFrame112"""113data = firsts.prglngth.values, others.prglngth.values114ht = DiffMeansResample(data)115p_value = ht.PValue(iters=10000)116print('\nmeans permute preglength')117print('p-value =', p_value)118print('actual =', ht.actual)119print('ts max =', ht.MaxTestStat())120121data = (firsts.totalwgt_lb.dropna().values,122others.totalwgt_lb.dropna().values)123ht = hypothesis.DiffMeansPermute(data)124p_value = ht.PValue(iters=10000)125print('\nmeans permute birthweight')126print('p-value =', p_value)127print('actual =', ht.actual)128print('ts max =', ht.MaxTestStat())129130131def RunTests(live, iters=1000):132"""Runs the tests from Chapter 9 with a subset of the data.133134live: DataFrame135iters: how many iterations to run136"""137n = len(live)138firsts = live[live.birthord == 1]139others = live[live.birthord != 1]140141# compare pregnancy lengths142data = firsts.prglngth.values, others.prglngth.values143ht = hypothesis.DiffMeansPermute(data)144p1 = ht.PValue(iters=iters)145146data = (firsts.totalwgt_lb.dropna().values,147others.totalwgt_lb.dropna().values)148ht = hypothesis.DiffMeansPermute(data)149p2 = ht.PValue(iters=iters)150151# test correlation152live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])153data = live2.agepreg.values, live2.totalwgt_lb.values154ht = hypothesis.CorrelationPermute(data)155p3 = ht.PValue(iters=iters)156157# compare pregnancy lengths (chi-squared)158data = firsts.prglngth.values, others.prglngth.values159ht = hypothesis.PregLengthTest(data)160p4 = ht.PValue(iters=iters)161162print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))163164165def main():166thinkstats2.RandomSeed(18)167168live, firsts, others = first.MakeFrames()169RunResampleTest(firsts, others)170171n = len(live)172for _ in range(7):173sample = thinkstats2.SampleRows(live, n)174RunTests(sample)175n //= 2176177178if __name__ == '__main__':179main()180181182