Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

📚 The CoCalc Library - books, templates and other resources

Views: 96143
License: OTHER
1
% LaTeX source for ``Think Stats:
2
% Exploratory data analysis in Python''
3
% Copyright 2014 Allen B. Downey.
4
5
% License: Creative Commons
6
% Attribution-NonCommercial-ShareAlike 4.0 International
7
% http://creativecommons.org/licenses/by-nc-sa/4.0/
8
%
9
10
%\documentclass[10pt,b5paper]{book}
11
\documentclass[12pt]{book}
12
13
%\usepackage[width=5.5in,height=8.5in,
14
% hmarginratio=3:2,vmarginratio=1:1]{geometry}
15
16
% for some of these packages, you might have to install
17
% texlive-latex-extra (in Ubuntu)
18
19
%\usepackage[T1]{fontenc}
20
%\usepackage{textcomp}
21
%\usepackage{mathpazo}
22
%\usepackage{pslatex}
23
24
\usepackage{url}
25
\usepackage{hyperref}
26
\usepackage{fancyhdr}
27
\usepackage{graphicx}
28
\usepackage{subfig}
29
\usepackage{amsmath}
30
\usepackage{amsthm}
31
%\usepackage{amssymb}
32
\usepackage{makeidx}
33
\usepackage{setspace}
34
\usepackage{hevea}
35
\usepackage{upquote}
36
37
\title{Think Stats}
38
\author{Allen B. Downey}
39
40
\newcommand{\thetitle}{Think Stats}
41
\newcommand{\thesubtitle}{Exploratory Data Analysis in Python}
42
\newcommand{\theversion}{2.0.38}
43
44
% these styles get translated in CSS for the HTML version
45
\newstyle{a:link}{color:black;}
46
\newstyle{p+p}{margin-top:1em;margin-bottom:1em}
47
\newstyle{img}{border:0px}
48
49
% change the arrows in the HTML version
50
\setlinkstext
51
{\imgsrc[ALT="Previous"]{back.png}}
52
{\imgsrc[ALT="Up"]{up.png}}
53
{\imgsrc[ALT="Next"]{next.png}}
54
55
\makeindex
56
57
\newif\ifplastex
58
\plastexfalse
59
60
\begin{document}
61
62
\frontmatter
63
64
\newcommand{\Erdos}{Erd\H{o}s}
65
\newcommand{\nhat}{\hat{N}}
66
\newcommand{\eps}{\varepsilon}
67
\newcommand{\slope}{\mathrm{slope}}
68
\newcommand{\inter}{\mathrm{inter}}
69
\newcommand{\xs}{\mathrm{xs}}
70
\newcommand{\ys}{\mathrm{ys}}
71
\newcommand{\res}{\mathrm{res}}
72
\newcommand{\xbar}{\bar{x}}
73
\newcommand{\ybar}{\bar{y}}
74
\newcommand{\PMF}{\mathrm{PMF}}
75
\newcommand{\PDF}{\mathrm{PDF}}
76
\newcommand{\CDF}{\mathrm{CDF}}
77
\newcommand{\ICDF}{\mathrm{ICDF}}
78
\newcommand{\Prob}{\mathrm{P}}
79
\newcommand{\Corr}{\mathrm{Corr}}
80
\newcommand{\normal}{\mathcal{N}}
81
\newcommand{\given}{|}
82
%\newcommand{\goodchi}{\protect\raisebox{2pt}{$\chi$}}
83
\newcommand{\goodchi}{\chi}
84
85
\ifplastex
86
\usepackage{localdef}
87
\maketitle
88
89
\newcount\anchorcnt
90
\newcommand*{\Anchor}[1]{%
91
\@bsphack%
92
\Hy@GlobalStepCount\anchorcnt%
93
\edef\@currentHref{anchor.\the\anchorcnt}%
94
\Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}%
95
\M@gettitle{}\label{#1}%
96
\@esphack%
97
}
98
99
100
\else
101
102
%%% EXERCISE
103
104
\newtheoremstyle{exercise}% name of the style to be used
105
{\topsep}% measure of space to leave above the theorem. E.g.: 3pt
106
{\topsep}% measure of space to leave below the theorem. E.g.: 3pt
107
{}% name of font to use in the body of the theorem
108
{}% measure of space to indent
109
{\bfseries}% name of head font
110
{}% punctuation between head and body
111
{ }% space after theorem head; " " = normal interword space
112
{}% Manually specify head
113
114
\theoremstyle{exercise}
115
\newtheorem{exercise}{Exercise}[chapter]
116
117
%\newcounter{exercise}[chapter]
118
%\newcommand{\nextexercise}{\refstepcounter{exercise}}
119
120
%\newenvironment{exercise}{\nextexercise \noindent \textbf{Exercise \thechapter.\theexercise} \begin{itshape} \noindent}{\end{itshape}}
121
122
\input{latexonly}
123
124
\begin{latexonly}
125
126
\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}
127
128
%\blankpage
129
%\blankpage
130
131
% TITLE PAGES FOR LATEX VERSION
132
133
%-half title--------------------------------------------------
134
\thispagestyle{empty}
135
136
\begin{flushright}
137
\vspace*{2.0in}
138
139
\begin{spacing}{3}
140
{\huge \thetitle}\\
141
{\Large \thesubtitle }
142
\end{spacing}
143
144
\vspace{0.25in}
145
146
Version \theversion
147
148
\vfill
149
150
\end{flushright}
151
152
%--verso------------------------------------------------------
153
154
\blankpage
155
\blankpage
156
%\clearemptydoublepage
157
%\pagebreak
158
%\thispagestyle{empty}
159
%\vspace*{6in}
160
161
%--title page--------------------------------------------------
162
\pagebreak
163
\thispagestyle{empty}
164
165
\begin{flushright}
166
\vspace*{2.0in}
167
168
\begin{spacing}{3}
169
{\huge \thetitle}\\
170
{\Large \thesubtitle}
171
\end{spacing}
172
173
\vspace{0.25in}
174
175
Version \theversion
176
177
\vspace{1in}
178
179
180
{\Large
181
Allen B. Downey\\
182
}
183
184
185
\vspace{0.5in}
186
187
{\Large Green Tea Press}
188
189
{\small Needham, Massachusetts}
190
191
%\includegraphics[width=1in]{figs/logo1.eps}
192
\vfill
193
194
\end{flushright}
195
196
197
%--copyright--------------------------------------------------
198
\pagebreak
199
\thispagestyle{empty}
200
201
{\small
202
Copyright \copyright ~2014 Allen B. Downey.
203
204
205
\vspace{0.2in}
206
207
\begin{flushleft}
208
Green Tea Press \\
209
9 Washburn Ave \\
210
Needham MA 02492
211
\end{flushleft}
212
213
Permission is granted to copy, distribute, and/or modify this document
214
under the terms of the Creative Commons
215
Attribution-NonCommercial-ShareAlike 4.0 International License, which
216
is available at
217
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
218
219
The original form of this book is \LaTeX\ source code. Compiling this
220
code has the effect of generating a device-independent representation
221
of a textbook, which can be converted to other formats and printed.
222
223
The \LaTeX\ source for this book is available from
224
\url{http://thinkstats2.com}.
225
226
\vspace{0.2in}
227
228
} % end small
229
230
\end{latexonly}
231
232
233
% HTMLONLY
234
235
\begin{htmlonly}
236
237
% TITLE PAGE FOR HTML VERSION
238
239
{\Large \thetitle: \thesubtitle}
240
241
{\large Allen B. Downey}
242
243
Version \theversion
244
245
\vspace{0.25in}
246
247
Copyright 2014 Allen B. Downey
248
249
\vspace{0.25in}
250
251
Permission is granted to copy, distribute, and/or modify this document
252
under the terms of the Creative Commons
253
Attribution-NonCommercial-ShareAlike 4.0 International
254
Unported License, which is available at
255
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
256
257
\setcounter{chapter}{-1}
258
259
\end{htmlonly}
260
261
\fi
262
% END OF THE PART WE SKIP FOR PLASTEX
263
264
\chapter{Preface}
265
\label{preface}
266
267
This book is an
268
introduction to the practical tools of exploratory data analysis.
269
The organization of the book follows the process I use
270
when I start working with a dataset:
271
272
\begin{itemize}
273
274
\item Importing and cleaning: Whatever format the data is in, it
275
usually takes some time and effort to read the data, clean and
276
transform it, and check that everything made it through the
277
translation process intact.
278
\index{cleaning}
279
280
\item Single variable explorations: I usually start by examining one
281
variable at a time, finding out what the variables mean, looking
282
at distributions of the values, and choosing appropriate
283
summary statistics.
284
\index{distribution}
285
286
\item Pair-wise explorations: To identify possible relationships
287
between variables, I look at tables and scatter plots, and compute
288
correlations and linear fits.
289
\index{correlation}
290
\index{linear fit}
291
292
\item Multivariate analysis: If there are apparent relationships
293
between variables, I use multiple regression to add control variables
294
and investigate more complex relationships.
295
\index{multiple regression}
296
\index{control variable}
297
298
\item Estimation and hypothesis testing: When reporting statistical
299
results, it is important to answer three questions: How big is
300
the effect? How much variability should we expect if we run the same
301
measurement again? Is it possible that the apparent effect is
302
due to chance?
303
\index{estimation}
304
\index{hypothesis testing}
305
306
\item Visualization: During exploration, visualization is an important
307
tool for finding possible relationships and effects. Then if an
308
apparent effect holds up to scrutiny, visualization is an effective
309
way to communicate results.
310
\index{visualization}
311
312
\end{itemize}
313
314
This book takes a computational approach, which has several
315
advantages over mathematical approaches:
316
\index{computational methods}
317
318
\begin{itemize}
319
320
\item I present most ideas using Python code, rather than
321
mathematical notation. In general, Python code is more readable;
322
also, because it is executable, readers can download it, run it,
323
and modify it.
324
325
\item Each chapter includes exercises readers can do to develop
326
and solidify their learning. When you write programs, you
327
express your understanding in code; while you are debugging the
328
program, you are also correcting your understanding.
329
\index{debugging}
330
331
\item Some exercises involve experiments to test statistical
332
behavior. For example, you can explore the Central Limit Theorem
333
(CLT) by generating random samples and computing their sums. The
334
resulting visualizations demonstrate why the CLT works and when
335
it doesn't.
336
\index{Central Limit Theorem}
337
\index{CLT}
338
339
\item Some ideas that are hard to grasp mathematically are easy to
340
understand by simulation. For example, we approximate p-values by
341
running random simulations, which reinforces the meaning of the
342
p-value.
343
\index{p-value}
344
345
\item Because the book is based on a general-purpose programming
346
language (Python), readers can import data from almost any source.
347
They are not limited to datasets that have been cleaned and
348
formatted for a particular statistics tool.
349
350
\end{itemize}
351
352
The book lends itself to a project-based approach. In my class,
353
students work on a semester-long project that requires them to pose a
354
statistical question, find a dataset that can address it, and apply
355
each of the techniques they learn to their own data.
356
357
To demonstrate my approach to statistical analysis, the book
358
presents a case study that runs through all of the chapters. It uses
359
data from two sources:
360
361
\begin{itemize}
362
363
\item The National Survey of Family Growth (NSFG), conducted by the
364
U.S. Centers for Disease Control and Prevention (CDC) to gather
365
``information on family life, marriage and divorce, pregnancy,
366
infertility, use of contraception, and men's and women's health.''
367
(See \url{http://cdc.gov/nchs/nsfg.htm}.)
368
369
\item The Behavioral Risk Factor Surveillance System (BRFSS),
370
conducted by the National Center for Chronic Disease Prevention and
371
Health Promotion to ``track health conditions and risk behaviors in
372
the United States.'' (See \url{http://cdc.gov/BRFSS/}.)
373
374
\end{itemize}
375
376
Other examples use data from the IRS, the U.S. Census, and
377
the Boston Marathon.
378
379
This second edition of {\it Think Stats\/} includes the chapters from
380
the first edition, many of them substantially revised, and new
381
chapters on regression, time series analysis, survival analysis,
382
and analytic methods. The previous edition did not use pandas,
383
SciPy, or StatsModels, so all of that material is new.
384
385
386
\section{How I wrote this book}
387
388
When people write a new textbook, they usually start by
389
reading a stack of old textbooks. As a result, most books
390
contain the same material in pretty much the same order.
391
392
I did not do that. In fact, I used almost no printed material while I
393
was writing this book, for several reasons:
394
395
\begin{itemize}
396
397
\item My goal was to explore a new approach to this material, so I didn't
398
want much exposure to existing approaches.
399
400
\item Since I am making this book available under a free license, I wanted
401
to make sure that no part of it was encumbered by copyright restrictions.
402
403
\item Many readers of my books don't have access to libraries of
404
printed material, so I tried to make references to resources that are
405
freely available on the Internet.
406
407
\item Some proponents of old media think that the exclusive
408
use of electronic resources is lazy and unreliable. They might be right
409
about the first part, but I think they are wrong about the second, so
410
I wanted to test my theory.
411
412
% http://www.ala.org/ala/mgrps/rts/nmrt/news/footnotes/may2010/in_defense_of_wikipedia_bonnett.cfm
413
414
\end{itemize}
415
416
The resource I used more than any other is Wikipedia. In general, the
417
articles I read on statistical topics were very good (although I made
418
a few small changes along the way). I include references to Wikipedia
419
pages throughout the book and I encourage you to follow those links;
420
in many cases, the Wikipedia page picks up where my description leaves
421
off. The vocabulary and notation in this book are generally
422
consistent with Wikipedia, unless I had a good reason to deviate.
423
Other resources I found useful were Wolfram MathWorld and
424
the Reddit statistics forum, \url{http://www.reddit.com/r/statistics}.
425
426
427
\section{Using the code}
428
\label{code}
429
430
The code and data used in this book are available from
431
\url{https://github.com/AllenDowney/ThinkStats2}. Git is a version
432
control system that allows you to keep track of the files that
433
make up a project. A collection of files under Git's control is
434
called a {\bf repository}. GitHub is a hosting service that provides
435
storage for Git repositories and a convenient web interface.
436
\index{repository}
437
\index{Git}
438
\index{GitHub}
439
440
The GitHub homepage for my repository provides several ways to
441
work with the code:
442
443
\begin{itemize}
444
445
\item You can create a copy of my repository
446
on GitHub by pressing the {\sf Fork} button. If you don't already
447
have a GitHub account, you'll need to create one. After forking, you'll
448
have your own repository on GitHub that you can use to keep track
449
of code you write while working on this book. Then you can
450
clone the repo, which means that you make a copy of the files
451
on your computer.
452
\index{fork}
453
454
\item Or you could clone
455
my repository. You don't need a GitHub account to do this, but you
456
won't be able to write your changes back to GitHub.
457
\index{clone}
458
459
\item If you don't want to use Git at all, you can download the files
460
in a Zip file using the button in the lower-right corner of the
461
GitHub page.
462
463
\end{itemize}
464
465
All of the code is written to work in both Python 2 and Python 3
466
with no translation.
467
468
I developed this book using Anaconda from
469
Continuum Analytics, which is a free Python distribution that includes
470
all the packages you'll need to run the code (and lots more).
471
I found Anaconda easy to install. By default it does a user-level
472
installation, not system-level, so you don't need administrative
473
privileges. And it supports both Python 2 and Python 3. You can
474
download Anaconda from \url{http://continuum.io/downloads}.
475
\index{Anaconda}
476
477
If you don't want to use Anaconda, you will need the following
478
packages:
479
480
\begin{itemize}
481
482
\item pandas for representing and analyzing data,
483
\url{http://pandas.pydata.org/};
484
\index{pandas}
485
486
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
487
\index{NumPy}
488
489
\item SciPy for scientific computation including statistics,
490
\url{http://www.scipy.org/};
491
\index{SciPy}
492
493
\item StatsModels for regression and other statistical analysis,
494
\url{http://statsmodels.sourceforge.net/}; and
495
\index{StatsModels}
496
497
\item matplotlib for visualization, \url{http://matplotlib.org/}.
498
\index{matplotlib}
499
500
\end{itemize}
501
502
Although these are commonly used packages, they are not included with
503
all Python installations, and they can be hard to install in some
504
environments. If you have trouble installing them, I strongly
505
recommend using Anaconda or one of the other Python distributions
506
that include these packages.
507
\index{installation}
508
509
After you clone the repository or unzip the zip file, you should have
510
a folder called {\tt ThinkStats2/code} with a file called {\tt nsfg.py}.
511
If you run {\tt nsfg.py}, it should read a data file, run some tests, and print a
512
message like, ``All tests passed.'' If you get import errors, it
513
probably means there are packages you need to install.
514
515
Most exercises use Python scripts, but some also use the IPython
516
notebook. If you have not used IPython notebook before, I suggest
517
you start with the documentation at
518
\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.
519
\index{IPython}
520
521
I wrote this book assuming that the reader is familiar with core Python,
522
including object-oriented features, but not pandas,
523
NumPy, and SciPy. If you are already familiar with these modules, you
524
can skip a few sections.
525
526
I assume that the reader knows basic mathematics, including
527
logarithms, for example, and summations. I refer to calculus concepts
528
in a few places, but you don't have to do any calculus.
529
530
If you have never studied statistics, I think this book is a good place
531
to start. And if you have taken
532
a traditional statistics class, I hope this book will help repair the
533
damage.
534
535
536
537
---
538
539
Allen B. Downey is a Professor of Computer Science at
540
the Franklin W. Olin College of Engineering in Needham, MA.
541
542
543
544
545
\section*{Contributor List}
546
547
If you have a suggestion or correction, please send email to
548
{\tt downey@allendowney.com}. If I make a change based on your
549
feedback, I will add you to the contributor list
550
(unless you ask to be omitted).
551
\index{contributors}
552
553
If you include at least part of the sentence the
554
error appears in, that makes it easy for me to search. Page and
555
section numbers are fine, too, but not quite as easy to work with.
556
Thanks!
557
558
\small
559
560
\begin{itemize}
561
562
\item Lisa Downey and June Downey read an early draft and made many
563
corrections and suggestions.
564
565
\item Steven Zhang found several errors.
566
567
\item Andy Pethan and Molly Farison helped debug some of the solutions,
568
and Molly spotted several typos.
569
570
\item Dr. Nikolas Akerblom knows how big a Hyracotherium is.
571
572
\item Alex Morrow clarified one of the code examples.
573
574
\item Jonathan Street caught an error in the nick of time.
575
576
\item Many thanks to Kevin Smith and Tim Arnold for their work on
577
plasTeX, which I used to convert this book to DocBook.
578
579
\item George Caplan sent several suggestions for improving clarity.
580
581
\item Julian Ceipek found an error and a number of typos.
582
583
\item Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnson
584
found errors in the first print edition.
585
586
\item J\"{o}rg Beyer found typos in the book and made many corrections
587
in the docstrings of the accompanying code.
588
589
\item Tommie Gannert sent a patch file with a number of corrections.
590
591
\item Christoph Lendenmann submitted several errata.
592
593
\item Michael Kearney sent me many excellent suggestions.
594
595
\item Alex Birch made a number of helpful suggestions.
596
597
\item Lindsey Vanderlyn, Griffin Tschurwald, and Ben Small read an
598
early version of this book and found many errors.
599
600
\item John Roth, Carol Willing, and Carol Novitsky performed technical
601
reviews of the book. They found many errors and made many
602
helpful suggestions.
603
604
\item David Palmer sent many helpful suggestions and corrections.
605
606
\item Erik Kulyk found many typos.
607
608
\item Nir Soffer sent several excellent pull requests for both the
609
book and the supporting code.
610
611
\item GitHub user flothesof sent a number of corrections.
612
613
\item Toshiaki Kurokawa, who is working on the Japanese translation of
614
this book, has sent many corrections and helpful suggestions.
615
616
\item Benjamin White suggested more idiomatic Pandas code.
617
618
\item Takashi Sato spotted an code error.
619
620
% ENDCONTRIB
621
622
\end{itemize}
623
624
Other people who found typos and similar errors are Andrew Heine,
625
G\'{a}bor Lipt\'{a}k,
626
Dan Kearney,
627
Alexander Gryzlov,
628
Martin Veillette,
629
Haitao Ma,
630
Jeff Pickhardt,
631
Rohit Deshpande,
632
Joanne Pratt,
633
Lucian Ursu,
634
Paul Glezen,
635
Ting-kuang Lin,
636
Scott Miller,
637
Luigi Patruno.
638
639
640
641
\normalsize
642
643
\clearemptydoublepage
644
645
% TABLE OF CONTENTS
646
\begin{latexonly}
647
648
\tableofcontents
649
650
\clearemptydoublepage
651
652
\end{latexonly}
653
654
% START THE BOOK
655
\mainmatter
656
657
658
\chapter{Exploratory data analysis}
659
\label{intro}
660
661
The thesis of this book is that data combined with practical
662
methods can answer questions and guide decisions under uncertainty.
663
664
As an example, I present a case study motivated by a question
665
I heard when my wife and I were expecting our first child: do first
666
babies tend to arrive late?
667
\index{first babies}
668
669
If you Google this question, you will find plenty of discussion. Some
670
people claim it's true, others say it's a myth, and some people say
671
it's the other way around: first babies come early.
672
673
In many of these discussions, people provide data to support their
674
claims. I found many examples like these:
675
676
\begin{quote}
677
678
``My two friends that have given birth recently to their first babies,
679
BOTH went almost 2 weeks overdue before going into labour or being
680
induced.''
681
682
``My first one came 2 weeks late and now I think the second one is
683
going to come out two weeks early!!''
684
685
``I don't think that can be true because my sister was my mother's
686
first and she was early, as with many of my cousins.''
687
688
\end{quote}
689
690
Reports like these are called {\bf anecdotal evidence} because they
691
are based on data that is unpublished and usually personal. In casual
692
conversation, there is nothing wrong with anecdotes, so I don't mean
693
to pick on the people I quoted.
694
\index{anecdotal evidence}
695
696
But we might want evidence that is more persuasive and
697
an answer that is more reliable. By those standards, anecdotal
698
evidence usually fails, because:
699
700
\begin{itemize}
701
702
\item Small number of observations: If pregnancy length is longer
703
for first babies, the difference is probably small compared to
704
natural variation. In that case, we might have to compare a large
705
number of pregnancies to be sure that a difference exists.
706
\index{pregnancy length}
707
708
\item Selection bias: People who join a discussion of this question
709
might be interested because their first babies were late. In that
710
case the process of selecting data would bias the results.
711
\index{selection bias}
712
\index{bias!selection}
713
714
\item Confirmation bias: People who believe the claim might be more
715
likely to contribute examples that confirm it. People who doubt the
716
claim are more likely to cite counterexamples.
717
\index{confirmation bias}
718
\index{bias!confirmation}
719
720
\item Inaccuracy: Anecdotes are often personal stories, and often
721
misremembered, misrepresented, repeated
722
inaccurately, etc.
723
724
\end{itemize}
725
726
So how can we do better?
727
728
729
\section{A statistical approach}
730
731
To address the limitations of anecdotes, we will use the tools
732
of statistics, which include:
733
734
\begin{itemize}
735
736
\item Data collection: We will use data from a large national survey
737
that was designed explicitly with the goal of generating
738
statistically valid inferences about the U.S. population.
739
\index{data collection}
740
741
\item Descriptive statistics: We will generate statistics that
742
summarize the data concisely, and evaluate different ways to
743
visualize data.
744
\index{descriptive statistics}
745
746
\item Exploratory data analysis: We will look for
747
patterns, differences, and other features that address the questions
748
we are interested in. At the same time we will check for
749
inconsistencies and identify limitations.
750
\index{exploratory data analysis}
751
752
\item Estimation: We will use data from a sample to estimate
753
characteristics of the general population.
754
\index{estimation}
755
756
\item Hypothesis testing: Where we see apparent effects, like a
757
difference between two groups, we will evaluate whether the effect
758
might have happened by chance.
759
\index{hypothesis testing}
760
761
\end{itemize}
762
763
By performing these steps with care to avoid pitfalls, we can
764
reach conclusions that are more justifiable and more likely to be
765
correct.
766
767
768
\section{The National Survey of Family Growth}
769
\label{nsfg}
770
771
Since 1973 the U.S. Centers for Disease Control and Prevention (CDC)
772
have conducted the National Survey of Family Growth (NSFG),
773
which is intended to gather ``information on family life, marriage and
774
divorce, pregnancy, infertility, use of contraception, and men's and
775
women's health. The survey results are used \ldots to plan health services and
776
health education programs, and to do statistical studies of families,
777
fertility, and health.'' See
778
\url{http://cdc.gov/nchs/nsfg.htm}.
779
\index{National Survey of Family Growth}
780
\index{NSFG}
781
782
We will use data collected by this survey to investigate whether first
783
babies tend to come late, and other questions. In order to use this
784
data effectively, we have to understand the design of the study.
785
786
The NSFG is a {\bf cross-sectional} study, which means that it
787
captures a snapshot of a group at a point in time. The most
788
common alternative is a {\bf longitudinal} study, which observes a
789
group repeatedly over a period of time.
790
\index{cross-sectional study}
791
\index{study!cross-sectional}
792
\index{longitudinal study}
793
\index{study!longitudinal}
794
795
The NSFG has been conducted seven times; each deployment is called a
796
{\bf cycle}. We will use data from Cycle 6, which was conducted from
797
January 2002 to March 2003. \index{cycle}
798
799
The goal of the survey is to draw conclusions about a {\bf
800
population}; the target population of the NSFG is people in the
801
United States aged 15-44. Ideally surveys would collect data from
802
every member of the population, but that's seldom possible. Instead
803
we collect data from a subset of the population called a {\bf sample}.
804
The people who participate in a survey are called {\bf respondents}.
805
\index{population}
806
807
In general,
808
cross-sectional studies are meant to be {\bf representative}, which
809
means that every member of the target population has an equal chance
810
of participating. That ideal is hard to achieve in
811
practice, but people who conduct surveys come as close as they can.
812
\index{respondent} \index{representative}
813
814
The NSFG is not representative; instead it is deliberately {\bf
815
oversampled}. The designers of the study recruited three
816
groups---Hispanics, African-Americans and teenagers---at rates higher
817
than their representation in the U.S. population, in order to
818
make sure that the number of respondents in each of
819
these groups is large enough to draw valid statistical inferences.
820
\index{oversampling}
821
822
Of course, the drawback of oversampling is that it is not as easy
823
to draw conclusions about the general population based on statistics
824
from the survey. We will come back to this point later.
825
826
When working with this kind of data, it is important to be familiar
827
with the {\bf codebook}, which documents the design of the study, the
828
survey questions, and the encoding of the responses. The codebook and
829
user's guide for the NSFG data are available from
830
\url{http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm}
831
832
833
\section{Importing the data}
834
835
The code and data used in this book are available from
836
\url{https://github.com/AllenDowney/ThinkStats2}. For information
837
about downloading and working with this code,
838
see Section~\ref{code}.
839
840
Once you download the code, you should have a file called {\tt
841
ThinkStats2/code/nsfg.py}. If you run it, it should read a data
842
file, run some tests, and print a message like, ``All tests passed.''
843
844
Let's see what it does. Pregnancy data from Cycle 6 of the NSFG is in
845
a file called {\tt 2002FemPreg.dat.gz}; it
846
is a gzip-compressed data file in plain text (ASCII), with fixed width
847
columns. Each line in the file is a {\bf record} that
848
contains data about one pregnancy.
849
850
The format of the file is documented in {\tt 2002FemPreg.dct}, which
851
is a Stata dictionary file. Stata is a statistical software system;
852
a ``dictionary'' in this context is a list of variable names, types,
853
and indices that identify where in each line to find each variable.
854
855
For example, here are a few lines from {\tt 2002FemPreg.dct}:
856
%
857
\begin{verbatim}
858
infile dictionary {
859
_column(1) str12 caseid %12s "RESPONDENT ID NUMBER"
860
_column(13) byte pregordr %2f "PREGNANCY ORDER (NUMBER)"
861
}
862
\end{verbatim}
863
864
This dictionary describes two variables: {\tt caseid} is a 12-character
865
string that represents the respondent ID; {\tt pregordr} is a
866
one-byte integer that indicates which pregnancy this record
867
describes for this respondent.
868
869
The code you downloaded includes {\tt thinkstats2.py}, which is a Python
870
module
871
that contains many classes and functions used in this book,
872
including functions that read the Stata dictionary and
873
the NSFG data file. Here's how they are used in {\tt nsfg.py}:
874
875
\begin{verbatim}
876
def ReadFemPreg(dct_file='2002FemPreg.dct',
877
dat_file='2002FemPreg.dat.gz'):
878
dct = thinkstats2.ReadStataDct(dct_file)
879
df = dct.ReadFixedWidth(dat_file, compression='gzip')
880
CleanFemPreg(df)
881
return df
882
\end{verbatim}
883
884
{\tt ReadStataDct} takes the name of the dictionary file
885
and returns {\tt dct}, a {\tt FixedWidthVariables} object that contains the
886
information from the dictionary file. {\tt dct} provides {\tt
887
ReadFixedWidth}, which reads the data file.
888
889
890
\section{DataFrames}
891
\label{dataframe}
892
893
The result of {\tt ReadFixedWidth} is a DataFrame, which is the
894
fundamental data structure provided by pandas, which is a Python
895
data and statistics package we'll use throughout this book.
896
A DataFrame contains a
897
row for each record, in this case one row per pregnancy, and a column
898
for each variable.
899
\index{pandas}
900
\index{DataFrame}
901
902
In addition to the data, a DataFrame also contains the variable
903
names and their types, and it provides methods for accessing and modifying
904
the data.
905
906
If you print {\tt df} you get a truncated view of the rows and
907
columns, and the shape of the DataFrame, which is 13593
908
rows/records and 244 columns/variables.
909
910
\begin{verbatim}
911
>>> import nsfg
912
>>> df = nsfg.ReadFemPreg()
913
>>> df
914
...
915
[13593 rows x 244 columns]
916
\end{verbatim}
917
918
The DataFrame is too big to display, so the output is truncated. The
919
last line reports the number of rows and columns.
920
921
The attribute {\tt columns} returns a sequence of column
922
names as Unicode strings:
923
924
\begin{verbatim}
925
>>> df.columns
926
Index([u'caseid', u'pregordr', u'howpreg_n', u'howpreg_p', ... ])
927
\end{verbatim}
928
929
The result is an Index, which is another pandas data structure.
930
We'll learn more about Index later, but for
931
now we'll treat it like a list:
932
\index{pandas}
933
\index{Index}
934
935
\begin{verbatim}
936
>>> df.columns[1]
937
'pregordr'
938
\end{verbatim}
939
940
To access a column from a DataFrame, you can use the column
941
name as a key:
942
\index{DataFrame}
943
944
\begin{verbatim}
945
>>> pregordr = df['pregordr']
946
>>> type(pregordr)
947
<class 'pandas.core.series.Series'>
948
\end{verbatim}
949
950
The result is a Series, yet another pandas data structure.
951
A Series is like a Python list with some additional features.
952
When you print a Series, you get the indices and the
953
corresponding values:
954
\index{Series}
955
956
\begin{verbatim}
957
>>> pregordr
958
0 1
959
1 2
960
2 1
961
3 2
962
...
963
13590 3
964
13591 4
965
13592 5
966
Name: pregordr, Length: 13593, dtype: int64
967
\end{verbatim}
968
969
In this example the indices are integers from 0 to 13592, but in
970
general they can be any sortable type. The elements
971
are also integers, but they can be any type.
972
973
The last line includes the variable name, Series length, and data type;
974
{\tt int64} is one of the types provided by NumPy. If you run
975
this example on a 32-bit machine you might see {\tt int32}.
976
\index{NumPy}
977
978
You can access the elements of a Series using integer indices
979
and slices:
980
981
\begin{verbatim}
982
>>> pregordr[0]
983
1
984
>>> pregordr[2:5]
985
2 1
986
3 2
987
4 3
988
Name: pregordr, dtype: int64
989
\end{verbatim}
990
991
The result of the index operator is an {\tt int64}; the
992
result of the slice is another Series.
993
994
You can also access the columns of a DataFrame using dot notation:
995
\index{DataFrame}
996
997
\begin{verbatim}
998
>>> pregordr = df.pregordr
999
\end{verbatim}
1000
1001
This notation only works if the column name is a valid Python
1002
identifier, so it has to begin with a letter, can't contain spaces, etc.
1003
1004
1005
\section{Variables}
1006
1007
We have already seen two variables in the NSFG dataset, {\tt caseid}
1008
and {\tt pregordr}, and we have seen that there are 244 variables in
1009
total. For the explorations in this book, I use the following
1010
variables:
1011
1012
\begin{itemize}
1013
1014
\item {\tt caseid} is the integer ID of the respondent.
1015
1016
\item {\tt prglngth} is the integer duration of the pregnancy in weeks.
1017
\index{pregnancy length}
1018
1019
\item {\tt outcome} is an integer code for the outcome of the
1020
pregnancy. The code 1 indicates a live birth.
1021
1022
\item {\tt pregordr} is a pregnancy serial number; for example, the
1023
code for a respondent's first pregnancy is 1, for the second
1024
pregnancy is 2, and so on.
1025
1026
\item {\tt birthord} is a serial number for live
1027
births; the code for a respondent's first child is 1, and so on.
1028
For outcomes other than live birth, this field is blank.
1029
1030
\item \verb"birthwgt_lb" and \verb"birthwgt_oz" contain the pounds and
1031
ounces parts of the birth weight of the baby.
1032
\index{birth weight}
1033
\index{weight!birth}
1034
1035
\item {\tt agepreg} is the mother's age at the end of the pregnancy.
1036
1037
\item {\tt finalwgt} is the statistical weight associated with the
1038
respondent. It is a floating-point value that indicates the number
1039
of people in the U.S. population this respondent represents.
1040
\index{weight!sample}
1041
1042
\end{itemize}
1043
1044
If you read the codebook carefully, you will see that many of the
1045
variables are {\bf recodes}, which means that they are not part of the
1046
{\bf raw data} collected by the survey; they are calculated using
1047
the raw data. \index{recode} \index{raw data}
1048
1049
For example, {\tt prglngth} for live births is equal to the raw
1050
variable {\tt wksgest} (weeks of gestation) if it is available;
1051
otherwise it is estimated using {\tt mosgest * 4.33} (months of
1052
gestation times the average number of weeks in a month).
1053
1054
Recodes are often based on logic that checks the consistency and
1055
accuracy of the data. In general it is a good idea to use recodes
1056
when they are available, unless there is a compelling reason to
1057
process the raw data yourself.
1058
1059
1060
\section{Transformation}
1061
\label{cleaning}
1062
1063
When you import data like this, you often have to check for errors,
1064
deal with special values, convert data into different formats, and
1065
perform calculations. These operations are called {\bf data cleaning}.
1066
1067
{\tt nsfg.py} includes {\tt CleanFemPreg}, a function that cleans
1068
the variables I am planning to use.
1069
1070
\begin{verbatim}
1071
def CleanFemPreg(df):
1072
df.agepreg /= 100.0
1073
1074
na_vals = [97, 98, 99]
1075
df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
1076
df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
1077
1078
df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0
1079
\end{verbatim}
1080
1081
{\tt agepreg} contains the mother's age at the end of the
1082
pregnancy. In the data file, {\tt agepreg} is encoded as an integer
1083
number of centiyears. So the first line divides each element
1084
of {\tt agepreg} by 100, yielding a floating-point value in
1085
years.
1086
1087
\verb"birthwgt_lb" and \verb"birthwgt_oz" contain the weight of the
1088
baby, in pounds and ounces, for pregnancies that end in live birth.
1089
In addition it uses several special codes:
1090
1091
\begin{verbatim}
1092
97 NOT ASCERTAINED
1093
98 REFUSED
1094
99 DON'T KNOW
1095
\end{verbatim}
1096
1097
Special values encoded as numbers are {\em dangerous\/} because if they
1098
are not handled properly, they can generate bogus results, like
1099
a 99-pound baby. The {\tt replace} method replaces these values with
1100
{\tt np.nan}, a special floating-point value that represents ``not a
1101
number.'' The {\tt inplace} flag tells {\tt replace} to modify the
1102
existing Series rather than create a new one.
1103
\index{NaN}
1104
1105
As part of the IEEE floating-point standard, all mathematical
1106
operations return {\tt nan} if either argument is {\tt nan}:
1107
1108
\begin{verbatim}
1109
>>> import numpy as np
1110
>>> np.nan / 100.0
1111
nan
1112
\end{verbatim}
1113
1114
So computations with {\tt nan} tend to do the right thing, and most
1115
pandas functions handle {\tt nan} appropriately. But dealing with
1116
missing data will be a recurring issue.
1117
\index{pandas}
1118
\index{missing values}
1119
1120
The last line of {\tt CleanFemPreg} creates a new
1121
column \verb"totalwgt_lb" that combines pounds and ounces into
1122
a single quantity, in pounds.
1123
1124
One important note: when you add a new column to a DataFrame, you
1125
must use dictionary syntax, like this
1126
\index{DataFrame}
1127
1128
\begin{verbatim}
1129
# CORRECT
1130
df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0
1131
\end{verbatim}
1132
1133
Not dot notation, like this:
1134
1135
\begin{verbatim}
1136
# WRONG!
1137
df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0
1138
\end{verbatim}
1139
1140
The version with dot notation adds an attribute to the DataFrame
1141
object, but that attribute is not treated as a new column.
1142
1143
1144
\section{Validation}
1145
1146
When data is exported from one software environment and imported into
1147
another, errors might be introduced. And when you are
1148
getting familiar with a new dataset, you might interpret data
1149
incorrectly or introduce other misunderstandings. If you take
1150
time to validate the data, you can save time later and avoid errors.
1151
1152
One way to validate data is to compute basic statistics and compare
1153
them with published results. For example, the NSFG codebook includes
1154
tables that summarize each variable. Here is the table for
1155
{\tt outcome}, which encodes the outcome of each pregnancy:
1156
1157
\begin{verbatim}
1158
value label Total
1159
1 LIVE BIRTH 9148
1160
2 INDUCED ABORTION 1862
1161
3 STILLBIRTH 120
1162
4 MISCARRIAGE 1921
1163
5 ECTOPIC PREGNANCY 190
1164
6 CURRENT PREGNANCY 352
1165
\end{verbatim}
1166
1167
The Series class provides a method, \verb"value_counts", that
1168
counts the number of times each value appears. If we select the {\tt
1169
outcome} Series from the DataFrame, we can use \verb"value_counts"
1170
to compare with the published data:
1171
\index{DataFrame}
1172
\index{Series}
1173
1174
\begin{verbatim}
1175
>>> df.outcome.value_counts().sort_index()
1176
1 9148
1177
2 1862
1178
3 120
1179
4 1921
1180
5 190
1181
6 352
1182
\end{verbatim}
1183
1184
The result of \verb"value_counts" is a Series;
1185
\verb"sort_index()" sorts the Series by index, so the values
1186
appear in order.
1187
1188
Comparing the results with the published table, it looks like the
1189
values in {\tt outcome} are correct. Similarly, here is the published
1190
table for \verb"birthwgt_lb"
1191
1192
\begin{verbatim}
1193
value label Total
1194
. INAPPLICABLE 4449
1195
0-5 UNDER 6 POUNDS 1125
1196
6 6 POUNDS 2223
1197
7 7 POUNDS 3049
1198
8 8 POUNDS 1889
1199
9-95 9 POUNDS OR MORE 799
1200
\end{verbatim}
1201
1202
And here are the value counts:
1203
1204
\begin{verbatim}
1205
>>> df.birthwgt_lb.value_counts(sort=False)
1206
0 8
1207
1 40
1208
2 53
1209
3 98
1210
4 229
1211
5 697
1212
6 2223
1213
7 3049
1214
8 1889
1215
9 623
1216
10 132
1217
11 26
1218
12 10
1219
13 3
1220
14 3
1221
15 1
1222
51 1
1223
\end{verbatim}
1224
1225
The counts for 6, 7, and 8 pounds check out, and if you add
1226
up the counts for 0-5 and 9-95, they check out, too. But
1227
if you look more closely, you will notice one value that has to be
1228
an error, a 51 pound baby!
1229
1230
To deal with this error, I added a line to {\tt CleanFemPreg}:
1231
1232
\begin{verbatim}
1233
df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
1234
\end{verbatim}
1235
1236
This statement replaces invalid values with {\tt np.nan}.
1237
The attribute {\tt loc} provides several ways to select
1238
rows and columns from a DataFrame. In this example, the
1239
first expression in brackets is the row indexer; the second
1240
expression selects the column.
1241
\index{loc indexer}
1242
\index{indexer!loc}
1243
1244
The expression \verb"df.birthwgt_lb > 20" yields a Series of type
1245
{\tt bool}, where True indicates that the condition is true. When a
1246
boolean Series is used as an index, it selects only the elements that
1247
satisfy the condition.
1248
\index{Series} \index{boolean} \index{NaN}
1249
1250
1251
1252
\section{Interpretation}
1253
1254
To work with data effectively, you have to think on two levels at the
1255
same time: the level of statistics and the level of context.
1256
1257
As an example, let's look at the sequence of outcomes for a few
1258
respondents. Because of the way the data files are organized, we have
1259
to do some processing to collect the pregnancy data for each respondent.
1260
Here's a function that does that:
1261
1262
\begin{verbatim}
1263
def MakePregMap(df):
1264
d = defaultdict(list)
1265
for index, caseid in df.caseid.iteritems():
1266
d[caseid].append(index)
1267
return d
1268
\end{verbatim}
1269
1270
{\tt df} is the DataFrame with pregnancy data. The {\tt iteritems}
1271
method enumerates the index (row number)
1272
and {\tt caseid} for each pregnancy.
1273
\index{DataFrame}
1274
1275
{\tt d} is a dictionary that maps from each case ID to a list of
1276
indices. If you are not familiar with {\tt defaultdict}, it is in
1277
the Python {\tt collections} module.
1278
Using {\tt d}, we can look up a respondent and get the
1279
indices of that respondent's pregnancies.
1280
1281
This example looks up one respondent and prints a list of outcomes
1282
for her pregnancies:
1283
1284
\begin{verbatim}
1285
>>> caseid = 10229
1286
>>> preg_map = nsfg.MakePregMap(df)
1287
>>> indices = preg_map[caseid]
1288
>>> df.outcome[indices].values
1289
[4 4 4 4 4 4 1]
1290
\end{verbatim}
1291
1292
{\tt indices} is the list of indices for pregnancies corresponding
1293
to respondent {\tt 10229}.
1294
1295
Using this list as an index into {\tt df.outcome} selects the
1296
indicated rows and yields a Series. Instead of printing the
1297
whole Series, I selected the {\tt values} attribute, which is
1298
a NumPy array.
1299
\index{NumPy}
1300
\index{Series}
1301
1302
The outcome code {\tt 1} indicates a live birth. Code {\tt 4} indicates
1303
a miscarriage; that is, a pregnancy that ended spontaneously, usually
1304
with no known medical cause.
1305
1306
Statistically this respondent is not unusual. Miscarriages are common
1307
and there are other respondents who reported as many or more.
1308
1309
But remembering the context, this data tells the story of a woman who
1310
was pregnant six times, each time ending in miscarriage. Her seventh
1311
and most recent pregnancy ended in a live birth. If we consider this
1312
data with empathy, it is natural to be moved by the story it tells.
1313
1314
Each record in the NSFG dataset represents a person who provided
1315
honest answers to many personal and difficult questions. We can use
1316
this data to answer statistical questions about family life,
1317
reproduction, and health. At the same time, we have an obligation
1318
to consider the people represented by the data, and to afford them
1319
respect and gratitude.
1320
\index{ethics}
1321
1322
1323
\section{Exercises}
1324
1325
\begin{exercise}
1326
In the repository you downloaded, you should find a file named
1327
\verb"chap01ex.ipynb", which is an IPython notebook. You can
1328
launch IPython notebook from the command line like this:
1329
\index{IPython}
1330
1331
\begin{verbatim}
1332
$ ipython notebook &
1333
\end{verbatim}
1334
1335
If IPython is installed, it should launch a server that runs in the
1336
background and open a browser to view the notebook. If you are not
1337
familiar with IPython, I suggest you start at
1338
\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.
1339
1340
To launch the IPython notebook server, run:
1341
1342
\begin{verbatim}
1343
$ ipython notebook &
1344
\end{verbatim}
1345
1346
It should open a new browser window, but if not, the startup
1347
message provides a URL you can load in a browser, usually
1348
\url{http://localhost:8888}. The new window should list the notebooks
1349
in the repository.
1350
1351
Open \verb"chap01ex.ipynb". Some cells are already filled in, and
1352
you should execute them. Other cells give you instructions for
1353
exercises you should try.
1354
1355
A solution to this exercise is in \verb"chap01soln.ipynb"
1356
\end{exercise}
1357
1358
1359
\begin{exercise}
1360
In the repository you downloaded, you should find a file named
1361
\verb"chap01ex.py"; using this file as a starting place, write a
1362
function that reads the respondent file, {\tt 2002FemResp.dat.gz}.
1363
1364
The variable {\tt pregnum} is a recode that indicates how many
1365
times each respondent has been pregnant. Print the value counts
1366
for this variable and compare them to the published results in
1367
the NSFG codebook.
1368
1369
You can also cross-validate the respondent and pregnancy files by
1370
comparing {\tt pregnum} for each respondent with the number of
1371
records in the pregnancy file.
1372
1373
You can use {\tt nsfg.MakePregMap} to make a dictionary that maps
1374
from each {\tt caseid} to a list of indices into the pregnancy
1375
DataFrame.
1376
\index{DataFrame}
1377
1378
A solution to this exercise is in \verb"chap01soln.py"
1379
\end{exercise}
1380
1381
1382
\begin{exercise}
1383
The best way to learn about statistics is to work on a project you are
1384
interested in. Is there a question like, ``Do first babies arrive
1385
late,'' that you want to investigate?
1386
1387
Think about questions you find personally interesting, or items of
1388
conventional wisdom, or controversial topics, or questions that have
1389
political consequences, and see if you can formulate a question that
1390
lends itself to statistical inquiry.
1391
1392
Look for data to help you address the question. Governments are good
1393
sources because data from public research is often freely
1394
available. Good places to start include \url{http://www.data.gov/},
1395
and \url{http://www.science.gov/}, and in the United Kingdom,
1396
\url{http://data.gov.uk/}.
1397
1398
Two of my favorite data sets are the General Social Survey at
1399
\url{http://www3.norc.org/gss+website/}, and the European Social
1400
Survey at \url{http://www.europeansocialsurvey.org/}.
1401
1402
If it seems like someone has already answered your question, look
1403
closely to see whether the answer is justified. There might be flaws
1404
in the data or the analysis that make the conclusion unreliable. In
1405
that case you could perform a different analysis of the same data, or
1406
look for a better source of data.
1407
1408
If you find a published paper that addresses your question, you
1409
should be able to get the raw data. Many authors make their data
1410
available on the web, but for sensitive data you might have to
1411
write to the authors, provide information about how you plan to use
1412
the data, or agree to certain terms of use. Be persistent!
1413
1414
\end{exercise}
1415
1416
1417
\section{Glossary}
1418
1419
\begin{itemize}
1420
1421
\item {\bf anecdotal evidence}: Evidence, often personal, that is collected
1422
casually rather than by a well-designed study.
1423
\index{anecdotal evidence}
1424
1425
\item {\bf population}: A group we are interested in studying.
1426
``Population'' often refers to a
1427
group of people, but the term is used for other subjects,
1428
too.
1429
\index{population}
1430
1431
\item {\bf cross-sectional study}: A study that collects data about a
1432
population at a particular point in time.
1433
\index{cross-sectional study}
1434
\index{study!cross-sectional}
1435
1436
\item {\bf cycle}: In a repeated cross-sectional study, each repetition
1437
of the study is called a cycle.
1438
1439
\item {\bf longitudinal study}: A study that follows a population over
1440
time, collecting data from the same group repeatedly.
1441
\index{longitudinal study}
1442
\index{study!longitudinal}
1443
1444
\item {\bf record}: In a dataset, a collection of information about
1445
a single person or other subject.
1446
\index{record}
1447
1448
\item {\bf respondent}: A person who responds to a survey.
1449
\index{respondent}
1450
1451
\item {\bf sample}: The subset of a population used to collect data.
1452
\index{sample}
1453
1454
\item {\bf representative}: A sample is representative if every member
1455
of the population has the same chance of being in the sample.
1456
\index{representative}
1457
1458
\item {\bf oversampling}: The technique of increasing the representation
1459
of a sub-population in order to avoid errors due to small sample
1460
sizes.
1461
\index{oversampling}
1462
1463
\item {\bf raw data}: Values collected and recorded with little or no
1464
checking, calculation or interpretation.
1465
\index{raw data}
1466
1467
\item {\bf recode}: A value that is generated by calculation and other
1468
logic applied to raw data.
1469
\index{recode}
1470
1471
\item {\bf data cleaning}: Processes that include validating data,
1472
identifying errors, translating between data types and
1473
representations, etc.
1474
1475
\end{itemize}
1476
1477
1478
1479
\chapter{Distributions}
1480
\label{descriptive}
1481
1482
1483
\section{Histograms}
1484
\label{histograms}
1485
1486
One of the best ways to describe a variable is to report the values
1487
that appear in the dataset and how many times each value appears.
1488
This description is called the {\bf distribution} of the variable.
1489
\index{distribution}
1490
1491
The most common representation of a distribution is a {\bf histogram},
1492
which is a graph that shows the {\bf frequency} of each value. In
1493
this context, ``frequency'' means the number of times the value
1494
appears. \index{histogram} \index{frequency}
1495
\index{dictionary}
1496
1497
In Python, an efficient way to compute frequencies is with a
1498
dictionary. Given a sequence of values, {\tt t}:
1499
%
1500
\begin{verbatim}
1501
hist = {}
1502
for x in t:
1503
hist[x] = hist.get(x, 0) + 1
1504
\end{verbatim}
1505
1506
The result is a dictionary that maps from values to frequencies.
1507
Alternatively, you could use the {\tt Counter} class defined in the
1508
{\tt collections} module:
1509
1510
\begin{verbatim}
1511
from collections import Counter
1512
counter = Counter(t)
1513
\end{verbatim}
1514
1515
The result is a {\tt Counter} object, which is a subclass of
1516
dictionary.
1517
1518
Another option is to use the pandas method \verb"value_counts", which
1519
we saw in the previous chapter. But for this book I created a class,
1520
Hist, that represents histograms and provides the methods
1521
that operate on them.
1522
\index{pandas}
1523
1524
1525
\section{Representing histograms}
1526
\index{histogram}
1527
\index{Hist}
1528
1529
The Hist constructor can take a sequence, dictionary, pandas
1530
Series, or another Hist. You can instantiate a Hist object like this:
1531
%
1532
\begin{verbatim}
1533
>>> import thinkstats2
1534
>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])
1535
>>> hist
1536
Hist({1: 1, 2: 2, 3: 1, 5: 1})
1537
\end{verbatim}
1538
1539
Hist objects provide {\tt Freq}, which takes a value and
1540
returns its frequency: \index{frequency}
1541
%
1542
\begin{verbatim}
1543
>>> hist.Freq(2)
1544
2
1545
\end{verbatim}
1546
1547
The bracket operator does the same thing: \index{bracket operator}
1548
%
1549
\begin{verbatim}
1550
>>> hist[2]
1551
2
1552
\end{verbatim}
1553
1554
If you look up a value that has never appeared, the frequency is 0.
1555
%
1556
\begin{verbatim}
1557
>>> hist.Freq(4)
1558
0
1559
\end{verbatim}
1560
1561
{\tt Values} returns an unsorted list of the values in the Hist:
1562
%
1563
\begin{verbatim}
1564
>>> hist.Values()
1565
[1, 5, 3, 2]
1566
\end{verbatim}
1567
1568
To loop through the values in order, you can use the built-in function
1569
{\tt sorted}:
1570
%
1571
\begin{verbatim}
1572
for val in sorted(hist.Values()):
1573
print(val, hist.Freq(val))
1574
\end{verbatim}
1575
1576
Or you can use {\tt Items} to iterate through
1577
value-frequency pairs: \index{frequency}
1578
%
1579
\begin{verbatim}
1580
for val, freq in hist.Items():
1581
print(val, freq)
1582
\end{verbatim}
1583
1584
1585
\section{Plotting histograms}
1586
\index{pyplot}
1587
1588
\begin{figure}
1589
% first.py
1590
\centerline{\includegraphics[height=2.5in]{figs/first_wgt_lb_hist.pdf}}
1591
\caption{Histogram of the pound part of birth weight.}
1592
\label{first_wgt_lb_hist}
1593
\end{figure}
1594
1595
For this book I wrote a module called {\tt thinkplot.py} that provides
1596
functions for plotting Hists and other objects defined in {\tt
1597
thinkstats2.py}. It is based on {\tt pyplot}, which is part of the
1598
{\tt matplotlib} package. See Section~\ref{code} for information
1599
about installing {\tt matplotlib}. \index{thinkplot}
1600
\index{matplotlib}
1601
1602
To plot {\tt hist} with {\tt thinkplot}, try this:
1603
\index{Hist}
1604
1605
\begin{verbatim}
1606
>>> import thinkplot
1607
>>> thinkplot.Hist(hist)
1608
>>> thinkplot.Show(xlabel='value', ylabel='frequency')
1609
\end{verbatim}
1610
1611
You can read the documentation for {\tt thinkplot} at
1612
\url{http://greenteapress.com/thinkstats2/thinkplot.html}.
1613
1614
1615
\begin{figure}
1616
% first.py
1617
\centerline{\includegraphics[height=2.5in]{figs/first_wgt_oz_hist.pdf}}
1618
\caption{Histogram of the ounce part of birth weight.}
1619
\label{first_wgt_oz_hist}
1620
\end{figure}
1621
1622
1623
\section{NSFG variables}
1624
1625
Now let's get back to the data from the NSFG. The code in this
1626
chapter is in {\tt first.py}.
1627
For information about downloading and
1628
working with this code, see Section~\ref{code}.
1629
1630
When you start working with a new dataset, I suggest you explore
1631
the variables you are planning to use one at a time, and a good
1632
way to start is by looking at histograms.
1633
\index{histogram}
1634
1635
In Section~\ref{cleaning} we transformed {\tt agepreg}
1636
from centiyears to years, and combined \verb"birthwgt_lb" and
1637
\verb"birthwgt_oz" into a single quantity, \verb"totalwgt_lb".
1638
In this section I use these variables to demonstrate some
1639
features of histograms.
1640
1641
\begin{figure}
1642
% first.py
1643
\centerline{\includegraphics[height=2.5in]{figs/first_agepreg_hist.pdf}}
1644
\caption{Histogram of mother's age at end of pregnancy.}
1645
\label{first_agepreg_hist}
1646
\end{figure}
1647
1648
I'll start by reading the data and selecting records for live
1649
births:
1650
1651
\begin{verbatim}
1652
preg = nsfg.ReadFemPreg()
1653
live = preg[preg.outcome == 1]
1654
\end{verbatim}
1655
1656
The expression in brackets is a boolean Series that
1657
selects rows from the DataFrame and returns a new DataFrame.
1658
Next I generate and plot the histogram of
1659
\verb"birthwgt_lb" for live births.
1660
\index{DataFrame}
1661
\index{Series}
1662
\index{Hist}
1663
\index{bracket operator}
1664
\index{boolean}
1665
1666
\begin{verbatim}
1667
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
1668
thinkplot.Hist(hist)
1669
thinkplot.Show(xlabel='pounds', ylabel='frequency')
1670
\end{verbatim}
1671
1672
When the argument passed to Hist is a pandas Series, any
1673
{\tt nan} values are dropped. {\tt label} is a string that appears
1674
in the legend when the Hist is plotted.
1675
\index{pandas}
1676
\index{Series}
1677
\index{thinkplot}
1678
\index{NaN}
1679
1680
\begin{figure}
1681
% first.py
1682
\centerline{\includegraphics[height=2.5in]{figs/first_prglngth_hist.pdf}}
1683
\caption{Histogram of pregnancy length in weeks.}
1684
\label{first_prglngth_hist}
1685
\end{figure}
1686
1687
Figure~\ref{first_wgt_lb_hist} shows the result. The most common
1688
value, called the {\bf mode}, is 7 pounds. The distribution is
1689
approximately bell-shaped, which is the shape of the {\bf normal}
1690
distribution, also called a {\bf Gaussian} distribution. But unlike a
1691
true normal distribution, this distribution is asymmetric; it has
1692
a {\bf tail} that extends farther to the left than to the right.
1693
1694
Figure~\ref{first_wgt_oz_hist} shows the histogram of
1695
\verb"birthwgt_oz", which is the ounces part of birth weight. In
1696
theory we expect this distribution to be {\bf uniform}; that is, all
1697
values should have the same frequency. In fact, 0 is more common than
1698
the other values, and 1 and 15 are less common, probably because
1699
respondents round off birth weights that are close to an integer
1700
value.
1701
\index{birth weight}
1702
\index{weight!birth}
1703
1704
Figure~\ref{first_agepreg_hist} shows the histogram of \verb"agepreg",
1705
the mother's age at the end of pregnancy. The mode is 21 years. The
1706
distribution is very roughly bell-shaped, but in this case the tail
1707
extends farther to the right than left; most mothers are in
1708
their 20s, fewer in their 30s.
1709
1710
Figure~\ref{first_prglngth_hist} shows the histogram of
1711
\verb"prglngth", the length of the pregnancy in weeks. By far the
1712
most common value is 39 weeks. The left tail is longer than the
1713
right; early babies are common, but pregnancies seldom go past 43
1714
weeks, and doctors often intervene if they do.
1715
\index{pregnancy length}
1716
1717
1718
\section{Outliers}
1719
1720
Looking at histograms, it is easy to identify the most common
1721
values and the shape of the distribution, but rare values are
1722
not always visible.
1723
\index{histogram}
1724
1725
Before going on, it is a good idea to check for {\bf
1726
outliers}, which are extreme values that might be errors in
1727
measurement and recording, or might be accurate reports of rare
1728
events.
1729
\index{outlier}
1730
1731
Hist provides methods {\tt Largest} and {\tt Smallest}, which take
1732
an integer {\tt n} and return the {\tt n} largest or smallest
1733
values from the histogram:
1734
\index{Hist}
1735
1736
\begin{verbatim}
1737
for weeks, freq in hist.Smallest(10):
1738
print(weeks, freq)
1739
\end{verbatim}
1740
1741
In the list of pregnancy lengths for live births, the 10 lowest values
1742
are {\tt [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]}. Values below 10 weeks
1743
are certainly errors; the most likely explanation is that the outcome
1744
was not coded correctly. Values higher than 30 weeks are probably
1745
legitimate. Between 10 and 30 weeks, it is hard to be sure; some
1746
values are probably errors, but some represent premature babies.
1747
\index{pregnancy length}
1748
1749
On the other end of the range, the highest values are:
1750
%
1751
\begin{verbatim}
1752
weeks count
1753
43 148
1754
44 46
1755
45 10
1756
46 1
1757
47 1
1758
48 7
1759
50 2
1760
\end{verbatim}
1761
1762
Most doctors recommend induced labor if a pregnancy exceeds 42 weeks,
1763
so some of the longer values are surprising. In particular, 50 weeks
1764
seems medically unlikely.
1765
1766
The best way to handle outliers depends on ``domain knowledge'';
1767
that is, information about where the data come from and what they
1768
mean. And it depends on what analysis you are planning to perform.
1769
\index{outlier}
1770
1771
In this example, the motivating question is whether first babies
1772
tend to be early (or late). When people ask this question, they are
1773
usually interested in full-term pregnancies, so for this analysis
1774
I will focus on pregnancies longer than 27 weeks.
1775
1776
1777
\section{First babies}
1778
1779
Now we can compare the distribution of pregnancy lengths for first
1780
babies and others. I divided the DataFrame of live births using
1781
{\tt birthord}, and computed their histograms:
1782
\index{DataFrame}
1783
\index{Hist}
1784
\index{pregnancy length}
1785
1786
\begin{verbatim}
1787
firsts = live[live.birthord == 1]
1788
others = live[live.birthord != 1]
1789
1790
first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
1791
other_hist = thinkstats2.Hist(others.prglngth, label='other')
1792
\end{verbatim}
1793
1794
Then I plotted their histograms on the same axis:
1795
1796
\begin{verbatim}
1797
width = 0.45
1798
thinkplot.PrePlot(2)
1799
thinkplot.Hist(first_hist, align='right', width=width)
1800
thinkplot.Hist(other_hist, align='left', width=width)
1801
thinkplot.Show(xlabel='weeks', ylabel='frequency',
1802
xlim=[27, 46])
1803
\end{verbatim}
1804
1805
{\tt thinkplot.PrePlot} takes the number of histograms
1806
we are planning to plot; it uses this information to choose
1807
an appropriate collection of colors.
1808
\index{thinkplot}
1809
1810
\begin{figure}
1811
% first.py
1812
\centerline{\includegraphics[height=2.5in]{figs/first_nsfg_hist.pdf}}
1813
\caption{Histogram of pregnancy lengths.}
1814
\label{first_nsfg_hist}
1815
\end{figure}
1816
1817
{\tt thinkplot.Hist} normally uses {\tt align='center'} so that
1818
each bar is centered over its value. For this figure, I use
1819
{\tt align='right'} and {\tt align='left'} to place
1820
corresponding bars on either side of the value.
1821
\index{Hist}
1822
1823
With {\tt width=0.45}, the total width of the two bars is 0.9,
1824
leaving some space between each pair.
1825
1826
Finally, I adjust the axis to show only data between 27 and 46 weeks.
1827
Figure~\ref{first_nsfg_hist} shows the result.
1828
\index{pregnancy length}
1829
\index{length!pregnancy}
1830
1831
Histograms are useful because they make the most frequent values
1832
immediately apparent. But they are not the best choice for comparing
1833
two distributions. In this example, there are fewer ``first babies''
1834
than ``others,'' so some of the apparent differences in the histograms
1835
are due to sample sizes. In the next chapter we address this problem
1836
using probability mass functions.
1837
1838
1839
\section{Summarizing distributions}
1840
\label{mean}
1841
1842
A histogram is a complete description of the distribution of a sample;
1843
that is, given a histogram, we could reconstruct the values in the
1844
sample (although not their order).
1845
1846
If the details of the distribution are important, it might be
1847
necessary to present a histogram. But often we want to
1848
summarize the distribution with a few descriptive statistics.
1849
1850
Some of the characteristics we might want to report are:
1851
1852
\begin{itemize}
1853
1854
\item central tendency: Do the values tend to cluster around
1855
a particular point?
1856
\index{central tendency}
1857
1858
\item modes: Is there more than one cluster?
1859
\index{mode}
1860
1861
\item spread: How much variability is there in the values?
1862
\index{spread}
1863
1864
\item tails: How quickly do the probabilities drop off as we
1865
move away from the modes?
1866
\index{tail}
1867
1868
\item outliers: Are there extreme values far from the modes?
1869
\index{outlier}
1870
1871
\end{itemize}
1872
1873
Statistics designed to answer these questions are called {\bf summary
1874
statistics}. By far the most common summary statistic is the {\bf
1875
mean}, which is meant to describe the central tendency of the
1876
distribution. \index{mean} \index{average} \index{summary statistic}
1877
1878
If you have a sample of {\tt n} values, $x_i$, the mean, $\xbar$, is
1879
the sum of the values divided by the number of values; in other words
1880
%
1881
\[ \xbar = \frac{1}{n} \sum_i x_i \]
1882
%
1883
The words ``mean'' and ``average'' are sometimes used interchangeably,
1884
but I make this distinction:
1885
1886
\begin{itemize}
1887
1888
\item The ``mean'' of a sample is the summary statistic computed with
1889
the previous formula.
1890
1891
\item An ``average'' is one of several summary statistics you might
1892
choose to describe a central tendency.
1893
\index{central tendency}
1894
1895
\end{itemize}
1896
1897
Sometimes the mean is a good description of a set of values. For
1898
example, apples are all pretty much the same size (at least the ones
1899
sold in supermarkets). So if I buy 6 apples and the total weight is 3
1900
pounds, it would be a reasonable summary to say they are about a half
1901
pound each.
1902
\index{weight!pumpkin}
1903
1904
But pumpkins are more diverse. Suppose I grow several varieties in my
1905
garden, and one day I harvest three decorative pumpkins that are 1
1906
pound each, two pie pumpkins that are 3 pounds each, and one Atlantic
1907
Giant\textregistered~pumpkin that weighs 591 pounds. The mean of this
1908
sample is 100 pounds, but if I told you ``The average pumpkin in my
1909
garden is 100 pounds,'' that would be misleading. In this example,
1910
there is no meaningful average because there is no typical pumpkin.
1911
\index{pumpkin}
1912
1913
1914
1915
\section{Variance}
1916
\index{variance}
1917
1918
If there is no single number that summarizes pumpkin weights,
1919
we can do a little better with two numbers: mean and {\bf variance}.
1920
1921
Variance is a summary statistic intended to describe the variability
1922
or spread of a distribution. The variance of a set of values is
1923
%
1924
\[ S^2 = \frac{1}{n} \sum_i (x_i - \xbar)^2 \]
1925
%
1926
The term $x_i - \xbar$ is called the ``deviation from the mean,'' so
1927
variance is the mean squared deviation. The square root of variance,
1928
$S$, is the {\bf standard deviation}. \index{deviation}
1929
\index{standard deviation}
1930
\index{deviation}
1931
1932
If you have prior experience, you might have seen a formula for
1933
variance with $n-1$ in the denominator, rather than {\tt n}. This
1934
statistic is used to estimate the variance in a population using a
1935
sample. We will come back to this in Chapter~\ref{estimation}.
1936
\index{sample variance}
1937
1938
Pandas data structures provides methods to compute mean, variance and
1939
standard deviation:
1940
\index{pandas}
1941
1942
\begin{verbatim}
1943
mean = live.prglngth.mean()
1944
var = live.prglngth.var()
1945
std = live.prglngth.std()
1946
\end{verbatim}
1947
1948
For all live births, the mean pregnancy length is 38.6 weeks, the
1949
standard deviation is 2.7 weeks, which means we should expect
1950
deviations of 2-3 weeks to be common.
1951
\index{pregnancy length}
1952
1953
Variance of pregnancy length is 7.3, which is hard to interpret,
1954
especially since the units are weeks$^2$, or ``square weeks.''
1955
Variance is useful in some calculations, but it is not
1956
a good summary statistic.
1957
1958
1959
\section{Effect size}
1960
\index{effect size}
1961
1962
An {\bf effect size} is a summary statistic intended to describe (wait
1963
for it) the size of an effect. For example, to describe the
1964
difference between two groups, one obvious choice is the difference in
1965
the means. \index{effect size}
1966
1967
Mean pregnancy length for first babies is 38.601; for
1968
other babies it is 38.523. The difference is 0.078 weeks, which works
1969
out to 13 hours. As a fraction of the typical pregnancy length, this
1970
difference is about 0.2\%.
1971
\index{pregnancy length}
1972
1973
If we assume this estimate is accurate, such a difference
1974
would have no practical consequences. In fact, without
1975
observing a large number of pregnancies, it is unlikely that anyone
1976
would notice this difference at all.
1977
\index{effect size}
1978
1979
Another way to convey the size of the effect is to compare the
1980
difference between groups to the variability within groups.
1981
Cohen's $d$ is a statistic intended to do that; it is defined
1982
%
1983
\[ d = \frac{\bar{x_1} - \bar{x_2}}{s} \]
1984
%
1985
where $\bar{x_1}$ and $\bar{x_2}$ are the means of the groups and
1986
$s$ is the ``pooled standard deviation''. Here's the Python
1987
code that computes Cohen's $d$:
1988
\index{standard deviation!pooled}
1989
1990
\begin{verbatim}
1991
def CohenEffectSize(group1, group2):
1992
diff = group1.mean() - group2.mean()
1993
1994
var1 = group1.var()
1995
var2 = group2.var()
1996
n1, n2 = len(group1), len(group2)
1997
1998
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
1999
d = diff / math.sqrt(pooled_var)
2000
return d
2001
\end{verbatim}
2002
2003
In this example, the difference in means is 0.029 standard deviations,
2004
which is small. To put that in perspective, the difference in
2005
height between men and women is about 1.7 standard deviations (see
2006
\url{https://en.wikipedia.org/wiki/Effect_size}).
2007
2008
2009
\section{Reporting results}
2010
2011
We have seen several ways to describe the difference in pregnancy
2012
length (if there is one) between first babies and others. How should
2013
we report these results?
2014
\index{pregnancy length}
2015
2016
The answer depends on who is asking the question. A scientist might
2017
be interested in any (real) effect, no matter how small. A doctor
2018
might only care about effects that are {\bf clinically significant};
2019
that is, differences that affect treatment decisions. A pregnant
2020
woman might be interested in results that are relevant to her, like
2021
the probability of delivering early or late.
2022
\index{clinically significant} \index{significant}
2023
2024
How you report results also depends on your goals. If you are trying
2025
to demonstrate the importance of an effect, you might choose summary
2026
statistics that emphasize differences. If you are trying to reassure
2027
a patient, you might choose statistics that put the differences in
2028
context.
2029
2030
Of course your decisions should also be guided by professional ethics.
2031
It's ok to be persuasive; you {\em should\/} design statistical reports
2032
and visualizations that tell a story clearly. But you should also do
2033
your best to make your reports honest, and to acknowledge uncertainty
2034
and limitations.
2035
\index{ethics}
2036
2037
2038
\section{Exercises}
2039
2040
\begin{exercise}
2041
Based on the results in this chapter, suppose you were asked to
2042
summarize what you learned about whether first babies arrive late.
2043
2044
Which summary statistics would you use if you wanted to get a story
2045
on the evening news? Which ones would you use if you wanted to
2046
reassure an anxious patient?
2047
\index{Adams, Cecil}
2048
\index{Straight Dope, The}
2049
2050
Finally, imagine that you are Cecil Adams, author of {\it The Straight
2051
Dope\/} (\url{http://straightdope.com}), and your job is to answer the
2052
question, ``Do first babies arrive late?'' Write a paragraph that
2053
uses the results in this chapter to answer the question clearly,
2054
precisely, and honestly.
2055
\index{ethics}
2056
2057
\end{exercise}
2058
2059
\begin{exercise}
2060
In the repository you downloaded, you should find a file named
2061
\verb"chap02ex.ipynb"; open it. Some cells are already filled in, and
2062
you should execute them. Other cells give you instructions for
2063
exercises. Follow the instructions and fill in the answers.
2064
2065
A solution to this exercise is in \verb"chap02soln.ipynb"
2066
\end{exercise}
2067
2068
In the repository you downloaded, you should find a file named
2069
\verb"chap02ex.py"; you can use this file as a starting place
2070
for the following exercises.
2071
My solution is in \verb"chap02soln.py".
2072
2073
\begin{exercise}
2074
The mode of a distribution is the most frequent value; see
2075
\url{http://wikipedia.org/wiki/Mode_(statistics)}. Write a function
2076
called {\tt Mode} that takes a Hist and returns the most
2077
frequent value.\index{mode}
2078
\index{Hist}
2079
2080
As a more challenging exercise, write a function called {\tt AllModes}
2081
that returns a list of value-frequency pairs in descending order of
2082
frequency.
2083
\index{frequency}
2084
\end{exercise}
2085
2086
\begin{exercise}
2087
Using the variable \verb"totalwgt_lb", investigate whether first
2088
babies are lighter or heavier than others. Compute Cohen's $d$
2089
to quantify the difference between the groups. How does it
2090
compare to the difference in pregnancy length?
2091
\index{pregnancy length}
2092
\end{exercise}
2093
2094
2095
\section{Glossary}
2096
2097
\begin{itemize}
2098
2099
\item distribution: The values that appear in a sample
2100
and the frequency of each.
2101
\index{distribution}
2102
2103
\item histogram: A mapping from values to frequencies, or a graph
2104
that shows this mapping.
2105
\index{histogram}
2106
2107
\item frequency: The number of times a value appears in a sample.
2108
\index{frequency}
2109
2110
\item mode: The most frequent value in a sample, or one of the
2111
most frequent values.
2112
\index{mode}
2113
2114
\item normal distribution: An idealization of a bell-shaped distribution;
2115
also known as a Gaussian distribution.
2116
\index{Gaussian distribution}
2117
\index{normal distribution}
2118
2119
\item uniform distribution: A distribution in which all values have
2120
the same frequency.
2121
\index{uniform distribution}
2122
2123
\item tail: The part of a distribution at the high and low extremes.
2124
\index{tail}
2125
2126
\item central tendency: A characteristic of a sample or population;
2127
intuitively, it is an average or typical value.
2128
\index{central tendency}
2129
2130
\item outlier: A value far from the central tendency.
2131
\index{outlier}
2132
2133
\item spread: A measure of how spread out the values in a distribution
2134
are.
2135
\index{spread}
2136
2137
\item summary statistic: A statistic that quantifies some aspect
2138
of a distribution, like central tendency or spread.
2139
\index{summary statistic}
2140
2141
\item variance: A summary statistic often used to quantify spread.
2142
\index{variance}
2143
2144
\item standard deviation: The square root of variance, also used
2145
as a measure of spread.
2146
\index{standard deviation}
2147
2148
\item effect size: A summary statistic intended to quantify the size
2149
of an effect like a difference between groups.
2150
\index{effect size}
2151
2152
\item clinically significant: A result, like a difference between groups,
2153
that is relevant in practice.
2154
\index{clinically significant}
2155
2156
\end{itemize}
2157
2158
2159
2160
2161
\chapter{Probability mass functions}
2162
\index{probability mass function}
2163
2164
The code for this chapter is in {\tt probability.py}.
2165
For information about downloading and
2166
working with this code, see Section~\ref{code}.
2167
2168
2169
\section{Pmfs}
2170
\index{Pmf}
2171
2172
Another way to represent a distribution is a {\bf probability mass
2173
function} (PMF), which maps from each value to its probability. A
2174
{\bf probability} is a frequency expressed as a fraction of the sample
2175
size, {\tt n}. To get from frequencies to probabilities, we divide
2176
through by {\tt n}, which is called {\bf normalization}.
2177
\index{frequency}
2178
\index{probability}
2179
\index{normalization}
2180
\index{PMF}
2181
\index{probability mass function}
2182
2183
Given a Hist, we can make a dictionary that maps from each
2184
value to its probability: \index{Hist}
2185
%
2186
\begin{verbatim}
2187
n = hist.Total()
2188
d = {}
2189
for x, freq in hist.Items():
2190
d[x] = freq / n
2191
\end{verbatim}
2192
%
2193
Or we can use the Pmf class provided by {\tt thinkstats2}.
2194
Like Hist, the Pmf constructor can take a list, pandas
2195
Series, dictionary, Hist, or another Pmf object. Here's an example
2196
with a simple list:
2197
%
2198
\begin{verbatim}
2199
>>> import thinkstats2
2200
>>> pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
2201
>>> pmf
2202
Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})
2203
\end{verbatim}
2204
2205
The Pmf is normalized so total probability is 1.
2206
2207
Pmf and Hist objects are similar in many ways; in fact, they inherit
2208
many of their methods from a common parent class. For example, the
2209
methods {\tt Values} and {\tt Items} work the same way for both. The
2210
biggest difference is that a Hist maps from values to integer
2211
counters; a Pmf maps from values to floating-point probabilities.
2212
\index{Hist}
2213
2214
To look up the probability associated with a value, use {\tt Prob}:
2215
%
2216
\begin{verbatim}
2217
>>> pmf.Prob(2)
2218
0.4
2219
\end{verbatim}
2220
2221
The bracket operator is equivalent:
2222
\index{bracket operator}
2223
2224
\begin{verbatim}
2225
>>> pmf[2]
2226
0.4
2227
\end{verbatim}
2228
2229
You can modify an existing Pmf by incrementing the probability
2230
associated with a value:
2231
%
2232
\begin{verbatim}
2233
>>> pmf.Incr(2, 0.2)
2234
>>> pmf.Prob(2)
2235
0.6
2236
\end{verbatim}
2237
2238
Or you can multiply a probability by a factor:
2239
%
2240
\begin{verbatim}
2241
>>> pmf.Mult(2, 0.5)
2242
>>> pmf.Prob(2)
2243
0.3
2244
\end{verbatim}
2245
2246
If you modify a Pmf, the result may not be normalized; that is, the
2247
probabilities may no longer add up to 1. To check, you can call {\tt
2248
Total}, which returns the sum of the probabilities:
2249
%
2250
\begin{verbatim}
2251
>>> pmf.Total()
2252
0.9
2253
\end{verbatim}
2254
2255
To renormalize, call {\tt Normalize}:
2256
%
2257
\begin{verbatim}
2258
>>> pmf.Normalize()
2259
>>> pmf.Total()
2260
1.0
2261
\end{verbatim}
2262
2263
Pmf objects provide a {\tt Copy} method so you can make
2264
and modify a copy without affecting the original.
2265
\index{Pmf}
2266
2267
My notation in this section might seem inconsistent, but there is a
2268
system: I use Pmf for the name of the class, {\tt pmf} for an instance
2269
of the class, and PMF for the mathematical concept of a
2270
probability mass function.
2271
2272
2273
\section{Plotting PMFs}
2274
\index{PMF}
2275
2276
{\tt thinkplot} provides two ways to plot Pmfs:
2277
\index{thinkplot}
2278
2279
\begin{itemize}
2280
2281
\item To plot a Pmf as a bar graph, you can use
2282
{\tt thinkplot.Hist}. Bar graphs are most useful if the number
2283
of values in the Pmf is small.
2284
\index{bar plot}
2285
\index{plot!bar}
2286
2287
\item To plot a Pmf as a step function, you can use
2288
{\tt thinkplot.Pmf}. This option is most useful if there are
2289
a large number of values and the Pmf is smooth. This function
2290
also works with Hist objects.
2291
\index{line plot}
2292
\index{plot!line}
2293
\index{Hist}
2294
\index{Pmf}
2295
2296
\end{itemize}
2297
2298
In addition, {\tt pyplot} provides a function called {\tt hist} that
2299
takes a sequence of values, computes a histogram, and plots it.
2300
Since I use Hist objects, I usually don't use {\tt pyplot.hist}.
2301
\index{pyplot}
2302
2303
\begin{figure}
2304
% probability.py
2305
\centerline{\includegraphics[height=3.0in]{figs/probability_nsfg_pmf.pdf}}
2306
\caption{PMF of pregnancy lengths for first babies and others, using
2307
bar graphs and step functions.}
2308
\label{probability_nsfg_pmf}
2309
\end{figure}
2310
\index{pregnancy length}
2311
\index{length!pregnancy}
2312
2313
Figure~\ref{probability_nsfg_pmf} shows PMFs of pregnancy length for
2314
first babies and others using bar graphs (left) and step functions
2315
(right).
2316
\index{pregnancy length}
2317
2318
By plotting the PMF instead of the histogram, we can compare the two
2319
distributions without being mislead by the difference in sample
2320
size. Based on this figure, first babies seem to be less likely than
2321
others to arrive on time (week 39) and more likely to be a late (weeks
2322
41 and 42).
2323
2324
Here's the code that generates Figure~\ref{probability_nsfg_pmf}:
2325
2326
\begin{verbatim}
2327
thinkplot.PrePlot(2, cols=2)
2328
thinkplot.Hist(first_pmf, align='right', width=width)
2329
thinkplot.Hist(other_pmf, align='left', width=width)
2330
thinkplot.Config(xlabel='weeks',
2331
ylabel='probability',
2332
axis=[27, 46, 0, 0.6])
2333
2334
thinkplot.PrePlot(2)
2335
thinkplot.SubPlot(2)
2336
thinkplot.Pmfs([first_pmf, other_pmf])
2337
thinkplot.Show(xlabel='weeks',
2338
axis=[27, 46, 0, 0.6])
2339
\end{verbatim}
2340
2341
{\tt PrePlot} takes optional parameters {\tt rows} and {\tt cols}
2342
to make a grid of figures, in this case one row of two figures.
2343
The first figure (on the left) displays the Pmfs using {\tt thinkplot.Hist},
2344
as we have seen before.
2345
\index{thinkplot}
2346
\index{Hist}
2347
2348
The second call to {\tt PrePlot} resets the color generator. Then
2349
{\tt SubPlot} switches to the second figure (on the right) and
2350
displays the Pmfs using {\tt thinkplot.Pmfs}. I used the {\tt axis} option
2351
to ensure that the two figures are on the same axes, which is
2352
generally a good idea if you intend to compare two figures.
2353
2354
2355
\section{Other visualizations}
2356
\label{visualization}
2357
2358
Histograms and PMFs are useful while you are exploring data and
2359
trying to identify patterns and relationships.
2360
Once you have an idea what is going on, a good next step is to
2361
design a visualization that makes the patterns you have identified
2362
as clear as possible.
2363
\index{exploratory data analysis}
2364
\index{visualization}
2365
2366
In the NSFG data, the biggest differences in the distributions are
2367
near the mode. So it makes sense to zoom in on that part of the
2368
graph, and to transform the data to emphasize differences:
2369
\index{National Survey of Family Growth}
2370
\index{NSFG}
2371
2372
\begin{verbatim}
2373
weeks = range(35, 46)
2374
diffs = []
2375
for week in weeks:
2376
p1 = first_pmf.Prob(week)
2377
p2 = other_pmf.Prob(week)
2378
diff = 100 * (p1 - p2)
2379
diffs.append(diff)
2380
2381
thinkplot.Bar(weeks, diffs)
2382
\end{verbatim}
2383
2384
In this code, {\tt weeks} is the range of weeks; {\tt diffs} is the
2385
difference between the two PMFs in percentage points.
2386
Figure~\ref{probability_nsfg_diffs} shows the result as a bar chart.
2387
This figure makes the pattern clearer: first babies are less likely to
2388
be born in week 39, and somewhat more likely to be born in weeks 41
2389
and 42.
2390
\index{thinkplot}
2391
2392
\begin{figure}
2393
% probability.py
2394
\centerline{\includegraphics[height=2.5in]{figs/probability_nsfg_diffs.pdf}}
2395
\caption{Difference, in percentage points, by week.}
2396
\label{probability_nsfg_diffs}
2397
\end{figure}
2398
2399
For now we should hold this conclusion only tentatively.
2400
We used the same dataset to identify an
2401
apparent difference and then chose a visualization that makes the
2402
difference apparent. We can't be sure this effect is real;
2403
it might be due to random variation. We'll address this concern
2404
later.
2405
2406
2407
\section{The class size paradox}
2408
\index{class size}
2409
2410
Before we go on, I want to demonstrate
2411
one kind of computation you can do with Pmf objects; I call
2412
this example the ``class size paradox.''
2413
\index{Pmf}
2414
2415
At many American colleges and universities, the student-to-faculty
2416
ratio is about 10:1. But students are often surprised to discover
2417
that their average class size is bigger than 10. There
2418
are two reasons for the discrepancy:
2419
2420
\begin{itemize}
2421
2422
\item Students typically take 4--5 classes per semester, but
2423
professors often teach 1 or 2.
2424
2425
\item The number of students who enjoy a small class is small,
2426
but the number of students in a large class is (ahem!) large.
2427
2428
\end{itemize}
2429
2430
The first effect is obvious, at least once it is pointed out;
2431
the second is more subtle. Let's look at an example. Suppose
2432
that a college offers 65 classes in a given semester, with the
2433
following distribution of sizes:
2434
%
2435
\begin{verbatim}
2436
size count
2437
5- 9 8
2438
10-14 8
2439
15-19 14
2440
20-24 4
2441
25-29 6
2442
30-34 12
2443
35-39 8
2444
40-44 3
2445
45-49 2
2446
\end{verbatim}
2447
2448
If you ask the Dean for the average class size, he would
2449
construct a PMF, compute the mean, and report that the
2450
average class size is 23.7. Here's the code:
2451
2452
\begin{verbatim}
2453
d = { 7: 8, 12: 8, 17: 14, 22: 4,
2454
27: 6, 32: 12, 37: 8, 42: 3, 47: 2 }
2455
2456
pmf = thinkstats2.Pmf(d, label='actual')
2457
print('mean', pmf.Mean())
2458
\end{verbatim}
2459
2460
But if you survey a group of students, ask them how many
2461
students are in their classes, and compute the mean, you would
2462
think the average class was bigger. Let's see how
2463
much bigger.
2464
2465
First, I compute the
2466
distribution as observed by students, where the probability
2467
associated with each class size is ``biased'' by the number
2468
of students in the class.
2469
\index{observer bias}
2470
\index{bias!observer}
2471
2472
\begin{verbatim}
2473
def BiasPmf(pmf, label):
2474
new_pmf = pmf.Copy(label=label)
2475
2476
for x, p in pmf.Items():
2477
new_pmf.Mult(x, x)
2478
2479
new_pmf.Normalize()
2480
return new_pmf
2481
\end{verbatim}
2482
2483
For each class size, {\tt x}, we multiply the probability by
2484
{\tt x}, the number of students who observe that class size.
2485
The result is a new Pmf that represents the biased distribution.
2486
2487
Now we can plot the actual and observed distributions:
2488
\index{thinkplot}
2489
2490
\begin{verbatim}
2491
biased_pmf = BiasPmf(pmf, label='observed')
2492
thinkplot.PrePlot(2)
2493
thinkplot.Pmfs([pmf, biased_pmf])
2494
thinkplot.Show(xlabel='class size', ylabel='PMF')
2495
\end{verbatim}
2496
2497
\begin{figure}
2498
% probability.py
2499
\centerline{\includegraphics[height=3.0in]{figs/class_size1.pdf}}
2500
\caption{Distribution of class sizes, actual and as observed by students.}
2501
\label{class_size1}
2502
\end{figure}
2503
2504
Figure~\ref{class_size1} shows the result. In the biased distribution
2505
there are fewer small classes and more large ones.
2506
The mean of the biased distribution is 29.1, almost 25\% higher
2507
than the actual mean.
2508
2509
It is also possible to invert this operation. Suppose you want to
2510
find the distribution of class sizes at a college, but you can't get
2511
reliable data from the Dean. An alternative is to choose a random
2512
sample of students and ask how many students are in their
2513
classes. \index{bias!oversampling} \index{oversampling}
2514
2515
The result would be biased for the reasons we've just seen, but you
2516
can use it to estimate the actual distribution. Here's the function
2517
that unbiases a Pmf:
2518
2519
\begin{verbatim}
2520
def UnbiasPmf(pmf, label):
2521
new_pmf = pmf.Copy(label=label)
2522
2523
for x, p in pmf.Items():
2524
new_pmf.Mult(x, 1.0/x)
2525
2526
new_pmf.Normalize()
2527
return new_pmf
2528
\end{verbatim}
2529
2530
It's similar to {\tt BiasPmf}; the only difference is that it
2531
divides each probability by {\tt x} instead of multiplying.
2532
2533
2534
\section{DataFrame indexing}
2535
2536
In Section~\ref{dataframe} we read a pandas DataFrame and used it to
2537
select and modify data columns. Now let's look at row selection.
2538
To start, I create a NumPy array of random numbers and use it
2539
to initialize a DataFrame:
2540
\index{NumPy}
2541
\index{pandas}
2542
\index{DataFrame}
2543
2544
\begin{verbatim}
2545
>>> import numpy as np
2546
>>> import pandas
2547
>>> array = np.random.randn(4, 2)
2548
>>> df = pandas.DataFrame(array)
2549
>>> df
2550
0 1
2551
0 -0.143510 0.616050
2552
1 -1.489647 0.300774
2553
2 -0.074350 0.039621
2554
3 -1.369968 0.545897
2555
\end{verbatim}
2556
2557
By default, the rows and columns are numbered starting at zero, but
2558
you can provide column names:
2559
2560
\begin{verbatim}
2561
>>> columns = ['A', 'B']
2562
>>> df = pandas.DataFrame(array, columns=columns)
2563
>>> df
2564
A B
2565
0 -0.143510 0.616050
2566
1 -1.489647 0.300774
2567
2 -0.074350 0.039621
2568
3 -1.369968 0.545897
2569
\end{verbatim}
2570
2571
You can also provide row names. The set of row names is called the
2572
{\bf index}; the row names themselves are called {\bf labels}.
2573
2574
\begin{verbatim}
2575
>>> index = ['a', 'b', 'c', 'd']
2576
>>> df = pandas.DataFrame(array, columns=columns, index=index)
2577
>>> df
2578
A B
2579
a -0.143510 0.616050
2580
b -1.489647 0.300774
2581
c -0.074350 0.039621
2582
d -1.369968 0.545897
2583
\end{verbatim}
2584
2585
As we saw in the previous chapter, simple indexing selects a
2586
column, returning a Series:
2587
\index{Series}
2588
2589
\begin{verbatim}
2590
>>> df['A']
2591
a -0.143510
2592
b -1.489647
2593
c -0.074350
2594
d -1.369968
2595
Name: A, dtype: float64
2596
\end{verbatim}
2597
2598
To select a row by label, you can use the {\tt loc} attribute, which
2599
returns a Series:
2600
2601
\begin{verbatim}
2602
>>> df.loc['a']
2603
A -0.14351
2604
B 0.61605
2605
Name: a, dtype: float64
2606
\end{verbatim}
2607
2608
If you know the integer position of a row, rather than its label, you
2609
can use the {\tt iloc} attribute, which also returns a Series.
2610
2611
\begin{verbatim}
2612
>>> df.iloc[0]
2613
A -0.14351
2614
B 0.61605
2615
Name: a, dtype: float64
2616
\end{verbatim}
2617
2618
{\tt loc} can also take a list of labels; in that case,
2619
the result is a DataFrame.
2620
2621
\begin{verbatim}
2622
>>> indices = ['a', 'c']
2623
>>> df.loc[indices]
2624
A B
2625
a -0.14351 0.616050
2626
c -0.07435 0.039621
2627
\end{verbatim}
2628
2629
Finally, you can use a slice to select a range of rows by label:
2630
2631
\begin{verbatim}
2632
>>> df['a':'c']
2633
A B
2634
a -0.143510 0.616050
2635
b -1.489647 0.300774
2636
c -0.074350 0.039621
2637
\end{verbatim}
2638
2639
Or by integer position:
2640
2641
\begin{verbatim}
2642
>>> df[0:2]
2643
A B
2644
a -0.143510 0.616050
2645
b -1.489647 0.300774
2646
\end{verbatim}
2647
2648
The result in either case is a DataFrame, but notice that the first
2649
result includes the end of the slice; the second doesn't.
2650
\index{DataFrame}
2651
2652
My advice: if your rows have labels that are not simple integers, use
2653
the labels consistently and avoid using integer positions.
2654
2655
2656
2657
\section{Exercises}
2658
2659
Solutions to these exercises are in \verb"chap03soln.ipynb"
2660
and \verb"chap03soln.py"
2661
2662
\begin{exercise}
2663
Something like the class size paradox appears if you survey children
2664
and ask how many children are in their family. Families with many
2665
children are more likely to appear in your sample, and
2666
families with no children have no chance to be in the sample.
2667
\index{observer bias}
2668
\index{bias!observer}
2669
2670
Use the NSFG respondent variable \verb"NUMKDHH" to construct the actual
2671
distribution for the number of children under 18 in the household.
2672
2673
Now compute the biased distribution we would see if we surveyed the
2674
children and asked them how many children under 18 (including themselves)
2675
are in their household.
2676
2677
Plot the actual and biased distributions, and compute their means.
2678
As a starting place, you can use \verb"chap03ex.ipynb".
2679
\end{exercise}
2680
2681
2682
\begin{exercise}
2683
\index{mean}
2684
\index{variance}
2685
\index{PMF}
2686
2687
In Section~\ref{mean} we computed the mean of a sample by adding up
2688
the elements and dividing by n. If you are given a PMF, you can
2689
still compute the mean, but the process is slightly different:
2690
%
2691
\[ \xbar = \sum_i p_i~x_i \]
2692
%
2693
where the $x_i$ are the unique values in the PMF and $p_i=PMF(x_i)$.
2694
Similarly, you can compute variance like this:
2695
%
2696
\[ S^2 = \sum_i p_i~(x_i - \xbar)^2\]
2697
%
2698
Write functions called {\tt PmfMean} and {\tt PmfVar} that take a
2699
Pmf object and compute the mean and variance. To test these methods,
2700
check that they are consistent with the methods {\tt Mean} and {\tt
2701
Var} provided by Pmf.
2702
\index{Pmf}
2703
2704
\end{exercise}
2705
2706
2707
\begin{exercise}
2708
I started with the question, ``Are first babies more likely
2709
to be late?'' To address it, I computed the difference in
2710
means between groups of babies, but I ignored the possibility
2711
that there might be a difference between first babies and
2712
others {\em for the same woman}.
2713
2714
To address this version of the question, select respondents who
2715
have at least two babies and compute pairwise differences. Does
2716
this formulation of the question yield a different result?
2717
2718
Hint: use {\tt nsfg.MakePregMap}.
2719
\end{exercise}
2720
2721
2722
\begin{exercise}
2723
\label{relay}
2724
2725
In most foot races, everyone starts at the same time. If you are a
2726
fast runner, you usually pass a lot of people at the beginning of the
2727
race, but after a few miles everyone around you is going at the same
2728
speed.
2729
\index{relay race}
2730
2731
When I ran a long-distance (209 miles) relay race for the first
2732
time, I noticed an odd phenomenon: when I overtook another runner, I
2733
was usually much faster, and when another runner overtook me, he was
2734
usually much faster.
2735
2736
At first I thought that the distribution of speeds might be bimodal;
2737
that is, there were many slow runners and many fast runners, but few
2738
at my speed.
2739
2740
Then I realized that I was the victim of a bias similar to the
2741
effect of class size. The race
2742
was unusual in two ways: it used a staggered start, so teams started
2743
at different times; also, many teams included runners at different
2744
levels of ability. \index{bias!selection} \index{selection bias}
2745
2746
As a result, runners were spread out along the course with little
2747
relationship between speed and location. When I joined the race, the
2748
runners near me were (pretty much) a random sample of the runners in
2749
the race.
2750
2751
So where does the bias come from? During my time on the course, the
2752
chance of overtaking a runner, or being overtaken, is proportional to
2753
the difference in our speeds. I am more likely to catch a slow
2754
runner, and more likely to be caught by a fast runner. But runners
2755
at the same speed are unlikely to see each other.
2756
2757
Write a function called {\tt ObservedPmf} that takes a Pmf representing
2758
the actual distribution of runners' speeds, and the speed of a running
2759
observer, and returns a new Pmf representing the distribution of
2760
runners' speeds as seen by the observer.
2761
\index{observer bias}
2762
\index{bias!observer}
2763
2764
To test your function, you can use {\tt relay.py}, which reads the
2765
results from the James Joyce Ramble 10K in Dedham MA and converts the
2766
pace of each runner to mph.
2767
2768
Compute the distribution of speeds you would observe if you ran a
2769
relay race at 7.5 mph with this group of runners. A solution to this
2770
exercise is in \verb"relay_soln.py".
2771
\end{exercise}
2772
2773
2774
\section{Glossary}
2775
2776
\begin{itemize}
2777
2778
\item Probability mass function (PMF): a representation of a distribution
2779
as a function that maps from values to probabilities.
2780
\index{PMF}
2781
\index{probability mass function}
2782
2783
\item probability: A frequency expressed as a fraction of the sample
2784
size.
2785
\index{frequency}
2786
\index{probability}
2787
2788
\item normalization: The process of dividing a frequency by a sample
2789
size to get a probability.
2790
\index{normalization}
2791
2792
\item index: In a pandas DataFrame, the index is a special column
2793
that contains the row labels.
2794
\index{pandas}
2795
\index{DataFrame}
2796
2797
\end{itemize}
2798
2799
2800
\chapter{Cumulative distribution functions}
2801
\label{cumulative}
2802
2803
The code for this chapter is in {\tt cumulative.py}.
2804
For information about downloading and
2805
working with this code, see Section~\ref{code}.
2806
2807
2808
\section{The limits of PMFs}
2809
\index{PMF}
2810
2811
PMFs work well if the number of values is small. But as the number of
2812
values increases, the probability associated with each value gets
2813
smaller and the effect of random noise increases.
2814
2815
For example, we might be interested in the distribution of birth
2816
weights. In the NSFG data, the variable \verb"totalwgt_lb" records
2817
weight at birth in pounds. Figure~\ref{nsfg_birthwgt_pmf} shows
2818
the PMF of these values for first babies and others.
2819
\index{National Survey of Family Growth} \index{NSFG} \index{birth weight}
2820
\index{weight!birth}
2821
2822
\begin{figure}
2823
% cumulative.py
2824
\centerline{\includegraphics[height=2.5in]{figs/nsfg_birthwgt_pmf.pdf}}
2825
\caption{PMF of birth weights. This figure shows a limitation
2826
of PMFs: they are hard to compare visually.}
2827
\label{nsfg_birthwgt_pmf}
2828
\end{figure}
2829
2830
Overall, these distributions resemble the bell shape of a normal
2831
distribution, with many values near the mean and a few values much
2832
higher and lower.
2833
2834
But parts of this figure are hard to interpret. There are many spikes
2835
and valleys, and some apparent differences between the distributions.
2836
It is hard to tell which of these features are meaningful. Also, it
2837
is hard to see overall patterns; for example, which distribution do
2838
you think has the higher mean?
2839
\index{binning}
2840
2841
These problems can be mitigated by binning the data; that is, dividing
2842
the range of values into non-overlapping intervals and counting the
2843
number of values in each bin. Binning can be useful, but it is tricky
2844
to get the size of the bins right. If they are big enough to smooth
2845
out noise, they might also smooth out useful information.
2846
2847
An alternative that avoids these problems is the cumulative
2848
distribution function (CDF), which is the subject of this chapter.
2849
But before I can explain CDFs, I have to explain percentiles.
2850
\index{CDF}
2851
2852
2853
\section{Percentiles}
2854
\index{percentile rank}
2855
2856
If you have taken a standardized test, you probably got your
2857
results in the form of a raw score and a {\bf percentile rank}.
2858
In this context, the percentile rank is the fraction of people who
2859
scored lower than you (or the same). So if you are ``in the 90th
2860
percentile,'' you did as well as or better than 90\% of the people who
2861
took the exam.
2862
2863
Here's how you could compute the percentile rank of a value,
2864
\verb"your_score", relative to the values in the sequence {\tt
2865
scores}:
2866
%
2867
\begin{verbatim}
2868
def PercentileRank(scores, your_score):
2869
count = 0
2870
for score in scores:
2871
if score <= your_score:
2872
count += 1
2873
2874
percentile_rank = 100.0 * count / len(scores)
2875
return percentile_rank
2876
\end{verbatim}
2877
2878
As an example, if the
2879
scores in the sequence were 55, 66, 77, 88 and 99, and you got the 88,
2880
then your percentile rank would be {\tt 100 * 4 / 5} which is 80.
2881
2882
If you are given a value, it is easy to find its percentile rank; going
2883
the other way is slightly harder. If you are given a percentile rank
2884
and you want to find the corresponding value, one option is to
2885
sort the values and search for the one you want:
2886
%
2887
\begin{verbatim}
2888
def Percentile(scores, percentile_rank):
2889
scores.sort()
2890
for score in scores:
2891
if PercentileRank(scores, score) >= percentile_rank:
2892
return score
2893
\end{verbatim}
2894
2895
The result of this calculation is a {\bf percentile}. For example,
2896
the 50th percentile is the value with percentile rank 50. In the
2897
distribution of exam scores, the 50th percentile is 77.
2898
\index{percentile}
2899
2900
This implementation of {\tt Percentile} is not efficient. A
2901
better approach is to use the percentile rank to compute the index of
2902
the corresponding percentile:
2903
2904
\begin{verbatim}
2905
def Percentile2(scores, percentile_rank):
2906
scores.sort()
2907
index = percentile_rank * (len(scores)-1) // 100
2908
return scores[index]
2909
\end{verbatim}
2910
2911
The difference between ``percentile'' and ``percentile rank'' can
2912
be confusing, and people do not always use the terms precisely.
2913
To summarize, {\tt PercentileRank} takes a value and computes
2914
its percentile rank in a set of values; {\tt Percentile} takes
2915
a percentile rank and computes the corresponding value.
2916
\index{percentile rank}
2917
2918
2919
\section{CDFs}
2920
\index{CDF}
2921
2922
Now that we understand percentiles and percentile ranks,
2923
we are ready to tackle the {\bf cumulative distribution function}
2924
(CDF). The CDF is the function that maps from a value to its
2925
percentile rank.
2926
\index{cumulative distribution function}
2927
\index{percentile rank}
2928
2929
The CDF is a function of $x$, where $x$ is any value that might appear
2930
in the distribution. To evaluate $\CDF(x)$ for a particular value of
2931
$x$, we compute the fraction of values in the distribution less
2932
than or equal to $x$.
2933
2934
Here's what that looks like as a function that takes a sequence,
2935
{\tt sample}, and a value, {\tt x}:
2936
%
2937
\begin{verbatim}
2938
def EvalCdf(sample, x):
2939
count = 0.0
2940
for value in sample:
2941
if value <= x:
2942
count += 1
2943
2944
prob = count / len(sample)
2945
return prob
2946
\end{verbatim}
2947
2948
This function is almost identical to {\tt PercentileRank}, except that
2949
the result is a probability in the range 0--1 rather than a
2950
percentile rank in the range 0--100.
2951
\index{sample}
2952
2953
As an example, suppose we collect a sample with the values
2954
{\tt [1, 2, 2, 3, 5]}. Here are some values from its CDF:
2955
%
2956
\[ CDF(0) = 0 \]
2957
%
2958
\[ CDF(1) = 0.2\]
2959
%
2960
\[ CDF(2) = 0.6\]
2961
%
2962
\[ CDF(3) = 0.8\]
2963
%
2964
\[ CDF(4) = 0.8\]
2965
%
2966
\[ CDF(5) = 1\]
2967
%
2968
We can evaluate the CDF for any value of $x$, not just
2969
values that appear in the sample.
2970
If $x$ is less than the smallest value in the sample, $\CDF(x)$ is 0.
2971
If $x$ is greater than the largest value, $\CDF(x)$ is 1.
2972
2973
\begin{figure}
2974
% cumulative.py
2975
\centerline{\includegraphics[height=2.5in]{figs/cumulative_example_cdf.pdf}}
2976
\caption{Example of a CDF.}
2977
\label{example_cdf}
2978
\end{figure}
2979
2980
Figure~\ref{example_cdf} is a graphical representation of this CDF.
2981
The CDF of a sample is a step function.
2982
\index{step function}
2983
2984
2985
\section{Representing CDFs}
2986
\index{Cdf}
2987
2988
{\tt thinkstats2} provides a class named Cdf that represents
2989
CDFs. The fundamental methods Cdf provides are:
2990
2991
\begin{itemize}
2992
2993
\item {\tt Prob(x)}: Given a value {\tt x}, computes the probability
2994
$p = \CDF(x)$. The bracket operator is equivalent to {\tt Prob}.
2995
\index{bracket operator}
2996
2997
\item {\tt Value(p)}: Given a probability {\tt p}, computes the
2998
corresponding value, {\tt x}; that is, the {\bf inverse CDF} of {\tt p}.
2999
\index{inverse CDF}
3000
\index{CDF, inverse}
3001
3002
\end{itemize}
3003
3004
\begin{figure}
3005
% cumulative.py
3006
\centerline{\includegraphics[height=2.5in]{figs/cumulative_prglngth_cdf.pdf}}
3007
\caption{CDF of pregnancy length.}
3008
\label{cumulative_prglngth_cdf}
3009
\end{figure}
3010
3011
The Cdf constructor can take as an argument a list of values,
3012
a pandas Series, a Hist, Pmf, or another Cdf. The following
3013
code makes a Cdf for the distribution of pregnancy lengths in
3014
the NSFG:
3015
\index{NSFG}
3016
\index{pregnancy length}
3017
3018
\begin{verbatim}
3019
live, firsts, others = first.MakeFrames()
3020
cdf = thinkstats2.Cdf(live.prglngth, label='prglngth')
3021
\end{verbatim}
3022
3023
{\tt thinkplot} provides a function named {\tt Cdf} that
3024
plots Cdfs as lines:
3025
\index{thinkplot}
3026
3027
\begin{verbatim}
3028
thinkplot.Cdf(cdf)
3029
thinkplot.Show(xlabel='weeks', ylabel='CDF')
3030
\end{verbatim}
3031
3032
Figure~\ref{cumulative_prglngth_cdf} shows the result. One way to
3033
read a CDF is to look up percentiles. For example, it looks like
3034
about 10\% of pregnancies are shorter than 36 weeks, and about 90\%
3035
are shorter than 41 weeks. The CDF also provides a visual
3036
representation of the shape of the distribution. Common values appear
3037
as steep or vertical sections of the CDF; in this example, the mode at
3038
39 weeks is apparent. There are few values below 30 weeks, so
3039
the CDF in this range is flat.
3040
\index{CDF, interpreting}
3041
3042
It takes some time to get used to CDFs, but once you
3043
do, I think you will find that they show more information, more
3044
clearly, than PMFs.
3045
3046
3047
\section{Comparing CDFs}
3048
\label{birth_weights}
3049
\index{National Survey of Family Growth}
3050
\index{NSFG}
3051
\index{birth weight}
3052
\index{weight!birth}
3053
3054
CDFs are especially useful for comparing distributions. For
3055
example, here is the code that plots the CDF of birth
3056
weight for first babies and others.
3057
\index{thinkplot}
3058
\index{distributions, comparing}
3059
3060
\begin{verbatim}
3061
first_cdf = thinkstats2.Cdf(firsts.totalwgt_lb, label='first')
3062
other_cdf = thinkstats2.Cdf(others.totalwgt_lb, label='other')
3063
3064
thinkplot.PrePlot(2)
3065
thinkplot.Cdfs([first_cdf, other_cdf])
3066
thinkplot.Show(xlabel='weight (pounds)', ylabel='CDF')
3067
\end{verbatim}
3068
3069
\begin{figure}
3070
% cumulative.py
3071
\centerline{\includegraphics[height=2.5in]{figs/cumulative_birthwgt_cdf.pdf}}
3072
\caption{CDF of birth weights for first babies and others.}
3073
\label{cumulative_birthwgt_cdf}
3074
\end{figure}
3075
3076
Figure~\ref{cumulative_birthwgt_cdf} shows the result.
3077
Compared to Figure~\ref{nsfg_birthwgt_pmf},
3078
this figure makes the shape of the distributions, and the differences
3079
between them, much clearer. We can see that first babies are slightly
3080
lighter throughout the distribution, with a larger discrepancy above
3081
the mean.
3082
\index{shape}
3083
3084
3085
3086
3087
\section{Percentile-based statistics}
3088
\index{summary statistic}
3089
\index{interquartile range}
3090
\index{quartile}
3091
\index{percentile}
3092
\index{median}
3093
\index{central tendency}
3094
\index{spread}
3095
3096
Once you have computed a CDF, it is easy to compute percentiles
3097
and percentile ranks. The Cdf class provides these two methods:
3098
\index{Cdf}
3099
\index{percentile rank}
3100
3101
\begin{itemize}
3102
3103
\item {\tt PercentileRank(x)}: Given a value {\tt x}, computes its
3104
percentile rank, $100 \cdot \CDF(x)$.
3105
3106
\item {\tt Percentile(p)}: Given a percentile rank {\tt p},
3107
computes the corresponding value, {\tt x}. Equivalent to {\tt
3108
Value(p/100)}.
3109
3110
\end{itemize}
3111
3112
{\tt Percentile} can be used to compute percentile-based summary
3113
statistics. For example, the 50th percentile is the value that
3114
divides the distribution in half, also known as the {\bf median}.
3115
Like the mean, the median is a measure of the central tendency
3116
of a distribution.
3117
3118
Actually, there are several definitions of ``median,'' each with
3119
different properties. But {\tt Percentile(50)} is simple and
3120
efficient to compute.
3121
3122
Another percentile-based statistic is the {\bf interquartile range} (IQR),
3123
which is a measure of the spread of a distribution. The IQR
3124
is the difference between the 75th and 25th percentiles.
3125
3126
More generally, percentiles are often used to summarize the shape
3127
of a distribution. For example, the distribution of income is
3128
often reported in ``quintiles''; that is, it is split at the
3129
20th, 40th, 60th and 80th percentiles. Other distributions
3130
are divided into ten ``deciles''. Statistics like these that represent
3131
equally-spaced points in a CDF are called {\bf quantiles}.
3132
For more, see \url{https://en.wikipedia.org/wiki/Quantile}.
3133
\index{quantile}
3134
\index{quintile}
3135
\index{decile}
3136
3137
3138
3139
\section{Random numbers}
3140
\label{random}
3141
\index{random number}
3142
3143
Suppose we choose a random sample from the population of live
3144
births and look up the percentile rank of their birth weights.
3145
Now suppose we compute the CDF of the percentile ranks. What do
3146
you think the distribution will look like?
3147
\index{percentile rank}
3148
\index{birth weight}
3149
\index{weight!birth}
3150
3151
Here's how we can compute it. First, we make the Cdf of
3152
birth weights:
3153
\index{Cdf}
3154
3155
\begin{verbatim}
3156
weights = live.totalwgt_lb
3157
cdf = thinkstats2.Cdf(weights, label='totalwgt_lb')
3158
\end{verbatim}
3159
3160
Then we generate a sample and compute the percentile rank of
3161
each value in the sample.
3162
3163
\begin{verbatim}
3164
sample = np.random.choice(weights, 100, replace=True)
3165
ranks = [cdf.PercentileRank(x) for x in sample]
3166
\end{verbatim}
3167
3168
{\tt sample}
3169
is a random sample of 100 birth weights, chosen with {\bf replacement};
3170
that is, the same value could be chosen more than once. {\tt ranks}
3171
is a list of percentile ranks.
3172
\index{replacement}
3173
3174
Finally we make and plot the Cdf of the percentile ranks.
3175
\index{thinkplot}
3176
3177
\begin{verbatim}
3178
rank_cdf = thinkstats2.Cdf(ranks)
3179
thinkplot.Cdf(rank_cdf)
3180
thinkplot.Show(xlabel='percentile rank', ylabel='CDF')
3181
\end{verbatim}
3182
3183
\begin{figure}
3184
% cumulative.py
3185
\centerline{\includegraphics[height=2.5in]{figs/cumulative_random.pdf}}
3186
\caption{CDF of percentile ranks for a random sample of birth weights.}
3187
\label{cumulative_random}
3188
\end{figure}
3189
3190
Figure~\ref{cumulative_random} shows the result. The CDF is
3191
approximately a straight line, which means that the distribution
3192
is uniform.
3193
3194
That outcome might be non-obvious, but it is a consequence of
3195
the way the CDF is defined. What this figure shows is that 10\%
3196
of the sample is below the 10th percentile, 20\% is below the
3197
20th percentile, and so on, exactly as we should expect.
3198
3199
So, regardless of the shape of the CDF, the distribution of
3200
percentile ranks is uniform. This property is useful, because it
3201
is the basis of a simple and efficient algorithm for generating
3202
random numbers with a given CDF. Here's how:
3203
\index{inverse CDF algorithm}
3204
\index{random number}
3205
3206
\begin{itemize}
3207
3208
\item Choose a percentile rank uniformly from the range 0--100.
3209
3210
\item Use {\tt Cdf.Percentile} to find the value in the distribution
3211
that corresponds to the percentile rank you chose.
3212
\index{Cdf}
3213
3214
\end{itemize}
3215
3216
Cdf provides an implementation of this algorithm, called
3217
{\tt Random}:
3218
3219
\begin{verbatim}
3220
# class Cdf:
3221
def Random(self):
3222
return self.Percentile(random.uniform(0, 100))
3223
\end{verbatim}
3224
3225
Cdf also provides {\tt Sample}, which takes an integer,
3226
{\tt n}, and returns a list of {\tt n} values chosen at random
3227
from the Cdf.
3228
3229
3230
\section{Comparing percentile ranks}
3231
3232
Percentile ranks are useful for comparing measurements across
3233
different groups. For example, people who compete in foot races are
3234
usually grouped by age and gender. To compare people in different
3235
age groups, you can convert race times to percentile ranks.
3236
\index{percentile rank}
3237
3238
A few years ago I ran the James Joyce Ramble 10K in
3239
Dedham MA; I finished in 42:44, which was 97th in a field of 1633. I beat or
3240
tied 1537 runners out of 1633, so my percentile rank in the field is
3241
94\%. \index{James Joyce Ramble} \index{race time}
3242
3243
More generally, given position and field size, we can compute
3244
percentile rank:
3245
\index{field size}
3246
3247
\begin{verbatim}
3248
def PositionToPercentile(position, field_size):
3249
beat = field_size - position + 1
3250
percentile = 100.0 * beat / field_size
3251
return percentile
3252
\end{verbatim}
3253
3254
In my age group, denoted M4049 for ``male between 40 and 49 years of
3255
age'', I came in 26th out of 256. So my percentile rank in my age
3256
group was 90\%.
3257
\index{age group}
3258
3259
If I am still running in 10 years (and I hope I am), I will be in
3260
the M5059 division. Assuming that my percentile rank in my division
3261
is the same, how much slower should I expect to be?
3262
3263
I can answer that question by converting my percentile rank in M4049
3264
to a position in M5059. Here's the code:
3265
3266
\begin{verbatim}
3267
def PercentileToPosition(percentile, field_size):
3268
beat = percentile * field_size / 100.0
3269
position = field_size - beat + 1
3270
return position
3271
\end{verbatim}
3272
3273
There were 171 people in M5059, so I would have to come in between
3274
17th and 18th place to have the same percentile rank. The finishing
3275
time of the 17th runner in M5059 was 46:05, so that's the time I will
3276
have to beat to maintain my percentile rank.
3277
3278
3279
\section{Exercises}
3280
3281
For the following exercises, you can start with \verb"chap04ex.ipynb".
3282
My solution is in \verb"chap04soln.ipynb".
3283
3284
\begin{exercise}
3285
How much did you weigh at birth? If you don't know, call your mother
3286
or someone else who knows. Using the NSFG data (all live births),
3287
compute the distribution of birth weights and use it to find your
3288
percentile rank. If you were a first baby, find your percentile rank
3289
in the distribution for first babies. Otherwise use the distribution
3290
for others. If you are in the 90th percentile or higher, call your
3291
mother back and apologize.
3292
\index{birth weight}
3293
\index{weight!birth}
3294
3295
\end{exercise}
3296
3297
\begin{exercise}
3298
The numbers generated by {\tt random.random} are supposed to be
3299
uniform between 0 and 1; that is, every value in the range
3300
should have the same probability.
3301
3302
Generate 1000 numbers from {\tt random.random} and plot their
3303
PMF and CDF. Is the distribution uniform?
3304
\index{uniform distribution}
3305
\index{distribution!uniform}
3306
\index{random number}
3307
3308
\end{exercise}
3309
3310
3311
\section{Glossary}
3312
3313
\begin{itemize}
3314
3315
\item percentile rank: The percentage of values in a distribution that are
3316
less than or equal to a given value.
3317
\index{percentile rank}
3318
3319
\item percentile: The value associated with a given percentile rank.
3320
\index{percentile}
3321
3322
\item cumulative distribution function (CDF): A function that maps
3323
from values to their cumulative probabilities. $\CDF(x)$ is the
3324
fraction of the sample less than or equal to $x$. \index{CDF}
3325
\index{cumulative probability}
3326
3327
\item inverse CDF: A function that maps from a cumulative probability,
3328
$p$, to the corresponding value.
3329
\index{inverse CDF}
3330
\index{CDF, inverse}
3331
3332
\item median: The 50th percentile, often used as a measure of central
3333
tendency. \index{median}
3334
3335
\item interquartile range: The difference between
3336
the 75th and 25th percentiles, used as a measure of spread.
3337
\index{interquartile range}
3338
3339
\item quantile: A sequence of values that correspond to equally spaced
3340
percentile ranks; for example, the quartiles of a distribution are
3341
the 25th, 50th and 75th percentiles.
3342
\index{quantile}
3343
3344
\item replacement: A property of a sampling process. ``With replacement''
3345
means that the same value can be chosen more than once; ``without
3346
replacement'' means that once a value is chosen, it is removed from
3347
the population.
3348
\index{replacement}
3349
3350
\end{itemize}
3351
3352
3353
\chapter{Modeling distributions}
3354
\label{modeling}
3355
3356
The distributions we have used so far are called {\bf empirical
3357
distributions} because they are based on empirical observations,
3358
which are necessarily finite samples.
3359
\index{analytic distribution}
3360
\index{distribution!analytic}
3361
\index{empirical distribution}
3362
\index{distribution!empirical}
3363
3364
The alternative is an {\bf analytic distribution}, which is
3365
characterized by a CDF that is a mathematical function.
3366
Analytic distributions can be used to model empirical distributions.
3367
In this context, a {\bf model} is a simplification that leaves out
3368
unneeded details. This chapter presents common analytic distributions
3369
and uses them to model data from a variety of sources.
3370
\index{model}
3371
3372
The code for this chapter is in {\tt analytic.py}. For information
3373
about downloading and working with this code, see Section~\ref{code}.
3374
3375
3376
3377
\section{The exponential distribution}
3378
\label{exponential}
3379
\index{exponential distribution}
3380
\index{distribution!exponential}
3381
3382
\begin{figure}
3383
% analytic.py
3384
\centerline{\includegraphics[height=2.5in]{figs/analytic_expo_cdf.pdf}}
3385
\caption{CDFs of exponential distributions with various parameters.}
3386
\label{analytic_expo_cdf}
3387
\end{figure}
3388
3389
I'll start with the {\bf exponential distribution} because it is
3390
relatively simple. The CDF of the exponential distribution is
3391
%
3392
\[ \CDF(x) = 1 - e^{-\lambda x} \]
3393
%
3394
The parameter, $\lambda$, determines the shape of the distribution.
3395
Figure~\ref{analytic_expo_cdf} shows what this CDF looks like with
3396
$\lambda = $ 0.5, 1, and 2.
3397
\index{parameter}
3398
3399
In the real world, exponential distributions
3400
come up when we look at a series of events and measure the
3401
times between events, called {\bf interarrival times}.
3402
If the events are equally likely to occur at any time, the distribution
3403
of interarrival times tends to look like an exponential distribution.
3404
\index{interarrival time}
3405
3406
As an example, we will look at the interarrival time of births.
3407
On December 18, 1997, 44 babies were born in a hospital in Brisbane,
3408
Australia.\footnote{This example is based on information and data from
3409
Dunn, ``A Simple Dataset for Demonstrating Common Distributions,''
3410
Journal of Statistics Education v.7, n.3 (1999).} The time of
3411
birth for all 44 babies was reported in the local paper; the
3412
complete dataset is in a file called {\tt babyboom.dat}, in the
3413
{\tt ThinkStats2} repository.
3414
\index{birth time}
3415
\index{Australia} \index{Brisbane}
3416
3417
\begin{verbatim}
3418
df = ReadBabyBoom()
3419
diffs = df.minutes.diff()
3420
cdf = thinkstats2.Cdf(diffs, label='actual')
3421
3422
thinkplot.Cdf(cdf)
3423
thinkplot.Show(xlabel='minutes', ylabel='CDF')
3424
\end{verbatim}
3425
3426
{\tt ReadBabyBoom} reads the data file and returns a DataFrame
3427
with columns {\tt time}, {\tt sex}, \verb"weight_g", and {\tt minutes},
3428
where {\tt minutes} is time of birth converted to minutes since
3429
midnight.
3430
\index{DataFrame}
3431
\index{thinkplot}
3432
3433
\begin{figure}
3434
% analytic.py
3435
\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals.pdf}}
3436
\caption{CDF of interarrival times (left) and CCDF on a log-y scale (right).}
3437
\label{analytic_interarrival_cdf}
3438
\end{figure}
3439
3440
%\begin{figure}
3441
% analytic.py
3442
%\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals_logy.pdf}}
3443
%\caption{CCDF of interarrival times.}
3444
%\label{analytic_interarrival_ccdf}
3445
%\end{figure}
3446
3447
{\tt diffs} is the difference between consecutive birth times, and
3448
{\tt cdf} is the distribution of these interarrival times.
3449
Figure~\ref{analytic_interarrival_cdf} (left) shows the CDF. It seems
3450
to have the general shape of an exponential distribution, but how can
3451
we tell?
3452
3453
One way is to plot the {\bf complementary CDF}, which is $1 - \CDF(x)$,
3454
on a log-y scale. For data from an exponential distribution, the
3455
result is a straight line. Let's see why that works.
3456
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
3457
3458
If you plot the complementary CDF (CCDF) of a dataset that you think is
3459
exponential, you expect to see a function like:
3460
%
3461
\[ y \approx e^{-\lambda x} \]
3462
%
3463
Taking the log of both sides yields:
3464
%
3465
\[ \log y \approx -\lambda x\]
3466
%
3467
So on a log-y scale the CCDF is a straight line
3468
with slope $-\lambda$. Here's how we can generate a plot like that:
3469
\index{logarithmic scale}
3470
\index{complementary CDF}
3471
\index{CDF!complementary}
3472
\index{CCDF}
3473
3474
3475
\begin{verbatim}
3476
thinkplot.Cdf(cdf, complement=True)
3477
thinkplot.Show(xlabel='minutes',
3478
ylabel='CCDF',
3479
yscale='log')
3480
\end{verbatim}
3481
3482
With the argument {\tt complement=True}, {\tt thinkplot.Cdf} computes
3483
the complementary CDF before plotting. And with {\tt yscale='log'},
3484
{\tt thinkplot.Show} sets the {\tt y} axis to a logarithmic scale.
3485
\index{thinkplot}
3486
\index{Cdf}
3487
3488
Figure~\ref{analytic_interarrival_cdf} (right) shows the result. It is not
3489
exactly straight, which indicates that the exponential distribution is
3490
not a perfect model for this data. Most likely the underlying
3491
assumption---that a birth is equally likely at any time of day---is
3492
not exactly true. Nevertheless, it might be reasonable to model this
3493
dataset with an exponential distribution. With that simplification, we can
3494
summarize the distribution with a single parameter.
3495
\index{model}
3496
3497
The parameter, $\lambda$, can be interpreted as a rate; that is, the
3498
number of events that occur, on average, in a unit of time. In this
3499
example, 44 babies are born in 24 hours, so the rate is $\lambda =
3500
0.0306$ births per minute. The mean of an exponential distribution is
3501
$1/\lambda$, so the mean time between births is 32.7 minutes.
3502
3503
3504
\section{The normal distribution}
3505
\label{normal}
3506
3507
The {\bf normal distribution}, also called Gaussian, is commonly
3508
used because it describes many phenomena, at least approximately.
3509
It turns out that there is a good reason for its ubiquity, which we
3510
will get to in Section~\ref{CLT}.
3511
\index{CDF}
3512
\index{parameter}
3513
\index{mean}
3514
\index{standard deviation}
3515
\index{normal distribution}
3516
\index{distribution!normal}
3517
\index{Gaussian distribution}
3518
\index{distribution!Gaussian}
3519
3520
%
3521
%\[ \CDF(z) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^z e^{-t^2/2} dt \]
3522
%
3523
3524
\begin{figure}
3525
% analytic.py
3526
\centerline{\includegraphics[height=2.5in]{figs/analytic_gaussian_cdf.pdf}}
3527
\caption{CDF of normal distributions with a range of parameters.}
3528
\label{analytic_gaussian_cdf}
3529
\end{figure}
3530
3531
The normal distribution is characterized by two parameters: the mean,
3532
$\mu$, and standard deviation $\sigma$. The normal distribution with
3533
$\mu=0$ and $\sigma=1$ is called the {\bf standard normal
3534
distribution}. Its CDF is defined by an integral that does not have
3535
a closed form solution, but there are algorithms that evaluate it
3536
efficiently. One of them is provided by SciPy: {\tt scipy.stats.norm}
3537
is an object that represents a normal distribution; it provides a
3538
method, {\tt cdf}, that evaluates the standard normal CDF:
3539
\index{SciPy}
3540
\index{closed form}
3541
3542
\begin{verbatim}
3543
>>> import scipy.stats
3544
>>> scipy.stats.norm.cdf(0)
3545
0.5
3546
\end{verbatim}
3547
3548
This result is correct: the median of the standard normal distribution
3549
is 0 (the same as the mean), and half of the values fall below the
3550
median, so $\CDF(0)$ is 0.5.
3551
3552
{\tt norm.cdf} takes optional parameters: {\tt loc}, which
3553
specifies the mean, and {\tt scale}, which specifies the
3554
standard deviation.
3555
3556
{\tt thinkstats2} makes this function a little easier to use
3557
by providing {\tt EvalNormalCdf}, which takes parameters {\tt mu}
3558
and {\tt sigma} and evaluates the CDF at {\tt x}:
3559
\index{normal distribution}
3560
3561
\begin{verbatim}
3562
def EvalNormalCdf(x, mu=0, sigma=1):
3563
return scipy.stats.norm.cdf(x, loc=mu, scale=sigma)
3564
\end{verbatim}
3565
3566
Figure~\ref{analytic_gaussian_cdf} shows CDFs for normal
3567
distributions with a range of parameters. The sigmoid shape of these
3568
curves is a recognizable characteristic of a normal distribution.
3569
3570
In the previous chapter we looked at the distribution of birth
3571
weights in the NSFG. Figure~\ref{analytic_birthwgt_model} shows the
3572
empirical CDF of weights for all live births and the CDF of
3573
a normal distribution with the same mean and variance.
3574
\index{National Survey of Family Growth}
3575
\index{NSFG}
3576
\index{birth weight}
3577
\index{weight!birth}
3578
3579
\begin{figure}
3580
% analytic.py
3581
\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_model.pdf}}
3582
\caption{CDF of birth weights with a normal model.}
3583
\label{analytic_birthwgt_model}
3584
\end{figure}
3585
3586
The normal distribution is a good model for this dataset, so
3587
if we summarize the distribution with the parameters
3588
$\mu = 7.28$ and $\sigma = 1.24$, the resulting error
3589
(difference between the model and the data) is small.
3590
\index{model}
3591
\index{percentile}
3592
3593
Below the 10th percentile there is a discrepancy between the data
3594
and the model; there are more light babies than we would expect in
3595
a normal distribution. If we are specifically interested in preterm
3596
babies, it would be important to get this part of the distribution
3597
right, so it might not be appropriate to use the normal
3598
model.
3599
3600
3601
\section{Normal probability plot}
3602
3603
For the exponential distribution, and a few others, there are
3604
simple transformations we can use to test whether an analytic
3605
distribution is a good model for a dataset.
3606
\index{exponential distribution}
3607
\index{distribution!exponential}
3608
\index{model}
3609
3610
For the normal distribution there is no such transformation, but there
3611
is an alternative called a {\bf normal probability plot}. There
3612
are two ways to generate a normal probability plot: the hard way
3613
and the easy way. If you are interested in the hard way, you can
3614
read about it at \url{https://en.wikipedia.org/wiki/Normal_probability_plot}.
3615
Here's the easy way:
3616
\index{normal probability plot}
3617
\index{plot!normal probability}
3618
\index{normal distribution}
3619
\index{distribution!normal}
3620
\index{Gaussian distribution}
3621
\index{distribution!Gaussian}
3622
3623
\begin{enumerate}
3624
3625
\item Sort the values in the sample.
3626
3627
\item From a standard normal distribution ($\mu=0$ and $\sigma=1$),
3628
generate a random sample with the same size as the sample, and sort it.
3629
\index{random number}
3630
3631
\item Plot the sorted values from the sample versus the random values.
3632
3633
\end{enumerate}
3634
3635
If the distribution of the sample is approximately normal, the result
3636
is a straight line with intercept {\tt mu} and slope {\tt sigma}.
3637
{\tt thinkstats2} provides {\tt NormalProbability}, which takes a
3638
sample and returns two NumPy arrays:
3639
\index{NumPy}
3640
3641
\begin{verbatim}
3642
xs, ys = thinkstats2.NormalProbability(sample)
3643
\end{verbatim}
3644
3645
\begin{figure}
3646
% analytic.py
3647
\centerline{\includegraphics[height=2.5in]{figs/analytic_normal_prob_example.pdf}}
3648
\caption{Normal probability plot for random samples from normal distributions.}
3649
\label{analytic_normal_prob_example}
3650
\end{figure}
3651
3652
{\tt ys} contains the sorted values from {\tt sample}; {\tt xs}
3653
contains the random values from the standard normal distribution.
3654
3655
To test {\tt NormalProbability} I generated some fake samples that
3656
were actually drawn from normal distributions with various parameters.
3657
Figure~\ref{analytic_normal_prob_example} shows the results.
3658
The lines are approximately straight, with values in the tails
3659
deviating more than values near the mean.
3660
3661
Now let's try it with real data. Here's code to generate
3662
a normal probability plot for the birth weight data from the
3663
previous section. It plots a gray line that represents the model
3664
and a blue line that represents the data.
3665
\index{birth weight}
3666
\index{weight!birth}
3667
3668
\begin{verbatim}
3669
def MakeNormalPlot(weights):
3670
mean = weights.mean()
3671
std = weights.std()
3672
3673
xs = [-4, 4]
3674
fxs, fys = thinkstats2.FitLine(xs, inter=mean, slope=std)
3675
thinkplot.Plot(fxs, fys, color='gray', label='model')
3676
3677
xs, ys = thinkstats2.NormalProbability(weights)
3678
thinkplot.Plot(xs, ys, label='birth weights')
3679
\end{verbatim}
3680
3681
{\tt weights} is a pandas Series of birth weights;
3682
{\tt mean} and {\tt std} are the mean and standard deviation.
3683
\index{pandas}
3684
\index{Series}
3685
\index{thinkplot}
3686
\index{standard deviation}
3687
3688
{\tt FitLine} takes a sequence of {\tt xs}, an intercept, and a
3689
slope; it returns {\tt xs} and {\tt ys} that represent a line
3690
with the given parameters, evaluated at the values in {\tt xs}.
3691
3692
{\tt NormalProbability} returns {\tt xs} and {\tt ys} that
3693
contain values from the standard normal distribution and values
3694
from {\tt weights}. If the distribution of weights is normal,
3695
the data should match the model.
3696
\index{model}
3697
3698
\begin{figure}
3699
% analytic.py
3700
\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_normal.pdf}}
3701
\caption{Normal probability plot of birth weights.}
3702
\label{analytic_birthwgt_normal}
3703
\end{figure}
3704
3705
Figure~\ref{analytic_birthwgt_normal} shows the results for
3706
all live births, and also for full term births (pregnancy length greater
3707
than 36 weeks). Both curves match the model near the mean and
3708
deviate in the tails. The heaviest babies are heavier than what
3709
the model expects, and the lightest babies are lighter.
3710
\index{pregnancy length}
3711
3712
When we select only full term births, we remove some of the lightest
3713
weights, which reduces the discrepancy in the lower tail of the
3714
distribution.
3715
3716
This plot suggests that the normal model describes the distribution
3717
well within a few standard deviations from the mean, but not in the
3718
tails. Whether it is good enough for practical purposes depends
3719
on the purposes.
3720
\index{model}
3721
\index{birth weight}
3722
\index{weight!birth}
3723
\index{standard deviation}
3724
3725
3726
\section{The lognormal distribution}
3727
\label{brfss}
3728
\label{lognormal}
3729
3730
If the logarithms of a set of values have a normal distribution, the
3731
values have a {\bf lognormal distribution}. The CDF of the lognormal
3732
distribution is the same as the CDF of the normal distribution,
3733
with $\log x$ substituted for $x$.
3734
%
3735
\[ CDF_{lognormal}(x) = CDF_{normal}(\log x)\]
3736
%
3737
The parameters of the lognormal distribution are usually denoted
3738
$\mu$ and $\sigma$. But remember that these parameters are {\em not\/}
3739
the mean and standard deviation; the mean of a lognormal distribution
3740
is $\exp(\mu +\sigma^2/2)$ and the standard deviation is
3741
ugly (see \url{http://wikipedia.org/wiki/Log-normal_distribution}).
3742
\index{parameter} \index{weight!adult} \index{adult weight}
3743
\index{lognormal distribution}
3744
\index{distribution!lognormal}
3745
\index{CDF}
3746
3747
\begin{figure}
3748
% brfss.py
3749
\centerline{
3750
\includegraphics[height=2.5in]{figs/brfss_weight.pdf}}
3751
\caption{CDF of adult weights on a linear scale (left) and
3752
log scale (right).}
3753
\label{brfss_weight}
3754
\end{figure}
3755
3756
If a sample is approximately lognormal and you plot its CDF on a
3757
log-x scale, it will have the characteristic shape of a normal
3758
distribution. To test how well the sample fits a lognormal model, you
3759
can make a normal probability plot using the log of the values
3760
in the sample.
3761
\index{normal probability plot}
3762
\index{model}
3763
3764
As an example, let's look at the distribution of adult weights, which
3765
is approximately lognormal.\footnote{I was tipped off to this
3766
possibility by a comment (without citation) at
3767
\url{http://mathworld.wolfram.com/LogNormalDistribution.html}.
3768
Subsequently I found a paper that proposes the log transform and
3769
suggests a cause: Penman and Johnson, ``The Changing Shape of the
3770
Body Mass Index Distribution Curve in the Population,'' Preventing
3771
Chronic Disease, 2006 July; 3(3): A74. Online at
3772
\url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1636707}.}
3773
3774
The National Center for Chronic Disease
3775
Prevention and Health Promotion conducts an annual survey as part of
3776
the Behavioral Risk Factor Surveillance System
3777
(BRFSS).\footnote{Centers for Disease Control and Prevention
3778
(CDC). Behavioral Risk Factor Surveillance System Survey
3779
Data. Atlanta, Georgia: U.S. Department of Health and Human
3780
Services, Centers for Disease Control and Prevention, 2008.} In
3781
2008, they interviewed 414,509 respondents and asked about their
3782
demographics, health, and health risks.
3783
Among the data they collected are the weights in kilograms of
3784
398,484 respondents.
3785
\index{Behavioral Risk Factor Surveillance System}
3786
\index{BRFSS}
3787
3788
The repository for this book contains {\tt CDBRFS08.ASC.gz},
3789
a fixed-width ASCII file that contains data from the BRFSS,
3790
and {\tt brfss.py}, which reads the file and analyzes the data.
3791
3792
\begin{figure}
3793
% brfss.py
3794
\centerline{
3795
\includegraphics[height=2.5in]{figs/brfss_weight_normal.pdf}}
3796
\caption{Normal probability plots for adult weight on a linear scale
3797
(left) and log scale (right).}
3798
\label{brfss_weight_normal}
3799
\end{figure}
3800
3801
Figure~\ref{brfss_weight} (left) shows the distribution of adult
3802
weights on a linear scale with a normal model.
3803
Figure~\ref{brfss_weight} (right) shows the same distribution on a log
3804
scale with a lognormal model. The lognormal model is a better fit,
3805
but this representation of the data does not make the difference
3806
particularly dramatic. \index{respondent} \index{model}
3807
3808
Figure~\ref{brfss_weight_normal} shows normal probability plots for
3809
adult weights, $w$, and for their logarithms, $\log_{10} w$. Now it
3810
is apparent that the data deviate substantially from the normal model.
3811
On the other hand, the lognormal model is a good match for the data.
3812
\index{normal distribution} \index{distribution!normal}
3813
\index{Gaussian distribution} \index{distribution!Gaussian}
3814
\index{lognormal distribution} \index{distribution!lognormal}
3815
\index{standard deviation} \index{adult weight} \index{weight!adult}
3816
\index{model} \index{normal probability plot}
3817
3818
3819
\section{The Pareto distribution}
3820
\index{Pareto distribution}
3821
\index{distribution!Pareto}
3822
\index{Pareto, Vilfredo}
3823
3824
The {\bf Pareto distribution} is named after the economist Vilfredo Pareto,
3825
who used it to describe the distribution of wealth (see
3826
\url{http://wikipedia.org/wiki/Pareto_distribution}). Since then, it
3827
has been used to describe phenomena in the natural and social sciences
3828
including sizes of cities and towns, sand particles and meteorites,
3829
forest fires and earthquakes. \index{CDF}
3830
3831
The CDF of the Pareto distribution is:
3832
%
3833
\[ CDF(x) = 1 - \left( \frac{x}{x_m} \right) ^{-\alpha} \]
3834
%
3835
The parameters $x_{m}$ and $\alpha$ determine the location and shape
3836
of the distribution. $x_{m}$ is the minimum possible value.
3837
Figure~\ref{analytic_pareto_cdf} shows CDFs of Pareto
3838
distributions with $x_{m} = 0.5$ and different values
3839
of $\alpha$.
3840
\index{parameter}
3841
3842
\begin{figure}
3843
% analytic.py
3844
\centerline{\includegraphics[height=2.5in]{figs/analytic_pareto_cdf.pdf}}
3845
\caption{CDFs of Pareto distributions with different parameters.}
3846
\label{analytic_pareto_cdf}
3847
\end{figure}
3848
3849
There is a simple visual test that indicates whether an empirical
3850
distribution fits a Pareto distribution: on a log-log scale, the CCDF
3851
looks like a straight line. Let's see why that works.
3852
3853
If you plot the CCDF of a sample from a Pareto distribution on a
3854
linear scale, you expect to see a function like:
3855
%
3856
\[ y \approx \left( \frac{x}{x_m} \right) ^{-\alpha} \]
3857
%
3858
Taking the log of both sides yields:
3859
%
3860
\[ \log y \approx -\alpha (\log x - \log x_{m})\]
3861
%
3862
So if you plot $\log y$ versus $\log x$, it should look like a straight
3863
line with slope $-\alpha$ and intercept
3864
$\alpha \log x_{m}$.
3865
3866
As an example, let's look at the sizes of cities and towns.
3867
The U.S.~Census Bureau publishes the
3868
population of every incorporated city and town in the United States.
3869
\index{Pareto distribution} \index{distribution!Pareto}
3870
\index{U.S.~Census Bureau} \index{population} \index{city size}
3871
3872
\begin{figure}
3873
% populations.py
3874
\centerline{\includegraphics[height=2.5in]{figs/populations_pareto.pdf}}
3875
\caption{CCDFs of city and town populations, on a log-log scale.}
3876
\label{populations_pareto}
3877
\end{figure}
3878
3879
I downloaded their data from
3880
\url{http://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012-3.html};
3881
it is in the repository for this book in a file named
3882
\verb"PEP_2012_PEPANNRES_with_ann.csv". The repository also
3883
contains {\tt populations.py}, which reads the file and plots
3884
the distribution of populations.
3885
3886
Figure~\ref{populations_pareto} shows the CCDF of populations on a
3887
log-log scale. The largest 1\% of cities and towns, below $10^{-2}$,
3888
fall along a straight line. So we could
3889
conclude, as some researchers have, that the tail of this distribution
3890
fits a Pareto model.
3891
\index{model}
3892
3893
On the other hand, a lognormal distribution also models the data well.
3894
Figure~\ref{populations_normal} shows the CDF of populations and a
3895
lognormal model (left), and a normal probability plot (right). Both
3896
plots show good agreement between the data and the model.
3897
\index{normal probability plot}
3898
3899
Neither model is perfect.
3900
The Pareto model only applies to the largest 1\% of cities, but it
3901
is a better fit for that part of the distribution. The lognormal
3902
model is a better fit for the other 99\%.
3903
Which model is appropriate depends on which part of the distribution
3904
is relevant.
3905
3906
\begin{figure}
3907
% populations.py
3908
\centerline{\includegraphics[height=2.5in]{figs/populations_normal.pdf}}
3909
\caption{CDF of city and town populations on a log-x scale (left), and
3910
normal probability plot of log-transformed populations (right).}
3911
\label{populations_normal}
3912
\end{figure}
3913
3914
3915
\section{Generating random numbers}
3916
\index{exponential distribution}
3917
\index{distribution!exponential}
3918
\index{random number}
3919
\index{CDF}
3920
\index{inverse CDF algorithm}
3921
\index{uniform distribution}
3922
\index{distribution!uniform}
3923
3924
Analytic CDFs can be used to generate random numbers with a given
3925
distribution function, $p = \CDF(x)$. If there is an efficient way to
3926
compute the inverse CDF, we can generate random values
3927
with the appropriate distribution by choosing $p$ from a uniform
3928
distribution between 0 and 1, then choosing
3929
$x = ICDF(p)$.
3930
\index{inverse CDF}
3931
\index{CDF, inverse}
3932
3933
For example, the CDF of the exponential distribution is
3934
%
3935
\[ p = 1 - e^{-\lambda x} \]
3936
%
3937
Solving for $x$ yields:
3938
%
3939
\[ x = -\log (1 - p) / \lambda \]
3940
%
3941
So in Python we can write
3942
%
3943
\begin{verbatim}
3944
def expovariate(lam):
3945
p = random.random()
3946
x = -math.log(1-p) / lam
3947
return x
3948
\end{verbatim}
3949
3950
{\tt expovariate} takes {\tt lam} and returns a random value chosen
3951
from the exponential distribution with parameter {\tt lam}.
3952
3953
Two notes about this implementation:
3954
I called the parameter \verb"lam" because \verb"lambda" is a Python
3955
keyword. Also, since $\log 0$ is undefined, we have to
3956
be a little careful. The implementation of {\tt random.random}
3957
can return 0 but not 1, so $1 - p$ can be 1 but not 0, so
3958
{\tt log(1-p)} is always defined. \index{random module}
3959
3960
3961
\section{Why model?}
3962
\index{model}
3963
3964
At the beginning of this chapter, I said that many real world phenomena
3965
can be modeled with analytic distributions. ``So,'' you might ask,
3966
``what?'' \index{abstraction}
3967
3968
Like all models, analytic distributions are abstractions, which
3969
means they leave out details that are considered irrelevant.
3970
For example, an observed distribution might have measurement errors
3971
or quirks that are specific to the sample; analytic models smooth
3972
out these idiosyncrasies.
3973
\index{smoothing}
3974
3975
Analytic models are also a form of data compression. When a model
3976
fits a dataset well, a small set of parameters can summarize a
3977
large amount of data.
3978
\index{parameter}
3979
\index{compression}
3980
3981
It is sometimes surprising when data from a natural phenomenon fit an
3982
analytic distribution, but these observations can provide insight
3983
into physical systems. Sometimes we can explain why an observed
3984
distribution has a particular form. For example, Pareto distributions
3985
are often the result of generative processes with positive feedback
3986
(so-called preferential attachment processes: see
3987
\url{http://wikipedia.org/wiki/Preferential_attachment}.).
3988
\index{preferential attachment}
3989
\index{generative process}
3990
\index{Pareto distribution}
3991
\index{distribution!Pareto}
3992
\index{analysis}
3993
3994
Also, analytic distributions lend themselves to mathematical
3995
analysis, as we will see in Chapter~\ref{analysis}.
3996
3997
But it is important to remember that all models are imperfect.
3998
Data from the real world never fit an analytic distribution perfectly.
3999
People sometimes talk as if data are generated by models; for example,
4000
they might say that the distribution of human heights is normal,
4001
or the distribution of income is lognormal. Taken literally, these
4002
claims cannot be true; there are always differences between the
4003
real world and mathematical models.
4004
4005
Models are useful if they capture the relevant aspects of the
4006
real world and leave out unneeded details. But what is ``relevant''
4007
or ``unneeded'' depends on what you are planning to use the model
4008
for.
4009
4010
4011
\section{Exercises}
4012
4013
For the following exercises, you can start with \verb"chap05ex.ipynb".
4014
My solution is in \verb"chap05soln.ipynb".
4015
4016
\begin{exercise}
4017
In the BRFSS (see Section~\ref{lognormal}), the distribution of
4018
heights is roughly normal with parameters $\mu = 178$ cm and
4019
$\sigma = 7.7$ cm for men, and $\mu = 163$ cm and $\sigma = 7.3$ cm for
4020
women.
4021
\index{normal distribution}
4022
\index{distribution!normal}
4023
\index{Gaussian distribution}
4024
\index{distribution!Gaussian}
4025
\index{height}
4026
\index{Blue Man Group}
4027
\index{Group, Blue Man}
4028
4029
In order to join Blue Man Group, you have to be male between 5'10''
4030
and 6'1'' (see \url{http://bluemancasting.com}). What percentage of
4031
the U.S. male population is in this range? Hint: use {\tt
4032
scipy.stats.norm.cdf}.
4033
\index{SciPy}
4034
4035
\end{exercise}
4036
4037
4038
\begin{exercise}
4039
To get a feel for the Pareto distribution, let's see how different
4040
the world
4041
would be if the distribution of human height were Pareto.
4042
With the parameters $x_{m} = 1$ m and $\alpha = 1.7$, we
4043
get a distribution with a reasonable minimum, 1 m,
4044
and median, 1.5 m.
4045
\index{height}
4046
\index{Pareto distribution}
4047
\index{distribution!Pareto}
4048
4049
Plot this distribution. What is the mean human height in Pareto
4050
world? What fraction of the population is shorter than the mean? If
4051
there are 7 billion people in Pareto world, how many do we expect to
4052
be taller than 1 km? How tall do we expect the tallest person to be?
4053
\index{Pareto World}
4054
4055
\end{exercise}
4056
4057
4058
\begin{exercise}
4059
\label{weibull}
4060
4061
The Weibull distribution is a generalization of the exponential
4062
distribution that comes up in failure analysis
4063
(see \url{http://wikipedia.org/wiki/Weibull_distribution}). Its CDF is
4064
%
4065
\[ CDF(x) = 1 - e^{-(x / \lambda)^k} \]
4066
%
4067
Can you find a transformation that makes a Weibull distribution look
4068
like a straight line? What do the slope and intercept of the
4069
line indicate?
4070
\index{Weibull distribution}
4071
\index{distribution!Weibull}
4072
\index{exponential distribution}
4073
\index{distribution!exponential}
4074
\index{random module}
4075
4076
Use {\tt random.weibullvariate} to generate a sample from a
4077
Weibull distribution and use it to test your transformation.
4078
4079
\end{exercise}
4080
4081
4082
\begin{exercise}
4083
For small values of $n$, we don't expect an empirical distribution
4084
to fit an analytic distribution exactly. One way to evaluate
4085
the quality of fit is to generate a sample from an analytic
4086
distribution and see how well it matches the data.
4087
\index{empirical distribution}
4088
\index{distribution!empirical}
4089
\index{random module}
4090
4091
For example, in Section~\ref{exponential} we plotted the distribution
4092
of time between births and saw that it is approximately exponential.
4093
But the distribution is based on only 44 data points. To see whether
4094
the data might have come from an exponential distribution, generate 44
4095
values from an exponential distribution with the same mean as the
4096
data, about 33 minutes between births.
4097
4098
Plot the distribution of the random values and compare it to the
4099
actual distribution. You can use {\tt random.expovariate}
4100
to generate the values.
4101
4102
\end{exercise}
4103
4104
\begin{exercise}
4105
In the repository for this book, you'll find a set of data files
4106
called {\tt mystery0.dat}, {\tt mystery1.dat}, and so on. Each
4107
contains a sequence of random numbers generated from an analytic
4108
distribution.
4109
\index{random number}
4110
4111
You will also find \verb"test_models.py", a script that reads
4112
data from a file and plots the CDF under a variety of transforms.
4113
You can run it like this:
4114
4115
\begin{verbatim}
4116
$ python test_models.py mystery0.dat
4117
\end{verbatim}
4118
4119
Based on these plots, you should be able to infer what kind of
4120
distribution generated each file. If you are stumped, you can
4121
look in {\tt mystery.py}, which contains the code that generated
4122
the files.
4123
4124
\end{exercise}
4125
4126
4127
\begin{exercise}
4128
\label{income}
4129
4130
The distributions of wealth and income are sometimes modeled using
4131
lognormal and Pareto distributions. To see which is better, let's
4132
look at some data.
4133
\index{Pareto distribution}
4134
\index{distribution!Pareto}
4135
\index{lognormal distribution}
4136
\index{distribution!lognormal}
4137
4138
The Current Population Survey (CPS) is a joint effort of the Bureau
4139
of Labor Statistics and the Census Bureau to study income and related
4140
variables. Data collected in 2013 is available from
4141
\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.
4142
I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with
4143
information about household income, and converted it to {\tt hinc06.csv},
4144
a CSV file you will find in the repository for this book. You
4145
will also find {\tt hinc.py}, which reads this file.
4146
4147
Extract the distribution of incomes from this dataset. Are any of the
4148
analytic distributions in this chapter a good model of the data? A
4149
solution to this exercise is in {\tt hinc_soln.py}.
4150
\index{model}
4151
4152
\end{exercise}
4153
4154
4155
4156
4157
\section{Glossary}
4158
4159
\begin{itemize}
4160
4161
\item empirical distribution: The distribution of values in a sample.
4162
\index{empirical distribution} \index{distribution!empirical}
4163
4164
\item analytic distribution: A distribution whose CDF is an analytic
4165
function.
4166
\index{analytic distribution}
4167
\index{distribution!analytic}
4168
4169
\item model: A useful simplification. Analytic distributions are
4170
often good models of more complex empirical distributions.
4171
\index{model}
4172
4173
\item interarrival time: The elapsed time between two events.
4174
\index{interarrival time}
4175
4176
\item complementary CDF: A function that maps from a value, $x$,
4177
to the fraction of values that exceed $x$, which is $1 - \CDF(x)$.
4178
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
4179
4180
\item standard normal distribution: The normal distribution with
4181
mean 0 and standard deviation 1.
4182
\index{standard normal distribution}
4183
4184
\item normal probability plot: A plot of the values in a sample versus
4185
random values from a standard normal distribution.
4186
\index{normal probability plot}
4187
\index{plot!normal probability}
4188
4189
\end{itemize}
4190
4191
4192
\chapter{Probability density functions}
4193
\label{density}
4194
\index{PDF}
4195
\index{probability density function}
4196
\index{exponential distribution}
4197
\index{distribution!exponential}
4198
\index{normal distribution}
4199
\index{distribution!normal}
4200
\index{Gaussian distribution}
4201
\index{distribution!Gaussian}
4202
\index{CDF}
4203
\index{derivative}
4204
4205
The code for this chapter is in {\tt density.py}. For information
4206
about downloading and working with this code, see Section~\ref{code}.
4207
4208
4209
\section{PDFs}
4210
4211
The derivative of a CDF is called a {\bf probability density function},
4212
or PDF. For example, the PDF of an exponential distribution is
4213
%
4214
\[ \PDF_{expo}(x) = \lambda e^{-\lambda x} \]
4215
%
4216
The PDF of a normal distribution is
4217
%
4218
\[ \PDF_{normal}(x) = \frac{1}{\sigma \sqrt{2 \pi}}
4219
\exp \left[ -\frac{1}{2}
4220
\left( \frac{x - \mu}{\sigma} \right)^2 \right] \]
4221
%
4222
Evaluating a PDF for a particular value of $x$ is usually not useful.
4223
The result is not a probability; it is a probability {\em density}.
4224
\index{density}
4225
\index{mass}
4226
4227
In physics, density is mass per unit of
4228
volume; in order to get a mass, you have to multiply by volume or,
4229
if the density is not constant, you have to integrate over volume.
4230
4231
Similarly, {\bf probability density} measures probability per unit of $x$.
4232
In order to get a probability mass, you have to integrate over $x$.
4233
4234
{\tt thinkstats2} provides a class called Pdf that represents
4235
a probability density function. Every Pdf object provides the
4236
following methods:
4237
4238
\begin{itemize}
4239
4240
\item {\tt Density}, which takes a value, {\tt x}, and returns the
4241
density of the distribution at {\tt x}.
4242
4243
\item {\tt Render}, which evaluates the density at a discrete set of
4244
values and returns a pair of sequences: the sorted values, {\tt xs},
4245
and their probability densities, {\tt ds}.
4246
4247
\item {\tt MakePmf}, which evaluates {\tt Density}
4248
at a discrete set of values and returns a normalized Pmf that
4249
approximates the Pdf.
4250
\index{Pmf}
4251
4252
\item {\tt GetLinspace}, which returns the default set of points used
4253
by {\tt Render} and {\tt MakePmf}.
4254
4255
\end{itemize}
4256
4257
Pdf is an abstract parent class, which means you should not
4258
instantiate it; that is, you cannot create a Pdf object. Instead, you
4259
should define a child class that inherits from Pdf and provides
4260
definitions of {\tt Density} and {\tt GetLinspace}. Pdf provides
4261
{\tt Render} and {\tt MakePmf}.
4262
4263
For example, {\tt thinkstats2} provides a class named {\tt
4264
NormalPdf} that evaluates the normal density function.
4265
4266
\begin{verbatim}
4267
class NormalPdf(Pdf):
4268
4269
def __init__(self, mu=0, sigma=1, label=''):
4270
self.mu = mu
4271
self.sigma = sigma
4272
self.label = label
4273
4274
def Density(self, xs):
4275
return scipy.stats.norm.pdf(xs, self.mu, self.sigma)
4276
4277
def GetLinspace(self):
4278
low, high = self.mu-3*self.sigma, self.mu+3*self.sigma
4279
return np.linspace(low, high, 101)
4280
\end{verbatim}
4281
4282
The NormalPdf object contains the parameters {\tt mu} and
4283
{\tt sigma}. {\tt Density} uses
4284
{\tt scipy.stats.norm}, which is an object that represents a normal
4285
distribution and provides {\tt cdf} and {\tt pdf}, among other
4286
methods (see Section~\ref{normal}).
4287
\index{SciPy}
4288
4289
The following example creates a NormalPdf with the mean and variance
4290
of adult female heights, in cm, from the BRFSS (see
4291
Section~\ref{brfss}). Then it computes the density of the
4292
distribution at a location one standard deviation from the mean.
4293
\index{standard deviation}
4294
4295
\begin{verbatim}
4296
>>> mean, var = 163, 52.8
4297
>>> std = math.sqrt(var)
4298
>>> pdf = thinkstats2.NormalPdf(mean, std)
4299
>>> pdf.Density(mean + std)
4300
0.0333001
4301
\end{verbatim}
4302
4303
The result is about 0.03, in units of probability mass per cm.
4304
Again, a probability density doesn't mean much by itself. But if
4305
we plot the Pdf, we can see the shape of the distribution:
4306
4307
\begin{verbatim}
4308
>>> thinkplot.Pdf(pdf, label='normal')
4309
>>> thinkplot.Show()
4310
\end{verbatim}
4311
4312
{\tt thinkplot.Pdf} plots the Pdf as a smooth function,
4313
as contrasted with {\tt thinkplot.Pmf}, which renders a Pmf as a
4314
step function. Figure~\ref{pdf_example} shows the result, as well
4315
as a PDF estimated from a sample, which we'll compute in the next
4316
section.
4317
\index{thinkplot}
4318
4319
You can use {\tt MakePmf} to approximate the Pdf:
4320
4321
\begin{verbatim}
4322
>>> pmf = pdf.MakePmf()
4323
\end{verbatim}
4324
4325
By default, the resulting Pmf contains 101 points equally spaced from
4326
{\tt mu - 3*sigma} to {\tt mu + 3*sigma}. Optionally, {\tt MakePmf}
4327
and {\tt Render} can take keyword arguments {\tt low}, {\tt high},
4328
and {\tt n}.
4329
4330
\begin{figure}
4331
% pdf_example.py
4332
\centerline{\includegraphics[height=2.2in]{figs/pdf_example.pdf}}
4333
\caption{A normal PDF that models adult female height in the U.S.,
4334
and the kernel density estimate of a sample with $n=500$.}
4335
\label{pdf_example}
4336
\end{figure}
4337
4338
4339
\section{Kernel density estimation}
4340
4341
{\bf Kernel density estimation} (KDE) is an algorithm that takes
4342
a sample and finds an appropriately smooth PDF that fits
4343
the data. You can read details at
4344
\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.
4345
\index{KDE}
4346
\index{kernel density estimation}
4347
4348
{\tt scipy} provides an implementation of KDE and {\tt thinkstats2}
4349
provides a class called {\tt EstimatedPdf} that uses it:
4350
\index{SciPy}
4351
\index{NumPy}
4352
4353
\begin{verbatim}
4354
class EstimatedPdf(Pdf):
4355
4356
def __init__(self, sample):
4357
self.kde = scipy.stats.gaussian_kde(sample)
4358
4359
def Density(self, xs):
4360
return self.kde.evaluate(xs)
4361
\end{verbatim}
4362
4363
\verb"__init__" takes a sample
4364
and computes a kernel density estimate. The result is a
4365
\verb"gaussian_kde" object that provides an {\tt evaluate}
4366
method.
4367
4368
{\tt Density} takes a value or sequence, calls
4369
\verb"gaussian_kde.evaluate", and returns the resulting density. The
4370
word ``Gaussian'' appears in the name because it uses a filter based
4371
on a Gaussian distribution to smooth the KDE. \index{density}
4372
4373
Here's an example that generates a sample from a normal
4374
distribution and then makes an EstimatedPdf to fit it:
4375
\index{NumPy}
4376
\index{EstimatedPdf}
4377
4378
\begin{verbatim}
4379
>>> sample = [random.gauss(mean, std) for i in range(500)]
4380
>>> sample_pdf = thinkstats2.EstimatedPdf(sample)
4381
>>> thinkplot.Pdf(sample_pdf, label='sample KDE')
4382
\end{verbatim}
4383
4384
\verb"sample" is a list of 500 random heights.
4385
\verb"sample_pdf" is a Pdf object that contains the estimated
4386
KDE of the sample.
4387
\index{thinkplot}
4388
\index{Pmf}
4389
4390
Figure~\ref{pdf_example} shows the normal density function and a KDE
4391
based on a sample of 500 random heights. The estimate is a good
4392
match for the original distribution.
4393
4394
Estimating a density function with KDE is useful for several purposes:
4395
4396
\begin{itemize}
4397
4398
\item {\it Visualization:\/} During the exploration phase of a project, CDFs
4399
are usually the best visualization of a distribution. After you
4400
look at a CDF, you can decide whether an estimated PDF is an
4401
appropriate model of the distribution. If so, it can be a better
4402
choice for presenting the distribution to an audience that is
4403
unfamiliar with CDFs.
4404
\index{visualization}
4405
\index{model}
4406
4407
\item {\it Interpolation:\/} An estimated PDF is a way to get from a sample
4408
to a model of the population. If you have reason to believe that
4409
the population distribution is smooth, you can use KDE to interpolate
4410
the density for values that don't appear in the sample.
4411
\index{interpolation}
4412
4413
\item {\it Simulation:\/} Simulations are often based on the distribution
4414
of a sample. If the sample size is small, it
4415
might be appropriate to smooth the sample distribution using KDE,
4416
which allows the simulation to explore more possible outcomes,
4417
rather than replicating the observed data.
4418
\index{simulation}
4419
4420
\end{itemize}
4421
4422
4423
\section{The distribution framework}
4424
\index{distribution framework}
4425
4426
\begin{figure}
4427
\centerline{\includegraphics[height=2.2in]{figs/distribution_functions.pdf}}
4428
\caption{A framework that relates representations of distribution
4429
functions.}
4430
\label{dist_framework}
4431
\end{figure}
4432
4433
At this point we have seen PMFs, CDFs and PDFs; let's take a minute
4434
to review. Figure~\ref{dist_framework} shows how these functions relate
4435
to each other.
4436
\index{Pmf}
4437
\index{Cdf}
4438
\index{Pdf}
4439
4440
We started with PMFs, which represent the probabilities for a discrete
4441
set of values. To get from a PMF to a CDF, you add up the probability
4442
masses to get cumulative probabilities.
4443
To get from a CDF back to a PMF, you compute differences in cumulative
4444
probabilities. We'll see the implementation of these operations
4445
in the next few sections.
4446
\index{cumulative probability}
4447
4448
A PDF is the derivative of a continuous CDF; or, equivalently,
4449
a CDF is the integral of a PDF. Remember that a PDF maps from
4450
values to probability densities; to get a probability, you have to
4451
integrate.
4452
\index{discrete distribution}
4453
\index{continuous distribution}
4454
\index{smoothing}
4455
4456
To get from a discrete to a continuous distribution, you can perform
4457
various kinds of smoothing. One form of smoothing is to assume that
4458
the data come from an analytic continuous distribution
4459
(like exponential or normal) and to estimate the parameters of that
4460
distribution. Another option is kernel density estimation.
4461
\index{exponential distribution}
4462
\index{distribution!exponential}
4463
\index{normal distribution}
4464
\index{distribution!normal}
4465
\index{Gaussian distribution}
4466
\index{distribution!Gaussian}
4467
4468
The opposite of smoothing is {\bf discretizing}, or quantizing. If you
4469
evaluate a PDF at discrete points, you can generate a PMF that is an
4470
approximation of the PDF. You can get a better approximation using
4471
numerical integration. \index{discretize}
4472
\index{quantize}
4473
\index{binning}
4474
4475
To distinguish between continuous and discrete CDFs, it might be
4476
better for a discrete CDF to be a ``cumulative mass function,'' but as
4477
far as I can tell no one uses that term. \index{CDF}
4478
4479
4480
4481
\section{Hist implementation}
4482
4483
At this point you should know how to use the basic types provided
4484
by {\tt thinkstats2}: Hist, Pmf, Cdf, and Pdf. The next few sections
4485
provide details about how they are implemented. This material
4486
might help you use these classes more effectively, but it is not
4487
strictly necessary.
4488
\index{Hist}
4489
4490
Hist and Pmf inherit from a parent class called \verb"_DictWrapper".
4491
The leading underscore indicates that this class is ``internal;'' that
4492
is, it should not be used by code in other modules. The name
4493
indicates what it is: a dictionary wrapper. Its primary attribute is
4494
{\tt d}, the dictionary that maps from values to their frequencies.
4495
\index{DictWrapper}
4496
\index{internal class}
4497
\index{wrapper}
4498
4499
The values can be any hashable type. The frequencies should be integers,
4500
but can be any numeric type.
4501
\index{hashable}
4502
4503
\verb"_DictWrapper" contains methods appropriate for both
4504
Hist and Pmf, including \verb"__init__", {\tt Values},
4505
{\tt Items} and {\tt Render}. It also provides modifier
4506
methods {\tt Set}, {\tt Incr}, {\tt Mult}, and {\tt Remove}. These
4507
methods are all implemented with dictionary operations. For example:
4508
\index{dictionary}
4509
4510
\begin{verbatim}
4511
# class _DictWrapper
4512
4513
def Incr(self, x, term=1):
4514
self.d[x] = self.d.get(x, 0) + term
4515
4516
def Mult(self, x, factor):
4517
self.d[x] = self.d.get(x, 0) * factor
4518
4519
def Remove(self, x):
4520
del self.d[x]
4521
\end{verbatim}
4522
4523
Hist also provides {\tt Freq}, which looks up the frequency
4524
of a given value.
4525
\index{frequency}
4526
4527
Because Hist operators and methods are based on dictionaries,
4528
these methods are constant time operations;
4529
that is, their run time does not increase as the Hist gets bigger.
4530
\index{Hist}
4531
4532
4533
\section{Pmf implementation}
4534
4535
Pmf and Hist are almost the same thing, except that a Pmf
4536
maps values to floating-point probabilities, rather than integer
4537
frequencies. If the sum of the probabilities is 1, the Pmf is normalized.
4538
\index{Pmf}
4539
4540
Pmf provides {\tt Normalize}, which computes the sum of the
4541
probabilities and divides through by a factor:
4542
4543
\begin{verbatim}
4544
# class Pmf
4545
4546
def Normalize(self, fraction=1.0):
4547
total = self.Total()
4548
if total == 0.0:
4549
raise ValueError('Total probability is zero.')
4550
4551
factor = float(fraction) / total
4552
for x in self.d:
4553
self.d[x] *= factor
4554
4555
return total
4556
\end{verbatim}
4557
4558
{\tt fraction} determines the sum of the probabilities after
4559
normalizing; the default value is 1. If the total probability is 0,
4560
the Pmf cannot be normalized, so {\tt Normalize} raises {\tt
4561
ValueError}.
4562
4563
Hist and Pmf have the same constructor. It can take
4564
as an argument a {\tt dict}, Hist, Pmf or Cdf, a pandas
4565
Series, a list of (value, frequency) pairs, or a sequence of values.
4566
\index{Hist}
4567
4568
If you instantiate a Pmf, the result is normalized. If you
4569
instantiate a Hist, it is not. To construct an unnormalized Pmf,
4570
you can create an empty Pmf and modify it. The Pmf modifiers do
4571
not renormalize the Pmf.
4572
4573
4574
\section{Cdf implementation}
4575
4576
A CDF maps from values to cumulative probabilities, so I could have
4577
implemented Cdf as a \verb"_DictWrapper". But the values in a CDF are
4578
ordered and the values in a \verb"_DictWrapper" are not. Also, it is
4579
often useful to compute the inverse CDF; that is, the map from
4580
cumulative probability to value. So the implementaion I chose is two
4581
sorted lists. That way I can use binary search to do a forward or
4582
inverse lookup in logarithmic time.
4583
\index{Cdf}
4584
\index{binary search}
4585
\index{cumulative probability}
4586
\index{DictWrapper}
4587
\index{inverse CDF}
4588
\index{CDF, inverse}
4589
4590
The Cdf constructor can take as a parameter a sequence of values
4591
or a pandas Series, a dictionary that maps from values to
4592
probabilities, a sequence of (value, probability) pairs, a Hist, Pmf,
4593
or Cdf. Or if it is given two parameters, it treats them as a sorted
4594
sequence of values and the sequence of corresponding cumulative
4595
probabilities.
4596
4597
Given a sequence, pandas Series, or dictionary, the constructor makes
4598
a Hist. Then it uses the Hist to initialize the attributes:
4599
4600
\begin{verbatim}
4601
self.xs, freqs = zip(*sorted(dw.Items()))
4602
self.ps = np.cumsum(freqs, dtype=np.float)
4603
self.ps /= self.ps[-1]
4604
\end{verbatim}
4605
4606
{\tt xs} is the sorted list of values; {\tt freqs} is the list
4607
of corresponding frequencies. {\tt np.cumsum} computes
4608
the cumulative sum of the frequencies. Dividing through by the
4609
total frequency yields cumulative probabilities.
4610
For {\tt n} values, the time to construct the
4611
Cdf is proportional to $n \log n$.
4612
\index{frequency}
4613
4614
Here is the implementation of {\tt Prob}, which takes a value
4615
and returns its cumulative probability:
4616
4617
\begin{verbatim}
4618
# class Cdf
4619
def Prob(self, x):
4620
if x < self.xs[0]:
4621
return 0.0
4622
index = bisect.bisect(self.xs, x)
4623
p = self.ps[index - 1]
4624
return p
4625
\end{verbatim}
4626
4627
The {\tt bisect} module provides an implementation of binary search.
4628
And here is the implementation of {\tt Value}, which takes a
4629
cumulative probability and returns the corresponding value:
4630
4631
\begin{verbatim}
4632
# class Cdf
4633
def Value(self, p):
4634
if p < 0 or p > 1:
4635
raise ValueError('p must be in range [0, 1]')
4636
4637
index = bisect.bisect_left(self.ps, p)
4638
return self.xs[index]
4639
\end{verbatim}
4640
4641
Given a Cdf, we can compute the Pmf by computing differences between
4642
consecutive cumulative probabilities. If you call the Cdf constructor
4643
and pass a Pmf, it computes differences by calling {\tt Cdf.Items}:
4644
\index{Pmf}
4645
\index{Cdf}
4646
4647
\begin{verbatim}
4648
# class Cdf
4649
def Items(self):
4650
a = self.ps
4651
b = np.roll(a, 1)
4652
b[0] = 0
4653
return zip(self.xs, a-b)
4654
\end{verbatim}
4655
4656
{\tt np.roll} shifts the elements of {\tt a} to the right, and ``rolls''
4657
the last one back to the beginning. We replace the first element of
4658
{\tt b} with 0 and then compute the difference {\tt a-b}. The result
4659
is a NumPy array of probabilities.
4660
\index{NumPy}
4661
4662
Cdf provides {\tt Shift} and {\tt Scale}, which modify the
4663
values in the Cdf, but the probabilities should be treated as
4664
immutable.
4665
4666
4667
\section{Moments}
4668
\index{moment}
4669
4670
Any time you take a sample and reduce it to a single number, that
4671
number is a statistic. The statistics we have seen so far include
4672
mean, variance, median, and interquartile range.
4673
4674
A {\bf raw moment} is a kind of statistic. If you have a sample of
4675
values, $x_i$, the $k$th raw moment is:
4676
%
4677
\[ m'_k = \frac{1}{n} \sum_i x_i^k \]
4678
%
4679
Or if you prefer Python notation:
4680
4681
\begin{verbatim}
4682
def RawMoment(xs, k):
4683
return sum(x**k for x in xs) / len(xs)
4684
\end{verbatim}
4685
4686
When $k=1$ the result is the sample mean, $\xbar$. The other
4687
raw moments don't mean much by themselves, but they are used
4688
in some computations.
4689
4690
The {\bf central moments} are more useful. The
4691
$k$th central moment is:
4692
%
4693
\[ m_k = \frac{1}{n} \sum_i (x_i - \xbar)^k \]
4694
%
4695
Or in Python:
4696
4697
\begin{verbatim}
4698
def CentralMoment(xs, k):
4699
mean = RawMoment(xs, 1)
4700
return sum((x - mean)**k for x in xs) / len(xs)
4701
\end{verbatim}
4702
4703
When $k=2$ the result is the second central moment, which you might
4704
recognize as variance. The definition of variance gives a hint about
4705
why these statistics are called moments. If we attach a weight along a
4706
ruler at each location, $x_i$, and then spin the ruler around
4707
the mean, the moment of inertia of the spinning weights is the variance
4708
of the values. If you are not familiar with moment of inertia, see
4709
\url{http://en.wikipedia.org/wiki/Moment_of_inertia}. \index{moment
4710
of inertia}
4711
4712
When you report moment-based statistics, it is important to think
4713
about the units. For example, if the values $x_i$ are in cm, the
4714
first raw moment is also in cm. But the second moment is in
4715
cm$^2$, the third moment is in cm$^3$, and so on.
4716
4717
Because of these units, moments are hard to interpret by themselves.
4718
That's why, for the second moment, it is common to report standard
4719
deviation, which is the square root of variance, so it is in the same
4720
units as $x_i$.
4721
\index{standard deviation}
4722
4723
4724
\section{Skewness}
4725
\index{skewness}
4726
4727
{\bf Skewness} is a property that describes the shape of a distribution.
4728
If the distribution is symmetric around its central tendency, it is
4729
unskewed. If the values extend farther to the right, it is ``right
4730
skewed'' and if the values extend left, it is ``left skewed.''
4731
\index{central tendency}
4732
4733
This use of ``skewed'' does not have the usual connotation of
4734
``biased.'' Skewness only describes the shape of the distribution;
4735
it says nothing about whether the sampling process might have been
4736
biased.
4737
\index{bias}
4738
\index{sample skewness}
4739
4740
Several statistics are commonly used to quantify the skewness of a
4741
distribution. Given a sequence of values, $x_i$, the {\bf sample
4742
skewness}, $g_1$, can be computed like this:
4743
4744
\begin{verbatim}
4745
def StandardizedMoment(xs, k):
4746
var = CentralMoment(xs, 2)
4747
std = math.sqrt(var)
4748
return CentralMoment(xs, k) / std**k
4749
4750
def Skewness(xs):
4751
return StandardizedMoment(xs, 3)
4752
\end{verbatim}
4753
4754
$g_1$ is the third {\bf standardized moment}, which means that it has
4755
been normalized so it has no units.
4756
\index{standardized moment}
4757
4758
Negative skewness indicates that a distribution
4759
skews left; positive skewness indicates
4760
that a distribution skews right. The magnitude of $g_1$ indicates
4761
the strength of the skewness, but by itself it is not easy to
4762
interpret.
4763
4764
In practice, computing sample skewness is usually not
4765
a good idea. If there are any outliers, they
4766
have a disproportionate effect on $g_1$.
4767
\index{outlier}
4768
4769
Another way to evaluate the asymmetry of a distribution is to look
4770
at the relationship between the mean and median.
4771
Extreme values have more effect on the mean than the median, so
4772
in a distribution that skews left, the mean is less than the median.
4773
In a distribution that skews right, the mean is greater.
4774
\index{symmetric}
4775
\index{Pearson median skewness}
4776
4777
{\bf Pearson's median skewness coefficient} is a measure
4778
of skewness based on the difference between the
4779
sample mean and median:
4780
%
4781
\[ g_p = 3 (\xbar - m) / S \]
4782
%
4783
Where $\xbar$ is the sample mean, $m$ is the median, and
4784
$S$ is the standard deviation. Or in Python:
4785
\index{standard deviation}
4786
4787
\begin{verbatim}
4788
def Median(xs):
4789
cdf = thinkstats2.Cdf(xs)
4790
return cdf.Value(0.5)
4791
4792
def PearsonMedianSkewness(xs):
4793
median = Median(xs)
4794
mean = RawMoment(xs, 1)
4795
var = CentralMoment(xs, 2)
4796
std = math.sqrt(var)
4797
gp = 3 * (mean - median) / std
4798
return gp
4799
\end{verbatim}
4800
4801
This statistic is {\bf robust}, which means that it is less vulnerable
4802
to the effect of outliers.
4803
\index{robust}
4804
\index{outlier}
4805
4806
\begin{figure}
4807
\centerline{\includegraphics[height=2.2in]{figs/density_totalwgt_kde.pdf}}
4808
\caption{Estimated PDF of birthweight data from the NSFG.}
4809
\label{density_totalwgt_kde}
4810
\end{figure}
4811
4812
As an example, let's look at the skewness of birth weights in the
4813
NSFG pregnancy data. Here's the code to estimate and plot the PDF:
4814
\index{thinkplot}
4815
4816
\begin{verbatim}
4817
live, firsts, others = first.MakeFrames()
4818
data = live.totalwgt_lb.dropna()
4819
pdf = thinkstats2.EstimatedPdf(data)
4820
thinkplot.Pdf(pdf, label='birth weight')
4821
\end{verbatim}
4822
4823
Figure~\ref{density_totalwgt_kde} shows the result. The left tail appears
4824
longer than the right, so we suspect the distribution is skewed left.
4825
The mean, 7.27 lbs, is a bit less than
4826
the median, 7.38 lbs, so that is consistent with left skew.
4827
And both skewness coefficients are negative:
4828
sample skewness is -0.59;
4829
Pearson's median skewness is -0.23.
4830
\index{skewness}
4831
\index{dropna}
4832
\index{NaN}
4833
4834
\begin{figure}
4835
\centerline{\includegraphics[height=2.2in]{figs/density_wtkg2_kde.pdf}}
4836
\caption{Estimated PDF of adult weight data from the BRFSS.}
4837
\label{density_wtkg2_kde}
4838
\end{figure}
4839
4840
Now let's compare this distribution to the distribution of adult
4841
weight in the BRFSS. Again, here's the code:
4842
\index{thinkplot}
4843
4844
\begin{verbatim}
4845
df = brfss.ReadBrfss(nrows=None)
4846
data = df.wtkg2.dropna()
4847
pdf = thinkstats2.EstimatedPdf(data)
4848
thinkplot.Pdf(pdf, label='adult weight')
4849
\end{verbatim}
4850
4851
Figure~\ref{density_wtkg2_kde} shows the result. The distribution
4852
appears skewed to the right. Sure enough, the mean, 79.0, is bigger
4853
than the median, 77.3. The sample skewness is 1.1 and Pearson's
4854
median skewness is 0.26.
4855
\index{dropna}
4856
\index{NaN}
4857
4858
The sign of the skewness coefficient indicates whether the distribution
4859
skews left or right, but other than that, they are hard to interpret.
4860
Sample skewness is less robust; that is, it is more
4861
susceptible to outliers. As a result it is less reliable
4862
when applied to skewed distributions, exactly when it would be most
4863
relevant.
4864
\index{outlier}
4865
\index{robust}
4866
4867
Pearson's median skewness is based on a computed mean and variance,
4868
so it is also susceptible to outliers, but since it does not depend
4869
on a third moment, it is somewhat more robust.
4870
\index{Pearson median skewness}
4871
4872
4873
\section{Exercises}
4874
4875
A solution to this exercise is in \verb"chap06soln.py".
4876
4877
\begin{exercise}
4878
4879
The distribution of income is famously skewed to the right. In this
4880
exercise, we'll measure how strong that skew is.
4881
\index{skewness}
4882
\index{income}
4883
4884
The Current Population Survey (CPS) is a joint effort of the Bureau
4885
of Labor Statistics and the Census Bureau to study income and related
4886
variables. Data collected in 2013 is available from
4887
\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.
4888
I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with
4889
information about household income, and converted it to {\tt hinc06.csv},
4890
a CSV file you will find in the repository for this book. You
4891
will also find {\tt hinc2.py}, which reads this file and transforms
4892
the data.
4893
\index{Current Population Survey}
4894
\index{Bureau of Labor Statistics}
4895
\index{Census Bureau}
4896
4897
The dataset is in the form of a series of income ranges and the number
4898
of respondents who fell in each range. The lowest range includes
4899
respondents who reported annual household income ``Under \$5000.''
4900
The highest range includes respondents who made ``\$250,000 or
4901
more.''
4902
4903
To estimate mean and other statistics from these data, we have to
4904
make some assumptions about the lower and upper bounds, and how
4905
the values are distributed in each range. {\tt hinc2.py} provides
4906
{\tt InterpolateSample}, which shows one way to model
4907
this data. It takes a DataFrame with a column, {\tt income}, that
4908
contains the upper bound of each range, and {\tt freq}, which contains
4909
the number of respondents in each frame.
4910
\index{DataFrame}
4911
\index{model}
4912
4913
It also takes \verb"log_upper", which is an assumed upper bound
4914
on the highest range, expressed in {\tt log10} dollars.
4915
The default value, \verb"log_upper=6.0" represents the assumption
4916
that the largest income among the respondents is
4917
$10^6$, or one million dollars.
4918
4919
{\tt InterpolateSample} generates a pseudo-sample; that is, a sample
4920
of household incomes that yields the same number of respondents
4921
in each range as the actual data. It assumes that incomes in
4922
each range are equally spaced on a log10 scale.
4923
4924
Compute the median, mean, skewness and Pearson's skewness of the
4925
resulting sample. What fraction of households reports a taxable
4926
income below the mean? How do the results depend on the assumed
4927
upper bound?
4928
\end{exercise}
4929
4930
4931
\section{Glossary}
4932
4933
\begin{itemize}
4934
4935
\item Probability density function (PDF): The derivative of a continuous CDF,
4936
a function that maps a value to its probability density.
4937
\index{PDF}
4938
\index{probability density function}
4939
4940
\item Probability density: A quantity that can be integrated over a
4941
range of values to yield a probability. If the values are in units
4942
of cm, for example, probability density is in units of probability
4943
per cm.
4944
\index{probability density}
4945
4946
\item Kernel density estimation (KDE): An algorithm that estimates a PDF
4947
based on a sample.
4948
\index{kernel density estimation}
4949
\index{KDE}
4950
4951
\item discretize: To approximate a continuous function or distribution
4952
with a discrete function. The opposite of smoothing.
4953
\index{discretize}
4954
4955
\item raw moment: A statistic based on the sum of data raised to a power.
4956
\index{raw moment}
4957
4958
\item central moment: A statistic based on deviation from the mean,
4959
raised to a power.
4960
\index{central moment}
4961
4962
\item standardized moment: A ratio of moments that has no units.
4963
\index{standardized moment}
4964
4965
\item skewness: A measure of how asymmetric a distribution is.
4966
\index{skewness}
4967
4968
\item sample skewness: A moment-based statistic intended to quantify
4969
the skewness of a distribution.
4970
\index{sample skewness}
4971
4972
\item Pearson's median skewness coefficient: A statistic intended to
4973
quantify the skewness of a distribution based on the median, mean,
4974
and standard deviation.
4975
\index{Pearson median skewness}
4976
4977
\item robust: A statistic is robust if it is relatively immune to the
4978
effect of outliers.
4979
\index{robust}
4980
4981
\end{itemize}
4982
4983
4984
4985
\chapter{Relationships between variables}
4986
4987
So far we have only looked at one variable at a time. In this
4988
chapter we look at relationships between variables. Two variables are
4989
related if knowing one gives you information about the other. For
4990
example, height and weight are related; people who are taller tend to
4991
be heavier. Of course, it is not a perfect relationship: there
4992
are short heavy people and tall light ones. But if you are
4993
trying to guess someone's weight, you will be more accurate if you
4994
know their height than if you don't.
4995
\index{adult weight}
4996
\index{adult height}
4997
4998
The code for this chapter is in {\tt scatter.py}.
4999
For information about downloading and
5000
working with this code, see Section~\ref{code}.
5001
5002
5003
\section{Scatter plots}
5004
\index{scatter plot}
5005
\index{plot!scatter}
5006
5007
The simplest way to check for a relationship between two variables
5008
is a {\bf scatter plot}, but making a good scatter plot is not always easy.
5009
As an example, I'll plot weight versus height for the respondents
5010
in the BRFSS (see Section~\ref{lognormal}).
5011
\index{BRFSS}
5012
5013
Here's the code that reads the data file and extracts height and
5014
weight:
5015
5016
\begin{verbatim}
5017
df = brfss.ReadBrfss(nrows=None)
5018
sample = thinkstats2.SampleRows(df, 5000)
5019
heights, weights = sample.htm3, sample.wtkg2
5020
\end{verbatim}
5021
5022
{\tt SampleRows} chooses a random subset of the data:
5023
\index{SampleRows}
5024
5025
\begin{verbatim}
5026
def SampleRows(df, nrows, replace=False):
5027
indices = np.random.choice(df.index, nrows, replace=replace)
5028
sample = df.loc[indices]
5029
return sample
5030
\end{verbatim}
5031
5032
{\tt df} is the DataFrame, {\tt nrows} is the number of rows to choose,
5033
and {\tt replace} is a boolean indicating whether sampling should be
5034
done with replacement; in other words, whether the same row could be
5035
chosen more than once.
5036
\index{DataFrame}
5037
\index{thinkplot}
5038
\index{boolean}
5039
\index{replacement}
5040
5041
{\tt thinkplot} provides {\tt Scatter}, which makes scatter plots:
5042
%
5043
\begin{verbatim}
5044
thinkplot.Scatter(heights, weights)
5045
thinkplot.Show(xlabel='Height (cm)',
5046
ylabel='Weight (kg)',
5047
axis=[140, 210, 20, 200])
5048
\end{verbatim}
5049
5050
The result, in Figure~\ref{scatter1} (left), shows the shape of
5051
the relationship. As we expected, taller
5052
people tend to be heavier.
5053
5054
\begin{figure}
5055
% scatter.py
5056
\centerline{\includegraphics[height=3.0in]{figs/scatter1.pdf}}
5057
\caption{Scatter plots of weight versus height for the respondents
5058
in the BRFSS, unjittered (left), jittered (right).}
5059
\label{scatter1}
5060
\end{figure}
5061
5062
But this is not the best representation of
5063
the data, because the data are packed into columns. The problem is
5064
that the heights are rounded to the nearest inch, converted to
5065
centimeters, and then rounded again. Some information is lost in
5066
translation. \index{height} \index{weight} \index{jitter}
5067
5068
We can't get that information back, but we can minimize the effect on
5069
the scatter plot by {\bf jittering} the data, which means adding random
5070
noise to reverse the effect of rounding off. Since these measurements
5071
were rounded to the nearest inch, they might be off by up to 0.5 inches or
5072
1.3 cm. Similarly, the weights might be off by 0.5 kg.
5073
\index{uniform distribution}
5074
\index{distribution!uniform}
5075
\index{noise}
5076
5077
%
5078
\begin{verbatim}
5079
heights = thinkstats2.Jitter(heights, 1.3)
5080
weights = thinkstats2.Jitter(weights, 0.5)
5081
\end{verbatim}
5082
5083
Here's the implementation of {\tt Jitter}:
5084
5085
\begin{verbatim}
5086
def Jitter(values, jitter=0.5):
5087
n = len(values)
5088
return np.random.uniform(-jitter, +jitter, n) + values
5089
\end{verbatim}
5090
5091
The values can be any sequence; the result is a NumPy array.
5092
\index{NumPy}
5093
5094
Figure~\ref{scatter1} (right) shows the result. Jittering reduces the
5095
visual effect of rounding and makes the shape of the relationship
5096
clearer. But in general you should only jitter data for purposes of
5097
visualization and avoid using jittered data for analysis.
5098
5099
Even with jittering, this is not the best way to represent the data.
5100
There are many overlapping points, which hides data
5101
in the dense parts of the figure and gives disproportionate emphasis
5102
to outliers. This effect is called {\bf saturation}.
5103
\index{outlier}
5104
\index{saturation}
5105
5106
\begin{figure}
5107
% scatter.py
5108
\centerline{\includegraphics[height=3.0in]{figs/scatter2.pdf}}
5109
\caption{Scatter plot with jittering and transparency (left),
5110
hexbin plot (right).}
5111
\label{scatter2}
5112
\end{figure}
5113
5114
We can solve this problem with the {\tt alpha} parameter, which makes
5115
the points partly transparent:
5116
%
5117
\begin{verbatim}
5118
thinkplot.Scatter(heights, weights, alpha=0.2)
5119
\end{verbatim}
5120
%
5121
Figure~\ref{scatter2} (left) shows the result. Overlapping data
5122
points look darker, so darkness is proportional to density. In this
5123
version of the plot we can see two details that were not apparent before:
5124
vertical clusters at several heights and a horizontal line near 90 kg
5125
or 200 pounds. Since this data is based on self-reports in pounds,
5126
the most likely explanation is that some respondents reported
5127
rounded values.
5128
\index{thinkplot}
5129
\index{alpha}
5130
\index{transparency}
5131
5132
Using transparency works well for moderate-sized datasets, but this
5133
figure only shows the first 5000 records in the BRFSS, out of a total
5134
of 414 509.
5135
\index{hexbin plot}
5136
\index{plot!hexbin}
5137
5138
To handle larger datasets, another option is a hexbin plot, which
5139
divides the graph into hexagonal bins and colors each bin according to
5140
how many data points fall in it. {\tt thinkplot} provides
5141
{\tt HexBin}:
5142
%
5143
\begin{verbatim}
5144
thinkplot.HexBin(heights, weights)
5145
\end{verbatim}
5146
%
5147
Figure~\ref{scatter2} (right) shows the result. An advantage of a
5148
hexbin is that it shows the shape of the relationship well, and it is
5149
efficient for large datasets, both in time and in the size of the file
5150
it generates. A drawback is that it makes the outliers invisible.
5151
\index{thinkplot}
5152
\index{outlier}
5153
5154
The point of this example is that it is
5155
not easy to make a scatter plot that shows relationships clearly
5156
without introducing misleading artifacts.
5157
\index{artifact}
5158
5159
5160
\section{Characterizing relationships}
5161
\label{characterizing}
5162
5163
Scatter plots provide a general impression of the relationship between
5164
variables, but there are other visualizations that provide more
5165
insight into the nature of the relationship. One option is to bin one
5166
variable and plot percentiles of the other.
5167
\index{binning}
5168
5169
NumPy and pandas provide functions for binning data:
5170
\index{NumPy}
5171
\index{pandas}
5172
5173
\begin{verbatim}
5174
df = df.dropna(subset=['htm3', 'wtkg2'])
5175
bins = np.arange(135, 210, 5)
5176
indices = np.digitize(df.htm3, bins)
5177
groups = df.groupby(indices)
5178
\end{verbatim}
5179
5180
{\tt dropna} drops rows with {\tt nan} in any of the listed columns.
5181
{\tt arange} makes a NumPy array of bins from 135 to, but not including,
5182
210, in increments of 5.
5183
\index{dropna}
5184
\index{digitize}
5185
\index{NaN}
5186
5187
{\tt digitize} computes the index of the bin that contains each value
5188
in {\tt df.htm3}. The result is a NumPy array of integer indices.
5189
Values that fall below the lowest bin are mapped to index 0. Values
5190
above the highest bin are mapped to {\tt len(bins)}.
5191
5192
\begin{figure}
5193
% scatter.py
5194
\centerline{\includegraphics[height=2.5in]{figs/scatter3.pdf}}
5195
\caption{Percentiles of weight for a range of height bins.}
5196
\label{scatter3}
5197
\end{figure}
5198
5199
{\tt groupby} is a DataFrame method that returns a GroupBy object;
5200
used in a {\tt for} loop, {\tt groups} iterates the names of the groups
5201
and the DataFrames that represent them. So, for example, we can
5202
print the number of rows in each group like this:
5203
\index{DataFrame}
5204
\index{groupby}
5205
5206
\begin{verbatim}
5207
for i, group in groups:
5208
print(i, len(group))
5209
\end{verbatim}
5210
5211
Now for each group we can compute the mean height and the CDF
5212
of weight:
5213
\index{Cdf}
5214
5215
\begin{verbatim}
5216
heights = [group.htm3.mean() for i, group in groups]
5217
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]
5218
\end{verbatim}
5219
5220
Finally, we can
5221
plot percentiles of weight versus height:
5222
\index{percentile}
5223
5224
\begin{verbatim}
5225
for percent in [75, 50, 25]:
5226
weights = [cdf.Percentile(percent) for cdf in cdfs]
5227
label = '%dth' % percent
5228
thinkplot.Plot(heights, weights, label=label)
5229
\end{verbatim}
5230
5231
Figure~\ref{scatter3} shows the result. Between 140 and 200 cm
5232
the relationship between these variables is roughly linear. This range
5233
includes more than 99\% of the data, so we don't have to worry
5234
too much about the extremes.
5235
\index{thinkplot}
5236
5237
5238
\section{Correlation}
5239
5240
A {\bf correlation} is a statistic intended to quantify the strength
5241
of the relationship between two variables.
5242
\index{correlation}
5243
5244
A challenge in measuring correlation is that the variables we want to
5245
compare are often not expressed in the same units. And even if they
5246
are in the same units, they come from different distributions.
5247
\index{units}
5248
5249
There are two common solutions to these problems:
5250
5251
\begin{enumerate}
5252
5253
\item Transform each value to a {\bf standard score}, which is the
5254
number of standard deviations from the mean.
5255
This transform leads to
5256
the ``Pearson product-moment correlation coefficient.''
5257
\index{standard score}
5258
\index{standard deviation}
5259
\index{Pearson coefficient of correlation}
5260
5261
\item Transform each value to its {\bf rank}, which is its index in
5262
the sorted list of values. This transform
5263
leads to the ``Spearman rank correlation coefficient.''
5264
\index{rank}
5265
\index{percentile rank}
5266
\index{Spearman coefficient of correlation}
5267
5268
\end{enumerate}
5269
5270
If $X$ is a series of $n$ values, $x_i$, we can convert to standard
5271
scores by subtracting the mean and dividing by the standard deviation:
5272
$z_i = (x_i - \mu) / \sigma$.
5273
\index{mean}
5274
\index{standard deviation}
5275
5276
The numerator is a deviation: the distance from the mean. Dividing by
5277
$\sigma$ {\bf standardizes} the deviation, so the values of $Z$ are
5278
dimensionless (no units) and their distribution has mean 0 and
5279
variance 1.
5280
\index{standardize}
5281
\index{deviation}
5282
\index{normal distribution}
5283
\index{distribution!normal}
5284
\index{Gaussian distribution}
5285
\index{distribution!Gaussian}
5286
5287
If $X$ is normally distributed, so is $Z$. But if $X$ is skewed or has
5288
outliers, so does $Z$; in those cases, it is more robust to use
5289
percentile ranks. If we compute a new variable, $R$, so that $r_i$ is
5290
the rank of $x_i$, the distribution of $R$ is uniform
5291
from 1 to $n$, regardless of the distribution of $X$.
5292
\index{uniform distribution} \index{distribution!uniform}
5293
\index{robust}
5294
\index{skewness}
5295
\index{outlier}
5296
5297
5298
\section{Covariance}
5299
\index{covariance}
5300
\index{deviation}
5301
5302
{\bf Covariance} is a measure of the tendency of two variables
5303
to vary together. If we have two series, $X$ and $Y$, their
5304
deviations from the mean are
5305
%
5306
\[ dx_i = x_i - \xbar \]
5307
\[ dy_i = y_i - \ybar \]
5308
%
5309
where $\xbar$ is the sample mean of $X$ and $\ybar$ is the sample mean
5310
of $Y$. If $X$ and $Y$ vary together, their deviations tend to have
5311
the same sign.
5312
5313
If we multiply them together, the product is positive when the
5314
deviations have the same sign and negative when they have the opposite
5315
sign. So adding up the products gives a measure of the tendency to
5316
vary together.
5317
5318
Covariance is the mean of these products:
5319
%
5320
\[ Cov(X,Y) = \frac{1}{n} \sum dx_i~dy_i \]
5321
%
5322
where $n$ is the length of the two series (they have to be the same
5323
length).
5324
5325
If you have studied linear algebra, you might recognize that
5326
{\tt Cov} is the dot product of the deviations, divided
5327
by their length. So the covariance is maximized if the two vectors
5328
are identical, 0 if they are orthogonal, and negative if they
5329
point in opposite directions. {\tt thinkstats2} uses {\tt np.dot} to
5330
implement {\tt Cov} efficiently:
5331
\index{linear algebra}
5332
\index{dot product}
5333
\index{orthogonal vector}
5334
5335
\begin{verbatim}
5336
def Cov(xs, ys, meanx=None, meany=None):
5337
xs = np.asarray(xs)
5338
ys = np.asarray(ys)
5339
5340
if meanx is None:
5341
meanx = np.mean(xs)
5342
if meany is None:
5343
meany = np.mean(ys)
5344
5345
cov = np.dot(xs-meanx, ys-meany) / len(xs)
5346
return cov
5347
\end{verbatim}
5348
5349
By default {\tt Cov} computes deviations from the sample means,
5350
or you can provide known means. If {\tt xs} and {\tt ys} are
5351
Python sequences, {\tt np.asarray} converts them to NumPy arrays.
5352
If they are already NumPy arrays, {\tt np.asarray} does nothing.
5353
\index{NumPy}
5354
5355
This implementation of covariance is meant to be simple for purposes
5356
of explanation. NumPy and pandas also provide implementations of
5357
covariance, but both of them apply a correction for small sample sizes
5358
that we have not covered yet, and {\tt np.cov} returns a covariance
5359
matrix, which is more than we need for now.
5360
\index{pandas}
5361
5362
5363
\section{Pearson's correlation}
5364
\index{correlation}
5365
\index{standard score}
5366
5367
Covariance is useful in some computations, but it is seldom reported
5368
as a summary statistic because it is hard to interpret. Among other
5369
problems, its units are the product of the units of $X$ and $Y$. For
5370
example, the covariance of weight and height in the BRFSS dataset is
5371
113 kilogram-centimeters, whatever that means.
5372
\index{deviation}
5373
\index{units}
5374
5375
One solution to this problem is to divide the deviations by the standard
5376
deviation, which yields standard scores, and compute the product of
5377
standard scores:
5378
%
5379
\[ p_i = \frac{(x_i - \xbar)}{S_X} \frac{(y_i - \ybar)}{S_Y} \]
5380
%
5381
Where $S_X$ and $S_Y$ are the standard deviations of $X$ and $Y$.
5382
The mean of these products is \index{standard deviation}
5383
%
5384
\[ \rho = \frac{1}{n} \sum p_i \]
5385
%
5386
Or we can rewrite $\rho$ by factoring out $S_X$ and
5387
$S_Y$:
5388
%
5389
\[ \rho = \frac{Cov(X,Y)}{S_X S_Y} \]
5390
%
5391
This value is called {\bf Pearson's correlation} after Karl Pearson,
5392
an influential early statistician. It is easy to compute and easy to
5393
interpret. Because standard scores are dimensionless, so is $\rho$.
5394
\index{Pearson, Karl}
5395
\index{Pearson coefficient of correlation}
5396
5397
Here is the implementation in {\tt thinkstats2}:
5398
5399
\begin{verbatim}
5400
def Corr(xs, ys):
5401
xs = np.asarray(xs)
5402
ys = np.asarray(ys)
5403
5404
meanx, varx = MeanVar(xs)
5405
meany, vary = MeanVar(ys)
5406
5407
corr = Cov(xs, ys, meanx, meany) / math.sqrt(varx * vary)
5408
return corr
5409
\end{verbatim}
5410
5411
{\tt MeanVar} computes mean and variance slightly more efficiently
5412
than separate calls to {\tt np.mean} and {\tt np.var}.
5413
\index{MeanVar}
5414
5415
Pearson's correlation is always between -1 and +1 (including both).
5416
If $\rho$ is positive, we say that the correlation is positive,
5417
which means that when one variable is high, the other tends to be
5418
high. If $\rho$ is negative, the correlation is negative, so
5419
when one variable is high, the other is low.
5420
5421
The magnitude of $\rho$ indicates the strength of the correlation. If
5422
$\rho$ is 1 or -1, the variables are perfectly correlated, which means
5423
that if you know one, you can make a perfect prediction about the
5424
other. \index{prediction}
5425
5426
Most correlation in the real world is not perfect, but it is still
5427
useful. The correlation of height and weight is 0.51, which is a
5428
strong correlation compared to similar human-related variables.
5429
5430
5431
\section{Nonlinear relationships}
5432
5433
If Pearson's correlation is near 0, it is tempting to conclude
5434
that there is no relationship between the variables, but that
5435
conclusion is not valid. Pearson's correlation only measures {\em
5436
linear\/} relationships. If there's a nonlinear relationship, $\rho$
5437
understates its strength. \index{linear relationship}
5438
\index{nonlinear}
5439
\index{Pearson coefficient of correlation}
5440
5441
\begin{figure}
5442
\centerline{\includegraphics[height=2.5in]{figs/Correlation_examples.png}}
5443
\caption{Examples of datasets with a range of correlations.}
5444
\label{corr_examples}
5445
\end{figure}
5446
5447
Figure~\ref{corr_examples} is from
5448
\url{http://wikipedia.org/wiki/Correlation_and_dependence}. It shows
5449
scatter plots and correlation coefficients for several
5450
carefully constructed datasets.
5451
\index{scatter plot}
5452
\index{plot!scatter}
5453
5454
The top row shows linear relationships with a range of correlations;
5455
you can use this row to get a sense of what different values of
5456
$\rho$ look like. The second row shows perfect correlations with a
5457
range of slopes, which demonstrates that correlation is unrelated to
5458
slope (we'll talk about estimating slope soon). The third row shows
5459
variables that are clearly related, but because the relationship is
5460
nonlinear, the correlation coefficient is 0.
5461
\index{nonlinear}
5462
5463
The moral of this story is that you should always look at a scatter
5464
plot of your data before blindly computing a correlation coefficient.
5465
\index{correlation}
5466
5467
5468
\section{Spearman's rank correlation}
5469
5470
Pearson's correlation works well if the relationship between variables
5471
is linear and if the variables are roughly normal. But it is not
5472
robust in the presence of outliers.
5473
\index{Pearson coefficient of correlation}
5474
\index{Spearman coefficient of correlation}
5475
\index{normal distribution}
5476
\index{distribution!normal}
5477
\index{Gaussian distribution}
5478
\index{distribution!Gaussian}
5479
\index{robust}
5480
Spearman's rank correlation is an alternative that mitigates the
5481
effect of outliers and skewed distributions. To compute Spearman's
5482
correlation, we have to compute the {\bf rank} of each value, which is its
5483
index in the sorted sample. For example, in the sample {\tt [1, 2, 5, 7]}
5484
the rank of the value 5 is 3, because it appears third in the sorted
5485
list. Then we compute Pearson's correlation for the ranks.
5486
\index{skewness}
5487
\index{outlier}
5488
\index{rank}
5489
5490
{\tt thinkstats2} provides a function that computes Spearman's rank
5491
correlation:
5492
5493
\begin{verbatim}
5494
def SpearmanCorr(xs, ys):
5495
xranks = pandas.Series(xs).rank()
5496
yranks = pandas.Series(ys).rank()
5497
return Corr(xranks, yranks)
5498
\end{verbatim}
5499
5500
I convert the arguments to pandas Series objects so I can use
5501
{\tt rank}, which computes the rank for each value and returns
5502
a Series. Then I use {\tt Corr} to compute the correlation
5503
of the ranks.
5504
\index{pandas}
5505
\index{Series}
5506
5507
I could also use {\tt Series.corr} directly and specify
5508
Spearman's method:
5509
5510
\begin{verbatim}
5511
def SpearmanCorr(xs, ys):
5512
xs = pandas.Series(xs)
5513
ys = pandas.Series(ys)
5514
return xs.corr(ys, method='spearman')
5515
\end{verbatim}
5516
5517
The Spearman rank correlation for the BRFSS data is 0.54, a little
5518
higher than the Pearson correlation, 0.51. There are several possible
5519
reasons for the difference, including:
5520
\index{rank correlation}
5521
\index{BRFSS}
5522
5523
\begin{itemize}
5524
5525
\item If the relationship is
5526
nonlinear, Pearson's correlation tends to underestimate the strength
5527
of the relationship, and
5528
\index{nonlinear}
5529
5530
\item Pearson's correlation can be affected (in either direction)
5531
if one of the distributions is skewed or contains outliers. Spearman's
5532
rank correlation is more robust.
5533
\index{skewness}
5534
\index{outlier}
5535
\index{robust}
5536
5537
\end{itemize}
5538
5539
In the BRFSS example, we know that the distribution of weights is
5540
roughly lognormal; under a log transform it approximates a normal
5541
distribution, so it has no skew.
5542
So another way to eliminate the effect of skewness is to
5543
compute Pearson's
5544
correlation with log-weight and height:
5545
\index{lognormal distribution}
5546
\index{distribution!lognormal}
5547
5548
\begin{verbatim}
5549
thinkstats2.Corr(df.htm3, np.log(df.wtkg2)))
5550
\end{verbatim}
5551
5552
The result is 0.53, close to the rank correlation, 0.54. So that
5553
suggests that skewness in the distribution of weight explains most of
5554
the difference between Pearson's and Spearman's correlation.
5555
\index{skewness}
5556
\index{Spearman coefficient of correlation}
5557
\index{Pearson coefficient of correlation}
5558
5559
5560
\section{Correlation and causation}
5561
\index{correlation}
5562
\index{causation}
5563
5564
If variables A and B are correlated, there are three possible
5565
explanations: A causes B, or B causes A, or some other set of factors
5566
causes both A and B. These explanations are called ``causal
5567
relationships''.
5568
\index{causal relationship}
5569
5570
Correlation alone does not distinguish between these explanations,
5571
so it does not tell you which ones are true.
5572
This rule is often summarized with the phrase ``Correlation
5573
does not imply causation,'' which is so pithy it has its own
5574
Wikipedia page: \url{http://wikipedia.org/wiki/Correlation_does_not_imply_causation}.
5575
5576
So what can you do to provide evidence of causation?
5577
5578
\begin{enumerate}
5579
5580
\item Use time. If A comes before B, then A can cause B but not the
5581
other way around (at least according to our common understanding of
5582
causation). The order of events can help us infer the direction
5583
of causation, but it does not preclude the possibility that something
5584
else causes both A and B.
5585
5586
\item Use randomness. If you divide a large sample into two
5587
groups at random and compute the means of almost any variable, you
5588
expect the difference to be small.
5589
If the groups are nearly identical in all variables but one, you
5590
can eliminate spurious relationships.
5591
\index{spurious relationship}
5592
5593
This works even if you don't know what the relevant variables
5594
are, but it works even better if you do, because you can check that
5595
the groups are identical.
5596
5597
\end{enumerate}
5598
5599
These ideas are the motivation for the {\bf randomized controlled
5600
trial}, in which subjects are assigned randomly to two (or more)
5601
groups: a {\bf treatment group} that receives some kind of intervention,
5602
like a new medicine, and a {\bf control group} that receives
5603
no intervention, or another treatment whose effects are known.
5604
\index{randomized controlled trial}
5605
\index{controlled trial}
5606
\index{treatment group}
5607
\index{control group}
5608
\index{medicine}
5609
5610
A randomized controlled trial is the most reliable way to demonstrate
5611
a causal relationship, and the foundation of science-based medicine
5612
(see \url{http://wikipedia.org/wiki/Randomized_controlled_trial}).
5613
5614
Unfortunately, controlled trials are only possible in the laboratory
5615
sciences, medicine, and a few other disciplines. In the social sciences,
5616
controlled experiments are rare, usually because they are impossible
5617
or unethical.
5618
\index{ethics}
5619
5620
An alternative is to look for a {\bf natural experiment}, where
5621
different ``treatments'' are applied to groups that are otherwise
5622
similar. One danger of natural experiments is that the groups might
5623
differ in ways that are not apparent. You can read more about this
5624
topic at \url{http://wikipedia.org/wiki/Natural_experiment}.
5625
\index{natural experiment}
5626
5627
In some cases it is possible to infer causal relationships using {\bf
5628
regression analysis}, which is the topic of Chapter~\ref{regression}.
5629
\index{regression analysis}
5630
5631
5632
\section{Exercises}
5633
5634
A solution to this exercise is in \verb"chap07soln.py".
5635
5636
\begin{exercise}
5637
Using data from the NSFG, make a scatter plot of birth weight
5638
versus mother's age. Plot percentiles of birth weight
5639
versus mother's age. Compute Pearson's and Spearman's correlations.
5640
How would you characterize the relationship
5641
between these variables?
5642
\index{birth weight}
5643
\index{weight!birth}
5644
\index{Pearson coefficient of correlation}
5645
\index{Spearman coefficient of correlation}
5646
\end{exercise}
5647
5648
5649
\section{Glossary}
5650
5651
\begin{itemize}
5652
5653
\item scatter plot: A visualization of the relationship between
5654
two variables, showing one point for each row of data.
5655
\index{scatter plot}
5656
5657
\item jitter: Random noise added to data for purposes of
5658
visualization.
5659
\index{jitter}
5660
5661
\item saturation: Loss of information when multiple points are
5662
plotted on top of each other.
5663
\index{saturation}
5664
5665
\item correlation: A statistic that measures the strength of the
5666
relationship between two variables.
5667
\index{correlation}
5668
5669
\item standardize: To transform a set of values so that their mean is 0 and
5670
their variance is 1.
5671
\index{standardize}
5672
5673
\item standard score: A value that has been standardized so that it is
5674
expressed in standard deviations from the mean.
5675
\index{standard score}
5676
\index{standard deviation}
5677
5678
\item covariance: A measure of the tendency of two variables
5679
to vary together.
5680
\index{covariance}
5681
5682
\item rank: The index where an element appears in a sorted list.
5683
\index{rank}
5684
5685
\item randomized controlled trial: An experimental design in which subjects
5686
are divided into groups at random, and different groups are given different
5687
treatments.
5688
\index{randomized controlled trial}
5689
5690
\item treatment group: A group in a controlled trial that receives
5691
some kind of intervention.
5692
\index{treatment group}
5693
5694
\item control group: A group in a controlled trial that receives no
5695
treatment, or a treatment whose effect is known.
5696
\index{control group}
5697
5698
\item natural experiment: An experimental design that takes advantage of
5699
a natural division of subjects into groups in ways that are at least
5700
approximately random.
5701
\index{natural experiment}
5702
5703
\end{itemize}
5704
5705
5706
5707
5708
\chapter{Estimation}
5709
\label{estimation}
5710
\index{estimation}
5711
5712
The code for this chapter is in {\tt estimation.py}. For information
5713
about downloading and working with this code, see Section~\ref{code}.
5714
5715
5716
\section{The estimation game}
5717
5718
Let's play a game. I think of a distribution, and you have to guess
5719
what it is. I'll give you two hints: it's a
5720
normal distribution, and here's a random sample drawn from it:
5721
\index{normal distribution}
5722
\index{distribution!normal}
5723
\index{Gaussian distribution}
5724
\index{distribution!Gaussian}
5725
5726
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}
5727
5728
What do you think is the mean parameter, $\mu$, of this distribution?
5729
\index{mean}
5730
\index{parameter}
5731
5732
One choice is to use the sample mean, $\xbar$, as an estimate of $\mu$.
5733
In this example, $\xbar$ is 0.155, so it would
5734
be reasonable to guess $\mu$ = 0.155.
5735
This process is called {\bf estimation}, and the statistic we used
5736
(the sample mean) is called an {\bf estimator}.
5737
\index{estimator}
5738
5739
Using the sample mean to estimate $\mu$ is so obvious that it is hard
5740
to imagine a reasonable alternative. But suppose we change the game by
5741
introducing outliers.
5742
\index{normal distribution}
5743
\index{distribution!normal}
5744
\index{Gaussian distribution}
5745
\index{distribution!Gaussian}
5746
5747
{\em I'm thinking of a distribution.\/} It's a normal distribution, and
5748
here's a sample that was collected by an unreliable surveyor who
5749
occasionally puts the decimal point in the wrong place.
5750
\index{measurement error}
5751
5752
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -213.8]}
5753
5754
Now what's your estimate of $\mu$? If you use the sample mean, your
5755
guess is -35.12. Is that the best choice? What are the alternatives?
5756
\index{outlier}
5757
5758
One option is to identify and discard outliers, then compute the sample
5759
mean of the rest. Another option is to use the median as an estimator.
5760
\index{median}
5761
5762
Which estimator is best depends on the circumstances (for example,
5763
whether there are outliers) and on what the goal is. Are you
5764
trying to minimize errors, or maximize your chance of getting the
5765
right answer?
5766
\index{error}
5767
\index{MSE}
5768
\index{mean squared error}
5769
5770
If there are no outliers, the sample mean minimizes the {\bf mean squared
5771
error} (MSE). That is, if we play the game many times, and each time
5772
compute the error $\xbar - \mu$, the sample mean minimizes
5773
%
5774
\[ MSE = \frac{1}{m} \sum (\xbar - \mu)^2 \]
5775
%
5776
Where $m$ is the number of times you play the estimation game, not
5777
to be confused with $n$, which is the size of the sample used to
5778
compute $\xbar$.
5779
5780
Here is a function that simulates the estimation game and computes
5781
the root mean squared error (RMSE), which is the square root of
5782
MSE:
5783
\index{mean squared error}
5784
\index{MSE}
5785
\index{RMSE}
5786
5787
\begin{verbatim}
5788
def Estimate1(n=7, m=1000):
5789
mu = 0
5790
sigma = 1
5791
5792
means = []
5793
medians = []
5794
for _ in range(m):
5795
xs = [random.gauss(mu, sigma) for i in range(n)]
5796
xbar = np.mean(xs)
5797
median = np.median(xs)
5798
means.append(xbar)
5799
medians.append(median)
5800
5801
print('rmse xbar', RMSE(means, mu))
5802
print('rmse median', RMSE(medians, mu))
5803
\end{verbatim}
5804
5805
Again, {\tt n} is the size of the sample, and {\tt m} is the
5806
number of times we play the game. {\tt means} is the list of
5807
estimates based on $\xbar$. {\tt medians} is the list of medians.
5808
\index{median}
5809
5810
Here's the function that computes RMSE:
5811
5812
\begin{verbatim}
5813
def RMSE(estimates, actual):
5814
e2 = [(estimate-actual)**2 for estimate in estimates]
5815
mse = np.mean(e2)
5816
return math.sqrt(mse)
5817
\end{verbatim}
5818
5819
{\tt estimates} is a list of estimates; {\tt actual} is the
5820
actual value being estimated. In practice, of course, we don't
5821
know {\tt actual}; if we did, we wouldn't have to estimate it.
5822
The purpose of this experiment is to compare the performance of
5823
the two estimators.
5824
\index{estimator}
5825
5826
When I ran this code, the RMSE of the sample mean was 0.41, which
5827
means that if we use $\xbar$ to estimate the mean of this
5828
distribution, based on a sample with $n=7$, we should expect to be off
5829
by 0.41 on average. Using the median to estimate the mean yields
5830
RMSE 0.53, which confirms that $\xbar$ yields lower RMSE, at least
5831
for this example.
5832
5833
Minimizing MSE is a nice property, but it's not always the best
5834
strategy. For example, suppose we are estimating the distribution of
5835
wind speeds at a building site. If the estimate is too high, we might
5836
overbuild the structure, increasing its cost. But if it's too
5837
low, the building might collapse. Because cost as a function of
5838
error is not symmetric, minimizing MSE is not the best strategy.
5839
\index{prediction}
5840
\index{cost function}
5841
\index{MSE}
5842
5843
As another example, suppose I roll three six-sided dice and ask you to
5844
predict the total. If you get it exactly right, you get a prize;
5845
otherwise you get nothing. In this case the value that minimizes MSE
5846
is 10.5, but that would be a bad guess, because the total of three
5847
dice is never 10.5. For this game, you want an estimator that has the
5848
highest chance of being right, which is a {\bf maximum likelihood
5849
estimator} (MLE). If you pick 10 or 11, your chance of winning is 1
5850
in 8, and that's the best you can do. \index{MLE}
5851
\index{maximum likelihood estimator}
5852
\index{dice}
5853
5854
5855
\section{Guess the variance}
5856
\index{variance}
5857
\index{normal distribution}
5858
\index{distribution!normal}
5859
\index{Gaussian distribution}
5860
\index{distribution!Gaussian}
5861
5862
{\em I'm thinking of a distribution\/.} It's a normal distribution, and
5863
here's a (familiar) sample:
5864
5865
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}
5866
5867
What do you think is the variance, $\sigma^2$, of my distribution?
5868
Again, the obvious choice is to use the sample variance, $S^2$, as an
5869
estimator.
5870
%
5871
\[ S^2 = \frac{1}{n} \sum (x_i - \xbar)^2 \]
5872
%
5873
For large samples, $S^2$ is an adequate estimator, but for small
5874
samples it tends to be too low. Because of this unfortunate
5875
property, it is called a {\bf biased} estimator.
5876
An estimator is {\bf unbiased} if the expected total (or mean) error,
5877
after many iterations of the estimation game, is 0.
5878
\index{sample variance}
5879
\index{biased estimator}
5880
\index{estimator!biased}
5881
\index{unbiased estimator}
5882
\index{estimator!unbiased}
5883
5884
Fortunately, there is another simple statistic that is an unbiased
5885
estimator of $\sigma^2$:
5886
%
5887
\[ S_{n-1}^2 = \frac{1}{n-1} \sum (x_i - \xbar)^2 \]
5888
%
5889
For an explanation of why $S^2$ is biased, and a proof that
5890
$S_{n-1}^2$ is unbiased, see
5891
\url{http://wikipedia.org/wiki/Bias_of_an_estimator}.
5892
5893
The biggest problem with this estimator is that its name and symbol
5894
are used inconsistently. The name ``sample variance'' can refer to
5895
either $S^2$ or $S_{n-1}^2$, and the symbol $S^2$ is used
5896
for either or both.
5897
5898
Here is a function that simulates the estimation game and tests
5899
the performance of $S^2$ and $S_{n-1}^2$:
5900
5901
\begin{verbatim}
5902
def Estimate2(n=7, m=1000):
5903
mu = 0
5904
sigma = 1
5905
5906
estimates1 = []
5907
estimates2 = []
5908
for _ in range(m):
5909
xs = [random.gauss(mu, sigma) for i in range(n)]
5910
biased = np.var(xs)
5911
unbiased = np.var(xs, ddof=1)
5912
estimates1.append(biased)
5913
estimates2.append(unbiased)
5914
5915
print('mean error biased', MeanError(estimates1, sigma**2))
5916
print('mean error unbiased', MeanError(estimates2, sigma**2))
5917
\end{verbatim}
5918
5919
Again, {\tt n} is the sample size and {\tt m} is the number of times
5920
we play the game. {\tt np.var} computes $S^2$ by default and
5921
$S_{n-1}^2$ if you provide the argument {\tt ddof=1}, which stands for
5922
``delta degrees of freedom.'' I won't explain that term, but you can read
5923
about it at
5924
\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.
5925
\index{degrees of freedom}
5926
5927
{\tt MeanError} computes the mean difference between the estimates
5928
and the actual value:
5929
5930
\begin{verbatim}
5931
def MeanError(estimates, actual):
5932
errors = [estimate-actual for estimate in estimates]
5933
return np.mean(errors)
5934
\end{verbatim}
5935
5936
When I ran this code, the mean error for $S^2$ was -0.13. As
5937
expected, this biased estimator tends to be too low. For $S_{n-1}^2$,
5938
the mean error was 0.014, about 10 times smaller. As {\tt m}
5939
increases, we expect the mean error for $S_{n-1}^2$ to approach 0.
5940
\index{mean error}
5941
5942
Properties like MSE and bias are long-term expectations based on
5943
many iterations of the estimation game. By running simulations like
5944
the ones in this chapter, we can compare estimators and check whether
5945
they have desired properties.
5946
\index{biased estimator}
5947
\index{estimator!biased}
5948
5949
But when you apply an estimator to real
5950
data, you just get one estimate. It would not be meaningful to say
5951
that the estimate is unbiased; being unbiased is a property of the
5952
estimator, not the estimate.
5953
5954
After you choose an estimator with appropriate properties, and use it to
5955
generate an estimate, the next step is to characterize the
5956
uncertainty of the estimate, which is the topic of the next
5957
section.
5958
5959
5960
\section{Sampling distributions}
5961
\label{gorilla}
5962
5963
Suppose you are a scientist studying gorillas in a wildlife
5964
preserve. You want to know the average weight of the adult
5965
female gorillas in the preserve. To weigh them, you have
5966
to tranquilize them, which is dangerous, expensive, and possibly
5967
harmful to the gorillas. But if it is important to obtain this
5968
information, it might be acceptable to weigh a sample of 9
5969
gorillas. Let's assume that the population of the preserve is
5970
well known, so we can choose a representative sample of adult
5971
females. We could use the sample mean, $\xbar$, to estimate the
5972
unknown population mean, $\mu$.
5973
\index{gorilla}
5974
\index{population}
5975
\index{sample}
5976
5977
Having weighed 9 female gorillas, you might find $\xbar=90$ kg and
5978
sample standard deviation, $S=7.5$ kg. The sample mean
5979
is an unbiased estimator of $\mu$, and in the long run it
5980
minimizes MSE. So if you report a single
5981
estimate that summarizes the results, you would report 90 kg.
5982
\index{MSE}
5983
\index{sample mean}
5984
\index{biased estimator}
5985
\index{estimator!biased}
5986
\index{standard deviation}
5987
5988
But how confident should you be in this estimate? If you only weigh
5989
$n=9$ gorillas out of a much larger population, you might be unlucky
5990
and choose the 9 heaviest gorillas (or the 9 lightest ones) just by
5991
chance. Variation in the estimate caused by random selection is
5992
called {\bf sampling error}.
5993
\index{sampling error}
5994
5995
To quantify sampling error, we can simulate the
5996
sampling process with hypothetical values of $\mu$ and $\sigma$, and
5997
see how much $\xbar$ varies.
5998
5999
Since we don't know the actual values of
6000
$\mu$ and $\sigma$ in the population, we'll use the estimates
6001
$\xbar$ and $S$.
6002
So the question we answer is:
6003
``If the actual values of $\mu$ and $\sigma$ were 90 kg and 7.5 kg,
6004
and we ran the same experiment many times, how much would the
6005
estimated mean, $\xbar$, vary?''
6006
6007
The following function answers that question:
6008
6009
\begin{verbatim}
6010
def SimulateSample(mu=90, sigma=7.5, n=9, m=1000):
6011
means = []
6012
for j in range(m):
6013
xs = np.random.normal(mu, sigma, n)
6014
xbar = np.mean(xs)
6015
means.append(xbar)
6016
6017
cdf = thinkstats2.Cdf(means)
6018
ci = cdf.Percentile(5), cdf.Percentile(95)
6019
stderr = RMSE(means, mu)
6020
\end{verbatim}
6021
6022
{\tt mu} and {\tt sigma} are the {\em hypothetical\/} values of
6023
the parameters. {\tt n} is the sample size, the number of
6024
gorillas we measured. {\tt m} is the number of times we run
6025
the simulation.
6026
\index{gorilla}
6027
\index{sample size}
6028
\index{simulation}
6029
6030
\begin{figure}
6031
% estimation.py
6032
\centerline{\includegraphics[height=2.5in]{figs/estimation1.pdf}}
6033
\caption{Sampling distribution of $\xbar$, with confidence interval.}
6034
\label{estimation1}
6035
\end{figure}
6036
6037
In each iteration, we choose {\tt n} values from a normal
6038
distribution with the given parameters, and compute the sample mean,
6039
{\tt xbar}. We run 1000 simulations and then compute the
6040
distribution, {\tt cdf}, of the estimates. The result is shown in
6041
Figure~\ref{estimation1}. This distribution is called the {\bf
6042
sampling distribution} of the estimator. It shows how much the
6043
estimates would vary if we ran the experiment over and over.
6044
\index{sampling distribution}
6045
6046
The mean of the sampling distribution is pretty close
6047
to the hypothetical value of $\mu$, which means that the experiment
6048
yields the right answer, on average. After 1000 tries, the lowest
6049
result is 82 kg, and the highest is 98 kg. This range suggests that
6050
the estimate might be off by as much as 8 kg.
6051
6052
There are two common ways to summarize the sampling distribution:
6053
6054
\begin{itemize}
6055
6056
\item {\bf Standard error} (SE) is a measure of how far we expect the
6057
estimate to be off, on average. For each simulated experiment, we
6058
compute the error, $\xbar - \mu$, and then compute the root mean
6059
squared error (RMSE). In this example, it is roughly 2.5 kg.
6060
\index{standard error}
6061
6062
\item A {\bf confidence interval} (CI) is a range that includes a
6063
given fraction of the sampling distribution. For example, the 90\%
6064
confidence interval is the range from the 5th to the 95th
6065
percentile. In this example, the 90\% CI is $(86, 94)$ kg.
6066
\index{confidence interval}
6067
\index{sampling distribution}
6068
6069
\end{itemize}
6070
6071
Standard errors and confidence intervals are the source of much confusion:
6072
6073
\begin{itemize}
6074
6075
\item People often confuse standard error and standard deviation.
6076
Remember that standard deviation describes variability in a measured
6077
quantity; in this example, the standard deviation of gorilla weight
6078
is 7.5 kg. Standard error describes variability in an estimate. In
6079
this example, the standard error of the mean, based on a sample of 9
6080
measurements, is 2.5 kg.
6081
\index{gorilla}
6082
\index{standard deviation}
6083
6084
One way to remember the difference is that, as sample size
6085
increases, standard error gets smaller; standard deviation does not.
6086
6087
\item People often think that there is a 90\% probability that the
6088
actual parameter, $\mu$, falls in the 90\% confidence interval.
6089
Sadly, that is not true. If you want to make a claim like that, you
6090
have to use Bayesian methods (see my book, {\it Think Bayes\/}).
6091
\index{Bayesian statistics}
6092
6093
The sampling distribution answers a different question: it gives you
6094
a sense of how reliable an estimate is by telling you how much it
6095
would vary if you ran the experiment again.
6096
\index{sampling distribution}
6097
6098
\end{itemize}
6099
6100
It is important to remember that confidence intervals
6101
and standard errors only quantify sampling error; that is,
6102
error due to measuring only part of the population.
6103
The sampling distribution does not account for other
6104
sources of error, notably sampling bias and measurement error,
6105
which are the topics of the next section.
6106
6107
6108
\section{Sampling bias}
6109
6110
Suppose that instead of the weight of gorillas in a nature preserve,
6111
you want to know the average weight of women in the city where you
6112
live. It is unlikely that you would be allowed
6113
to choose a representative sample of women and
6114
weigh them.
6115
\index{gorilla}
6116
\index{adult weight}
6117
\index{sampling bias}
6118
\index{bias!sampling}
6119
\index{measurement error}
6120
6121
A simple alternative would be
6122
``telephone sampling;'' that is,
6123
you could choose random numbers from the phone book, call and ask to
6124
speak to an adult woman, and ask how much she weighs.
6125
\index{telephone sampling}
6126
\index{random number}
6127
6128
Telephone sampling has obvious limitations. For example, the sample
6129
is limited to people whose telephone numbers are listed, so it
6130
eliminates people without phones (who might be poorer than average)
6131
and people with unlisted numbers (who might be richer). Also, if you
6132
call home telephones during the day, you are less likely to sample
6133
people with jobs. And if you only sample the person who answers the
6134
phone, you are less likely to sample people who share a phone line.
6135
6136
If factors like income, employment, and household size are related
6137
to weight---and it is plausible that they are---the results of your
6138
survey would be affected one way or another. This problem is
6139
called {\bf sampling bias} because it is a property of the sampling
6140
process.
6141
\index{sampling bias}
6142
6143
This sampling process is also vulnerable to self-selection, which is a
6144
kind of sampling bias. Some people will refuse to answer the
6145
question, and if the tendency to refuse is related to weight, that
6146
would affect the results.
6147
\index{self-selection}
6148
6149
Finally, if you ask people how much they weigh, rather than weighing
6150
them, the results might not be accurate. Even helpful respondents
6151
might round up or down if they are uncomfortable with their actual
6152
weight. And not all respondents are helpful. These inaccuracies are
6153
examples of {\bf measurement error}.
6154
\index{measurement error}
6155
6156
When you report an estimated quantity, it is useful to report
6157
standard error, or a confidence interval, or both, in order to
6158
quantify sampling error. But it is also important to remember that
6159
sampling error is only one source of error, and often it is not the
6160
biggest.
6161
\index{standard error}
6162
\index{confidence interval}
6163
6164
6165
\section{Exponential distributions}
6166
\index{exponential distribution}
6167
\index{distribution!exponential}
6168
6169
Let's play one more round of the estimation game.
6170
{\em I'm thinking of a distribution.\/} It's an exponential distribution, and
6171
here's a sample:
6172
6173
{\tt [5.384, 4.493, 19.198, 2.790, 6.122, 12.844]}
6174
6175
What do you think is the parameter, $\lambda$, of this distribution?
6176
\index{parameter}
6177
\index{mean}
6178
6179
\newcommand{\lamhat}{L}
6180
\newcommand{\lamhatmed}{L_m}
6181
6182
In general, the mean of an exponential distribution is $1/\lambda$,
6183
so working backwards, we might choose
6184
%
6185
\[ \lamhat = 1 / \xbar\]
6186
%
6187
$\lamhat$ is an
6188
estimator of $\lambda$. And not just any estimator; it is also the
6189
maximum likelihood estimator (see
6190
\url{http://wikipedia.org/wiki/Exponential_distribution#Maximum_likelihood}).
6191
So if you want to maximize your chance of guessing $\lambda$ exactly,
6192
$\lamhat$ is the way to go.
6193
\index{MLE}
6194
\index{maximum likelihood estimator}
6195
6196
But we know that $\xbar$ is not robust in the presence of outliers, so
6197
we expect $\lamhat$ to have the same problem.
6198
\index{robust}
6199
\index{outlier}
6200
\index{sample median}
6201
6202
We can choose an alternative based on the sample median.
6203
The median of an exponential distribution is $\ln(2) / \lambda$,
6204
so working backwards again, we can define an estimator
6205
%
6206
\[ \lamhatmed = \ln(2) / m \]
6207
%
6208
where $m$ is the sample median.
6209
\index{median}
6210
6211
To test the performance of these estimators, we can simulate the
6212
sampling process:
6213
6214
\begin{verbatim}
6215
def Estimate3(n=7, m=1000):
6216
lam = 2
6217
6218
means = []
6219
medians = []
6220
for _ in range(m):
6221
xs = np.random.exponential(1.0/lam, n)
6222
L = 1 / np.mean(xs)
6223
Lm = math.log(2) / thinkstats2.Median(xs)
6224
means.append(L)
6225
medians.append(Lm)
6226
6227
print('rmse L', RMSE(means, lam))
6228
print('rmse Lm', RMSE(medians, lam))
6229
print('mean error L', MeanError(means, lam))
6230
print('mean error Lm', MeanError(medians, lam))
6231
\end{verbatim}
6232
6233
When I run this experiment with $\lambda=2$, the RMSE of $L$ is
6234
1.1. For the median-based estimator $L_m$, RMSE is 1.8. We can't
6235
tell from this experiment whether $L$ minimizes MSE, but at least
6236
it seems better than $L_m$.
6237
\index{MSE}
6238
\index{RMSE}
6239
6240
Sadly, it seems that both estimators are biased. For $L$ the mean
6241
error is 0.33; for $L_m$ it is 0.45. And neither converges to 0
6242
as {\tt m} increases.
6243
\index{biased estimator}
6244
\index{estimator!biased}
6245
6246
It turns out that $\xbar$ is an unbiased estimator of the mean
6247
of the distribution, $1 / \lambda$, but $L$ is not an unbiased
6248
estimator of $\lambda$.
6249
6250
6251
\section{Exercises}
6252
6253
For the following exercises, you might want to start with a copy of
6254
{\tt estimation.py}. Solutions are in \verb"chap08soln.py"
6255
6256
\begin{exercise}
6257
6258
In this chapter we used $\xbar$ and median to estimate $\mu$, and
6259
found that $\xbar$ yields lower MSE.
6260
Also, we used $S^2$ and $S_{n-1}^2$ to estimate $\sigma$, and found that
6261
$S^2$ is biased and $S_{n-1}^2$ unbiased.
6262
6263
Run similar experiments to see if $\xbar$ and median are biased estimates
6264
of $\mu$.
6265
Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.
6266
\index{sample mean}
6267
\index{sample median}
6268
\index{estimator!biased}
6269
6270
\end{exercise}
6271
6272
6273
\begin{exercise}
6274
6275
Suppose you draw a sample with size $n=10$ from
6276
an exponential distribution with $\lambda=2$. Simulate
6277
this experiment 1000 times and plot the sampling distribution of
6278
the estimate $\lamhat$. Compute the standard error of the estimate
6279
and the 90\% confidence interval.
6280
\index{standard error}
6281
\index{confidence interval}
6282
\index{sampling distribution}
6283
6284
Repeat the experiment with a few different values of $n$ and make
6285
a plot of standard error versus $n$.
6286
\index{exponential distribution}
6287
\index{distribution!exponential}
6288
6289
6290
\end{exercise}
6291
6292
6293
\begin{exercise}
6294
6295
In games like hockey and soccer, the time between goals is
6296
roughly exponential. So you could estimate a team's goal-scoring rate
6297
by observing the number of goals they score in a game. This
6298
estimation process is a little different from sampling the time
6299
between goals, so let's see how it works.
6300
\index{hockey}
6301
\index{soccer}
6302
6303
Write a function that takes a goal-scoring rate, {\tt lam}, in goals
6304
per game, and simulates a game by generating the time between goals
6305
until the total time exceeds 1 game, then returns the number of goals
6306
scored.
6307
6308
Write another function that simulates many games, stores the
6309
estimates of {\tt lam}, then computes their mean error and RMSE.
6310
6311
Is this way of making an estimate biased? Plot the sampling
6312
distribution of the estimates and the 90\% confidence interval. What
6313
is the standard error? What happens to sampling error for increasing
6314
values of {\tt lam}?
6315
\index{estimator!biased}
6316
\index{biased estimator}
6317
\index{standard error}
6318
\index{confidence interval}
6319
6320
\end{exercise}
6321
6322
6323
\section{Glossary}
6324
6325
\begin{itemize}
6326
6327
\item estimation: The process of inferring the parameters of a distribution
6328
from a sample.
6329
\index{estimation}
6330
6331
\item estimator: A statistic used to estimate a parameter.
6332
\index{estimation}
6333
6334
\item mean squared error (MSE): A measure of estimation error.
6335
\index{mean squared error}
6336
\index{MSE}
6337
6338
\item root mean squared error (RMSE): The square root of MSE,
6339
a more meaningful representation of typical error magnitude.
6340
\index{mean squared error}
6341
\index{MSE}
6342
6343
\item maximum likelihood estimator (MLE): An estimator that computes the
6344
point estimate most likely to be correct.
6345
\index{MLE}
6346
\index{maximum likelihood estimator}
6347
6348
\item bias (of an estimator): The tendency of an estimator to be above or
6349
below the actual value of the parameter, when averaged over repeated
6350
experiments. \index{biased estimator}
6351
6352
\item sampling error: Error in an estimate due to the limited
6353
size of the sample and variation due to chance. \index{point estimation}
6354
6355
\item sampling bias: Error in an estimate due to a sampling process
6356
that is not representative of the population. \index{sampling bias}
6357
6358
\item measurement error: Error in an estimate due to inaccuracy collecting
6359
or recording data. \index{measurement error}
6360
6361
\item sampling distribution: The distribution of a statistic if an
6362
experiment is repeated many times. \index{sampling distribution}
6363
6364
\item standard error: The RMSE of an estimate,
6365
which quantifies variability due to sampling error (but not
6366
other sources of error).
6367
\index{standard error}
6368
6369
\item confidence interval: An interval that represents the expected
6370
range of an estimator if an experiment is repeated many times.
6371
\index{confidence interval} \index{interval!confidence}
6372
6373
\end{itemize}
6374
6375
6376
\chapter{Hypothesis testing}
6377
\label{testing}
6378
6379
The code for this chapter is in {\tt hypothesis.py}. For information
6380
about downloading and working with this code, see Section~\ref{code}.
6381
6382
\section{Classical hypothesis testing}
6383
\index{hypothesis testing}
6384
\index{apparent effect}
6385
6386
Exploring the data from the NSFG, we saw several ``apparent effects,''
6387
including differences between first babies and others.
6388
So far we have taken these effects at face value; in this chapter,
6389
we put them to the test.
6390
\index{National Survey of Family Growth}
6391
\index{NSFG}
6392
6393
The fundamental question we want to address is whether the effects
6394
we see in a sample are likely to appear in the larger population.
6395
For example, in the NSFG sample we see a difference in mean pregnancy
6396
length for first babies and others. We would like to know if
6397
that effect reflects a real difference for women
6398
in the U.S., or if it might appear in the sample by chance.
6399
\index{pregnancy length} \index{length!pregnancy}
6400
6401
There are several ways we could formulate this question, including
6402
Fisher null hypothesis testing, Neyman-Pearson decision theory, and
6403
Bayesian inference\footnote{For more about Bayesian inference, see the
6404
sequel to this book, {\it Think Bayes}.}. What I present here is a
6405
subset of all three that makes up most of what people use in practice,
6406
which I will call {\bf classical hypothesis testing}.
6407
\index{Bayesian inference}
6408
\index{null hypothesis}
6409
6410
The goal of classical hypothesis testing is to answer the question,
6411
``Given a sample and an apparent effect, what is the probability of
6412
seeing such an effect by chance?'' Here's how we answer that question:
6413
6414
\begin{itemize}
6415
6416
\item The first step is to quantify the size of the apparent effect by
6417
choosing a {\bf test statistic}. In the NSFG example, the apparent
6418
effect is a difference in pregnancy length between first babies and
6419
others, so a natural choice for the test statistic is the difference
6420
in means between the two groups.
6421
\index{test statistic}
6422
6423
\item The second step is to define a {\bf null hypothesis}, which is a
6424
model of the system based on the assumption that the apparent effect
6425
is not real. In the NSFG example the null hypothesis is that there
6426
is no difference between first babies and others; that is, that
6427
pregnancy lengths for both groups have the same distribution.
6428
\index{null hypothesis}
6429
\index{pregnancy length}
6430
\index{model}
6431
6432
\item The third step is to compute a {\bf p-value}, which is the
6433
probability of seeing the apparent effect if the null hypothesis is
6434
true. In the NSFG example, we would compute the actual difference
6435
in means, then compute the probability of seeing a
6436
difference as big, or bigger, under the null hypothesis.
6437
\index{p-value}
6438
6439
\item The last step is to interpret the result. If the p-value is
6440
low, the effect is said to be {\bf statistically significant}, which
6441
means that it is unlikely to have occurred by chance. In that case
6442
we infer that the effect is more likely to appear in the larger
6443
population. \index{statistically significant} \index{significant}
6444
6445
\end{itemize}
6446
6447
The logic of this process is similar to a proof by
6448
contradiction. To prove a mathematical statement, A, you assume
6449
temporarily that A is false. If that assumption leads to a
6450
contradiction, you conclude that A must actually be true.
6451
\index{contradiction, proof by}
6452
\index{proof by contradiction}
6453
6454
Similarly, to test a hypothesis like, ``This effect is real,'' we
6455
assume, temporarily, that it is not. That's the null hypothesis.
6456
Based on that assumption, we compute the probability of the apparent
6457
effect. That's the p-value. If the p-value is low, we
6458
conclude that the null hypothesis is unlikely to be true.
6459
\index{p-value}
6460
\index{null hypothesis}
6461
6462
6463
\section{HypothesisTest}
6464
\label{hypotest}
6465
\index{mean!difference in}
6466
6467
{\tt thinkstats2} provides {\tt HypothesisTest}, a
6468
class that represents the structure of a classical hypothesis
6469
test. Here is the definition:
6470
\index{HypothesisTest}
6471
6472
\begin{verbatim}
6473
class HypothesisTest(object):
6474
6475
def __init__(self, data):
6476
self.data = data
6477
self.MakeModel()
6478
self.actual = self.TestStatistic(data)
6479
6480
def PValue(self, iters=1000):
6481
self.test_stats = [self.TestStatistic(self.RunModel())
6482
for _ in range(iters)]
6483
6484
count = sum(1 for x in self.test_stats if x >= self.actual)
6485
return count / iters
6486
6487
def TestStatistic(self, data):
6488
raise UnimplementedMethodException()
6489
6490
def MakeModel(self):
6491
pass
6492
6493
def RunModel(self):
6494
raise UnimplementedMethodException()
6495
\end{verbatim}
6496
6497
{\tt HypothesisTest} is an abstract parent class that provides
6498
complete definitions for some methods and place-keepers for others.
6499
Child classes based on {\tt HypothesisTest} inherit \verb"__init__"
6500
and {\tt PValue} and provide {\tt TestStatistic},
6501
{\tt RunModel}, and optionally {\tt MakeModel}.
6502
\index{HypothesisTest}
6503
6504
\verb"__init__" takes the data in whatever form is appropriate. It
6505
calls {\tt MakeModel}, which builds a representation of the null
6506
hypothesis, then passes the data to {\tt TestStatistic}, which
6507
computes the size of the effect in the sample.
6508
\index{test statistic}
6509
\index{null hypothesis}
6510
6511
{\tt PValue} computes the probability of the apparent effect under
6512
the null hypothesis. It takes as a parameter {\tt iters}, which is
6513
the number of simulations to run. The first line generates simulated
6514
data, computes test statistics, and stores them in
6515
\verb"test_stats".
6516
The result is
6517
the fraction of elements in \verb"test_stats" that
6518
exceed or equal the observed test statistic, {\tt self.actual}.
6519
\index{simulation}
6520
6521
As a simple example\footnote{Adapted from MacKay, {\it Information
6522
Theory, Inference, and Learning Algorithms}, 2003.}, suppose we
6523
toss a coin 250 times and see 140 heads and 110 tails. Based on this
6524
result, we might suspect that the coin is biased; that is, more likely
6525
to land heads. To test this hypothesis, we compute the
6526
probability of seeing such a difference if the coin is actually
6527
fair:
6528
\index{biased coin}
6529
\index{MacKay, David}
6530
6531
\begin{verbatim}
6532
class CoinTest(thinkstats2.HypothesisTest):
6533
6534
def TestStatistic(self, data):
6535
heads, tails = data
6536
test_stat = abs(heads - tails)
6537
return test_stat
6538
6539
def RunModel(self):
6540
heads, tails = self.data
6541
n = heads + tails
6542
sample = [random.choice('HT') for _ in range(n)]
6543
hist = thinkstats2.Hist(sample)
6544
data = hist['H'], hist['T']
6545
return data
6546
\end{verbatim}
6547
6548
The parameter, {\tt data}, is a pair of
6549
integers: the number of heads and tails. The test statistic is
6550
the absolute difference between them, so {\tt self.actual}
6551
is 30.
6552
\index{HypothesisTest}
6553
6554
{\tt RunModel} simulates coin tosses assuming that the coin is
6555
actually fair. It generates a sample of 250 tosses, uses Hist
6556
to count the number of heads and tails, and returns a pair of
6557
integers.
6558
\index{Hist}
6559
\index{model}
6560
6561
Now all we have to do is instantiate {\tt CoinTest} and call
6562
{\tt PValue}:
6563
6564
\begin{verbatim}
6565
ct = CoinTest((140, 110))
6566
pvalue = ct.PValue()
6567
\end{verbatim}
6568
6569
The result is about 0.07, which means that if the coin is
6570
fair, we expect to see a difference as big as 30 about 7\% of the
6571
time.
6572
6573
How should we interpret this result? By convention,
6574
5\% is the threshold of statistical significance. If the
6575
p-value is less than 5\%, the effect is considered significant; otherwise
6576
it is not.
6577
\index{p-value}
6578
\index{statistically significant} \index{significant}
6579
6580
But the choice of 5\% is arbitrary, and (as we will see later) the
6581
p-value depends on the choice of the test statistics and
6582
the model of the null hypothesis. So p-values should not be considered
6583
precise measurements.
6584
\index{null hypothesis}
6585
6586
I recommend interpreting p-values according to their order of
6587
magnitude: if the p-value is less than 1\%, the effect is unlikely to
6588
be due to chance; if it is greater than 10\%, the effect can plausibly
6589
be explained by chance. P-values between 1\% and 10\% should be
6590
considered borderline. So in this example I conclude that the
6591
data do not provide strong evidence that the coin is biased or not.
6592
6593
6594
\section{Testing a difference in means}
6595
\label{testdiff}
6596
\index{mean!difference in}
6597
6598
One of the most common effects to test is a difference in mean
6599
between two groups. In the NSFG data, we saw that the mean pregnancy
6600
length for first babies is slightly longer, and the mean birth weight
6601
is slightly smaller. Now we will see if those effects are
6602
statistically significant.
6603
\index{National Survey of Family Growth}
6604
\index{NSFG}
6605
\index{pregnancy length}
6606
\index{length!pregnancy}
6607
6608
For these examples, the null hypothesis is that the distributions
6609
for the two groups are the same. One way to model the null
6610
hypothesis is by {\bf permutation}; that is, we can take values
6611
for first babies and others and shuffle them, treating
6612
the two groups as one big group:
6613
\index{null hypothesis}
6614
\index{permutation}
6615
\index{model}
6616
6617
\begin{verbatim}
6618
class DiffMeansPermute(thinkstats2.HypothesisTest):
6619
6620
def TestStatistic(self, data):
6621
group1, group2 = data
6622
test_stat = abs(group1.mean() - group2.mean())
6623
return test_stat
6624
6625
def MakeModel(self):
6626
group1, group2 = self.data
6627
self.n, self.m = len(group1), len(group2)
6628
self.pool = np.hstack((group1, group2))
6629
6630
def RunModel(self):
6631
np.random.shuffle(self.pool)
6632
data = self.pool[:self.n], self.pool[self.n:]
6633
return data
6634
\end{verbatim}
6635
6636
{\tt data} is a pair of sequences, one for each
6637
group. The test statistic is the absolute difference in the means.
6638
\index{HypothesisTest}
6639
6640
{\tt MakeModel} records the sizes of the groups, {\tt n} and
6641
{\tt m}, and combines the groups into one NumPy
6642
array, {\tt self.pool}.
6643
\index{NumPy}
6644
6645
{\tt RunModel} simulates the null hypothesis by shuffling the
6646
pooled values and splitting them into two groups with sizes {\tt n}
6647
and {\tt m}. As always, the return value from {\tt RunModel} has
6648
the same format as the observed data.
6649
\index{null hypothesis}
6650
\index{model}
6651
6652
To test the difference in pregnancy length, we run:
6653
6654
\begin{verbatim}
6655
live, firsts, others = first.MakeFrames()
6656
data = firsts.prglngth.values, others.prglngth.values
6657
ht = DiffMeansPermute(data)
6658
pvalue = ht.PValue()
6659
\end{verbatim}
6660
6661
{\tt MakeFrames} reads the NSFG data and returns DataFrames
6662
representing all live births, first babies, and others.
6663
We extract pregnancy lengths as NumPy arrays, pass them as
6664
data to {\tt DiffMeansPermute}, and compute the p-value. The
6665
result is about 0.17, which means that we expect to see a difference
6666
as big as the observed effect about 17\% of the time. So
6667
this effect is not statistically significant.
6668
\index{DataFrame}
6669
\index{p-value}
6670
\index{significant} \index{statistically significant}
6671
\index{pregnancy length}
6672
6673
\begin{figure}
6674
% hypothesis.py
6675
\centerline{\includegraphics[height=2.5in]{figs/hypothesis1.pdf}}
6676
\caption{CDF of difference in mean pregnancy length under the null
6677
hypothesis.}
6678
\label{hypothesis1}
6679
\end{figure}
6680
6681
{\tt HypothesisTest} provides {\tt PlotCdf}, which plots the
6682
distribution of the test statistic and a gray line indicating
6683
the observed effect size:
6684
\index{thinkplot}
6685
\index{HypothesisTest}
6686
\index{Cdf}
6687
\index{effect size}
6688
6689
\begin{verbatim}
6690
ht.PlotCdf()
6691
thinkplot.Show(xlabel='test statistic',
6692
ylabel='CDF')
6693
\end{verbatim}
6694
6695
Figure~\ref{hypothesis1} shows the result. The CDF intersects the
6696
observed difference at 0.83, which is the complement of the p-value,
6697
0.17.
6698
\index{p-value}
6699
6700
If we run the same analysis with birth weight, the computed p-value
6701
is 0; after 1000 attempts,
6702
the simulation never yields an effect
6703
as big as the observed difference, 0.12 lbs. So we would
6704
report $p < 0.001$, and
6705
conclude that the difference in birth weight is statistically
6706
significant.
6707
\index{birth weight}
6708
\index{weight!birth}
6709
\index{significant} \index{statistically significant}
6710
6711
6712
\section{Other test statistics}
6713
6714
Choosing the best test statistic depends on what question you are
6715
trying to address. For example, if the relevant question is whether
6716
pregnancy lengths are different for first
6717
babies, then it makes sense to test the absolute difference in means,
6718
as we did in the previous section.
6719
\index{test statistic}
6720
\index{pregnancy length}
6721
6722
If we had some reason to think that first babies are likely
6723
to be late, then we would not take the absolute value of the difference;
6724
instead we would use this test statistic:
6725
6726
\begin{verbatim}
6727
class DiffMeansOneSided(DiffMeansPermute):
6728
6729
def TestStatistic(self, data):
6730
group1, group2 = data
6731
test_stat = group1.mean() - group2.mean()
6732
return test_stat
6733
\end{verbatim}
6734
6735
{\tt DiffMeansOneSided} inherits {\tt MakeModel} and {\tt RunModel}
6736
from {\tt DiffMeansPermute}; the only difference is that
6737
{\tt TestStatistic} does not take the absolute value of the
6738
difference. This kind of test is called {\bf one-sided} because
6739
it only counts one side of the distribution of differences. The
6740
previous test, using both sides, is {\bf two-sided}.
6741
\index{one-sided test}
6742
\index{two-sided test}
6743
6744
For this version of the test, the p-value is 0.09. In general
6745
the p-value for a one-sided test is about half the p-value for
6746
a two-sided test, depending on the shape of the distribution.
6747
\index{p-value}
6748
6749
The one-sided hypothesis, that first babies are born late, is more
6750
specific than the two-sided hypothesis, so the p-value is smaller.
6751
But even for the stronger hypothesis, the difference is
6752
not statistically significant.
6753
\index{significant} \index{statistically significant}
6754
6755
We can use the same framework to test for a difference in standard
6756
deviation. In Section~\ref{visualization}, we saw some evidence that
6757
first babies are more likely to be early or late, and less likely to
6758
be on time. So we might hypothesize that the standard deviation is
6759
higher. Here's how we can test that:
6760
\index{standard deviation}
6761
6762
\begin{verbatim}
6763
class DiffStdPermute(DiffMeansPermute):
6764
6765
def TestStatistic(self, data):
6766
group1, group2 = data
6767
test_stat = group1.std() - group2.std()
6768
return test_stat
6769
\end{verbatim}
6770
6771
This is a one-sided test because the hypothesis is that the standard
6772
deviation for first babies is higher, not just different. The p-value
6773
is 0.09, which is not statistically significant.
6774
\index{p-value}
6775
\index{permutation}
6776
\index{significant} \index{statistically significant}
6777
6778
6779
\section{Testing a correlation}
6780
\label{corrtest}
6781
6782
This framework can also test correlations. For example, in the NSFG
6783
data set, the correlation between birth weight and mother's age is
6784
about 0.07. It seems like older mothers have heavier babies. But
6785
could this effect be due to chance?
6786
\index{correlation}
6787
\index{test statistic}
6788
6789
For the test statistic, I use
6790
Pearson's correlation, but Spearman's would work as well.
6791
If we had reason to expect positive correlation, we would do a
6792
one-sided test. But since we have no such reason, I'll
6793
do a two-sided test using the absolute value of correlation.
6794
\index{Pearson coefficient of correlation}
6795
\index{Spearman coefficient of correlation}
6796
6797
The null hypothesis is that there is no correlation between mother's
6798
age and birth weight. By shuffling the observed values, we can
6799
simulate a world where the distributions of age and
6800
birth weight are the same, but where the variables are unrelated:
6801
\index{birth weight}
6802
\index{weight!birth}
6803
\index{null hypothesis}
6804
6805
\begin{verbatim}
6806
class CorrelationPermute(thinkstats2.HypothesisTest):
6807
6808
def TestStatistic(self, data):
6809
xs, ys = data
6810
test_stat = abs(thinkstats2.Corr(xs, ys))
6811
return test_stat
6812
6813
def RunModel(self):
6814
xs, ys = self.data
6815
xs = np.random.permutation(xs)
6816
return xs, ys
6817
\end{verbatim}
6818
6819
{\tt data} is a pair of sequences. {\tt TestStatistic} computes the
6820
absolute value of Pearson's correlation. {\tt RunModel} shuffles the
6821
{\tt xs} and returns simulated data.
6822
\index{HypothesisTest}
6823
\index{permutation}
6824
\index{Pearson coefficient of correlation}
6825
6826
Here's the code that reads the data and runs the test:
6827
6828
\begin{verbatim}
6829
live, firsts, others = first.MakeFrames()
6830
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
6831
data = live.agepreg.values, live.totalwgt_lb.values
6832
ht = CorrelationPermute(data)
6833
pvalue = ht.PValue()
6834
\end{verbatim}
6835
6836
I use {\tt dropna} with the {\tt subset} argument to drop rows
6837
that are missing either of the variables we need.
6838
\index{dropna}
6839
\index{NaN}
6840
\index{missing values}
6841
6842
The actual correlation is 0.07. The computed p-value is 0; after 1000
6843
iterations the largest simulated correlation is 0.04. So although the
6844
observed correlation is small, it is statistically significant.
6845
\index{p-value}
6846
\index{significant} \index{statistically significant}
6847
6848
This example is a reminder that ``statistically significant'' does not
6849
always mean that an effect is important, or significant in practice.
6850
It only means that it is unlikely to have occurred by chance.
6851
6852
6853
\section{Testing proportions}
6854
\label{casino}
6855
\index{chi-squared test}
6856
6857
Suppose you run a casino and you suspect that a customer is
6858
using a crooked die; that
6859
is, one that has been modified to make one of the faces more
6860
likely than the others. You apprehend the alleged
6861
cheater and confiscate the die, but now you have to prove that it
6862
is crooked. You roll the die 60 times and get the following results:
6863
\index{casino}
6864
\index{dice}
6865
\index{crooked die}
6866
6867
\begin{center}
6868
\begin{tabular}{|l|c|c|c|c|c|c|}
6869
\hline
6870
Value & 1 & 2 & 3 & 4 & 5 & 6 \\
6871
\hline
6872
Frequency & 8 & 9 & 19 & 5 & 8 & 11 \\
6873
\hline
6874
\end{tabular}
6875
\end{center}
6876
6877
On average you expect each value to appear 10 times. In this
6878
dataset, the value 3 appears more often than expected, and the value 4
6879
appears less often. But are these differences statistically
6880
significant?
6881
\index{frequency}
6882
\index{significant} \index{statistically significant}
6883
6884
To test this hypothesis, we can compute the expected frequency for
6885
each value, the difference between the expected and observed
6886
frequencies, and the total absolute difference. In this
6887
example, we expect each side to come up 10 times out of 60; the
6888
deviations from this expectation are -2, -1, 9, -5, -2, and 1; so the
6889
total absolute difference is 20. How often would we see such a
6890
difference by chance?
6891
\index{deviation}
6892
6893
Here's a version of {\tt HypothesisTest} that answers that question:
6894
\index{HypothesisTest}
6895
6896
\begin{verbatim}
6897
class DiceTest(thinkstats2.HypothesisTest):
6898
6899
def TestStatistic(self, data):
6900
observed = data
6901
n = sum(observed)
6902
expected = np.ones(6) * n / 6
6903
test_stat = sum(abs(observed - expected))
6904
return test_stat
6905
6906
def RunModel(self):
6907
n = sum(self.data)
6908
values = [1, 2, 3, 4, 5, 6]
6909
rolls = np.random.choice(values, n, replace=True)
6910
hist = thinkstats2.Hist(rolls)
6911
freqs = hist.Freqs(values)
6912
return freqs
6913
\end{verbatim}
6914
6915
The data are represented as a list of frequencies: the observed
6916
values are {\tt [8, 9, 19, 5, 8, 11]}; the expected frequencies
6917
are all 10. The test statistic is the sum of the absolute differences.
6918
\index{frequency}
6919
6920
The null hypothesis is that the die is fair, so we simulate that by
6921
drawing random samples from {\tt values}. {\tt RunModel} uses {\tt
6922
Hist} to compute and return the list of frequencies.
6923
\index{Hist}
6924
\index{null hypothesis}
6925
\index{model}
6926
6927
The p-value for this data is 0.13, which means that if the die is
6928
fair we expect to see the observed total deviation, or more, about
6929
13\% of the time. So the apparent effect is not statistically
6930
significant.
6931
\index{p-value}
6932
\index{deviation}
6933
\index{significant} \index{statistically significant}
6934
6935
6936
\section{Chi-squared tests}
6937
\label{casino2}
6938
6939
In the previous section we used total deviation as the test statistic.
6940
But for testing proportions it is more common to use the chi-squared
6941
statistic:
6942
%
6943
\[ \goodchi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} \]
6944
%
6945
%% TODO: Consider using upper case chi, which is more strictly correct,
6946
%% but harder to distinguish from X.
6947
%
6948
Where $O_i$ are the observed frequencies and $E_i$ are the expected
6949
frequencies. Here's the Python code:
6950
\index{chi-squared test}
6951
\index{chi-squared statistic}
6952
\index{test statistic}
6953
6954
\begin{verbatim}
6955
class DiceChiTest(DiceTest):
6956
6957
def TestStatistic(self, data):
6958
observed = data
6959
n = sum(observed)
6960
expected = np.ones(6) * n / 6
6961
test_stat = sum((observed - expected)**2 / expected)
6962
return test_stat
6963
\end{verbatim}
6964
6965
Squaring the deviations (rather than taking absolute values) gives
6966
more weight to large deviations. Dividing through by {\tt expected}
6967
standardizes the deviations, although in this case it has no effect
6968
because the expected frequencies are all equal.
6969
\index{deviation}
6970
6971
The p-value using the chi-squared statistic is 0.04,
6972
substantially smaller than what we got using total deviation, 0.13.
6973
If we take the 5\% threshold seriously, we would consider this effect
6974
statistically significant. But considering the two tests togther, I
6975
would say that the results are borderline. I would not rule out the
6976
possibility that the die is crooked, but I would not convict the
6977
accused cheater.
6978
\index{p-value}
6979
\index{significant} \index{statistically significant}
6980
6981
This example demonstrates an important point: the p-value depends
6982
on the choice of test statistic and the model of the null hypothesis,
6983
and sometimes these choices determine whether an effect is
6984
statistically significant or not.
6985
\index{null hypothesis}
6986
\index{model}
6987
6988
6989
\section{First babies again}
6990
6991
Earlier in this chapter we looked at pregnancy lengths for first
6992
babies and others, and concluded that the apparent differences in
6993
mean and standard deviation are not statistically significant. But in
6994
Section~\ref{visualization}, we saw several apparent differences
6995
in the distribution of pregnancy length, especially in the range from
6996
35 to 43 weeks. To see whether those differences are statistically
6997
significant, we can use a test based on a chi-squared statistic.
6998
\index{standard deviation}
6999
\index{statistically significant} \index{significant}
7000
\index{pregnancy length}
7001
7002
The code combines elements from previous examples:
7003
\index{HypothesisTest}
7004
7005
\begin{verbatim}
7006
class PregLengthTest(thinkstats2.HypothesisTest):
7007
7008
def MakeModel(self):
7009
firsts, others = self.data
7010
self.n = len(firsts)
7011
self.pool = np.hstack((firsts, others))
7012
7013
pmf = thinkstats2.Pmf(self.pool)
7014
self.values = range(35, 44)
7015
self.expected_probs = np.array(pmf.Probs(self.values))
7016
7017
def RunModel(self):
7018
np.random.shuffle(self.pool)
7019
data = self.pool[:self.n], self.pool[self.n:]
7020
return data
7021
\end{verbatim}
7022
7023
The data are represented as two lists of pregnancy lengths. The null
7024
hypothesis is that both samples are drawn from the same distribution.
7025
{\tt MakeModel} models that distribution by pooling the two
7026
samples using {\tt hstack}. Then {\tt RunModel} generates
7027
simulated data by shuffling the pooled sample and splitting it
7028
into two parts.
7029
\index{null hypothesis}
7030
\index{model}
7031
\index{hstack}
7032
\index{pregnancy length}
7033
7034
{\tt MakeModel} also defines {\tt values}, which is the
7035
range of weeks we'll use, and \verb"expected_probs",
7036
which is the probability of each value in the pooled distribution.
7037
7038
Here's the code that computes the test statistic:
7039
7040
\begin{verbatim}
7041
# class PregLengthTest:
7042
7043
def TestStatistic(self, data):
7044
firsts, others = data
7045
stat = self.ChiSquared(firsts) + self.ChiSquared(others)
7046
return stat
7047
7048
def ChiSquared(self, lengths):
7049
hist = thinkstats2.Hist(lengths)
7050
observed = np.array(hist.Freqs(self.values))
7051
expected = self.expected_probs * len(lengths)
7052
stat = sum((observed - expected)**2 / expected)
7053
return stat
7054
\end{verbatim}
7055
7056
{\tt TestStatistic} computes the chi-squared statistic for
7057
first babies and others, and adds them.
7058
\index{chi-squared statistic}
7059
7060
{\tt ChiSquared} takes a sequence of pregnancy lengths, computes
7061
its histogram, and computes {\tt observed}, which is a list of
7062
frequencies corresponding to {\tt self.values}.
7063
To compute the list of expected frequencies, it multiplies the
7064
pre-computed probabilities, \verb"expected_probs", by the sample
7065
size. It returns the chi-squared statistic, {\tt stat}.
7066
7067
For the NSFG data the total chi-squared statistic is 102, which
7068
doesn't mean much by itself. But after 1000 iterations, the largest
7069
test statistic generated under the null hypothesis is 32. We conclude
7070
that the observed chi-squared statistic is unlikely under the null
7071
hypothesis, so the apparent effect is statistically significant.
7072
\index{null hypothesis}
7073
\index{statistically significant} \index{significant}
7074
7075
This example demonstrates a limitation of chi-squared tests: they
7076
indicate that there is a difference between the two groups,
7077
but they don't say anything specific about what the difference is.
7078
7079
7080
\section{Errors}
7081
\index{error}
7082
7083
In classical hypothesis testing, an effect is considered statistically
7084
significant if the p-value is below some threshold, commonly 5\%.
7085
This procedure raises two questions:
7086
\index{p-value}
7087
\index{threshold}
7088
\index{statistically significant} \index{significant}
7089
7090
\begin{itemize}
7091
7092
\item If the effect is actually due to chance, what is the probability
7093
that we will wrongly consider it significant? This
7094
probability is the {\bf false positive rate}.
7095
\index{false positive}
7096
7097
\item If the effect is real, what is the chance that the hypothesis
7098
test will fail? This probability is the {\bf false negative rate}.
7099
\index{false negative}
7100
7101
\end{itemize}
7102
7103
The false positive rate is relatively easy to compute: if the
7104
threshold is 5\%, the false positive rate is 5\%. Here's why:
7105
7106
\begin{itemize}
7107
7108
\item If there is no real effect, the null hypothesis is true, so we
7109
can compute the distribution of the test statistic by simulating the
7110
null hypothesis. Call this distribution $\CDF_T$.
7111
\index{null hypothesis}
7112
\index{CDF}
7113
7114
\item Each time we run an experiment, we get a test statistic, $t$,
7115
which is drawn from $CDF_T$. Then we compute a p-value, which is
7116
the probability that a random value from $CDF_T$ exceeds {\tt t},
7117
so that's $1 - CDF_T(t)$.
7118
7119
\item The p-value is less than 5\% if $CDF_T(t)$ is greater
7120
than 95\%; that is, if $t$ exceeds the 95th percentile.
7121
And how often does a value chosen from $CDF_T$ exceed
7122
the 95th percentile? 5\% of the time.
7123
7124
\end{itemize}
7125
7126
So if you perform one hypothesis test with a 5\% threshold, you expect
7127
a false positive 1 time in 20.
7128
7129
7130
\section{Power}
7131
\label{power}
7132
7133
The false negative rate is harder to compute because it depends on
7134
the actual effect size, and normally we don't know that.
7135
One option is to compute a rate
7136
conditioned on a hypothetical effect size.
7137
\index{effect size}
7138
7139
For example, if we assume that the observed difference between groups
7140
is accurate, we can use the observed samples as a model of the
7141
population and run hypothesis tests with simulated data:
7142
\index{model}
7143
7144
\begin{verbatim}
7145
def FalseNegRate(data, num_runs=100):
7146
group1, group2 = data
7147
count = 0
7148
7149
for i in range(num_runs):
7150
sample1 = thinkstats2.Resample(group1)
7151
sample2 = thinkstats2.Resample(group2)
7152
7153
ht = DiffMeansPermute((sample1, sample2))
7154
pvalue = ht.PValue(iters=101)
7155
if pvalue > 0.05:
7156
count += 1
7157
7158
return count / num_runs
7159
\end{verbatim}
7160
7161
{\tt FalseNegRate} takes data in the form of two sequences, one for
7162
each group. Each time through the loop, it simulates an experiment by
7163
drawing a random sample from each group and running a hypothesis test.
7164
Then it checks the result and counts the number of false negatives.
7165
\index{Resample}
7166
\index{permutation}
7167
7168
{\tt Resample} takes a sequence and draws a sample with the same
7169
length, with replacement:
7170
\index{replacement}
7171
7172
\begin{verbatim}
7173
def Resample(xs):
7174
return np.random.choice(xs, len(xs), replace=True)
7175
\end{verbatim}
7176
7177
Here's the code that tests pregnancy lengths:
7178
7179
\begin{verbatim}
7180
live, firsts, others = first.MakeFrames()
7181
data = firsts.prglngth.values, others.prglngth.values
7182
neg_rate = FalseNegRate(data)
7183
\end{verbatim}
7184
7185
The result is about 70\%, which means that if the actual difference in
7186
mean pregnancy length is 0.078 weeks, we expect an experiment with this
7187
sample size to yield a negative test 70\% of the time.
7188
\index{pregnancy length}
7189
7190
This result is often presented the other way around: if the actual
7191
difference is 0.078 weeks, we should expect a positive test only 30\%
7192
of the time. This ``correct positive rate'' is called the {\bf power}
7193
of the test, or sometimes ``sensitivity''. It reflects the ability of
7194
the test to detect an effect of a given size.
7195
\index{power}
7196
\index{sensitivity}
7197
\index{correct positive}
7198
7199
In this example, the test had only a 30\% chance of yielding a
7200
positive result (again, assuming that the difference is 0.078 weeks).
7201
As a rule of thumb, a power of 80\% is considered acceptable, so
7202
we would say that this test was ``underpowered.''
7203
\index{underpowered}
7204
7205
In general a negative hypothesis test does not imply that there is no
7206
difference between the groups; instead it suggests that if there is a
7207
difference, it is too small to detect with this sample size.
7208
7209
7210
\section{Replication}
7211
\label{replication}
7212
7213
The hypothesis testing process I demonstrated in this chapter is not,
7214
strictly speaking, good practice.
7215
7216
First, I performed multiple tests. If you run one hypothesis test,
7217
the chance of a false positive is about 1 in 20, which might be
7218
acceptable. But if you run 20 tests, you should expect at least one
7219
false positive, most of the time.
7220
\index{multiple tests}
7221
7222
Second, I used the same dataset for exploration and testing. If
7223
you explore a large dataset, find a surprising effect, and then test
7224
whether it is significant, you have a good chance of generating a
7225
false positive.
7226
\index{statistically significant} \index{significant}
7227
7228
To compensate for multiple tests, you can adjust the p-value
7229
threshold (see
7230
\url{https://en.wikipedia.org/wiki/Holm-Bonferroni_method}). Or you
7231
can address both problems by partitioning the data, using one set for
7232
exploration and the other for testing.
7233
\index{p-value}
7234
\index{Holm-Bonferroni method}
7235
7236
In some fields these practices are required or at least encouraged.
7237
But it is also common to address these problems implicitly by
7238
replicating published results. Typically the first paper to report a
7239
new result is considered exploratory. Subsequent papers that
7240
replicate the result with new data are considered confirmatory.
7241
\index{confirmatory result}
7242
7243
As it happens, we have an opportunity to replicate the results in this
7244
chapter. The first edition of this book is based on Cycle 6 of the
7245
NSFG, which was released in 2002. In October 2011, the CDC released
7246
additional data based on interviews conducted from 2006--2010. {\tt
7247
nsfg2.py} contains code to read and clean this data. In the new
7248
dataset:
7249
\index{NSFG}
7250
7251
\begin{itemize}
7252
7253
\item The difference in mean pregnancy length is
7254
0.16 weeks and statistically significant with $p < 0.001$ (compared
7255
to 0.078 weeks in the original dataset).
7256
\index{statistically significant} \index{significant}
7257
\index{pregnancy length}
7258
7259
\item The difference in birth weight is 0.17 pounds with $p < 0.001$
7260
(compared to 0.12 lbs in the original dataset).
7261
\index{birth weight}
7262
\index{weight!birth}
7263
7264
\item The correlation between birth weight and mother's age is
7265
0.08 with $p < 0.001$ (compared to 0.07).
7266
7267
\item The chi-squared test is statistically significant with
7268
$p < 0.001$ (as it was in the original).
7269
7270
\end{itemize}
7271
7272
In summary, all of the effects that were statistically significant
7273
in the original dataset were replicated in the new dataset, and the
7274
difference in pregnancy length, which was not significant in the
7275
original, is bigger in the new dataset and significant.
7276
7277
7278
\section{Exercises}
7279
7280
A solution to these exercises is in \verb"chap09soln.py".
7281
7282
\begin{exercise}
7283
As sample size increases, the power of a hypothesis test increases,
7284
which means it is more likely to be positive if the effect is real.
7285
Conversely, as sample size decreases, the test is less likely to
7286
be positive even if the effect is real.
7287
\index{sample size}
7288
7289
To investigate this behavior, run the tests in this chapter with
7290
different subsets of the NSFG data. You can use {\tt thinkstats2.SampleRows}
7291
to select a random subset of the rows in a DataFrame.
7292
\index{National Survey of Family Growth}
7293
\index{NSFG}
7294
\index{DataFrame}
7295
7296
What happens to the p-values of these tests as sample size decreases?
7297
What is the smallest sample size that yields a positive test?
7298
\index{p-value}
7299
\end{exercise}
7300
7301
7302
7303
\begin{exercise}
7304
7305
In Section~\ref{testdiff}, we simulated the null hypothesis by
7306
permutation; that is, we treated the observed values as if they
7307
represented the entire population, and randomly assigned the
7308
members of the population to the two groups.
7309
\index{null hypothesis}
7310
\index{permutation}
7311
7312
An alternative is to use the sample to estimate the distribution for
7313
the population, then draw a random sample from that distribution.
7314
This process is called {\bf resampling}. There are several ways to
7315
implement resampling, but one of the simplest is to draw a sample
7316
with replacement from the observed values, as in Section~\ref{power}.
7317
\index{resampling}
7318
\index{replacement}
7319
7320
Write a class named {\tt DiffMeansResample} that inherits from
7321
{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement
7322
resampling, rather than permutation.
7323
\index{permutation}
7324
7325
Use this model to test the differences in pregnancy length and
7326
birth weight. How much does the model affect the results?
7327
\index{model}
7328
\index{birth weight}
7329
\index{weight!birth}
7330
\index{pregnancy length}
7331
7332
\end{exercise}
7333
7334
7335
\section{Glossary}
7336
7337
\begin{itemize}
7338
7339
\item hypothesis testing: The process of determining whether an apparent
7340
effect is statistically significant.
7341
\index{hypothesis testing}
7342
7343
\item test statistic: A statistic used to quantify an effect size.
7344
\index{test statistic}
7345
\index{effect size}
7346
7347
\item null hypothesis: A model of a system based on the assumption that
7348
an apparent effect is due to chance.
7349
\index{null hypothesis}
7350
7351
\item p-value: The probability that an effect could occur by chance.
7352
\index{p-value}
7353
7354
\item statistically significant: An effect is statistically
7355
significant if it is unlikely to occur by chance.
7356
\index{significant} \index{statistically significant}
7357
7358
\item permutation test: A way to compute p-values by generating
7359
permutations of an observed dataset.
7360
\index{permutation test}
7361
7362
\item resampling test: A way to compute p-values by generating
7363
samples, with replacement, from an observed dataset.
7364
\index{resampling test}
7365
7366
\item two-sided test: A test that asks, ``What is the chance of an effect
7367
as big as the observed effect, positive or negative?''
7368
7369
\item one-sided test: A test that asks, ``What is the chance of an effect
7370
as big as the observed effect, and with the same sign?''
7371
\index{one-sided test}
7372
\index{two-sided test}
7373
\index{test!one-sided}
7374
\index{test!two-sided}
7375
7376
\item chi-squared test: A test that uses the chi-squared statistic as
7377
the test statistic.
7378
\index{chi-squared test}
7379
7380
\item false positive: The conclusion that an effect is real when it is not.
7381
\index{false positive}
7382
7383
\item false negative: The conclusion that an effect is due to chance when it
7384
is not.
7385
\index{false negative}
7386
7387
\item power: The probability of a positive test if the null hypothesis
7388
is false.
7389
\index{power}
7390
\index{null hypothesis}
7391
7392
\end{itemize}
7393
7394
7395
\chapter{Linear least squares}
7396
\label{linear}
7397
7398
The code for this chapter is in {\tt linear.py}. For information
7399
about downloading and working with this code, see Section~\ref{code}.
7400
7401
7402
\section{Least squares fit}
7403
7404
Correlation coefficients measure the strength and sign of a
7405
relationship, but not the slope. There are several ways to estimate
7406
the slope; the most common is a {\bf linear least squares fit}. A
7407
``linear fit'' is a line intended to model the relationship between
7408
variables. A ``least squares'' fit is one that minimizes the mean
7409
squared error (MSE) between the line and the data.
7410
\index{least squares fit}
7411
\index{linear least squares}
7412
\index{model}
7413
7414
Suppose we have a sequence of points, {\tt ys}, that we want to
7415
express as a function of another sequence {\tt xs}. If there is a
7416
linear relationship between {\tt xs} and {\tt ys} with intercept {\tt
7417
inter} and slope {\tt slope}, we expect each {\tt y[i]} to be
7418
{\tt inter + slope * x[i]}. \index{residuals}
7419
7420
But unless the correlation is perfect, this prediction is only
7421
approximate. The vertical deviation from the line, or {\bf residual},
7422
is
7423
\index{deviation}
7424
7425
\begin{verbatim}
7426
res = ys - (inter + slope * xs)
7427
\end{verbatim}
7428
7429
The residuals might be due to random factors like measurement error,
7430
or non-random factors that are unknown. For example, if we are
7431
trying to predict weight as a function of height, unknown factors
7432
might include diet, exercise, and body type.
7433
\index{slope}
7434
\index{intercept}
7435
\index{measurement error}
7436
7437
If we get the parameters {\tt inter} and {\tt slope} wrong, the residuals
7438
get bigger, so it makes intuitive sense that the parameters we want
7439
are the ones that minimize the residuals.
7440
\index{parameter}
7441
7442
We might try to minimize the absolute value of the
7443
residuals, or their squares, or their cubes; but the most common
7444
choice is to minimize the sum of squared residuals,
7445
{\tt sum(res**2)}.
7446
7447
Why? There are three good reasons and one less important one:
7448
7449
\begin{itemize}
7450
7451
\item Squaring has the feature of treating positive and
7452
negative residuals the same, which is usually what we want.
7453
7454
\item Squaring gives more weight to large residuals, but not
7455
so much weight that the largest residual always dominates.
7456
7457
\item If the residuals are uncorrelated and normally distributed with
7458
mean 0 and constant (but unknown) variance, then the least squares
7459
fit is also the maximum likelihood estimator of {\tt inter} and {\tt
7460
slope}. See
7461
\url{https://en.wikipedia.org/wiki/Linear_regression}. \index{MLE}
7462
\index{maximum likelihood estimator}
7463
\index{correlation}
7464
7465
\item The values of {\tt inter} and {\tt slope} that minimize
7466
the squared residuals can be computed efficiently.
7467
7468
\end{itemize}
7469
7470
The last reason made sense when computational efficiency was more
7471
important than choosing the method most appropriate to the problem
7472
at hand. That's no longer the case, so it is worth considering
7473
whether squared residuals are the right thing to minimize.
7474
\index{computational methods}
7475
\index{squared residuals}
7476
7477
For example, if you are using {\tt xs} to predict values of {\tt ys},
7478
guessing too high might be better (or worse) than guessing too low.
7479
In that case you might want to compute some cost function for each
7480
residual, and minimize total cost, {\tt sum(cost(res))}.
7481
However, computing a least squares fit is quick, easy and often good
7482
enough.
7483
\index{cost function}
7484
7485
7486
\section{Implementation}
7487
7488
{\tt thinkstats2} provides simple functions that demonstrate
7489
linear least squares:
7490
\index{LeastSquares}
7491
7492
\begin{verbatim}
7493
def LeastSquares(xs, ys):
7494
meanx, varx = MeanVar(xs)
7495
meany = Mean(ys)
7496
7497
slope = Cov(xs, ys, meanx, meany) / varx
7498
inter = meany - slope * meanx
7499
7500
return inter, slope
7501
\end{verbatim}
7502
7503
{\tt LeastSquares} takes sequences
7504
{\tt xs} and {\tt ys} and returns the estimated parameters {\tt inter}
7505
and {\tt slope}.
7506
For details on how it works, see
7507
\url{http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares}.
7508
\index{parameter}
7509
7510
{\tt thinkstats2} also provides {\tt FitLine}, which takes {\tt inter}
7511
and {\tt slope} and returns the fitted line for a sequence
7512
of {\tt xs}.
7513
\index{FitLine}
7514
7515
\begin{verbatim}
7516
def FitLine(xs, inter, slope):
7517
fit_xs = np.sort(xs)
7518
fit_ys = inter + slope * fit_xs
7519
return fit_xs, fit_ys
7520
\end{verbatim}
7521
7522
We can use these functions to compute the least squares fit for
7523
birth weight as a function of mother's age.
7524
\index{birth weight}
7525
\index{weight!birth}
7526
\index{age}
7527
7528
\begin{verbatim}
7529
live, firsts, others = first.MakeFrames()
7530
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
7531
ages = live.agepreg
7532
weights = live.totalwgt_lb
7533
7534
inter, slope = thinkstats2.LeastSquares(ages, weights)
7535
fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)
7536
\end{verbatim}
7537
7538
The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year.
7539
These values are hard to interpret in this form: the intercept is
7540
the expected weight of a baby whose mother is 0 years old, which
7541
doesn't make sense in context, and the slope is too small to
7542
grasp easily.
7543
\index{slope}
7544
\index{intercept}
7545
\index{dropna}
7546
\index{NaN}
7547
7548
Instead of presenting the intercept at $x=0$, it
7549
is often helpful to present the intercept at the mean of $x$. In
7550
this case the mean age is about 25 years and the mean baby weight
7551
for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces
7552
per year, or 0.17 pounds per decade.
7553
7554
\begin{figure}
7555
% linear.py
7556
\centerline{\includegraphics[height=2.5in]{figs/linear1.pdf}}
7557
\caption{Scatter plot of birth weight and mother's age with
7558
a linear fit.}
7559
\label{linear1}
7560
\end{figure}
7561
7562
Figure~\ref{linear1} shows a scatter plot of birth weight and age
7563
along with the fitted line. It's a good idea to look at a figure like
7564
this to assess whether the relationship is linear and whether the
7565
fitted line seems like a good model of the relationship.
7566
\index{birth weight}
7567
\index{weight!birth}
7568
\index{scatter plot}
7569
\index{plot!scatter}
7570
\index{model}
7571
7572
7573
\section{Residuals}
7574
\label{residuals}
7575
7576
Another useful test is to plot the residuals.
7577
{\tt thinkstats2} provides a function that computes residuals:
7578
\index{residuals}
7579
7580
\begin{verbatim}
7581
def Residuals(xs, ys, inter, slope):
7582
xs = np.asarray(xs)
7583
ys = np.asarray(ys)
7584
res = ys - (inter + slope * xs)
7585
return res
7586
\end{verbatim}
7587
7588
{\tt Residuals} takes sequences {\tt xs} and {\tt ys} and
7589
estimated parameters {\tt inter} and {\tt slope}. It returns
7590
the differences between the actual values and the fitted line.
7591
7592
\begin{figure}
7593
% linear.py
7594
\centerline{\includegraphics[height=2.5in]{figs/linear2.pdf}}
7595
\caption{Residuals of the linear fit.}
7596
\label{linear2}
7597
\end{figure}
7598
7599
To visualize the residuals, I group respondents by age and compute
7600
percentiles in each group, as we saw in Section~\ref{characterizing}.
7601
Figure~\ref{linear2} shows the 25th, 50th and 75th percentiles of
7602
the residuals for each age group. The median is near zero, as
7603
expected, and the interquartile range is about 2 pounds. So if we
7604
know the mother's age, we can guess the baby's weight within a pound,
7605
about 50\% of the time.
7606
\index{visualization}
7607
7608
Ideally these lines should be flat, indicating that the residuals are
7609
random, and parallel, indicating that the variance of the residuals is
7610
the same for all age groups. In fact, the lines are close to
7611
parallel, so that's good; but they have some curvature, indicating
7612
that the relationship is nonlinear. Nevertheless, the linear fit
7613
is a simple model that is probably good enough for some purposes.
7614
\index{model}
7615
\index{nonlinear}
7616
7617
7618
\section{Estimation}
7619
\label{regest}
7620
7621
The parameters {\tt slope} and {\tt inter} are estimates based on a
7622
sample; like other estimates, they are vulnerable to sampling bias,
7623
measurement error, and sampling error. As discussed in
7624
Chapter~\ref{estimation}, sampling bias is caused by non-representative
7625
sampling, measurement error is caused by errors in collecting
7626
and recording data, and sampling error is the result of measuring a
7627
sample rather than the entire population.
7628
\index{sampling bias}
7629
\index{bias!sampling}
7630
\index{measurement error}
7631
\index{sampling error}
7632
\index{estimation}
7633
7634
To assess sampling error, we ask, ``If we run this experiment again,
7635
how much variability do we expect in the estimates?'' We can
7636
answer this question by running simulated experiments and computing
7637
sampling distributions of the estimates.
7638
\index{sampling error}
7639
\index{sampling distribution}
7640
7641
I simulate the experiments by resampling the data; that is, I treat
7642
the observed pregnancies as if they were the entire population
7643
and draw samples, with replacement, from the observed sample.
7644
\index{simulation}
7645
\index{replacement}
7646
7647
\begin{verbatim}
7648
def SamplingDistributions(live, iters=101):
7649
t = []
7650
for _ in range(iters):
7651
sample = thinkstats2.ResampleRows(live)
7652
ages = sample.agepreg
7653
weights = sample.totalwgt_lb
7654
estimates = thinkstats2.LeastSquares(ages, weights)
7655
t.append(estimates)
7656
7657
inters, slopes = zip(*t)
7658
return inters, slopes
7659
\end{verbatim}
7660
7661
{\tt SamplingDistributions} takes a DataFrame with one row per live
7662
birth, and {\tt iters}, the number of experiments to simulate. It
7663
uses {\tt ResampleRows} to resample the observed pregnancies. We've
7664
already seen {\tt SampleRows}, which chooses random rows from a
7665
DataFrame. {\tt thinkstats2} also provides {\tt ResampleRows}, which
7666
returns a sample the same size as the original:
7667
\index{DataFrame}
7668
\index{resampling}
7669
7670
\begin{verbatim}
7671
def ResampleRows(df):
7672
return SampleRows(df, len(df), replace=True)
7673
\end{verbatim}
7674
7675
After resampling, we use the simulated sample to estimate parameters.
7676
The result is two sequences: the estimated intercepts and estimated
7677
slopes.
7678
\index{parameter}
7679
7680
I summarize the sampling distributions by printing the standard
7681
error and confidence interval:
7682
\index{sampling distribution}
7683
7684
\begin{verbatim}
7685
def Summarize(estimates, actual=None):
7686
mean = thinkstats2.Mean(estimates)
7687
stderr = thinkstats2.Std(estimates, mu=actual)
7688
cdf = thinkstats2.Cdf(estimates)
7689
ci = cdf.ConfidenceInterval(90)
7690
print('mean, SE, CI', mean, stderr, ci)
7691
\end{verbatim}
7692
7693
{\tt Summarize} takes a sequence of estimates and the actual value.
7694
It prints the mean of the estimates, the standard error and
7695
a 90\% confidence interval.
7696
\index{standard error}
7697
\index{confidence interval}
7698
7699
For the intercept, the mean estimate is 6.83, with standard error
7700
0.07 and 90\% confidence interval (6.71, 6.94). The estimated slope, in
7701
more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220).
7702
There is almost a factor of two between the low and high ends of
7703
this CI, so it should be considered a rough estimate.
7704
7705
%inter 6.83039697331 6.83174035366
7706
%SE, CI 0.0699814482068 (6.7146843084406846, 6.9447797068631871)
7707
%slope 0.0174538514718 0.0173840926936
7708
%SE, CI 0.00276116142884 (0.012635074392201724, 0.021975282350381781)
7709
7710
To visualize the sampling error of the estimate, we could plot
7711
all of the fitted lines, or for a less cluttered representation,
7712
plot a 90\% confidence interval for each age. Here's the code:
7713
7714
\begin{verbatim}
7715
def PlotConfidenceIntervals(xs, inters, slopes,
7716
percent=90, **options):
7717
fys_seq = []
7718
for inter, slope in zip(inters, slopes):
7719
fxs, fys = thinkstats2.FitLine(xs, inter, slope)
7720
fys_seq.append(fys)
7721
7722
p = (100 - percent) / 2
7723
percents = p, 100 - p
7724
low, high = thinkstats2.PercentileRows(fys_seq, percents)
7725
thinkplot.FillBetween(fxs, low, high, **options)
7726
\end{verbatim}
7727
7728
{\tt xs} is the sequence of mother's age. {\tt inters} and {\tt slopes}
7729
are the estimated parameters generated by {\tt SamplingDistributions}.
7730
{\tt percent} indicates which confidence interval to plot.
7731
7732
{\tt PlotConfidenceIntervals} generates a fitted line for each pair
7733
of {\tt inter} and {\tt slope} and stores the results in a sequence,
7734
\verb"fys_seq". Then it uses {\tt PercentileRows} to select the
7735
upper and lower percentiles of {\tt y} for each value of {\tt x}.
7736
For a 90\% confidence interval, it selects the 5th and 95th percentiles.
7737
{\tt FillBetween} draws a polygon that fills the space between two
7738
lines.
7739
\index{thinkplot}
7740
\index{FillBetween}
7741
7742
\begin{figure}
7743
% linear.py
7744
\centerline{\includegraphics[height=2.5in]{figs/linear3.pdf}}
7745
\caption{50\% and 90\% confidence intervals showing variability in the
7746
fitted line due to sampling error of {\tt inter} and {\tt slope}.}
7747
\label{linear3}
7748
\end{figure}
7749
7750
Figure~\ref{linear3} shows the 50\% and 90\% confidence
7751
intervals for curves fitted to birth weight as a function of
7752
mother's age.
7753
The vertical width of the region represents the effect of
7754
sampling error; the effect is smaller for values near the mean and
7755
larger for the extremes.
7756
7757
7758
\section{Goodness of fit}
7759
\label{goodness}
7760
\index{goodness of fit}
7761
7762
There are several ways to measure the quality of a linear model, or
7763
{\bf goodness of fit}. One of the simplest is the standard deviation
7764
of the residuals.
7765
\index{standard deviation}
7766
\index{model}
7767
7768
If you use a linear model to make predictions, {\tt Std(res)}
7769
is the root mean squared error (RMSE) of your predictions. For
7770
example, if you use mother's age to guess birth weight, the RMSE of
7771
your guess would be 1.40 lbs.
7772
\index{birth weight}
7773
\index{weight!birth}
7774
7775
If you guess birth weight without knowing the mother's age, the RMSE
7776
of your guess is {\tt Std(ys)}, which is 1.41 lbs. So in this
7777
example, knowing a mother's age does not improve the predictions
7778
substantially.
7779
\index{prediction}
7780
7781
Another way to measure goodness of fit is the {\bf
7782
coefficient of determination}, usually denoted $R^2$ and
7783
called ``R-squared'':
7784
\index{coefficient of determination}
7785
\index{r-squared}
7786
7787
\begin{verbatim}
7788
def CoefDetermination(ys, res):
7789
return 1 - Var(res) / Var(ys)
7790
\end{verbatim}
7791
7792
{\tt Var(res)} is the MSE of your guesses using the model,
7793
{\tt Var(ys)} is the MSE without it. So their ratio is the fraction
7794
of MSE that remains if you use the model, and $R^2$ is the fraction
7795
of MSE the model eliminates.
7796
\index{MSE}
7797
7798
For birth weight and mother's age, $R^2$ is 0.0047, which means
7799
that mother's age predicts about half of 1\% of variance in
7800
birth weight.
7801
7802
There is a simple relationship between the coefficient of
7803
determination and Pearson's coefficient of correlation: $R^2 = \rho^2$.
7804
For example, if $\rho$ is 0.8 or -0.8, $R^2 = 0.64$.
7805
\index{Pearson coefficient of correlation}
7806
7807
Although $\rho$ and $R^2$ are often used to quantify the strength of a
7808
relationship, they are not easy to interpret in terms of predictive
7809
power. In my opinion, {\tt Std(res)} is the best representation
7810
of the quality of prediction, especially if it is presented
7811
in relation to {\tt Std(ys)}.
7812
\index{coefficient of determination}
7813
\index{r-squared}
7814
7815
For example, when people talk about the validity of the SAT
7816
(a standardized test used for college admission in the U.S.) they
7817
often talk about correlations between SAT scores and other measures of
7818
intelligence.
7819
\index{SAT}
7820
\index{IQ}
7821
7822
According to one study, there is a Pearson correlation of
7823
$\rho=0.72$ between total SAT scores and IQ scores, which sounds like
7824
a strong correlation. But $R^2 = \rho^2 = 0.52$, so SAT scores
7825
account for only 52\% of variance in IQ.
7826
7827
IQ scores are normalized with {\tt Std(ys) = 15}, so
7828
7829
\begin{verbatim}
7830
>>> var_ys = 15**2
7831
>>> rho = 0.72
7832
>>> r2 = rho**2
7833
>>> var_res = (1 - r2) * var_ys
7834
>>> std_res = math.sqrt(var_res)
7835
10.4096
7836
\end{verbatim}
7837
7838
So using SAT score to predict IQ reduces RMSE from 15 points to 10.4
7839
points. A correlation of 0.72 yields a reduction in RMSE of only
7840
31\%.
7841
7842
If you see a correlation that looks impressive, remember that $R^2$ is
7843
a better indicator of reduction in MSE, and reduction in RMSE is a
7844
better indicator of predictive power.
7845
\index{coefficient of determination}
7846
\index{r-squared}
7847
\index{prediction}
7848
7849
7850
\section{Testing a linear model}
7851
7852
The effect of mother's age on birth weight is small, and has little
7853
predictive power. So is it possible that the apparent relationship
7854
is due to chance? There are several ways we might test the
7855
results of a linear fit.
7856
\index{birth weight}
7857
\index{weight!birth}
7858
\index{model}
7859
\index{linear model}
7860
7861
One option is to test whether the apparent reduction in MSE is due to
7862
chance. In that case, the test statistic is $R^2$ and the null
7863
hypothesis is that there is no relationship between the variables. We
7864
can simulate the null hypothesis by permutation, as in
7865
Section~\ref{corrtest}, when we tested the correlation between
7866
mother's age and birth weight. In fact, because $R^2 = \rho^2$, a
7867
one-sided test of $R^2$ is equivalent to a two-sided test of $\rho$.
7868
We've already done that test, and found $p < 0.001$, so we conclude
7869
that the apparent relationship between mother's age and birth weight
7870
is statistically significant.
7871
\index{null hypothesis}
7872
\index{permutation}
7873
\index{coefficient of determination}
7874
\index{r-squared}
7875
\index{significant} \index{statistically significant}
7876
7877
Another approach is to test whether the apparent slope is due to chance.
7878
The null hypothesis is that the slope is actually zero; in that case
7879
we can model the birth weights as random variations around their mean.
7880
Here's a HypothesisTest for this model:
7881
\index{HypothesisTest}
7882
\index{model}
7883
7884
\begin{verbatim}
7885
class SlopeTest(thinkstats2.HypothesisTest):
7886
7887
def TestStatistic(self, data):
7888
ages, weights = data
7889
_, slope = thinkstats2.LeastSquares(ages, weights)
7890
return slope
7891
7892
def MakeModel(self):
7893
_, weights = self.data
7894
self.ybar = weights.mean()
7895
self.res = weights - self.ybar
7896
7897
def RunModel(self):
7898
ages, _ = self.data
7899
weights = self.ybar + np.random.permutation(self.res)
7900
return ages, weights
7901
\end{verbatim}
7902
7903
The data are represented as sequences of ages and weights. The
7904
test statistic is the slope estimated by {\tt LeastSquares}.
7905
The model of the null hypothesis is represented by the mean weight
7906
of all babies and the deviations from the mean. To
7907
generate simulated data, we permute the deviations and add them to
7908
the mean.
7909
\index{deviation}
7910
\index{null hypothesis}
7911
\index{permutation}
7912
7913
Here's the code that runs the hypothesis test:
7914
7915
\begin{verbatim}
7916
live, firsts, others = first.MakeFrames()
7917
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
7918
ht = SlopeTest((live.agepreg, live.totalwgt_lb))
7919
pvalue = ht.PValue()
7920
\end{verbatim}
7921
7922
The p-value is less than $0.001$, so although the estimated
7923
slope is small, it is unlikely to be due to chance.
7924
\index{p-value}
7925
\index{dropna}
7926
\index{NaN}
7927
7928
Estimating the p-value by simulating the null hypothesis is strictly
7929
correct, but there is a simpler alternative. Remember that we already
7930
computed the sampling distribution of the slope, in
7931
Section~\ref{regest}. To do that, we assumed that the observed slope
7932
was correct and simulated experiments by resampling.
7933
\index{null hypothesis}
7934
7935
Figure~\ref{linear4} shows the sampling distribution of the
7936
slope, from Section~\ref{regest}, and the distribution of slopes
7937
generated under the null hypothesis. The sampling distribution
7938
is centered about the estimated slope, 0.017 lbs/year, and the slopes
7939
under the null hypothesis are centered around 0; but other than
7940
that, the distributions are identical. The distributions are
7941
also symmetric, for reasons we will see in Section~\ref{CLT}.
7942
\index{symmetric}
7943
\index{sampling distribution}
7944
7945
\begin{figure}
7946
% linear.py
7947
\centerline{\includegraphics[height=2.5in]{figs/linear4.pdf}}
7948
\caption{The sampling distribution of the estimated
7949
slope and the distribution of slopes
7950
generated under the null hypothesis. The vertical lines are at 0
7951
and the observed slope, 0.017 lbs/year.}
7952
\label{linear4}
7953
\end{figure}
7954
7955
So we could estimate the p-value two ways:
7956
\index{p-value}
7957
7958
\begin{itemize}
7959
7960
\item Compute the probability that the slope under the null
7961
hypothesis exceeds the observed slope.
7962
\index{null hypothesis}
7963
7964
\item Compute the probability that the slope in the sampling
7965
distribution falls below 0. (If the estimated slope were negative,
7966
we would compute the probability that the slope in the sampling
7967
distribution exceeds 0.)
7968
7969
\end{itemize}
7970
7971
The second option is easier because we normally want to compute the
7972
sampling distribution of the parameters anyway. And it is a good
7973
approximation unless the sample size is small {\em and\/} the
7974
distribution of residuals is skewed. Even then, it is usually good
7975
enough, because p-values don't have to be precise.
7976
\index{skewness}
7977
\index{parameter}
7978
7979
Here's the code that estimates the p-value of the slope using the
7980
sampling distribution:
7981
\index{sampling distribution}
7982
7983
\begin{verbatim}
7984
inters, slopes = SamplingDistributions(live, iters=1001)
7985
slope_cdf = thinkstats2.Cdf(slopes)
7986
pvalue = slope_cdf[0]
7987
\end{verbatim}
7988
7989
Again, we find $p < 0.001$.
7990
7991
7992
\section{Weighted resampling}
7993
\label{weighted}
7994
7995
So far we have treated the NSFG data as if it were a representative
7996
sample, but as I mentioned in Section~\ref{nsfg}, it is not. The
7997
survey deliberately oversamples several groups in order to
7998
improve the chance of getting statistically significant results; that
7999
is, in order to improve the power of tests involving these groups.
8000
\index{significant} \index{statistically significant}
8001
8002
This survey design is useful for many purposes, but it means that we
8003
cannot use the sample to estimate values for the general
8004
population without accounting for the sampling process.
8005
8006
For each respondent, the NSFG data includes a variable called {\tt
8007
finalwgt}, which is the number of people in the general population
8008
the respondent represents. This value is called a {\bf sampling
8009
weight}, or just ``weight.''
8010
\index{sampling weight}
8011
\index{weight}
8012
\index{weighted resampling}
8013
\index{resampling!weighted}
8014
8015
As an example, if you survey 100,000 people in a country of 300
8016
million, each respondent represents 3,000 people. If you oversample
8017
one group by a factor of 2, each person in the oversampled
8018
group would have a lower weight, about 1500.
8019
8020
To correct for oversampling, we can use resampling; that is, we
8021
can draw samples from the survey using probabilities proportional
8022
to sampling weights. Then, for any quantity we want to estimate, we can
8023
generate sampling distributions, standard errors, and confidence
8024
intervals. As an example, I will estimate mean birth weight with
8025
and without sampling weights.
8026
\index{standard error}
8027
\index{confidence interval}
8028
\index{birth weight}
8029
\index{weight!birth}
8030
\index{sampling distribution}
8031
\index{oversampling}
8032
8033
In Section~\ref{regest}, we saw {\tt ResampleRows}, which chooses
8034
rows from a DataFrame, giving each row the same probability.
8035
Now we need to do the same thing using probabilities
8036
proportional to sampling weights.
8037
{\tt ResampleRowsWeighted} takes a DataFrame, resamples rows according
8038
to the weights in {\tt finalwgt}, and returns a DataFrame containing
8039
the resampled rows:
8040
\index{DataFrame}
8041
\index{resampling}
8042
8043
\begin{verbatim}
8044
def ResampleRowsWeighted(df, column='finalwgt'):
8045
weights = df[column]
8046
cdf = Cdf(dict(weights))
8047
indices = cdf.Sample(len(weights))
8048
sample = df.loc[indices]
8049
return sample
8050
\end{verbatim}
8051
8052
{\tt weights} is a Series; converting it to a dictionary makes
8053
a map from the indices to the weights. In {\tt cdf} the values
8054
are indices and the probabilities are proportional to the
8055
weights.
8056
8057
{\tt indices} is a sequence of row indices; {\tt sample} is a
8058
DataFrame that contains the selected rows. Since we sample with
8059
replacement, the same row might appear more than once. \index{Cdf}
8060
\index{replacement}
8061
8062
Now we can compare the effect of resampling with and without
8063
weights. Without weights, we generate the sampling distribution
8064
like this:
8065
\index{sampling distribution}
8066
8067
\begin{verbatim}
8068
estimates = [ResampleRows(live).totalwgt_lb.mean()
8069
for _ in range(iters)]
8070
\end{verbatim}
8071
8072
With weights, it looks like this:
8073
8074
\begin{verbatim}
8075
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
8076
for _ in range(iters)]
8077
\end{verbatim}
8078
8079
The following table summarizes the results:
8080
8081
\begin{center}
8082
\begin{tabular}{|l|c|c|c|}
8083
\hline
8084
& mean birth & standard & 90\% CI \\
8085
& weight (lbs) & error & \\
8086
\hline
8087
Unweighted & 7.27 & 0.014 & (7.24, 7.29) \\
8088
Weighted & 7.35 & 0.014 & (7.32, 7.37) \\
8089
\hline
8090
\end{tabular}
8091
\end{center}
8092
8093
%mean 7.26580789518
8094
%stderr 0.0141683527792
8095
%ci (7.2428565501217079, 7.2890814917127074)
8096
%mean 7.34778034718
8097
%stderr 0.0142738972319
8098
%ci (7.3232804012858885, 7.3704916897506925)
8099
8100
In this example, the effect of weighting is small but non-negligible.
8101
The difference in estimated means, with and without weighting, is
8102
about 0.08 pounds, or 1.3 ounces. This difference is substantially
8103
larger than the standard error of the estimate, 0.014 pounds, which
8104
implies that the difference is not due to chance.
8105
\index{standard error}
8106
\index{confidence interval}
8107
8108
8109
\section{Exercises}
8110
8111
A solution to this exercise is in \verb"chap10soln.ipynb"
8112
8113
\begin{exercise}
8114
8115
Using the data from the BRFSS, compute the linear least squares
8116
fit for log(weight) versus height.
8117
How would you best present the estimated parameters for a model
8118
like this where one of the variables is log-transformed?
8119
If you were trying to guess
8120
someone's weight, how much would it help to know their height?
8121
\index{Behavioral Risk Factor Surveillance System}
8122
\index{BRFSS}
8123
\index{model}
8124
8125
Like the NSFG, the BRFSS oversamples some groups and provides
8126
a sampling weight for each respondent. In the BRFSS data, the variable
8127
name for these weights is {\tt finalwt}.
8128
Use resampling, with and without weights, to estimate the mean height
8129
of respondents in the BRFSS, the standard error of the mean, and a
8130
90\% confidence interval. How much does correct weighting affect the
8131
estimates?
8132
\index{confidence interval}
8133
\index{standard error}
8134
\index{oversampling}
8135
\index{sampling weight}
8136
\end{exercise}
8137
8138
8139
\section{Glossary}
8140
8141
\begin{itemize}
8142
8143
\item linear fit: a line intended to model the relationship between
8144
variables. \index{linear fit}
8145
8146
\item least squares fit: A model of a dataset that minimizes the
8147
sum of squares of the residuals.
8148
\index{least squares fit}
8149
8150
\item residual: The deviation of an actual value from a model.
8151
\index{residuals}
8152
8153
\item goodness of fit: A measure of how well a model fits data.
8154
\index{goodness of fit}
8155
8156
\item coefficient of determination: A statistic intended to
8157
quantify goodness of fit.
8158
\index{coefficient of determination}
8159
8160
\item sampling weight: A value associated with an observation in a
8161
sample that indicates what part of the population it represents.
8162
\index{sampling weight}
8163
8164
\end{itemize}
8165
8166
8167
8168
\chapter{Regression}
8169
\label{regression}
8170
8171
The linear least squares fit in the previous chapter is an example of
8172
{\bf regression}, which is the more general problem of fitting any
8173
kind of model to any kind of data. This use of the term ``regression''
8174
is a historical accident; it is only indirectly related to the
8175
original meaning of the word.
8176
\index{model}
8177
\index{regression}
8178
8179
The goal of regression analysis is to describe the relationship
8180
between one set of variables, called the {\bf dependent variables},
8181
and another set of variables, called independent or {\bf
8182
explanatory variables}.
8183
\index{explanatory variable}
8184
\index{dependent variable}
8185
8186
In the previous chapter we used mother's age as an explanatory
8187
variable to predict birth weight as a dependent variable. When there
8188
is only one dependent and one explanatory variable, that's {\bf
8189
simple regression}. In this chapter, we move on to {\bf multiple
8190
regression}, with more than one explanatory variable. If there is
8191
more than one dependent variable, that's multivariate
8192
regression.
8193
\index{birth weight}
8194
\index{weight!birth}
8195
\index{simple regression}
8196
\index{multiple regression}
8197
8198
If the relationship between the dependent and explanatory variable
8199
is linear, that's {\bf linear regression}. For example,
8200
if the dependent variable is $y$ and the explanatory variables
8201
are $x_1$ and $x_2$, we would write the following linear
8202
regression model:
8203
%
8204
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8205
%
8206
where $\beta_0$ is the intercept, $\beta_1$ is the parameter
8207
associated with $x_1$, $\beta_2$ is the parameter associated with
8208
$x_2$, and $\eps$ is the residual due to random variation or other
8209
unknown factors.
8210
\index{regression model}
8211
\index{linear regression}
8212
8213
Given a sequence of values for $y$ and sequences for $x_1$ and $x_2$,
8214
we can find the parameters, $\beta_0$, $\beta_1$, and $\beta_2$, that
8215
minimize the sum of $\eps^2$. This process is called
8216
{\bf ordinary least squares}. The computation is similar to {\tt
8217
thinkstats2.LeastSquare}, but generalized to deal with more than one
8218
explanatory variable. You can find the details at
8219
\url{https://en.wikipedia.org/wiki/Ordinary_least_squares}
8220
\index{explanatory variable}
8221
\index{ordinary least squares}
8222
\index{parameter}
8223
8224
The code for this chapter is in {\tt regression.py}. For information
8225
about downloading and working with this code, see Section~\ref{code}.
8226
8227
\section{StatsModels}
8228
\label{statsmodels}
8229
8230
In the previous chapter I presented {\tt thinkstats2.LeastSquares}, an
8231
implementation of simple linear regression intended to be easy to
8232
read. For multiple regression we'll switch to StatsModels, a Python
8233
package that provides several forms of regression and other
8234
analyses. If you are using Anaconda, you already have StatsModels;
8235
otherwise you might have to install it.
8236
\index{Anaconda}
8237
8238
As an example, I'll run the model from the previous chapter with
8239
StatsModels:
8240
\index{StatsModels}
8241
\index{model}
8242
8243
\begin{verbatim}
8244
import statsmodels.formula.api as smf
8245
8246
live, firsts, others = first.MakeFrames()
8247
formula = 'totalwgt_lb ~ agepreg'
8248
model = smf.ols(formula, data=live)
8249
results = model.fit()
8250
\end{verbatim}
8251
8252
{\tt statsmodels} provides two interfaces (APIs); the ``formula''
8253
API uses strings to identify the dependent and explanatory variables.
8254
It uses a syntax called {\tt patsy}; in this example, the \verb"~"
8255
operator separates the dependent variable on the left from the
8256
explanatory variables on the right.
8257
\index{explanatory variable}
8258
\index{dependent variable}
8259
\index{Patsy}
8260
8261
{\tt smf.ols} takes the formula string and the DataFrame, {\tt live},
8262
and returns an OLS object that represents the model. The name {\tt ols}
8263
stands for ``ordinary least squares.''
8264
\index{DataFrame}
8265
\index{model}
8266
\index{ordinary least squares}
8267
8268
The {\tt fit} method fits the model to the data and returns a
8269
RegressionResults object that contains the results.
8270
\index{RegressionResults}
8271
8272
The results are also available as attributes. {\tt params}
8273
is a Series that maps from variable names to their parameters, so we can
8274
get the intercept and slope like this:
8275
\index{Series}
8276
8277
\begin{verbatim}
8278
inter = results.params['Intercept']
8279
slope = results.params['agepreg']
8280
\end{verbatim}
8281
8282
The estimated parameters are 6.83 and 0.0175, the same as
8283
from {\tt LeastSquares}.
8284
\index{parameter}
8285
8286
{\tt pvalues} is a Series that maps from variable names to the associated
8287
p-values, so we can check whether the estimated slope is statistically
8288
significant:
8289
\index{p-value}
8290
\index{significant} \index{statistically significant}
8291
8292
\begin{verbatim}
8293
slope_pvalue = results.pvalues['agepreg']
8294
\end{verbatim}
8295
8296
The p-value associated with {\tt agepreg} is {\tt 5.7e-11}, which
8297
is less than $0.001$, as expected.
8298
\index{age}
8299
8300
{\tt results.rsquared} contains $R^2$, which is $0.0047$. {\tt
8301
results} also provides \verb"f_pvalue", which is the p-value
8302
associated with the model as a whole, similar to testing whether $R^2$
8303
is statistically significant.
8304
\index{model}
8305
\index{coefficient of determination}
8306
\index{r-squared}
8307
8308
And {\tt results} provides {\tt resid}, a sequence of residuals, and
8309
{\tt fittedvalues}, a sequence of fitted values corresponding to
8310
{\tt agepreg}.
8311
\index{residuals}
8312
8313
The results object provides {\tt summary()}, which
8314
represents the results in a readable format.
8315
8316
\begin{verbatim}
8317
print(results.summary())
8318
\end{verbatim}
8319
8320
But it prints a lot of information that is not relevant (yet), so
8321
I use a simpler function called {\tt SummarizeResults}. Here are
8322
the results of this model:
8323
8324
\begin{verbatim}
8325
Intercept 6.83 (0)
8326
agepreg 0.0175 (5.72e-11)
8327
R^2 0.004738
8328
Std(ys) 1.408
8329
Std(res) 1.405
8330
\end{verbatim}
8331
8332
{\tt Std(ys)} is the standard deviation of the dependent variable,
8333
which is the RMSE if you have to guess birth weights without the benefit of
8334
any explanatory variables. {\tt Std(res)} is the standard deviation
8335
of the residuals, which is the RMSE if your guesses are informed
8336
by the mother's age. As we have already seen, knowing the mother's
8337
age provides no substantial improvement to the predictions.
8338
\index{standard deviation}
8339
\index{birth weight}
8340
\index{weight!birth}
8341
\index{explanatory variable}
8342
\index{dependent variable}
8343
\index{RMSE}
8344
\index{predictive power}
8345
8346
8347
\section{Multiple regression}
8348
\label{multiple}
8349
8350
In Section~\ref{birth_weights} we saw that first babies tend to be
8351
lighter than others, and this effect is statistically significant.
8352
But it is a strange result because there is no obvious mechanism that
8353
would cause first babies to be lighter. So we might wonder whether
8354
this relationship is {\bf spurious}.
8355
\index{multiple regression}
8356
\index{spurious relationship}
8357
8358
In fact, there is a possible explanation for this effect. We have
8359
seen that birth weight depends on mother's age, and we might expect
8360
that mothers of first babies are younger than others.
8361
\index{weight}
8362
\index{age}
8363
8364
With a few calculations we can check whether this explanation
8365
is plausible. Then we'll use multiple regression to investigate
8366
more carefully. First, let's see how big the difference in weight
8367
is:
8368
8369
\begin{verbatim}
8370
diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()
8371
\end{verbatim}
8372
8373
First babies are 0.125 lbs lighter, or 2 ounces. And the difference
8374
in ages:
8375
8376
\begin{verbatim}
8377
diff_age = firsts.agepreg.mean() - others.agepreg.mean()
8378
\end{verbatim}
8379
8380
The mothers of first babies are 3.59 years younger. Running the
8381
linear model again, we get the change in birth weight as a function
8382
of age:
8383
\index{birth weight}
8384
\index{weight!birth}
8385
8386
\begin{verbatim}
8387
results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit()
8388
slope = results.params['agepreg']
8389
\end{verbatim}
8390
8391
The slope is 0.0175 pounds per year. If we multiply the slope by
8392
the difference in ages, we get the expected difference in birth
8393
weight for first babies and others, due to mother's age:
8394
8395
\begin{verbatim}
8396
slope * diff_age
8397
\end{verbatim}
8398
8399
The result is 0.063, just about half of the observed difference.
8400
So we conclude, tentatively, that the observed difference in birth
8401
weight can be partly explained by the difference in mother's age.
8402
8403
Using multiple regression, we can explore these relationships
8404
more systematically.
8405
\index{multiple regression}
8406
8407
\begin{verbatim}
8408
live['isfirst'] = live.birthord == 1
8409
formula = 'totalwgt_lb ~ isfirst'
8410
results = smf.ols(formula, data=live).fit()
8411
\end{verbatim}
8412
8413
The first line creates a new column named {\tt isfirst} that is
8414
True for first babies and false otherwise. Then we fit a model
8415
using {\tt isfirst} as an explanatory variable.
8416
\index{model}
8417
\index{explanatory variable}
8418
8419
Here are the results:
8420
8421
\begin{verbatim}
8422
Intercept 7.33 (0)
8423
isfirst[T.True] -0.125 (2.55e-05)
8424
R^2 0.00196
8425
\end{verbatim}
8426
8427
Because {\tt isfirst} is a boolean, {\tt ols} treats it as a
8428
{\bf categorical variable}, which means that the values fall
8429
into categories, like True and False, and should not be treated
8430
as numbers. The estimated parameter is the effect on birth
8431
weight when {\tt isfirst} is true, so the result,
8432
-0.125 lbs, is the difference in
8433
birth weight between first babies and others.
8434
\index{birth weight}
8435
\index{weight!birth}
8436
\index{categorical variable}
8437
\index{boolean}
8438
8439
The slope and the intercept are statistically significant,
8440
which means that they were unlikely to occur by chance, but the
8441
the $R^2$ value for this model is small, which means that
8442
{\tt isfirst} doesn't account for a substantial part of the
8443
variation in birth weight.
8444
\index{coefficient of determination}
8445
\index{r-squared}
8446
8447
The results are similar with {\tt agepreg}:
8448
8449
\begin{verbatim}
8450
Intercept 6.83 (0)
8451
agepreg 0.0175 (5.72e-11)
8452
R^2 0.004738
8453
\end{verbatim}
8454
8455
Again, the parameters are statistically significant, but
8456
$R^2$ is low.
8457
\index{coefficient of determination}
8458
\index{r-squared}
8459
8460
These models confirm results we have already seen. But now we
8461
can fit a single model that includes both variables. With the
8462
formula \verb"totalwgt_lb ~ isfirst + agepreg", we get:
8463
8464
\begin{verbatim}
8465
Intercept 6.91 (0)
8466
isfirst[T.True] -0.0698 (0.0253)
8467
agepreg 0.0154 (3.93e-08)
8468
R^2 0.005289
8469
\end{verbatim}
8470
8471
In the combined model, the parameter for {\tt isfirst} is smaller
8472
by about half, which means that part of the apparent effect of
8473
{\tt isfirst} is actually accounted for by {\tt agepreg}. And
8474
the p-value for {\tt isfirst} is about 2.5\%, which is on the
8475
border of statistical significance.
8476
\index{p-value}
8477
\index{model}
8478
8479
$R^2$ for this model is a little higher, which indicates that the
8480
two variables together account for more variation in birth weight
8481
than either alone (but not by much).
8482
\index{birth weight}
8483
\index{weight!birth}
8484
\index{coefficient of determination}
8485
\index{r-squared}
8486
8487
8488
\section{Nonlinear relationships}
8489
\label{nonlinear}
8490
8491
Remembering that the contribution of {\tt agepreg} might be nonlinear,
8492
we might consider adding a variable to capture more of this
8493
relationship. One option is to create a column, {\tt agepreg2},
8494
that contains the squares of the ages:
8495
\index{nonlinear}
8496
8497
\begin{verbatim}
8498
live['agepreg2'] = live.agepreg**2
8499
formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'
8500
\end{verbatim}
8501
8502
Now by estimating parameters for {\tt agepreg} and {\tt agepreg2},
8503
we are effectively fitting a parabola:
8504
8505
\begin{verbatim}
8506
Intercept 5.69 (1.38e-86)
8507
isfirst[T.True] -0.0504 (0.109)
8508
agepreg 0.112 (3.23e-07)
8509
agepreg2 -0.00185 (8.8e-06)
8510
R^2 0.007462
8511
\end{verbatim}
8512
8513
The parameter of {\tt agepreg2} is negative, so the parabola
8514
curves downward, which is consistent with the shape of the lines
8515
in Figure~\ref{linear2}.
8516
\index{parabola}
8517
8518
The quadratic model of {\tt agepreg} accounts for more of the
8519
variability in birth weight; the parameter for {\tt isfirst}
8520
is smaller in this model, and no longer statistically significant.
8521
\index{birth weight}
8522
\index{weight!birth}
8523
\index{quadratic model}
8524
\index{model}
8525
\index{significant} \index{statistically significant}
8526
8527
Using computed variables like {\tt agepreg2} is a common way to
8528
fit polynomials and other functions to data.
8529
This process is still considered linear
8530
regression, because the dependent variable is a linear function of
8531
the explanatory variables, regardless of whether some variables
8532
are nonlinear functions of others.
8533
\index{explanatory variable}
8534
\index{dependent variable}
8535
\index{nonlinear}
8536
8537
The following table summarizes the results of these regressions:
8538
8539
\begin{center}
8540
\begin{tabular}{|l|c|c|c|c|}
8541
\hline & isfirst & agepreg & agepreg2 & $R^2$ \\ \hline
8542
Model 1 & -0.125 * & -- & -- & 0.002 \\
8543
Model 2 & -- & 0.0175 * & -- & 0.0047 \\
8544
Model 3 & -0.0698 (0.025) & 0.0154 * & -- & 0.0053 \\
8545
Model 4 & -0.0504 (0.11) & 0.112 * & -0.00185 * & 0.0075 \\
8546
\hline
8547
\end{tabular}
8548
\end{center}
8549
8550
The columns in this table are the explanatory variables and
8551
the coefficient of determination, $R^2$. Each entry is an estimated
8552
parameter and either a p-value in parentheses or an asterisk to
8553
indicate a p-value less that 0.001.
8554
\index{p-value}
8555
\index{coefficient of determination}
8556
\index{r-squared}
8557
\index{explanatory variable}
8558
8559
We conclude that the apparent difference in birth weight
8560
is explained, at least in part, by the difference in mother's age.
8561
When we include mother's age in the model, the effect of
8562
{\tt isfirst} gets smaller, and the remaining effect might be
8563
due to chance.
8564
\index{age}
8565
8566
In this example, mother's age acts as a {\bf control variable};
8567
including {\tt agepreg} in the model ``controls for'' the
8568
difference in age between first-time mothers and others, making
8569
it possible to isolate the effect (if any) of {\tt isfirst}.
8570
\index{control variable}
8571
8572
8573
\section{Data mining}
8574
\label{mining}
8575
8576
So far we have used regression models for explanation; for example,
8577
in the previous section we discovered that an apparent difference
8578
in birth weight is actually due to a difference in mother's age.
8579
But the $R^2$ values of those models is very low, which means that
8580
they have little predictive power. In this section we'll try to
8581
do better.
8582
\index{birth weight}
8583
\index{weight!birth}
8584
\index{regression model}
8585
\index{coefficient of determination}
8586
\index{r-squared}
8587
8588
Suppose one of your co-workers is expecting a baby and
8589
there is an office pool to guess the baby's birth weight (if you are
8590
not familiar with betting pools, see
8591
\url{https://en.wikipedia.org/wiki/Betting_pool}).
8592
\index{betting pool}
8593
8594
Now suppose that you {\em really\/} want to win the pool. What could
8595
you do to improve your chances? Well,
8596
the NSFG dataset includes 244 variables about each pregnancy and another
8597
3087 variables about each respondent. Maybe some of those variables
8598
have predictive power. To find out which ones are most useful,
8599
why not try them all?
8600
\index{NSFG}
8601
8602
Testing the variables in the pregnancy table is easy, but in order to
8603
use the variables in the respondent table, we have to match up each
8604
pregnancy with a respondent. In theory we could iterate through the
8605
rows of the pregnancy table, use the {\tt caseid} to find the
8606
corresponding respondent, and copy the values from the
8607
correspondent table into the pregnancy table. But that would be slow.
8608
\index{join}
8609
\index{SQL}
8610
8611
A better option is to recognize this process as a {\bf join} operation
8612
as defined in SQL and other relational database languages (see
8613
\url{https://en.wikipedia.org/wiki/Join_(SQL)}). Join is implemented
8614
as a DataFrame method, so we can perform the operation like this:
8615
\index{DataFrame}
8616
8617
\begin{verbatim}
8618
live = live[live.prglngth>30]
8619
resp = chap01soln.ReadFemResp()
8620
resp.index = resp.caseid
8621
join = live.join(resp, on='caseid', rsuffix='_r')
8622
\end{verbatim}
8623
8624
The first line selects records for pregnancies longer than 30 weeks,
8625
assuming that the office pool is formed several weeks before the
8626
due date.
8627
\index{betting pool}
8628
8629
The next line reads the respondent file. The result is a DataFrame
8630
with integer indices; in order to look up respondents efficiently,
8631
I replace {\tt resp.index} with {\tt resp.caseid}.
8632
8633
The {\tt join} method is invoked on {\tt live}, which is considered
8634
the ``left'' table, and passed {\tt resp}, which is the ``right'' table.
8635
The keyword argument {\tt on} indicates the variable used to match up
8636
rows from the two tables.
8637
8638
In this example some column names appear in both tables,
8639
so we have to provide {\tt rsuffix}, which is a string that will be
8640
appended to the names of overlapping columns from the right table.
8641
For example, both tables have a column named {\tt race} that encodes
8642
the race of the respondent. The result of the join contains two
8643
columns named {\tt race} and \verb"race_r".
8644
\index{race}
8645
8646
The pandas implementation is fast. Joining the NSFG tables takes
8647
less than a second on an ordinary desktop computer.
8648
Now we can start testing variables.
8649
\index{pandas}
8650
\index{join}
8651
8652
\begin{verbatim}
8653
t = []
8654
for name in join.columns:
8655
try:
8656
if join[name].var() < 1e-7:
8657
continue
8658
8659
formula = 'totalwgt_lb ~ agepreg + ' + name
8660
model = smf.ols(formula, data=join)
8661
if model.nobs < len(join)/2:
8662
continue
8663
8664
results = model.fit()
8665
except (ValueError, TypeError):
8666
continue
8667
8668
t.append((results.rsquared, name))
8669
\end{verbatim}
8670
8671
For each variable we construct a model, compute $R^2$, and append
8672
the results to a list. The models all include {\tt agepreg}, since
8673
we already know that it has some predictive power.
8674
\index{model}
8675
\index{coefficient of determination}
8676
\index{r-squared}
8677
8678
I check that each explanatory variable has some variability; otherwise
8679
the results of the regression are unreliable. I also check the number
8680
of observations for each model. Variables that contain a large number
8681
of {\tt nan}s are not good candidates for prediction.
8682
\index{explanatory variable}
8683
\index{NaN}
8684
8685
For most of these variables, we haven't done any cleaning. Some of them
8686
are encoded in ways that don't work very well for linear regression.
8687
As a result, we might overlook some variables that would be useful if
8688
they were cleaned properly. But maybe we will find some good candidates.
8689
\index{cleaning}
8690
8691
8692
\section{Prediction}
8693
8694
The next step is to sort the results and select the variables that
8695
yield the highest values of $R^2$.
8696
\index{prediction}
8697
8698
\begin{verbatim}
8699
t.sort(reverse=True)
8700
for mse, name in t[:30]:
8701
print(name, mse)
8702
\end{verbatim}
8703
8704
The first variable on the list is \verb"totalwgt_lb",
8705
followed by \verb"birthwgt_lb". Obviously, we can't use birth
8706
weight to predict birth weight.
8707
\index{birth weight}
8708
\index{weight!birth}
8709
8710
Similarly {\tt prglngth} has useful predictive power, but for the
8711
office pool we assume pregnancy length (and the related variables)
8712
are not known yet.
8713
\index{predictive power}
8714
\index{pregnancy length}
8715
8716
The first useful predictive variable is {\tt babysex} which indicates
8717
whether the baby is male or female. In the NSFG dataset, boys are
8718
about 0.3 lbs heavier. So, assuming that the sex of the baby is
8719
known, we can use it for prediction.
8720
\index{sex}
8721
8722
Next is {\tt race}, which indicates whether the respondent is white,
8723
black, or other. As an explanatory variable, race can be problematic.
8724
In datasets like the NSFG, race is correlated with many other
8725
variables, including income and other socioeconomic factors. In a
8726
regression model, race acts as a {\bf proxy variable},
8727
so apparent correlations with race are often caused, at least in
8728
part, by other factors.
8729
\index{explanatory variable}
8730
\index{race}
8731
8732
The next variable on the list is {\tt nbrnaliv}, which indicates
8733
whether the pregnancy yielded multiple births. Twins and triplets
8734
tend to be smaller than other babies, so if we know whether our
8735
hypothetical co-worker is expecting twins, that would help.
8736
\index{multiple birth}
8737
8738
Next on the list is {\tt paydu}, which indicates whether the
8739
respondent owns her home. It is one of several income-related
8740
variables that turn out to be predictive. In datasets like the NSFG,
8741
income and wealth are correlated with just about everything. In this
8742
example, income is related to diet, health, health care, and other
8743
factors likely to affect birth weight.
8744
\index{birth weight}
8745
\index{weight!birth}
8746
\index{income}
8747
\index{wealth}
8748
8749
Some of the other variables on the list are things that would not
8750
be known until later, like {\tt bfeedwks}, the number of weeks
8751
the baby was breast fed. We can't use these variables for prediction,
8752
but you might want to speculate on reasons
8753
{\tt bfeedwks} might be correlated with birth weight.
8754
8755
Sometimes you start with a theory and use data to test it. Other
8756
times you start with data and go looking for possible theories.
8757
The second approach, which this section demonstrates, is
8758
called {\bf data mining}. An advantage of data mining is that it
8759
can discover unexpected patterns. A hazard is that many of the
8760
patterns it discovers are either random or spurious.
8761
\index{theory}
8762
\index{data mining}
8763
8764
Having identified potential explanatory variables, I tested a few
8765
models and settled on this one:
8766
\index{model}
8767
\index{explanatory variable}
8768
8769
\begin{verbatim}
8770
formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '
8771
'nbrnaliv>1 + paydu==1 + totincr')
8772
results = smf.ols(formula, data=join).fit()
8773
\end{verbatim}
8774
8775
This formula uses some syntax we have not seen yet:
8776
{\tt C(race)} tells the formula parser (Patsy) to treat race as a
8777
categorical variable, even though it is encoded numerically.
8778
\index{Patsy}
8779
\index{categorical variable}
8780
8781
The encoding for {\tt babysex} is 1 for male, 2 for female; writing
8782
{\tt babysex==1} converts it to boolean, True for male and false for
8783
female.
8784
\index{boolean}
8785
8786
Similarly {\tt nbrnaliv>1} is True for multiple births and
8787
{\tt paydu==1} is True for respondents who own their houses.
8788
8789
{\tt totincr} is encoded numerically from 1-14, with each increment
8790
representing about \$5000 in annual income. So we can treat these
8791
values as numerical, expressed in units of \$5000.
8792
\index{income}
8793
8794
Here are the results of the model:
8795
8796
\begin{verbatim}
8797
Intercept 6.63 (0)
8798
C(race)[T.2] 0.357 (5.43e-29)
8799
C(race)[T.3] 0.266 (2.33e-07)
8800
babysex == 1[T.True] 0.295 (5.39e-29)
8801
nbrnaliv > 1[T.True] -1.38 (5.1e-37)
8802
paydu == 1[T.True] 0.12 (0.000114)
8803
agepreg 0.00741 (0.0035)
8804
totincr 0.0122 (0.00188)
8805
\end{verbatim}
8806
8807
The estimated parameters for race are larger than I expected,
8808
especially since we control for income. The encoding
8809
is 1 for black, 2 for white, and 3 for other. Babies of black
8810
mothers are lighter than babies of other races by 0.27--0.36 lbs.
8811
\index{control variable}
8812
\index{race}
8813
8814
As we've already seen, boys are heavier by about 0.3 lbs;
8815
twins and other multiplets are lighter by 1.4 lbs.
8816
\index{weight}
8817
8818
People who own their homes have heavier babies by about 0.12 lbs,
8819
even when we control for income. The parameter for mother's
8820
age is smaller than what we saw in Section~\ref{multiple}, which
8821
suggests that some of the other variables are correlated with
8822
age, probably including {\tt paydu} and {\tt totincr}.
8823
\index{income}
8824
8825
All of these variables are statistically significant, some with
8826
very low p-values, but
8827
$R^2$ is only 0.06, still quite small.
8828
RMSE without using the model is 1.27 lbs; with the model it drops
8829
to 1.23. So your chance of winning the pool is not substantially
8830
improved. Sorry!
8831
\index{p-value}
8832
\index{model}
8833
\index{coefficient of determination}
8834
\index{r-squared}
8835
\index{significant} \index{statistically significant}
8836
8837
8838
8839
\section{Logistic regression}
8840
8841
In the previous examples, some of the explanatory variables were
8842
numerical and some categorical (including boolean). But the dependent
8843
variable was always numerical.
8844
\index{explanatory variable}
8845
\index{dependent variable}
8846
\index{categorical variable}
8847
8848
Linear regression can be generalized to handle other kinds of
8849
dependent variables. If the dependent variable is boolean, the
8850
generalized model is called {\bf logistic regression}. If the dependent
8851
variable is an integer count, it's called {\bf Poisson
8852
regression}.
8853
\index{model}
8854
\index{logistic regression}
8855
\index{Poisson regression}
8856
\index{boolean}
8857
8858
As an example of logistic regression, let's consider a variation
8859
on the office pool scenario.
8860
Suppose
8861
a friend of yours is pregnant and you want to predict whether the
8862
baby is a boy or a girl. You could use data from the NSFG to find
8863
factors that affect the ``sex ratio'', which is conventionally
8864
defined to be the probability
8865
of having a boy.
8866
\index{betting pool}
8867
\index{sex}
8868
8869
If you encode the dependent variable numerically, for example 0 for a
8870
girl and 1 for a boy, you could apply ordinary least squares, but
8871
there would be problems. The linear model might be something like
8872
this:
8873
%
8874
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8875
%
8876
Where $y$ is the dependent variable, and $x_1$ and $x_2$ are
8877
explanatory variables. Then we could find the parameters that
8878
minimize the residuals.
8879
\index{regression model}
8880
\index{explanatory variable}
8881
\index{dependent variable}
8882
\index{ordinary least squares}
8883
8884
The problem with this approach is that it produces predictions that
8885
are hard to interpret. Given estimated parameters and values for
8886
$x_1$ and $x_2$, the model might predict $y=0.5$, but the only
8887
meaningful values of $y$ are 0 and 1.
8888
\index{parameter}
8889
8890
It is tempting to interpret a result like that as a probability; for
8891
example, we might say that a respondent with particular values of
8892
$x_1$ and $x_2$ has a 50\% chance of having a boy. But it is also
8893
possible for this model to predict $y=1.1$ or $y=-0.1$, and those
8894
are not valid probabilities.
8895
\index{probability}
8896
8897
Logistic regression avoids this problem by expressing predictions in
8898
terms of {\bf odds} rather than probabilities. If you are not
8899
familiar with odds, ``odds in favor'' of an event is the ratio of the
8900
probability it will occur to the probability that it will not.
8901
\index{odds}
8902
8903
So if I think my team has a 75\% chance of winning, I would
8904
say that the odds in their favor are three to one, because
8905
the chance of winning is three times the chance of losing.
8906
8907
Odds and probabilities are different representations of the same
8908
information. Given a probability, you can compute the odds like this:
8909
8910
\begin{verbatim}
8911
o = p / (1-p)
8912
\end{verbatim}
8913
8914
Given odds in favor, you can convert to
8915
probability like this:
8916
8917
\begin{verbatim}
8918
p = o / (o+1)
8919
\end{verbatim}
8920
8921
Logistic regression is based on the following model:
8922
%
8923
\[ \log o = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8924
%
8925
Where $o$ is the odds in favor of a particular outcome; in the
8926
example, $o$ would be the odds of having a boy.
8927
\index{regression model}
8928
8929
Suppose we have estimated the parameters $\beta_0$, $\beta_1$, and
8930
$\beta_2$ (I'll explain how in a minute). And suppose we are given
8931
values for $x_1$ and $x_2$. We can compute the predicted value of
8932
$\log o$, and then convert to a probability:
8933
8934
\begin{verbatim}
8935
o = np.exp(log_o)
8936
p = o / (o+1)
8937
\end{verbatim}
8938
8939
So in the office pool scenario we could compute the predictive
8940
probability of having a boy. But how do we estimate the parameters?
8941
\index{parameter}
8942
8943
8944
\section{Estimating parameters}
8945
8946
Unlike linear regression, logistic regression does not have a
8947
closed form solution, so it is solved by guessing an initial
8948
solution and improving it iteratively.
8949
\index{logistic regression}
8950
\index{closed form}
8951
8952
The usual goal is to find the maximum-likelihood estimate (MLE),
8953
which is the set of parameters that maximizes the likelihood of the
8954
data. For example, suppose we have the following data:
8955
\index{MLE}
8956
\index{maximum likelihood estimator}
8957
8958
\begin{verbatim}
8959
>>> y = np.array([0, 1, 0, 1])
8960
>>> x1 = np.array([0, 0, 0, 1])
8961
>>> x2 = np.array([0, 1, 1, 1])
8962
\end{verbatim}
8963
8964
And we start with the initial guesses $\beta_0=-1.5$, $\beta_1=2.8$,
8965
and $\beta_2=1.1$:
8966
8967
\begin{verbatim}
8968
>>> beta = [-1.5, 2.8, 1.1]
8969
\end{verbatim}
8970
8971
Then for each row we can compute \verb"log_o":
8972
8973
\begin{verbatim}
8974
>>> log_o = beta[0] + beta[1] * x1 + beta[2] * x2
8975
[-1.5 -0.4 -0.4 2.4]
8976
\end{verbatim}
8977
8978
And convert from log odds to probabilities:
8979
\index{log odds}
8980
8981
\begin{verbatim}
8982
>>> o = np.exp(log_o)
8983
[ 0.223 0.670 0.670 11.02 ]
8984
8985
>>> p = o / (o+1)
8986
[ 0.182 0.401 0.401 0.916 ]
8987
\end{verbatim}
8988
8989
Notice that when \verb"log_o" is greater than 0, {\tt o}
8990
is greater than 1 and {\tt p} is greater than 0.5.
8991
8992
The likelihood of an outcome is {\tt p} when {\tt y==1} and {\tt 1-p}
8993
when {\tt y==0}. For example, if we think the probability of a boy is
8994
0.8 and the outcome is a boy, the likelihood is 0.8; if
8995
the outcome is a girl, the likelihood is 0.2. We can compute that
8996
like this:
8997
\index{likelihood}
8998
8999
\begin{verbatim}
9000
>>> likes = y * p + (1-y) * (1-p)
9001
[ 0.817 0.401 0.598 0.916 ]
9002
\end{verbatim}
9003
9004
The overall likelihood of the data is the product of {\tt likes}:
9005
9006
\begin{verbatim}
9007
>>> like = np.prod(likes)
9008
0.18
9009
\end{verbatim}
9010
9011
For these values of {\tt beta}, the likelihood of the data is 0.18.
9012
The goal of logistic regression is to find parameters that maximize
9013
this likelihood. To do that, most statistics packages use an
9014
iterative solver like Newton's method (see
9015
\url{https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting}).
9016
\index{Newton's method}
9017
\index{iterative solver}
9018
9019
9020
\section{Implementation}
9021
\label{implementation}
9022
9023
StatsModels provides an implementation of logistic regression
9024
called {\tt logit}, named for the function that converts from
9025
probability to log odds. To demonstrate its use, I'll look for
9026
variables that affect the sex ratio.
9027
\index{StatsModels}
9028
\index{sex ratio}
9029
\index{logit function}
9030
9031
Again, I load the NSFG data and select pregnancies longer than
9032
30 weeks:
9033
9034
\begin{verbatim}
9035
live, firsts, others = first.MakeFrames()
9036
df = live[live.prglngth>30]
9037
\end{verbatim}
9038
9039
{\tt logit} requires the dependent variable to be binary (rather than
9040
boolean), so I create a new column named {\tt boy}, using {\tt
9041
astype(int)} to convert to binary integers:
9042
\index{dependent variable}
9043
\index{boolean}
9044
\index{binary}
9045
9046
\begin{verbatim}
9047
df['boy'] = (df.babysex==1).astype(int)
9048
\end{verbatim}
9049
9050
Factors that have been found to affect sex ratio include parents'
9051
age, birth order, race, and social status. We can use logistic
9052
regression to see if these effects appear in the NSFG data. I'll
9053
start with the mother's age:
9054
\index{age}
9055
\index{race}
9056
9057
\begin{verbatim}
9058
import statsmodels.formula.api as smf
9059
9060
model = smf.logit('boy ~ agepreg', data=df)
9061
results = model.fit()
9062
SummarizeResults(results)
9063
\end{verbatim}
9064
9065
{\tt logit} takes the same arguments as {\tt ols}, a formula
9066
in Patsy syntax and a DataFrame. The result is a Logit object
9067
that represents the model. It contains attributes called
9068
{\tt endog} and {\tt exog} that contain the {\bf endogenous
9069
variable}, another name for the dependent variable,
9070
and the {\bf exogenous variables}, another name for the
9071
explanatory variables. Since they are NumPy arrays, it is
9072
sometimes convenient to convert them to DataFrames:
9073
\index{NumPy}
9074
\index{pandas}
9075
\index{DataFrame}
9076
\index{explanatory variable}
9077
\index{dependent variable}
9078
\index{exogenous variable}
9079
\index{endogenous variable}
9080
\index{Patsy}
9081
9082
\begin{verbatim}
9083
endog = pandas.DataFrame(model.endog, columns=[model.endog_names])
9084
exog = pandas.DataFrame(model.exog, columns=model.exog_names)
9085
\end{verbatim}
9086
9087
The result of {\tt model.fit} is a BinaryResults object, which is
9088
similar to the RegressionResults object we got from {\tt ols}.
9089
Here is a summary of the results:
9090
9091
\begin{verbatim}
9092
Intercept 0.00579 (0.953)
9093
agepreg 0.00105 (0.783)
9094
R^2 6.144e-06
9095
\end{verbatim}
9096
9097
The parameter of {\tt agepreg} is positive, which suggests that
9098
older mothers are more likely to have boys, but the p-value is
9099
0.783, which means that the apparent effect could easily be due
9100
to chance.
9101
\index{p-value}
9102
\index{age}
9103
9104
The coefficient of determination, $R^2$, does not apply to logistic
9105
regression, but there are several alternatives that are used
9106
as ``pseudo $R^2$ values.'' These values can be useful for comparing
9107
models. For example, here's a model that includes several factors
9108
believed to be associated with sex ratio:
9109
\index{model}
9110
\index{coefficient of determination}
9111
\index{r-squared}
9112
\index{pseudo r-squared}
9113
9114
\begin{verbatim}
9115
formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'
9116
model = smf.logit(formula, data=df)
9117
results = model.fit()
9118
\end{verbatim}
9119
9120
Along with mother's age, this model includes father's age at
9121
birth ({\tt hpagelb}), birth order ({\tt birthord}), and
9122
race as a categorical variable. Here are the results:
9123
\index{categorical variable}
9124
9125
\begin{verbatim}
9126
Intercept -0.0301 (0.772)
9127
C(race)[T.2] -0.0224 (0.66)
9128
C(race)[T.3] -0.000457 (0.996)
9129
agepreg -0.00267 (0.629)
9130
hpagelb 0.0047 (0.266)
9131
birthord 0.00501 (0.821)
9132
R^2 0.000144
9133
\end{verbatim}
9134
9135
None of the estimated parameters are statistically significant. The
9136
pseudo-$R^2$ value is a little higher, but that could be due to
9137
chance.
9138
\index{pseudo r-squared}
9139
\index{significant} \index{statistically significant}
9140
9141
9142
\section{Accuracy}
9143
\label{accuracy}
9144
9145
In the office pool scenario,
9146
we are most interested in the accuracy of the model:
9147
the number of successful predictions, compared with what we would
9148
expect by chance.
9149
\index{model}
9150
\index{accuracy}
9151
9152
In the NSFG data, there are more boys than girls, so the baseline
9153
strategy is to guess ``boy'' every time. The accuracy of this
9154
strategy is just the fraction of boys:
9155
9156
\begin{verbatim}
9157
actual = endog['boy']
9158
baseline = actual.mean()
9159
\end{verbatim}
9160
9161
Since {\tt actual} is encoded in binary integers, the mean is the
9162
fraction of boys, which is 0.507.
9163
9164
Here's how we compute the accuracy of the model:
9165
9166
\begin{verbatim}
9167
predict = (results.predict() >= 0.5)
9168
true_pos = predict * actual
9169
true_neg = (1 - predict) * (1 - actual)
9170
\end{verbatim}
9171
9172
{\tt results.predict} returns a NumPy array of probabilities, which we
9173
round off to 0 or 1. Multiplying by {\tt actual}
9174
yields 1 if we predict a boy and get it right, 0 otherwise. So,
9175
\verb"true_pos" indicates ``true positives''.
9176
\index{NumPy}
9177
\index{true positive}
9178
\index{true negative}
9179
9180
Similarly, \verb"true_neg" indicates the cases where we guess ``girl''
9181
and get it right. Accuracy is the fraction of correct guesses:
9182
9183
\begin{verbatim}
9184
acc = (sum(true_pos) + sum(true_neg)) / len(actual)
9185
\end{verbatim}
9186
9187
The result is 0.512, slightly better than the
9188
baseline, 0.507. But, you should not take this result too seriously.
9189
We used the same data to build and test the model, so the model
9190
may not have predictive power on new data.
9191
\index{model}
9192
9193
Nevertheless, let's use the model to make a prediction for the office
9194
pool. Suppose your friend is 35 years old and white,
9195
her husband is 39, and they are expecting their third child:
9196
9197
\begin{verbatim}
9198
columns = ['agepreg', 'hpagelb', 'birthord', 'race']
9199
new = pandas.DataFrame([[35, 39, 3, 2]], columns=columns)
9200
y = results.predict(new)
9201
\end{verbatim}
9202
9203
To invoke {\tt results.predict} for a new case, you have to construct
9204
a DataFrame with a column for each variable in the model. The result
9205
in this case is 0.52, so you should guess ``boy.'' But if the model
9206
improves your chances of winning, the difference is very small.
9207
\index{DataFrame}
9208
9209
9210
9211
\section{Exercises}
9212
9213
My solution to these exercises is in \verb"chap11soln.ipynb".
9214
9215
\begin{exercise}
9216
Suppose one of your co-workers is expecting a baby and you are
9217
participating in an office pool to predict the date of birth.
9218
Assuming that bets are placed during the 30th week of pregnancy, what
9219
variables could you use to make the best prediction? You should limit
9220
yourself to variables that are known before the birth, and likely to
9221
be available to the people in the pool.
9222
\index{betting pool}
9223
\index{date of birth}
9224
9225
\end{exercise}
9226
9227
9228
\begin{exercise}
9229
The Trivers-Willard hypothesis suggests that for many mammals the
9230
sex ratio depends on ``maternal condition''; that is,
9231
factors like the mother's age, size, health, and social status.
9232
See \url{https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis}
9233
\index{Trivers-Willard hypothesis}
9234
\index{sex ratio}
9235
9236
Some studies have shown this effect among humans, but results are
9237
mixed. In this chapter we tested some variables related to these
9238
factors, but didn't find any with a statistically significant effect
9239
on sex ratio.
9240
\index{significant} \index{statistically significant}
9241
9242
As an exercise, use a data mining approach to test the other variables
9243
in the pregnancy and respondent files. Can you find any factors with
9244
a substantial effect?
9245
\index{data mining}
9246
9247
\end{exercise}
9248
9249
9250
\begin{exercise}
9251
If the quantity you want to predict is a count, you can use Poisson
9252
regression, which is implemented in StatsModels with a function called
9253
{\tt poisson}. It works the same way as {\tt ols} and {\tt logit}.
9254
As an exercise, let's use it to predict how many children a woman
9255
has born; in the NSFG dataset, this variable is called {\tt numbabes}.
9256
\index{StatsModels}
9257
\index{Poisson regression}
9258
9259
Suppose you meet a woman who is 35 years old, black, and a college
9260
graduate whose annual household income exceeds \$75,000. How many
9261
children would you predict she has born?
9262
\end{exercise}
9263
9264
9265
\begin{exercise}
9266
If the quantity you want to predict is categorical, you can use
9267
multinomial logistic regression, which is implemented in StatsModels
9268
with a function called {\tt mnlogit}. As an exercise, let's use it to
9269
guess whether a woman is married, cohabitating, widowed, divorced,
9270
separated, or never married; in the NSFG dataset, marital status is
9271
encoded in a variable called {\tt rmarital}.
9272
\index{categorical variable}
9273
\index{marital status}
9274
9275
Suppose you meet a woman who is 25 years old, white, and a high
9276
school graduate whose annual household income is about \$45,000.
9277
What is the probability that she is married, cohabitating, etc?
9278
\end{exercise}
9279
9280
9281
9282
9283
\section{Glossary}
9284
9285
\begin{itemize}
9286
9287
\item regression: One of several related processes for estimating parameters
9288
that fit a model to data.
9289
\index{regression}
9290
9291
\item dependent variables: The variables in a regression model we would
9292
like to predict. Also known as endogenous variables.
9293
\index{dependent variable}
9294
\index{endogenous variable}
9295
9296
\item explanatory variables: The variables used to predict or explain
9297
the dependent variables. Also known as independent, or exogenous,
9298
variables.
9299
\index{explanatory variable}
9300
\index{exogenous variable}
9301
9302
\item simple regression: A regression with only one dependent and
9303
one explanatory variable.
9304
\index{simple regression}
9305
9306
\item multiple regression: A regression with multiple explanatory
9307
variables, but only one dependent variable.
9308
\index{multiple regression}
9309
9310
\item linear regression: A regression based on a linear model.
9311
\index{linear regression}
9312
9313
\item ordinary least squares: A linear regression that estimates
9314
parameters by minimizing the squared error of the residuals.
9315
\index{ordinary least squares}
9316
9317
\item spurious relationship: A relationship between two variables that is
9318
caused by a statistical artifact or a factor, not included in the
9319
model, that is related to both variables.
9320
\index{spurious relationship}
9321
9322
\item control variable: A variable included in a regression to
9323
eliminate or ``control for'' a spurious relationship.
9324
\index{control variable}
9325
9326
\item proxy variable: A variable that contributes information to
9327
a regression model indirectly because of a relationship with another
9328
factor, so it acts as a proxy for that factor.
9329
\index{proxy variable}
9330
9331
\item categorical variable: A variable that can have one of a
9332
discrete set of unordered values.
9333
\index{categorical variable}
9334
9335
\item join: An operation that combines data from two DataFrames
9336
using a key to match up rows in the two frames.
9337
\index{join}
9338
\index{DataFrame}
9339
9340
\item data mining: An approach to finding relationships between
9341
variables by testing a large number of models.
9342
\index{data mining}
9343
9344
\item logistic regression: A form of regression used when the
9345
dependent variable is boolean.
9346
\index{logistic regression}
9347
9348
\item Poisson regression: A form of regression used when the
9349
dependent variable is a non-negative integer, usually a count.
9350
\index{Poisson regression}
9351
9352
\item odds: An alternative way of representing a probability, $p$, as
9353
the ratio of the probability and its complement, $p / (1-p)$.
9354
\index{odds}
9355
9356
\end{itemize}
9357
9358
9359
9360
\chapter{Time series analysis}
9361
9362
A {\bf time series} is a sequence of measurements from a system that
9363
varies in time. One famous example is the ``hockey stick graph'' that
9364
shows global average temperature over time (see
9365
\url{https://en.wikipedia.org/wiki/Hockey_stick_graph}).
9366
\index{time series}
9367
\index{hockey stick graph}
9368
9369
The example I work with in this chapter comes from Zachary M. Jones, a
9370
researcher in political science who studies the black market for
9371
cannabis in the U.S. (\url{http://zmjones.com/marijuana}). He
9372
collected data from a web site called ``Price of Weed'' that
9373
crowdsources market information by asking participants to report the
9374
price, quantity, quality, and location of cannabis transactions
9375
(\url{http://www.priceofweed.com/}). The goal of his project is to
9376
investigate the effect of policy decisions, like legalization, on
9377
markets. I find this project appealing because it is an example that
9378
uses data to address important political questions, like drug policy.
9379
\index{Price of Weed}
9380
\index{cannabis}
9381
9382
I hope you will
9383
find this chapter interesting, but I'll take this opportunity to
9384
reiterate the importance of maintaining a professional attitude to
9385
data analysis. Whether and which drugs should be illegal are
9386
important and difficult public policy questions; our decisions should
9387
be informed by accurate data reported honestly.
9388
\index{ethics}
9389
9390
The code for this chapter is in {\tt timeseries.py}. For information
9391
about downloading and working with this code, see Section~\ref{code}.
9392
9393
9394
\section{Importing and cleaning}
9395
9396
The data I downloaded from
9397
Mr. Jones's site is in the repository for this book.
9398
The following code reads it into a
9399
pandas DataFrame:
9400
\index{pandas}
9401
\index{DataFrame}
9402
9403
\begin{verbatim}
9404
transactions = pandas.read_csv('mj-clean.csv', parse_dates=[5])
9405
\end{verbatim}
9406
9407
\verb"parse_dates" tells \verb"read_csv" to interpret values in column 5
9408
as dates and convert them to NumPy {\tt datetime64} objects.
9409
\index{NumPy}
9410
9411
The DataFrame has a row for each reported transaction and
9412
the following columns:
9413
9414
\begin{itemize}
9415
9416
\item city: string city name.
9417
9418
\item state: two-letter state abbreviation.
9419
9420
\item price: price paid in dollars.
9421
\index{price}
9422
9423
\item amount: quantity purchased in grams.
9424
9425
\item quality: high, medium, or low quality, as reported by the purchaser.
9426
9427
\item date: date of report, presumed to be shortly after date of purchase.
9428
9429
\item ppg: price per gram, in dollars.
9430
9431
\item state.name: string state name.
9432
9433
\item lat: approximate latitude of the transaction, based on city name.
9434
9435
\item lon: approximate longitude of the transaction.
9436
9437
\end{itemize}
9438
9439
Each transaction is an event in time, so we could treat this dataset
9440
as a time series. But the events are not equally spaced in time; the
9441
number of transactions reported each day varies from 0 to several
9442
hundred. Many methods used to analyze time series require the
9443
measurements to be equally spaced, or at least things are simpler if
9444
they are.
9445
\index{transaction}
9446
\index{equally spaced data}
9447
9448
In order to demonstrate these methods, I divide the dataset
9449
into groups by reported quality, and then transform each group into
9450
an equally spaced series by computing the mean daily price per gram.
9451
9452
\begin{verbatim}
9453
def GroupByQualityAndDay(transactions):
9454
groups = transactions.groupby('quality')
9455
dailies = {}
9456
for name, group in groups:
9457
dailies[name] = GroupByDay(group)
9458
9459
return dailies
9460
\end{verbatim}
9461
9462
{\tt groupby} is a DataFrame method that returns a GroupBy object,
9463
{\tt groups}; used in a for loop, it iterates the names of the groups
9464
and the DataFrames that represent them. Since the values of {\tt
9465
quality} are {\tt low}, {\tt medium}, and {\tt high}, we get three
9466
groups with those names. \index{DataFrame} \index{groupby}
9467
9468
The loop iterates through the groups and calls {\tt GroupByDay},
9469
which computes the daily average price and returns a new DataFrame:
9470
9471
\begin{verbatim}
9472
def GroupByDay(transactions, func=np.mean):
9473
grouped = transactions[['date', 'ppg']].groupby('date')
9474
daily = grouped.aggregate(func)
9475
9476
daily['date'] = daily.index
9477
start = daily.date[0]
9478
one_year = np.timedelta64(1, 'Y')
9479
daily['years'] = (daily.date - start) / one_year
9480
9481
return daily
9482
\end{verbatim}
9483
9484
The parameter, {\tt transactions}, is a DataFrame that contains
9485
columns {\tt date} and {\tt ppg}. We select these two
9486
columns, then group by {\tt date}.
9487
\index{groupby}
9488
9489
The result, {\tt grouped}, is a map from each date to a DataFrame that
9490
contains prices reported on that date. {\tt aggregate} is a
9491
GroupBy method that iterates through the groups and applies a
9492
function to each column of the group; in this case there is only one
9493
column, {\tt ppg}. So the result of {\tt aggregate} is a DataFrame
9494
with one row for each date and one column, {\tt ppg}.
9495
\index{aggregate}
9496
9497
Dates in these DataFrames are stored as NumPy {\tt datetime64}
9498
objects, which are represented as 64-bit integers in nanoseconds.
9499
For some of the analyses coming up, it will be convenient to
9500
work with time in more human-friendly units, like years. So
9501
{\tt GroupByDay} adds a column named {\tt date} by copying
9502
the {\tt index}, then adds {\tt years}, which contains the number
9503
of years since the first transaction as a floating-point number.
9504
\index{NumPy}
9505
\index{datetime64}
9506
9507
The resulting DataFrame has columns {\tt ppg}, {\tt date}, and
9508
{\tt years}.
9509
\index{DataFrame}
9510
9511
9512
\section{Plotting}
9513
9514
The result from {\tt GroupByQualityAndDay} is a map from each quality
9515
to a DataFrame of daily prices. Here's the code I use to plot
9516
the three time series:
9517
\index{DataFrame}
9518
\index{visualization}
9519
9520
\begin{verbatim}
9521
thinkplot.PrePlot(rows=3)
9522
for i, (name, daily) in enumerate(dailies.items()):
9523
thinkplot.SubPlot(i+1)
9524
title = 'price per gram ($)' if i==0 else ''
9525
thinkplot.Config(ylim=[0, 20], title=title)
9526
thinkplot.Scatter(daily.index, daily.ppg, s=10, label=name)
9527
if i == 2:
9528
pyplot.xticks(rotation=30)
9529
else:
9530
thinkplot.Config(xticks=[])
9531
\end{verbatim}
9532
9533
{\tt PrePlot} with {\tt rows=3} means that we are planning to
9534
make three subplots laid out in three rows. The loop iterates
9535
through the DataFrames and creates a scatter plot for each. It is
9536
common to plot time series with line segments between the points,
9537
but in this case there are many data points and prices are highly
9538
variable, so adding lines would not help.
9539
\index{thinkplot}
9540
9541
Since the labels on the x-axis are dates, I use {\tt pyplot.xticks}
9542
to rotate the ``ticks'' 30 degrees, making them more readable.
9543
\index{pyplot}
9544
\index{ticks}
9545
\index{xticks}
9546
9547
\begin{figure}
9548
% timeseries.py
9549
\centerline{\includegraphics[width=3.5in]{figs/timeseries1.pdf}}
9550
\caption{Time series of daily price per gram for high, medium, and low
9551
quality cannabis.}
9552
\label{timeseries1}
9553
\end{figure}
9554
9555
Figure~\ref{timeseries1} shows the result. One apparent feature in
9556
these plots is a gap around November 2013. It's possible that data
9557
collection was not active during this time, or the data might not
9558
be available. We will consider ways to deal with this missing data
9559
later.
9560
\index{missing values}
9561
9562
Visually, it looks like the price of high quality cannabis is
9563
declining during this period, and the price of medium quality is
9564
increasing. The price of low quality might also be increasing, but it
9565
is harder to tell, since it seems to be more volatile. Keep in mind
9566
that quality data is reported by volunteers, so trends over time
9567
might reflect changes in how participants apply these labels.
9568
\index{price}
9569
9570
9571
\section{Linear regression}
9572
\label{timeregress}
9573
9574
Although there are methods specific to time series analysis, for many
9575
problems a simple way to get started is by applying general-purpose
9576
tools like linear regression. The following function takes a
9577
DataFrame of daily prices and computes a least squares fit, returning
9578
the model and results objects from StatsModels:
9579
\index{DataFrame}
9580
\index{StatsModels}
9581
\index{linear regression}
9582
9583
\begin{verbatim}
9584
def RunLinearModel(daily):
9585
model = smf.ols('ppg ~ years', data=daily)
9586
results = model.fit()
9587
return model, results
9588
\end{verbatim}
9589
9590
Then we can iterate through the qualities and fit a model to
9591
each:
9592
9593
\begin{verbatim}
9594
for name, daily in dailies.items():
9595
model, results = RunLinearModel(daily)
9596
print(name)
9597
regression.SummarizeResults(results)
9598
\end{verbatim}
9599
9600
Here are the results:
9601
9602
\begin{center}
9603
\begin{tabular}{|l|l|l|c|} \hline
9604
quality & intercept & slope & $R^2$ \\ \hline
9605
high & 13.450 & -0.708 & 0.444 \\
9606
medium & 8.879 & 0.283 & 0.050 \\
9607
low & 5.362 & 0.568 & 0.030 \\
9608
\hline
9609
\end{tabular}
9610
\end{center}
9611
9612
The estimated slopes indicate that the price of high quality cannabis
9613
dropped by about 71 cents per year during the observed interval; for
9614
medium quality it increased by 28 cents per year, and for low quality
9615
it increased by 57 cents per year. These estimates are all
9616
statistically significant with very small p-values.
9617
\index{p-value}
9618
\index{significant} \index{statistically significant}
9619
9620
The $R^2$ value for high quality cannabis is 0.44, which means
9621
that time as an explanatory variable accounts for 44\% of the observed
9622
variability in price. For the other qualities, the change in price
9623
is smaller, and variability in prices is higher, so the values
9624
of $R^2$ are smaller (but still statistically significant).
9625
\index{explanatory variable}
9626
\index{significant} \index{statistically significant}
9627
9628
The following code plots the observed prices and the fitted values:
9629
9630
\begin{verbatim}
9631
def PlotFittedValues(model, results, label=''):
9632
years = model.exog[:,1]
9633
values = model.endog
9634
thinkplot.Scatter(years, values, s=15, label=label)
9635
thinkplot.Plot(years, results.fittedvalues, label='model')
9636
\end{verbatim}
9637
9638
As we saw in Section~\ref{implementation}, {\tt model} contains
9639
{\tt exog} and {\tt endog}, NumPy arrays with the exogenous
9640
(explanatory) and endogenous (dependent) variables.
9641
\index{NumPy}
9642
\index{explanatory variable}
9643
\index{dependent variable}
9644
\index{exogenous variable}
9645
\index{endogenous variable}
9646
9647
\begin{figure}
9648
% timeseries.py
9649
\centerline{\includegraphics[height=2.5in]{figs/timeseries2.pdf}}
9650
\caption{Time series of daily price per gram for high quality cannabis,
9651
and a linear least squares fit.}
9652
\label{timeseries2}
9653
\end{figure}
9654
9655
{\tt PlotFittedValues} makes a scatter plot of the data points and a line
9656
plot of the fitted values. Figure~\ref{timeseries2} shows the results
9657
for high quality cannabis. The model seems like a good linear fit
9658
for the data; nevertheless, linear regression is not the most
9659
appropriate choice for this data:
9660
\index{model}
9661
\index{fitted values}
9662
9663
\begin{itemize}
9664
9665
\item First, there is no reason to expect the long-term trend to be a
9666
line or any other simple function. In general, prices are
9667
determined by supply and demand, both of which vary over time in
9668
unpredictable ways.
9669
\index{trend}
9670
9671
\item Second, the linear regression model gives equal weight to all
9672
data, recent and past. For purposes of prediction, we should
9673
probably give more weight to recent data.
9674
\index{weight}
9675
9676
\item Finally, one of the assumptions of linear regression is that the
9677
residuals are uncorrelated noise. With time series data, this
9678
assumption is often false because successive values are correlated.
9679
\index{residuals}
9680
9681
\end{itemize}
9682
9683
The next section presents an alternative that is more appropriate
9684
for time series data.
9685
9686
9687
\section{Moving averages}
9688
9689
Most time series analysis is based on the modeling assumption that the
9690
observed series is the sum of three components:
9691
\index{model}
9692
\index{moving average}
9693
9694
\begin{itemize}
9695
9696
\item Trend: A smooth function that captures persistent changes.
9697
\index{trend}
9698
9699
\item Seasonality: Periodic variation, possibly including daily,
9700
weekly, monthly, or yearly cycles.
9701
\index{seasonality}
9702
9703
\item Noise: Random variation around the long-term trend.
9704
\index{noise}
9705
9706
\end{itemize}
9707
9708
Regression is one way to extract the trend from a series, as we
9709
saw in the previous section. But if the trend is not a simple
9710
function, a good alternative is a {\bf moving average}. A moving
9711
average divides the series into overlapping regions, called {\bf windows},
9712
and computes the average of the values in each window.
9713
\index{window}
9714
9715
One of the simplest moving averages is the {\bf rolling mean}, which
9716
computes the mean of the values in each window. For example, if
9717
the window size is 3, the rolling mean computes the mean of
9718
values 0 through 2, 1 through 3, 2 through 4, etc.
9719
\index{rolling mean}
9720
\index{mean!rolling}
9721
9722
pandas provides \verb"rolling_mean", which takes a Series and a
9723
window size and returns a new Series.
9724
\index{pandas}
9725
\index{Series}
9726
9727
\begin{verbatim}
9728
>>> series = np.arange(10)
9729
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
9730
9731
>>> pandas.rolling_mean(series, 3)
9732
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
9733
\end{verbatim}
9734
9735
The first two values are {\tt nan}; the next value is the mean of
9736
the first three elements, 0, 1, and 2. The next value is the mean
9737
of 1, 2, and 3. And so on.
9738
9739
Before we can apply \verb"rolling_mean" to the cannabis data, we
9740
have to deal with missing values. There are a few days in the
9741
observed interval with no reported transactions for one or more
9742
quality categories, and a period in 2013 when data collection was
9743
not active.
9744
\index{missing values}
9745
9746
In the DataFrames we have used so far, these dates are absent;
9747
the index skips days with no data. For the analysis that follows,
9748
we need to represent this missing data explicitly. We can do
9749
that by ``reindexing'' the DataFrame:
9750
\index{DataFrame}
9751
\index{reindex}
9752
9753
\begin{verbatim}
9754
dates = pandas.date_range(daily.index.min(), daily.index.max())
9755
reindexed = daily.reindex(dates)
9756
\end{verbatim}
9757
9758
The first line computes a date range that includes every day from the
9759
beginning to the end of the observed interval. The second line
9760
creates a new DataFrame with all of the data from {\tt daily}, but
9761
including rows for all dates, filled with {\tt nan}.
9762
\index{interval}
9763
\index{date range}
9764
9765
Now we can plot the rolling mean like this:
9766
9767
\begin{verbatim}
9768
roll_mean = pandas.rolling_mean(reindexed.ppg, 30)
9769
thinkplot.Plot(roll_mean.index, roll_mean)
9770
\end{verbatim}
9771
9772
The window size is 30, so each value in \verb"roll_mean" is
9773
the mean of 30 values from {\tt reindexed.ppg}.
9774
\index{pandas}
9775
\index{window}
9776
9777
\begin{figure}
9778
% timeseries.py
9779
\centerline{\includegraphics[height=2.5in]{figs/timeseries10.pdf}}
9780
\caption{Daily price and a rolling mean (left) and exponentially-weighted
9781
moving average (right).}
9782
\label{timeseries10}
9783
\end{figure}
9784
9785
Figure~\ref{timeseries10} (left)
9786
shows the result.
9787
The rolling mean seems to do a good job of smoothing out the noise and
9788
extracting the trend. The first 29 values are {\tt nan}, and wherever
9789
there's a missing value, it's followed by another 29 {\tt nan}s.
9790
There are ways to fill in these gaps, but they are a minor nuisance.
9791
\index{missing values}
9792
\index{noise}
9793
\index{smoothing}
9794
9795
An alternative is the {\bf exponentially-weighted moving average} (EWMA),
9796
which has two advantages. First, as the name suggests, it computes
9797
a weighted average where the most recent value has the highest weight
9798
and the weights for previous values drop off exponentially.
9799
Second, the pandas implementation of EWMA handles missing values
9800
better.
9801
\index{reindex}
9802
\index{exponentially-weighted moving average}
9803
\index{EWMA}
9804
9805
\begin{verbatim}
9806
ewma = pandas.ewma(reindexed.ppg, span=30)
9807
thinkplot.Plot(ewma.index, ewma)
9808
\end{verbatim}
9809
9810
The {\bf span} parameter corresponds roughly to the window size of
9811
a moving average; it controls how fast the weights drop off, so it
9812
determines the number of points that make a non-negligible contribution
9813
to each average.
9814
\index{span}
9815
\index{window}
9816
9817
Figure~\ref{timeseries10} (right) shows the EWMA for the same data.
9818
It is similar to the rolling mean, where they are both defined,
9819
but it has no missing values, which makes it easier to work with. The
9820
values are noisy at the beginning of the time series, because they are
9821
based on fewer data points.
9822
\index{missing values}
9823
9824
9825
\section{Missing values}
9826
9827
Now that we have characterized the trend of the time series, the
9828
next step is to investigate seasonality, which is periodic behavior.
9829
Time series data based on human behavior often exhibits daily,
9830
weekly, monthly, or yearly cycles. In the next section I present
9831
methods to test for seasonality, but they don't work well with
9832
missing data, so we have to solve that problem first.
9833
\index{missing values}
9834
\index{seasonality}
9835
9836
A simple and common way to fill missing data is to use a moving
9837
average. The Series method {\tt fillna} does just what we want:
9838
\index{Series}
9839
\index{fillna}
9840
9841
\begin{verbatim}
9842
reindexed.ppg.fillna(ewma, inplace=True)
9843
\end{verbatim}
9844
9845
Wherever {\tt reindexed.ppg} is {\tt nan}, {\tt fillna} replaces
9846
it with the corresponding value from {\tt ewma}. The {\tt inplace}
9847
flag tells {\tt fillna} to modify the existing Series rather than
9848
create a new one.
9849
9850
A drawback of this method is that it understates the noise in the
9851
series. We can solve that problem by adding in resampled
9852
residuals:
9853
\index{resampling}
9854
\index{noise}
9855
9856
\begin{verbatim}
9857
resid = (reindexed.ppg - ewma).dropna()
9858
fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))
9859
reindexed.ppg.fillna(fake_data, inplace=True)
9860
\end{verbatim}
9861
9862
% (One note on vocabulary: in this book I am using
9863
%``resampling'' in the statistical sense, which is drawing a random
9864
%sample from a population that is, itself, a sample. In the context
9865
%of time series analysis, it has another meaning: changing the
9866
%time between measurements in a series. I don't use the second
9867
%meaning in this book, but you might encounter it.)
9868
9869
{\tt resid} contains the residual values, not including days
9870
when {\tt ppg} is {\tt nan}. \verb"fake_data" contains the
9871
sum of the moving average and a random sample of residuals.
9872
Finally, {\tt fillna} replaces {\tt nan} with values from
9873
\verb"fake_data".
9874
\index{dropna}
9875
\index{fillna}
9876
\index{NaN}
9877
9878
\begin{figure}
9879
% timeseries.py
9880
\centerline{\includegraphics[height=2.5in]{figs/timeseries8.pdf}}
9881
\caption{Daily price with filled data.}
9882
\label{timeseries8}
9883
\end{figure}
9884
9885
Figure~\ref{timeseries8} shows the result. The filled data is visually
9886
similar to the actual values. Since the resampled residuals are
9887
random, the results are different every time; later we'll see how
9888
to characterize the error created by missing values.
9889
\index{resampling}
9890
\index{missing values}
9891
9892
9893
\section{Serial correlation}
9894
9895
As prices vary from day to day, you might expect to see patterns.
9896
If the price is high on Monday,
9897
you might expect it to be high for a few more days; and
9898
if it's low, you might expect it to stay low. A pattern
9899
like this is called {\bf serial
9900
correlation}, because each value is correlated with the next one
9901
in the series.
9902
\index{correlation!serial}
9903
\index{serial correlation}
9904
9905
To compute serial correlation, we can shift the time series
9906
by an interval called a {\bf lag}, and then compute the correlation
9907
of the shifted series with the original:
9908
\index{lag}
9909
9910
\begin{verbatim}
9911
def SerialCorr(series, lag=1):
9912
xs = series[lag:]
9913
ys = series.shift(lag)[lag:]
9914
corr = thinkstats2.Corr(xs, ys)
9915
return corr
9916
\end{verbatim}
9917
9918
After the shift, the first {\tt lag} values are {\tt nan}, so
9919
I use a slice to remove them before computing {\tt Corr}.
9920
\index{NaN}
9921
9922
%high 0.480121816154
9923
%medium 0.164600078362
9924
%low 0.103373620131
9925
9926
If we apply {\tt SerialCorr} to the raw price data with lag 1, we find
9927
serial correlation 0.48 for the high quality category, 0.16 for
9928
medium and 0.10 for low. In any time series with a long-term trend,
9929
we expect to see strong serial correlations; for example, if prices
9930
are falling, we expect to see values above the mean in the first
9931
half of the series and values below the mean in the second half.
9932
9933
It is more interesting to see if the correlation persists if you
9934
subtract away the trend. For example, we can compute the residual
9935
of the EWMA and then compute its serial correlation:
9936
\index{EWMA}
9937
9938
\begin{verbatim}
9939
ewma = pandas.ewma(reindexed.ppg, span=30)
9940
resid = reindexed.ppg - ewma
9941
corr = SerialCorr(resid, 1)
9942
\end{verbatim}
9943
9944
With lag=1, the serial correlations for the de-trended data are
9945
-0.022 for high quality, -0.015 for medium, and 0.036 for low.
9946
These values are small, indicating that there is little or
9947
no one-day serial correlation in this series.
9948
\index{pandas}
9949
9950
To check for weekly, monthly, and yearly seasonality, I ran
9951
the analysis again with different lags. Here are the results:
9952
\index{seasonality}
9953
9954
\begin{center}
9955
\begin{tabular}{|c|c|c|c|}
9956
\hline
9957
lag & high & medium & low \\ \hline
9958
1 & -0.029 & -0.014 & 0.034 \\
9959
7 & 0.02 & -0.042 & -0.0097 \\
9960
30 & 0.014 & -0.0064 & -0.013 \\
9961
365 & 0.045 & 0.015 & 0.033 \\
9962
\hline
9963
\end{tabular}
9964
\end{center}
9965
9966
In the next section we'll test whether these correlations are
9967
statistically significant (they are not), but at this point we can
9968
tentatively conclude that there are no substantial seasonal patterns
9969
in these series, at least not with these lags.
9970
\index{significant} \index{statistically significant}
9971
9972
9973
\section{Autocorrelation}
9974
9975
If you think a series might have some serial correlation, but you
9976
don't know which lags to test, you can test them all! The {\bf
9977
autocorrelation function} is a function that maps from lag to the
9978
serial correlation with the given lag. ``Autocorrelation'' is another
9979
name for serial correlation, used more often when the lag is not 1.
9980
\index{autocorrelation function}
9981
9982
StatsModels, which we used for linear regression in
9983
Section~\ref{statsmodels}, also provides functions for time series
9984
analysis, including {\tt acf}, which computes the autocorrelation
9985
function:
9986
\index{StatsModels}
9987
9988
\begin{verbatim}
9989
import statsmodels.tsa.stattools as smtsa
9990
acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)
9991
\end{verbatim}
9992
9993
{\tt acf} computes serial correlations with
9994
lags from 0 through {\tt nlags}. The {\tt unbiased} flag tells
9995
{\tt acf} to correct the estimates for the sample size. The result
9996
is an array of correlations. If we select daily prices for high
9997
quality, and extract correlations for lags 1, 7, 30, and 365, we can
9998
confirm that {\tt acf} and {\tt SerialCorr} yield approximately
9999
the same results:
10000
\index{acf}
10001
10002
\begin{verbatim}
10003
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
10004
1.000, -0.029, 0.020, 0.014, 0.044
10005
\end{verbatim}
10006
10007
With {\tt lag=0}, {\tt acf} computes the correlation of the series
10008
with itself, which is always 1.
10009
\index{lag}
10010
10011
\begin{figure}
10012
% timeseries.py
10013
\centerline{\includegraphics[height=2.5in]{figs/timeseries9.pdf}}
10014
\caption{Autocorrelation function for daily prices (left), and
10015
daily prices with a simulated weekly seasonality (right).}
10016
\label{timeseries9}
10017
\end{figure}
10018
10019
Figure~\ref{timeseries9} (left) shows autocorrelation functions for
10020
the three quality categories, with {\tt nlags=40}. The gray region
10021
shows the normal variability we would expect if there is no actual
10022
autocorrelation; anything that falls outside this range is
10023
statistically significant, with a p-value less than 5\%. Since
10024
the false positive rate is 5\%, and
10025
we are computing 120 correlations (40 lags for each of 3 times series),
10026
we expect to see about 6 points outside this region. In fact, there
10027
are 7. We conclude that there are no autocorrelations
10028
in these series that could not be explained by chance.
10029
\index{p-value}
10030
\index{significant} \index{statistically significant}
10031
\index{false positive}
10032
10033
I computed the gray regions by resampling the residuals. You
10034
can see my code in {\tt timeseries.py}; the function is called
10035
{\tt SimulateAutocorrelation}.
10036
\index{resampling}
10037
10038
To see what the autocorrelation function looks like when there is a
10039
seasonal component, I generated simulated data by adding a weekly
10040
cycle. Assuming that demand for cannabis is higher on weekends, we
10041
might expect the price to be higher. To simulate this effect, I
10042
select dates that fall on Friday or Saturday and add a random amount
10043
to the price, chosen from a uniform distribution from \$0 to \$2.
10044
\index{simulation}
10045
\index{uniform distribution}
10046
\index{distribution!uniform}
10047
10048
\begin{verbatim}
10049
def AddWeeklySeasonality(daily):
10050
frisat = (daily.index.dayofweek==4) | (daily.index.dayofweek==5)
10051
fake = daily.copy()
10052
fake.ppg[frisat] += np.random.uniform(0, 2, frisat.sum())
10053
return fake
10054
\end{verbatim}
10055
10056
{\tt frisat} is a boolean Series, {\tt True} if the day of the
10057
week is Friday or Saturday. {\tt fake} is a new DataFrame, initially
10058
a copy of {\tt daily}, which we modify by adding random values
10059
to {\tt ppg}. {\tt frisat.sum()} is the total number of Fridays
10060
and Saturdays, which is the number of random values we have to
10061
generate.
10062
\index{DataFrame}
10063
\index{Series}
10064
\index{boolean}
10065
10066
Figure~\ref{timeseries9} (right) shows autocorrelation functions for
10067
prices with this simulated seasonality. As expected, the
10068
correlations are highest when the lag is a multiple of 7. For
10069
high and medium quality, the new correlations are statistically
10070
significant. For low quality they are not, because residuals in this
10071
category are large; the effect would have to be bigger
10072
to be visible through the noise.
10073
\index{significant} \index{statistically significant}
10074
\index{residuals}
10075
\index{lag}
10076
10077
10078
\section{Prediction}
10079
10080
Time series analysis can be used to investigate, and sometimes
10081
explain, the behavior of systems that vary in time. It can also
10082
make predictions.
10083
\index{prediction}
10084
10085
The linear regressions we used in Section~\ref{timeregress} can be
10086
used for prediction. The RegressionResults class provides {\tt
10087
predict}, which takes a DataFrame containing the explanatory
10088
variables and returns a sequence of predictions. Here's the code:
10089
\index{explanatory variable}
10090
\index{linear regression}
10091
10092
\begin{verbatim}
10093
def GenerateSimplePrediction(results, years):
10094
n = len(years)
10095
inter = np.ones(n)
10096
d = dict(Intercept=inter, years=years)
10097
predict_df = pandas.DataFrame(d)
10098
predict = results.predict(predict_df)
10099
return predict
10100
\end{verbatim}
10101
10102
{\tt results} is a RegressionResults object; {\tt years} is the
10103
sequence of time values we want predictions for. The function
10104
constructs a DataFrame, passes it to {\tt predict}, and
10105
returns the result.
10106
\index{pandas}
10107
\index{DataFrame}
10108
10109
If all we want is a single, best-guess prediction, we're done. But
10110
for most purposes it is important to quantify error. In other words,
10111
we want to know how accurate the prediction is likely to be.
10112
10113
There are three sources of error we should take into account:
10114
10115
\begin{itemize}
10116
10117
\item Sampling error: The prediction is based on estimated
10118
parameters, which depend on random variation
10119
in the sample. If we run the experiment again, we expect
10120
the estimates to vary.
10121
\index{sampling error}
10122
\index{parameter}
10123
10124
\item Random variation: Even if the estimated parameters are
10125
perfect, the observed data varies randomly around the long-term
10126
trend, and we expect this variation to continue in the future.
10127
\index{noise}
10128
10129
\item Modeling error: We have already seen evidence that the long-term
10130
trend is not linear, so predictions based on a linear model will
10131
eventually fail.
10132
\index{modeling error}
10133
10134
\end{itemize}
10135
10136
Another source of error to consider is unexpected future events.
10137
Agricultural prices are affected by weather, and all prices are
10138
affected by politics and law. As I write this, cannabis is legal in
10139
two states and legal for medical purposes in 20 more. If more states
10140
legalize it, the price is likely to go down. But if
10141
the federal government cracks down, the price might go up.
10142
10143
Modeling errors and unexpected future events are hard to quantify.
10144
Sampling error and random variation are easier to deal with, so we'll
10145
do that first.
10146
10147
To quantify sampling error, I use resampling, as we did in
10148
Section~\ref{regest}. As always, the goal is to use the actual
10149
observations to simulate what would happen if we ran the experiment
10150
again. The simulations are based on the assumption that the estimated
10151
parameters are correct, but the random residuals could have been
10152
different. Here is a function that runs the simulations:
10153
\index{resampling}
10154
10155
\begin{verbatim}
10156
def SimulateResults(daily, iters=101):
10157
model, results = RunLinearModel(daily)
10158
fake = daily.copy()
10159
10160
result_seq = []
10161
for i in range(iters):
10162
fake.ppg = results.fittedvalues + Resample(results.resid)
10163
_, fake_results = RunLinearModel(fake)
10164
result_seq.append(fake_results)
10165
10166
return result_seq
10167
\end{verbatim}
10168
10169
{\tt daily} is a DataFrame containing the observed prices;
10170
{\tt iters} is the number of simulations to run.
10171
\index{DataFrame}
10172
\index{price}
10173
10174
{\tt SimulateResults} uses {\tt RunLinearModel}, from
10175
Section~\ref{timeregress}, to estimate the slope and intercept
10176
of the observed values.
10177
10178
Each time through the loop, it generates a ``fake'' dataset by
10179
resampling the residuals and adding them to the fitted values. Then
10180
it runs a linear model on the fake data and stores the RegressionResults
10181
object.
10182
\index{model}
10183
\index{residuals}
10184
10185
The next step is to use the simulated results to generate predictions:
10186
10187
\begin{verbatim}
10188
def GeneratePredictions(result_seq, years, add_resid=False):
10189
n = len(years)
10190
d = dict(Intercept=np.ones(n), years=years, years2=years**2)
10191
predict_df = pandas.DataFrame(d)
10192
10193
predict_seq = []
10194
for fake_results in result_seq:
10195
predict = fake_results.predict(predict_df)
10196
if add_resid:
10197
predict += thinkstats2.Resample(fake_results.resid, n)
10198
predict_seq.append(predict)
10199
10200
return predict_seq
10201
\end{verbatim}
10202
10203
{\tt GeneratePredictions} takes the sequence of results from the
10204
previous step, as well as {\tt years}, which is a sequence of
10205
floats that specifies the interval to generate predictions for,
10206
and \verb"add_resid", which indicates whether it should add resampled
10207
residuals to the straight-line prediction.
10208
{\tt GeneratePredictions} iterates through the sequence of
10209
RegressionResults and generates a sequence of predictions.
10210
\index{resampling}
10211
10212
\begin{figure}
10213
% timeseries.py
10214
\centerline{\includegraphics[height=2.5in]{figs/timeseries4.pdf}}
10215
\caption{Predictions based on linear fits, showing variation due
10216
to sampling error and prediction error.}
10217
\label{timeseries4}
10218
\end{figure}
10219
10220
Finally, here's the code that plots a 90\% confidence interval for
10221
the predictions:
10222
\index{confidence interval}
10223
10224
\begin{verbatim}
10225
def PlotPredictions(daily, years, iters=101, percent=90):
10226
result_seq = SimulateResults(daily, iters=iters)
10227
p = (100 - percent) / 2
10228
percents = p, 100-p
10229
10230
predict_seq = GeneratePredictions(result_seq, years, True)
10231
low, high = thinkstats2.PercentileRows(predict_seq, percents)
10232
thinkplot.FillBetween(years, low, high, alpha=0.3, color='gray')
10233
10234
predict_seq = GeneratePredictions(result_seq, years, False)
10235
low, high = thinkstats2.PercentileRows(predict_seq, percents)
10236
thinkplot.FillBetween(years, low, high, alpha=0.5, color='gray')
10237
\end{verbatim}
10238
10239
{\tt PlotPredictions} calls {\tt GeneratePredictions} twice: once
10240
with \verb"add_resid=True" and again with \verb"add_resid=False".
10241
It uses {\tt PercentileRows} to select the 5th and 95th percentiles
10242
for each year, then plots a gray region between these bounds.
10243
\index{FillBetween}
10244
10245
Figure~\ref{timeseries4} shows the result.
10246
The dark gray region represents a 90\% confidence interval for
10247
the sampling error; that is, uncertainty about the estimated
10248
slope and intercept due to sampling.
10249
\index{sampling error}
10250
10251
The lighter region shows
10252
a 90\% confidence interval for prediction error, which is the
10253
sum of sampling error and random variation.
10254
\index{noise}
10255
10256
These regions quantify sampling error and random variation, but
10257
not modeling error. In general modeling error is hard to quantify,
10258
but in this case we can address at least one source of error,
10259
unpredictable external events.
10260
\index{modeling error}
10261
10262
The regression model is based on the assumption that the system
10263
is {\bf stationary}; that is, that the parameters of the model
10264
don't change over time.
10265
Specifically, it assumes that the slope and
10266
intercept are constant, as well as the distribution of residuals.
10267
\index{stationary model}
10268
\index{parameter}
10269
10270
But looking at the moving averages in Figure~\ref{timeseries10}, it
10271
seems like the slope changes at least once during the observed
10272
interval, and the variance of the residuals seems bigger in the first
10273
half than the second.
10274
\index{slope}
10275
10276
As a result, the parameters we get depend on the interval we
10277
observe. To see how much effect this has on the predictions,
10278
we can extend {\tt SimulateResults} to use intervals of observation
10279
with different start and end dates. My implementation is in
10280
{\tt timeseries.py}.
10281
\index{simulation}
10282
10283
\begin{figure}
10284
% timeseries.py
10285
\centerline{\includegraphics[height=2.5in]{figs/timeseries5.pdf}}
10286
\caption{Predictions based on linear fits, showing
10287
variation due to the interval of observation.}
10288
\label{timeseries5}
10289
\end{figure}
10290
10291
Figure~\ref{timeseries5} shows the result for the medium quality
10292
category. The lightest gray area shows a confidence interval that
10293
includes uncertainty due to sampling error, random variation, and
10294
variation in the interval of observation.
10295
\index{confidence interval}
10296
\index{interval}
10297
10298
The model based on the entire interval has positive slope, indicating
10299
that prices were increasing. But the most recent interval shows signs
10300
of decreasing prices, so models based on the most recent data have
10301
negative slope. As a result, the widest predictive interval includes
10302
the possibility of decreasing prices over the next year.
10303
\index{model}
10304
10305
10306
\section{Further reading}
10307
10308
Time series analysis is a big topic; this chapter has only scratched
10309
the surface. An important tool for working with time series data
10310
is autoregression, which I did not cover here, mostly because it turns
10311
out not to be useful for the example data I worked with.
10312
\index{time series}
10313
10314
But once you
10315
have learned the material in this chapter, you are well prepared
10316
to learn about autoregression. One resource I recommend is
10317
Philipp Janert's book, {\it Data Analysis with Open Source Tools},
10318
O'Reilly Media, 2011. His chapter on time series analysis picks up
10319
where this one leaves off.
10320
\index{Janert, Philipp}
10321
10322
10323
\section{Exercises}
10324
10325
My solution to these exercises is in \verb"chap12soln.py".
10326
10327
\begin{exercise}
10328
The linear model I used in this chapter has the obvious drawback
10329
that it is linear, and there is no reason to expect prices to
10330
change linearly over time.
10331
We can add flexibility to the model by adding a quadratic term,
10332
as we did in Section~\ref{nonlinear}.
10333
\index{nonlinear}
10334
\index{linear model}
10335
\index{quadratic model}
10336
10337
Use a quadratic model to fit the time series of daily prices,
10338
and use the model to generate predictions. You will have to
10339
write a version of {\tt RunLinearModel} that runs that quadratic
10340
model, but after that you should be able to reuse code in
10341
{\tt timeseries.py} to generate predictions.
10342
\index{prediction}
10343
10344
\end{exercise}
10345
10346
\begin{exercise}
10347
Write a definition for a class named {\tt SerialCorrelationTest}
10348
that extends {\tt HypothesisTest} from Section~\ref{hypotest}.
10349
It should take a series and a lag as data, compute the serial
10350
correlation of the series with the given lag, and then compute
10351
the p-value of the observed correlation.
10352
\index{HypothesisTest}
10353
\index{p-value}
10354
\index{lag}
10355
10356
Use this class to test whether the serial correlation in raw
10357
price data is statistically significant. Also test the residuals
10358
of the linear model and (if you did the previous exercise),
10359
the quadratic model.
10360
\index{quadratic model}
10361
\index{significant} \index{statistically significant}
10362
10363
\end{exercise}
10364
10365
\begin{exercise}
10366
There are several ways to extend the EWMA model to generate predictions.
10367
One of the simplest is something like this:
10368
\index{EWMA}
10369
10370
\begin{enumerate}
10371
10372
\item Compute the EWMA of the time series and use the last point
10373
as an intercept, {\tt inter}.
10374
10375
\item Compute the EWMA of differences between successive elements in
10376
the time series and use the last point as a slope, {\tt slope}.
10377
\index{slope}
10378
10379
\item To predict values at future times, compute {\tt inter + slope * dt},
10380
where {\tt dt} is the difference between the time of the prediction and
10381
the time of the last observation.
10382
\index{prediction}
10383
10384
\end{enumerate}
10385
10386
Use this method to generate predictions for a year after the last
10387
observation. A few hints:
10388
10389
\begin{itemize}
10390
10391
\item Use {\tt timeseries.FillMissing} to fill in missing values
10392
before running this analysis. That way the time between consecutive
10393
elements is consistent.
10394
\index{missing values}
10395
10396
\item Use {\tt Series.diff} to compute differences between successive
10397
elements.
10398
\index{Series}
10399
10400
\item Use {\tt reindex} to extend the DataFrame index into the future.
10401
\index{reindex}
10402
10403
\item Use {\tt fillna} to put your predicted values into the DataFrame.
10404
\index{fillna}
10405
10406
\end{itemize}
10407
10408
\end{exercise}
10409
10410
10411
\section{Glossary}
10412
10413
\begin{itemize}
10414
10415
\item time series: A dataset where each value is associated with
10416
a timestamp, often a series of measurements and the times they
10417
were collected.
10418
\index{time series}
10419
10420
\item window: A sequence of consecutive values in a time series,
10421
often used to compute a moving average.
10422
\index{window}
10423
10424
\item moving average: One of several statistics intended to estimate
10425
the underlying trend in a time series by computing averages (of
10426
some kind) for a series of overlapping windows.
10427
\index{moving average}
10428
10429
\item rolling mean: A moving average based on the mean value in
10430
each window.
10431
\index{rolling mean}
10432
10433
\item exponentially-weighted moving average (EWMA): A moving
10434
average based on a weighted mean that gives the highest weight
10435
to the most recent values, and exponentially decreasing weights
10436
to earlier values. \index{exponentially-weighted moving average} \index{EWMA}
10437
10438
\item span: A parameter of EWMA that determines how quickly the
10439
weights decrease.
10440
\index{span}
10441
10442
\item serial correlation: Correlation between a time series and
10443
a shifted or lagged version of itself.
10444
\index{serial correlation}
10445
10446
\item lag: The size of the shift in a serial correlation or
10447
autocorrelation.
10448
\index{lag}
10449
10450
\item autocorrelation: A more general term for a serial correlation
10451
with any amount of lag.
10452
\index{autocorrelation function}
10453
10454
\item autocorrelation function: A function that maps from lag to
10455
serial correlation.
10456
10457
\item stationary: A model is stationary if the parameters and the
10458
distribution of residuals does not change over time.
10459
\index{model}
10460
\index{stationary model}
10461
10462
\end{itemize}
10463
10464
10465
10466
\chapter{Survival analysis}
10467
10468
{\bf Survival analysis} is a way to describe how long things last.
10469
It is often used to study human lifetimes, but it
10470
also applies to ``survival'' of mechanical and electronic components, or
10471
more generally to intervals in time before an event.
10472
\index{survival analysis}
10473
\index{mechanical component}
10474
\index{electrical component}
10475
10476
If someone you know has been diagnosed with a life-threatening
10477
disease, you might have seen a ``5-year survival rate,'' which
10478
is the probability of surviving five years after diagnosis. That
10479
estimate and related statistics are the result of survival analysis.
10480
\index{survival rate}
10481
10482
The code in this chapter is in {\tt survival.py}. For information
10483
about downloading and working with this code, see Section~\ref{code}.
10484
10485
10486
\section{Survival curves}
10487
\label{survival}
10488
10489
The fundamental concept in survival analysis is the {\bf survival
10490
curve}, $S(t)$, which is a function that maps from a duration, $t$, to the
10491
probability of surviving longer than $t$. If you know the distribution
10492
of durations, or ``lifetimes'', finding the survival curve is easy;
10493
it's just the complement of the CDF: \index{survival curve}
10494
%
10495
\[ S(t) = 1 - \CDF(t) \]
10496
%
10497
where $CDF(t)$ is the probability of a lifetime less than or equal
10498
to $t$.
10499
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
10500
10501
For example, in the NSFG dataset, we know the duration of 11189
10502
complete pregnancies. We can read this data and compute the CDF:
10503
\index{pregnancy length}
10504
10505
\begin{verbatim}
10506
preg = nsfg.ReadFemPreg()
10507
complete = preg.query('outcome in [1, 3, 4]').prglngth
10508
cdf = thinkstats2.Cdf(complete, label='cdf')
10509
\end{verbatim}
10510
10511
The outcome codes {\tt 1, 3, 4} indicate live birth, stillbirth,
10512
and miscarriage. For this analysis I am excluding induced abortions,
10513
ectopic pregnancies, and pregnancies that were in progress when
10514
the respondent was interviewed.
10515
10516
The DataFrame method {\tt query} takes a boolean expression and
10517
evaluates it for each row, selecting the rows that yield True.
10518
\index{DataFrame}
10519
\index{boolean}
10520
\index{query}
10521
10522
\begin{figure}
10523
% survival.py
10524
\centerline{\includegraphics[height=3.0in]{figs/survival1.pdf}}
10525
\caption{Cdf and survival curve for pregnancy length (top),
10526
hazard curve (bottom).}
10527
\label{survival1}
10528
\end{figure}
10529
10530
Figure~\ref{survival1} (top) shows the CDF of pregnancy length
10531
and its complement, the survival curve. To represent the
10532
survival curve, I define an object that wraps a Cdf and
10533
adapts the interface:
10534
\index{Cdf}
10535
\index{pregnancy length}
10536
\index{SurvivalFunction}
10537
10538
\begin{verbatim}
10539
class SurvivalFunction(object):
10540
def __init__(self, cdf, label=''):
10541
self.cdf = cdf
10542
self.label = label or cdf.label
10543
10544
@property
10545
def ts(self):
10546
return self.cdf.xs
10547
10548
@property
10549
def ss(self):
10550
return 1 - self.cdf.ps
10551
\end{verbatim}
10552
10553
{\tt SurvivalFunction} provides two properties: {\tt ts}, which
10554
is the sequence of lifetimes, and {\tt ss}, which is the survival
10555
curve. In Python, a ``property'' is a method that can be
10556
invoked as if it were a variable.
10557
10558
We can instantiate a {\tt SurvivalFunction} by passing
10559
the CDF of lifetimes:
10560
\index{property}
10561
10562
\begin{verbatim}
10563
sf = SurvivalFunction(cdf)
10564
\end{verbatim}
10565
10566
{\tt SurvivalFunction} also provides \verb"__getitem__" and
10567
{\tt Prob}, which evaluates the survival curve:
10568
10569
\begin{verbatim}
10570
# class SurvivalFunction
10571
10572
def __getitem__(self, t):
10573
return self.Prob(t)
10574
10575
def Prob(self, t):
10576
return 1 - self.cdf.Prob(t)
10577
\end{verbatim}
10578
10579
For example, {\tt sf[13]} is the fraction of pregnancies that
10580
proceed past the first trimester:
10581
\index{trimester}
10582
10583
\begin{verbatim}
10584
>>> sf[13]
10585
0.86022
10586
>>> cdf[13]
10587
0.13978
10588
\end{verbatim}
10589
10590
About 86\% of pregnancies proceed past the first trimester;
10591
about 14\% do not.
10592
10593
{\tt SurvivalFunction} provides {\tt Render}, so we can
10594
plot {\tt sf} using the functions in {\tt thinkplot}:
10595
\index{thinkplot}
10596
10597
\begin{verbatim}
10598
thinkplot.Plot(sf)
10599
\end{verbatim}
10600
10601
Figure~\ref{survival1} (top) shows the result. The curve is nearly
10602
flat between 13 and 26 weeks, which shows that few pregnancies
10603
end in the second trimester. And the curve is steepest around 39
10604
weeks, which is the most common pregnancy length.
10605
\index{pregnancy length}
10606
10607
10608
\section{Hazard function}
10609
\label{hazard}
10610
10611
From the survival curve we can derive the {\bf hazard function};
10612
for pregnancy lengths, the hazard function maps from a time, $t$, to
10613
the fraction of pregnancies that continue until $t$ and then end at
10614
$t$. To be more precise:
10615
%
10616
\[ \lambda(t) = \frac{S(t) - S(t+1)}{S(t)} \]
10617
%
10618
The numerator is the fraction of lifetimes that end at $t$, which
10619
is also $\PMF(t)$.
10620
\index{hazard function}
10621
10622
{\tt SurvivalFunction} provides {\tt MakeHazard}, which calculates
10623
the hazard function:
10624
10625
\begin{verbatim}
10626
# class SurvivalFunction
10627
10628
def MakeHazard(self, label=''):
10629
ss = self.ss
10630
lams = {}
10631
for i, t in enumerate(self.ts[:-1]):
10632
hazard = (ss[i] - ss[i+1]) / ss[i]
10633
lams[t] = hazard
10634
10635
return HazardFunction(lams, label=label)
10636
\end{verbatim}
10637
10638
The {\tt HazardFunction} object is a wrapper for a pandas
10639
Series:
10640
\index{pandas}
10641
\index{Series}
10642
\index{wrapper}
10643
10644
\begin{verbatim}
10645
class HazardFunction(object):
10646
10647
def __init__(self, d, label=''):
10648
self.series = pandas.Series(d)
10649
self.label = label
10650
\end{verbatim}
10651
10652
{\tt d} can be a dictionary or any other type that can initialize
10653
a Series, including another Series. {\tt label} is a string used
10654
to identify the HazardFunction when plotted.
10655
\index{HazardFunction}
10656
10657
{\tt HazardFunction} provides \verb"__getitem__", so we can evaluate
10658
it like this:
10659
10660
\begin{verbatim}
10661
>>> hf = sf.MakeHazard()
10662
>>> hf[39]
10663
0.49689
10664
\end{verbatim}
10665
10666
So of all pregnancies that proceed until week 39, about
10667
50\% end in week 39.
10668
10669
Figure~\ref{survival1} (bottom) shows the hazard function for
10670
pregnancy lengths. For times after week 42, the hazard function
10671
is erratic because it is based on a small number of cases.
10672
Other than that the shape of the curve is as expected: it is
10673
highest around 39 weeks, and a little higher in the first
10674
trimester than in the second.
10675
\index{pregnancy length}
10676
10677
The hazard function is useful in its own right, but it is also an
10678
important tool for estimating survival curves, as we'll see in the
10679
next section.
10680
10681
10682
\section{Inferring survival curves}
10683
10684
If someone gives you the CDF of lifetimes, it is easy to compute the
10685
survival and hazard functions. But in many real-world
10686
scenarios, we can't measure the distribution of lifetimes directly.
10687
We have to infer it.
10688
\index{survival curve}
10689
\index{CDF}
10690
10691
For example, suppose you are following a group of patients to see how
10692
long they survive after diagnosis. Not all patients are diagnosed on
10693
the same day, so at any point in time, some patients have survived
10694
longer than others. If some patients have died, we know their
10695
survival times. For patients who are still alive, we don't know
10696
survival times, but we have a lower bound.
10697
\index{diagnosis}
10698
10699
If we wait until all patients are dead, we can compute the survival
10700
curve, but if we are evaluating the effectiveness of a new treatment,
10701
we can't wait that long! We need a way to estimate survival curves
10702
using incomplete information.
10703
\index{incomplete information}
10704
10705
As a more cheerful example, I will use NSFG data to quantify how
10706
long respondents ``survive'' until they get married for the
10707
first time. The range of respondents' ages is 14 to 44 years, so
10708
the dataset provides a snapshot of women at different stages in their
10709
lives.
10710
\index{marital status}
10711
10712
For women who have been married, the dataset includes the date
10713
of their first marriage and their age at the time.
10714
For women who have not been married, we know their age when interviewed,
10715
but have no way of knowing when or if they will get married.
10716
\index{age}
10717
10718
Since we know the age at first marriage for {\em some\/} women, it
10719
might be tempting to exclude the rest and compute the CDF of
10720
the known data. That is a bad idea. The result would
10721
be doubly misleading: (1) older women would be overrepresented,
10722
because they are more likely to be married when interviewed,
10723
and (2) married women would be overrepresented! In fact, this
10724
analysis would lead to the conclusion that all women get married,
10725
which is obviously incorrect.
10726
10727
10728
\section{Kaplan-Meier estimation}
10729
10730
In this example it is not only desirable but necessary to include
10731
observations of unmarried women, which brings us to one of the central
10732
algorithms in survival analysis, {\bf Kaplan-Meier estimation}.
10733
\index{Kaplan-Meier estimation}
10734
10735
The general idea is that we can use the data to estimate the hazard
10736
function, then convert the hazard function to a survival curve.
10737
To estimate the hazard function, we consider, for each age,
10738
(1) the number of women who got married at that age and (2) the number
10739
of women ``at risk'' of getting married, which includes all women
10740
who were not married at an earlier age.
10741
\index{hazard function}
10742
\index{at risk}
10743
10744
Here's the code:
10745
10746
\begin{verbatim}
10747
def EstimateHazardFunction(complete, ongoing, label=''):
10748
10749
hist_complete = Counter(complete)
10750
hist_ongoing = Counter(ongoing)
10751
10752
ts = list(hist_complete | hist_ongoing)
10753
ts.sort()
10754
10755
at_risk = len(complete) + len(ongoing)
10756
10757
lams = pandas.Series(index=ts)
10758
for t in ts:
10759
ended = hist_complete[t]
10760
censored = hist_ongoing[t]
10761
10762
lams[t] = ended / at_risk
10763
at_risk -= ended + censored
10764
10765
return HazardFunction(lams, label=label)
10766
\end{verbatim}
10767
10768
{\tt complete} is the set of complete observations; in this case,
10769
the ages when respondents got married. {\tt ongoing} is the set
10770
of incomplete observations; that is, the ages of unmarried women
10771
when they were interviewed.
10772
10773
First, we precompute \verb"hist_complete", which is a Counter
10774
that maps from each age to the number of women married at that
10775
age, and \verb"hist_ongoing" which maps from each age to the
10776
number of unmarried women interviewed at that age.
10777
10778
\index{Counter}
10779
\index{survival curve}
10780
10781
{\tt ts} is the union of ages when respondents got married
10782
and ages when unmarried women were interviewed, sorted in
10783
increasing order.
10784
10785
\verb"at_risk" keeps track of the number of respondents considered
10786
``at risk'' at each age; initially, it is the total number of
10787
respondents.
10788
10789
The result is stored in a Pandas {\tt Series} that maps from
10790
each age to the estimated hazard function at that age.
10791
10792
Each time through the loop, we consider one age, {\tt t},
10793
and compute the number of events that end at {\tt t} (that is,
10794
the number of respondents married at that age) and the number
10795
of events censored at {\tt t} (that is, the number of women
10796
interviewed at {\tt t} whose future marriage dates are
10797
censored). In this context, ``censored'' means that the
10798
data are unavailable because of the data collection process.
10799
10800
The estimated hazard function is the fraction of the cases
10801
at risk that end at {\tt t}.
10802
10803
At the end of the loop, we subtract from \verb"at_risk" the
10804
number of cases that ended or were censored at {\tt t}.
10805
10806
Finally, we pass {\tt lams} to the {\tt HazardFunction}
10807
constructor and return the result.
10808
10809
\index{HazardFunction}
10810
10811
10812
\section{The marriage curve}
10813
10814
To test this function, we have to do some data cleaning and
10815
transformation. The NSFG variables we need are:
10816
\index{marital status}
10817
10818
\begin{itemize}
10819
10820
\item {\tt cmbirth}: The respondent's date of birth, known for
10821
all respondents.
10822
\index{date of birth}
10823
10824
\item {\tt cmintvw}: The date the respondent was interviewed,
10825
known for all respondents.
10826
10827
\item {\tt cmmarrhx}: The date the respondent was first married,
10828
if applicable and known.
10829
10830
\item {\tt evrmarry}: 1 if the respondent had been
10831
married prior to the date of interview, 0 otherwise.
10832
10833
\end{itemize}
10834
10835
The first three variables are encoded in ``century-months''; that is, the
10836
integer number of months since December 1899. So century-month
10837
1 is January 1900.
10838
\index{century month}
10839
10840
First, we read the respondent file and replace invalid values of
10841
{\tt cmmarrhx}:
10842
10843
\begin{verbatim}
10844
resp = chap01soln.ReadFemResp()
10845
resp.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)
10846
\end{verbatim}
10847
10848
Then we compute each respondent's age when married and age when
10849
interviewed:
10850
\index{NaN}
10851
10852
\begin{verbatim}
10853
resp['agemarry'] = (resp.cmmarrhx - resp.cmbirth) / 12.0
10854
resp['age'] = (resp.cmintvw - resp.cmbirth) / 12.0
10855
\end{verbatim}
10856
10857
Next we extract {\tt complete}, which is the age at marriage for
10858
women who have been married, and {\tt ongoing}, which is the
10859
age at interview for women who have not:
10860
\index{age}
10861
10862
\begin{verbatim}
10863
complete = resp[resp.evrmarry==1].agemarry
10864
ongoing = resp[resp.evrmarry==0].age
10865
\end{verbatim}
10866
10867
Finally we compute the
10868
hazard function.
10869
\index{hazard function}
10870
10871
\begin{verbatim}
10872
hf = EstimateHazardFunction(complete, ongoing)
10873
\end{verbatim}
10874
10875
Figure~\ref{survival2} (top) shows the estimated hazard function;
10876
it is low in the teens,
10877
higher in the 20s, and declining in the 30s. It increases again in
10878
the 40s, but that is an artifact of the estimation process; as the
10879
number of respondents ``at risk'' decreases, a small number of
10880
women getting married yields a large estimated hazard. The survival
10881
curve will smooth out this noise.
10882
\index{noise}
10883
10884
10885
\section{Estimating the survival curve}
10886
10887
Once we have the hazard function, we can estimate the survival curve.
10888
The chance of surviving past time {\tt t} is the chance of surviving
10889
all times up through {\tt t}, which is the cumulative product of
10890
the complementary hazard function:
10891
%
10892
\[ [1-\lambda(0)] [1-\lambda(1)] \ldots [1-\lambda(t)] \]
10893
%
10894
The {\tt HazardFunction} class provides {\tt MakeSurvival}, which
10895
computes this product:
10896
\index{cumulative product}
10897
\index{SurvivalFunction}
10898
10899
\begin{verbatim}
10900
# class HazardFunction:
10901
10902
def MakeSurvival(self):
10903
ts = self.series.index
10904
ss = (1 - self.series).cumprod()
10905
cdf = thinkstats2.Cdf(ts, 1-ss)
10906
sf = SurvivalFunction(cdf)
10907
return sf
10908
\end{verbatim}
10909
10910
{\tt ts} is the sequence of times where the hazard function is
10911
estimated. {\tt ss} is the cumulative product of the complementary
10912
hazard function, so it is the survival curve.
10913
10914
Because of the way {\tt SurvivalFunction} is implemented, we have
10915
to compute the complement of {\tt ss}, make a Cdf, and then instantiate
10916
a SurvivalFunction object.
10917
\index{Cdf}
10918
\index{complementary CDF}
10919
10920
10921
\begin{figure}
10922
% survival.py
10923
\centerline{\includegraphics[height=2.5in]{figs/survival2.pdf}}
10924
\caption{Hazard function for age at first marriage (top) and
10925
survival curve (bottom).}
10926
\label{survival2}
10927
\end{figure}
10928
10929
Figure~\ref{survival2} (bottom) shows the result. The survival
10930
curve is steepest between 25 and 35, when most women get married.
10931
Between 35 and 45,
10932
the curve is nearly flat, indicating that women who do not marry
10933
before age 35 are unlikely to get married.
10934
10935
A curve like this was the basis of a famous magazine article in 1986;
10936
{\it Newsweek\/} reported that a 40-year old unmarried woman was ``more
10937
likely to be killed by a terrorist'' than get married. These
10938
statistics were widely reported and became part of popular culture,
10939
but they were wrong then (because they were based on faulty analysis)
10940
and turned out to be even more wrong (because of cultural changes that
10941
were already in progress and continued). In 2006, {\it Newsweek\/} ran
10942
an another article admitting that they were wrong.
10943
\index{Newsweek}
10944
10945
I encourage you to read more about this article, the statistics it was
10946
based on, and the reaction. It should remind you of the ethical
10947
obligation to perform statistical analysis with care, interpret the
10948
results with appropriate skepticism, and present them to the public
10949
accurately and honestly.
10950
\index{ethics}
10951
10952
10953
\section{Confidence intervals}
10954
10955
Kaplan-Meier analysis yields a single estimate of the survival curve,
10956
but it is also important to quantify the uncertainty of the estimate.
10957
As usual, there are three possible sources of error: measurement
10958
error, sampling error, and modeling error.
10959
\index{confidence interval}
10960
\index{modeling error}
10961
\index{sampling error}
10962
10963
In this example, measurement error is probably small. People
10964
generally know when they were born, whether they've been married, and
10965
when. And they can be expected to report this information accurately.
10966
\index{measurement error}
10967
10968
We can quantify sampling error by resampling. Here's the code:
10969
\index{resampling}
10970
10971
\begin{verbatim}
10972
def ResampleSurvival(resp, iters=101):
10973
low, high = resp.agemarry.min(), resp.agemarry.max()
10974
ts = np.arange(low, high, 1/12.0)
10975
10976
ss_seq = []
10977
for i in range(iters):
10978
sample = thinkstats2.ResampleRowsWeighted(resp)
10979
hf, sf = EstimateSurvival(sample)
10980
ss_seq.append(sf.Probs(ts))
10981
10982
low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])
10983
thinkplot.FillBetween(ts, low, high)
10984
\end{verbatim}
10985
10986
{\tt ResampleSurvival} takes {\tt resp}, a DataFrame of respondents,
10987
and {\tt iters}, the number of times to resample. It computes {\tt
10988
ts}, which is the sequence of ages where we will evaluate the survival
10989
curves.
10990
\index{DataFrame}
10991
10992
Inside the loop, {\tt ResampleSurvival}:
10993
10994
\begin{itemize}
10995
10996
\item Resamples the respondents using {\tt ResampleRowsWeighted},
10997
which we saw in Section~\ref{weighted}.
10998
\index{weighted resampling}
10999
11000
\item Calls {\tt EstimateSurvival}, which uses the process in the
11001
previous sections to estimate the hazard and survival curves, and
11002
11003
\item Evaluates the survival curve at each age in {\tt ts}.
11004
11005
\end{itemize}
11006
11007
\verb"ss_seq" is a sequence of evaluated survival curves.
11008
{\tt PercentileRows} takes this sequence and computes the 5th and 95th
11009
percentiles, returning a 90\% confidence interval for the survival
11010
curve.
11011
\index{FillBetween}
11012
11013
\begin{figure}
11014
% survival.py
11015
\centerline{\includegraphics[height=2.5in]{figs/survival3.pdf}}
11016
\caption{Survival curve for age at first marriage (dark line) and a 90\%
11017
confidence interval based on weighted resampling (gray line).}
11018
\label{survival3}
11019
\end{figure}
11020
11021
Figure~\ref{survival3} shows the result along with the survival
11022
curve we estimated in the previous section. The confidence
11023
interval takes into account the sampling weights, unlike the estimated
11024
curve. The discrepancy between them indicates that the sampling
11025
weights have a substantial effect on the estimate---we will have
11026
to keep that in mind.
11027
\index{confidence interval}
11028
\index{sampling weight}
11029
11030
11031
\section{Cohort effects}
11032
11033
One of the challenges of survival analysis is that different parts
11034
of the estimated curve are based on different groups of respondents.
11035
The part of the curve at time {\tt t} is based on respondents
11036
whose age was at least {\tt t} when they were interviewed.
11037
So the leftmost part of the curve includes data from all respondents,
11038
but the rightmost part includes only the oldest respondents.
11039
11040
If the relevant characteristics of the respondents are not changing
11041
over time, that's fine, but in this case it seems likely that marriage
11042
patterns are different for women born in different generations.
11043
We can investigate this effect by grouping respondents according
11044
to their decade of birth. Groups like this, defined by date of
11045
birth or similar events, are called {\bf cohorts}, and differences
11046
between the groups are called {\bf cohort effects}.
11047
\index{cohort}
11048
\index{cohort effect}
11049
11050
To investigate cohort effects in the NSFG marriage data, I gathered
11051
the Cycle 6 data from 2002 used throughout this book;
11052
the Cycle 7 data from 2006--2010 used in Section~\ref{replication};
11053
and the Cycle 5 data from 1995. In total these datasets include
11054
30,769 respondents.
11055
11056
\begin{verbatim}
11057
resp5 = ReadFemResp1995()
11058
resp6 = ReadFemResp2002()
11059
resp7 = ReadFemResp2010()
11060
resps = [resp5, resp6, resp7]
11061
\end{verbatim}
11062
11063
For each DataFrame, {\tt resp}, I use {\tt cmbirth} to compute the
11064
decade of birth for each respondent:
11065
\index{pandas}
11066
\index{DataFrame}
11067
11068
\begin{verbatim}
11069
month0 = pandas.to_datetime('1899-12-15')
11070
dates = [month0 + pandas.DateOffset(months=cm)
11071
for cm in resp.cmbirth]
11072
resp['decade'] = (pandas.DatetimeIndex(dates).year - 1900) // 10
11073
\end{verbatim}
11074
11075
{\tt cmbirth} is encoded as the integer number of months since
11076
December 1899; {\tt month0} represents that date as a Timestamp
11077
object. For each birth date, we instantiate a {\tt DateOffset} that
11078
contains the century-month and add it to {\tt month0}; the result
11079
is a sequence of Timestamps, which is converted to a {\tt
11080
DateTimeIndex}. Finally, we extract {\tt year} and compute
11081
decades.
11082
\index{DateTimeIndex}
11083
\index{Index}
11084
\index{century month}
11085
11086
To take into account the sampling weights, and also to show
11087
variability due to sampling error, I resample the data,
11088
group respondents by decade, and plot survival curves:
11089
\index{resampling}
11090
\index{sampling error}
11091
11092
\begin{verbatim}
11093
for i in range(iters):
11094
samples = [thinkstats2.ResampleRowsWeighted(resp)
11095
for resp in resps]
11096
sample = pandas.concat(samples, ignore_index=True)
11097
groups = sample.groupby('decade')
11098
11099
EstimateSurvivalByDecade(groups, alpha=0.2)
11100
\end{verbatim}
11101
11102
Data from the three NSFG cycles use different sampling weights,
11103
so I resample them separately and then use {\tt concat}
11104
to merge them into a single DataFrame. The parameter \verb"ignore_index"
11105
tells {\tt concat} not to match up respondents by index; instead
11106
it creates a new index from 0 to 30768.
11107
\index{pandas}
11108
\index{DataFrame}
11109
\index{groupby}
11110
11111
{\tt EstimateSurvivalByDecade} plots survival curves for each cohort:
11112
11113
\begin{verbatim}
11114
def EstimateSurvivalByDecade(resp):
11115
for name, group in groups:
11116
hf, sf = EstimateSurvival(group)
11117
thinkplot.Plot(sf)
11118
\end{verbatim}
11119
11120
\begin{figure}
11121
% survival.py
11122
\centerline{\includegraphics[height=2.5in]{figs/survival4.pdf}}
11123
\caption{Survival curves for respondents born during different decades.}
11124
\label{survival4}
11125
\end{figure}
11126
11127
Figure~\ref{survival4} shows the results. Several patterns are
11128
visible:
11129
11130
\begin{itemize}
11131
11132
\item Women born in the 50s married earliest, with successive
11133
cohorts marrying later and later, at least until age 30 or so.
11134
11135
\item Women born in the 60s follow a surprising pattern. Prior
11136
to age 25, they were marrying at slower rates than their predecessors.
11137
After age 25, they were marrying faster. By age 32 they had overtaken
11138
the 50s cohort, and at age 44 they are substantially more likely to
11139
have married.
11140
\index{marital status}
11141
11142
Women born in the 60s turned 25 between 1985 and 1995. Remembering
11143
that the {\it Newsweek\/} article I mentioned was published in 1986, it
11144
is tempting to imagine that the article triggered a marriage boom.
11145
That explanation would be too pat, but it is possible that the article
11146
and the reaction to it were indicative of a mood that affected the
11147
behavior of this cohort.
11148
\index{Newsweek}
11149
11150
\item The pattern of the 70s cohort is similar. They are less
11151
likely than their predecessors to be married before age 25, but
11152
at age 35 they have caught up with both of the previous cohorts.
11153
11154
\item Women born in the 80s are even less likely to marry before
11155
age 25. What happens after that is not clear; for more data, we
11156
have to wait for the next cycle of the NSFG.
11157
11158
\end{itemize}
11159
11160
In the meantime we can make some predictions.
11161
\index{prediction}
11162
11163
11164
\section{Extrapolation}
11165
11166
The survival curve for the 70s cohort ends at about age 38;
11167
for the 80s cohort it ends at age 28, and for the 90s cohort
11168
we hardly have any data at all.
11169
\index{extrapolation}
11170
11171
We can extrapolate these curves by ``borrowing'' data from the
11172
previous cohort. HazardFunction provides a method, {\tt Extend}, that
11173
copies the tail from another longer HazardFunction:
11174
\index{HazardFunction}
11175
11176
\begin{verbatim}
11177
# class HazardFunction
11178
11179
def Extend(self, other):
11180
last = self.series.index[-1]
11181
more = other.series[other.series.index > last]
11182
self.series = pandas.concat([self.series, more])
11183
\end{verbatim}
11184
11185
As we saw in Section~\ref{hazard}, the HazardFunction contains a Series
11186
that maps from $t$ to $\lambda(t)$. {\tt Extend} finds {\tt last},
11187
which is the last index in {\tt self.series}, selects values from
11188
{\tt other} that come later than {\tt last}, and appends them
11189
onto {\tt self.series}.
11190
\index{pandas}
11191
\index{Series}
11192
11193
Now we can extend the HazardFunction for each cohort, using values
11194
from the predecessor:
11195
11196
\begin{verbatim}
11197
def PlotPredictionsByDecade(groups):
11198
hfs = []
11199
for name, group in groups:
11200
hf, sf = EstimateSurvival(group)
11201
hfs.append(hf)
11202
11203
thinkplot.PrePlot(len(hfs))
11204
for i, hf in enumerate(hfs):
11205
if i > 0:
11206
hf.Extend(hfs[i-1])
11207
sf = hf.MakeSurvival()
11208
thinkplot.Plot(sf)
11209
\end{verbatim}
11210
11211
{\tt groups} is a GroupBy object with respondents grouped by decade of
11212
birth. The first loop computes the HazardFunction for each group.
11213
\index{groupby}
11214
11215
The second loop extends each HazardFunction with values from
11216
its predecessor, which might contain values from the previous
11217
group, and so on. Then it converts each HazardFunction to
11218
a SurvivalFunction and plots it.
11219
11220
\begin{figure}
11221
% survival.py
11222
\centerline{\includegraphics[height=2.5in]{figs/survival5.pdf}}
11223
\caption{Survival curves for respondents born during different decades,
11224
with predictions for the later cohorts.}
11225
\label{survival5}
11226
\end{figure}
11227
11228
Figure~\ref{survival5} shows the results; I've removed the 50s cohort
11229
to make the predictions more visible. These results suggest that by
11230
age 40, the most recent cohorts will converge with the 60s cohort,
11231
with fewer than 20\% never married.
11232
\index{visualization}
11233
11234
11235
\section{Expected remaining lifetime}
11236
11237
Given a survival curve, we can compute the expected remaining
11238
lifetime as a function of current age. For example, given the
11239
survival curve of pregnancy length from Section~\ref{survival},
11240
we can compute the expected time until delivery.
11241
\index{pregnancy length}
11242
11243
The first step is to extract the PMF of lifetimes. {\tt SurvivalFunction}
11244
provides a method that does that:
11245
11246
\begin{verbatim}
11247
# class SurvivalFunction
11248
11249
def MakePmf(self, filler=None):
11250
pmf = thinkstats2.Pmf()
11251
for val, prob in self.cdf.Items():
11252
pmf.Set(val, prob)
11253
11254
cutoff = self.cdf.ps[-1]
11255
if filler is not None:
11256
pmf[filler] = 1-cutoff
11257
11258
return pmf
11259
\end{verbatim}
11260
11261
Remember that the SurvivalFunction contains the Cdf of lifetimes.
11262
The loop copies the values and probabilities from the Cdf into
11263
a Pmf.
11264
\index{Pmf}
11265
\index{Cdf}
11266
11267
{\tt cutoff} is the highest probability in the Cdf, which is 1
11268
if the Cdf is complete, and otherwise less than 1.
11269
If the Cdf is incomplete, we plug in the provided value, {\tt filler},
11270
to cap it off.
11271
11272
The Cdf of pregnancy lengths is complete, so we don't have to worry
11273
about this detail yet.
11274
\index{pregnancy length}
11275
11276
The next step is to compute the expected remaining lifetime, where
11277
``expected'' means average. {\tt SurvivalFunction}
11278
provides a method that does that, too:
11279
\index{expected remaining lifetime}
11280
11281
\begin{verbatim}
11282
# class SurvivalFunction
11283
11284
def RemainingLifetime(self, filler=None, func=thinkstats2.Pmf.Mean):
11285
pmf = self.MakePmf(filler=filler)
11286
d = {}
11287
for t in sorted(pmf.Values())[:-1]:
11288
pmf[t] = 0
11289
pmf.Normalize()
11290
d[t] = func(pmf) - t
11291
11292
return pandas.Series(d)
11293
\end{verbatim}
11294
11295
{\tt RemainingLifetime} takes {\tt filler}, which is passed along
11296
to {\tt MakePmf}, and {\tt func} which is the function used to
11297
summarize the distribution of remaining lifetimes.
11298
11299
{\tt pmf} is the Pmf of lifetimes extracted from the SurvivalFunction.
11300
{\tt d} is a dictionary that contains the results, a map from
11301
current age, {\tt t}, to expected remaining lifetime.
11302
\index{Pmf}
11303
11304
The loop iterates through the values in the Pmf. For each value
11305
of {\tt t} it computes the conditional distribution of lifetimes,
11306
given that the lifetime exceeds {\tt t}. It does that by removing
11307
values from the Pmf one at a time and renormalizing the remaining
11308
values.
11309
11310
Then it uses {\tt func} to summarize the conditional distribution.
11311
In this example the result is the mean pregnancy length, given that
11312
the length exceeds {\tt t}. By subtracting {\tt t} we get the
11313
mean remaining pregnancy length.
11314
\index{pregnancy length}
11315
11316
\begin{figure}
11317
% survival.py
11318
\centerline{\includegraphics[height=2.5in]{figs/survival6.pdf}}
11319
\caption{Expected remaining pregnancy length (left) and
11320
years until first marriage (right).}
11321
\label{survival6}
11322
\end{figure}
11323
11324
Figure~\ref{survival6} (left) shows the expected remaining pregnancy
11325
length as a function of the current duration. For example, during
11326
Week 0, the expected remaining duration is about 34 weeks. That's
11327
less than full term (39 weeks) because terminations of pregnancy
11328
in the first trimester bring the average down.
11329
\index{pregnancy length}
11330
11331
The curve drops slowly during the first trimester. After 13 weeks,
11332
the expected remaining lifetime has dropped by only 9 weeks, to
11333
25. After that the curve drops faster, by about a week per week.
11334
11335
Between Week 37 and 42, the curve levels off between 1 and 2 weeks.
11336
At any time during this period, the expected remaining lifetime is the
11337
same; with each week that passes, the destination gets no closer.
11338
Processes with this property are called {\bf memoryless} because
11339
the past has no effect on the predictions.
11340
This behavior is the mathematical basis of the infuriating mantra
11341
of obstetrics nurses: ``any day now.''
11342
\index{memoryless}
11343
11344
Figure~\ref{survival6} (right) shows the median remaining time until
11345
first marriage, as a function of age. For an 11 year-old girl, the
11346
median time until first marriage is about 14 years. The curve decreases
11347
until age 22 when the median remaining time is about 7 years.
11348
After that it increases again: by age 30 it is back where it started,
11349
at 14 years.
11350
11351
Based on this data, young women have decreasing remaining
11352
``lifetimes''. Mechanical components with this property are called {\bf NBUE}
11353
for ``new better than used in expectation,'' meaning that a new part is
11354
expected to last longer.
11355
\index{NBUE}
11356
11357
Women older than 22 have increasing remaining time until first
11358
marriage. Components with this property are called {\bf UBNE} for
11359
``used better than new in expectation.'' That is, the older the part,
11360
the longer it is expected to last. Newborns and cancer patients are
11361
also UBNE; their life expectancy increases the longer they live.
11362
\index{UBNE}
11363
11364
For this example I computed median, rather than mean, because the
11365
Cdf is incomplete; the survival curve projects that about 20\%
11366
of respondents will not marry before age 44. The age of
11367
first marriage for these women is unknown, and might be non-existent,
11368
so we can't compute a mean.
11369
\index{Cdf}
11370
\index{median}
11371
11372
I deal with these unknown values by replacing them with {\tt np.inf},
11373
a special value that represents infinity. That makes the mean
11374
infinity for all ages, but the median is well-defined as long as
11375
more than 50\% of the remaining lifetimes are finite, which is true
11376
until age 30. After that it is hard to define a meaningful
11377
expected remaining lifetime.
11378
\index{inf}
11379
11380
Here's the code that computes and plots these functions:
11381
11382
\begin{verbatim}
11383
rem_life1 = sf1.RemainingLifetime()
11384
thinkplot.Plot(rem_life1)
11385
11386
func = lambda pmf: pmf.Percentile(50)
11387
rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)
11388
thinkplot.Plot(rem_life2)
11389
\end{verbatim}
11390
11391
{\tt sf1} is the survival curve for pregnancy length;
11392
in this case we can use the default values for {\tt RemainingLifetime}.
11393
\index{pregnancy length}
11394
11395
{\tt sf2} is the survival curve for age at first marriage;
11396
{\tt func} is a function that takes a Pmf and computes its
11397
median (50th percentile).
11398
\index{Pmf}
11399
11400
11401
\section{Exercises}
11402
11403
My solution to this exercise is in \verb"chap13soln.py".
11404
11405
\begin{exercise}
11406
In NSFG Cycles 6 and 7, the variable {\tt cmdivorcx} contains the
11407
date of divorce for the respondent's first marriage, if applicable,
11408
encoded in century-months.
11409
\index{divorce}
11410
\index{marital status}
11411
11412
Compute the duration of marriages that have ended in divorce, and
11413
the duration, so far, of marriages that are ongoing. Estimate the
11414
hazard and survival curve for the duration of marriage.
11415
11416
Use resampling to take into account sampling weights, and plot
11417
data from several resamples to visualize sampling error.
11418
\index{resampling}
11419
11420
Consider dividing the respondents into groups by decade of birth,
11421
and possibly by age at first marriage.
11422
\index{groupby}
11423
11424
\end{exercise}
11425
11426
11427
\section{Glossary}
11428
11429
\begin{itemize}
11430
11431
\item survival analysis: A set of methods for describing and
11432
predicting lifetimes, or more generally time until an event occurs.
11433
\index{survival analysis}
11434
11435
\item survival curve: A function that maps from a time, $t$, to the
11436
probability of surviving past $t$.
11437
\index{survival curve}
11438
11439
\item hazard function: A function that maps from $t$ to the fraction
11440
of people alive until $t$ who die at $t$.
11441
\index{hazard function}
11442
11443
\item Kaplan-Meier estimation: An algorithm for estimating hazard and
11444
survival functions.
11445
\index{Kaplan-Meier estimation}
11446
11447
\item cohort: a group of subjects defined by an event, like date of
11448
birth, in a particular interval of time.
11449
\index{cohort}
11450
11451
\item cohort effect: a difference between cohorts.
11452
\index{cohort effect}
11453
11454
\item NBUE: A property of expected remaining lifetime, ``New
11455
better than used in expectation.''
11456
\index{NBUE}
11457
11458
\item UBNE: A property of expected remaining lifetime, ``Used
11459
better than new in expectation.''
11460
\index{UBNE}
11461
11462
\end{itemize}
11463
11464
11465
\chapter{Analytic methods}
11466
\label{analysis}
11467
11468
This book has focused on computational methods like simulation and
11469
resampling, but some of the problems we solved have
11470
analytic solutions that can be much faster.
11471
\index{resampling}
11472
\index{analytic methods}
11473
\index{computational methods}
11474
11475
I present some of these methods in this chapter, and explain
11476
how they work. At the end of the chapter, I make suggestions
11477
for integrating computational and analytic methods for exploratory
11478
data analysis.
11479
11480
The code in this chapter is in {\tt normal.py}. For information
11481
about downloading and working with this code, see Section~\ref{code}.
11482
11483
11484
\section{Normal distributions}
11485
\label{why_normal}
11486
\index{normal distribution}
11487
\index{distribution!normal}
11488
\index{Gaussian distribution}
11489
\index{distribution!Gaussian}
11490
11491
As a motivating example, let's review the problem from
11492
Section~\ref{gorilla}:
11493
\index{gorilla}
11494
11495
\begin{quotation}
11496
\noindent Suppose you are a scientist studying gorillas in a wildlife
11497
preserve. Having weighed 9 gorillas, you find sample mean $\xbar=90$ kg and
11498
sample standard deviation, $S=7.5$ kg. If you use $\xbar$ to estimate
11499
the population mean, what is the standard error of the estimate?
11500
\end{quotation}
11501
11502
To answer that question, we need the sampling
11503
distribution of $\xbar$. In Section~\ref{gorilla} we approximated
11504
this distribution by simulating the experiment (weighing
11505
9 gorillas), computing $\xbar$ for each simulated experiment, and
11506
accumulating the distribution of estimates.
11507
\index{standard error}
11508
\index{standard deviation}
11509
11510
The result is an approximation of the sampling distribution. Then we
11511
use the sampling distribution to compute standard errors and
11512
confidence intervals:
11513
\index{confidence interval}
11514
\index{sampling distribution}
11515
11516
\begin{enumerate}
11517
11518
\item The standard deviation of the sampling distribution is the
11519
standard error of the estimate; in the example, it is about
11520
2.5 kg.
11521
11522
\item The interval between the 5th and 95th percentile of the sampling
11523
distribution is a 90\% confidence interval. If we run the
11524
experiment many times, we expect the estimate to fall in this
11525
interval 90\% of the time. In the example, the 90\% CI is
11526
$(86, 94)$ kg.
11527
11528
\end{enumerate}
11529
11530
Now we'll do the same calculation analytically. We
11531
take advantage of the fact that the weights of adult female gorillas
11532
are roughly normally distributed. Normal distributions have two
11533
properties that make them amenable for analysis: they are ``closed'' under
11534
linear transformation and addition. To explain what that means, I
11535
need some notation. \index{analysis}
11536
\index{linear transformation}
11537
\index{addition, closed under}
11538
11539
If the distribution of a quantity, $X$, is
11540
normal with parameters $\mu$ and $\sigma$, you can write
11541
%
11542
\[ X \sim \normal~(\mu, \sigma^{2})\]
11543
%
11544
where the symbol $\sim$ means ``is distributed'' and the script letter
11545
$\normal$ stands for ``normal.''
11546
11547
%The other analytic distributions in this chapter are sometimes
11548
%written $\mathrm{Exponential}(\lambda)$, $\mathrm{Pareto}(x_m,
11549
%\alpha)$ and, for lognormal, $\mathrm{Log}-\normal~(\mu,
11550
%\sigma^2)$.
11551
11552
A linear transformation of $X$ is something like $X' = a X + b$, where
11553
$a$ and $b$ are real numbers.\index{linear transformation}
11554
A family of distributions is closed under
11555
linear transformation if $X'$ is in the same family as $X$. The normal
11556
distribution has this property; if $X \sim \normal~(\mu,
11557
\sigma^2)$,
11558
%
11559
\[ X' \sim \normal~(a \mu + b, a^{2} \sigma^2) \tag*{(1)} \]
11560
%
11561
Normal distributions are also closed under addition.
11562
If $Z = X + Y$ and
11563
$X \sim \normal~(\mu_{X}, \sigma_{X}^{2})$ and
11564
$Y \sim \normal~(\mu_{Y}, \sigma_{Y}^{2})$ then
11565
%
11566
\[ Z \sim \normal~(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) \tag*{(2)}\]
11567
%
11568
In the special case $Z = X + X$, we have
11569
%
11570
\[ Z \sim \normal~(2 \mu_X, 2 \sigma_X^2) \]
11571
%
11572
and in general if we draw $n$ values of $X$ and add them up, we have
11573
%
11574
\[ Z \sim \normal~(n \mu_X, n \sigma_X^2) \tag*{(3)}\]
11575
11576
11577
\section{Sampling distributions}
11578
11579
Now we have everything we need to compute the sampling distribution of
11580
$\xbar$. Remember that we compute $\xbar$ by weighing $n$ gorillas,
11581
adding up the total weight, and dividing by $n$.
11582
\index{sampling distribution}
11583
\index{gorilla}
11584
\index{weight}
11585
11586
Assume that the distribution of gorilla weights, $X$, is
11587
approximately normal:
11588
%
11589
\[ X \sim \normal~(\mu, \sigma^2)\]
11590
%
11591
If we weigh $n$ gorillas, the total weight, $Y$, is distributed
11592
%
11593
\[ Y \sim \normal~(n \mu, n \sigma^2) \]
11594
%
11595
using Equation 3. And if we divide by $n$, the sample mean,
11596
$Z$, is distributed
11597
%
11598
\[ Z \sim \normal~(\mu, \sigma^2/n) \]
11599
%
11600
using Equation 1 with $a = 1/n$.
11601
11602
The distribution of $Z$ is the sampling distribution of $\xbar$.
11603
The mean of $Z$ is $\mu$, which shows that $\xbar$ is an unbiased
11604
estimate of $\mu$. The variance of the sampling distribution
11605
is $\sigma^2 / n$.
11606
\index{biased estimator}
11607
\index{estimator!biased}
11608
11609
So the standard deviation of the sampling distribution, which is the
11610
standard error of the estimate, is $\sigma / \sqrt{n}$. In the
11611
example, $\sigma$ is 7.5 kg and $n$ is 9, so the standard error is 2.5
11612
kg. That result is consistent with what we estimated by simulation,
11613
but much faster to compute!
11614
\index{standard error}
11615
\index{standard deviation}
11616
11617
We can also use the sampling distribution to compute confidence
11618
intervals. A 90\% confidence interval for $\xbar$ is the interval
11619
between the 5th and 95th percentiles of $Z$. Since $Z$ is normally
11620
distributed, we can compute percentiles by evaluating the inverse
11621
CDF.
11622
\index{inverse CDF}
11623
\index{CDF, inverse}
11624
\index{confidence interval}
11625
11626
There is no closed form for the CDF of the normal distribution
11627
or its inverse, but there are fast numerical methods and they
11628
are implemented in SciPy, as we saw in Section~\ref{normal}.
11629
{\tt thinkstats2} provides a wrapper function that makes the
11630
SciPy function a little easier to use:
11631
\index{SciPy}
11632
\index{normal distribution}
11633
\index{wrapper}
11634
\index{closed form}
11635
11636
\begin{verbatim}
11637
def EvalNormalCdfInverse(p, mu=0, sigma=1):
11638
return scipy.stats.norm.ppf(p, loc=mu, scale=sigma)
11639
\end{verbatim}
11640
11641
Given a probability, {\tt p}, it returns the corresponding
11642
percentile from a normal distribution with parameters {\tt mu}
11643
and {\tt sigma}. For the 90\% confidence interval of $\xbar$,
11644
we compute the 5th and 95th percentiles like this:
11645
\index{percentile}
11646
11647
\begin{verbatim}
11648
>>> thinkstats2.EvalNormalCdfInverse(0.05, mu=90, sigma=2.5)
11649
85.888
11650
11651
>>> thinkstats2.EvalNormalCdfInverse(0.95, mu=90, sigma=2.5)
11652
94.112
11653
\end{verbatim}
11654
11655
So if we run the experiment many times, we expect the
11656
estimate, $\xbar$, to fall in the range $(85.9, 94.1)$ about
11657
90\% of the time. Again, this is consistent with the result
11658
we got by simulation.
11659
\index{simulation}
11660
11661
11662
\section{Representing normal distributions}
11663
11664
To make these calculations easier, I have defined a class called
11665
{\tt Normal} that represents a normal distribution and encodes
11666
the equations in the previous sections. Here's what it looks
11667
like:
11668
\index{Normal}
11669
11670
\begin{verbatim}
11671
class Normal(object):
11672
11673
def __init__(self, mu, sigma2):
11674
self.mu = mu
11675
self.sigma2 = sigma2
11676
11677
def __str__(self):
11678
return 'N(%g, %g)' % (self.mu, self.sigma2)
11679
\end{verbatim}
11680
11681
So we can instantiate a Normal that represents the distribution
11682
of gorilla weights:
11683
\index{gorilla}
11684
11685
\begin{verbatim}
11686
>>> dist = Normal(90, 7.5**2)
11687
>>> dist
11688
N(90, 56.25)
11689
\end{verbatim}
11690
11691
{\tt Normal} provides {\tt Sum}, which takes a sample size, {\tt n},
11692
and returns the distribution of the sum of {\tt n} values, using
11693
Equation 3:
11694
11695
\begin{verbatim}
11696
def Sum(self, n):
11697
return Normal(n * self.mu, n * self.sigma2)
11698
\end{verbatim}
11699
11700
Normal also knows how to multiply and divide using
11701
Equation 1:
11702
11703
\begin{verbatim}
11704
def __mul__(self, factor):
11705
return Normal(factor * self.mu, factor**2 * self.sigma2)
11706
11707
def __div__(self, divisor):
11708
return 1 / divisor * self
11709
\end{verbatim}
11710
11711
So we can compute the sampling distribution of the mean with sample
11712
size 9:
11713
\index{sampling distribution}
11714
\index{sample size}
11715
11716
\begin{verbatim}
11717
>>> dist_xbar = dist.Sum(9) / 9
11718
>>> dist_xbar.sigma
11719
2.5
11720
\end{verbatim}
11721
11722
The standard deviation of the sampling distribution is 2.5 kg, as we
11723
saw in the previous section. Finally, Normal provides {\tt
11724
Percentile}, which we can use to compute a confidence interval:
11725
\index{standard deviation}
11726
\index{confidence interval}
11727
11728
\begin{verbatim}
11729
>>> dist_xbar.Percentile(5), dist_xbar.Percentile(95)
11730
85.888 94.113
11731
\end{verbatim}
11732
11733
And that's the same answer we got before. We'll use the Normal
11734
class again later, but before we go on, we need one more bit of
11735
analysis.
11736
11737
11738
\section{Central limit theorem}
11739
\label{CLT}
11740
11741
As we saw in the previous sections, if we add values drawn from normal
11742
distributions, the distribution of the sum is normal.
11743
Most other distributions don't have this property;
11744
if we add values drawn from other distributions, the sum does not
11745
generally have an analytic distribution.
11746
\index{sum}
11747
\index{normal distribution} \index{distribution!normal}
11748
\index{Gaussian distribution} \index{distribution!Gaussian}
11749
11750
But if we add up {\tt n} values from
11751
almost any distribution, the distribution of the sum converges to
11752
normal as {\tt n} increases.
11753
11754
More specifically, if the distribution of the values has mean and
11755
standard deviation $\mu$ and $\sigma$, the distribution of the sum is
11756
approximately $\normal(n \mu, n \sigma^2)$.
11757
\index{standard deviation}
11758
11759
This result is the Central Limit Theorem (CLT). It is one of the
11760
most useful tools for statistical analysis, but it comes with
11761
caveats:
11762
\index{Central Limit Theorem}
11763
\index{CLT}
11764
11765
\begin{itemize}
11766
11767
\item The values have to be drawn independently. If they are
11768
correlated, the CLT doesn't apply (although this is seldom a problem
11769
in practice).
11770
\index{independent}
11771
11772
\item The values have to come from the same distribution (although
11773
this requirement can be relaxed).
11774
\index{identical}
11775
11776
\item The values have to be drawn
11777
from a distribution with finite mean and variance. So most Pareto
11778
distributions are out.
11779
\index{mean}
11780
\index{variance}
11781
\index{Pareto distribution}
11782
\index{distribution!Pareto}
11783
\index{exponential distribution}
11784
\index{distribution!exponential}
11785
11786
\item The rate of convergence depends
11787
on the skewness of the distribution. Sums from an exponential
11788
distribution converge for small {\tt n}. Sums from a
11789
lognormal distribution require larger sizes.
11790
\index{lognormal distribution}
11791
\index{distribution!lognormal}
11792
\index{skewness}
11793
11794
\end{itemize}
11795
11796
The Central Limit Theorem explains the prevalence
11797
of normal distributions in the natural world. Many characteristics of
11798
living things are affected by genetic
11799
and environmental factors whose effect is additive. The characteristics
11800
we measure are the sum of a large number of small effects, so their
11801
distribution tends to be normal.
11802
\index{normal distribution}
11803
\index{distribution!normal}
11804
\index{Gaussian distribution}
11805
\index{distribution!Gaussian}
11806
\index{Central Limit Theorem}
11807
\index{CLT}
11808
11809
11810
\section{Testing the CLT}
11811
11812
To see how the Central Limit Theorem works, and when it doesn't,
11813
let's try some experiments. First, we'll try
11814
an exponential distribution:
11815
11816
\begin{verbatim}
11817
def MakeExpoSamples(beta=2.0, iters=1000):
11818
samples = []
11819
for n in [1, 10, 100]:
11820
sample = [np.sum(np.random.exponential(beta, n))
11821
for _ in range(iters)]
11822
samples.append((n, sample))
11823
return samples
11824
\end{verbatim}
11825
11826
{\tt MakeExpoSamples} generates samples of sums of exponential values
11827
(I use ``exponential values'' as shorthand for ``values from an
11828
exponential distribution'').
11829
{\tt beta} is the parameter of the distribution; {\tt iters}
11830
is the number of sums to generate.
11831
11832
To explain this function, I'll start from the inside and work my way
11833
out. Each time we call {\tt np.random.exponential}, we get a sequence
11834
of {\tt n} exponential values and compute its sum. {\tt sample}
11835
is a list of these sums, with length {\tt iters}.
11836
\index{NumPy}
11837
11838
It is easy to get {\tt n} and {\tt iters} confused: {\tt n} is the
11839
number of terms in each sum; {\tt iters} is the number of sums we
11840
compute in order to characterize the distribution of sums.
11841
11842
The return value is a list of {\tt (n, sample)} pairs. For
11843
each pair, we make a normal probability plot:
11844
\index{thinkplot}
11845
\index{normal probability plot}
11846
11847
\begin{verbatim}
11848
def NormalPlotSamples(samples, plot=1, ylabel=''):
11849
for n, sample in samples:
11850
thinkplot.SubPlot(plot)
11851
thinkstats2.NormalProbabilityPlot(sample)
11852
11853
thinkplot.Config(title='n=%d' % n, ylabel=ylabel)
11854
plot += 1
11855
\end{verbatim}
11856
11857
{\tt NormalPlotSamples} takes the list of pairs from {\tt
11858
MakeExpoSamples} and generates a row of normal probability plots.
11859
\index{normal probability plot}
11860
11861
\begin{figure}
11862
% normal.py
11863
\centerline{\includegraphics[height=3.5in]{figs/normal1.pdf}}
11864
\caption{Distributions of sums of exponential values (top row) and
11865
lognormal values (bottom row).}
11866
\label{normal1}
11867
\end{figure}
11868
11869
Figure~\ref{normal1} (top row) shows
11870
the results. With {\tt n=1}, the distribution of the sum is still
11871
exponential, so the normal probability plot is not a straight line.
11872
But with {\tt n=10} the distribution of the sum is approximately
11873
normal, and with {\tt n=100} it is all but indistinguishable from
11874
normal.
11875
11876
Figure~\ref{normal1} (bottom row) shows similar results for a
11877
lognormal distribution. Lognormal distributions are generally more
11878
skewed than exponential distributions, so the distribution of sums
11879
takes longer to converge. With {\tt n=10} the normal
11880
probability plot is nowhere near straight, but with {\tt n=100}
11881
it is approximately normal.
11882
\index{lognormal distribution}
11883
\index{distribution!lognormal}
11884
\index{skewness}
11885
11886
\begin{figure}
11887
% normal.py
11888
\centerline{\includegraphics[height=3.5in]{figs/normal2.pdf}}
11889
\caption{Distributions of sums of Pareto values (top row) and
11890
correlated exponential values (bottom row).}
11891
\label{normal2}
11892
\end{figure}
11893
11894
Pareto distributions are even more skewed than lognormal. Depending
11895
on the parameters, many Pareto distributions do not have finite mean
11896
and variance. As a result, the Central Limit Theorem does not apply.
11897
Figure~\ref{normal2} (top row) shows distributions of sums of
11898
Pareto values. Even with {\tt n=100} the normal probability plot
11899
is far from straight.
11900
\index{Pareto distribution}
11901
\index{distribution!Pareto}
11902
\index{Central Limit Theorem}
11903
\index{CLT}
11904
\index{normal probability plot}
11905
11906
I also mentioned that CLT does not apply if the values are correlated.
11907
To test that, I generate correlated values from an exponential
11908
distribution. The algorithm for generating correlated values is
11909
(1) generate correlated normal values, (2) use the normal CDF
11910
to transform the values to uniform, and (3) use the inverse
11911
exponential CDF to transform the uniform values to exponential.
11912
\index{inverse CDF}
11913
\index{CDF, inverse}
11914
\index{correlation}
11915
\index{random number}
11916
11917
{\tt GenerateCorrelated} returns an iterator of {\tt n} normal values
11918
with serial correlation {\tt rho}:
11919
\index{iterator}
11920
11921
\begin{verbatim}
11922
def GenerateCorrelated(rho, n):
11923
x = random.gauss(0, 1)
11924
yield x
11925
11926
sigma = math.sqrt(1 - rho**2)
11927
for _ in range(n-1):
11928
x = random.gauss(x*rho, sigma)
11929
yield x
11930
\end{verbatim}
11931
11932
The first value is a standard normal value. Each subsequent value
11933
depends on its predecessor: if the previous value is {\tt x}, the mean of
11934
the next value is {\tt x*rho}, with variance {\tt 1-rho**2}. Note that {\tt
11935
random.gauss} takes the standard deviation as the second argument,
11936
not variance.
11937
\index{standard deviation}
11938
\index{standard normal distribution}
11939
11940
{\tt GenerateExpoCorrelated}
11941
takes the resulting sequence and transforms it to exponential:
11942
11943
\begin{verbatim}
11944
def GenerateExpoCorrelated(rho, n):
11945
normal = list(GenerateCorrelated(rho, n))
11946
uniform = scipy.stats.norm.cdf(normal)
11947
expo = scipy.stats.expon.ppf(uniform)
11948
return expo
11949
\end{verbatim}
11950
11951
{\tt normal} is a list of correlated normal values. {\tt uniform}
11952
is a sequence of uniform values between 0 and 1. {\tt expo} is
11953
a correlated sequence of exponential values.
11954
{\tt ppf} stands for ``percent point function,'' which is another
11955
name for the inverse CDF.
11956
\index{inverse CDF}
11957
\index{CDF, inverse}
11958
\index{percent point function}
11959
11960
Figure~\ref{normal2} (bottom row) shows distributions of sums of
11961
correlated exponential values with {\tt rho=0.9}. The correlation
11962
slows the rate of convergence; nevertheless, with {\tt n=100} the
11963
normal probability plot is nearly straight. So even though CLT
11964
does not strictly apply when the values are correlated, moderate
11965
correlations are seldom a problem in practice.
11966
\index{normal probability plot}
11967
\index{correlation}
11968
11969
These experiments are meant to show how the Central Limit Theorem
11970
works, and what happens when it doesn't. Now let's see how we can
11971
use it.
11972
11973
11974
\section{Applying the CLT}
11975
\label{usingCLT}
11976
11977
To see why the Central Limit Theorem is useful, let's get back
11978
to the example in Section~\ref{testdiff}: testing the apparent
11979
difference in mean pregnancy length for first babies and others.
11980
As we've seen, the apparent difference is about
11981
0.078 weeks:
11982
\index{pregnancy length}
11983
\index{Central Limit Theorem}
11984
\index{CLT}
11985
11986
\begin{verbatim}
11987
>>> live, firsts, others = first.MakeFrames()
11988
>>> delta = firsts.prglngth.mean() - others.prglngth.mean()
11989
0.078
11990
\end{verbatim}
11991
11992
Remember the logic of hypothesis testing: we compute a p-value, which
11993
is the probability of the observed difference under the null
11994
hypothesis; if it is small, we conclude that the observed difference
11995
is unlikely to be due to chance.
11996
\index{p-value}
11997
\index{null hypothesis}
11998
\index{hypothesis testing}
11999
12000
In this example, the null hypothesis is that the distribution of
12001
pregnancy lengths is the same for first babies and others.
12002
So we can compute the sampling distribution of the mean
12003
like this:
12004
\index{sampling distribution}
12005
12006
\begin{verbatim}
12007
dist1 = SamplingDistMean(live.prglngth, len(firsts))
12008
dist2 = SamplingDistMean(live.prglngth, len(others))
12009
\end{verbatim}
12010
12011
Both sampling distributions are based on the same population, which is
12012
the pool of all live births. {\tt SamplingDistMean} takes this
12013
sequence of values and the sample size, and returns a Normal object
12014
representing the sampling distribution:
12015
12016
\begin{verbatim}
12017
def SamplingDistMean(data, n):
12018
mean, var = data.mean(), data.var()
12019
dist = Normal(mean, var)
12020
return dist.Sum(n) / n
12021
\end{verbatim}
12022
12023
{\tt mean} and {\tt var} are the mean and variance of
12024
{\tt data}. We approximate the distribution of the data with
12025
a normal distribution, {\tt dist}.
12026
12027
In this example, the data are not normally distributed, so this
12028
approximation is not very good. But then we compute {\tt dist.Sum(n)
12029
/ n}, which is the sampling distribution of the mean of {\tt n}
12030
values. Even if the data are not normally distributed, the sampling
12031
distribution of the mean is, by the Central Limit Theorem.
12032
\index{Central Limit Theorem}
12033
\index{CLT}
12034
12035
Next, we compute the sampling distribution of the difference
12036
in the means. The {\tt Normal} class knows how to perform
12037
subtraction using Equation 2:
12038
\index{Normal}
12039
12040
\begin{verbatim}
12041
def __sub__(self, other):
12042
return Normal(self.mu - other.mu,
12043
self.sigma2 + other.sigma2)
12044
\end{verbatim}
12045
12046
So we can compute the sampling distribution of the difference like this:
12047
12048
\begin{verbatim}
12049
>>> dist = dist1 - dist2
12050
N(0, 0.0032)
12051
\end{verbatim}
12052
12053
The mean is 0, which makes sense because we expect two samples from
12054
the same distribution to have the same mean, on average. The variance
12055
of the sampling distribution is 0.0032.
12056
\index{sampling distribution}
12057
12058
{\tt Normal} provides {\tt Prob}, which evaluates the normal CDF.
12059
We can use {\tt Prob} to compute the probability of a
12060
difference as large as {\tt delta} under the null hypothesis:
12061
\index{null hypothesis}
12062
12063
\begin{verbatim}
12064
>>> 1 - dist.Prob(delta)
12065
0.084
12066
\end{verbatim}
12067
12068
Which means that the p-value for a one-sided test is 0.84. For
12069
a two-sided test we would also compute
12070
\index{p-value}
12071
\index{one-sided test}
12072
\index{two-sided test}
12073
12074
\begin{verbatim}
12075
>>> dist.Prob(-delta)
12076
0.084
12077
\end{verbatim}
12078
12079
Which is the same because the normal distribution is symmetric.
12080
The sum of the tails is 0.168, which is consistent with the estimate
12081
in Section~\ref{testdiff}, which was 0.17.
12082
\index{symmetric}
12083
12084
12085
12086
\section{Correlation test}
12087
12088
In Section~\ref{corrtest} we used a permutation test for the correlation
12089
between birth weight and mother's age, and found that it is
12090
statistically significant, with p-value less than 0.001.
12091
\index{p-value}
12092
\index{birth weight}
12093
\index{weight!birth}
12094
\index{permutation}
12095
\index{significant} \index{statistically significant}
12096
12097
Now we can do the same thing analytically. The method is based
12098
on this mathematical result: given two variables that are normally distributed
12099
and uncorrelated, if we generate a sample with size $n$,
12100
compute Pearson's correlation, $r$, and then compute the transformed
12101
correlation
12102
%
12103
\[ t = r \sqrt{\frac{n-2}{1-r^2}} \]
12104
%
12105
the distribution of $t$ is Student's t-distribution with parameter
12106
$n-2$. The t-distribution is an analytic distribution; the CDF can
12107
be computed efficiently using gamma functions.
12108
\index{Pearson coefficient of correlation}
12109
\index{correlation}
12110
12111
We can use this result to compute the sampling distribution of
12112
correlation under the null hypothesis; that is, if we generate
12113
uncorrelated sequences of normal values, what is the distribution of
12114
their correlation? {\tt StudentCdf} takes the sample size, {\tt n}, and
12115
returns the sampling distribution of correlation:
12116
\index{null hypothesis}
12117
\index{sampling distribution}
12118
12119
\begin{verbatim}
12120
def StudentCdf(n):
12121
ts = np.linspace(-3, 3, 101)
12122
ps = scipy.stats.t.cdf(ts, df=n-2)
12123
rs = ts / np.sqrt(n - 2 + ts**2)
12124
return thinkstats2.Cdf(rs, ps)
12125
\end{verbatim}
12126
12127
{\tt ts} is a NumPy array of values for $t$, the transformed
12128
correlation. {\tt ps} contains the corresponding probabilities,
12129
computed using the CDF of the Student's t-distribution implemented in
12130
SciPy. The parameter of the t-distribution, {\tt df}, stands for
12131
``degrees of freedom.'' I won't explain that term, but you can read
12132
about it at
12133
\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.
12134
\index{NumPy}
12135
\index{SciPy}
12136
\index{Student's t-distribution}
12137
\index{distribution!Student's t}
12138
\index{degrees of freedom}
12139
12140
\begin{figure}
12141
% normal.py
12142
\centerline{\includegraphics[height=2.5in]{figs/normal4.pdf}}
12143
\caption{Sampling distribution of correlations for uncorrelated
12144
normal variables.}
12145
\label{normal4}
12146
\end{figure}
12147
12148
To get from {\tt ts} to the correlation coefficients, {\tt rs},
12149
we apply the inverse transform,
12150
%
12151
\[ r = t / \sqrt{n - 2 + t^2} \]
12152
%
12153
The result is the sampling distribution of $r$ under the null hypothesis.
12154
Figure~\ref{normal4} shows this distribution along with the distribution
12155
we generated in Section~\ref{corrtest} by resampling. They are nearly
12156
identical. Although the actual distributions are not normal,
12157
Pearson's coefficient of correlation is based on sample means
12158
and variances. By the Central Limit Theorem, these moment-based
12159
statistics are normally distributed even if the data are not.
12160
\index{Central Limit Theorem}
12161
\index{CLT}
12162
\index{null hypothesis}
12163
\index{resampling}
12164
12165
From Figure~\ref{normal4}, we can see that the
12166
observed correlation, 0.07, is unlikely to occur if the variables
12167
are actually uncorrelated.
12168
Using the analytic distribution, we can compute just how unlikely:
12169
\index{analytic distribution}
12170
12171
\begin{verbatim}
12172
t = r * math.sqrt((n-2) / (1-r**2))
12173
p_value = 1 - scipy.stats.t.cdf(t, df=n-2)
12174
\end{verbatim}
12175
12176
We compute the value of {\tt t} that corresponds to {\tt r=0.07}, and
12177
then evaluate the t-distribution at {\tt t}. The result is {\tt
12178
2.9e-11}. This example demonstrates an advantage of the analytic
12179
method: we can compute very small p-values. But in practice it
12180
usually doesn't matter.
12181
\index{SciPy}
12182
\index{p-value}
12183
12184
12185
\section{Chi-squared test}
12186
12187
In Section~\ref{casino2} we used the chi-squared statistic to
12188
test whether a die is crooked. The chi-squared statistic measures
12189
the total normalized deviation from the expected values in a table:
12190
%
12191
\[ \goodchi^2 = \sum_i \frac{{(O_i - E_i)}^2}{E_i} \]
12192
%
12193
One reason the chi-squared statistic is widely used is that
12194
its sampling distribution under the null hypothesis is analytic;
12195
by a remarkable coincidence\footnote{Not really.}, it is called
12196
the chi-squared distribution. Like the t-distribution, the
12197
chi-squared CDF can be computed efficiently using gamma functions.
12198
\index{deviation}
12199
\index{null hypothesis}
12200
\index{sampling distribution}
12201
\index{chi-squared test}
12202
\index{chi-squared distribution}
12203
\index{distribution!chi-squared}
12204
12205
\begin{figure}
12206
% normal.py
12207
\centerline{\includegraphics[height=2.5in]{figs/normal5.pdf}}
12208
\caption{Sampling distribution of chi-squared statistics for
12209
a fair six-sided die.}
12210
\label{normal5}
12211
\end{figure}
12212
12213
SciPy provides an implementation of the chi-squared distribution,
12214
which we use to compute the sampling distribution of the
12215
chi-squared statistic:
12216
\index{SciPy}
12217
12218
\begin{verbatim}
12219
def ChiSquaredCdf(n):
12220
xs = np.linspace(0, 25, 101)
12221
ps = scipy.stats.chi2.cdf(xs, df=n-1)
12222
return thinkstats2.Cdf(xs, ps)
12223
\end{verbatim}
12224
12225
Figure~\ref{normal5} shows the analytic result along with the
12226
distribution we got by resampling. They are very similar,
12227
especially in the tail, which is the part we usually care most
12228
about.
12229
\index{resampling}
12230
\index{tail}
12231
12232
We can use this distribution to compute the p-value of the
12233
observed test statistic, {\tt chi2}:
12234
\index{test statistic}
12235
\index{p-value}
12236
12237
\begin{verbatim}
12238
p_value = 1 - scipy.stats.chi2.cdf(chi2, df=n-1)
12239
\end{verbatim}
12240
12241
The result is 0.041, which is consistent with the result
12242
from Section~\ref{casino2}.
12243
12244
The parameter of the chi-squared distribution is ``degrees of
12245
freedom'' again. In this case the correct parameter is {\tt n-1},
12246
where {\tt n} is the size of the table, 6. Choosing this parameter
12247
can be tricky; to be honest, I am never confident that I have it
12248
right until I generate something like Figure~\ref{normal5} to compare
12249
the analytic results to the resampling results.
12250
\index{degrees of freedom}
12251
12252
12253
\section{Discussion}
12254
12255
This book focuses on computational methods like resampling and
12256
permutation. These methods have several advantages over analysis:
12257
\index{resampling}
12258
\index{permutation}
12259
\index{computational methods}
12260
12261
\begin{itemize}
12262
12263
\item They are easier to explain and understand. For example, one of
12264
the most difficult topics in an introductory statistics class is
12265
hypothesis testing. Many students don't really understand what
12266
p-values are. I think the approach I presented in
12267
Chapter~\ref{testing}---simulating the null hypothesis and
12268
computing test statistics---makes the fundamental idea clearer.
12269
\index{p-value}
12270
\index{null hypothesis}
12271
12272
\item They are robust and versatile. Analytic methods are often based
12273
on assumptions that might not hold in practice. Computational
12274
methods require fewer assumptions, and can be adapted and extended
12275
more easily.
12276
\index{robust}
12277
12278
\item They are debuggable. Analytic methods are often like a black
12279
box: you plug in numbers and they spit out results. But it's easy
12280
to make subtle errors, hard to be confident that the results are
12281
right, and hard to find the problem if they are not. Computational
12282
methods lend themselves to incremental development and testing,
12283
which fosters confidence in the results.
12284
\index{debugging}
12285
12286
\end{itemize}
12287
12288
But there is one drawback: computational methods can be slow. Taking
12289
into account these pros and cons, I recommend the following process:
12290
12291
\begin{enumerate}
12292
12293
\item Use computational methods during exploration. If you find a
12294
satisfactory answer and the run time is acceptable, you can stop.
12295
\index{exploration}
12296
12297
\item If run time is not acceptable, look for opportunities to
12298
optimize. Using analytic methods is one of several methods of
12299
optimization.
12300
12301
\item If replacing a computational method with an analytic method is
12302
appropriate, use the computational method as a basis of comparison,
12303
providing mutual validation between the computational and
12304
analytic results.
12305
\index{model}
12306
12307
\end{enumerate}
12308
12309
For the vast majority of problems I have worked on, I didn't have
12310
to go past Step 1.
12311
12312
12313
\section{Exercises}
12314
12315
A solution to these exercises is in \verb"chap14soln.py"
12316
12317
\begin{exercise}
12318
\label{log_clt}
12319
In Section~\ref{lognormal}, we saw that the distribution
12320
of adult weights is approximately lognormal. One possible
12321
explanation is that the weight a person
12322
gains each year is proportional to their current weight.
12323
In that case, adult weight is the product of a large number
12324
of multiplicative factors:
12325
%
12326
\[ w = w_0 f_1 f_2 \ldots f_n \]
12327
%
12328
where $w$ is adult weight, $w_0$ is birth weight, and $f_i$
12329
is the weight gain factor for year $i$.
12330
\index{birth weight}
12331
\index{weight!birth}
12332
\index{lognormal distribution}
12333
\index{distribution!lognormal}
12334
\index{adult weight}
12335
12336
The log of a product is the sum of the logs of the
12337
factors:
12338
%
12339
\[ \log w = \log w_0 + \log f_1 + \log f_2 + \cdots + \log f_n \]
12340
%
12341
So by the Central Limit Theorem, the distribution of $\log w$ is
12342
approximately normal for large $n$, which implies that the
12343
distribution of $w$ is lognormal.
12344
\index{Central Limit Theorem}
12345
\index{CLT}
12346
12347
To model this phenomenon, choose a distribution for $f$ that seems
12348
reasonable, then generate a sample of adult weights by choosing a
12349
random value from the distribution of birth weights, choosing a
12350
sequence of factors from the distribution of $f$, and computing the
12351
product. What value of $n$ is needed to converge to a lognormal
12352
distribution?
12353
\index{model}
12354
12355
\index{logarithm}
12356
\index{product}
12357
12358
\end{exercise}
12359
12360
12361
12362
\begin{exercise}
12363
In Section~\ref{usingCLT} we used the Central Limit Theorem to find
12364
the sampling distribution of the difference in means, $\delta$, under
12365
the null hypothesis that both samples are drawn from the same
12366
population.
12367
\index{null hypothesis}
12368
\index{sampling distribution}
12369
12370
We can also use this distribution to find the standard error of the
12371
estimate and confidence intervals, but that would only be
12372
approximately correct. To be more precise, we should compute the
12373
sampling distribution of $\delta$ under the alternate hypothesis that
12374
the samples are drawn from different populations.
12375
\index{standard error}
12376
\index{standard deviation}
12377
\index{confidence interval}
12378
12379
Compute this distribution and use it to calculate the standard error
12380
and a 90\% confidence interval for the difference in means.
12381
\end{exercise}
12382
12383
12384
\begin{exercise}
12385
In a recent paper\footnote{``Evidence for the persistent effects of an
12386
intervention to mitigate gender-sterotypical task allocation within
12387
student engineering teams,'' Proceedings of the IEEE Frontiers in Education
12388
Conference, 2014.}, Stein et al.~investigate the
12389
effects of an intervention intended to mitigate gender-stereotypical
12390
task allocation within student engineering teams.
12391
12392
Before and after the intervention, students responded to a survey that
12393
asked them to rate their contribution to each aspect of class projects on
12394
a 7-point scale.
12395
12396
Before the intervention, male students reported higher scores for the
12397
programming aspect of the project than female students; on average men
12398
reported a score of 3.57 with standard error 0.28. Women reported
12399
1.91, on average, with standard error 0.32.
12400
\index{standard error}
12401
12402
Compute the sampling distribution of the gender gap (the difference in
12403
means), and test whether it is statistically significant. Because you
12404
are given standard errors for the estimated means, you don't need to
12405
know the sample size to figure out the sampling distributions.
12406
\index{significant} \index{statistically significant}
12407
\index{sampling distribution}
12408
12409
After the intervention, the gender gap was smaller: the average score
12410
for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE
12411
0.16). Again, compute the sampling distribution of the gender gap and
12412
test it.
12413
\index{gender gap}
12414
12415
Finally, estimate the change in gender gap; what is the sampling
12416
distribution of this change, and is it statistically significant?
12417
\index{significant} \index{statistically significant}
12418
\end{exercise}
12419
12420
\cleardoublepage
12421
\phantomsection
12422
\addcontentsline{toc}{chapter}{\indexname}%
12423
\printindex
12424
12425
\clearemptydoublepage
12426
%\blankpage
12427
%\blankpage
12428
%\blankpage
12429
12430
12431
\end{document}
12432
12433
12434
12435