CoCalc -- book.tex

📚 The CoCalc Library - books, templates and other resources
Project: 📚 The Library - Shared Public Version
Path: cocalc-examples / think-stats-2ed / book / book.tex
Views: ⁹⁶¹⁴³
License: OTHER
1
% LaTeX source for ``Think Stats:
2
% Exploratory data analysis in Python''
3
% Copyright 2014  Allen B. Downey.
4

5
% License: Creative Commons 
6
% Attribution-NonCommercial-ShareAlike 4.0 International
7
% http://creativecommons.org/licenses/by-nc-sa/4.0/
8
%
9

10
%\documentclass[10pt,b5paper]{book}
11
\documentclass[12pt]{book}
12

13
%\usepackage[width=5.5in,height=8.5in,
14
%  hmarginratio=3:2,vmarginratio=1:1]{geometry}
15

16
% for some of these packages, you might have to install
17
% texlive-latex-extra (in Ubuntu)
18

19
%\usepackage[T1]{fontenc}
20
%\usepackage{textcomp}
21
%\usepackage{mathpazo}
22
%\usepackage{pslatex}
23

24
\usepackage{url}
25
\usepackage{hyperref}
26
\usepackage{fancyhdr}
27
\usepackage{graphicx}
28
\usepackage{subfig}
29
\usepackage{amsmath}
30
\usepackage{amsthm}
31
%\usepackage{amssymb}
32
\usepackage{makeidx}
33
\usepackage{setspace}
34
\usepackage{hevea}                           
35
\usepackage{upquote}
36

37
\title{Think Stats}
38
\author{Allen B. Downey}
39

40
\newcommand{\thetitle}{Think Stats}
41
\newcommand{\thesubtitle}{Exploratory Data Analysis in Python}
42
\newcommand{\theversion}{2.0.38}
43

44
% these styles get translated in CSS for the HTML version
45
\newstyle{a:link}{color:black;}
46
\newstyle{p+p}{margin-top:1em;margin-bottom:1em}
47
\newstyle{img}{border:0px}
48

49
% change the arrows in the HTML version
50
\setlinkstext
51
  {\imgsrc[ALT="Previous"]{back.png}}
52
  {\imgsrc[ALT="Up"]{up.png}}
53
  {\imgsrc[ALT="Next"]{next.png}} 
54

55
\makeindex
56

57
\newif\ifplastex
58
\plastexfalse
59

60
\begin{document}
61

62
\frontmatter
63

64
\newcommand{\Erdos}{Erd\H{o}s}
65
\newcommand{\nhat}{\hat{N}}
66
\newcommand{\eps}{\varepsilon}
67
\newcommand{\slope}{\mathrm{slope}}
68
\newcommand{\inter}{\mathrm{inter}}
69
\newcommand{\xs}{\mathrm{xs}}
70
\newcommand{\ys}{\mathrm{ys}}
71
\newcommand{\res}{\mathrm{res}}
72
\newcommand{\xbar}{\bar{x}}
73
\newcommand{\ybar}{\bar{y}}
74
\newcommand{\PMF}{\mathrm{PMF}}
75
\newcommand{\PDF}{\mathrm{PDF}}
76
\newcommand{\CDF}{\mathrm{CDF}}
77
\newcommand{\ICDF}{\mathrm{ICDF}}
78
\newcommand{\Prob}{\mathrm{P}}
79
\newcommand{\Corr}{\mathrm{Corr}}
80
\newcommand{\normal}{\mathcal{N}}
81
\newcommand{\given}{|}
82
%\newcommand{\goodchi}{\protect\raisebox{2pt}{$\chi$}}
83
\newcommand{\goodchi}{\chi}
84

85
\ifplastex
86
    \usepackage{localdef}
87
    \maketitle
88

89
\newcount\anchorcnt
90
\newcommand*{\Anchor}[1]{%
91
  \@bsphack%
92
    \Hy@GlobalStepCount\anchorcnt%
93
    \edef\@currentHref{anchor.\the\anchorcnt}% 
94
    \Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}% 
95
    \M@gettitle{}\label{#1}% 
96
    \@esphack%
97
}
98

99

100
\else
101

102
%%% EXERCISE
103

104
\newtheoremstyle{exercise}% name of the style to be used
105
  {\topsep}% measure of space to leave above the theorem. E.g.: 3pt
106
  {\topsep}% measure of space to leave below the theorem. E.g.: 3pt
107
  {}% name of font to use in the body of the theorem
108
  {}% measure of space to indent
109
  {\bfseries}% name of head font
110
  {}% punctuation between head and body
111
  { }% space after theorem head; " " = normal interword space
112
  {}% Manually specify head
113

114
\theoremstyle{exercise}
115
\newtheorem{exercise}{Exercise}[chapter]
116

117
%\newcounter{exercise}[chapter]
118
%\newcommand{\nextexercise}{\refstepcounter{exercise}}
119

120
%\newenvironment{exercise}{\nextexercise \noindent \textbf{Exercise \thechapter.\theexercise} \begin{itshape} \noindent}{\end{itshape}}
121

122
\input{latexonly}
123

124
\begin{latexonly}
125

126
\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}
127

128
%\blankpage
129
%\blankpage
130

131
% TITLE PAGES FOR LATEX VERSION
132

133
%-half title--------------------------------------------------
134
\thispagestyle{empty}
135

136
\begin{flushright}
137
\vspace*{2.0in}
138

139
\begin{spacing}{3}
140
{\huge \thetitle}\\
141
{\Large \thesubtitle }
142
\end{spacing}
143

144
\vspace{0.25in}
145

146
Version \theversion
147

148
\vfill
149

150
\end{flushright}
151

152
%--verso------------------------------------------------------
153

154
\blankpage
155
\blankpage
156
%\clearemptydoublepage
157
%\pagebreak
158
%\thispagestyle{empty}
159
%\vspace*{6in}
160

161
%--title page--------------------------------------------------
162
\pagebreak
163
\thispagestyle{empty}
164

165
\begin{flushright}
166
\vspace*{2.0in}
167

168
\begin{spacing}{3}
169
{\huge \thetitle}\\
170
{\Large \thesubtitle}
171
\end{spacing}
172

173
\vspace{0.25in}
174

175
Version \theversion
176

177
\vspace{1in}
178

179

180
{\Large
181
Allen B. Downey\\
182
}
183

184

185
\vspace{0.5in}
186

187
{\Large Green Tea Press}
188

189
{\small Needham, Massachusetts}
190

191
%\includegraphics[width=1in]{figs/logo1.eps}
192
\vfill
193

194
\end{flushright}
195

196

197
%--copyright--------------------------------------------------
198
\pagebreak
199
\thispagestyle{empty}
200

201
{\small
202
Copyright \copyright ~2014 Allen B. Downey.
203

204

205
\vspace{0.2in}
206

207
\begin{flushleft}
208
Green Tea Press       \\
209
9 Washburn Ave \\
210
Needham MA 02492
211
\end{flushleft}
212

213
Permission is granted to copy, distribute, and/or modify this document
214
under the terms of the Creative Commons
215
Attribution-NonCommercial-ShareAlike 4.0 International License, which
216
is available at
217
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
218

219
The original form of this book is \LaTeX\ source code.  Compiling this
220
code has the effect of generating a device-independent representation
221
of a textbook, which can be converted to other formats and printed.
222

223
The \LaTeX\ source for this book is available from
224
\url{http://thinkstats2.com}.
225

226
\vspace{0.2in}
227

228
} % end small
229

230
\end{latexonly}
231

232

233
% HTMLONLY
234

235
\begin{htmlonly}
236

237
% TITLE PAGE FOR HTML VERSION
238

239
{\Large \thetitle: \thesubtitle}
240

241
{\large Allen B. Downey}
242

243
Version \theversion
244

245
\vspace{0.25in}
246

247
Copyright 2014 Allen B. Downey
248

249
\vspace{0.25in}
250

251
Permission is granted to copy, distribute, and/or modify this document
252
under the terms of the Creative Commons 
253
Attribution-NonCommercial-ShareAlike 4.0 International
254
Unported License, which is available at
255
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
256

257
\setcounter{chapter}{-1}
258

259
\end{htmlonly}
260

261
\fi
262
% END OF THE PART WE SKIP FOR PLASTEX
263

264
\chapter{Preface}
265
\label{preface}
266

267
This book is an
268
introduction to the practical tools of exploratory data analysis.
269
The organization of the book follows the process I use
270
when I start working with a dataset:
271

272
\begin{itemize}
273

274
\item Importing and cleaning: Whatever format the data is in, it
275
  usually takes some time and effort to read the data, clean and
276
  transform it, and check that everything made it through the
277
  translation process intact.
278
\index{cleaning}
279

280
\item Single variable explorations: I usually start by examining one
281
  variable at a time, finding out what the variables mean, looking
282
  at distributions of the values, and choosing appropriate
283
  summary statistics.
284
\index{distribution}
285

286
\item Pair-wise explorations: To identify possible relationships
287
  between variables, I look at tables and scatter plots, and compute
288
  correlations and linear fits.
289
\index{correlation}
290
\index{linear fit}
291

292
\item Multivariate analysis: If there are apparent relationships
293
  between variables, I use multiple regression to add control variables
294
  and investigate more complex relationships.
295
\index{multiple regression}
296
\index{control variable}
297

298
\item Estimation and hypothesis testing: When reporting statistical
299
  results, it is important to answer three questions: How big is
300
  the effect?  How much variability should we expect if we run the same
301
  measurement again?  Is it possible that the apparent effect is
302
  due to chance?
303
\index{estimation}
304
\index{hypothesis testing}
305

306
\item Visualization: During exploration, visualization is an important 
307
  tool for finding possible relationships and effects.  Then if an
308
  apparent effect holds up to scrutiny, visualization is an effective
309
  way to communicate results.
310
\index{visualization}
311

312
\end{itemize}
313

314
This book takes a computational approach, which has several
315
advantages over mathematical approaches:
316
\index{computational methods}
317

318
\begin{itemize}
319

320
\item I present most ideas using Python code, rather than
321
  mathematical notation.  In general, Python code is more readable;
322
  also, because it is executable, readers can download it, run it,
323
  and modify it.
324

325
\item Each chapter includes exercises readers can do to develop
326
  and solidify their learning.  When you write programs, you
327
  express your understanding in code; while you are debugging the
328
  program, you are also correcting your understanding.
329
\index{debugging}
330

331
\item Some exercises involve experiments to test statistical
332
  behavior.  For example, you can explore the Central Limit Theorem
333
  (CLT) by generating random samples and computing their sums.  The
334
  resulting visualizations demonstrate why the CLT works and when
335
  it doesn't.
336
\index{Central Limit Theorem}
337
\index{CLT}
338

339
\item Some ideas that are hard to grasp mathematically are easy to
340
  understand by simulation.  For example, we approximate p-values by
341
  running random simulations, which reinforces the meaning of the
342
  p-value.
343
\index{p-value}
344

345
\item Because the book is based on a general-purpose programming
346
  language (Python), readers can import data from almost any source.
347
  They are not limited to datasets that have been cleaned and
348
  formatted for a particular statistics tool.
349

350
\end{itemize}
351

352
The book lends itself to a project-based approach.  In my class,
353
students work on a semester-long project that requires them to pose a
354
statistical question, find a dataset that can address it, and apply
355
each of the techniques they learn to their own data.
356

357
To demonstrate my approach to statistical analysis, the book
358
presents a case study that runs through all of the chapters.  It uses
359
data from two sources:
360

361
\begin{itemize}
362

363
\item The National Survey of Family Growth (NSFG), conducted by the
364
  U.S. Centers for Disease Control and Prevention (CDC) to gather
365
  ``information on family life, marriage and divorce, pregnancy,
366
  infertility, use of contraception, and men's and women's health.''
367
  (See \url{http://cdc.gov/nchs/nsfg.htm}.)
368

369
\item The Behavioral Risk Factor Surveillance System (BRFSS),
370
  conducted by the National Center for Chronic Disease Prevention and
371
  Health Promotion to ``track health conditions and risk behaviors in
372
  the United States.''  (See \url{http://cdc.gov/BRFSS/}.)
373

374
\end{itemize}
375

376
Other examples use data from the IRS, the U.S. Census, and
377
the Boston Marathon.
378

379
This second edition of {\it Think Stats\/} includes the chapters from
380
the first edition, many of them substantially revised, and new
381
chapters on regression, time series analysis, survival analysis,
382
and analytic methods.  The previous edition did not use pandas,
383
SciPy, or StatsModels, so all of that material is new.
384

385

386
\section{How I wrote this book}
387

388
When people write a new textbook, they usually start by
389
reading a stack of old textbooks.  As a result, most books
390
contain the same material in pretty much the same order.
391

392
I did not do that.  In fact, I used almost no printed material while I
393
was writing this book, for several reasons:
394

395
\begin{itemize}
396

397
\item My goal was to explore a new approach to this material, so I didn't
398
want much exposure to existing approaches.
399

400
\item Since I am making this book available under a free license, I wanted
401
to make sure that no part of it was encumbered by copyright restrictions.
402

403
\item Many readers of my books don't have access to libraries of
404
printed material, so I tried to make references to resources that are
405
freely available on the Internet.
406

407
\item Some proponents of old media think that the exclusive
408
use of electronic resources is lazy and unreliable.  They might be right
409
about the first part, but I think they are wrong about the second, so
410
I wanted to test my theory.
411

412
% http://www.ala.org/ala/mgrps/rts/nmrt/news/footnotes/may2010/in_defense_of_wikipedia_bonnett.cfm
413

414
\end{itemize}
415

416
The resource I used more than any other is Wikipedia.  In general, the
417
articles I read on statistical topics were very good (although I made
418
a few small changes along the way).  I include references to Wikipedia
419
pages throughout the book and I encourage you to follow those links;
420
in many cases, the Wikipedia page picks up where my description leaves
421
off.  The vocabulary and notation in this book are generally
422
consistent with Wikipedia, unless I had a good reason to deviate.
423
Other resources I found useful were Wolfram MathWorld and 
424
the Reddit statistics forum, \url{http://www.reddit.com/r/statistics}.
425

426

427
\section{Using the code}
428
\label{code}
429

430
The code and data used in this book are available from
431
\url{https://github.com/AllenDowney/ThinkStats2}.  Git is a version
432
control system that allows you to keep track of the files that
433
make up a project.  A collection of files under Git's control is
434
called a {\bf repository}.  GitHub is a hosting service that provides
435
storage for Git repositories and a convenient web interface.
436
\index{repository}
437
\index{Git}
438
\index{GitHub}
439

440
The GitHub homepage for my repository provides several ways to
441
work with the code:
442

443
\begin{itemize}
444

445
\item You can create a copy of my repository
446
on GitHub by pressing the {\sf Fork} button.  If you don't already
447
have a GitHub account, you'll need to create one.  After forking, you'll
448
have your own repository on GitHub that you can use to keep track
449
of code you write while working on this book.  Then you can
450
clone the repo, which means that you make a copy of the files
451
on your computer.
452
\index{fork}
453

454
\item Or you could clone
455
my repository.  You don't need a GitHub account to do this, but you
456
won't be able to write your changes back to GitHub.
457
\index{clone}
458

459
\item If you don't want to use Git at all, you can download the files
460
in a Zip file using the button in the lower-right corner of the
461
GitHub page.
462

463
\end{itemize}
464

465
All of the code is written to work in both Python 2 and Python 3
466
with no translation.
467

468
I developed this book using Anaconda from
469
Continuum Analytics, which is a free Python distribution that includes
470
all the packages you'll need to run the code (and lots more).
471
I found Anaconda easy to install.  By default it does a user-level
472
installation, not system-level, so you don't need administrative
473
privileges.  And it supports both Python 2 and Python 3.  You can
474
download Anaconda from \url{http://continuum.io/downloads}.
475
\index{Anaconda}
476

477
If you don't want to use Anaconda, you will need the following
478
packages:
479

480
\begin{itemize}
481

482
\item pandas for representing and analyzing data,
483
  \url{http://pandas.pydata.org/};
484
\index{pandas}
485

486
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
487
\index{NumPy}
488

489
\item SciPy for scientific computation including statistics,
490
  \url{http://www.scipy.org/};
491
\index{SciPy}
492

493
\item StatsModels for regression and other statistical analysis,
494
\url{http://statsmodels.sourceforge.net/}; and
495
\index{StatsModels}
496

497
\item matplotlib for visualization, \url{http://matplotlib.org/}.
498
\index{matplotlib}
499

500
\end{itemize}
501

502
Although these are commonly used packages, they are not included with
503
all Python installations, and they can be hard to install in some
504
environments.  If you have trouble installing them, I strongly
505
recommend using Anaconda or one of the other Python distributions
506
that include these packages.
507
\index{installation}
508

509
After you clone the repository or unzip the zip file, you should have
510
a folder called {\tt ThinkStats2/code} with a file called {\tt nsfg.py}.
511
If you run {\tt nsfg.py}, it should read a data file, run some tests, and print a
512
message like, ``All tests passed.''  If you get import errors, it
513
probably means there are packages you need to install.
514

515
Most exercises use Python scripts, but some also use the IPython
516
notebook.  If you have not used IPython notebook before, I suggest
517
you start with the documentation at
518
\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.
519
\index{IPython}
520

521
I wrote this book assuming that the reader is familiar with core Python,
522
including object-oriented features, but not pandas,
523
NumPy, and SciPy.  If you are already familiar with these modules, you
524
can skip a few sections.
525

526
I assume that the reader knows basic mathematics, including
527
logarithms, for example, and summations.  I refer to calculus concepts
528
in a few places, but you don't have to do any calculus.
529

530
If you have never studied statistics, I think this book is a good place
531
to start.  And if you have taken
532
a traditional statistics class, I hope this book will help repair the
533
damage.
534

535

536

537
---
538

539
Allen B. Downey is a Professor of Computer Science at 
540
the Franklin W. Olin College of Engineering in Needham, MA.
541

542

543

544

545
\section*{Contributor List}
546

547
If you have a suggestion or correction, please send email to 
548
{\tt downey@allendowney.com}.  If I make a change based on your
549
feedback, I will add you to the contributor list
550
(unless you ask to be omitted).
551
\index{contributors}
552

553
If you include at least part of the sentence the
554
error appears in, that makes it easy for me to search.  Page and
555
section numbers are fine, too, but not quite as easy to work with.
556
Thanks!
557

558
\small
559

560
\begin{itemize}
561

562
\item Lisa Downey and June Downey read an early draft and made many
563
corrections and suggestions.
564

565
\item Steven Zhang found several errors.
566

567
\item Andy Pethan and Molly Farison helped debug some of the solutions,
568
and Molly spotted several typos.
569

570
\item Dr. Nikolas Akerblom knows how big a Hyracotherium is.
571

572
\item Alex Morrow clarified one of the code examples.
573

574
\item Jonathan Street caught an error in the nick of time.
575

576
\item Many thanks to Kevin Smith and Tim Arnold for their work on
577
plasTeX, which I used to convert this book to DocBook.
578

579
\item George Caplan sent several suggestions for improving clarity.
580

581
\item Julian Ceipek found an error and a number of typos.
582

583
\item Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnson
584
found errors in the first print edition.
585

586
\item J\"{o}rg Beyer found typos in the book and made many corrections
587
in the docstrings of the accompanying code.
588

589
\item Tommie Gannert sent a patch file with a number of corrections.
590

591
\item Christoph Lendenmann submitted several errata.
592

593
\item Michael Kearney sent me many excellent suggestions.
594

595
\item Alex Birch made a number of helpful suggestions.
596

597
\item Lindsey Vanderlyn, Griffin Tschurwald, and Ben Small read an
598
  early version of this book and found many errors.
599

600
\item John Roth, Carol Willing, and Carol Novitsky performed technical
601
reviews of the book.  They found many errors and made many
602
helpful suggestions.
603

604
\item David Palmer sent many helpful suggestions and corrections.
605

606
\item Erik Kulyk found many typos.
607

608
\item Nir Soffer sent several excellent pull requests for both the
609
  book and the supporting code.
610

611
\item GitHub user flothesof sent a number of corrections.
612

613
\item Toshiaki Kurokawa, who is working on the Japanese translation of
614
this book, has sent many corrections and helpful suggestions.
615

616
\item Benjamin White suggested more idiomatic Pandas code.
617

618
\item Takashi Sato spotted an code error.
619

620
% ENDCONTRIB
621

622
\end{itemize}
623

624
Other people who found typos and similar errors are Andrew Heine,
625
G\'{a}bor Lipt\'{a}k,
626
Dan Kearney,
627
Alexander Gryzlov, 
628
Martin Veillette, 
629
Haitao Ma, 
630
Jeff Pickhardt,
631
Rohit Deshpande,
632
Joanne Pratt,
633
Lucian Ursu,
634
Paul Glezen,
635
Ting-kuang Lin,
636
Scott Miller,
637
Luigi Patruno.
638

639

640

641
\normalsize
642

643
\clearemptydoublepage
644

645
% TABLE OF CONTENTS
646
\begin{latexonly}
647

648
\tableofcontents
649

650
\clearemptydoublepage
651

652
\end{latexonly}
653

654
% START THE BOOK
655
\mainmatter
656

657

658
\chapter{Exploratory data analysis}
659
\label{intro}
660

661
The thesis of this book is that data combined with practical
662
methods can answer questions and guide decisions under uncertainty.
663

664
As an example, I present a case study motivated by a question
665
I heard when my wife and I were expecting our first child: do first
666
babies tend to arrive late?
667
\index{first babies}
668

669
If you Google this question, you will find plenty of discussion.  Some
670
people claim it's true, others say it's a myth, and some people say
671
it's the other way around: first babies come early.
672

673
In many of these discussions, people provide data to support their
674
claims.  I found many examples like these:
675

676
\begin{quote}
677

678
``My two friends that have given birth recently to their first babies,
679
BOTH went almost 2 weeks overdue before going into labour or being
680
induced.''
681

682
``My first one came 2 weeks late and now I think the second one is
683
going to come out two weeks early!!''
684

685
``I don't think that can be true because my sister was my mother's
686
first and she was early, as with many of my cousins.''
687

688
\end{quote}
689

690
Reports like these are called {\bf anecdotal evidence} because they
691
are based on data that is unpublished and usually personal.  In casual
692
conversation, there is nothing wrong with anecdotes, so I don't mean
693
to pick on the people I quoted.
694
\index{anecdotal evidence}
695

696
But we might want evidence that is more persuasive and
697
an answer that is more reliable.  By those standards, anecdotal
698
evidence usually fails, because:
699

700
\begin{itemize}
701

702
\item Small number of observations: If pregnancy length is longer
703
  for first babies, the difference is probably small compared to
704
  natural variation.  In that case, we might have to compare a large
705
  number of pregnancies to be sure that a difference exists.
706
\index{pregnancy length}
707

708
\item Selection bias: People who join a discussion of this question
709
  might be interested because their first babies were late.  In that
710
  case the process of selecting data would bias the results.
711
\index{selection bias}
712
\index{bias!selection}
713

714
\item Confirmation bias:  People who believe the claim might be more
715
  likely to contribute examples that confirm it.  People who doubt the
716
  claim are more likely to cite counterexamples.
717
\index{confirmation bias}
718
\index{bias!confirmation}
719

720
\item Inaccuracy: Anecdotes are often personal stories, and often
721
  misremembered, misrepresented, repeated
722
  inaccurately, etc.
723

724
\end{itemize}
725

726
So how can we do better?
727

728

729
\section{A statistical approach}
730

731
To address the limitations of anecdotes, we will use the tools
732
of statistics, which include:
733

734
\begin{itemize}
735

736
\item Data collection: We will use data from a large national survey
737
  that was designed explicitly with the goal of generating
738
  statistically valid inferences about the U.S. population.
739
\index{data collection}
740

741
\item Descriptive statistics: We will generate statistics that
742
  summarize the data concisely, and evaluate different ways to
743
  visualize data.
744
\index{descriptive statistics}
745

746
\item Exploratory data analysis: We will look for
747
  patterns, differences, and other features that address the questions
748
  we are interested in.  At the same time we will check for
749
  inconsistencies and identify limitations.
750
\index{exploratory data analysis}
751

752
\item Estimation: We will use data from a sample to estimate
753
  characteristics of the general population.
754
\index{estimation}
755

756
\item Hypothesis testing: Where we see apparent effects, like a
757
  difference between two groups, we will evaluate whether the effect
758
  might have happened by chance.
759
\index{hypothesis testing}
760

761
\end{itemize}
762

763
By performing these steps with care to avoid pitfalls, we can
764
reach conclusions that are more justifiable and more likely to be
765
correct.
766

767

768
\section{The National Survey of Family Growth}
769
\label{nsfg}
770

771
Since 1973 the U.S. Centers for Disease Control and Prevention (CDC)
772
have conducted the National Survey of Family Growth (NSFG),
773
which is intended to gather ``information on family life, marriage and
774
divorce, pregnancy, infertility, use of contraception, and men's and
775
women's health. The survey results are used \ldots to plan health services and
776
health education programs, and to do statistical studies of families,
777
fertility, and health.''  See
778
  \url{http://cdc.gov/nchs/nsfg.htm}.
779
\index{National Survey of Family Growth}
780
\index{NSFG}
781

782
We will use data collected by this survey to investigate whether first
783
babies tend to come late, and other questions.  In order to use this
784
data effectively, we have to understand the design of the study.
785

786
The NSFG is a {\bf cross-sectional} study, which means that it
787
captures a snapshot of a group at a point in time.  The most
788
common alternative is a {\bf longitudinal} study, which observes a
789
group repeatedly over a period of time.
790
\index{cross-sectional study}
791
\index{study!cross-sectional}
792
\index{longitudinal study}
793
\index{study!longitudinal}
794

795
The NSFG has been conducted seven times; each deployment is called a
796
{\bf cycle}.  We will use data from Cycle 6, which was conducted from
797
January 2002 to March 2003.  \index{cycle}
798

799
The goal of the survey is to draw conclusions about a {\bf
800
  population}; the target population of the NSFG is people in the
801
United States aged 15-44.  Ideally surveys would collect data from
802
every member of the population, but that's seldom possible.  Instead
803
we collect data from a subset of the population called a {\bf sample}.
804
The people who participate in a survey are called {\bf respondents}.
805
\index{population}
806

807
In general,
808
cross-sectional studies are meant to be {\bf representative}, which
809
means that every member of the target population has an equal chance
810
of participating.  That ideal is hard to achieve in
811
practice, but people who conduct surveys come as close as they can.
812
\index{respondent} \index{representative}
813

814
The NSFG is not representative; instead it is deliberately {\bf
815
  oversampled}.  The designers of the study recruited three
816
groups---Hispanics, African-Americans and teenagers---at rates higher
817
than their representation in the U.S. population, in order to
818
make sure that the number of respondents in each of
819
these groups is large enough to draw valid statistical inferences.
820
\index{oversampling}
821

822
Of course, the drawback of oversampling is that it is not as easy
823
to draw conclusions about the general population based on statistics
824
from the survey.  We will come back to this point later.
825

826
When working with this kind of data, it is important to be familiar
827
with the {\bf codebook}, which documents the design of the study, the
828
survey questions, and the encoding of the responses.  The codebook and
829
user's guide for the NSFG data are available from
830
\url{http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm}
831

832

833
\section{Importing the data}
834

835
The code and data used in this book are available from
836
\url{https://github.com/AllenDowney/ThinkStats2}.  For information
837
about downloading and working with this code, 
838
see Section~\ref{code}.
839

840
Once you download the code, you should have a file called {\tt
841
  ThinkStats2/code/nsfg.py}.  If you run it, it should read a data
842
file, run some tests, and print a message like, ``All tests passed.''
843

844
Let's see what it does.  Pregnancy data from Cycle 6 of the NSFG is in
845
a file called {\tt 2002FemPreg.dat.gz}; it
846
is a gzip-compressed data file in plain text (ASCII), with fixed width
847
columns.  Each line in the file is a {\bf record} that
848
contains data about one pregnancy.
849

850
The format of the file is documented in {\tt 2002FemPreg.dct}, which
851
is a Stata dictionary file.  Stata is a statistical software system;
852
a ``dictionary'' in this context is a list of variable names, types,
853
and indices that identify where in each line to find each variable.
854

855
For example, here are a few lines from {\tt 2002FemPreg.dct}:
856
%
857
\begin{verbatim}
858
infile dictionary {
859
  _column(1)  str12  caseid    %12s  "RESPONDENT ID NUMBER"
860
  _column(13) byte   pregordr   %2f  "PREGNANCY ORDER (NUMBER)"
861
}
862
\end{verbatim}
863

864
This dictionary describes two variables: {\tt caseid} is a 12-character
865
string that represents the respondent ID; {\tt pregordr} is a 
866
one-byte integer that indicates which pregnancy this record
867
describes for this respondent.
868

869
The code you downloaded includes {\tt thinkstats2.py}, which is a Python
870
module
871
that contains many classes and functions used in this book,
872
including functions that read the Stata dictionary and
873
the NSFG data file.  Here's how they are used in {\tt nsfg.py}:
874

875
\begin{verbatim}
876
def ReadFemPreg(dct_file='2002FemPreg.dct',
877
                dat_file='2002FemPreg.dat.gz'):
878
    dct = thinkstats2.ReadStataDct(dct_file)
879
    df = dct.ReadFixedWidth(dat_file, compression='gzip')
880
    CleanFemPreg(df)
881
    return df
882
\end{verbatim}
883

884
{\tt ReadStataDct} takes the name of the dictionary file
885
and returns {\tt dct}, a {\tt FixedWidthVariables} object that contains the
886
information from the dictionary file.  {\tt dct} provides {\tt
887
  ReadFixedWidth}, which reads the data file.
888

889

890
\section{DataFrames}
891
\label{dataframe}
892

893
The result of {\tt ReadFixedWidth} is a DataFrame, which is the
894
fundamental data structure provided by pandas, which is a Python
895
data and statistics package we'll use throughout this book.
896
A DataFrame contains a
897
row for each record, in this case one row per pregnancy, and a column
898
for each variable.
899
\index{pandas}
900
\index{DataFrame}
901

902
In addition to the data, a DataFrame also contains the variable
903
names and their types, and it provides methods for accessing and modifying
904
the data.
905

906
If you print {\tt df} you get a truncated view of the rows and
907
columns, and the shape of the DataFrame, which is 13593
908
rows/records and 244 columns/variables.
909

910
\begin{verbatim}
911
>>> import nsfg
912
>>> df = nsfg.ReadFemPreg()
913
>>> df
914
...
915
[13593 rows x 244 columns]
916
\end{verbatim}
917

918
The DataFrame is too big to display, so the output is truncated.  The
919
last line reports the number of rows and columns.
920

921
The attribute {\tt columns} returns a sequence of column
922
names as Unicode strings:
923

924
\begin{verbatim}
925
>>> df.columns
926
Index([u'caseid', u'pregordr', u'howpreg_n', u'howpreg_p', ... ])
927
\end{verbatim}
928

929
The result is an Index, which is another pandas data structure.  
930
We'll learn more about Index later, but for
931
now we'll treat it like a list:
932
\index{pandas}
933
\index{Index}
934

935
\begin{verbatim}
936
>>> df.columns[1]
937
'pregordr'
938
\end{verbatim}
939

940
To access a column from a DataFrame, you can use the column
941
name as a key:
942
\index{DataFrame}
943

944
\begin{verbatim}
945
>>> pregordr = df['pregordr']
946
>>> type(pregordr)
947
<class 'pandas.core.series.Series'>
948
\end{verbatim}
949

950
The result is a Series, yet another pandas data structure.
951
A Series is like a Python list with some additional features.
952
When you print a Series, you get the indices and the
953
corresponding values:
954
\index{Series}
955

956
\begin{verbatim}
957
>>> pregordr
958
0     1
959
1     2
960
2     1
961
3     2
962
...
963
13590    3
964
13591    4
965
13592    5
966
Name: pregordr, Length: 13593, dtype: int64
967
\end{verbatim}
968

969
In this example the indices are integers from 0 to 13592, but in
970
general they can be any sortable type.  The elements
971
are also integers, but they can be any type.
972

973
The last line includes the variable name, Series length, and data type;
974
{\tt int64} is one of the types provided by NumPy.  If you run
975
this example on a 32-bit machine you might see {\tt int32}.
976
\index{NumPy}
977

978
You can access the elements of a Series using integer indices
979
and slices:
980

981
\begin{verbatim}
982
>>> pregordr[0]
983
1
984
>>> pregordr[2:5]
985
2    1
986
3    2
987
4    3
988
Name: pregordr, dtype: int64
989
\end{verbatim}
990

991
The result of the index operator is an {\tt int64}; the
992
result of the slice is another Series.
993

994
You can also access the columns of a DataFrame using dot notation:
995
\index{DataFrame}
996

997
\begin{verbatim}
998
>>> pregordr = df.pregordr
999
\end{verbatim}
1000

1001
This notation only works if the column name is a valid Python
1002
identifier, so it has to begin with a letter, can't contain spaces, etc.
1003

1004

1005
\section{Variables}
1006

1007
We have already seen two variables in the NSFG dataset, {\tt caseid}
1008
and {\tt pregordr}, and we have seen that there are 244 variables in
1009
total.  For the explorations in this book, I use the following
1010
variables:
1011

1012
\begin{itemize}
1013

1014
\item {\tt caseid} is the integer ID of the respondent.
1015

1016
\item {\tt prglngth} is the integer duration of the pregnancy in weeks.
1017
\index{pregnancy length}
1018

1019
\item {\tt outcome} is an integer code for the outcome of the
1020
  pregnancy.  The code 1 indicates a live birth.
1021

1022
\item {\tt pregordr} is a pregnancy serial number; for example, the
1023
  code for a respondent's first pregnancy is 1, for the second
1024
  pregnancy is 2, and so on.
1025

1026
\item {\tt birthord} is a serial number for live
1027
  births; the code for a respondent's first child is 1, and so on.
1028
  For outcomes other than live birth, this field is blank.
1029

1030
\item \verb"birthwgt_lb" and \verb"birthwgt_oz" contain the pounds and
1031
  ounces parts of the birth weight of the baby.
1032
\index{birth weight}
1033
\index{weight!birth}
1034

1035
\item {\tt agepreg} is the mother's age at the end of the pregnancy.
1036

1037
\item {\tt finalwgt} is the statistical weight associated with the
1038
  respondent.  It is a floating-point value that indicates the number
1039
  of people in the U.S. population this respondent represents.
1040
  \index{weight!sample}
1041

1042
\end{itemize}
1043

1044
If you read the codebook carefully, you will see that many of the
1045
variables are {\bf recodes}, which means that they are not part of the
1046
{\bf raw data} collected by the survey; they are calculated using
1047
the raw data.  \index{recode} \index{raw data}
1048

1049
For example, {\tt prglngth} for live births is equal to the raw
1050
variable {\tt wksgest} (weeks of gestation) if it is available;
1051
otherwise it is estimated using {\tt mosgest * 4.33} (months of
1052
gestation times the average number of weeks in a month).
1053

1054
Recodes are often based on logic that checks the consistency and
1055
accuracy of the data.  In general it is a good idea to use recodes
1056
when they are available, unless there is a compelling reason to
1057
process the raw data yourself.
1058

1059

1060
\section{Transformation}
1061
\label{cleaning}
1062

1063
When you import data like this, you often have to check for errors,
1064
deal with special values, convert data into different formats, and
1065
perform calculations.  These operations are called {\bf data cleaning}.
1066

1067
{\tt nsfg.py} includes {\tt CleanFemPreg}, a function that cleans
1068
the variables I am planning to use.
1069

1070
\begin{verbatim}
1071
def CleanFemPreg(df):
1072
    df.agepreg /= 100.0
1073

1074
    na_vals = [97, 98, 99]
1075
    df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
1076
    df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
1077

1078
    df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0    
1079
\end{verbatim}
1080

1081
{\tt agepreg} contains the mother's age at the end of the
1082
pregnancy.  In the data file, {\tt agepreg} is encoded as an integer
1083
number of centiyears.  So the first line divides each element
1084
of {\tt agepreg} by 100, yielding a floating-point value in
1085
years.
1086

1087
\verb"birthwgt_lb" and \verb"birthwgt_oz" contain the weight of the
1088
baby, in pounds and ounces, for pregnancies that end in live birth.
1089
In addition it uses several special codes:
1090

1091
\begin{verbatim}
1092
97      NOT ASCERTAINED
1093
98      REFUSED
1094
99      DON'T KNOW
1095
\end{verbatim}
1096

1097
Special values encoded as numbers are {\em dangerous\/} because if they
1098
are not handled properly, they can generate bogus results, like
1099
a 99-pound baby.  The {\tt replace} method replaces these values with
1100
{\tt np.nan}, a special floating-point value that represents ``not a
1101
number.''  The {\tt inplace} flag tells {\tt replace} to modify the
1102
existing Series rather than create a new one.
1103
\index{NaN}
1104

1105
As part of the IEEE floating-point standard, all mathematical
1106
operations return {\tt nan} if either argument is {\tt nan}:
1107

1108
\begin{verbatim}
1109
>>> import numpy as np
1110
>>> np.nan / 100.0
1111
nan
1112
\end{verbatim}
1113

1114
So computations with {\tt nan} tend to do the right thing, and most
1115
pandas functions handle {\tt nan} appropriately.  But dealing with
1116
missing data will be a recurring issue.
1117
\index{pandas}
1118
\index{missing values}
1119

1120
The last line of {\tt CleanFemPreg} creates a new
1121
column \verb"totalwgt_lb" that combines pounds and ounces into
1122
a single quantity, in pounds.
1123

1124
One important note: when you add a new column to a DataFrame, you
1125
must use dictionary syntax, like this
1126
\index{DataFrame}
1127

1128
\begin{verbatim}
1129
    # CORRECT
1130
    df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0 
1131
\end{verbatim}
1132

1133
Not dot notation, like this:
1134

1135
\begin{verbatim}
1136
    # WRONG!
1137
    df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0 
1138
\end{verbatim}
1139

1140
The version with dot notation adds an attribute to the DataFrame
1141
object, but that attribute is not treated as a new column.
1142

1143

1144
\section{Validation}
1145

1146
When data is exported from one software environment and imported into
1147
another, errors might be introduced.  And when you are
1148
getting familiar with a new dataset, you might interpret data
1149
incorrectly or introduce other misunderstandings.  If you take
1150
time to validate the data, you can save time later and avoid errors.
1151

1152
One way to validate data is to compute basic statistics and compare
1153
them with published results.  For example, the NSFG codebook includes
1154
tables that summarize each variable.  Here is the table for
1155
{\tt outcome}, which encodes the outcome of each pregnancy:
1156

1157
\begin{verbatim}
1158
value   label                  Total
1159
1       LIVE BIRTH              9148
1160
2       INDUCED ABORTION        1862
1161
3       STILLBIRTH               120
1162
4       MISCARRIAGE             1921
1163
5       ECTOPIC PREGNANCY        190
1164
6       CURRENT PREGNANCY        352
1165
\end{verbatim}
1166

1167
The Series class provides a method, \verb"value_counts", that
1168
counts the number of times each value appears.  If we select the {\tt
1169
  outcome} Series from the DataFrame, we can use \verb"value_counts"
1170
to compare with the published data:
1171
\index{DataFrame}
1172
\index{Series}
1173

1174
\begin{verbatim}
1175
>>> df.outcome.value_counts().sort_index()
1176
1    9148
1177
2    1862
1178
3     120
1179
4    1921
1180
5     190
1181
6     352
1182
\end{verbatim}
1183

1184
The result of \verb"value_counts" is a Series;
1185
\verb"sort_index()" sorts the Series by index, so the values
1186
appear in order.
1187

1188
Comparing the results with the published table, it looks like the
1189
values in {\tt outcome} are correct.  Similarly, here is the published
1190
table for \verb"birthwgt_lb"
1191

1192
\begin{verbatim}
1193
value   label                  Total
1194
.       INAPPLICABLE            4449
1195
0-5     UNDER 6 POUNDS          1125
1196
6       6 POUNDS                2223
1197
7       7 POUNDS                3049
1198
8       8 POUNDS                1889
1199
9-95    9 POUNDS OR MORE         799
1200
\end{verbatim}
1201

1202
And here are the value counts:
1203

1204
\begin{verbatim}
1205
>>> df.birthwgt_lb.value_counts(sort=False)
1206
0        8
1207
1       40
1208
2       53
1209
3       98
1210
4      229
1211
5      697
1212
6     2223
1213
7     3049
1214
8     1889
1215
9      623
1216
10     132
1217
11      26
1218
12      10
1219
13       3
1220
14       3
1221
15       1
1222
51       1
1223
\end{verbatim}
1224

1225
The counts for 6, 7, and 8 pounds check out, and if you add
1226
up the counts for 0-5 and 9-95, they check out, too.  But
1227
if you look more closely, you will notice one value that has to be
1228
an error, a 51 pound baby!
1229

1230
To deal with this error, I added a line to {\tt CleanFemPreg}:
1231

1232
\begin{verbatim}
1233
df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
1234
\end{verbatim}
1235

1236
This statement replaces invalid values with {\tt np.nan}.
1237
The attribute {\tt loc} provides several ways to select
1238
rows and columns from a DataFrame.  In this example, the
1239
first expression in brackets is the row indexer; the second
1240
expression selects the column.
1241
\index{loc indexer}
1242
\index{indexer!loc}
1243

1244
The expression \verb"df.birthwgt_lb > 20" yields a Series of type
1245
{\tt bool}, where True indicates that the condition is true.  When a
1246
boolean Series is used as an index, it selects only the elements that
1247
satisfy the condition.
1248
\index{Series} \index{boolean} \index{NaN}
1249

1250

1251

1252
\section{Interpretation}
1253

1254
To work with data effectively, you have to think on two levels at the
1255
same time: the level of statistics and the level of context.
1256

1257
As an example, let's look at the sequence of outcomes for a few
1258
respondents.  Because of the way the data files are organized, we have
1259
to do some processing to collect the pregnancy data for each respondent.
1260
Here's a function that does that:
1261

1262
\begin{verbatim}
1263
def MakePregMap(df):
1264
    d = defaultdict(list)
1265
    for index, caseid in df.caseid.iteritems():
1266
        d[caseid].append(index)
1267
    return d
1268
\end{verbatim}
1269

1270
{\tt df} is the DataFrame with pregnancy data.  The {\tt iteritems}
1271
method enumerates the index (row number)
1272
and {\tt caseid} for each pregnancy.
1273
\index{DataFrame}
1274

1275
{\tt d} is a dictionary that maps from each case ID to a list of
1276
indices.  If you are not familiar with {\tt defaultdict}, it is in
1277
the Python {\tt collections} module.
1278
Using {\tt d}, we can look up a respondent and get the
1279
indices of that respondent's pregnancies.
1280

1281
This example looks up one respondent and prints a list of outcomes
1282
for her pregnancies:
1283

1284
\begin{verbatim}
1285
>>> caseid = 10229
1286
>>> preg_map = nsfg.MakePregMap(df)
1287
>>> indices = preg_map[caseid]
1288
>>> df.outcome[indices].values
1289
[4 4 4 4 4 4 1]
1290
\end{verbatim}
1291

1292
{\tt indices} is the list of indices for pregnancies corresponding
1293
to respondent {\tt 10229}.
1294

1295
Using this list as an index into {\tt df.outcome} selects the
1296
indicated rows and yields a Series.  Instead of printing the
1297
whole Series, I selected the {\tt values} attribute, which is
1298
a NumPy array.  
1299
\index{NumPy}
1300
\index{Series}
1301

1302
The outcome code {\tt 1} indicates a live birth. Code {\tt 4} indicates
1303
a miscarriage; that is, a pregnancy that ended spontaneously, usually
1304
with no known medical cause.
1305

1306
Statistically this respondent is not unusual.  Miscarriages are common
1307
and there are other respondents who reported as many or more.
1308

1309
But remembering the context, this data tells the story of a woman who
1310
was pregnant six times, each time ending in miscarriage.  Her seventh
1311
and most recent pregnancy ended in a live birth.  If we consider this
1312
data with empathy, it is natural to be moved by the story it tells.
1313

1314
Each record in the NSFG dataset represents a person who provided
1315
honest answers to many personal and difficult questions.  We can use
1316
this data to answer statistical questions about family life,
1317
reproduction, and health.  At the same time, we have an obligation
1318
to consider the people represented by the data, and to afford them
1319
respect and gratitude.
1320
\index{ethics}
1321

1322

1323
\section{Exercises}
1324

1325
\begin{exercise}
1326
In the repository you downloaded, you should find a file named
1327
\verb"chap01ex.ipynb", which is an IPython notebook.  You can
1328
launch IPython notebook from the command line like this:
1329
\index{IPython}
1330

1331
\begin{verbatim}
1332
$ ipython notebook &
1333
\end{verbatim}
1334

1335
If IPython is installed, it should launch a server that runs in the
1336
background and open a browser to view the notebook.  If you are not
1337
familiar with IPython, I suggest you start at
1338
\url{http://ipython.org/ipython-doc/stable/notebook/notebook.html}.
1339

1340
To launch the IPython notebook server, run:
1341

1342
\begin{verbatim}
1343
$ ipython notebook &
1344
\end{verbatim}
1345

1346
It should open a new browser window, but if not, the startup
1347
message provides a URL you can load in a browser, usually
1348
\url{http://localhost:8888}.  The new window should list the notebooks
1349
in the repository.
1350

1351
Open \verb"chap01ex.ipynb".  Some cells are already filled in, and
1352
you should execute them.  Other cells give you instructions for
1353
exercises you should try.
1354

1355
A solution to this exercise is in \verb"chap01soln.ipynb"
1356
\end{exercise}
1357

1358

1359
\begin{exercise}
1360
In the repository you downloaded, you should find a file named
1361
\verb"chap01ex.py"; using this file as a starting place, write a
1362
function that reads the respondent file, {\tt 2002FemResp.dat.gz}.
1363

1364
The variable {\tt pregnum} is a recode that indicates how many
1365
times each respondent has been pregnant.  Print the value counts
1366
for this variable and compare them to the published results in
1367
the NSFG codebook.
1368

1369
You can also cross-validate the respondent and pregnancy files by
1370
comparing {\tt pregnum} for each respondent with the number of
1371
records in the pregnancy file.
1372

1373
You can use {\tt nsfg.MakePregMap} to make a dictionary that maps
1374
from each {\tt caseid} to a list of indices into the pregnancy
1375
DataFrame.
1376
\index{DataFrame}
1377

1378
A solution to this exercise is in \verb"chap01soln.py"
1379
\end{exercise}
1380

1381

1382
\begin{exercise}
1383
The best way to learn about statistics is to work on a project you are
1384
interested in.  Is there a question like, ``Do first babies arrive
1385
late,'' that you want to investigate?
1386

1387
Think about questions you find personally interesting, or items of
1388
conventional wisdom, or controversial topics, or questions that have
1389
political consequences, and see if you can formulate a question that
1390
lends itself to statistical inquiry.
1391

1392
Look for data to help you address the question.  Governments are good
1393
sources because data from public research is often freely
1394
available.  Good places to start include \url{http://www.data.gov/},
1395
and \url{http://www.science.gov/}, and in the United Kingdom,
1396
\url{http://data.gov.uk/}.
1397

1398
Two of my favorite data sets are the General Social Survey at
1399
\url{http://www3.norc.org/gss+website/}, and the European Social
1400
Survey at \url{http://www.europeansocialsurvey.org/}.
1401

1402
If it seems like someone has already answered your question, look
1403
closely to see whether the answer is justified.  There might be flaws
1404
in the data or the analysis that make the conclusion unreliable.  In
1405
that case you could perform a different analysis of the same data, or
1406
look for a better source of data.
1407

1408
If you find a published paper that addresses your question, you
1409
should be able to get the raw data.  Many authors make their data
1410
available on the web, but for sensitive data you might have to
1411
write to the authors, provide information about how you plan to use
1412
the data, or agree to certain terms of use.  Be persistent!
1413

1414
\end{exercise}
1415

1416

1417
\section{Glossary}
1418

1419
\begin{itemize}
1420

1421
\item {\bf anecdotal evidence}: Evidence, often personal, that is collected
1422
  casually rather than by a well-designed study.
1423
\index{anecdotal evidence}
1424

1425
\item {\bf population}: A group we are interested in studying.
1426
  ``Population'' often refers to a
1427
  group of people, but the term is used for other subjects,
1428
  too.
1429
\index{population}
1430

1431
\item {\bf cross-sectional study}: A study that collects data about a
1432
population at a particular point in time.
1433
\index{cross-sectional study}
1434
\index{study!cross-sectional}
1435

1436
\item {\bf cycle}: In a repeated cross-sectional study, each repetition
1437
of the study is called a cycle.
1438

1439
\item {\bf longitudinal study}: A study that follows a population over
1440
time, collecting data from the same group repeatedly.
1441
\index{longitudinal study}
1442
\index{study!longitudinal}
1443

1444
\item {\bf record}: In a dataset, a collection of information about
1445
a single person or other subject.
1446
\index{record}
1447

1448
\item {\bf respondent}: A person who responds to a survey.
1449
\index{respondent}
1450

1451
\item {\bf sample}: The subset of a population used to collect data.
1452
\index{sample}
1453

1454
\item {\bf representative}: A sample is representative if every member
1455
of the population has the same chance of being in the sample.
1456
\index{representative}
1457

1458
\item {\bf oversampling}: The technique of increasing the representation
1459
of a sub-population in order to avoid errors due to small sample
1460
sizes.
1461
\index{oversampling}
1462

1463
\item {\bf raw data}: Values collected and recorded with little or no
1464
checking, calculation or interpretation.
1465
\index{raw data}
1466

1467
\item {\bf recode}: A value that is generated by calculation and other
1468
logic applied to raw data.
1469
\index{recode}
1470

1471
\item {\bf data cleaning}: Processes that include validating data,
1472
  identifying errors, translating between data types and
1473
  representations, etc.
1474

1475
\end{itemize}
1476

1477

1478

1479
\chapter{Distributions}
1480
\label{descriptive}
1481

1482

1483
\section{Histograms}
1484
\label{histograms}
1485

1486
One of the best ways to describe a variable is to report the values
1487
that appear in the dataset and how many times each value appears.
1488
This description is called the {\bf distribution} of the variable.
1489
\index{distribution}
1490

1491
The most common representation of a distribution is a {\bf histogram},
1492
which is a graph that shows the {\bf frequency} of each value.  In
1493
this context, ``frequency'' means the number of times the value
1494
appears.  \index{histogram} \index{frequency}
1495
\index{dictionary}
1496

1497
In Python, an efficient way to compute frequencies is with a
1498
dictionary.  Given a sequence of values, {\tt t}:
1499
%
1500
\begin{verbatim}
1501
hist = {}
1502
for x in t:
1503
    hist[x] = hist.get(x, 0) + 1
1504
\end{verbatim}
1505

1506
The result is a dictionary that maps from values to frequencies.
1507
Alternatively, you could use the {\tt Counter} class defined in the
1508
{\tt collections} module:
1509

1510
\begin{verbatim}
1511
from collections import Counter
1512
counter = Counter(t)
1513
\end{verbatim}
1514

1515
The result is a {\tt Counter} object, which is a subclass of
1516
dictionary.
1517

1518
Another option is to use the pandas method \verb"value_counts", which
1519
we saw in the previous chapter.  But for this book I created a class,
1520
Hist, that represents histograms and provides the methods
1521
that operate on them.
1522
\index{pandas}
1523

1524

1525
\section{Representing histograms}
1526
\index{histogram}
1527
\index{Hist}
1528

1529
The Hist constructor can take a sequence, dictionary, pandas
1530
Series, or another Hist.  You can instantiate a Hist object like this:
1531
%
1532
\begin{verbatim}
1533
>>> import thinkstats2
1534
>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])
1535
>>> hist
1536
Hist({1: 1, 2: 2, 3: 1, 5: 1})
1537
\end{verbatim}
1538

1539
Hist objects provide {\tt Freq}, which takes a value and
1540
returns its frequency: \index{frequency}
1541
%
1542
\begin{verbatim}
1543
>>> hist.Freq(2)
1544
2
1545
\end{verbatim}
1546

1547
The bracket operator does the same thing: \index{bracket operator}
1548
%
1549
\begin{verbatim}
1550
>>> hist[2]
1551
2
1552
\end{verbatim}
1553

1554
If you look up a value that has never appeared, the frequency is 0.
1555
%
1556
\begin{verbatim}
1557
>>> hist.Freq(4)
1558
0
1559
\end{verbatim}
1560

1561
{\tt Values} returns an unsorted list of the values in the Hist:
1562
%
1563
\begin{verbatim}
1564
>>> hist.Values()
1565
[1, 5, 3, 2]
1566
\end{verbatim}
1567

1568
To loop through the values in order, you can use the built-in function
1569
{\tt sorted}:
1570
%
1571
\begin{verbatim}
1572
for val in sorted(hist.Values()):
1573
    print(val, hist.Freq(val))
1574
\end{verbatim}
1575

1576
Or you can use {\tt Items} to iterate through
1577
value-frequency pairs: \index{frequency}
1578
%
1579
\begin{verbatim}
1580
for val, freq in hist.Items():
1581
     print(val, freq)
1582
\end{verbatim}
1583

1584

1585
\section{Plotting histograms}
1586
\index{pyplot}
1587

1588
\begin{figure}
1589
% first.py
1590
\centerline{\includegraphics[height=2.5in]{figs/first_wgt_lb_hist.pdf}}
1591
\caption{Histogram of the pound part of birth weight.}
1592
\label{first_wgt_lb_hist}
1593
\end{figure}
1594

1595
For this book I wrote a module called {\tt thinkplot.py} that provides
1596
functions for plotting Hists and other objects defined in {\tt
1597
  thinkstats2.py}.  It is based on {\tt pyplot}, which is part of the
1598
{\tt matplotlib} package.  See Section~\ref{code} for information
1599
about installing {\tt matplotlib}.  \index{thinkplot}
1600
\index{matplotlib}
1601

1602
To plot {\tt hist} with {\tt thinkplot}, try this:
1603
\index{Hist}
1604

1605
\begin{verbatim}
1606
>>> import thinkplot
1607
>>> thinkplot.Hist(hist)
1608
>>> thinkplot.Show(xlabel='value', ylabel='frequency')
1609
\end{verbatim}
1610

1611
You can read the documentation for {\tt thinkplot} at
1612
\url{http://greenteapress.com/thinkstats2/thinkplot.html}.
1613

1614

1615
\begin{figure}
1616
% first.py
1617
\centerline{\includegraphics[height=2.5in]{figs/first_wgt_oz_hist.pdf}}
1618
\caption{Histogram of the ounce part of birth weight.}
1619
\label{first_wgt_oz_hist}
1620
\end{figure}
1621

1622

1623
\section{NSFG variables}
1624

1625
Now let's get back to the data from the NSFG.  The code in this
1626
chapter is in {\tt first.py}.  
1627
For information about downloading and
1628
working with this code, see Section~\ref{code}.
1629

1630
When you start working with a new dataset, I suggest you explore
1631
the variables you are planning to use one at a time, and a good
1632
way to start is by looking at histograms.
1633
\index{histogram}
1634

1635
In Section~\ref{cleaning} we transformed {\tt agepreg}
1636
from centiyears to years, and combined \verb"birthwgt_lb" and
1637
\verb"birthwgt_oz" into a single quantity, \verb"totalwgt_lb".
1638
In this section I use these variables to demonstrate some
1639
features of histograms.
1640

1641
\begin{figure}
1642
% first.py
1643
\centerline{\includegraphics[height=2.5in]{figs/first_agepreg_hist.pdf}}
1644
\caption{Histogram of mother's age at end of pregnancy.}
1645
\label{first_agepreg_hist}
1646
\end{figure}
1647

1648
I'll start by reading the data and selecting records for live
1649
births:
1650

1651
\begin{verbatim}
1652
    preg = nsfg.ReadFemPreg()
1653
    live = preg[preg.outcome == 1]
1654
\end{verbatim}
1655

1656
The expression in brackets is a boolean Series that
1657
selects rows from the DataFrame and returns a new DataFrame.
1658
Next I generate and plot the histogram of
1659
\verb"birthwgt_lb" for live births.
1660
\index{DataFrame}
1661
\index{Series}
1662
\index{Hist}
1663
\index{bracket operator}
1664
\index{boolean}
1665

1666
\begin{verbatim}
1667
    hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
1668
    thinkplot.Hist(hist)
1669
    thinkplot.Show(xlabel='pounds', ylabel='frequency')
1670
\end{verbatim}
1671

1672
When the argument passed to Hist is a pandas Series, any
1673
{\tt nan} values are dropped.  {\tt label} is a string that appears
1674
in the legend when the Hist is plotted.
1675
\index{pandas}
1676
\index{Series}
1677
\index{thinkplot}
1678
\index{NaN}
1679

1680
\begin{figure}
1681
% first.py
1682
\centerline{\includegraphics[height=2.5in]{figs/first_prglngth_hist.pdf}}
1683
\caption{Histogram of pregnancy length in weeks.}
1684
\label{first_prglngth_hist}
1685
\end{figure}
1686

1687
Figure~\ref{first_wgt_lb_hist} shows the result.  The most common
1688
value, called the {\bf mode}, is 7 pounds.  The distribution is
1689
approximately bell-shaped, which is the shape of the {\bf normal}
1690
distribution, also called a {\bf Gaussian} distribution.  But unlike a
1691
true normal distribution, this distribution is asymmetric; it has
1692
a {\bf tail} that extends farther to the left than to the right.
1693

1694
Figure~\ref{first_wgt_oz_hist} shows the histogram of
1695
\verb"birthwgt_oz", which is the ounces part of birth weight.  In
1696
theory we expect this distribution to be {\bf uniform}; that is, all
1697
values should have the same frequency.  In fact, 0 is more common than
1698
the other values, and 1 and 15 are less common, probably because
1699
respondents round off birth weights that are close to an integer
1700
value.
1701
\index{birth weight}
1702
\index{weight!birth}
1703

1704
Figure~\ref{first_agepreg_hist} shows the histogram of \verb"agepreg",
1705
the mother's age at the end of pregnancy.  The mode is 21 years.  The
1706
distribution is very roughly bell-shaped, but in this case the tail
1707
extends farther to the right than left; most mothers are in
1708
their 20s, fewer in their 30s.
1709

1710
Figure~\ref{first_prglngth_hist} shows the histogram of
1711
\verb"prglngth", the length of the pregnancy in weeks.  By far the
1712
most common value is 39 weeks.  The left tail is longer than the
1713
right; early babies are common, but pregnancies seldom go past 43
1714
weeks, and doctors often intervene if they do.
1715
\index{pregnancy length}
1716

1717

1718
\section{Outliers}
1719

1720
Looking at histograms, it is easy to identify the most common
1721
values and the shape of the distribution, but rare values are
1722
not always visible.
1723
\index{histogram}
1724

1725
Before going on, it is a good idea to check for {\bf
1726
  outliers}, which are extreme values that might be errors in
1727
measurement and recording, or might be accurate reports of rare
1728
events.
1729
\index{outlier}
1730

1731
Hist provides methods {\tt Largest} and {\tt Smallest}, which take
1732
an integer {\tt n} and return the {\tt n} largest or smallest
1733
values from the histogram:
1734
\index{Hist}
1735

1736
\begin{verbatim}
1737
    for weeks, freq in hist.Smallest(10):
1738
        print(weeks, freq)
1739
\end{verbatim}
1740

1741
In the list of pregnancy lengths for live births, the 10 lowest values
1742
are {\tt [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]}.  Values below 10 weeks
1743
are certainly errors; the most likely explanation is that the outcome
1744
was not coded correctly.  Values higher than 30 weeks are probably
1745
legitimate.  Between 10 and 30 weeks, it is hard to be sure; some
1746
values are probably errors, but some represent premature babies.
1747
\index{pregnancy length}
1748

1749
On the other end of the range, the highest values are:
1750
%
1751
\begin{verbatim}
1752
weeks  count
1753
43     148
1754
44     46
1755
45     10
1756
46     1
1757
47     1
1758
48     7
1759
50     2
1760
\end{verbatim}
1761

1762
Most doctors recommend induced labor if a pregnancy exceeds 42 weeks,
1763
so some of the longer values are surprising.  In particular, 50 weeks
1764
seems medically unlikely.
1765

1766
The best way to handle outliers depends on ``domain knowledge'';
1767
that is, information about where the data come from and what they
1768
mean.  And it depends on what analysis you are planning to perform.
1769
\index{outlier}
1770

1771
In this example, the motivating question is whether first babies
1772
tend to be early (or late).  When people ask this question, they are
1773
usually interested in full-term pregnancies, so for this analysis
1774
I will focus on pregnancies longer than 27 weeks.
1775

1776

1777
\section{First babies}
1778

1779
Now we can compare the distribution of pregnancy lengths for first
1780
babies and others.  I divided the DataFrame of live births using
1781
{\tt birthord}, and computed their histograms:
1782
\index{DataFrame}
1783
\index{Hist}
1784
\index{pregnancy length}
1785

1786
\begin{verbatim}
1787
    firsts = live[live.birthord == 1]
1788
    others = live[live.birthord != 1]
1789

1790
    first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
1791
    other_hist = thinkstats2.Hist(others.prglngth, label='other')
1792
\end{verbatim}
1793

1794
Then I plotted their histograms on the same axis:
1795

1796
\begin{verbatim}
1797
    width = 0.45
1798
    thinkplot.PrePlot(2)
1799
    thinkplot.Hist(first_hist, align='right', width=width)
1800
    thinkplot.Hist(other_hist, align='left', width=width)
1801
    thinkplot.Show(xlabel='weeks', ylabel='frequency',
1802
                   xlim=[27, 46])
1803
\end{verbatim}
1804

1805
{\tt thinkplot.PrePlot} takes the number of histograms
1806
we are planning to plot; it uses this information to choose
1807
an appropriate collection of colors.
1808
\index{thinkplot}
1809

1810
\begin{figure}
1811
% first.py
1812
\centerline{\includegraphics[height=2.5in]{figs/first_nsfg_hist.pdf}}
1813
\caption{Histogram of pregnancy lengths.}
1814
\label{first_nsfg_hist}
1815
\end{figure}
1816

1817
{\tt thinkplot.Hist} normally uses {\tt align='center'} so that
1818
each bar is centered over its value.  For this figure, I use
1819
{\tt align='right'} and {\tt align='left'} to place
1820
corresponding bars on either side of the value.
1821
\index{Hist}
1822

1823
With {\tt width=0.45}, the total width of the two bars is 0.9,
1824
leaving some space between each pair.
1825

1826
Finally, I adjust the axis to show only data between 27 and 46 weeks.
1827
Figure~\ref{first_nsfg_hist} shows the result.
1828
\index{pregnancy length}
1829
\index{length!pregnancy}
1830

1831
Histograms are useful because they make the most frequent values
1832
immediately apparent.  But they are not the best choice for comparing
1833
two distributions.  In this example, there are fewer ``first babies''
1834
than ``others,'' so some of the apparent differences in the histograms
1835
are due to sample sizes.  In the next chapter we address this problem
1836
using probability mass functions.
1837

1838

1839
\section{Summarizing distributions}
1840
\label{mean}
1841

1842
A histogram is a complete description of the distribution of a sample;
1843
that is, given a histogram, we could reconstruct the values in the
1844
sample (although not their order).
1845

1846
If the details of the distribution are important, it might be
1847
necessary to present a histogram.  But often we want to
1848
summarize the distribution with a few descriptive statistics.
1849

1850
Some of the characteristics we might want to report are:
1851

1852
\begin{itemize}
1853

1854
\item central tendency: Do the values tend to cluster around
1855
a particular point?
1856
\index{central tendency}
1857

1858
\item modes: Is there more than one cluster?
1859
\index{mode}
1860

1861
\item spread: How much variability is there in the values?
1862
\index{spread}
1863

1864
\item tails: How quickly do the probabilities drop off as we
1865
move away from the modes?
1866
\index{tail}
1867

1868
\item outliers: Are there extreme values far from the modes?
1869
\index{outlier}
1870

1871
\end{itemize}
1872

1873
Statistics designed to answer these questions are called {\bf summary
1874
  statistics}.  By far the most common summary statistic is the {\bf
1875
  mean}, which is meant to describe the central tendency of the
1876
distribution.  \index{mean} \index{average} \index{summary statistic}
1877

1878
If you have a sample of {\tt n} values, $x_i$, the mean, $\xbar$, is
1879
the sum of the values divided by the number of values; in other words
1880
%
1881
\[ \xbar = \frac{1}{n} \sum_i x_i \]
1882
%
1883
The words ``mean'' and ``average'' are sometimes used interchangeably,
1884
but I make this distinction:
1885

1886
\begin{itemize}
1887

1888
\item The ``mean'' of a sample is the summary statistic computed with
1889
  the previous formula.
1890

1891
\item An ``average'' is one of several summary statistics you might
1892
  choose to describe a central tendency.
1893
\index{central tendency}
1894

1895
\end{itemize}
1896

1897
Sometimes the mean is a good description of a set of values.  For
1898
example, apples are all pretty much the same size (at least the ones
1899
sold in supermarkets).  So if I buy 6 apples and the total weight is 3
1900
pounds, it would be a reasonable summary to say they are about a half
1901
pound each.
1902
\index{weight!pumpkin}
1903

1904
But pumpkins are more diverse.  Suppose I grow several varieties in my
1905
garden, and one day I harvest three decorative pumpkins that are 1
1906
pound each, two pie pumpkins that are 3 pounds each, and one Atlantic
1907
Giant\textregistered~pumpkin that weighs 591 pounds.  The mean of this
1908
sample is 100 pounds, but if I told you ``The average pumpkin in my
1909
garden is 100 pounds,'' that would be misleading.  In this example,
1910
there is no meaningful average because there is no typical pumpkin.
1911
\index{pumpkin}
1912

1913

1914

1915
\section{Variance}
1916
\index{variance}
1917

1918
If there is no single number that summarizes pumpkin weights,
1919
we can do a little better with two numbers: mean and {\bf variance}.
1920

1921
Variance is a summary statistic intended to describe the variability
1922
or spread of a distribution.  The variance of a set of values is
1923
%
1924
\[ S^2 = \frac{1}{n} \sum_i (x_i - \xbar)^2 \]
1925
%
1926
The term $x_i - \xbar$ is called the ``deviation from the mean,'' so
1927
variance is the mean squared deviation.  The square root of variance,
1928
$S$, is the {\bf standard deviation}.  \index{deviation}
1929
\index{standard deviation}
1930
\index{deviation}
1931

1932
If you have prior experience, you might have seen a formula for
1933
variance with $n-1$ in the denominator, rather than {\tt n}.  This
1934
statistic is used to estimate the variance in a population using a
1935
sample.  We will come back to this in Chapter~\ref{estimation}.
1936
\index{sample variance}
1937

1938
Pandas data structures provides methods to compute mean, variance and
1939
standard deviation:
1940
\index{pandas}
1941

1942
\begin{verbatim}
1943
    mean = live.prglngth.mean()
1944
    var = live.prglngth.var()
1945
    std = live.prglngth.std()
1946
\end{verbatim}
1947

1948
For all live births, the mean pregnancy length is 38.6 weeks, the
1949
standard deviation is 2.7 weeks, which means we should expect
1950
deviations of 2-3 weeks to be common.
1951
\index{pregnancy length}
1952

1953
Variance of pregnancy length is 7.3, which is hard to interpret,
1954
especially since the units are weeks$^2$, or ``square weeks.''
1955
Variance is useful in some calculations, but it is not
1956
a good summary statistic.
1957

1958

1959
\section{Effect size}
1960
\index{effect size}
1961

1962
An {\bf effect size} is a summary statistic intended to describe (wait
1963
for it) the size of an effect.  For example, to describe the
1964
difference between two groups, one obvious choice is the difference in
1965
the means.  \index{effect size}
1966

1967
Mean pregnancy length for first babies is 38.601; for
1968
other babies it is 38.523.  The difference is 0.078 weeks, which works
1969
out to 13 hours.  As a fraction of the typical pregnancy length, this
1970
difference is about 0.2\%.
1971
\index{pregnancy length}
1972

1973
If we assume this estimate is accurate, such a difference
1974
would have no practical consequences.  In fact, without
1975
observing a large number of pregnancies, it is unlikely that anyone
1976
would notice this difference at all.
1977
\index{effect size}
1978

1979
Another way to convey the size of the effect is to compare the
1980
difference between groups to the variability within groups.
1981
Cohen's $d$ is a statistic intended to do that; it is defined
1982
%
1983
\[ d = \frac{\bar{x_1} - \bar{x_2}}{s}  \]
1984
%
1985
where $\bar{x_1}$ and $\bar{x_2}$ are the means of the groups and
1986
$s$ is the ``pooled standard deviation''.  Here's the Python
1987
code that computes Cohen's $d$:
1988
\index{standard deviation!pooled}
1989

1990
\begin{verbatim}
1991
def CohenEffectSize(group1, group2):
1992
    diff = group1.mean() - group2.mean()
1993

1994
    var1 = group1.var()
1995
    var2 = group2.var()
1996
    n1, n2 = len(group1), len(group2)
1997

1998
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
1999
    d = diff / math.sqrt(pooled_var)
2000
    return d
2001
\end{verbatim}
2002

2003
In this example, the difference in means is 0.029 standard deviations,
2004
which is small.  To put that in perspective, the difference in
2005
height between men and women is about 1.7 standard deviations (see
2006
\url{https://en.wikipedia.org/wiki/Effect_size}).
2007

2008

2009
\section{Reporting results}
2010

2011
We have seen several ways to describe the difference in pregnancy
2012
length (if there is one) between first babies and others.  How should
2013
we report these results?
2014
\index{pregnancy length}
2015

2016
The answer depends on who is asking the question.  A scientist might
2017
be interested in any (real) effect, no matter how small.  A doctor
2018
might only care about effects that are {\bf clinically significant};
2019
that is, differences that affect treatment decisions.  A pregnant
2020
woman might be interested in results that are relevant to her, like
2021
the probability of delivering early or late.
2022
\index{clinically significant} \index{significant}
2023

2024
How you report results also depends on your goals.  If you are trying
2025
to demonstrate the importance of an effect, you might choose summary
2026
statistics that emphasize differences.  If you are trying to reassure
2027
a patient, you might choose statistics that put the differences in
2028
context.
2029

2030
Of course your decisions should also be guided by professional ethics.
2031
It's ok to be persuasive; you {\em should\/} design statistical reports
2032
and visualizations that tell a story clearly.  But you should also do
2033
your best to make your reports honest, and to acknowledge uncertainty
2034
and limitations.
2035
\index{ethics}
2036

2037

2038
\section{Exercises}
2039

2040
\begin{exercise}
2041
Based on the results in this chapter, suppose you were asked to
2042
summarize what you learned about whether first babies arrive late.
2043

2044
Which summary statistics would you use if you wanted to get a story
2045
on the evening news?  Which ones would you use if you wanted to
2046
reassure an anxious patient?
2047
\index{Adams, Cecil}
2048
\index{Straight Dope, The}
2049

2050
Finally, imagine that you are Cecil Adams, author of {\it The Straight
2051
  Dope\/} (\url{http://straightdope.com}), and your job is to answer the
2052
question, ``Do first babies arrive late?''  Write a paragraph that
2053
uses the results in this chapter to answer the question clearly,
2054
precisely, and honestly.
2055
\index{ethics}
2056

2057
\end{exercise}
2058

2059
\begin{exercise}
2060
In the repository you downloaded, you should find a file named
2061
\verb"chap02ex.ipynb"; open it.  Some cells are already filled in, and
2062
you should execute them.  Other cells give you instructions for
2063
exercises.  Follow the instructions and fill in the answers.
2064

2065
A solution to this exercise is in \verb"chap02soln.ipynb"
2066
\end{exercise}
2067

2068
In the repository you downloaded, you should find a file named
2069
\verb"chap02ex.py"; you can use this file as a starting place
2070
for the following exercises.
2071
My solution is in \verb"chap02soln.py".
2072

2073
\begin{exercise}
2074
The mode of a distribution is the most frequent value; see
2075
\url{http://wikipedia.org/wiki/Mode_(statistics)}.  Write a function
2076
called {\tt Mode} that takes a Hist and returns the most
2077
frequent value.\index{mode}
2078
\index{Hist}
2079

2080
As a more challenging exercise, write a function called {\tt AllModes}
2081
that returns a list of value-frequency pairs in descending order of
2082
frequency.
2083
\index{frequency}
2084
\end{exercise}
2085

2086
\begin{exercise}
2087
Using the variable \verb"totalwgt_lb", investigate whether first
2088
babies are lighter or heavier than others.  Compute Cohen's $d$
2089
to quantify the difference between the groups.  How does it
2090
compare to the difference in pregnancy length?
2091
\index{pregnancy length}
2092
\end{exercise}
2093

2094

2095
\section{Glossary}
2096

2097
\begin{itemize}
2098

2099
\item distribution: The values that appear in a sample
2100
and the frequency of each.
2101
\index{distribution}
2102

2103
\item histogram: A mapping from values to frequencies, or a graph
2104
that shows this mapping.
2105
\index{histogram}
2106

2107
\item frequency: The number of times a value appears in a sample.
2108
\index{frequency}
2109

2110
\item mode: The most frequent value in a sample, or one of the
2111
most frequent values.
2112
\index{mode}
2113

2114
\item normal distribution: An idealization of a bell-shaped distribution;
2115
also known as a Gaussian distribution. 
2116
\index{Gaussian distribution}
2117
\index{normal distribution}
2118

2119
\item uniform distribution: A distribution in which all values have
2120
the same frequency.
2121
\index{uniform distribution}
2122

2123
\item tail: The part of a distribution at the high and low extremes.
2124
\index{tail}
2125

2126
\item central tendency: A characteristic of a sample or population;
2127
intuitively, it is an average or typical value. 
2128
\index{central tendency}
2129

2130
\item outlier: A value far from the central tendency.
2131
\index{outlier}
2132

2133
\item spread: A measure of how spread out the values in a distribution
2134
are.
2135
\index{spread}
2136

2137
\item summary statistic: A statistic that quantifies some aspect
2138
of a distribution, like central tendency or spread.
2139
\index{summary statistic}
2140

2141
\item variance: A summary statistic often used to quantify spread.
2142
\index{variance}
2143

2144
\item standard deviation: The square root of variance, also used
2145
as a measure of spread.
2146
\index{standard deviation}
2147

2148
\item effect size: A summary statistic intended to quantify the size
2149
of an effect like a difference between groups.
2150
\index{effect size}
2151

2152
\item clinically significant: A result, like a difference between groups,
2153
that is relevant in practice.
2154
\index{clinically significant}
2155

2156
\end{itemize}
2157

2158

2159

2160

2161
\chapter{Probability mass functions}
2162
\index{probability mass function}
2163

2164
The code for this chapter is in {\tt probability.py}.
2165
For information about downloading and
2166
working with this code, see Section~\ref{code}.
2167

2168

2169
\section{Pmfs}
2170
\index{Pmf}
2171

2172
Another way to represent a distribution is a {\bf probability mass
2173
  function} (PMF), which maps from each value to its probability.  A
2174
{\bf probability} is a frequency expressed as a fraction of the sample
2175
size, {\tt n}.  To get from frequencies to probabilities, we divide
2176
through by {\tt n}, which is called {\bf normalization}.
2177
\index{frequency}
2178
\index{probability}
2179
\index{normalization}
2180
\index{PMF}
2181
\index{probability mass function}
2182

2183
Given a Hist, we can make a dictionary that maps from each
2184
value to its probability: \index{Hist}
2185
%
2186
\begin{verbatim}
2187
n = hist.Total()
2188
d = {}
2189
for x, freq in hist.Items():
2190
    d[x] = freq / n
2191
\end{verbatim}
2192
%
2193
Or we can use the Pmf class provided by {\tt thinkstats2}.
2194
Like Hist, the Pmf constructor can take a list, pandas
2195
Series, dictionary, Hist, or another Pmf object.  Here's an example
2196
with a simple list:
2197
%
2198
\begin{verbatim}
2199
>>> import thinkstats2
2200
>>> pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
2201
>>> pmf
2202
Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})
2203
\end{verbatim}
2204

2205
The Pmf is normalized so total probability is 1.
2206

2207
Pmf and Hist objects are similar in many ways; in fact, they inherit
2208
many of their methods from a common parent class.  For example, the
2209
methods {\tt Values} and {\tt Items} work the same way for both.  The
2210
biggest difference is that a Hist maps from values to integer
2211
counters; a Pmf maps from values to floating-point probabilities.
2212
\index{Hist}
2213

2214
To look up the probability associated with a value, use {\tt Prob}:
2215
%
2216
\begin{verbatim}
2217
>>> pmf.Prob(2)
2218
0.4
2219
\end{verbatim}
2220

2221
The bracket operator is equivalent:
2222
\index{bracket operator}
2223

2224
\begin{verbatim}
2225
>>> pmf[2]
2226
0.4
2227
\end{verbatim}
2228

2229
You can modify an existing Pmf by incrementing the probability
2230
associated with a value:
2231
%
2232
\begin{verbatim}
2233
>>> pmf.Incr(2, 0.2)
2234
>>> pmf.Prob(2)
2235
0.6
2236
\end{verbatim}
2237

2238
Or you can multiply a probability by a factor:
2239
%
2240
\begin{verbatim}
2241
>>> pmf.Mult(2, 0.5)
2242
>>> pmf.Prob(2)
2243
0.3
2244
\end{verbatim}
2245

2246
If you modify a Pmf, the result may not be normalized; that is, the
2247
probabilities may no longer add up to 1.  To check, you can call {\tt
2248
  Total}, which returns the sum of the probabilities:
2249
%
2250
\begin{verbatim}
2251
>>> pmf.Total()
2252
0.9
2253
\end{verbatim}
2254

2255
To renormalize, call {\tt Normalize}:
2256
%
2257
\begin{verbatim}
2258
>>> pmf.Normalize()
2259
>>> pmf.Total()
2260
1.0
2261
\end{verbatim}
2262

2263
Pmf objects provide a {\tt Copy} method so you can make
2264
and modify a copy without affecting the original.
2265
\index{Pmf}
2266

2267
My notation in this section might seem inconsistent, but there is a
2268
system: I use Pmf for the name of the class, {\tt pmf} for an instance
2269
of the class, and PMF for the mathematical concept of a
2270
probability mass function.
2271

2272

2273
\section{Plotting PMFs}
2274
\index{PMF}
2275

2276
{\tt thinkplot} provides two ways to plot Pmfs:
2277
\index{thinkplot}
2278

2279
\begin{itemize}
2280

2281
\item To plot a Pmf as a bar graph, you can use 
2282
{\tt thinkplot.Hist}.  Bar graphs are most useful if the number
2283
of values in the Pmf is small.
2284
\index{bar plot}
2285
\index{plot!bar}
2286

2287
\item To plot a Pmf as a step function, you can use
2288
{\tt thinkplot.Pmf}.  This option is most useful if there are
2289
a large number of values and the Pmf is smooth.  This function
2290
also works with Hist objects.
2291
\index{line plot}
2292
\index{plot!line}
2293
\index{Hist}
2294
\index{Pmf}
2295

2296
\end{itemize}
2297

2298
In addition, {\tt pyplot} provides a function called {\tt hist} that
2299
takes a sequence of values, computes a histogram, and plots it.
2300
Since I use Hist objects, I usually don't use {\tt pyplot.hist}.
2301
\index{pyplot}
2302

2303
\begin{figure}
2304
% probability.py
2305
\centerline{\includegraphics[height=3.0in]{figs/probability_nsfg_pmf.pdf}}
2306
\caption{PMF of pregnancy lengths for first babies and others, using
2307
  bar graphs and step functions.}
2308
\label{probability_nsfg_pmf}
2309
\end{figure}
2310
\index{pregnancy length}
2311
\index{length!pregnancy}
2312

2313
Figure~\ref{probability_nsfg_pmf} shows PMFs of pregnancy length for
2314
first babies and others using bar graphs (left) and step functions
2315
(right).
2316
\index{pregnancy length}
2317

2318
By plotting the PMF instead of the histogram, we can compare the two
2319
distributions without being mislead by the difference in sample
2320
size.  Based on this figure, first babies seem to be less likely than
2321
others to arrive on time (week 39) and more likely to be a late (weeks
2322
41 and 42).
2323

2324
Here's the code that generates Figure~\ref{probability_nsfg_pmf}:
2325

2326
\begin{verbatim}
2327
    thinkplot.PrePlot(2, cols=2)
2328
    thinkplot.Hist(first_pmf, align='right', width=width)
2329
    thinkplot.Hist(other_pmf, align='left', width=width)
2330
    thinkplot.Config(xlabel='weeks',
2331
                     ylabel='probability',
2332
                     axis=[27, 46, 0, 0.6])
2333

2334
    thinkplot.PrePlot(2)
2335
    thinkplot.SubPlot(2)
2336
    thinkplot.Pmfs([first_pmf, other_pmf])
2337
    thinkplot.Show(xlabel='weeks',
2338
                   axis=[27, 46, 0, 0.6])
2339
\end{verbatim}
2340

2341
{\tt PrePlot} takes optional parameters {\tt rows} and {\tt cols}
2342
to make a grid of figures, in this case one row of two figures.
2343
The first figure (on the left) displays the Pmfs using {\tt thinkplot.Hist},
2344
as we have seen before.
2345
\index{thinkplot}
2346
\index{Hist}
2347

2348
The second call to {\tt PrePlot} resets the color generator.  Then
2349
{\tt SubPlot} switches to the second figure (on the right) and
2350
displays the Pmfs using {\tt thinkplot.Pmfs}.  I used the {\tt axis} option
2351
to ensure that the two figures are on the same axes, which is
2352
generally a good idea if you intend to compare two figures.
2353

2354

2355
\section{Other visualizations}
2356
\label{visualization}
2357

2358
Histograms and PMFs are useful while you are exploring data and
2359
trying to identify patterns and relationships.
2360
Once you have an idea what is going on, a good next step is to
2361
design a visualization that makes the patterns you have identified
2362
as clear as possible.
2363
\index{exploratory data analysis}
2364
\index{visualization}
2365

2366
In the NSFG data, the biggest differences in the distributions are
2367
near the mode.  So it makes sense to zoom in on that part of the
2368
graph, and to transform the data to emphasize differences:
2369
\index{National Survey of Family Growth}
2370
\index{NSFG}
2371

2372
\begin{verbatim}
2373
    weeks = range(35, 46)
2374
    diffs = []
2375
    for week in weeks:
2376
        p1 = first_pmf.Prob(week)
2377
        p2 = other_pmf.Prob(week)
2378
        diff = 100 * (p1 - p2)
2379
        diffs.append(diff)
2380

2381
    thinkplot.Bar(weeks, diffs)
2382
\end{verbatim}
2383

2384
In this code, {\tt weeks} is the range of weeks; {\tt diffs} is the
2385
difference between the two PMFs in percentage points.
2386
Figure~\ref{probability_nsfg_diffs} shows the result as a bar chart.
2387
This figure makes the pattern clearer: first babies are less likely to
2388
be born in week 39, and somewhat more likely to be born in weeks 41
2389
and 42.
2390
\index{thinkplot}
2391

2392
\begin{figure}
2393
% probability.py
2394
\centerline{\includegraphics[height=2.5in]{figs/probability_nsfg_diffs.pdf}}
2395
\caption{Difference, in percentage points, by week.}
2396
\label{probability_nsfg_diffs}
2397
\end{figure}
2398

2399
For now we should hold this conclusion only tentatively.
2400
We used the same dataset to identify an
2401
apparent difference and then chose a visualization that makes the
2402
difference apparent.  We can't be sure this effect is real;
2403
it might be due to random variation.  We'll address this concern
2404
later.
2405

2406

2407
\section{The class size paradox}
2408
\index{class size}
2409

2410
Before we go on, I want to demonstrate
2411
one kind of computation you can do with Pmf objects; I call
2412
this example the ``class size paradox.''
2413
\index{Pmf}
2414

2415
At many American colleges and universities, the student-to-faculty
2416
ratio is about 10:1.  But students are often surprised to discover
2417
that their average class size is bigger than 10.  There
2418
are two reasons for the discrepancy:
2419

2420
\begin{itemize}
2421

2422
\item Students typically take 4--5 classes per semester, but
2423
professors often teach 1 or 2.
2424

2425
\item The number of students who enjoy a small class is small,
2426
but the number of students in a large class is (ahem!) large.
2427

2428
\end{itemize}
2429

2430
The first effect is obvious, at least once it is pointed out;
2431
the second is more subtle.  Let's look at an example.  Suppose
2432
that a college offers 65 classes in a given semester, with the
2433
following distribution of sizes:
2434
%
2435
\begin{verbatim}
2436
 size      count
2437
 5- 9          8
2438
10-14          8
2439
15-19         14
2440
20-24          4
2441
25-29          6
2442
30-34         12
2443
35-39          8
2444
40-44          3
2445
45-49          2
2446
\end{verbatim}
2447

2448
If you ask the Dean for the average class size, he would
2449
construct a PMF, compute the mean, and report that the
2450
average class size is 23.7.  Here's the code:
2451

2452
\begin{verbatim}
2453
    d = { 7: 8, 12: 8, 17: 14, 22: 4, 
2454
          27: 6, 32: 12, 37: 8, 42: 3, 47: 2 }
2455

2456
    pmf = thinkstats2.Pmf(d, label='actual')
2457
    print('mean', pmf.Mean())
2458
\end{verbatim}
2459

2460
But if you survey a group of students, ask them how many
2461
students are in their classes, and compute the mean, you would
2462
think the average class was bigger.  Let's see how
2463
much bigger.
2464

2465
First, I compute the
2466
distribution as observed by students, where the probability
2467
associated with each class size is ``biased'' by the number
2468
of students in the class.
2469
\index{observer bias}
2470
\index{bias!observer}
2471

2472
\begin{verbatim}
2473
def BiasPmf(pmf, label):
2474
    new_pmf = pmf.Copy(label=label)
2475

2476
    for x, p in pmf.Items():
2477
        new_pmf.Mult(x, x)
2478
        
2479
    new_pmf.Normalize()
2480
    return new_pmf
2481
\end{verbatim}
2482

2483
For each class size, {\tt x}, we multiply the probability by
2484
{\tt x}, the number of students who observe that class size.
2485
The result is a new Pmf that represents the biased distribution.
2486

2487
Now we can plot the actual and observed distributions:
2488
\index{thinkplot}
2489

2490
\begin{verbatim}
2491
    biased_pmf = BiasPmf(pmf, label='observed')
2492
    thinkplot.PrePlot(2)
2493
    thinkplot.Pmfs([pmf, biased_pmf])
2494
    thinkplot.Show(xlabel='class size', ylabel='PMF')
2495
\end{verbatim}
2496

2497
\begin{figure}
2498
% probability.py
2499
\centerline{\includegraphics[height=3.0in]{figs/class_size1.pdf}}
2500
\caption{Distribution of class sizes, actual and as observed by students.}
2501
\label{class_size1}
2502
\end{figure}
2503

2504
Figure~\ref{class_size1} shows the result.  In the biased distribution
2505
there are fewer small classes and more large ones.
2506
The mean of the biased distribution is 29.1, almost 25\% higher
2507
than the actual mean.
2508

2509
It is also possible to invert this operation.  Suppose you want to
2510
find the distribution of class sizes at a college, but you can't get
2511
reliable data from the Dean.  An alternative is to choose a random
2512
sample of students and ask how many students are in their
2513
classes.  \index{bias!oversampling} \index{oversampling}
2514

2515
The result would be biased for the reasons we've just seen, but you
2516
can use it to estimate the actual distribution.  Here's the function
2517
that unbiases a Pmf:
2518

2519
\begin{verbatim}
2520
def UnbiasPmf(pmf, label):
2521
    new_pmf = pmf.Copy(label=label)
2522

2523
    for x, p in pmf.Items():
2524
        new_pmf.Mult(x, 1.0/x)
2525
        
2526
    new_pmf.Normalize()
2527
    return new_pmf
2528
\end{verbatim}
2529

2530
It's similar to {\tt BiasPmf}; the only difference is that it
2531
divides each probability by {\tt x} instead of multiplying.
2532

2533

2534
\section{DataFrame indexing}
2535

2536
In Section~\ref{dataframe} we read a pandas DataFrame and used it to
2537
select and modify data columns.  Now let's look at row selection.
2538
To start, I create a NumPy array of random numbers and use it
2539
to initialize a DataFrame:
2540
\index{NumPy}
2541
\index{pandas}
2542
\index{DataFrame}
2543

2544
\begin{verbatim}
2545
>>> import numpy as np
2546
>>> import pandas
2547
>>> array = np.random.randn(4, 2)
2548
>>> df = pandas.DataFrame(array)
2549
>>> df
2550
          0         1
2551
0 -0.143510  0.616050
2552
1 -1.489647  0.300774
2553
2 -0.074350  0.039621
2554
3 -1.369968  0.545897
2555
\end{verbatim}
2556

2557
By default, the rows and columns are numbered starting at zero, but
2558
you can provide column names:
2559

2560
\begin{verbatim}
2561
>>> columns = ['A', 'B']
2562
>>> df = pandas.DataFrame(array, columns=columns)
2563
>>> df
2564
          A         B
2565
0 -0.143510  0.616050
2566
1 -1.489647  0.300774
2567
2 -0.074350  0.039621
2568
3 -1.369968  0.545897
2569
\end{verbatim}
2570

2571
You can also provide row names.  The set of row names is called the
2572
{\bf index}; the row names themselves are called {\bf labels}.
2573

2574
\begin{verbatim}
2575
>>> index = ['a', 'b', 'c', 'd']
2576
>>> df = pandas.DataFrame(array, columns=columns, index=index)
2577
>>> df
2578
          A         B
2579
a -0.143510  0.616050
2580
b -1.489647  0.300774
2581
c -0.074350  0.039621
2582
d -1.369968  0.545897
2583
\end{verbatim}
2584

2585
As we saw in the previous chapter, simple indexing selects a
2586
column, returning a Series:
2587
\index{Series}
2588

2589
\begin{verbatim}
2590
>>> df['A']
2591
a   -0.143510
2592
b   -1.489647
2593
c   -0.074350
2594
d   -1.369968
2595
Name: A, dtype: float64
2596
\end{verbatim}
2597

2598
To select a row by label, you can use the {\tt loc} attribute, which
2599
returns a Series:
2600

2601
\begin{verbatim}
2602
>>> df.loc['a']
2603
A   -0.14351
2604
B    0.61605
2605
Name: a, dtype: float64
2606
\end{verbatim}
2607

2608
If you know the integer position of a row, rather than its label, you
2609
can use the {\tt iloc} attribute, which also returns a Series.
2610

2611
\begin{verbatim}
2612
>>> df.iloc[0]
2613
A   -0.14351
2614
B    0.61605
2615
Name: a, dtype: float64
2616
\end{verbatim}
2617

2618
{\tt loc} can also take a list of labels; in that case,
2619
the result is a DataFrame.
2620

2621
\begin{verbatim}
2622
>>> indices = ['a', 'c']
2623
>>> df.loc[indices]
2624
         A         B
2625
a -0.14351  0.616050
2626
c -0.07435  0.039621
2627
\end{verbatim}
2628

2629
Finally, you can use a slice to select a range of rows by label:
2630

2631
\begin{verbatim}
2632
>>> df['a':'c']
2633
          A         B
2634
a -0.143510  0.616050
2635
b -1.489647  0.300774
2636
c -0.074350  0.039621
2637
\end{verbatim}
2638

2639
Or by integer position:
2640

2641
\begin{verbatim}
2642
>>> df[0:2]
2643
          A         B
2644
a -0.143510  0.616050
2645
b -1.489647  0.300774
2646
\end{verbatim}
2647

2648
The result in either case is a DataFrame, but notice that the first
2649
result includes the end of the slice; the second doesn't.
2650
\index{DataFrame}
2651

2652
My advice: if your rows have labels that are not simple integers, use
2653
the labels consistently and avoid using integer positions.
2654

2655

2656

2657
\section{Exercises}
2658

2659
Solutions to these exercises are in \verb"chap03soln.ipynb"
2660
and \verb"chap03soln.py"
2661

2662
\begin{exercise}
2663
Something like the class size paradox appears if you survey children
2664
and ask how many children are in their family.  Families with many
2665
children are more likely to appear in your sample, and
2666
families with no children have no chance to be in the sample.
2667
\index{observer bias}
2668
\index{bias!observer}
2669

2670
Use the NSFG respondent variable \verb"NUMKDHH" to construct the actual
2671
distribution for the number of children under 18 in the household.
2672

2673
Now compute the biased distribution we would see if we surveyed the
2674
children and asked them how many children under 18 (including themselves)
2675
are in their household.  
2676

2677
Plot the actual and biased distributions, and compute their means.
2678
As a starting place, you can use \verb"chap03ex.ipynb".
2679
\end{exercise}
2680

2681

2682
\begin{exercise}
2683
\index{mean}
2684
\index{variance}
2685
\index{PMF}
2686

2687
In Section~\ref{mean} we computed the mean of a sample by adding up
2688
the elements and dividing by n.  If you are given a PMF, you can
2689
still compute the mean, but the process is slightly different:
2690
%
2691
\[ \xbar = \sum_i p_i~x_i \]
2692
%
2693
where the $x_i$ are the unique values in the PMF and $p_i=PMF(x_i)$.
2694
Similarly, you can compute variance like this:
2695
%
2696
\[ S^2 = \sum_i p_i~(x_i - \xbar)^2\]
2697
% 
2698
Write functions called {\tt PmfMean} and {\tt PmfVar} that take a
2699
Pmf object and compute the mean and variance.  To test these methods,
2700
check that they are consistent with the methods {\tt Mean} and {\tt
2701
  Var} provided by Pmf.
2702
\index{Pmf}
2703

2704
\end{exercise}
2705

2706

2707
\begin{exercise}
2708
I started with the question, ``Are first babies more likely
2709
to be late?''  To address it, I computed the difference in
2710
means between groups of babies, but I ignored the possibility
2711
that there might be a difference between first babies and
2712
others {\em for the same woman}.
2713

2714
To address this version of the question, select respondents who
2715
have at least two babies and compute pairwise differences.  Does
2716
this formulation of the question yield a different result?
2717

2718
Hint: use {\tt nsfg.MakePregMap}.
2719
\end{exercise}
2720

2721

2722
\begin{exercise}
2723
\label{relay}
2724

2725
In most foot races, everyone starts at the same time.  If you are a
2726
fast runner, you usually pass a lot of people at the beginning of the
2727
race, but after a few miles everyone around you is going at the same
2728
speed.
2729
\index{relay race}
2730

2731
When I ran a long-distance (209 miles) relay race for the first
2732
time, I noticed an odd phenomenon: when I overtook another runner, I
2733
was usually much faster, and when another runner overtook me, he was
2734
usually much faster.
2735

2736
At first I thought that the distribution of speeds might be bimodal;
2737
that is, there were many slow runners and many fast runners, but few
2738
at my speed.
2739

2740
Then I realized that I was the victim of a bias similar to the
2741
effect of class size.  The race
2742
was unusual in two ways: it used a staggered start, so teams started
2743
at different times; also, many teams included runners at different
2744
levels of ability. \index{bias!selection} \index{selection bias}
2745

2746
As a result, runners were spread out along the course with little
2747
relationship between speed and location.  When I joined the race, the
2748
runners near me were (pretty much) a random sample of the runners in
2749
the race.
2750

2751
So where does the bias come from?  During my time on the course, the
2752
chance of overtaking a runner, or being overtaken, is proportional to
2753
the difference in our speeds.  I am more likely to catch a slow
2754
runner, and more likely to be caught by a fast runner.  But runners
2755
at the same speed are unlikely to see each other.
2756

2757
Write a function called {\tt ObservedPmf} that takes a Pmf representing
2758
the actual distribution of runners' speeds, and the speed of a running
2759
observer, and returns a new Pmf representing the distribution of
2760
runners' speeds as seen by the observer.
2761
\index{observer bias}
2762
\index{bias!observer}
2763

2764
To test your function, you can use {\tt relay.py}, which  reads the
2765
results from the James Joyce Ramble 10K in Dedham MA and converts the
2766
pace of each runner to mph.
2767

2768
Compute the distribution of speeds you would observe if you ran a
2769
relay race at 7.5 mph with this group of runners.  A solution to this
2770
exercise is in \verb"relay_soln.py".
2771
\end{exercise}
2772

2773

2774
\section{Glossary}
2775

2776
\begin{itemize}
2777

2778
\item Probability mass function (PMF): a representation of a distribution
2779
as a function that maps from values to probabilities.
2780
\index{PMF}
2781
\index{probability mass function}
2782

2783
\item probability: A frequency expressed as a fraction of the sample
2784
size.
2785
\index{frequency}
2786
\index{probability}
2787

2788
\item normalization: The process of dividing a frequency by a sample
2789
size to get a probability.
2790
\index{normalization}
2791

2792
\item index: In a pandas DataFrame, the index is a special column
2793
that contains the row labels.
2794
\index{pandas}
2795
\index{DataFrame}
2796

2797
\end{itemize}
2798

2799

2800
\chapter{Cumulative distribution functions}
2801
\label{cumulative}
2802

2803
The code for this chapter is in {\tt cumulative.py}.
2804
For information about downloading and
2805
working with this code, see Section~\ref{code}.
2806

2807

2808
\section{The limits of PMFs}
2809
\index{PMF}
2810

2811
PMFs work well if the number of values is small.  But as the number of
2812
values increases, the probability associated with each value gets
2813
smaller and the effect of random noise increases.
2814

2815
For example, we might be interested in the distribution of birth
2816
weights.  In the NSFG data, the variable \verb"totalwgt_lb" records
2817
weight at birth in pounds.  Figure~\ref{nsfg_birthwgt_pmf} shows
2818
the PMF of these values for first babies and others.
2819
\index{National Survey of Family Growth} \index{NSFG} \index{birth weight}
2820
\index{weight!birth}
2821

2822
\begin{figure}
2823
% cumulative.py
2824
\centerline{\includegraphics[height=2.5in]{figs/nsfg_birthwgt_pmf.pdf}}
2825
\caption{PMF of birth weights.  This figure shows a limitation
2826
of PMFs: they are hard to compare visually.}
2827
\label{nsfg_birthwgt_pmf}
2828
\end{figure}
2829

2830
Overall, these distributions resemble the bell shape of a normal
2831
distribution, with many values near the mean and a few values much
2832
higher and lower.
2833

2834
But parts of this figure are hard to interpret.  There are many spikes
2835
and valleys, and some apparent differences between the distributions.
2836
It is hard to tell which of these features are meaningful.  Also, it
2837
is hard to see overall patterns; for example, which distribution do
2838
you think has the higher mean?
2839
\index{binning}
2840

2841
These problems can be mitigated by binning the data; that is, dividing
2842
the range of values into non-overlapping intervals and counting the
2843
number of values in each bin.  Binning can be useful, but it is tricky
2844
to get the size of the bins right.  If they are big enough to smooth
2845
out noise, they might also smooth out useful information.
2846

2847
An alternative that avoids these problems is the cumulative
2848
distribution function (CDF), which is the subject of this chapter.
2849
But before I can explain CDFs, I have to explain percentiles.
2850
\index{CDF}
2851

2852

2853
\section{Percentiles}
2854
\index{percentile rank}
2855

2856
If you have taken a standardized test, you probably got your
2857
results in the form of a raw score and a {\bf percentile rank}.
2858
In this context, the percentile rank is the fraction of people who
2859
scored lower than you (or the same).  So if you are ``in the 90th
2860
percentile,'' you did as well as or better than 90\% of the people who
2861
took the exam.
2862

2863
Here's how you could compute the percentile rank of a value,
2864
\verb"your_score", relative to the values in the sequence {\tt
2865
  scores}:
2866
%
2867
\begin{verbatim}
2868
def PercentileRank(scores, your_score):
2869
    count = 0
2870
    for score in scores:
2871
        if score <= your_score:
2872
            count += 1
2873

2874
    percentile_rank = 100.0 * count / len(scores)
2875
    return percentile_rank
2876
\end{verbatim}
2877

2878
As an example, if the
2879
scores in the sequence were 55, 66, 77, 88 and 99, and you got the 88,
2880
then your percentile rank would be {\tt 100 * 4 / 5} which is 80.
2881

2882
If you are given a value, it is easy to find its percentile rank; going
2883
the other way is slightly harder.  If you are given a percentile rank
2884
and you want to find the corresponding value, one option is to
2885
sort the values and search for the one you want:
2886
%
2887
\begin{verbatim}
2888
def Percentile(scores, percentile_rank):
2889
    scores.sort()
2890
    for score in scores:
2891
        if PercentileRank(scores, score) >= percentile_rank:
2892
            return score
2893
\end{verbatim}
2894

2895
The result of this calculation is a {\bf percentile}.  For example,
2896
the 50th percentile is the value with percentile rank 50.  In the
2897
distribution of exam scores, the 50th percentile is 77.
2898
\index{percentile}
2899

2900
This implementation of {\tt Percentile} is not efficient.  A
2901
better approach is to use the percentile rank to compute the index of
2902
the corresponding percentile:
2903

2904
\begin{verbatim}
2905
def Percentile2(scores, percentile_rank):
2906
    scores.sort()
2907
    index = percentile_rank * (len(scores)-1) // 100
2908
    return scores[index]
2909
\end{verbatim}
2910

2911
The difference between ``percentile'' and ``percentile rank'' can
2912
be confusing, and people do not always use the terms precisely.
2913
To summarize, {\tt PercentileRank} takes a value and computes
2914
its percentile rank in a set of values; {\tt Percentile} takes
2915
a percentile rank and computes the corresponding value.
2916
\index{percentile rank}
2917

2918

2919
\section{CDFs}
2920
\index{CDF}
2921

2922
Now that we understand percentiles and percentile ranks,
2923
we are ready to tackle the {\bf cumulative distribution function}
2924
(CDF).  The CDF is the function that maps from a value to its
2925
percentile rank.
2926
\index{cumulative distribution function}
2927
\index{percentile rank}
2928

2929
The CDF is a function of $x$, where $x$ is any value that might appear
2930
in the distribution.  To evaluate $\CDF(x)$ for a particular value of
2931
$x$, we compute the fraction of values in the distribution less
2932
than or equal to $x$.
2933

2934
Here's what that looks like as a function that takes a sequence,
2935
{\tt sample}, and a value, {\tt x}:
2936
%
2937
\begin{verbatim}
2938
def EvalCdf(sample, x):
2939
    count = 0.0
2940
    for value in sample:
2941
        if value <= x:
2942
            count += 1
2943

2944
    prob = count / len(sample)
2945
    return prob
2946
\end{verbatim}
2947

2948
This function is almost identical to {\tt PercentileRank}, except that
2949
the result is a probability in the range 0--1 rather than a
2950
percentile rank in the range 0--100.
2951
\index{sample}
2952

2953
As an example, suppose we collect a sample with the values 
2954
{\tt [1, 2, 2, 3, 5]}.  Here are some values from its CDF:
2955
%
2956
\[ CDF(0) = 0 \]
2957
%
2958
\[ CDF(1) = 0.2\]
2959
%
2960
\[ CDF(2) = 0.6\]
2961
%
2962
\[ CDF(3) = 0.8\]
2963
%
2964
\[ CDF(4) = 0.8\]
2965
%
2966
\[ CDF(5) = 1\]
2967
%
2968
We can evaluate the CDF for any value of $x$, not just
2969
values that appear in the sample.
2970
If $x$ is less than the smallest value in the sample, $\CDF(x)$ is 0.
2971
If $x$ is greater than the largest value, $\CDF(x)$ is 1.
2972

2973
\begin{figure}
2974
% cumulative.py
2975
\centerline{\includegraphics[height=2.5in]{figs/cumulative_example_cdf.pdf}}
2976
\caption{Example of a CDF.}
2977
\label{example_cdf}
2978
\end{figure}
2979

2980
Figure~\ref{example_cdf} is a graphical representation of this CDF.
2981
The CDF of a sample is a step function.
2982
\index{step function}
2983

2984

2985
\section{Representing CDFs}
2986
\index{Cdf}
2987

2988
{\tt thinkstats2} provides a class named Cdf that represents
2989
CDFs.  The fundamental methods Cdf provides are:
2990

2991
\begin{itemize}
2992

2993
\item {\tt Prob(x)}: Given a value {\tt x}, computes the probability
2994
  $p = \CDF(x)$.  The bracket operator is equivalent to {\tt Prob}.
2995
\index{bracket operator}
2996

2997
\item {\tt Value(p)}: Given a probability {\tt p}, computes the
2998
corresponding value, {\tt x}; that is, the {\bf inverse CDF} of {\tt p}.
2999
\index{inverse CDF}
3000
\index{CDF, inverse}
3001

3002
\end{itemize}
3003

3004
\begin{figure}
3005
% cumulative.py
3006
\centerline{\includegraphics[height=2.5in]{figs/cumulative_prglngth_cdf.pdf}}
3007
\caption{CDF of pregnancy length.}
3008
\label{cumulative_prglngth_cdf}
3009
\end{figure}
3010

3011
The Cdf constructor can take as an argument a list of values,
3012
a pandas Series, a Hist, Pmf, or another Cdf.  The following
3013
code makes a Cdf for the distribution of pregnancy lengths in
3014
the NSFG:
3015
\index{NSFG}
3016
\index{pregnancy length}
3017

3018
\begin{verbatim}
3019
    live, firsts, others = first.MakeFrames()
3020
    cdf = thinkstats2.Cdf(live.prglngth, label='prglngth')
3021
\end{verbatim}
3022

3023
{\tt thinkplot} provides a function named {\tt Cdf} that
3024
plots Cdfs as lines:
3025
\index{thinkplot}
3026

3027
\begin{verbatim}
3028
    thinkplot.Cdf(cdf)
3029
    thinkplot.Show(xlabel='weeks', ylabel='CDF')
3030
\end{verbatim}
3031

3032
Figure~\ref{cumulative_prglngth_cdf} shows the result.  One way to
3033
read a CDF is to look up percentiles.  For example, it looks like
3034
about 10\% of pregnancies are shorter than 36 weeks, and about 90\%
3035
are shorter than 41 weeks.  The CDF also provides a visual
3036
representation of the shape of the distribution.  Common values appear
3037
as steep or vertical sections of the CDF; in this example, the mode at
3038
39 weeks is apparent.  There are few values below 30 weeks, so
3039
the CDF in this range is flat.
3040
\index{CDF, interpreting}
3041

3042
It takes some time to get used to CDFs, but once you
3043
do, I think you will find that they show more information, more
3044
clearly, than PMFs.
3045

3046

3047
\section{Comparing CDFs}
3048
\label{birth_weights}
3049
\index{National Survey of Family Growth}
3050
\index{NSFG}
3051
\index{birth weight}
3052
\index{weight!birth}
3053

3054
CDFs are especially useful for comparing distributions.  For
3055
example, here is the code that plots the CDF of birth
3056
weight for first babies and others.
3057
\index{thinkplot}
3058
\index{distributions, comparing}
3059

3060
\begin{verbatim}
3061
    first_cdf = thinkstats2.Cdf(firsts.totalwgt_lb, label='first')
3062
    other_cdf = thinkstats2.Cdf(others.totalwgt_lb, label='other')
3063

3064
    thinkplot.PrePlot(2)
3065
    thinkplot.Cdfs([first_cdf, other_cdf])
3066
    thinkplot.Show(xlabel='weight (pounds)', ylabel='CDF')
3067
\end{verbatim}
3068

3069
\begin{figure}
3070
% cumulative.py
3071
\centerline{\includegraphics[height=2.5in]{figs/cumulative_birthwgt_cdf.pdf}}
3072
\caption{CDF of birth weights for first babies and others.}
3073
\label{cumulative_birthwgt_cdf}
3074
\end{figure}
3075

3076
Figure~\ref{cumulative_birthwgt_cdf} shows the result.
3077
Compared to Figure~\ref{nsfg_birthwgt_pmf},
3078
this figure makes the shape of the distributions, and the differences
3079
between them, much clearer.  We can see that first babies are slightly
3080
lighter throughout the distribution, with a larger discrepancy above 
3081
the mean.
3082
\index{shape}
3083

3084

3085

3086

3087
\section{Percentile-based statistics}
3088
\index{summary statistic}
3089
\index{interquartile range}
3090
\index{quartile}
3091
\index{percentile}
3092
\index{median}
3093
\index{central tendency}
3094
\index{spread}
3095

3096
Once you have computed a CDF, it is easy to compute percentiles
3097
and percentile ranks.  The Cdf class provides these two methods:
3098
\index{Cdf}
3099
\index{percentile rank}
3100

3101
\begin{itemize}
3102

3103
\item {\tt PercentileRank(x)}: Given a value {\tt x}, computes its
3104
  percentile rank, $100 \cdot \CDF(x)$.
3105

3106
\item {\tt Percentile(p)}: Given a percentile rank {\tt p},
3107
  computes the corresponding value, {\tt x}.  Equivalent to {\tt
3108
    Value(p/100)}.
3109

3110
\end{itemize}
3111

3112
{\tt Percentile} can be used to compute percentile-based summary
3113
statistics.  For example, the 50th percentile is the value that
3114
divides the distribution in half, also known as the {\bf median}.
3115
Like the mean, the median is a measure of the central tendency
3116
of a distribution.
3117

3118
Actually, there are several definitions of ``median,'' each with
3119
different properties.  But {\tt Percentile(50)} is simple and
3120
efficient to compute.
3121

3122
Another percentile-based statistic is the {\bf interquartile range} (IQR),
3123
which is a measure of the spread of a distribution.  The IQR
3124
is the difference between the 75th and 25th percentiles.
3125

3126
More generally, percentiles are often used to summarize the shape
3127
of a distribution.  For example, the distribution of income is
3128
often reported in ``quintiles''; that is, it is split at the
3129
20th, 40th, 60th and 80th percentiles.  Other distributions
3130
are divided into ten ``deciles''.  Statistics like these that represent
3131
equally-spaced points in a CDF are called {\bf quantiles}.
3132
For more, see \url{https://en.wikipedia.org/wiki/Quantile}.
3133
\index{quantile}
3134
\index{quintile}
3135
\index{decile}
3136

3137

3138

3139
\section{Random numbers}
3140
\label{random}
3141
\index{random number}
3142

3143
Suppose we choose a random sample from the population of live
3144
births and look up the percentile rank of their birth weights.
3145
Now suppose we compute the CDF of the percentile ranks.  What do
3146
you think the distribution will look like?
3147
\index{percentile rank}
3148
\index{birth weight}
3149
\index{weight!birth}
3150

3151
Here's how we can compute it.  First, we make the Cdf of
3152
birth weights:
3153
\index{Cdf}
3154

3155
\begin{verbatim}
3156
    weights = live.totalwgt_lb
3157
    cdf = thinkstats2.Cdf(weights, label='totalwgt_lb')
3158
\end{verbatim}
3159

3160
Then we generate a sample and compute the percentile rank of
3161
each value in the sample.
3162

3163
\begin{verbatim}
3164
    sample = np.random.choice(weights, 100, replace=True)
3165
    ranks = [cdf.PercentileRank(x) for x in sample]
3166
\end{verbatim}
3167

3168
{\tt sample}
3169
is a random sample of 100 birth weights, chosen with {\bf replacement};
3170
that is, the same value could be chosen more than once.  {\tt ranks}
3171
is a list of percentile ranks.
3172
\index{replacement}
3173

3174
Finally we make and plot the Cdf of the percentile ranks.
3175
\index{thinkplot}
3176

3177
\begin{verbatim}
3178
    rank_cdf = thinkstats2.Cdf(ranks)
3179
    thinkplot.Cdf(rank_cdf)
3180
    thinkplot.Show(xlabel='percentile rank', ylabel='CDF')
3181
\end{verbatim}
3182

3183
\begin{figure}
3184
% cumulative.py
3185
\centerline{\includegraphics[height=2.5in]{figs/cumulative_random.pdf}}
3186
\caption{CDF of percentile ranks for a random sample of birth weights.}
3187
\label{cumulative_random}
3188
\end{figure}
3189

3190
Figure~\ref{cumulative_random} shows the result.  The CDF is
3191
approximately a straight line, which means that the distribution
3192
is uniform.
3193

3194
That outcome might be non-obvious, but it is a consequence of
3195
the way the CDF is defined.  What this figure shows is that 10\%
3196
of the sample is below the 10th percentile, 20\% is below the
3197
20th percentile, and so on, exactly as we should expect.
3198

3199
So, regardless of the shape of the CDF, the distribution of
3200
percentile ranks is uniform.  This property is useful, because it
3201
is the basis of a simple and efficient algorithm for generating
3202
random numbers with a given CDF.  Here's how:
3203
\index{inverse CDF algorithm}
3204
\index{random number}
3205

3206
\begin{itemize}
3207

3208
\item Choose a percentile rank uniformly from the range 0--100.
3209

3210
\item Use {\tt Cdf.Percentile} to find the value in the distribution
3211
that corresponds to the percentile rank you chose.
3212
\index{Cdf}
3213

3214
\end{itemize}
3215

3216
Cdf provides an implementation of this algorithm, called
3217
{\tt Random}:
3218

3219
\begin{verbatim}
3220
# class Cdf:
3221
    def Random(self):
3222
        return self.Percentile(random.uniform(0, 100))
3223
\end{verbatim}
3224

3225
Cdf also provides {\tt Sample}, which takes an integer,
3226
{\tt n}, and returns a list of {\tt n} values chosen at random
3227
from the Cdf.
3228

3229

3230
\section{Comparing percentile ranks}
3231

3232
Percentile ranks are useful for comparing measurements across
3233
different groups.  For example, people who compete in foot races are
3234
usually grouped by age and gender.  To compare people in different
3235
age groups, you can convert race times to percentile ranks.
3236
\index{percentile rank}
3237

3238
A few years ago I ran the James Joyce Ramble 10K in
3239
Dedham MA; I finished in 42:44, which was 97th in a field of 1633.  I beat or
3240
tied 1537 runners out of 1633, so my percentile rank in the field is
3241
94\%.  \index{James Joyce Ramble} \index{race time}
3242

3243
More generally, given position and field size, we can compute
3244
percentile rank:
3245
\index{field size}
3246

3247
\begin{verbatim}
3248
def PositionToPercentile(position, field_size):
3249
    beat = field_size - position + 1
3250
    percentile = 100.0 * beat / field_size
3251
    return percentile
3252
\end{verbatim}
3253

3254
In my age group, denoted M4049 for ``male between 40 and 49 years of
3255
age'', I came in 26th out of 256.  So my percentile rank in my age
3256
group was 90\%.
3257
\index{age group}
3258

3259
If I am still running in 10 years (and I hope I am), I will be in
3260
the M5059 division.  Assuming that my percentile rank in my division
3261
is the same, how much slower should I expect to be?
3262

3263
I can answer that question by converting my percentile rank in M4049
3264
to a position in M5059.  Here's the code:
3265

3266
\begin{verbatim}
3267
def PercentileToPosition(percentile, field_size):
3268
    beat = percentile * field_size / 100.0
3269
    position = field_size - beat + 1
3270
    return position
3271
\end{verbatim}
3272

3273
There were 171 people in M5059, so I would have to come in between
3274
17th and 18th place to have the same percentile rank.  The finishing
3275
time of the 17th runner in M5059 was 46:05, so that's the time I will
3276
have to beat to maintain my percentile rank.
3277

3278

3279
\section{Exercises}
3280

3281
For the following exercises, you can start with \verb"chap04ex.ipynb".
3282
My solution is in \verb"chap04soln.ipynb".
3283

3284
\begin{exercise}
3285
How much did you weigh at birth?  If you don't know, call your mother
3286
or someone else who knows.  Using the NSFG data (all live births),
3287
compute the distribution of birth weights and use it to find your
3288
percentile rank.  If you were a first baby, find your percentile rank
3289
in the distribution for first babies.  Otherwise use the distribution
3290
for others.  If you are in the 90th percentile or higher, call your
3291
mother back and apologize.
3292
\index{birth weight}
3293
\index{weight!birth}
3294

3295
\end{exercise}
3296

3297
\begin{exercise}
3298
The numbers generated by {\tt random.random} are supposed to be
3299
uniform between 0 and 1; that is, every value in the range
3300
should have the same probability.
3301

3302
Generate 1000 numbers from {\tt random.random} and plot their
3303
PMF and CDF.  Is the distribution uniform?
3304
\index{uniform distribution}
3305
\index{distribution!uniform}
3306
\index{random number}
3307

3308
\end{exercise}
3309

3310

3311
\section{Glossary}
3312

3313
\begin{itemize}
3314

3315
\item percentile rank: The percentage of values in a distribution that are
3316
less than or equal to a given value.
3317
\index{percentile rank}
3318

3319
\item percentile: The value associated with a given percentile rank.
3320
\index{percentile}
3321

3322
\item cumulative distribution function (CDF): A function that maps
3323
  from values to their cumulative probabilities.  $\CDF(x)$ is the
3324
  fraction of the sample less than or equal to $x$.  \index{CDF}
3325
\index{cumulative probability}
3326

3327
\item inverse CDF: A function that maps from a cumulative probability,
3328
  $p$, to the corresponding value.
3329
\index{inverse CDF}
3330
\index{CDF, inverse}
3331

3332
\item median: The 50th percentile, often used as a measure of central
3333
  tendency.  \index{median}
3334

3335
\item interquartile range: The difference between
3336
the 75th and 25th percentiles, used as a measure of spread.
3337
\index{interquartile range}
3338

3339
\item quantile: A sequence of values that correspond to equally spaced
3340
percentile ranks; for example, the quartiles of a distribution are
3341
the 25th, 50th and 75th percentiles.
3342
\index{quantile}
3343

3344
\item replacement: A property of a sampling process. ``With replacement''
3345
means that the same value can be chosen more than once; ``without
3346
replacement'' means that once a value is chosen, it is removed from
3347
the population.
3348
\index{replacement}
3349

3350
\end{itemize}
3351

3352

3353
\chapter{Modeling distributions}
3354
\label{modeling}
3355

3356
The distributions we have used so far are called {\bf empirical
3357
  distributions} because they are based on empirical observations,
3358
which are necessarily finite samples.
3359
\index{analytic distribution}
3360
\index{distribution!analytic}
3361
\index{empirical distribution}
3362
\index{distribution!empirical}
3363

3364
The alternative is an {\bf analytic distribution}, which is
3365
characterized by a CDF that is a mathematical function.
3366
Analytic distributions can be used to model empirical distributions.
3367
In this context, a {\bf model} is a simplification that leaves out
3368
unneeded details.  This chapter presents common analytic distributions
3369
and uses them to model data from a variety of sources.
3370
\index{model}
3371

3372
The code for this chapter is in {\tt analytic.py}.  For information
3373
about downloading and working with this code, see Section~\ref{code}.
3374

3375

3376

3377
\section{The exponential distribution}
3378
\label{exponential}
3379
\index{exponential distribution}
3380
\index{distribution!exponential}
3381

3382
\begin{figure}
3383
% analytic.py
3384
\centerline{\includegraphics[height=2.5in]{figs/analytic_expo_cdf.pdf}}
3385
\caption{CDFs of exponential distributions with various parameters.}
3386
\label{analytic_expo_cdf}
3387
\end{figure}
3388

3389
I'll start with the {\bf exponential distribution} because it is
3390
relatively simple.  The CDF of the exponential distribution is
3391
%
3392
\[ \CDF(x) = 1 - e^{-\lambda x} \]
3393
%
3394
The parameter, $\lambda$, determines the shape of the distribution.
3395
Figure~\ref{analytic_expo_cdf} shows what this CDF looks like with
3396
$\lambda = $ 0.5, 1, and 2.
3397
  \index{parameter}
3398

3399
In the real world, exponential distributions
3400
come up when we look at a series of events and measure the
3401
times between events, called {\bf interarrival times}.
3402
If the events are equally likely to occur at any time, the distribution
3403
of interarrival times tends to look like an exponential distribution.
3404
\index{interarrival time}
3405

3406
As an example, we will look at the interarrival time of births.
3407
On December 18, 1997, 44 babies were born in a hospital in Brisbane,
3408
Australia.\footnote{This example is based on information and data from
3409
  Dunn, ``A Simple Dataset for Demonstrating Common Distributions,''
3410
  Journal of Statistics Education v.7, n.3 (1999).}  The time of
3411
birth for all 44 babies was reported in the local paper; the
3412
complete dataset is in a file called {\tt babyboom.dat}, in the
3413
{\tt ThinkStats2} repository.
3414
\index{birth time}
3415
\index{Australia} \index{Brisbane}
3416

3417
\begin{verbatim}
3418
    df = ReadBabyBoom()
3419
    diffs = df.minutes.diff()
3420
    cdf = thinkstats2.Cdf(diffs, label='actual')
3421

3422
    thinkplot.Cdf(cdf)
3423
    thinkplot.Show(xlabel='minutes', ylabel='CDF')
3424
\end{verbatim}
3425

3426
{\tt ReadBabyBoom} reads the data file and returns a DataFrame
3427
with columns {\tt time}, {\tt sex}, \verb"weight_g", and {\tt minutes},
3428
where {\tt minutes} is time of birth converted to minutes since
3429
midnight.
3430
\index{DataFrame}
3431
\index{thinkplot}
3432

3433
\begin{figure}
3434
% analytic.py
3435
\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals.pdf}}
3436
\caption{CDF of interarrival times (left) and CCDF on a log-y scale (right).}
3437
\label{analytic_interarrival_cdf}
3438
\end{figure}
3439

3440
%\begin{figure}
3441
% analytic.py
3442
%\centerline{\includegraphics[height=2.5in]{figs/analytic_interarrivals_logy.pdf}}
3443
%\caption{CCDF of interarrival times.}
3444
%\label{analytic_interarrival_ccdf}
3445
%\end{figure}
3446

3447
{\tt diffs} is the difference between consecutive birth times, and
3448
{\tt cdf} is the distribution of these interarrival times.
3449
Figure~\ref{analytic_interarrival_cdf} (left) shows the CDF.  It seems
3450
to have the general shape of an exponential distribution, but how can
3451
we tell?
3452

3453
One way is to plot the {\bf complementary CDF}, which is $1 - \CDF(x)$,
3454
on a log-y scale.  For data from an exponential distribution, the
3455
result is a straight line.  Let's see why that works.
3456
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
3457

3458
If you plot the complementary CDF (CCDF) of a dataset that you think is
3459
exponential, you expect to see a function like:
3460
%
3461
\[ y \approx e^{-\lambda x} \]
3462
%
3463
Taking the log of both sides yields:
3464
%
3465
\[ \log y \approx -\lambda x\]
3466
%
3467
So on a log-y scale the CCDF is a straight line
3468
with slope $-\lambda$.  Here's how we can generate a plot like that:
3469
\index{logarithmic scale}
3470
\index{complementary CDF}
3471
\index{CDF!complementary}
3472
\index{CCDF}
3473

3474

3475
\begin{verbatim}
3476
    thinkplot.Cdf(cdf, complement=True)
3477
    thinkplot.Show(xlabel='minutes',
3478
                   ylabel='CCDF',
3479
                   yscale='log')
3480
\end{verbatim}
3481

3482
With the argument {\tt complement=True}, {\tt thinkplot.Cdf} computes
3483
the complementary CDF before plotting.  And with {\tt yscale='log'},
3484
{\tt thinkplot.Show} sets the {\tt y} axis to a logarithmic scale.
3485
\index{thinkplot}
3486
\index{Cdf}
3487

3488
Figure~\ref{analytic_interarrival_cdf} (right) shows the result.  It is not
3489
exactly straight, which indicates that the exponential distribution is
3490
not a perfect model for this data.  Most likely the underlying
3491
assumption---that a birth is equally likely at any time of day---is
3492
not exactly true.  Nevertheless, it might be reasonable to model this
3493
dataset with an exponential distribution.  With that simplification, we can
3494
summarize the distribution with a single parameter.
3495
\index{model}
3496

3497
The parameter, $\lambda$, can be interpreted as a rate; that is, the
3498
number of events that occur, on average, in a unit of time.  In this
3499
example, 44 babies are born in 24 hours, so the rate is $\lambda =
3500
0.0306$ births per minute.  The mean of an exponential distribution is
3501
$1/\lambda$, so the mean time between births is 32.7 minutes.
3502

3503

3504
\section{The normal distribution}
3505
\label{normal}
3506

3507
The {\bf normal distribution}, also called Gaussian, is commonly
3508
used because it describes many phenomena, at least approximately.
3509
It turns out that there is a good reason for its ubiquity, which we
3510
will get to in Section~\ref{CLT}.
3511
\index{CDF}
3512
\index{parameter}
3513
\index{mean}
3514
\index{standard deviation}
3515
\index{normal distribution}
3516
\index{distribution!normal}
3517
\index{Gaussian distribution}
3518
\index{distribution!Gaussian}
3519

3520
%
3521
%\[ \CDF(z) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^z e^{-t^2/2} dt \]
3522
%
3523

3524
\begin{figure}
3525
% analytic.py
3526
\centerline{\includegraphics[height=2.5in]{figs/analytic_gaussian_cdf.pdf}}
3527
\caption{CDF of normal distributions with a range of parameters.}
3528
\label{analytic_gaussian_cdf}
3529
\end{figure}
3530

3531
The normal distribution is characterized by two parameters: the mean,
3532
$\mu$, and standard deviation $\sigma$.  The normal distribution with
3533
$\mu=0$ and $\sigma=1$ is called the {\bf standard normal
3534
  distribution}.  Its CDF is defined by an integral that does not have
3535
a closed form solution, but there are algorithms that evaluate it
3536
efficiently.  One of them is provided by SciPy: {\tt scipy.stats.norm}
3537
is an object that represents a normal distribution; it provides a
3538
method, {\tt cdf}, that evaluates the standard normal CDF:
3539
\index{SciPy}
3540
\index{closed form}
3541

3542
\begin{verbatim}
3543
>>> import scipy.stats
3544
>>> scipy.stats.norm.cdf(0)
3545
0.5
3546
\end{verbatim}
3547

3548
This result is correct: the median of the standard normal distribution
3549
is 0 (the same as the mean), and half of the values fall below the
3550
median, so $\CDF(0)$ is 0.5.
3551

3552
{\tt norm.cdf} takes optional parameters: {\tt loc}, which
3553
specifies the mean, and {\tt scale}, which specifies the
3554
standard deviation.
3555

3556
{\tt thinkstats2} makes this function a little easier to use
3557
by providing {\tt EvalNormalCdf}, which takes parameters {\tt mu}
3558
and {\tt sigma} and evaluates the CDF at {\tt x}:
3559
\index{normal distribution}
3560

3561
\begin{verbatim}
3562
def EvalNormalCdf(x, mu=0, sigma=1):
3563
    return scipy.stats.norm.cdf(x, loc=mu, scale=sigma)
3564
\end{verbatim}
3565

3566
Figure~\ref{analytic_gaussian_cdf} shows CDFs for normal
3567
distributions with a range of parameters.  The sigmoid shape of these
3568
curves is a recognizable characteristic of a normal distribution.
3569

3570
In the previous chapter we looked at the distribution of birth
3571
weights in the NSFG.  Figure~\ref{analytic_birthwgt_model} shows the
3572
empirical CDF of weights for all live births and the CDF of
3573
a normal distribution with the same mean and variance.
3574
\index{National Survey of Family Growth}
3575
\index{NSFG}
3576
\index{birth weight}
3577
\index{weight!birth}
3578

3579
\begin{figure}
3580
% analytic.py
3581
\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_model.pdf}}
3582
\caption{CDF of birth weights with a normal model.}
3583
\label{analytic_birthwgt_model}
3584
\end{figure}
3585

3586
The normal distribution is a good model for this dataset, so
3587
if we summarize the distribution with the parameters
3588
$\mu = 7.28$ and $\sigma = 1.24$, the resulting error
3589
(difference between the model and the data) is small.
3590
\index{model}
3591
\index{percentile}
3592

3593
Below the 10th percentile there is a discrepancy between the data
3594
and the model; there are more light babies than we would expect in
3595
a normal distribution.  If we are specifically interested in preterm
3596
babies, it would be important to get this part of the distribution
3597
right, so it might not be appropriate to use the normal
3598
model.
3599

3600

3601
\section{Normal probability plot}
3602

3603
For the exponential distribution, and a few others, there are
3604
simple transformations we can use to test whether an analytic
3605
distribution is a good model for a dataset.
3606
\index{exponential distribution}
3607
\index{distribution!exponential}
3608
\index{model}
3609

3610
For the normal distribution there is no such transformation, but there
3611
is an alternative called a {\bf normal probability plot}.  There
3612
are two ways to generate a normal probability plot: the hard way
3613
and the easy way.  If you are interested in the hard way, you can
3614
read about it at \url{https://en.wikipedia.org/wiki/Normal_probability_plot}.
3615
Here's the easy way:
3616
\index{normal probability plot}
3617
\index{plot!normal probability}
3618
\index{normal distribution}
3619
\index{distribution!normal}
3620
\index{Gaussian distribution}
3621
\index{distribution!Gaussian}
3622

3623
\begin{enumerate}
3624

3625
\item Sort the values in the sample.
3626

3627
\item From a standard normal distribution ($\mu=0$ and $\sigma=1$),
3628
generate a random sample with the same size as the sample, and sort it.
3629
\index{random number}
3630

3631
\item Plot the sorted values from the sample versus the random values.
3632

3633
\end{enumerate}
3634

3635
If the distribution of the sample is approximately normal, the result
3636
is a straight line with intercept {\tt mu} and slope {\tt sigma}.
3637
{\tt thinkstats2} provides {\tt NormalProbability}, which takes a
3638
sample and returns two NumPy arrays:
3639
\index{NumPy}
3640

3641
\begin{verbatim}
3642
xs, ys = thinkstats2.NormalProbability(sample)
3643
\end{verbatim}
3644

3645
\begin{figure}
3646
% analytic.py
3647
\centerline{\includegraphics[height=2.5in]{figs/analytic_normal_prob_example.pdf}}
3648
\caption{Normal probability plot for random samples from normal distributions.}
3649
\label{analytic_normal_prob_example}
3650
\end{figure}
3651

3652
{\tt ys} contains the sorted values from {\tt sample}; {\tt xs}
3653
contains the random values from the standard normal distribution.
3654

3655
To test {\tt NormalProbability} I generated some fake samples that
3656
were actually drawn from normal distributions with various parameters.
3657
Figure~\ref{analytic_normal_prob_example} shows the results.
3658
The lines are approximately straight, with values in the tails
3659
deviating more than values near the mean.
3660

3661
Now let's try it with real data.  Here's code to generate
3662
a normal probability plot for the birth weight data from the
3663
previous section.  It plots a gray line that represents the model
3664
and a blue line that represents the data.
3665
\index{birth weight}
3666
\index{weight!birth}
3667

3668
\begin{verbatim}
3669
def MakeNormalPlot(weights):
3670
    mean = weights.mean()
3671
    std = weights.std()
3672

3673
    xs = [-4, 4]
3674
    fxs, fys = thinkstats2.FitLine(xs, inter=mean, slope=std)
3675
    thinkplot.Plot(fxs, fys, color='gray', label='model')
3676

3677
    xs, ys = thinkstats2.NormalProbability(weights)
3678
    thinkplot.Plot(xs, ys, label='birth weights')
3679
\end{verbatim}
3680

3681
{\tt weights} is a pandas Series of birth weights;
3682
{\tt mean} and {\tt std} are the mean and standard deviation.
3683
\index{pandas}
3684
\index{Series}
3685
\index{thinkplot}
3686
\index{standard deviation}
3687

3688
{\tt FitLine} takes a sequence of {\tt xs}, an intercept, and a
3689
slope; it returns {\tt xs} and {\tt ys} that represent a line
3690
with the given parameters, evaluated at the values in {\tt xs}.
3691

3692
{\tt NormalProbability} returns {\tt xs} and {\tt ys} that
3693
contain values from the standard normal distribution and values
3694
from {\tt weights}.  If the distribution of weights is normal,
3695
the data should match the model.
3696
\index{model}
3697

3698
\begin{figure}
3699
% analytic.py
3700
\centerline{\includegraphics[height=2.5in]{figs/analytic_birthwgt_normal.pdf}}
3701
\caption{Normal probability plot of birth weights.}
3702
\label{analytic_birthwgt_normal}
3703
\end{figure}
3704

3705
Figure~\ref{analytic_birthwgt_normal} shows the results for
3706
all live births, and also for full term births (pregnancy length greater
3707
than 36 weeks).  Both curves match the model near the mean and
3708
deviate in the tails.  The heaviest babies are heavier than what
3709
the model expects, and the lightest babies are lighter.
3710
\index{pregnancy length}
3711

3712
When we select only full term births, we remove some of the lightest
3713
weights, which reduces the discrepancy in the lower tail of the
3714
distribution.
3715

3716
This plot suggests that the normal model describes the distribution
3717
well within a few standard deviations from the mean, but not in the
3718
tails.  Whether it is good enough for practical purposes depends
3719
on the purposes.
3720
\index{model}
3721
\index{birth weight}
3722
\index{weight!birth}
3723
\index{standard deviation}
3724

3725

3726
\section{The lognormal distribution}
3727
\label{brfss}
3728
\label{lognormal}
3729

3730
If the logarithms of a set of values have a normal distribution, the
3731
values have a {\bf lognormal distribution}.  The CDF of the lognormal
3732
distribution is the same as the CDF of the normal distribution,
3733
with $\log x$ substituted for $x$.
3734
%
3735
\[ CDF_{lognormal}(x) = CDF_{normal}(\log x)\]
3736
%
3737
The parameters of the lognormal distribution are usually denoted
3738
$\mu$ and $\sigma$.  But remember that these parameters are {\em not\/}
3739
the mean and standard deviation; the mean of a lognormal distribution
3740
is $\exp(\mu +\sigma^2/2)$ and the standard deviation is
3741
ugly (see \url{http://wikipedia.org/wiki/Log-normal_distribution}).
3742
\index{parameter} \index{weight!adult} \index{adult weight}
3743
\index{lognormal distribution}
3744
\index{distribution!lognormal}
3745
\index{CDF}
3746

3747
\begin{figure}
3748
% brfss.py
3749
\centerline{
3750
\includegraphics[height=2.5in]{figs/brfss_weight.pdf}}
3751
\caption{CDF of adult weights on a linear scale (left) and
3752
log scale (right).}
3753
\label{brfss_weight}
3754
\end{figure}
3755

3756
If a sample is approximately lognormal and you plot its CDF on a
3757
log-x scale, it will have the characteristic shape of a normal
3758
distribution.  To test how well the sample fits a lognormal model, you
3759
can make a normal probability plot using the log of the values
3760
in the sample.
3761
\index{normal probability plot}
3762
\index{model}
3763

3764
As an example, let's look at the distribution of adult weights, which
3765
is approximately lognormal.\footnote{I was tipped off to this
3766
  possibility by a comment (without citation) at
3767
  \url{http://mathworld.wolfram.com/LogNormalDistribution.html}.
3768
  Subsequently I found a paper that proposes the log transform and
3769
  suggests a cause: Penman and Johnson, ``The Changing Shape of the
3770
  Body Mass Index Distribution Curve in the Population,'' Preventing
3771
  Chronic Disease, 2006 July; 3(3): A74.  Online at
3772
  \url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1636707}.}
3773

3774
The National Center for Chronic Disease
3775
Prevention and Health Promotion conducts an annual survey as part of
3776
the Behavioral Risk Factor Surveillance System
3777
(BRFSS).\footnote{Centers for Disease Control and Prevention
3778
  (CDC). Behavioral Risk Factor Surveillance System Survey
3779
  Data. Atlanta, Georgia: U.S. Department of Health and Human
3780
  Services, Centers for Disease Control and Prevention, 2008.}  In
3781
2008, they interviewed 414,509 respondents and asked about their
3782
demographics, health, and health risks.
3783
Among the data they collected are the weights in kilograms of
3784
398,484 respondents.
3785
\index{Behavioral Risk Factor Surveillance System}
3786
\index{BRFSS}
3787

3788
The repository for this book contains {\tt CDBRFS08.ASC.gz},
3789
a fixed-width ASCII file that contains data from the BRFSS,
3790
and {\tt brfss.py}, which reads the file and analyzes the data.
3791

3792
\begin{figure}
3793
% brfss.py
3794
\centerline{
3795
\includegraphics[height=2.5in]{figs/brfss_weight_normal.pdf}}
3796
\caption{Normal probability plots for adult weight on a linear scale
3797
  (left) and log scale (right).}
3798
\label{brfss_weight_normal}
3799
\end{figure}
3800

3801
Figure~\ref{brfss_weight} (left) shows the distribution of adult
3802
weights on a linear scale with a normal model.
3803
Figure~\ref{brfss_weight} (right) shows the same distribution on a log
3804
scale with a lognormal model.  The lognormal model is a better fit,
3805
but this representation of the data does not make the difference
3806
particularly dramatic.  \index{respondent} \index{model}
3807

3808
Figure~\ref{brfss_weight_normal} shows normal probability plots for
3809
adult weights, $w$, and for their logarithms, $\log_{10} w$.  Now it
3810
is apparent that the data deviate substantially from the normal model.
3811
On the other hand, the lognormal model is a good match for the data.
3812
\index{normal distribution} \index{distribution!normal}
3813
\index{Gaussian distribution} \index{distribution!Gaussian}
3814
\index{lognormal distribution} \index{distribution!lognormal}
3815
\index{standard deviation} \index{adult weight} \index{weight!adult}
3816
\index{model} \index{normal probability plot}
3817

3818

3819
\section{The Pareto distribution}
3820
\index{Pareto distribution}
3821
\index{distribution!Pareto}
3822
\index{Pareto, Vilfredo}
3823

3824
The {\bf Pareto distribution} is named after the economist Vilfredo Pareto,
3825
who used it to describe the distribution of wealth (see
3826
\url{http://wikipedia.org/wiki/Pareto_distribution}).  Since then, it
3827
has been used to describe phenomena in the natural and social sciences
3828
including sizes of cities and towns, sand particles and meteorites,
3829
forest fires and earthquakes.  \index{CDF}
3830

3831
The CDF of the Pareto distribution is:
3832
%
3833
\[ CDF(x) = 1 - \left( \frac{x}{x_m} \right) ^{-\alpha} \]
3834
%
3835
The parameters $x_{m}$ and $\alpha$ determine the location and shape
3836
of the distribution. $x_{m}$ is the minimum possible value.
3837
Figure~\ref{analytic_pareto_cdf} shows CDFs of Pareto
3838
distributions with $x_{m} = 0.5$ and different values
3839
of $\alpha$.
3840
\index{parameter}
3841

3842
\begin{figure}
3843
% analytic.py
3844
\centerline{\includegraphics[height=2.5in]{figs/analytic_pareto_cdf.pdf}}
3845
\caption{CDFs of Pareto distributions with different parameters.}
3846
\label{analytic_pareto_cdf}
3847
\end{figure}
3848

3849
There is a simple visual test that indicates whether an empirical
3850
distribution fits a Pareto distribution: on a log-log scale, the CCDF
3851
looks like a straight line.  Let's see why that works.
3852

3853
If you plot the CCDF of a sample from a Pareto distribution on a
3854
linear scale, you expect to see a function like:
3855
%
3856
\[ y \approx \left( \frac{x}{x_m} \right) ^{-\alpha} \]
3857
%
3858
Taking the log of both sides yields:
3859
%
3860
\[ \log y \approx -\alpha (\log x - \log x_{m})\]
3861
%
3862
So if you plot $\log y$ versus $\log x$, it should look like a straight
3863
line with slope $-\alpha$ and intercept
3864
$\alpha \log x_{m}$.
3865

3866
As an example, let's look at the sizes of cities and towns.
3867
The U.S.~Census Bureau publishes the
3868
population of every incorporated city and town in the United States.
3869
\index{Pareto distribution} \index{distribution!Pareto}
3870
\index{U.S.~Census Bureau} \index{population} \index{city size}
3871

3872
\begin{figure}
3873
% populations.py
3874
\centerline{\includegraphics[height=2.5in]{figs/populations_pareto.pdf}}
3875
\caption{CCDFs of city and town populations, on a log-log scale.}
3876
\label{populations_pareto}
3877
\end{figure}
3878

3879
I downloaded their data from
3880
\url{http://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012-3.html};
3881
it is in the repository for this book in a file named
3882
\verb"PEP_2012_PEPANNRES_with_ann.csv".  The repository also
3883
contains {\tt populations.py}, which reads the file and plots
3884
the distribution of populations.
3885

3886
Figure~\ref{populations_pareto} shows the CCDF of populations on a
3887
log-log scale.  The largest 1\% of cities and towns, below $10^{-2}$,
3888
fall along a straight line.  So we could
3889
conclude, as some researchers have, that the tail of this distribution
3890
fits a Pareto model.
3891
\index{model}
3892

3893
On the other hand, a lognormal distribution also models the data well.
3894
Figure~\ref{populations_normal} shows the CDF of populations and a
3895
lognormal model (left), and a normal probability plot (right).  Both
3896
plots show good agreement between the data and the model.
3897
\index{normal probability plot}
3898

3899
Neither model is perfect.
3900
The Pareto model only applies to the largest 1\% of cities, but it
3901
is a better fit for that part of the distribution.  The lognormal
3902
model is a better fit for the other 99\%.
3903
Which model is appropriate depends on which part of the distribution
3904
is relevant.
3905

3906
\begin{figure}
3907
% populations.py
3908
\centerline{\includegraphics[height=2.5in]{figs/populations_normal.pdf}}
3909
\caption{CDF of city and town populations on a log-x scale (left), and
3910
normal probability plot of log-transformed populations (right).}
3911
\label{populations_normal}
3912
\end{figure}
3913

3914

3915
\section{Generating random numbers}
3916
\index{exponential distribution}
3917
\index{distribution!exponential}
3918
\index{random number}
3919
\index{CDF}
3920
\index{inverse CDF algorithm}
3921
\index{uniform distribution}
3922
\index{distribution!uniform}
3923

3924
Analytic CDFs can be used to generate random numbers with a given
3925
distribution function, $p = \CDF(x)$.  If there is an efficient way to
3926
compute the inverse CDF, we can generate random values
3927
with the appropriate distribution by choosing $p$ from a uniform
3928
distribution between 0 and 1, then choosing
3929
$x = ICDF(p)$.
3930
\index{inverse CDF}
3931
\index{CDF, inverse}
3932

3933
For example, the CDF of the exponential distribution is
3934
%
3935
\[ p = 1 - e^{-\lambda x} \]
3936
%
3937
Solving for $x$ yields:
3938
%
3939
\[ x = -\log (1 - p) / \lambda \]
3940
%
3941
So in Python we can write
3942
%
3943
\begin{verbatim}
3944
def expovariate(lam):
3945
    p = random.random()
3946
    x = -math.log(1-p) / lam
3947
    return x
3948
\end{verbatim}
3949

3950
{\tt expovariate} takes {\tt lam} and returns a random value chosen
3951
from the exponential distribution with parameter {\tt lam}.
3952

3953
Two notes about this implementation:
3954
I called the parameter \verb"lam" because \verb"lambda" is a Python
3955
keyword.  Also, since $\log 0$ is undefined, we have to
3956
be a little careful.  The implementation of {\tt random.random}
3957
can return 0 but not 1, so $1 - p$ can be 1 but not 0, so
3958
{\tt log(1-p)} is always defined.  \index{random module}
3959

3960

3961
\section{Why model?}
3962
\index{model}
3963

3964
At the beginning of this chapter, I said that many real world phenomena
3965
can be modeled with analytic distributions.  ``So,'' you might ask,
3966
``what?''  \index{abstraction}
3967

3968
Like all models, analytic distributions are abstractions, which
3969
means they leave out details that are considered irrelevant.
3970
For example, an observed distribution might have measurement errors
3971
or quirks that are specific to the sample; analytic models smooth
3972
out these idiosyncrasies.
3973
\index{smoothing}
3974

3975
Analytic models are also a form of data compression.  When a model
3976
fits a dataset well, a small set of parameters can summarize a
3977
large amount of data.
3978
\index{parameter}
3979
\index{compression}
3980

3981
It is sometimes surprising when data from a natural phenomenon fit an
3982
analytic distribution, but these observations can provide insight
3983
into physical systems.  Sometimes we can explain why an observed
3984
distribution has a particular form.  For example, Pareto distributions
3985
are often the result of generative processes with positive feedback
3986
(so-called preferential attachment processes: see
3987
\url{http://wikipedia.org/wiki/Preferential_attachment}.).
3988
\index{preferential attachment}
3989
\index{generative process}
3990
\index{Pareto distribution}
3991
\index{distribution!Pareto}
3992
\index{analysis}
3993

3994
Also, analytic distributions lend themselves to mathematical
3995
analysis, as we will see in Chapter~\ref{analysis}.
3996

3997
But it is important to remember that all models are imperfect.
3998
Data from the real world never fit an analytic distribution perfectly.
3999
People sometimes talk as if data are generated by models; for example,
4000
they might say that the distribution of human heights is normal,
4001
or the distribution of income is lognormal.  Taken literally, these
4002
claims cannot be true; there are always differences between the
4003
real world and mathematical models.
4004

4005
Models are useful if they capture the relevant aspects of the
4006
real world and leave out unneeded details.  But what is ``relevant''
4007
or ``unneeded'' depends on what you are planning to use the model
4008
for.
4009

4010

4011
\section{Exercises}
4012

4013
For the following exercises, you can start with \verb"chap05ex.ipynb".
4014
My solution is in \verb"chap05soln.ipynb".
4015

4016
\begin{exercise}
4017
In the BRFSS (see Section~\ref{lognormal}), the distribution of
4018
heights is roughly normal with parameters $\mu = 178$ cm and
4019
$\sigma = 7.7$ cm for men, and $\mu = 163$ cm and $\sigma = 7.3$ cm for
4020
women.
4021
\index{normal distribution}
4022
\index{distribution!normal}
4023
\index{Gaussian distribution}
4024
\index{distribution!Gaussian}
4025
\index{height}
4026
\index{Blue Man Group}
4027
\index{Group, Blue Man}
4028

4029
In order to join Blue Man Group, you have to be male between 5'10''
4030
and 6'1'' (see \url{http://bluemancasting.com}).  What percentage of
4031
the U.S. male population is in this range?  Hint: use {\tt
4032
  scipy.stats.norm.cdf}.
4033
\index{SciPy}
4034

4035
\end{exercise}
4036

4037

4038
\begin{exercise}
4039
To get a feel for the Pareto distribution, let's see how different
4040
the world
4041
would be if the distribution of human height were Pareto.
4042
With the parameters $x_{m} = 1$ m and $\alpha = 1.7$, we
4043
get a distribution with a reasonable minimum, 1 m,
4044
and median, 1.5 m.
4045
\index{height}
4046
\index{Pareto distribution}
4047
\index{distribution!Pareto}
4048

4049
Plot this distribution.  What is the mean human height in Pareto
4050
world?  What fraction of the population is shorter than the mean?  If
4051
there are 7 billion people in Pareto world, how many do we expect to
4052
be taller than 1 km?  How tall do we expect the tallest person to be?
4053
\index{Pareto World}
4054

4055
\end{exercise}
4056

4057

4058
\begin{exercise}
4059
\label{weibull}
4060

4061
The Weibull distribution is a generalization of the exponential
4062
distribution that comes up in failure analysis
4063
(see \url{http://wikipedia.org/wiki/Weibull_distribution}).  Its CDF is
4064
%
4065
\[ CDF(x) = 1 - e^{-(x / \lambda)^k} \]
4066
%
4067
Can you find a transformation that makes a Weibull distribution look
4068
like a straight line?  What do the slope and intercept of the
4069
line indicate?
4070
\index{Weibull distribution}
4071
\index{distribution!Weibull}
4072
\index{exponential distribution}
4073
\index{distribution!exponential}
4074
\index{random module}
4075

4076
Use {\tt random.weibullvariate} to generate a sample from a
4077
Weibull distribution and use it to test your transformation.
4078

4079
\end{exercise}
4080

4081

4082
\begin{exercise}
4083
For small values of $n$, we don't expect an empirical distribution
4084
to fit an analytic distribution exactly.  One way to evaluate
4085
the quality of fit is to generate a sample from an analytic
4086
distribution and see how well it matches the data.
4087
\index{empirical distribution}
4088
\index{distribution!empirical}
4089
\index{random module}
4090

4091
For example, in Section~\ref{exponential} we plotted the distribution
4092
of time between births and saw that it is approximately exponential.
4093
But the distribution is based on only 44 data points.  To see whether
4094
the data might have come from an exponential distribution, generate 44
4095
values from an exponential distribution with the same mean as the
4096
data, about 33 minutes between births.
4097

4098
Plot the distribution of the random values and compare it to the
4099
actual distribution.  You can use {\tt random.expovariate} 
4100
to generate the values.
4101

4102
\end{exercise}
4103

4104
\begin{exercise}
4105
In the repository for this book, you'll find a set of data files
4106
called {\tt mystery0.dat}, {\tt mystery1.dat}, and so on.  Each
4107
contains a sequence of random numbers generated from an analytic
4108
distribution.
4109
\index{random number}
4110

4111
You will also find \verb"test_models.py", a script that reads
4112
data from a file and plots the CDF under a variety of transforms.
4113
You can run it like this:
4114

4115
\begin{verbatim}
4116
$ python test_models.py mystery0.dat
4117
\end{verbatim}
4118

4119
Based on these plots, you should be able to infer what kind of
4120
distribution generated each file.  If you are stumped, you can
4121
look in {\tt mystery.py}, which contains the code that generated
4122
the files.
4123

4124
\end{exercise}
4125

4126

4127
\begin{exercise}
4128
\label{income}
4129

4130
The distributions of wealth and income are sometimes modeled using
4131
lognormal and Pareto distributions.  To see which is better, let's
4132
look at some data.
4133
\index{Pareto distribution}
4134
\index{distribution!Pareto}
4135
\index{lognormal distribution}
4136
\index{distribution!lognormal}
4137

4138
The Current Population Survey (CPS) is a joint effort of the Bureau
4139
of Labor Statistics and the Census Bureau to study income and related
4140
variables.  Data collected in 2013 is available from
4141
\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.
4142
I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with
4143
information about household income, and converted it to {\tt hinc06.csv},
4144
a CSV file you will find in the repository for this book.  You
4145
will also find {\tt hinc.py}, which reads this file.
4146

4147
Extract the distribution of incomes from this dataset.  Are any of the
4148
analytic distributions in this chapter a good model of the data?  A
4149
solution to this exercise is in {\tt hinc_soln.py}.
4150
\index{model}
4151

4152
\end{exercise}
4153

4154

4155

4156

4157
\section{Glossary}
4158

4159
\begin{itemize}
4160

4161
\item empirical distribution: The distribution of values in a sample.
4162
  \index{empirical distribution} \index{distribution!empirical}
4163

4164
\item analytic distribution: A distribution whose CDF is an analytic
4165
function.
4166
\index{analytic distribution}
4167
\index{distribution!analytic}
4168

4169
\item model: A useful simplification.  Analytic distributions are
4170
often good models of more complex empirical distributions.
4171
\index{model}
4172

4173
\item interarrival time: The elapsed time between two events.
4174
\index{interarrival time}
4175

4176
\item complementary CDF: A function that maps from a value, $x$,
4177
to the fraction of values that exceed $x$, which is $1 - \CDF(x)$.
4178
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
4179

4180
\item standard normal distribution: The normal distribution with
4181
mean 0 and standard deviation 1.
4182
\index{standard normal distribution}
4183

4184
\item normal probability plot: A plot of the values in a sample versus
4185
random values from a standard normal distribution.
4186
\index{normal probability plot}
4187
\index{plot!normal probability}
4188

4189
\end{itemize}
4190

4191

4192
\chapter{Probability density functions}
4193
\label{density}
4194
\index{PDF}
4195
\index{probability density function}
4196
\index{exponential distribution}
4197
\index{distribution!exponential}
4198
\index{normal distribution}
4199
\index{distribution!normal}
4200
\index{Gaussian distribution}
4201
\index{distribution!Gaussian}
4202
\index{CDF}
4203
\index{derivative}
4204

4205
The code for this chapter is in {\tt density.py}.  For information
4206
about downloading and working with this code, see Section~\ref{code}.
4207

4208

4209
\section{PDFs}
4210

4211
The derivative of a CDF is called a {\bf probability density function},
4212
or PDF.  For example, the PDF of an exponential distribution is
4213
%
4214
\[ \PDF_{expo}(x) = \lambda e^{-\lambda x}   \]
4215
%
4216
The PDF of a normal distribution is
4217
%
4218
\[ \PDF_{normal}(x) = \frac{1}{\sigma \sqrt{2 \pi}} 
4219
                 \exp \left[ -\frac{1}{2} 
4220
                 \left( \frac{x - \mu}{\sigma} \right)^2 \right]  \]
4221
%
4222
Evaluating a PDF for a particular value of $x$ is usually not useful.
4223
The result is not a probability; it is a probability {\em density}.
4224
\index{density}
4225
\index{mass}
4226

4227
In physics, density is mass per unit of
4228
volume; in order to get a mass, you have to multiply by volume or,
4229
if the density is not constant, you have to integrate over volume.
4230

4231
Similarly, {\bf probability density} measures probability per unit of $x$.
4232
In order to get a probability mass, you have to integrate over $x$.
4233

4234
{\tt thinkstats2} provides a class called Pdf that represents
4235
a probability density function.  Every Pdf object provides the
4236
following methods:
4237

4238
\begin{itemize}
4239

4240
\item {\tt Density}, which takes a value, {\tt x}, and returns the
4241
  density of the distribution at {\tt x}.
4242

4243
\item {\tt Render}, which evaluates the density at a discrete set of
4244
  values and returns a pair of sequences: the sorted values, {\tt xs},
4245
  and their probability densities, {\tt ds}.
4246

4247
\item {\tt MakePmf}, which evaluates {\tt Density}
4248
  at a discrete set of values and returns a normalized Pmf that
4249
  approximates the Pdf.
4250
\index{Pmf}
4251

4252
\item {\tt GetLinspace}, which returns the default set of points used 
4253
  by {\tt Render} and {\tt MakePmf}.
4254

4255
\end{itemize}  
4256

4257
Pdf is an abstract parent class, which means you should not
4258
instantiate it; that is, you cannot create a Pdf object.  Instead, you
4259
should define a child class that inherits from Pdf and provides
4260
definitions of {\tt Density} and {\tt GetLinspace}.  Pdf provides
4261
{\tt Render} and {\tt MakePmf}.
4262

4263
For example, {\tt thinkstats2} provides a class named {\tt
4264
  NormalPdf} that evaluates the normal density function.
4265

4266
\begin{verbatim}
4267
class NormalPdf(Pdf):
4268

4269
    def __init__(self, mu=0, sigma=1, label=''):
4270
        self.mu = mu
4271
        self.sigma = sigma
4272
        self.label = label
4273

4274
    def Density(self, xs):
4275
        return scipy.stats.norm.pdf(xs, self.mu, self.sigma)
4276

4277
    def GetLinspace(self):
4278
        low, high = self.mu-3*self.sigma, self.mu+3*self.sigma
4279
        return np.linspace(low, high, 101)
4280
\end{verbatim}
4281

4282
The NormalPdf object contains the parameters {\tt mu} and
4283
{\tt sigma}.  {\tt Density} uses
4284
{\tt scipy.stats.norm}, which is an object that represents a normal
4285
distribution and provides {\tt cdf} and {\tt pdf}, among other
4286
methods (see Section~\ref{normal}).
4287
\index{SciPy}
4288

4289
The following example creates a NormalPdf with the mean and variance
4290
of adult female heights, in cm, from the BRFSS (see
4291
Section~\ref{brfss}).  Then it computes the density of the
4292
distribution at a location one standard deviation from the mean.
4293
\index{standard deviation}
4294

4295
\begin{verbatim}
4296
>>> mean, var = 163, 52.8
4297
>>> std = math.sqrt(var)
4298
>>> pdf = thinkstats2.NormalPdf(mean, std)
4299
>>> pdf.Density(mean + std)
4300
0.0333001
4301
\end{verbatim}
4302

4303
The result is about 0.03, in units of probability mass per cm.
4304
Again, a probability density doesn't mean much by itself.  But if
4305
we plot the Pdf, we can see the shape of the distribution:
4306

4307
\begin{verbatim}
4308
>>> thinkplot.Pdf(pdf, label='normal')
4309
>>> thinkplot.Show()
4310
\end{verbatim}
4311

4312
{\tt thinkplot.Pdf} plots the Pdf as a smooth function,
4313
as contrasted with {\tt thinkplot.Pmf}, which renders a Pmf as a
4314
step function.  Figure~\ref{pdf_example} shows the result, as well
4315
as a PDF estimated from a sample, which we'll compute in the next
4316
section.
4317
\index{thinkplot}
4318

4319
You can use {\tt MakePmf} to approximate the Pdf:
4320

4321
\begin{verbatim}
4322
>>> pmf = pdf.MakePmf()
4323
\end{verbatim}
4324

4325
By default, the resulting Pmf contains 101 points equally spaced from
4326
{\tt mu - 3*sigma} to {\tt mu + 3*sigma}.  Optionally, {\tt MakePmf}
4327
and {\tt Render} can take keyword arguments {\tt low}, {\tt high},
4328
and {\tt n}.
4329

4330
\begin{figure}
4331
% pdf_example.py
4332
\centerline{\includegraphics[height=2.2in]{figs/pdf_example.pdf}}
4333
\caption{A normal PDF that models adult female height in the U.S.,
4334
and the kernel density estimate of a sample with $n=500$.}
4335
\label{pdf_example}
4336
\end{figure}
4337

4338

4339
\section{Kernel density estimation} 
4340

4341
{\bf Kernel density estimation} (KDE) is an algorithm that takes
4342
a sample and finds an appropriately smooth PDF that fits 
4343
the data.  You can read details at
4344
\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.
4345
\index{KDE}
4346
\index{kernel density estimation}
4347

4348
{\tt scipy} provides an implementation of KDE and {\tt thinkstats2}
4349
provides a class called {\tt EstimatedPdf} that uses it:
4350
\index{SciPy}
4351
\index{NumPy}
4352

4353
\begin{verbatim}
4354
class EstimatedPdf(Pdf):
4355

4356
    def __init__(self, sample):
4357
        self.kde = scipy.stats.gaussian_kde(sample)
4358

4359
    def Density(self, xs):
4360
        return self.kde.evaluate(xs)
4361
\end{verbatim}
4362

4363
\verb"__init__" takes a sample
4364
and computes a kernel density estimate.  The result is a
4365
\verb"gaussian_kde" object that provides an {\tt evaluate}
4366
method.
4367

4368
{\tt Density} takes a value or sequence, calls
4369
\verb"gaussian_kde.evaluate", and returns the resulting density.  The
4370
word ``Gaussian'' appears in the name because it uses a filter based
4371
on a Gaussian distribution to smooth the KDE.  \index{density}
4372

4373
Here's an example that generates a sample from a normal
4374
distribution and then makes an EstimatedPdf to fit it:
4375
\index{NumPy}
4376
\index{EstimatedPdf}
4377

4378
\begin{verbatim}
4379
>>> sample = [random.gauss(mean, std) for i in range(500)]
4380
>>> sample_pdf = thinkstats2.EstimatedPdf(sample)
4381
>>> thinkplot.Pdf(sample_pdf, label='sample KDE')
4382
\end{verbatim}
4383

4384
\verb"sample" is a list of 500 random heights.
4385
\verb"sample_pdf" is a Pdf object that contains the estimated
4386
KDE of the sample.
4387
\index{thinkplot}
4388
\index{Pmf}
4389

4390
Figure~\ref{pdf_example} shows the normal density function and a KDE
4391
based on a sample of 500 random heights.  The estimate is a good
4392
match for the original distribution.
4393

4394
Estimating a density function with KDE is useful for several purposes:
4395

4396
\begin{itemize}
4397

4398
\item {\it Visualization:\/} During the exploration phase of a project, CDFs
4399
  are usually the best visualization of a distribution.  After you
4400
  look at a CDF, you can decide whether an estimated PDF is an
4401
  appropriate model of the distribution.  If so, it can be a better
4402
  choice for presenting the distribution to an audience that is
4403
  unfamiliar with CDFs.
4404
\index{visualization}
4405
\index{model}
4406

4407
\item {\it Interpolation:\/} An estimated PDF is a way to get from a sample
4408
  to a model of the population.  If you have reason to believe that
4409
  the population distribution is smooth, you can use KDE to interpolate
4410
  the density for values that don't appear in the sample.
4411
\index{interpolation}
4412

4413
\item {\it Simulation:\/} Simulations are often based on the distribution
4414
  of a sample.  If the sample size is small, it
4415
  might be appropriate to smooth the sample distribution using KDE,
4416
  which allows the simulation to explore more possible outcomes,
4417
  rather than replicating the observed data.
4418
\index{simulation}
4419

4420
\end{itemize}
4421

4422

4423
\section{The distribution framework}
4424
\index{distribution framework}
4425

4426
\begin{figure}
4427
\centerline{\includegraphics[height=2.2in]{figs/distribution_functions.pdf}}
4428
\caption{A framework that relates representations of distribution
4429
functions.}
4430
\label{dist_framework}
4431
\end{figure}
4432

4433
At this point we have seen PMFs, CDFs and PDFs; let's take a minute
4434
to review.  Figure~\ref{dist_framework} shows how these functions relate
4435
to each other.
4436
\index{Pmf}
4437
\index{Cdf}
4438
\index{Pdf}
4439

4440
We started with PMFs, which represent the probabilities for a discrete
4441
set of values.  To get from a PMF to a CDF, you add up the probability
4442
masses to get cumulative probabilities.  
4443
To get from a CDF back to a PMF, you compute differences in cumulative
4444
probabilities.  We'll see the implementation of these operations
4445
in the next few sections.
4446
\index{cumulative probability}
4447

4448
A PDF is the derivative of a continuous CDF; or, equivalently,
4449
a CDF is the integral of a PDF.  Remember that a PDF maps from
4450
values to probability densities; to get a probability, you have to
4451
integrate.
4452
\index{discrete distribution}
4453
\index{continuous distribution}
4454
\index{smoothing}
4455

4456
To get from a discrete to a continuous distribution, you can perform
4457
various kinds of smoothing.  One form of smoothing is to assume that
4458
the data come from an analytic continuous distribution
4459
(like exponential or normal) and to estimate the parameters of that
4460
distribution.  Another option is kernel density estimation.
4461
\index{exponential distribution}
4462
\index{distribution!exponential}
4463
\index{normal distribution}
4464
\index{distribution!normal}
4465
\index{Gaussian distribution}
4466
\index{distribution!Gaussian}
4467

4468
The opposite of smoothing is {\bf discretizing}, or quantizing.  If you
4469
evaluate a PDF at discrete points, you can generate a PMF that is an
4470
approximation of the PDF.  You can get a better approximation using
4471
numerical integration.  \index{discretize}
4472
\index{quantize}
4473
\index{binning}
4474

4475
To distinguish between continuous and discrete CDFs, it might be
4476
better for a discrete CDF to be a ``cumulative mass function,'' but as
4477
far as I can tell no one uses that term.  \index{CDF}
4478

4479

4480

4481
\section{Hist implementation}
4482

4483
At this point you should know how to use the basic types provided
4484
by {\tt thinkstats2}: Hist, Pmf, Cdf, and Pdf.  The next few sections
4485
provide details about how they are implemented.  This material
4486
might help you use these classes more effectively, but it is not
4487
strictly necessary.
4488
\index{Hist}
4489

4490
Hist and Pmf inherit from a parent class called \verb"_DictWrapper".
4491
The leading underscore indicates that this class is ``internal;'' that
4492
is, it should not be used by code in other modules.  The name
4493
indicates what it is: a dictionary wrapper.  Its primary attribute is
4494
{\tt d}, the dictionary that maps from values to their frequencies.
4495
\index{DictWrapper}
4496
\index{internal class}
4497
\index{wrapper}
4498

4499
The values can be any hashable type.  The frequencies should be integers,
4500
but can be any numeric type.
4501
\index{hashable}
4502

4503
\verb"_DictWrapper" contains methods appropriate for both
4504
Hist and Pmf, including \verb"__init__", {\tt Values},
4505
{\tt Items} and {\tt Render}.  It also provides modifier
4506
methods {\tt Set}, {\tt Incr}, {\tt Mult}, and {\tt Remove}.  These
4507
methods are all implemented with dictionary operations.  For example:
4508
\index{dictionary}
4509

4510
\begin{verbatim}
4511
# class _DictWrapper
4512

4513
    def Incr(self, x, term=1):
4514
        self.d[x] = self.d.get(x, 0) + term
4515

4516
    def Mult(self, x, factor):
4517
        self.d[x] = self.d.get(x, 0) * factor
4518

4519
    def Remove(self, x):
4520
        del self.d[x]
4521
\end{verbatim}
4522

4523
Hist also provides {\tt Freq}, which looks up the frequency
4524
of a given value.
4525
\index{frequency}
4526

4527
Because Hist operators and methods are based on dictionaries,
4528
these methods are constant time operations;
4529
that is, their run time does not increase as the Hist gets bigger.
4530
\index{Hist}
4531

4532

4533
\section{Pmf implementation}
4534

4535
Pmf and Hist are almost the same thing, except that a Pmf
4536
maps values to floating-point probabilities, rather than integer
4537
frequencies.  If the sum of the probabilities is 1, the Pmf is normalized.
4538
\index{Pmf}
4539

4540
Pmf provides {\tt Normalize}, which computes the sum of the
4541
probabilities and divides through by a factor:
4542

4543
\begin{verbatim}
4544
# class Pmf
4545

4546
    def Normalize(self, fraction=1.0):
4547
        total = self.Total()
4548
        if total == 0.0:
4549
            raise ValueError('Total probability is zero.')
4550

4551
        factor = float(fraction) / total
4552
        for x in self.d:
4553
            self.d[x] *= factor
4554

4555
        return total
4556
\end{verbatim}
4557

4558
{\tt fraction} determines the sum of the probabilities after
4559
normalizing; the default value is 1.  If the total probability is 0,
4560
the Pmf cannot be normalized, so {\tt Normalize} raises {\tt
4561
  ValueError}.
4562

4563
Hist and Pmf have the same constructor.  It can take
4564
as an argument a {\tt dict}, Hist, Pmf or Cdf, a pandas
4565
Series, a list of (value, frequency) pairs, or a sequence of values.
4566
\index{Hist}
4567

4568
If you instantiate a Pmf, the result is normalized.  If you
4569
instantiate a Hist, it is not.  To construct an unnormalized Pmf,
4570
you can create an empty Pmf and modify it.  The Pmf modifiers do
4571
not renormalize the Pmf.
4572

4573

4574
\section{Cdf implementation}
4575

4576
A CDF maps from values to cumulative probabilities, so I could have
4577
implemented Cdf as a \verb"_DictWrapper".  But the values in a CDF are
4578
ordered and the values in a \verb"_DictWrapper" are not.  Also, it is
4579
often useful to compute the inverse CDF; that is, the map from
4580
cumulative probability to value.  So the implementaion I chose is two
4581
sorted lists.  That way I can use binary search to do a forward or
4582
inverse lookup in logarithmic time.
4583
\index{Cdf}
4584
\index{binary search}
4585
\index{cumulative probability}
4586
\index{DictWrapper}
4587
\index{inverse CDF}
4588
\index{CDF, inverse}
4589

4590
The Cdf constructor can take as a parameter a sequence of values
4591
or a pandas Series, a dictionary that maps from values to
4592
probabilities, a sequence of (value, probability) pairs, a Hist, Pmf,
4593
or Cdf.  Or if it is given two parameters, it treats them as a sorted
4594
sequence of values and the sequence of corresponding cumulative
4595
probabilities.
4596

4597
Given a sequence, pandas Series, or dictionary, the constructor makes
4598
a Hist.  Then it uses the Hist to initialize the attributes:
4599

4600
\begin{verbatim}
4601
        self.xs, freqs = zip(*sorted(dw.Items()))
4602
        self.ps = np.cumsum(freqs, dtype=np.float)
4603
        self.ps /= self.ps[-1]
4604
\end{verbatim}
4605

4606
{\tt xs} is the sorted list of values; {\tt freqs} is the list
4607
of corresponding frequencies.  {\tt np.cumsum} computes
4608
the cumulative sum of the frequencies.  Dividing through by the
4609
total frequency yields cumulative probabilities.
4610
For {\tt n} values, the time to construct the
4611
Cdf is proportional to $n \log n$.
4612
\index{frequency}
4613

4614
Here is the implementation of {\tt Prob}, which takes a value
4615
and returns its cumulative probability: 
4616

4617
\begin{verbatim}
4618
# class Cdf
4619
    def Prob(self, x):
4620
        if x < self.xs[0]:
4621
            return 0.0
4622
        index = bisect.bisect(self.xs, x)
4623
        p = self.ps[index - 1]
4624
        return p
4625
\end{verbatim}
4626

4627
The {\tt bisect} module provides an implementation of binary search.
4628
And here is the implementation of {\tt Value}, which takes a
4629
cumulative probability and returns the corresponding value:
4630

4631
\begin{verbatim}
4632
# class Cdf
4633
    def Value(self, p):
4634
        if p < 0 or p > 1:
4635
            raise ValueError('p must be in range [0, 1]')
4636

4637
        index = bisect.bisect_left(self.ps, p)
4638
        return self.xs[index]
4639
\end{verbatim}
4640

4641
Given a Cdf, we can compute the Pmf by computing differences between
4642
consecutive cumulative probabilities.  If you call the Cdf constructor
4643
and pass a Pmf, it computes differences by calling {\tt Cdf.Items}:
4644
\index{Pmf}
4645
\index{Cdf}
4646

4647
\begin{verbatim}
4648
# class Cdf
4649
    def Items(self):
4650
        a = self.ps
4651
        b = np.roll(a, 1)
4652
        b[0] = 0
4653
        return zip(self.xs, a-b)
4654
\end{verbatim}
4655

4656
{\tt np.roll} shifts the elements of {\tt a} to the right, and ``rolls''
4657
the last one back to the beginning.  We replace the first element of
4658
{\tt b} with 0 and then compute the difference {\tt a-b}.  The result
4659
is a NumPy array of probabilities.
4660
\index{NumPy}
4661

4662
Cdf provides {\tt Shift} and {\tt Scale}, which modify the
4663
values in the Cdf, but the probabilities should be treated as
4664
immutable.
4665

4666

4667
\section{Moments}
4668
\index{moment}
4669

4670
Any time you take a sample and reduce it to a single number, that
4671
number is a statistic.  The statistics we have seen so far include
4672
mean, variance, median, and interquartile range.
4673

4674
A {\bf raw moment} is a kind of statistic.  If you have a sample of
4675
values, $x_i$, the $k$th raw moment is:
4676
%
4677
\[ m'_k = \frac{1}{n} \sum_i x_i^k \]
4678
%
4679
Or if you prefer Python notation:
4680

4681
\begin{verbatim}
4682
def RawMoment(xs, k):
4683
    return sum(x**k for x in xs) / len(xs)
4684
\end{verbatim}
4685

4686
When $k=1$ the result is the sample mean, $\xbar$.  The other
4687
raw moments don't mean much by themselves, but they are used
4688
in some computations.
4689

4690
The {\bf central moments} are more useful.  The
4691
$k$th central moment is:
4692
%
4693
\[ m_k = \frac{1}{n} \sum_i (x_i - \xbar)^k \]
4694
%
4695
Or in Python:
4696

4697
\begin{verbatim}
4698
def CentralMoment(xs, k):
4699
    mean = RawMoment(xs, 1)
4700
    return sum((x - mean)**k for x in xs) / len(xs)
4701
\end{verbatim}
4702

4703
When $k=2$ the result is the second central moment, which you might
4704
recognize as variance.  The definition of variance gives a hint about
4705
why these statistics are called moments.  If we attach a weight along a
4706
ruler at each location, $x_i$, and then spin the ruler around
4707
the mean, the moment of inertia of the spinning weights is the variance
4708
of the values.  If you are not familiar with moment of inertia, see
4709
\url{http://en.wikipedia.org/wiki/Moment_of_inertia}.  \index{moment
4710
  of inertia}
4711

4712
When you report moment-based statistics, it is important to think
4713
about the units.  For example, if the values $x_i$ are in cm, the
4714
first raw moment is also in cm.  But the second moment is in
4715
cm$^2$, the third moment is in cm$^3$, and so on.
4716

4717
Because of these units, moments are hard to interpret by themselves.
4718
That's why, for the second moment, it is common to report standard
4719
deviation, which is the square root of variance, so it is in the same
4720
units as $x_i$.
4721
\index{standard deviation}
4722

4723

4724
\section{Skewness}
4725
\index{skewness}
4726

4727
{\bf Skewness} is a property that describes the shape of a distribution.
4728
If the distribution is symmetric around its central tendency, it is
4729
unskewed.  If the values extend farther to the right, it is ``right
4730
skewed'' and if the values extend left, it is ``left skewed.''
4731
\index{central tendency}
4732

4733
This use of ``skewed'' does not have the usual connotation of
4734
``biased.''  Skewness only describes the shape of the distribution;
4735
it says nothing about whether the sampling process might have been
4736
biased.
4737
\index{bias}
4738
\index{sample skewness}
4739

4740
Several statistics are commonly used to quantify the skewness of a
4741
distribution.  Given a sequence of values, $x_i$, the {\bf sample
4742
  skewness}, $g_1$, can be computed like this:
4743

4744
\begin{verbatim}
4745
def StandardizedMoment(xs, k):
4746
    var = CentralMoment(xs, 2)
4747
    std = math.sqrt(var)
4748
    return CentralMoment(xs, k) / std**k
4749

4750
def Skewness(xs):
4751
    return StandardizedMoment(xs, 3)
4752
\end{verbatim}
4753

4754
$g_1$ is the third {\bf standardized moment}, which means that it has
4755
been normalized so it has no units.
4756
\index{standardized moment}
4757

4758
Negative skewness indicates that a distribution 
4759
skews left; positive skewness indicates
4760
that a distribution skews right.  The magnitude of $g_1$ indicates
4761
the strength of the skewness, but by itself it is not easy to
4762
interpret.
4763

4764
In practice, computing sample skewness is usually not
4765
a good idea.  If there are any outliers, they
4766
have a disproportionate effect on $g_1$.
4767
\index{outlier}
4768

4769
Another way to evaluate the asymmetry of a distribution is to look
4770
at the relationship between the mean and median.
4771
Extreme values have more effect on the mean than the median, so
4772
in a distribution that skews left, the mean is less than the median.
4773
In a distribution that skews right, the mean is greater.
4774
\index{symmetric}
4775
\index{Pearson median skewness}
4776

4777
{\bf Pearson's median skewness coefficient} is a measure
4778
of skewness based on the difference between the
4779
sample mean and median:
4780
%
4781
\[ g_p = 3 (\xbar - m) / S \]
4782
%
4783
Where $\xbar$ is the sample mean, $m$ is the median, and
4784
$S$ is the standard deviation.  Or in Python:
4785
\index{standard deviation}
4786

4787
\begin{verbatim}
4788
def Median(xs):
4789
    cdf = thinkstats2.Cdf(xs)
4790
    return cdf.Value(0.5)
4791

4792
def PearsonMedianSkewness(xs):
4793
    median = Median(xs)
4794
    mean = RawMoment(xs, 1)
4795
    var = CentralMoment(xs, 2)
4796
    std = math.sqrt(var)
4797
    gp = 3 * (mean - median) / std
4798
    return gp
4799
\end{verbatim}
4800

4801
This statistic is {\bf robust}, which means that it is less vulnerable
4802
to the effect of outliers.
4803
\index{robust}
4804
\index{outlier}
4805

4806
\begin{figure}
4807
\centerline{\includegraphics[height=2.2in]{figs/density_totalwgt_kde.pdf}}
4808
\caption{Estimated PDF of birthweight data from the NSFG.}
4809
\label{density_totalwgt_kde}
4810
\end{figure}
4811

4812
As an example, let's look at the skewness of birth weights in the
4813
NSFG pregnancy data.  Here's the code to estimate and plot the PDF:
4814
\index{thinkplot}
4815

4816
\begin{verbatim}
4817
    live, firsts, others = first.MakeFrames()
4818
    data = live.totalwgt_lb.dropna()
4819
    pdf = thinkstats2.EstimatedPdf(data)
4820
    thinkplot.Pdf(pdf, label='birth weight')
4821
\end{verbatim}
4822

4823
Figure~\ref{density_totalwgt_kde} shows the result.  The left tail appears
4824
longer than the right, so we suspect the distribution is skewed left.
4825
The mean, 7.27 lbs, is a bit less than
4826
the median, 7.38 lbs, so that is consistent with left skew.
4827
And both skewness coefficients are negative:
4828
sample skewness is -0.59;
4829
Pearson's median skewness is -0.23.
4830
\index{skewness}
4831
\index{dropna}
4832
\index{NaN}
4833

4834
\begin{figure}
4835
\centerline{\includegraphics[height=2.2in]{figs/density_wtkg2_kde.pdf}}
4836
\caption{Estimated PDF of adult weight data from the BRFSS.}
4837
\label{density_wtkg2_kde}
4838
\end{figure}
4839

4840
Now let's compare this distribution to the distribution of adult
4841
weight in the BRFSS.  Again, here's the code:
4842
\index{thinkplot}
4843

4844
\begin{verbatim}
4845
    df = brfss.ReadBrfss(nrows=None)
4846
    data = df.wtkg2.dropna()
4847
    pdf = thinkstats2.EstimatedPdf(data)
4848
    thinkplot.Pdf(pdf, label='adult weight')
4849
\end{verbatim}
4850

4851
Figure~\ref{density_wtkg2_kde} shows the result.  The distribution
4852
appears skewed to the right.  Sure enough, the mean, 79.0, is bigger
4853
than the median, 77.3.  The sample skewness is 1.1 and Pearson's
4854
median skewness is 0.26.
4855
\index{dropna}
4856
\index{NaN}
4857

4858
The sign of the skewness coefficient indicates whether the distribution
4859
skews left or right, but other than that, they are hard to interpret.
4860
Sample skewness is less robust; that is, it is more
4861
susceptible to outliers.  As a result it is less reliable
4862
when applied to skewed distributions, exactly when it would be most
4863
relevant.
4864
\index{outlier}
4865
\index{robust}
4866

4867
Pearson's median skewness is based on a computed mean and variance,
4868
so it is also susceptible to outliers, but since it does not depend
4869
on a third moment, it is somewhat more robust.
4870
\index{Pearson median skewness}
4871

4872

4873
\section{Exercises}
4874

4875
A solution to this exercise is in \verb"chap06soln.py".
4876

4877
\begin{exercise}
4878

4879
The distribution of income is famously skewed to the right.  In this
4880
exercise, we'll measure how strong that skew is.
4881
\index{skewness}
4882
\index{income}
4883

4884
The Current Population Survey (CPS) is a joint effort of the Bureau
4885
of Labor Statistics and the Census Bureau to study income and related
4886
variables.  Data collected in 2013 is available from
4887
\url{http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm}.
4888
I downloaded {\tt hinc06.xls}, which is an Excel spreadsheet with
4889
information about household income, and converted it to {\tt hinc06.csv},
4890
a CSV file you will find in the repository for this book.  You
4891
will also find {\tt hinc2.py}, which reads this file and transforms
4892
the data.
4893
\index{Current Population Survey}
4894
\index{Bureau of Labor Statistics}
4895
\index{Census Bureau}
4896

4897
The dataset is in the form of a series of income ranges and the number
4898
of respondents who fell in each range.  The lowest range includes
4899
respondents who reported annual household income ``Under \$5000.''
4900
The highest range includes respondents who made ``\$250,000 or
4901
more.''
4902

4903
To estimate mean and other statistics from these data, we have to
4904
make some assumptions about the lower and upper bounds, and how
4905
the values are distributed in each range.  {\tt hinc2.py} provides
4906
{\tt InterpolateSample}, which shows one way to model
4907
this data.  It takes a DataFrame with a column, {\tt income}, that
4908
contains the upper bound of each range, and {\tt freq}, which contains
4909
the number of respondents in each frame.
4910
\index{DataFrame}
4911
\index{model}
4912

4913
It also takes \verb"log_upper", which is an assumed upper bound
4914
on the highest range, expressed in {\tt log10} dollars.  
4915
The default value, \verb"log_upper=6.0" represents the assumption
4916
that the largest income among the respondents is
4917
$10^6$, or one million dollars.
4918

4919
{\tt InterpolateSample} generates a pseudo-sample; that is, a sample
4920
of household incomes that yields the same number of respondents
4921
in each range as the actual data.  It assumes that incomes in
4922
each range are equally spaced on a log10 scale.
4923

4924
Compute the median, mean, skewness and Pearson's skewness of the
4925
resulting sample.  What fraction of households reports a taxable
4926
income below the mean?  How do the results depend on the assumed
4927
upper bound?
4928
\end{exercise}
4929

4930

4931
\section{Glossary}
4932

4933
\begin{itemize}
4934

4935
\item Probability density function (PDF): The derivative of a continuous CDF,
4936
a function that maps a value to its probability density.
4937
\index{PDF}
4938
\index{probability density function}
4939

4940
\item Probability density: A quantity that can be integrated over a
4941
  range of values to yield a probability.  If the values are in units
4942
  of cm, for example, probability density is in units of probability
4943
  per cm.
4944
\index{probability density}
4945

4946
\item Kernel density estimation (KDE): An algorithm that estimates a PDF
4947
based on a sample.
4948
\index{kernel density estimation}
4949
\index{KDE}
4950

4951
\item discretize: To approximate a continuous function or distribution
4952
with a discrete function.  The opposite of smoothing.
4953
\index{discretize}
4954

4955
\item raw moment: A statistic based on the sum of data raised to a power.
4956
\index{raw moment}
4957

4958
\item central moment: A statistic based on deviation from the mean,
4959
raised to a power.
4960
\index{central moment}
4961

4962
\item standardized moment: A ratio of moments that has no units.
4963
\index{standardized moment}
4964

4965
\item skewness: A measure of how asymmetric a distribution is.
4966
\index{skewness}
4967

4968
\item sample skewness: A moment-based statistic intended to quantify
4969
the skewness of a distribution.
4970
\index{sample skewness}
4971

4972
\item Pearson's median skewness coefficient: A statistic intended to
4973
  quantify the skewness of a distribution based on the median, mean,
4974
  and standard deviation.
4975
  \index{Pearson median skewness}
4976

4977
\item robust: A statistic is robust if it is relatively immune to the
4978
  effect of outliers.
4979
\index{robust}
4980

4981
\end{itemize}
4982

4983

4984

4985
\chapter{Relationships between variables}
4986

4987
So far we have only looked at one variable at a time.  In this
4988
chapter we look at relationships between variables.  Two variables are
4989
related if knowing one gives you information about the other.  For
4990
example, height and weight are related; people who are taller tend to
4991
be heavier.  Of course, it is not a perfect relationship: there
4992
are short heavy people and tall light ones.  But if you are
4993
trying to guess someone's weight, you will be more accurate if you
4994
know their height than if you don't.
4995
\index{adult weight}
4996
\index{adult height}
4997

4998
The code for this chapter is in {\tt scatter.py}.
4999
For information about downloading and
5000
working with this code, see Section~\ref{code}.
5001

5002

5003
\section{Scatter plots}
5004
\index{scatter plot}
5005
\index{plot!scatter}
5006

5007
The simplest way to check for a relationship between two variables
5008
is a {\bf scatter plot}, but making a good scatter plot is not always easy.
5009
As an example, I'll plot weight versus height for the respondents
5010
in the BRFSS (see Section~\ref{lognormal}).
5011
\index{BRFSS}
5012

5013
Here's the code that reads the data file and extracts height and
5014
weight:
5015

5016
\begin{verbatim}
5017
    df = brfss.ReadBrfss(nrows=None)
5018
    sample = thinkstats2.SampleRows(df, 5000)
5019
    heights, weights = sample.htm3, sample.wtkg2
5020
\end{verbatim}
5021

5022
{\tt SampleRows} chooses a random subset of the data:
5023
\index{SampleRows}
5024

5025
\begin{verbatim}
5026
def SampleRows(df, nrows, replace=False):
5027
    indices = np.random.choice(df.index, nrows, replace=replace)
5028
    sample = df.loc[indices]
5029
    return sample
5030
\end{verbatim}
5031

5032
{\tt df} is the DataFrame, {\tt nrows} is the number of rows to choose,
5033
and {\tt replace} is a boolean indicating whether sampling should be
5034
done with replacement; in other words, whether the same row could be
5035
chosen more than once.
5036
\index{DataFrame}
5037
\index{thinkplot}
5038
\index{boolean}
5039
\index{replacement}
5040

5041
{\tt thinkplot} provides {\tt Scatter}, which makes scatter plots:
5042
%
5043
\begin{verbatim}
5044
    thinkplot.Scatter(heights, weights)
5045
    thinkplot.Show(xlabel='Height (cm)',
5046
                   ylabel='Weight (kg)',
5047
                   axis=[140, 210, 20, 200])
5048
\end{verbatim}
5049

5050
The result, in Figure~\ref{scatter1} (left), shows the shape of
5051
the relationship.  As we expected, taller
5052
people tend to be heavier.  
5053

5054
\begin{figure}
5055
% scatter.py
5056
\centerline{\includegraphics[height=3.0in]{figs/scatter1.pdf}}
5057
\caption{Scatter plots of weight versus height for the respondents
5058
in the BRFSS, unjittered (left), jittered (right).}
5059
\label{scatter1}
5060
\end{figure}
5061

5062
But this is not the best representation of
5063
the data, because the data are packed into columns.  The problem is
5064
that the heights are rounded to the nearest inch, converted to
5065
centimeters, and then rounded again.  Some information is lost in
5066
translation.  \index{height} \index{weight} \index{jitter}
5067

5068
We can't get that information back, but we can minimize the effect on
5069
the scatter plot by {\bf jittering} the data, which means adding random
5070
noise to reverse the effect of rounding off.  Since these measurements
5071
were rounded to the nearest inch, they might be off by up to 0.5 inches or
5072
1.3 cm.  Similarly, the weights might be off by 0.5 kg.
5073
\index{uniform distribution}
5074
\index{distribution!uniform}
5075
\index{noise}
5076

5077
%
5078
\begin{verbatim}
5079
    heights = thinkstats2.Jitter(heights, 1.3)
5080
    weights = thinkstats2.Jitter(weights, 0.5)
5081
\end{verbatim}
5082

5083
Here's the implementation of {\tt Jitter}:
5084

5085
\begin{verbatim}
5086
def Jitter(values, jitter=0.5):
5087
    n = len(values)
5088
    return np.random.uniform(-jitter, +jitter, n) + values
5089
\end{verbatim}
5090

5091
The values can be any sequence; the result is a NumPy array.
5092
\index{NumPy}
5093

5094
Figure~\ref{scatter1} (right) shows the result.  Jittering reduces the
5095
visual effect of rounding and makes the shape of the relationship
5096
clearer.  But in general you should only jitter data for purposes of
5097
visualization and avoid using jittered data for analysis.
5098

5099
Even with jittering, this is not the best way to represent the data.
5100
There are many overlapping points, which hides data
5101
in the dense parts of the figure and gives disproportionate emphasis
5102
to outliers.  This effect is called {\bf saturation}.
5103
\index{outlier}
5104
\index{saturation}
5105

5106
\begin{figure}
5107
% scatter.py
5108
\centerline{\includegraphics[height=3.0in]{figs/scatter2.pdf}}
5109
\caption{Scatter plot with jittering and transparency (left),
5110
hexbin plot (right).}
5111
\label{scatter2}
5112
\end{figure}
5113

5114
We can solve this problem with the {\tt alpha} parameter, which makes
5115
the points partly transparent:
5116
%
5117
\begin{verbatim}
5118
    thinkplot.Scatter(heights, weights, alpha=0.2)
5119
\end{verbatim}
5120
%
5121
Figure~\ref{scatter2} (left) shows the result.  Overlapping data
5122
points look darker, so darkness is proportional to density.  In this
5123
version of the plot we can see two details that were not apparent before:
5124
vertical clusters at several heights and a horizontal line near 90 kg
5125
or 200 pounds.  Since this data is based on self-reports in pounds,
5126
the most likely explanation is that some respondents reported
5127
rounded values.
5128
\index{thinkplot}
5129
\index{alpha}
5130
\index{transparency}
5131

5132
Using transparency works well for moderate-sized datasets, but this
5133
figure only shows the first 5000 records in the BRFSS, out of a total
5134
of 414 509.
5135
\index{hexbin plot}
5136
\index{plot!hexbin}
5137

5138
To handle larger datasets, another option is a hexbin plot, which
5139
divides the graph into hexagonal bins and colors each bin according to
5140
how many data points fall in it.  {\tt thinkplot} provides 
5141
{\tt HexBin}:
5142
%
5143
\begin{verbatim}
5144
    thinkplot.HexBin(heights, weights)
5145
\end{verbatim}
5146
%
5147
Figure~\ref{scatter2} (right) shows the result.  An advantage of a
5148
hexbin is that it shows the shape of the relationship well, and it is
5149
efficient for large datasets, both in time and in the size of the file
5150
it generates.  A drawback is that it makes the outliers invisible.
5151
\index{thinkplot}
5152
\index{outlier}
5153

5154
The point of this example is that it is
5155
not easy to make a scatter plot that shows relationships clearly
5156
without introducing misleading artifacts.
5157
\index{artifact}
5158

5159

5160
\section{Characterizing relationships}
5161
\label{characterizing}
5162

5163
Scatter plots provide a general impression of the relationship between
5164
variables, but there are other visualizations that provide more
5165
insight into the nature of the relationship.  One option is to bin one
5166
variable and plot percentiles of the other.
5167
\index{binning}
5168

5169
NumPy and pandas provide functions for binning data:
5170
\index{NumPy}
5171
\index{pandas}
5172

5173
\begin{verbatim}
5174
    df = df.dropna(subset=['htm3', 'wtkg2'])
5175
    bins = np.arange(135, 210, 5)
5176
    indices = np.digitize(df.htm3, bins)
5177
    groups = df.groupby(indices)
5178
\end{verbatim}
5179

5180
{\tt dropna} drops rows with {\tt nan} in any of the listed columns.
5181
{\tt arange} makes a NumPy array of bins from 135 to, but not including,
5182
210, in increments of 5.
5183
\index{dropna}
5184
\index{digitize}
5185
\index{NaN}
5186

5187
{\tt digitize} computes the index of the bin that contains each value
5188
in {\tt df.htm3}.  The result is a NumPy array of integer indices.
5189
Values that fall below the lowest bin are mapped to index 0.  Values
5190
above the highest bin are mapped to {\tt len(bins)}.
5191

5192
\begin{figure}
5193
% scatter.py
5194
\centerline{\includegraphics[height=2.5in]{figs/scatter3.pdf}}
5195
\caption{Percentiles of weight for a range of height bins.}
5196
\label{scatter3}
5197
\end{figure}
5198

5199
{\tt groupby} is a DataFrame method that returns a GroupBy object;
5200
used in a {\tt for} loop, {\tt groups} iterates the names of the groups
5201
and the DataFrames that represent them.  So, for example, we can
5202
print the number of rows in each group like this:
5203
\index{DataFrame}
5204
\index{groupby}
5205

5206
\begin{verbatim}
5207
for i, group in groups:
5208
    print(i, len(group))
5209
\end{verbatim}
5210

5211
Now for each group we can compute the mean height and the CDF
5212
of weight:
5213
\index{Cdf}
5214

5215
\begin{verbatim}
5216
    heights = [group.htm3.mean() for i, group in groups]
5217
    cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]
5218
\end{verbatim}
5219

5220
Finally, we can
5221
plot percentiles of weight versus height:
5222
\index{percentile}
5223

5224
\begin{verbatim}
5225
    for percent in [75, 50, 25]:
5226
        weights = [cdf.Percentile(percent) for cdf in cdfs]
5227
        label = '%dth' % percent
5228
        thinkplot.Plot(heights, weights, label=label)
5229
\end{verbatim}
5230

5231
Figure~\ref{scatter3} shows the result.  Between 140 and 200 cm
5232
the relationship between these variables is roughly linear.  This range
5233
includes more than 99\% of the data, so we don't have to worry
5234
too much about the extremes.
5235
\index{thinkplot}
5236

5237

5238
\section{Correlation}
5239

5240
A {\bf correlation} is a statistic intended to quantify the strength
5241
of the relationship between two variables.
5242
\index{correlation}
5243

5244
A challenge in measuring correlation is that the variables we want to
5245
compare are often not expressed in the same units.  And even if they
5246
are in the same units, they come from different distributions.
5247
\index{units}
5248

5249
There are two common solutions to these problems:
5250

5251
\begin{enumerate}
5252

5253
\item Transform each value to a {\bf standard score}, which is the
5254
number of standard deviations from the mean.  
5255
This transform leads to
5256
the ``Pearson product-moment correlation coefficient.''
5257
\index{standard score}
5258
\index{standard deviation}
5259
\index{Pearson coefficient of correlation}
5260

5261
\item Transform each value to its {\bf rank}, which is its index in
5262
the sorted list of values.  This transform
5263
leads to the ``Spearman rank correlation coefficient.''
5264
\index{rank}
5265
\index{percentile rank}
5266
\index{Spearman coefficient of correlation}
5267

5268
\end{enumerate}
5269

5270
If $X$ is a series of $n$ values, $x_i$, we can convert to standard
5271
scores by subtracting the mean and dividing by the standard deviation:
5272
$z_i = (x_i - \mu) / \sigma$.
5273
\index{mean}
5274
\index{standard deviation}
5275

5276
The numerator is a deviation: the distance from the mean.  Dividing by
5277
$\sigma$ {\bf standardizes} the deviation, so the values of $Z$ are
5278
dimensionless (no units) and their distribution has mean 0 and
5279
variance 1.
5280
\index{standardize}
5281
\index{deviation}
5282
\index{normal distribution}
5283
\index{distribution!normal}
5284
\index{Gaussian distribution}
5285
\index{distribution!Gaussian}
5286

5287
If $X$ is normally distributed, so is $Z$.  But if $X$ is skewed or has
5288
outliers, so does $Z$; in those cases, it is more robust to use
5289
percentile ranks.  If we compute a new variable, $R$, so that $r_i$ is
5290
the rank of $x_i$, the distribution of $R$ is uniform
5291
from 1 to $n$, regardless of the distribution of $X$.
5292
\index{uniform distribution} \index{distribution!uniform}
5293
\index{robust}
5294
\index{skewness}
5295
\index{outlier}
5296

5297

5298
\section{Covariance}
5299
\index{covariance}
5300
\index{deviation}
5301

5302
{\bf Covariance} is a measure of the tendency of two variables
5303
to vary together.  If we have two series, $X$ and $Y$, their
5304
deviations from the mean are
5305
%
5306
\[ dx_i = x_i - \xbar \]
5307
\[ dy_i = y_i - \ybar \]
5308
%
5309
where $\xbar$ is the sample mean of $X$ and $\ybar$ is the sample mean
5310
of $Y$.  If $X$ and $Y$ vary together, their deviations tend to have
5311
the same sign.
5312

5313
If we multiply them together, the product is positive when the
5314
deviations have the same sign and negative when they have the opposite
5315
sign.  So adding up the products gives a measure of the tendency to
5316
vary together.
5317

5318
Covariance is the mean of these products:
5319
%
5320
\[ Cov(X,Y) = \frac{1}{n} \sum dx_i~dy_i \]
5321
%
5322
where $n$ is the length of the two series (they have to be the same
5323
length).
5324

5325
If you have studied linear algebra, you might recognize that
5326
{\tt Cov} is the dot product of the deviations, divided
5327
by their length.  So the covariance is maximized if the two vectors
5328
are identical, 0 if they are orthogonal, and negative if they
5329
point in opposite directions.  {\tt thinkstats2} uses {\tt np.dot} to
5330
implement {\tt Cov} efficiently:
5331
\index{linear algebra}
5332
\index{dot product}
5333
\index{orthogonal vector}
5334

5335
\begin{verbatim}
5336
def Cov(xs, ys, meanx=None, meany=None):
5337
    xs = np.asarray(xs)
5338
    ys = np.asarray(ys)
5339

5340
    if meanx is None:
5341
        meanx = np.mean(xs)
5342
    if meany is None:
5343
        meany = np.mean(ys)
5344

5345
    cov = np.dot(xs-meanx, ys-meany) / len(xs)
5346
    return cov
5347
\end{verbatim}
5348

5349
By default {\tt Cov} computes deviations from the sample means,
5350
or you can provide known means.  If {\tt xs} and {\tt ys} are
5351
Python sequences, {\tt np.asarray} converts them to NumPy arrays.
5352
If they are already NumPy arrays, {\tt np.asarray} does nothing.
5353
\index{NumPy}
5354

5355
This implementation of covariance is meant to be simple for purposes
5356
of explanation.  NumPy and pandas also provide implementations of
5357
covariance, but both of them apply a correction for small sample sizes
5358
that we have not covered yet, and {\tt np.cov} returns a covariance
5359
matrix, which is more than we need for now.
5360
\index{pandas}
5361

5362

5363
\section{Pearson's correlation}
5364
\index{correlation}
5365
\index{standard score}
5366

5367
Covariance is useful in some computations, but it is seldom reported
5368
as a summary statistic because it is hard to interpret.  Among other
5369
problems, its units are the product of the units of $X$ and $Y$.  For
5370
example, the covariance of weight and height in the BRFSS dataset is
5371
113 kilogram-centimeters, whatever that means.
5372
\index{deviation}
5373
\index{units}
5374

5375
One solution to this problem is to divide the deviations by the standard
5376
deviation, which yields standard scores, and compute the product of
5377
standard scores:
5378
%
5379
\[ p_i = \frac{(x_i - \xbar)}{S_X} \frac{(y_i - \ybar)}{S_Y} \]
5380
%
5381
Where $S_X$ and $S_Y$ are the standard deviations of $X$ and $Y$.
5382
The mean of these products is \index{standard deviation}
5383
%
5384
\[ \rho = \frac{1}{n} \sum p_i \]
5385
%
5386
Or we can rewrite $\rho$ by factoring out $S_X$ and
5387
$S_Y$:
5388
%
5389
\[ \rho = \frac{Cov(X,Y)}{S_X S_Y} \]
5390
%
5391
This value is called {\bf Pearson's correlation} after Karl Pearson,
5392
an influential early statistician.  It is easy to compute and easy to
5393
interpret.  Because standard scores are dimensionless, so is $\rho$.
5394
\index{Pearson, Karl}
5395
\index{Pearson coefficient of correlation}
5396

5397
Here is the implementation in {\tt thinkstats2}:
5398

5399
\begin{verbatim}
5400
def Corr(xs, ys):
5401
    xs = np.asarray(xs)
5402
    ys = np.asarray(ys)
5403

5404
    meanx, varx = MeanVar(xs)
5405
    meany, vary = MeanVar(ys)
5406

5407
    corr = Cov(xs, ys, meanx, meany) / math.sqrt(varx * vary)
5408
    return corr
5409
\end{verbatim}
5410

5411
{\tt MeanVar} computes mean and variance slightly more efficiently
5412
than separate calls to {\tt np.mean} and {\tt np.var}.
5413
\index{MeanVar}
5414

5415
Pearson's correlation is always between -1 and +1 (including both).
5416
If $\rho$ is positive, we say that the correlation is positive,
5417
which means that when one variable is high, the other tends to be
5418
high.  If $\rho$ is negative, the correlation is negative, so
5419
when one variable is high, the other is low.
5420

5421
The magnitude of $\rho$ indicates the strength of the correlation.  If
5422
$\rho$ is 1 or -1, the variables are perfectly correlated, which means
5423
that if you know one, you can make a perfect prediction about the
5424
other.  \index{prediction}
5425

5426
Most correlation in the real world is not perfect, but it is still
5427
useful.  The correlation of height and weight is 0.51, which is a
5428
strong correlation compared to similar human-related variables.
5429

5430

5431
\section{Nonlinear relationships}
5432

5433
If Pearson's correlation is near 0, it is tempting to conclude
5434
that there is no relationship between the variables, but that
5435
conclusion is not valid.  Pearson's correlation only measures {\em
5436
  linear\/} relationships.  If there's a nonlinear relationship, $\rho$
5437
understates its strength.  \index{linear relationship}
5438
\index{nonlinear}
5439
\index{Pearson coefficient of correlation}
5440

5441
\begin{figure}
5442
\centerline{\includegraphics[height=2.5in]{figs/Correlation_examples.png}}
5443
\caption{Examples of datasets with a range of correlations.}
5444
\label{corr_examples}
5445
\end{figure}
5446

5447
Figure~\ref{corr_examples} is from
5448
\url{http://wikipedia.org/wiki/Correlation_and_dependence}.  It shows
5449
scatter plots and correlation coefficients for several
5450
carefully constructed datasets.
5451
\index{scatter plot}
5452
\index{plot!scatter}
5453

5454
The top row shows linear relationships with a range of correlations;
5455
you can use this row to get a sense of what different values of
5456
$\rho$ look like.  The second row shows perfect correlations with a
5457
range of slopes, which demonstrates that correlation is unrelated to
5458
slope (we'll talk about estimating slope soon).  The third row shows
5459
variables that are clearly related, but because the relationship is
5460
nonlinear, the correlation coefficient is 0.
5461
\index{nonlinear}
5462

5463
The moral of this story is that you should always look at a scatter
5464
plot of your data before blindly computing a correlation coefficient.
5465
\index{correlation}
5466

5467

5468
\section{Spearman's rank correlation}
5469

5470
Pearson's correlation works well if the relationship between variables
5471
is linear and if the variables are roughly normal.  But it is not
5472
robust in the presence of outliers.
5473
\index{Pearson coefficient of correlation}
5474
\index{Spearman coefficient of correlation}
5475
\index{normal distribution}
5476
\index{distribution!normal}
5477
\index{Gaussian distribution}
5478
\index{distribution!Gaussian}
5479
\index{robust}
5480
Spearman's rank correlation is an alternative that mitigates the
5481
effect of outliers and skewed distributions.  To compute Spearman's
5482
correlation, we have to compute the {\bf rank} of each value, which is its
5483
index in the sorted sample.  For example, in the sample {\tt [1, 2, 5, 7]}
5484
the rank of the value 5 is 3, because it appears third in the sorted
5485
list.  Then we compute Pearson's correlation for the ranks.
5486
\index{skewness}
5487
\index{outlier}
5488
\index{rank}
5489

5490
{\tt thinkstats2} provides a function that computes Spearman's rank
5491
correlation:
5492

5493
\begin{verbatim}
5494
def SpearmanCorr(xs, ys):
5495
    xranks = pandas.Series(xs).rank()
5496
    yranks = pandas.Series(ys).rank()
5497
    return Corr(xranks, yranks)
5498
\end{verbatim}
5499

5500
I convert the arguments to pandas Series objects so I can use
5501
{\tt rank}, which computes the rank for each value and returns
5502
a Series.  Then I use {\tt Corr} to compute the correlation
5503
of the ranks.
5504
\index{pandas}
5505
\index{Series}
5506

5507
I could also use {\tt Series.corr} directly and specify
5508
Spearman's method:
5509

5510
\begin{verbatim}
5511
def SpearmanCorr(xs, ys):
5512
    xs = pandas.Series(xs)
5513
    ys = pandas.Series(ys)
5514
    return xs.corr(ys, method='spearman')
5515
\end{verbatim}
5516

5517
The Spearman rank correlation for the BRFSS data is 0.54, a little
5518
higher than the Pearson correlation, 0.51.  There are several possible
5519
reasons for the difference, including:
5520
\index{rank correlation}
5521
\index{BRFSS}
5522

5523
\begin{itemize}
5524

5525
\item If the relationship is
5526
nonlinear, Pearson's correlation tends to underestimate the strength
5527
of the relationship, and 
5528
\index{nonlinear}
5529

5530
\item Pearson's correlation can be affected (in either direction)
5531
if one of the distributions is skewed or contains outliers.  Spearman's
5532
rank correlation is more robust.
5533
\index{skewness}
5534
\index{outlier}
5535
\index{robust}
5536

5537
\end{itemize}
5538

5539
In the BRFSS example, we know that the distribution of weights is
5540
roughly lognormal; under a log transform it approximates a normal
5541
distribution, so it has no skew.
5542
So another way to eliminate the effect of skewness is to
5543
compute Pearson's
5544
correlation with log-weight and height:
5545
\index{lognormal distribution}
5546
\index{distribution!lognormal}
5547

5548
\begin{verbatim}
5549
    thinkstats2.Corr(df.htm3, np.log(df.wtkg2)))
5550
\end{verbatim}
5551

5552
The result is 0.53, close to the rank correlation, 0.54.  So that
5553
suggests that skewness in the distribution of weight explains most of
5554
the difference between Pearson's and Spearman's correlation.
5555
\index{skewness}
5556
\index{Spearman coefficient of correlation}
5557
\index{Pearson coefficient of correlation}
5558

5559

5560
\section{Correlation and causation}
5561
\index{correlation}
5562
\index{causation}
5563

5564
If variables A and B are correlated, there are three possible
5565
explanations: A causes B, or B causes A, or some other set of factors
5566
causes both A and B.  These explanations are called ``causal
5567
relationships''.
5568
\index{causal relationship}
5569

5570
Correlation alone does not distinguish between these explanations,
5571
so it does not tell you which ones are true.
5572
This rule is often summarized with the phrase ``Correlation
5573
does not imply causation,'' which is so pithy it has its own
5574
Wikipedia page: \url{http://wikipedia.org/wiki/Correlation_does_not_imply_causation}.
5575

5576
So what can you do to provide evidence of causation?
5577

5578
\begin{enumerate}
5579

5580
\item Use time.  If A comes before B, then A can cause B but not the
5581
  other way around (at least according to our common understanding of
5582
  causation).  The order of events can help us infer the direction
5583
  of causation, but it does not preclude the possibility that something
5584
  else causes both A and B.
5585

5586
\item Use randomness.  If you divide a large sample into two
5587
  groups at random and compute the means of almost any variable, you
5588
  expect the difference to be small.
5589
  If the groups are nearly identical in all variables but one, you
5590
  can eliminate spurious relationships.
5591
  \index{spurious relationship}
5592

5593
  This works even if you don't know what the relevant variables
5594
  are, but it works even better if you do, because you can check that
5595
  the groups are identical.
5596

5597
\end{enumerate}
5598

5599
These ideas are the motivation for the {\bf randomized controlled
5600
trial}, in which subjects are assigned randomly to two (or more)
5601
groups: a {\bf treatment group} that receives some kind of intervention,
5602
like a new medicine, and a {\bf control group} that receives
5603
no intervention, or another treatment whose effects are known.
5604
\index{randomized controlled trial}
5605
\index{controlled trial}
5606
\index{treatment group}
5607
\index{control group}
5608
\index{medicine}
5609

5610
A randomized controlled trial is the most reliable way to demonstrate
5611
a causal relationship, and the foundation of science-based medicine
5612
(see \url{http://wikipedia.org/wiki/Randomized_controlled_trial}).
5613

5614
Unfortunately, controlled trials are only possible in the laboratory
5615
sciences, medicine, and a few other disciplines.  In the social sciences,
5616
controlled experiments are rare, usually because they are impossible
5617
or unethical.
5618
\index{ethics}
5619

5620
An alternative is to look for a {\bf natural experiment}, where
5621
different ``treatments'' are applied to groups that are otherwise
5622
similar.  One danger of natural experiments is that the groups might
5623
differ in ways that are not apparent.  You can read more about this
5624
topic at \url{http://wikipedia.org/wiki/Natural_experiment}.
5625
\index{natural experiment}
5626

5627
In some cases it is possible to infer causal relationships using {\bf
5628
  regression analysis}, which is the topic of Chapter~\ref{regression}.
5629
\index{regression analysis}
5630

5631

5632
\section{Exercises}
5633

5634
A solution to this exercise is in \verb"chap07soln.py".
5635

5636
\begin{exercise}
5637
Using data from the NSFG, make a scatter plot of birth weight
5638
versus mother's age.  Plot percentiles of birth weight
5639
versus mother's age.  Compute Pearson's and Spearman's correlations.
5640
How would you characterize the relationship
5641
between these variables?
5642
\index{birth weight}
5643
\index{weight!birth}
5644
\index{Pearson coefficient of correlation}
5645
\index{Spearman coefficient of correlation}
5646
\end{exercise}
5647

5648

5649
\section{Glossary}
5650

5651
\begin{itemize}
5652

5653
\item scatter plot: A visualization of the relationship between
5654
two variables, showing one point for each row of data.
5655
\index{scatter plot}
5656

5657
\item jitter: Random noise added to data for purposes of
5658
visualization.
5659
\index{jitter}
5660

5661
\item saturation: Loss of information when multiple points are
5662
plotted on top of each other. 
5663
\index{saturation}
5664

5665
\item correlation: A statistic that measures the strength of the
5666
relationship between two variables.
5667
\index{correlation}
5668

5669
\item standardize: To transform a set of values so that their mean is 0 and
5670
their variance is 1.
5671
\index{standardize}
5672

5673
\item standard score: A value that has been standardized so that it is
5674
  expressed in standard deviations from the mean.
5675
  \index{standard score}
5676
\index{standard deviation}
5677

5678
\item covariance: A measure of the tendency of two variables
5679
to vary together.
5680
\index{covariance}
5681

5682
\item rank: The index where an element appears in a sorted list.
5683
\index{rank}
5684

5685
\item randomized controlled trial: An experimental design in which subjects
5686
are divided into groups at random, and different groups are given different
5687
treatments.
5688
\index{randomized controlled trial}
5689

5690
\item treatment group: A group in a controlled trial that receives
5691
some kind of intervention.
5692
\index{treatment group}
5693

5694
\item control group: A group in a controlled trial that receives no
5695
treatment, or a treatment whose effect is known.
5696
\index{control group}
5697

5698
\item natural experiment: An experimental design that takes advantage of
5699
a natural division of subjects into groups in ways that are at least
5700
approximately random.
5701
\index{natural experiment}
5702

5703
\end{itemize}
5704

5705

5706

5707

5708
\chapter{Estimation}
5709
\label{estimation}
5710
\index{estimation}
5711

5712
The code for this chapter is in {\tt estimation.py}.  For information
5713
about downloading and working with this code, see Section~\ref{code}.
5714

5715

5716
\section{The estimation game}
5717

5718
Let's play a game.  I think of a distribution, and you have to guess
5719
what it is.  I'll give you two hints: it's a
5720
normal distribution, and here's a random sample drawn from it:
5721
\index{normal distribution}
5722
\index{distribution!normal}
5723
\index{Gaussian distribution}
5724
\index{distribution!Gaussian}
5725

5726
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}
5727

5728
What do you think is the mean parameter, $\mu$, of this distribution?
5729
\index{mean}
5730
\index{parameter}
5731

5732
One choice is to use the sample mean, $\xbar$, as an estimate of $\mu$.
5733
In this example, $\xbar$ is 0.155, so it would
5734
be reasonable to guess $\mu$ = 0.155.
5735
This process is called {\bf estimation}, and the statistic we used
5736
(the sample mean) is called an {\bf estimator}.
5737
\index{estimator}
5738

5739
Using the sample mean to estimate $\mu$ is so obvious that it is hard
5740
to imagine a reasonable alternative.  But suppose we change the game by
5741
introducing outliers.
5742
\index{normal distribution}
5743
\index{distribution!normal}
5744
\index{Gaussian distribution}
5745
\index{distribution!Gaussian}
5746

5747
{\em I'm thinking of a distribution.\/}  It's a normal distribution, and
5748
here's a sample that was collected by an unreliable surveyor who
5749
occasionally puts the decimal point in the wrong place.
5750
\index{measurement error}
5751

5752
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -213.8]}
5753

5754
Now what's your estimate of $\mu$?  If you use the sample mean, your
5755
guess is -35.12.  Is that the best choice?  What are the alternatives?
5756
\index{outlier}
5757

5758
One option is to identify and discard outliers, then compute the sample
5759
mean of the rest.  Another option is to use the median as an estimator.
5760
\index{median}
5761

5762
Which estimator is best depends on the circumstances (for example,
5763
whether there are outliers) and on what the goal is.  Are you
5764
trying to minimize errors, or maximize your chance of getting the
5765
right answer?
5766
\index{error}
5767
\index{MSE}
5768
\index{mean squared error}
5769

5770
If there are no outliers, the sample mean minimizes the {\bf mean squared
5771
error} (MSE).  That is, if we play the game many times, and each time
5772
compute the error $\xbar - \mu$, the sample mean minimizes
5773
%
5774
\[ MSE = \frac{1}{m} \sum (\xbar - \mu)^2 \]
5775
%
5776
Where $m$ is the number of times you play the estimation game, not
5777
to be confused with $n$, which is the size of the sample used to
5778
compute $\xbar$.
5779

5780
Here is a function that simulates the estimation game and computes
5781
the root mean squared error (RMSE), which is the square root of
5782
MSE:
5783
\index{mean squared error}
5784
\index{MSE}
5785
\index{RMSE}
5786

5787
\begin{verbatim}
5788
def Estimate1(n=7, m=1000):
5789
    mu = 0
5790
    sigma = 1
5791

5792
    means = []
5793
    medians = []
5794
    for _ in range(m):
5795
        xs = [random.gauss(mu, sigma) for i in range(n)]
5796
        xbar = np.mean(xs)
5797
        median = np.median(xs)
5798
        means.append(xbar)
5799
        medians.append(median)
5800

5801
    print('rmse xbar', RMSE(means, mu))
5802
    print('rmse median', RMSE(medians, mu))
5803
\end{verbatim}
5804

5805
Again, {\tt n} is the size of the sample, and {\tt m} is the
5806
number of times we play the game.  {\tt means} is the list of
5807
estimates based on $\xbar$.  {\tt medians} is the list of medians.
5808
\index{median}
5809

5810
Here's the function that computes RMSE:
5811

5812
\begin{verbatim}
5813
def RMSE(estimates, actual):
5814
    e2 = [(estimate-actual)**2 for estimate in estimates]
5815
    mse = np.mean(e2)
5816
    return math.sqrt(mse)
5817
\end{verbatim}
5818

5819
{\tt estimates} is a list of estimates; {\tt actual} is the
5820
actual value being estimated.  In practice, of course, we don't
5821
know {\tt actual}; if we did, we wouldn't have to estimate it.
5822
The purpose of this experiment is to compare the performance of
5823
the two estimators.
5824
\index{estimator}
5825

5826
When I ran this code, the RMSE of the sample mean was 0.41, which
5827
means that if we use $\xbar$ to estimate the mean of this
5828
distribution, based on a sample with $n=7$, we should expect to be off
5829
by 0.41 on average.  Using the median to estimate the mean yields
5830
RMSE 0.53, which confirms that $\xbar$ yields lower RMSE, at least
5831
for this example.
5832

5833
Minimizing MSE is a nice property, but it's not always the best
5834
strategy.  For example, suppose we are estimating the distribution of
5835
wind speeds at a building site.  If the estimate is too high, we might
5836
overbuild the structure, increasing its cost.  But if it's too
5837
low, the building might collapse.  Because cost as a function of
5838
error is not symmetric, minimizing MSE is not the best strategy.
5839
\index{prediction}
5840
\index{cost function}
5841
\index{MSE}
5842

5843
As another example, suppose I roll three six-sided dice and ask you to
5844
predict the total.  If you get it exactly right, you get a prize;
5845
otherwise you get nothing.  In this case the value that minimizes MSE
5846
is 10.5, but that would be a bad guess, because the total of three
5847
dice is never 10.5.  For this game, you want an estimator that has the
5848
highest chance of being right, which is a {\bf maximum likelihood
5849
  estimator} (MLE).  If you pick 10 or 11, your chance of winning is 1
5850
in 8, and that's the best you can do.  \index{MLE}
5851
\index{maximum likelihood estimator}
5852
\index{dice}
5853

5854

5855
\section{Guess the variance}
5856
\index{variance}
5857
\index{normal distribution}
5858
\index{distribution!normal}
5859
\index{Gaussian distribution}
5860
\index{distribution!Gaussian}
5861

5862
{\em I'm thinking of a distribution\/.}  It's a normal distribution, and 
5863
here's a (familiar) sample:
5864

5865
{\tt [-0.441, 1.774, -0.101, -1.138, 2.975, -2.138]}
5866

5867
What do you think is the variance, $\sigma^2$, of my distribution?
5868
Again, the obvious choice is to use the sample variance, $S^2$, as an
5869
estimator.
5870
%
5871
\[ S^2 = \frac{1}{n} \sum (x_i - \xbar)^2 \] 
5872
%
5873
For large samples, $S^2$ is an adequate estimator, but for small
5874
samples it tends to be too low.  Because of this unfortunate
5875
property, it is called a {\bf biased} estimator.
5876
An estimator is {\bf unbiased} if the expected total (or mean) error,
5877
after many iterations of the estimation game, is 0.
5878
\index{sample variance}
5879
\index{biased estimator}
5880
\index{estimator!biased}
5881
\index{unbiased estimator}
5882
\index{estimator!unbiased}
5883

5884
Fortunately, there is another simple statistic that is an unbiased
5885
estimator of $\sigma^2$:
5886
%
5887
\[ S_{n-1}^2 = \frac{1}{n-1} \sum (x_i - \xbar)^2 \] 
5888
%
5889
For an explanation of why $S^2$ is biased, and a proof that
5890
$S_{n-1}^2$ is unbiased, see
5891
\url{http://wikipedia.org/wiki/Bias_of_an_estimator}.
5892

5893
The biggest problem with this estimator is that its name and symbol
5894
are used inconsistently.  The name ``sample variance'' can refer to
5895
either $S^2$ or $S_{n-1}^2$, and the symbol $S^2$ is used
5896
for either or both.
5897

5898
Here is a function that simulates the estimation game and tests
5899
the performance of $S^2$ and $S_{n-1}^2$:
5900

5901
\begin{verbatim}
5902
def Estimate2(n=7, m=1000):
5903
    mu = 0
5904
    sigma = 1
5905

5906
    estimates1 = []
5907
    estimates2 = []
5908
    for _ in range(m):
5909
        xs = [random.gauss(mu, sigma) for i in range(n)]
5910
        biased = np.var(xs)
5911
        unbiased = np.var(xs, ddof=1)
5912
        estimates1.append(biased)
5913
        estimates2.append(unbiased)
5914

5915
    print('mean error biased', MeanError(estimates1, sigma**2))
5916
    print('mean error unbiased', MeanError(estimates2, sigma**2))
5917
\end{verbatim}
5918

5919
Again, {\tt n} is the sample size and {\tt m} is the number of times
5920
we play the game.  {\tt np.var} computes $S^2$ by default and
5921
$S_{n-1}^2$ if you provide the argument {\tt ddof=1}, which stands for
5922
``delta degrees of freedom.''  I won't explain that term, but you can read
5923
about it at
5924
\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.
5925
\index{degrees of freedom}
5926

5927
{\tt MeanError} computes the mean difference between the estimates
5928
and the actual value:
5929

5930
\begin{verbatim}
5931
def MeanError(estimates, actual):
5932
    errors = [estimate-actual for estimate in estimates]
5933
    return np.mean(errors)
5934
\end{verbatim}
5935

5936
When I ran this code, the mean error for $S^2$ was -0.13.  As
5937
expected, this biased estimator tends to be too low.  For $S_{n-1}^2$,
5938
the mean error was 0.014, about 10 times smaller.  As {\tt m}
5939
increases, we expect the mean error for $S_{n-1}^2$ to approach 0.
5940
\index{mean error}
5941

5942
Properties like MSE and bias are long-term expectations based on
5943
many iterations of the estimation game.  By running simulations like
5944
the ones in this chapter, we can compare estimators and check whether
5945
they have desired properties.
5946
\index{biased estimator}
5947
\index{estimator!biased}
5948

5949
But when you apply an estimator to real
5950
data, you just get one estimate.  It would not be meaningful to say
5951
that the estimate is unbiased; being unbiased is a property of the
5952
estimator, not the estimate.
5953

5954
After you choose an estimator with appropriate properties, and use it to
5955
generate an estimate, the next step is to characterize the
5956
uncertainty of the estimate, which is the topic of the next
5957
section.
5958

5959

5960
\section{Sampling distributions}
5961
\label{gorilla}
5962

5963
Suppose you are a scientist studying gorillas in a wildlife
5964
preserve.  You want to know the average weight of the adult
5965
female gorillas in the preserve.  To weigh them, you have
5966
to tranquilize them, which is dangerous, expensive, and possibly
5967
harmful to the gorillas.  But if it is important to obtain this
5968
information, it might be acceptable to weigh a sample of 9
5969
gorillas.  Let's assume that the population of the preserve is
5970
well known, so we can choose a representative sample of adult
5971
females.  We could use the sample mean, $\xbar$, to estimate the
5972
unknown population mean, $\mu$.
5973
\index{gorilla}
5974
\index{population}
5975
\index{sample}
5976

5977
Having weighed 9 female gorillas, you might find $\xbar=90$ kg and
5978
sample standard deviation, $S=7.5$ kg.  The sample mean
5979
is an unbiased estimator of $\mu$, and in the long run it
5980
minimizes MSE.  So if you report a single
5981
estimate that summarizes the results, you would report 90 kg.
5982
\index{MSE}
5983
\index{sample mean}
5984
\index{biased estimator}
5985
\index{estimator!biased}
5986
\index{standard deviation}
5987

5988
But how confident should you be in this estimate?  If you only weigh
5989
$n=9$ gorillas out of a much larger population, you might be unlucky
5990
and choose the 9 heaviest gorillas (or the 9 lightest ones) just by
5991
chance.  Variation in the estimate caused by random selection is
5992
called {\bf sampling error}.
5993
\index{sampling error}
5994

5995
To quantify sampling error, we can simulate the
5996
sampling process with hypothetical values of $\mu$ and $\sigma$, and
5997
see how much $\xbar$ varies.
5998

5999
Since we don't know the actual values of 
6000
$\mu$ and $\sigma$ in the population, we'll use the estimates
6001
$\xbar$ and $S$.
6002
So the question we answer is:
6003
``If the actual values of $\mu$ and $\sigma$ were 90 kg and 7.5 kg,
6004
and we ran the same experiment many times, how much would the
6005
estimated mean, $\xbar$, vary?''
6006

6007
The following function answers that question:
6008

6009
\begin{verbatim}
6010
def SimulateSample(mu=90, sigma=7.5, n=9, m=1000):
6011
    means = []
6012
    for j in range(m):
6013
        xs = np.random.normal(mu, sigma, n)
6014
        xbar = np.mean(xs)
6015
        means.append(xbar)
6016

6017
    cdf = thinkstats2.Cdf(means)
6018
    ci = cdf.Percentile(5), cdf.Percentile(95)
6019
    stderr = RMSE(means, mu)
6020
\end{verbatim}
6021

6022
{\tt mu} and {\tt sigma} are the {\em hypothetical\/} values of
6023
the parameters.  {\tt n} is the sample size, the number of
6024
gorillas we measured.  {\tt m} is the number of times we run
6025
the simulation.
6026
\index{gorilla}
6027
\index{sample size}
6028
\index{simulation}
6029

6030
\begin{figure}
6031
% estimation.py
6032
\centerline{\includegraphics[height=2.5in]{figs/estimation1.pdf}}
6033
\caption{Sampling distribution of $\xbar$, with confidence interval.}
6034
\label{estimation1}
6035
\end{figure}
6036

6037
In each iteration, we choose {\tt n} values from a normal
6038
distribution with the given parameters, and compute the sample mean,
6039
{\tt xbar}.  We run 1000 simulations and then compute the
6040
distribution, {\tt cdf}, of the estimates.  The result is shown in
6041
Figure~\ref{estimation1}.  This distribution is called the {\bf
6042
  sampling distribution} of the estimator.  It shows how much the
6043
estimates would vary if we ran the experiment over and over.
6044
\index{sampling distribution}
6045

6046
The mean of the sampling distribution is pretty close
6047
to the hypothetical value of $\mu$, which means that the experiment
6048
yields the right answer, on average.  After 1000 tries, the lowest
6049
result is 82 kg, and the highest is 98 kg.  This range suggests that
6050
the estimate might be off by as much as 8 kg.
6051

6052
There are two common ways to summarize the sampling distribution:
6053

6054
\begin{itemize}
6055

6056
\item {\bf Standard error} (SE) is a measure of how far we expect the
6057
  estimate to be off, on average.  For each simulated experiment, we
6058
  compute the error, $\xbar - \mu$, and then compute the root mean
6059
  squared error (RMSE).  In this example, it is roughly 2.5 kg.
6060
\index{standard error}
6061

6062
\item A {\bf confidence interval} (CI) is a range that includes a
6063
  given fraction of the sampling distribution.  For example, the 90\%
6064
  confidence interval is the range from the 5th to the 95th
6065
  percentile.  In this example, the 90\% CI is $(86, 94)$ kg.
6066
\index{confidence interval}
6067
\index{sampling distribution}
6068

6069
\end{itemize}
6070

6071
Standard errors and confidence intervals are the source of much confusion:
6072

6073
\begin{itemize}
6074

6075
\item People often confuse standard error and standard deviation.
6076
  Remember that standard deviation describes variability in a measured
6077
  quantity; in this example, the standard deviation of gorilla weight
6078
  is 7.5 kg.  Standard error describes variability in an estimate.  In
6079
  this example, the standard error of the mean, based on a sample of 9
6080
  measurements, is 2.5 kg.
6081
\index{gorilla}
6082
\index{standard deviation}
6083

6084
  One way to remember the difference is that, as sample size
6085
  increases, standard error gets smaller; standard deviation does not.
6086

6087
\item People often think that there is a 90\% probability that the
6088
  actual parameter, $\mu$, falls in the 90\% confidence interval.
6089
  Sadly, that is not true.  If you want to make a claim like that, you
6090
  have to use Bayesian methods (see my book, {\it Think Bayes\/}).
6091
\index{Bayesian statistics}
6092

6093
  The sampling distribution answers a different question: it gives you
6094
  a sense of how reliable an estimate is by telling you how much it
6095
  would vary if you ran the experiment again.
6096
\index{sampling distribution}
6097

6098
\end{itemize}
6099

6100
It is important to remember that confidence intervals
6101
and standard errors only quantify sampling error; that is,
6102
error due to measuring only part of the population.
6103
The sampling distribution does not account for other
6104
sources of error, notably sampling bias and measurement error, 
6105
which are the topics of the next section.
6106

6107

6108
\section{Sampling bias}
6109

6110
Suppose that instead of the weight of gorillas in a nature preserve,
6111
you want to know the average weight of women in the city where you
6112
live.  It is unlikely that you would be allowed
6113
to choose a representative sample of women and
6114
weigh them.
6115
\index{gorilla}
6116
\index{adult weight}
6117
\index{sampling bias}
6118
\index{bias!sampling}
6119
\index{measurement error}
6120

6121
A simple alternative would be
6122
``telephone sampling;'' that is,
6123
you could choose random numbers from the phone book, call and ask to
6124
speak to an adult woman, and ask how much she weighs.
6125
\index{telephone sampling}
6126
\index{random number}
6127

6128
Telephone sampling has obvious limitations.  For example, the sample
6129
is limited to people whose telephone numbers are listed, so it
6130
eliminates people without phones (who might be poorer than average)
6131
and people with unlisted numbers (who might be richer).  Also, if you
6132
call home telephones during the day, you are less likely to sample
6133
people with jobs.  And if you only sample the person who answers the
6134
phone, you are less likely to sample people who share a phone line.
6135

6136
If factors like income, employment, and household size are related
6137
to weight---and it is plausible that they are---the results of your
6138
survey would be affected one way or another.  This problem is
6139
called {\bf sampling bias} because it is a property of the sampling
6140
process.
6141
\index{sampling bias}
6142

6143
This sampling process is also vulnerable to self-selection, which is a
6144
kind of sampling bias.  Some people will refuse to answer the
6145
question, and if the tendency to refuse is related to weight, that
6146
would affect the results.
6147
\index{self-selection}
6148

6149
Finally, if you ask people how much they weigh, rather than weighing
6150
them, the results might not be accurate.  Even helpful respondents
6151
might round up or down if they are uncomfortable with their actual
6152
weight.  And not all respondents are helpful.  These inaccuracies are
6153
examples of {\bf measurement error}.
6154
\index{measurement error}
6155

6156
When you report an estimated quantity, it is useful to report
6157
standard error, or a confidence interval, or both, in order to
6158
quantify sampling error.  But it is also important to remember that
6159
sampling error is only one source of error, and often it is not the
6160
biggest.
6161
\index{standard error}
6162
\index{confidence interval}
6163

6164

6165
\section{Exponential distributions}
6166
\index{exponential distribution}
6167
\index{distribution!exponential}
6168

6169
Let's play one more round of the estimation game.
6170
{\em I'm thinking of a distribution.\/}  It's an exponential distribution, and 
6171
here's a sample:
6172

6173
{\tt [5.384, 4.493, 19.198, 2.790, 6.122, 12.844]}
6174

6175
What do you think is the parameter, $\lambda$, of this distribution?
6176
\index{parameter}
6177
\index{mean}
6178

6179
\newcommand{\lamhat}{L}
6180
\newcommand{\lamhatmed}{L_m}
6181

6182
In general, the mean of an exponential distribution is $1/\lambda$,
6183
so working backwards, we might choose
6184
%
6185
\[ \lamhat = 1 / \xbar\]
6186
%
6187
$\lamhat$ is an
6188
estimator of $\lambda$.  And not just any estimator; it is also the
6189
maximum likelihood estimator (see
6190
\url{http://wikipedia.org/wiki/Exponential_distribution#Maximum_likelihood}).
6191
So if you want to maximize your chance of guessing $\lambda$ exactly,
6192
$\lamhat$ is the way to go.
6193
\index{MLE}
6194
\index{maximum likelihood estimator}
6195

6196
But we know that $\xbar$ is not robust in the presence of outliers, so
6197
we expect $\lamhat$ to have the same problem.
6198
\index{robust}
6199
\index{outlier}
6200
\index{sample median}
6201

6202
We can choose an alternative based on the sample median.
6203
The median of an exponential distribution is $\ln(2) / \lambda$,
6204
so working backwards again, we can define an estimator
6205
%
6206
\[ \lamhatmed = \ln(2) / m \]
6207
%
6208
where $m$ is the sample median.
6209
\index{median}
6210

6211
To test the performance of these estimators, we can simulate the
6212
sampling process:
6213

6214
\begin{verbatim}
6215
def Estimate3(n=7, m=1000):
6216
    lam = 2
6217

6218
    means = []
6219
    medians = []
6220
    for _ in range(m):
6221
        xs = np.random.exponential(1.0/lam, n)
6222
        L = 1 / np.mean(xs)
6223
        Lm = math.log(2) / thinkstats2.Median(xs)
6224
        means.append(L)
6225
        medians.append(Lm)
6226

6227
    print('rmse L', RMSE(means, lam))
6228
    print('rmse Lm', RMSE(medians, lam))
6229
    print('mean error L', MeanError(means, lam))
6230
    print('mean error Lm', MeanError(medians, lam))
6231
\end{verbatim}
6232

6233
When I run this experiment with $\lambda=2$, the RMSE of $L$ is
6234
1.1.  For the median-based estimator $L_m$, RMSE is 1.8.  We can't
6235
tell from this experiment whether $L$ minimizes MSE, but at least
6236
it seems better than $L_m$.
6237
\index{MSE}
6238
\index{RMSE}
6239

6240
Sadly, it seems that both estimators are biased.  For $L$ the mean
6241
error is 0.33; for $L_m$ it is 0.45.  And neither converges to 0
6242
as {\tt m} increases.
6243
\index{biased estimator}
6244
\index{estimator!biased}
6245

6246
It turns out that $\xbar$ is an unbiased estimator of the mean
6247
of the distribution, $1 / \lambda$, but $L$ is not an unbiased
6248
estimator of $\lambda$.
6249

6250

6251
\section{Exercises}
6252

6253
For the following exercises, you might want to start with a copy of
6254
{\tt estimation.py}.  Solutions are in \verb"chap08soln.py"
6255

6256
\begin{exercise}
6257

6258
In this chapter we used $\xbar$ and median to estimate $\mu$, and
6259
found that $\xbar$  yields lower MSE.
6260
Also, we used $S^2$ and $S_{n-1}^2$ to estimate $\sigma$, and found that
6261
$S^2$ is biased and $S_{n-1}^2$ unbiased.
6262

6263
Run similar experiments to see if $\xbar$ and median are biased estimates
6264
of $\mu$.
6265
Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.
6266
\index{sample mean}
6267
\index{sample median}
6268
\index{estimator!biased}
6269

6270
\end{exercise}
6271

6272

6273
\begin{exercise}
6274

6275
Suppose you draw a sample with size $n=10$ from 
6276
an exponential distribution with $\lambda=2$.  Simulate
6277
this experiment 1000 times and plot the sampling distribution of
6278
the estimate $\lamhat$.  Compute the standard error of the estimate
6279
and the 90\% confidence interval.
6280
\index{standard error}
6281
\index{confidence interval}
6282
\index{sampling distribution}
6283

6284
Repeat the experiment with a few different values of $n$ and make
6285
a plot of standard error versus $n$.
6286
\index{exponential distribution}
6287
\index{distribution!exponential}
6288

6289

6290
\end{exercise}
6291

6292

6293
\begin{exercise}
6294

6295
In games like hockey and soccer, the time between goals is
6296
roughly exponential.  So you could estimate a team's goal-scoring rate
6297
by observing the number of goals they score in a game.  This
6298
estimation process is a little different from sampling the time
6299
between goals, so let's see how it works.
6300
\index{hockey}
6301
\index{soccer}
6302

6303
Write a function that takes a goal-scoring rate, {\tt lam}, in goals
6304
per game, and simulates a game by generating the time between goals
6305
until the total time exceeds 1 game, then returns the number of goals
6306
scored.
6307

6308
Write another function that simulates many games, stores the
6309
estimates of {\tt lam}, then computes their mean error and RMSE.
6310

6311
Is this way of making an estimate biased?  Plot the sampling
6312
distribution of the estimates and the 90\% confidence interval.  What
6313
is the standard error?  What happens to sampling error for increasing
6314
values of {\tt lam}?
6315
\index{estimator!biased}
6316
\index{biased estimator}
6317
\index{standard error}
6318
\index{confidence interval}
6319

6320
\end{exercise}
6321

6322

6323
\section{Glossary}
6324

6325
\begin{itemize}
6326

6327
\item estimation: The process of inferring the parameters of a distribution
6328
from a sample.
6329
\index{estimation}
6330

6331
\item estimator: A statistic used to estimate a parameter.
6332
\index{estimation}
6333

6334
\item mean squared error (MSE): A measure of estimation error.
6335
\index{mean squared error}
6336
\index{MSE}
6337

6338
\item root mean squared error (RMSE): The square root of MSE,
6339
a more meaningful representation of typical error magnitude.
6340
\index{mean squared error}
6341
\index{MSE}
6342

6343
\item maximum likelihood estimator (MLE): An estimator that computes the
6344
point estimate most likely to be correct.
6345
\index{MLE}
6346
\index{maximum likelihood estimator}
6347

6348
\item bias (of an estimator): The tendency of an estimator to be above or
6349
  below the actual value of the parameter, when averaged over repeated
6350
  experiments.  \index{biased estimator}
6351

6352
\item sampling error: Error in an estimate due to the limited
6353
  size of the sample and variation due to chance. \index{point estimation}
6354

6355
\item sampling bias: Error in an estimate due to a sampling process
6356
  that is not representative of the population. \index{sampling bias}
6357

6358
\item measurement error: Error in an estimate due to inaccuracy collecting
6359
  or recording data. \index{measurement error}
6360

6361
\item sampling distribution: The distribution of a statistic if an
6362
  experiment is repeated many times.  \index{sampling distribution}
6363

6364
\item standard error: The RMSE of an estimate,
6365
which quantifies variability due to sampling error (but not
6366
other sources of error).
6367
\index{standard error}
6368

6369
\item confidence interval: An interval that represents the expected
6370
  range of an estimator if an experiment is repeated many times.
6371
  \index{confidence interval} \index{interval!confidence}
6372

6373
\end{itemize}
6374

6375

6376
\chapter{Hypothesis testing}
6377
\label{testing}
6378

6379
The code for this chapter is in {\tt hypothesis.py}.  For information
6380
about downloading and working with this code, see Section~\ref{code}.
6381

6382
\section{Classical hypothesis testing}
6383
\index{hypothesis testing}
6384
\index{apparent effect}
6385

6386
Exploring the data from the NSFG, we saw several ``apparent effects,''
6387
including differences between first babies and others.
6388
So far we have taken these effects at face value; in this chapter,
6389
we put them to the test.
6390
\index{National Survey of Family Growth}
6391
\index{NSFG}
6392

6393
The fundamental question we want to address is whether the effects
6394
we see in a sample are likely to appear in the larger population.
6395
For example, in the NSFG sample we see a difference in mean pregnancy
6396
length for first babies and others.  We would like to know if
6397
that effect reflects a real difference for women
6398
in the U.S., or if it might appear in the sample by chance.
6399
\index{pregnancy length} \index{length!pregnancy}
6400

6401
There are several ways we could formulate this question, including
6402
Fisher null hypothesis testing, Neyman-Pearson decision theory, and
6403
Bayesian inference\footnote{For more about Bayesian inference, see the
6404
  sequel to this book, {\it Think Bayes}.}.  What I present here is a
6405
subset of all three that makes up most of what people use in practice,
6406
which I will call {\bf classical hypothesis testing}.
6407
\index{Bayesian inference}
6408
\index{null hypothesis}
6409

6410
The goal of classical hypothesis testing is to answer the question,
6411
``Given a sample and an apparent effect, what is the probability of
6412
seeing such an effect by chance?''  Here's how we answer that question:
6413

6414
\begin{itemize}
6415

6416
\item The first step is to quantify the size of the apparent effect by
6417
  choosing a {\bf test statistic}.  In the NSFG example, the apparent
6418
  effect is a difference in pregnancy length between first babies and
6419
  others, so a natural choice for the test statistic is the difference
6420
  in means between the two groups.
6421
  \index{test statistic}
6422

6423
\item The second step is to define a {\bf null hypothesis}, which is a
6424
  model of the system based on the assumption that the apparent effect
6425
  is not real.  In the NSFG example the null hypothesis is that there
6426
  is no difference between first babies and others; that is, that
6427
  pregnancy lengths for both groups have the same distribution.
6428
  \index{null hypothesis}
6429
\index{pregnancy length}
6430
\index{model}
6431

6432
\item The third step is to compute a {\bf p-value}, which is the
6433
  probability of seeing the apparent effect if the null hypothesis is
6434
  true.  In the NSFG example, we would compute the actual difference
6435
  in means, then compute the probability of seeing a
6436
  difference as big, or bigger, under the null hypothesis.
6437
  \index{p-value}
6438

6439
\item The last step is to interpret the result.  If the p-value is
6440
  low, the effect is said to be {\bf statistically significant}, which
6441
  means that it is unlikely to have occurred by chance.  In that case
6442
  we infer that the effect is more likely to appear in the larger
6443
  population.  \index{statistically significant} \index{significant}
6444

6445
\end{itemize}
6446

6447
The logic of this process is similar to a proof by
6448
contradiction.  To prove a mathematical statement, A, you assume
6449
temporarily that A is false.  If that assumption leads to a
6450
contradiction, you conclude that A must actually be true.
6451
\index{contradiction, proof by}
6452
\index{proof by contradiction}
6453

6454
Similarly, to test a hypothesis like, ``This effect is real,'' we
6455
assume, temporarily, that it is not.  That's the null hypothesis.
6456
Based on that assumption, we compute the probability of the apparent
6457
effect.  That's the p-value.  If the p-value is low, we
6458
conclude that the null hypothesis is unlikely to be true.
6459
\index{p-value}
6460
\index{null hypothesis}
6461

6462

6463
\section{HypothesisTest}
6464
\label{hypotest}
6465
\index{mean!difference in}
6466

6467
{\tt thinkstats2} provides {\tt HypothesisTest}, a
6468
class that represents the structure of a classical hypothesis
6469
test.  Here is the definition:
6470
\index{HypothesisTest}
6471

6472
\begin{verbatim}
6473
class HypothesisTest(object):
6474

6475
    def __init__(self, data):
6476
        self.data = data
6477
        self.MakeModel()
6478
        self.actual = self.TestStatistic(data)
6479

6480
    def PValue(self, iters=1000):
6481
        self.test_stats = [self.TestStatistic(self.RunModel()) 
6482
                           for _ in range(iters)]
6483

6484
        count = sum(1 for x in self.test_stats if x >= self.actual)
6485
        return count / iters
6486

6487
    def TestStatistic(self, data):
6488
        raise UnimplementedMethodException()
6489

6490
    def MakeModel(self):
6491
        pass
6492

6493
    def RunModel(self):
6494
        raise UnimplementedMethodException()
6495
\end{verbatim}
6496

6497
{\tt HypothesisTest} is an abstract parent class that provides
6498
complete definitions for some methods and place-keepers for others.
6499
Child classes based on {\tt HypothesisTest} inherit \verb"__init__"
6500
and {\tt PValue} and provide {\tt TestStatistic},
6501
{\tt RunModel}, and optionally {\tt MakeModel}.
6502
\index{HypothesisTest}
6503

6504
\verb"__init__" takes the data in whatever form is appropriate.  It
6505
calls {\tt MakeModel}, which builds a representation of the null
6506
hypothesis, then passes the data to {\tt TestStatistic}, which
6507
computes the size of the effect in the sample.
6508
\index{test statistic}
6509
\index{null hypothesis}
6510

6511
{\tt PValue} computes the probability of the apparent effect under
6512
the null hypothesis.  It takes as a parameter {\tt iters}, which is
6513
the number of simulations to run.  The first line generates simulated
6514
data, computes test statistics, and stores them in
6515
\verb"test_stats".
6516
The result is
6517
the fraction of elements in \verb"test_stats" that
6518
exceed or equal the observed test statistic, {\tt self.actual}.
6519
\index{simulation}
6520

6521
As a simple example\footnote{Adapted from MacKay, {\it Information
6522
    Theory, Inference, and Learning Algorithms}, 2003.}, suppose we
6523
toss a coin 250 times and see 140 heads and 110 tails.  Based on this
6524
result, we might suspect that the coin is biased; that is, more likely
6525
to land heads.  To test this hypothesis, we compute the
6526
probability of seeing such a difference if the coin is actually
6527
fair:
6528
\index{biased coin}
6529
\index{MacKay, David}
6530

6531
\begin{verbatim}
6532
class CoinTest(thinkstats2.HypothesisTest):
6533

6534
    def TestStatistic(self, data):
6535
        heads, tails = data
6536
        test_stat = abs(heads - tails)
6537
        return test_stat
6538

6539
    def RunModel(self):
6540
        heads, tails = self.data
6541
        n = heads + tails
6542
        sample = [random.choice('HT') for _ in range(n)]
6543
        hist = thinkstats2.Hist(sample)
6544
        data = hist['H'], hist['T']
6545
        return data
6546
\end{verbatim}
6547

6548
The parameter, {\tt data}, is a pair of
6549
integers: the number of heads and tails.  The test statistic is
6550
the absolute difference between them, so {\tt self.actual}
6551
is 30.
6552
\index{HypothesisTest}
6553

6554
{\tt RunModel} simulates coin tosses assuming that the coin is
6555
actually fair.  It generates a sample of 250 tosses, uses Hist
6556
to count the number of heads and tails, and returns a pair of
6557
integers.
6558
\index{Hist}
6559
\index{model}
6560

6561
Now all we have to do is instantiate {\tt CoinTest} and call
6562
{\tt PValue}:
6563

6564
\begin{verbatim}
6565
    ct = CoinTest((140, 110))
6566
    pvalue = ct.PValue()
6567
\end{verbatim}
6568

6569
The result is about 0.07, which means that if the coin is
6570
fair, we expect to see a difference as big as 30 about 7\% of the
6571
time.
6572

6573
How should we interpret this result?  By convention,
6574
5\% is the threshold of statistical significance.  If the
6575
p-value is less than 5\%, the effect is considered significant; otherwise
6576
it is not.
6577
\index{p-value}
6578
\index{statistically significant} \index{significant}
6579

6580
But the choice of 5\% is arbitrary, and (as we will see later) the
6581
p-value depends on the choice of the test statistics and
6582
the model of the null hypothesis.  So p-values should not be considered
6583
precise measurements.
6584
\index{null hypothesis}
6585

6586
I recommend interpreting p-values according to their order of
6587
magnitude: if the p-value is less than 1\%, the effect is unlikely to
6588
be due to chance; if it is greater than 10\%, the effect can plausibly
6589
be explained by chance.  P-values between 1\% and 10\% should be
6590
considered borderline.  So in this example I conclude that the
6591
data do not provide strong evidence that the coin is biased or not.
6592

6593

6594
\section{Testing a difference in means}
6595
\label{testdiff}
6596
\index{mean!difference in}
6597

6598
One of the most common effects to test is a difference in mean
6599
between two groups.  In the NSFG data, we saw that the mean pregnancy
6600
length for first babies is slightly longer, and the mean birth weight
6601
is slightly smaller.  Now we will see if those effects are
6602
statistically significant.
6603
\index{National Survey of Family Growth}
6604
\index{NSFG}
6605
\index{pregnancy length}
6606
\index{length!pregnancy}
6607

6608
For these examples, the null hypothesis is that the distributions
6609
for the two groups are the same.  One way to model the null
6610
hypothesis is by {\bf permutation}; that is, we can take values
6611
for first babies and others and shuffle them, treating
6612
the two groups as one big group:
6613
\index{null hypothesis}
6614
\index{permutation}
6615
\index{model}
6616

6617
\begin{verbatim}
6618
class DiffMeansPermute(thinkstats2.HypothesisTest):
6619

6620
    def TestStatistic(self, data):
6621
        group1, group2 = data
6622
        test_stat = abs(group1.mean() - group2.mean())
6623
        return test_stat
6624

6625
    def MakeModel(self):
6626
        group1, group2 = self.data
6627
        self.n, self.m = len(group1), len(group2)
6628
        self.pool = np.hstack((group1, group2))
6629

6630
    def RunModel(self):
6631
        np.random.shuffle(self.pool)
6632
        data = self.pool[:self.n], self.pool[self.n:]
6633
        return data
6634
\end{verbatim}
6635

6636
{\tt data} is a pair of sequences, one for each
6637
group.  The test statistic is the absolute difference in the means.
6638
\index{HypothesisTest}
6639

6640
{\tt MakeModel} records the sizes of the groups, {\tt n} and
6641
{\tt m}, and combines the groups into one NumPy
6642
array, {\tt self.pool}.
6643
\index{NumPy}
6644

6645
{\tt RunModel} simulates the null hypothesis by shuffling the
6646
pooled values and splitting them into two groups with sizes {\tt n}
6647
and {\tt m}.  As always, the return value from {\tt RunModel} has
6648
the same format as the observed data.
6649
\index{null hypothesis}
6650
\index{model}
6651

6652
To test the difference in pregnancy length, we run:
6653

6654
\begin{verbatim}
6655
    live, firsts, others = first.MakeFrames()
6656
    data = firsts.prglngth.values, others.prglngth.values
6657
    ht = DiffMeansPermute(data)
6658
    pvalue = ht.PValue()
6659
\end{verbatim}
6660

6661
{\tt MakeFrames} reads the NSFG data and returns DataFrames
6662
representing all live births, first babies, and others.
6663
We extract pregnancy lengths as NumPy arrays, pass them as
6664
data to {\tt DiffMeansPermute}, and compute the p-value.  The
6665
result is about 0.17, which means that we expect to see a difference
6666
as big as the observed effect about 17\% of the time.  So
6667
this effect is not statistically significant.
6668
\index{DataFrame}
6669
\index{p-value}
6670
  \index{significant} \index{statistically significant}
6671
\index{pregnancy length}
6672

6673
\begin{figure}
6674
% hypothesis.py
6675
\centerline{\includegraphics[height=2.5in]{figs/hypothesis1.pdf}}
6676
\caption{CDF of difference in mean pregnancy length under the null
6677
hypothesis.}
6678
\label{hypothesis1}
6679
\end{figure}
6680

6681
{\tt HypothesisTest} provides {\tt PlotCdf}, which plots the
6682
distribution of the test statistic and a gray line indicating
6683
the observed effect size:
6684
\index{thinkplot}
6685
\index{HypothesisTest}
6686
\index{Cdf}
6687
\index{effect size}
6688

6689
\begin{verbatim}
6690
    ht.PlotCdf()
6691
    thinkplot.Show(xlabel='test statistic',
6692
                   ylabel='CDF')
6693
\end{verbatim}
6694

6695
Figure~\ref{hypothesis1} shows the result.  The CDF intersects the
6696
observed difference at 0.83, which is the complement of the p-value,
6697
0.17.
6698
\index{p-value}
6699

6700
If we run the same analysis with birth weight, the computed p-value
6701
is 0; after 1000 attempts,
6702
the simulation never yields an effect
6703
as big as the observed difference, 0.12 lbs.  So we would
6704
report $p < 0.001$, and
6705
conclude that the difference in birth weight is statistically
6706
significant.
6707
\index{birth weight}
6708
\index{weight!birth}
6709
  \index{significant} \index{statistically significant}
6710

6711

6712
\section{Other test statistics}
6713

6714
Choosing the best test statistic depends on what question you are
6715
trying to address.  For example, if the relevant question is whether
6716
pregnancy lengths are different for first
6717
babies, then it makes sense to test the absolute difference in means,
6718
as we did in the previous section.
6719
\index{test statistic}
6720
\index{pregnancy length}
6721

6722
If we had some reason to think that first babies are likely
6723
to be late, then we would not take the absolute value of the difference;
6724
instead we would use this test statistic:
6725

6726
\begin{verbatim}
6727
class DiffMeansOneSided(DiffMeansPermute):
6728

6729
    def TestStatistic(self, data):
6730
        group1, group2 = data
6731
        test_stat = group1.mean() - group2.mean()
6732
        return test_stat
6733
\end{verbatim}
6734

6735
{\tt DiffMeansOneSided} inherits {\tt MakeModel} and {\tt RunModel}
6736
from {\tt DiffMeansPermute}; the only difference is that
6737
{\tt TestStatistic} does not take the absolute value of the
6738
difference.  This kind of test is called {\bf one-sided} because
6739
it only counts one side of the distribution of differences.  The
6740
previous test, using both sides, is {\bf two-sided}.
6741
\index{one-sided test}
6742
\index{two-sided test}
6743

6744
For this version of the test, the p-value is 0.09.  In general
6745
the p-value for a one-sided test is about half the p-value for
6746
a two-sided test, depending on the shape of the distribution.
6747
\index{p-value}
6748

6749
The one-sided hypothesis, that first babies are born late, is more
6750
specific than the two-sided hypothesis, so the p-value is smaller.
6751
But even for the stronger hypothesis, the difference is
6752
not statistically significant.
6753
  \index{significant} \index{statistically significant}
6754

6755
We can use the same framework to test for a difference in standard
6756
deviation.  In Section~\ref{visualization}, we saw some evidence that
6757
first babies are more likely to be early or late, and less likely to
6758
be on time.  So we might hypothesize that the standard deviation is
6759
higher.  Here's how we can test that:
6760
\index{standard deviation}
6761

6762
\begin{verbatim}
6763
class DiffStdPermute(DiffMeansPermute):
6764

6765
    def TestStatistic(self, data):
6766
        group1, group2 = data
6767
        test_stat = group1.std() - group2.std()
6768
        return test_stat
6769
\end{verbatim}
6770

6771
This is a one-sided test because the hypothesis is that the standard
6772
deviation for first babies is higher, not just different.  The p-value
6773
is 0.09, which is not statistically significant.
6774
\index{p-value}
6775
\index{permutation}
6776
  \index{significant} \index{statistically significant}
6777

6778

6779
\section{Testing a correlation}
6780
\label{corrtest}
6781

6782
This framework can also test correlations.  For example, in the NSFG
6783
data set, the correlation between birth weight and mother's age is
6784
about 0.07.  It seems like older mothers have heavier babies.  But
6785
could this effect be due to chance?
6786
\index{correlation}
6787
\index{test statistic}
6788

6789
For the test statistic, I use
6790
Pearson's correlation, but Spearman's would work as well.
6791
If we had reason to expect positive correlation, we would do a
6792
one-sided test.  But since we have no such reason, I'll
6793
do a two-sided test using the absolute value of correlation.
6794
\index{Pearson coefficient of correlation}
6795
\index{Spearman coefficient of correlation}
6796

6797
The null hypothesis is that there is no correlation between mother's
6798
age and birth weight.  By shuffling the observed values, we can
6799
simulate a world where the distributions of age and
6800
birth weight are the same, but where the variables are unrelated:
6801
\index{birth weight}
6802
\index{weight!birth}
6803
\index{null hypothesis}
6804

6805
\begin{verbatim}
6806
class CorrelationPermute(thinkstats2.HypothesisTest):
6807

6808
    def TestStatistic(self, data):
6809
        xs, ys = data
6810
        test_stat = abs(thinkstats2.Corr(xs, ys))
6811
        return test_stat
6812

6813
    def RunModel(self):
6814
        xs, ys = self.data
6815
        xs = np.random.permutation(xs)
6816
        return xs, ys
6817
\end{verbatim}
6818

6819
{\tt data} is a pair of sequences.  {\tt TestStatistic} computes the
6820
absolute value of Pearson's correlation.  {\tt RunModel} shuffles the
6821
{\tt xs} and returns simulated data.
6822
\index{HypothesisTest}
6823
\index{permutation}
6824
\index{Pearson coefficient of correlation}
6825

6826
Here's the code that reads the data and runs the test:
6827

6828
\begin{verbatim}
6829
    live, firsts, others = first.MakeFrames()
6830
    live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
6831
    data = live.agepreg.values, live.totalwgt_lb.values
6832
    ht = CorrelationPermute(data)
6833
    pvalue = ht.PValue()
6834
\end{verbatim}
6835

6836
I use {\tt dropna} with the {\tt subset} argument to drop rows
6837
that are missing either of the variables we need.
6838
\index{dropna}
6839
\index{NaN}
6840
\index{missing values}
6841

6842
The actual correlation is 0.07.  The computed p-value is 0; after 1000
6843
iterations the largest simulated correlation is 0.04.  So although the
6844
observed correlation is small, it is statistically significant.
6845
\index{p-value}
6846
  \index{significant} \index{statistically significant}
6847

6848
This example is a reminder that ``statistically significant'' does not
6849
always mean that an effect is important, or significant in practice.
6850
It only means that it is unlikely to have occurred by chance.
6851

6852

6853
\section{Testing proportions}
6854
\label{casino}
6855
\index{chi-squared test}
6856

6857
Suppose you run a casino and you suspect that a customer is
6858
using a crooked die; that
6859
is, one that has been modified to make one of the faces more
6860
likely than the others.  You apprehend the alleged
6861
cheater and confiscate the die, but now you have to prove that it
6862
is crooked.  You roll the die 60 times and get the following results:
6863
\index{casino}
6864
\index{dice}
6865
\index{crooked die}
6866

6867
\begin{center}
6868
\begin{tabular}{|l|c|c|c|c|c|c|}
6869
\hline
6870
Value     &  1  &  2  &  3  &  4  &  5  &  6  \\ 
6871
\hline
6872
Frequency &  8  &  9  &  19  &  5  &  8  &  11  \\
6873
\hline
6874
\end{tabular}
6875
\end{center}
6876

6877
On average you expect each value to appear 10 times.  In this
6878
dataset, the value 3 appears more often than expected, and the value 4
6879
appears less often.  But are these differences statistically
6880
significant?
6881
\index{frequency}
6882
  \index{significant} \index{statistically significant}
6883

6884
To test this hypothesis, we can compute the expected frequency for
6885
each value, the difference between the expected and observed
6886
frequencies, and the total absolute difference.  In this
6887
example, we expect each side to come up 10 times out of 60; the
6888
deviations from this expectation are -2, -1, 9, -5, -2, and 1; so the
6889
total absolute difference is 20.  How often would we see such a
6890
difference by chance?
6891
\index{deviation}
6892

6893
Here's a version of {\tt HypothesisTest} that answers that question:
6894
\index{HypothesisTest}
6895

6896
\begin{verbatim}
6897
class DiceTest(thinkstats2.HypothesisTest):
6898

6899
    def TestStatistic(self, data):
6900
        observed = data
6901
        n = sum(observed)
6902
        expected = np.ones(6) * n / 6
6903
        test_stat = sum(abs(observed - expected))
6904
        return test_stat
6905

6906
    def RunModel(self):
6907
        n = sum(self.data)
6908
        values = [1, 2, 3, 4, 5, 6]
6909
        rolls = np.random.choice(values, n, replace=True)
6910
        hist = thinkstats2.Hist(rolls)
6911
        freqs = hist.Freqs(values)
6912
        return freqs
6913
\end{verbatim}
6914

6915
The data are represented as a list of frequencies: the observed
6916
values are {\tt [8, 9, 19, 5, 8, 11]}; the expected frequencies
6917
are all 10.  The test statistic is the sum of the absolute differences.
6918
\index{frequency}
6919

6920
The null hypothesis is that the die is fair, so we simulate that by
6921
drawing random samples from {\tt values}.  {\tt RunModel} uses {\tt
6922
  Hist} to compute and return the list of frequencies.
6923
\index{Hist}
6924
\index{null hypothesis}
6925
\index{model}
6926

6927
The p-value for this data is 0.13, which means that if the die is
6928
fair we expect to see the observed total deviation, or more, about
6929
13\% of the time.  So the apparent effect is not statistically
6930
significant.
6931
\index{p-value}
6932
\index{deviation}
6933
  \index{significant} \index{statistically significant}
6934

6935

6936
\section{Chi-squared tests}
6937
\label{casino2}
6938

6939
In the previous section we used total deviation as the test statistic.
6940
But for testing proportions it is more common to use the chi-squared
6941
statistic:
6942
%
6943
\[ \goodchi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} \]
6944
%
6945
%% TODO: Consider using upper case chi, which is more strictly correct,
6946
%% but harder to distinguish from X.
6947
% 
6948
Where $O_i$ are the observed frequencies and $E_i$ are the expected
6949
frequencies.  Here's the Python code:
6950
\index{chi-squared test}
6951
\index{chi-squared statistic}
6952
\index{test statistic}
6953

6954
\begin{verbatim}
6955
class DiceChiTest(DiceTest):
6956

6957
    def TestStatistic(self, data):
6958
        observed = data
6959
        n = sum(observed)
6960
        expected = np.ones(6) * n / 6
6961
        test_stat = sum((observed - expected)**2 / expected)
6962
        return test_stat
6963
\end{verbatim}
6964

6965
Squaring the deviations (rather than taking absolute values) gives
6966
more weight to large deviations.  Dividing through by {\tt expected}
6967
standardizes the deviations, although in this case it has no effect
6968
because the expected frequencies are all equal.
6969
\index{deviation}
6970

6971
The p-value using the chi-squared statistic is 0.04,
6972
substantially smaller than what we got using total deviation, 0.13.
6973
If we take the 5\% threshold seriously, we would consider this effect
6974
statistically significant.  But considering the two tests togther, I
6975
would say that the results are borderline.  I would not rule out the
6976
possibility that the die is crooked, but I would not convict the
6977
accused cheater.
6978
\index{p-value}
6979
  \index{significant} \index{statistically significant}
6980

6981
This example demonstrates an important point: the p-value depends
6982
on the choice of test statistic and the model of the null hypothesis,
6983
and sometimes these choices determine whether an effect is
6984
statistically significant or not.
6985
\index{null hypothesis}
6986
\index{model}
6987

6988

6989
\section{First babies again}
6990

6991
Earlier in this chapter we looked at pregnancy lengths for first
6992
babies and others, and concluded that the apparent differences in
6993
mean and standard deviation are not statistically significant.  But in
6994
Section~\ref{visualization}, we saw several apparent differences
6995
in the distribution of pregnancy length, especially in the range from
6996
35 to 43 weeks.  To see whether those differences are statistically
6997
significant, we can use a test based on a chi-squared statistic.
6998
\index{standard deviation}
6999
\index{statistically significant} \index{significant}
7000
\index{pregnancy length}
7001

7002
The code combines elements from previous examples:
7003
\index{HypothesisTest}
7004

7005
\begin{verbatim}
7006
class PregLengthTest(thinkstats2.HypothesisTest):
7007

7008
    def MakeModel(self):
7009
        firsts, others = self.data
7010
        self.n = len(firsts)
7011
        self.pool = np.hstack((firsts, others))
7012

7013
        pmf = thinkstats2.Pmf(self.pool)
7014
        self.values = range(35, 44)
7015
        self.expected_probs = np.array(pmf.Probs(self.values))
7016

7017
    def RunModel(self):
7018
        np.random.shuffle(self.pool)
7019
        data = self.pool[:self.n], self.pool[self.n:]
7020
        return data
7021
\end{verbatim}
7022

7023
The data are represented as two lists of pregnancy lengths.  The null
7024
hypothesis is that both samples are drawn from the same distribution.
7025
{\tt MakeModel} models that distribution by pooling the two
7026
samples using {\tt hstack}.  Then {\tt RunModel} generates
7027
simulated data by shuffling the pooled sample and splitting it
7028
into two parts.
7029
\index{null hypothesis}
7030
\index{model}
7031
\index{hstack}
7032
\index{pregnancy length}
7033

7034
{\tt MakeModel} also defines {\tt values}, which is the
7035
range of weeks we'll use, and \verb"expected_probs",
7036
which is the probability of each value in the pooled distribution.
7037

7038
Here's the code that computes the test statistic:
7039

7040
\begin{verbatim}
7041
# class PregLengthTest:
7042

7043
    def TestStatistic(self, data):
7044
        firsts, others = data
7045
        stat = self.ChiSquared(firsts) + self.ChiSquared(others)
7046
        return stat
7047

7048
    def ChiSquared(self, lengths):
7049
        hist = thinkstats2.Hist(lengths)
7050
        observed = np.array(hist.Freqs(self.values))
7051
        expected = self.expected_probs * len(lengths)
7052
        stat = sum((observed - expected)**2 / expected)
7053
        return stat
7054
\end{verbatim}
7055

7056
{\tt TestStatistic} computes the chi-squared statistic for
7057
first babies and others, and adds them.
7058
\index{chi-squared statistic}
7059

7060
{\tt ChiSquared} takes a sequence of pregnancy lengths, computes
7061
its histogram, and computes {\tt observed}, which is a list of
7062
frequencies corresponding to {\tt self.values}.
7063
To compute the list of expected frequencies, it multiplies the
7064
pre-computed probabilities, \verb"expected_probs", by the sample
7065
size.  It returns the chi-squared statistic, {\tt stat}.
7066

7067
For the NSFG data the total chi-squared statistic is 102, which
7068
doesn't mean much by itself.  But after 1000 iterations, the largest
7069
test statistic generated under the null hypothesis is 32.  We conclude
7070
that the observed chi-squared statistic is unlikely under the null
7071
hypothesis, so the apparent effect is statistically significant.
7072
\index{null hypothesis}
7073
\index{statistically significant} \index{significant}
7074

7075
This example demonstrates a limitation of chi-squared tests: they
7076
indicate that there is a difference between the two groups,
7077
but they don't say anything specific about what the difference is.
7078

7079

7080
\section{Errors}
7081
\index{error}
7082

7083
In classical hypothesis testing, an effect is considered statistically
7084
significant if the p-value is below some threshold, commonly 5\%.
7085
This procedure raises two questions:
7086
\index{p-value}
7087
\index{threshold}
7088
\index{statistically significant} \index{significant}
7089

7090
\begin{itemize}
7091

7092
\item If the effect is actually due to chance, what is the probability
7093
that we will wrongly consider it significant?  This
7094
probability is the {\bf false positive rate}.
7095
\index{false positive}
7096

7097
\item If the effect is real, what is the chance that the hypothesis
7098
test will fail?  This probability is the {\bf false negative rate}.
7099
\index{false negative}
7100

7101
\end{itemize}
7102

7103
The false positive rate is relatively easy to compute: if the
7104
threshold is 5\%, the false positive rate is 5\%.  Here's why:
7105

7106
\begin{itemize}
7107

7108
\item If there is no real effect, the null hypothesis is true, so we
7109
  can compute the distribution of the test statistic by simulating the
7110
  null hypothesis.  Call this distribution $\CDF_T$.
7111
\index{null hypothesis}
7112
\index{CDF}
7113

7114
\item Each time we run an experiment, we get a test statistic, $t$,
7115
  which is drawn from $CDF_T$.  Then we compute a p-value, which is
7116
  the probability that a random value from $CDF_T$ exceeds {\tt t},
7117
  so that's $1 - CDF_T(t)$.
7118

7119
\item The p-value is less than 5\% if $CDF_T(t)$ is greater
7120
  than 95\%; that is, if $t$ exceeds the 95th percentile.
7121
  And how often does a value chosen from $CDF_T$ exceed
7122
  the 95th percentile?  5\% of the time.
7123

7124
\end{itemize}
7125

7126
So if you perform one hypothesis test with a 5\% threshold, you expect
7127
a false positive 1 time in 20.
7128

7129

7130
\section{Power}
7131
\label{power}
7132

7133
The false negative rate is harder to compute because it depends on
7134
the actual effect size, and normally we don't know that.
7135
One option is to compute a rate
7136
conditioned on a hypothetical effect size.
7137
\index{effect size}
7138

7139
For example, if we assume that the observed difference between groups
7140
is accurate, we can use the observed samples as a model of the
7141
population and run hypothesis tests with simulated data:
7142
\index{model}
7143

7144
\begin{verbatim}
7145
def FalseNegRate(data, num_runs=100):
7146
    group1, group2 = data
7147
    count = 0
7148

7149
    for i in range(num_runs):
7150
        sample1 = thinkstats2.Resample(group1)
7151
        sample2 = thinkstats2.Resample(group2)
7152

7153
        ht = DiffMeansPermute((sample1, sample2))
7154
        pvalue = ht.PValue(iters=101)
7155
        if pvalue > 0.05:
7156
            count += 1
7157

7158
    return count / num_runs
7159
\end{verbatim}
7160

7161
{\tt FalseNegRate} takes data in the form of two sequences, one for
7162
each group.  Each time through the loop, it simulates an experiment by
7163
drawing a random sample from each group and running a hypothesis test.
7164
Then it checks the result and counts the number of false negatives.
7165
\index{Resample}
7166
\index{permutation}
7167

7168
{\tt Resample} takes a sequence and draws a sample with the same
7169
length, with replacement:
7170
\index{replacement}
7171

7172
\begin{verbatim}
7173
def Resample(xs):
7174
    return np.random.choice(xs, len(xs), replace=True)
7175
\end{verbatim}
7176

7177
Here's the code that tests pregnancy lengths:
7178

7179
\begin{verbatim}
7180
    live, firsts, others = first.MakeFrames()
7181
    data = firsts.prglngth.values, others.prglngth.values
7182
    neg_rate = FalseNegRate(data)
7183
\end{verbatim}
7184

7185
The result is about 70\%, which means that if the actual difference in
7186
mean pregnancy length is 0.078 weeks, we expect an experiment with this
7187
sample size to yield a negative test 70\% of the time.
7188
\index{pregnancy length}
7189

7190
This result is often presented the other way around: if the actual
7191
difference is 0.078 weeks, we should expect a positive test only 30\%
7192
of the time.  This ``correct positive rate'' is called the {\bf power}
7193
of the test, or sometimes ``sensitivity''.  It reflects the ability of
7194
the test to detect an effect of a given size.
7195
\index{power}
7196
\index{sensitivity}
7197
\index{correct positive}
7198

7199
In this example, the test had only a 30\% chance of yielding a
7200
positive result (again, assuming that the difference is 0.078 weeks).
7201
As a rule of thumb, a power of 80\% is considered acceptable, so
7202
we would say that this test was ``underpowered.''
7203
\index{underpowered}
7204

7205
In general a negative hypothesis test does not imply that there is no
7206
difference between the groups; instead it suggests that if there is a
7207
difference, it is too small to detect with this sample size.
7208

7209

7210
\section{Replication}
7211
\label{replication}
7212

7213
The hypothesis testing process I demonstrated in this chapter is not,
7214
strictly speaking, good practice.
7215

7216
First, I performed multiple tests.  If you run one hypothesis test,
7217
the chance of a false positive is about 1 in 20, which might be
7218
acceptable.  But if you run 20 tests, you should expect at least one
7219
false positive, most of the time.
7220
\index{multiple tests}
7221

7222
Second, I used the same dataset for exploration and testing.  If
7223
you explore a large dataset, find a surprising effect, and then test
7224
whether it is significant, you have a good chance of generating a
7225
false positive.
7226
\index{statistically significant} \index{significant}
7227

7228
To compensate for multiple tests, you can adjust the p-value
7229
threshold (see
7230
  \url{https://en.wikipedia.org/wiki/Holm-Bonferroni_method}).  Or you
7231
can address both problems by partitioning the data, using one set for
7232
exploration and the other for testing.
7233
\index{p-value}
7234
\index{Holm-Bonferroni method}
7235

7236
In some fields these practices are required or at least encouraged.
7237
But it is also common to address these problems implicitly by
7238
replicating published results.  Typically the first paper to report a
7239
new result is considered exploratory.  Subsequent papers that
7240
replicate the result with new data are considered confirmatory.
7241
\index{confirmatory result}
7242

7243
As it happens, we have an opportunity to replicate the results in this
7244
chapter.  The first edition of this book is based on Cycle 6 of the
7245
NSFG, which was released in 2002.  In October 2011, the CDC released
7246
additional data based on interviews conducted from 2006--2010.  {\tt
7247
  nsfg2.py} contains code to read and clean this data.  In the new
7248
dataset:
7249
\index{NSFG}
7250

7251
\begin{itemize}
7252

7253
\item The difference in mean pregnancy length is
7254
0.16 weeks and statistically significant with $p < 0.001$ (compared
7255
to 0.078 weeks in the original dataset).
7256
\index{statistically significant} \index{significant}
7257
\index{pregnancy length}
7258

7259
\item The difference in birth weight is 0.17 pounds with $p < 0.001$
7260
(compared to 0.12 lbs in the original dataset).
7261
\index{birth weight}
7262
\index{weight!birth}
7263

7264
\item The correlation between birth weight and mother's age is
7265
0.08 with $p < 0.001$ (compared to 0.07).
7266

7267
\item The chi-squared test is statistically significant with
7268
$p < 0.001$ (as it was in the original).
7269

7270
\end{itemize}
7271

7272
In summary, all of the effects that were statistically significant
7273
in the original dataset were replicated in the new dataset, and the
7274
difference in pregnancy length, which was not significant in the
7275
original, is bigger in the new dataset and significant.
7276

7277

7278
\section{Exercises}
7279

7280
A solution to these exercises is in \verb"chap09soln.py".
7281

7282
\begin{exercise}
7283
As sample size increases, the power of a hypothesis test increases,
7284
which means it is more likely to be positive if the effect is real.
7285
Conversely, as sample size decreases, the test is less likely to
7286
be positive even if the effect is real.
7287
\index{sample size}
7288

7289
To investigate this behavior, run the tests in this chapter with
7290
different subsets of the NSFG data.  You can use {\tt thinkstats2.SampleRows}
7291
to select a random subset of the rows in a DataFrame.
7292
\index{National Survey of Family Growth}
7293
\index{NSFG}
7294
\index{DataFrame}
7295

7296
What happens to the p-values of these tests as sample size decreases?
7297
What is the smallest sample size that yields a positive test?
7298
\index{p-value}
7299
\end{exercise}
7300

7301

7302

7303
\begin{exercise}
7304

7305
In Section~\ref{testdiff}, we simulated the null hypothesis by
7306
permutation; that is, we treated the observed values as if they
7307
represented the entire population, and randomly assigned the
7308
members of the population to the two groups.
7309
\index{null hypothesis}
7310
\index{permutation}
7311

7312
An alternative is to use the sample to estimate the distribution for
7313
the population, then draw a random sample from that distribution.
7314
This process is called {\bf resampling}.  There are several ways to
7315
implement resampling, but one of the simplest is to draw a sample
7316
with replacement from the observed values, as in Section~\ref{power}.
7317
\index{resampling}
7318
\index{replacement}
7319

7320
Write a class named {\tt DiffMeansResample} that inherits from
7321
{\tt DiffMeansPermute} and overrides {\tt RunModel} to implement
7322
resampling, rather than permutation.
7323
\index{permutation}
7324

7325
Use this model to test the differences in pregnancy length and
7326
birth weight.  How much does the model affect the results?
7327
\index{model}
7328
\index{birth weight}
7329
\index{weight!birth}
7330
\index{pregnancy length}
7331

7332
\end{exercise}
7333

7334

7335
\section{Glossary}
7336

7337
\begin{itemize}
7338

7339
\item hypothesis testing: The process of determining whether an apparent
7340
effect is statistically significant.
7341
\index{hypothesis testing}
7342

7343
\item test statistic: A statistic used to quantify an effect size.
7344
\index{test statistic}
7345
\index{effect size}
7346

7347
\item null hypothesis: A model of a system based on the assumption that
7348
an apparent effect is due to chance.
7349
\index{null hypothesis}
7350

7351
\item p-value: The probability that an effect could occur by chance.
7352
\index{p-value}
7353

7354
\item statistically significant: An effect is statistically
7355
  significant if it is unlikely to occur by chance.
7356
  \index{significant} \index{statistically significant}
7357

7358
\item permutation test: A way to compute p-values by generating
7359
  permutations of an observed dataset.
7360
  \index{permutation test}
7361

7362
\item resampling test: A way to compute p-values by generating
7363
  samples, with replacement, from an observed dataset.
7364
  \index{resampling test}
7365

7366
\item two-sided test: A test that asks, ``What is the chance of an effect
7367
as big as the observed effect, positive or negative?''
7368

7369
\item one-sided test: A test that asks, ``What is the chance of an effect
7370
as big as the observed effect, and with the same sign?''
7371
\index{one-sided test}
7372
\index{two-sided test}
7373
\index{test!one-sided}
7374
\index{test!two-sided}
7375

7376
\item chi-squared test: A test that uses the chi-squared statistic as
7377
the test statistic.
7378
\index{chi-squared test}
7379

7380
\item false positive: The conclusion that an effect is real when it is not.
7381
\index{false positive}
7382

7383
\item false negative: The conclusion that an effect is due to chance when it
7384
is not.
7385
\index{false negative}
7386

7387
\item power: The probability of a positive test if the null hypothesis
7388
is false.
7389
\index{power}
7390
\index{null hypothesis}
7391

7392
\end{itemize}
7393

7394

7395
\chapter{Linear least squares}
7396
\label{linear}
7397

7398
The code for this chapter is in {\tt linear.py}.  For information
7399
about downloading and working with this code, see Section~\ref{code}.
7400

7401

7402
\section{Least squares fit}
7403

7404
Correlation coefficients measure the strength and sign of a
7405
relationship, but not the slope.  There are several ways to estimate
7406
the slope; the most common is a {\bf linear least squares fit}.  A
7407
``linear fit'' is a line intended to model the relationship between
7408
variables.  A ``least squares'' fit is one that minimizes the mean
7409
squared error (MSE) between the line and the data.
7410
\index{least squares fit}
7411
\index{linear least squares}
7412
\index{model}
7413

7414
Suppose we have a sequence of points, {\tt ys}, that we want to
7415
express as a function of another sequence {\tt xs}.  If there is a
7416
linear relationship between {\tt xs} and {\tt ys} with intercept {\tt
7417
  inter} and slope {\tt slope}, we expect each {\tt y[i]} to be
7418
{\tt inter + slope * x[i]}.  \index{residuals}
7419

7420
But unless the correlation is perfect, this prediction is only
7421
approximate.  The vertical deviation from the line, or {\bf residual},
7422
is
7423
\index{deviation}
7424

7425
\begin{verbatim}
7426
res = ys - (inter + slope * xs)
7427
\end{verbatim}
7428

7429
The residuals might be due to random factors like measurement error,
7430
or non-random factors that are unknown.  For example, if we are
7431
trying to predict weight as a function of height, unknown factors
7432
might include diet, exercise, and body type.
7433
\index{slope}
7434
\index{intercept}
7435
\index{measurement error}
7436

7437
If we get the parameters {\tt inter} and {\tt slope} wrong, the residuals
7438
get bigger, so it makes intuitive sense that the parameters we want
7439
are the ones that minimize the residuals.
7440
\index{parameter}
7441

7442
We might try to minimize the absolute value of the
7443
residuals, or their squares, or their cubes; but the most common
7444
choice is to minimize the sum of squared residuals,
7445
{\tt sum(res**2)}.
7446

7447
Why?  There are three good reasons and one less important one:
7448

7449
\begin{itemize}
7450

7451
\item Squaring has the feature of treating positive and
7452
negative residuals the same, which is usually what we want.
7453

7454
\item Squaring gives more weight to large residuals, but not
7455
so much weight that the largest residual always dominates.
7456

7457
\item If the residuals are uncorrelated and normally distributed with
7458
  mean 0 and constant (but unknown) variance, then the least squares
7459
  fit is also the maximum likelihood estimator of {\tt inter} and {\tt
7460
    slope}.  See
7461
  \url{https://en.wikipedia.org/wiki/Linear_regression}.  \index{MLE}
7462
  \index{maximum likelihood estimator}
7463
\index{correlation}
7464

7465
\item The values of {\tt inter} and {\tt slope} that minimize
7466
  the squared residuals can be computed efficiently.
7467

7468
\end{itemize}
7469

7470
The last reason made sense when computational efficiency was more
7471
important than choosing the method most appropriate to the problem
7472
at hand.  That's no longer the case, so it is worth considering
7473
whether squared residuals are the right thing to minimize.
7474
\index{computational methods}
7475
\index{squared residuals}
7476

7477
For example, if you are using {\tt xs} to predict values of {\tt ys},
7478
guessing too high might be better (or worse) than guessing too low.
7479
In that case you might want to compute some cost function for each
7480
residual, and minimize total cost, {\tt sum(cost(res))}.
7481
However, computing a least squares fit is quick, easy and often good
7482
enough.  
7483
\index{cost function}
7484

7485

7486
\section{Implementation}
7487

7488
{\tt thinkstats2} provides simple functions that demonstrate
7489
linear least squares:
7490
\index{LeastSquares}
7491

7492
\begin{verbatim}
7493
def LeastSquares(xs, ys):
7494
    meanx, varx = MeanVar(xs)
7495
    meany = Mean(ys)
7496

7497
    slope = Cov(xs, ys, meanx, meany) / varx
7498
    inter = meany - slope * meanx
7499

7500
    return inter, slope
7501
\end{verbatim}
7502

7503
{\tt LeastSquares} takes sequences
7504
{\tt xs} and {\tt ys} and returns the estimated parameters {\tt inter}
7505
and {\tt slope}.
7506
For details on how it works, see
7507
\url{http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares}.
7508
\index{parameter}
7509

7510
{\tt thinkstats2} also provides {\tt FitLine}, which takes {\tt inter}
7511
and {\tt slope} and returns the fitted line for a sequence
7512
of {\tt xs}.
7513
\index{FitLine}
7514

7515
\begin{verbatim}
7516
def FitLine(xs, inter, slope):
7517
    fit_xs = np.sort(xs)
7518
    fit_ys = inter + slope * fit_xs
7519
    return fit_xs, fit_ys
7520
\end{verbatim}
7521

7522
We can use these functions to compute the least squares fit for
7523
birth weight as a function of mother's age.
7524
\index{birth weight}
7525
\index{weight!birth}
7526
\index{age}
7527

7528
\begin{verbatim}
7529
    live, firsts, others = first.MakeFrames()
7530
    live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
7531
    ages = live.agepreg
7532
    weights = live.totalwgt_lb
7533

7534
    inter, slope = thinkstats2.LeastSquares(ages, weights)
7535
    fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)
7536
\end{verbatim}
7537

7538
The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year.
7539
These values are hard to interpret in this form: the intercept is
7540
the expected weight of a baby whose mother is 0 years old, which
7541
doesn't make sense in context, and the slope is too small to
7542
grasp easily.
7543
\index{slope}
7544
\index{intercept}
7545
\index{dropna}
7546
\index{NaN}
7547

7548
Instead of presenting the intercept at $x=0$, it
7549
is often helpful to present the intercept at the mean of $x$.  In
7550
this case the mean age is about 25 years and the mean baby weight
7551
for a 25 year old mother is 7.3 pounds.  The slope is 0.27 ounces
7552
per year, or 0.17 pounds per decade.
7553

7554
\begin{figure}
7555
% linear.py
7556
\centerline{\includegraphics[height=2.5in]{figs/linear1.pdf}}
7557
\caption{Scatter plot of birth weight and mother's age with
7558
a linear fit.}
7559
\label{linear1}
7560
\end{figure}
7561

7562
Figure~\ref{linear1} shows a scatter plot of birth weight and age
7563
along with the fitted line.  It's a good idea to look at a figure like
7564
this to assess whether the relationship is linear and whether the
7565
fitted line seems like a good model of the relationship.
7566
\index{birth weight}
7567
\index{weight!birth}
7568
\index{scatter plot}
7569
\index{plot!scatter}
7570
\index{model}
7571

7572

7573
\section{Residuals}
7574
\label{residuals}
7575

7576
Another useful test is to plot the residuals.
7577
{\tt thinkstats2} provides a function that computes residuals:
7578
\index{residuals}
7579

7580
\begin{verbatim}
7581
def Residuals(xs, ys, inter, slope):
7582
    xs = np.asarray(xs)
7583
    ys = np.asarray(ys)
7584
    res = ys - (inter + slope * xs)
7585
    return res
7586
\end{verbatim}
7587

7588
{\tt Residuals} takes sequences {\tt xs} and {\tt ys} and
7589
estimated parameters {\tt inter} and {\tt slope}.  It returns
7590
the differences between the actual values and the fitted line.
7591

7592
\begin{figure}
7593
% linear.py
7594
\centerline{\includegraphics[height=2.5in]{figs/linear2.pdf}}
7595
\caption{Residuals of the linear fit.}
7596
\label{linear2}
7597
\end{figure}
7598

7599
To visualize the residuals, I group respondents by age and compute
7600
percentiles in each group, as we saw in Section~\ref{characterizing}.
7601
Figure~\ref{linear2} shows the 25th, 50th and 75th percentiles of
7602
the residuals for each age group.  The median is near zero, as
7603
expected, and the interquartile range is about 2 pounds.  So if we
7604
know the mother's age, we can guess the baby's weight within a pound,
7605
about 50\% of the time.
7606
\index{visualization}
7607

7608
Ideally these lines should be flat, indicating that the residuals are
7609
random, and parallel, indicating that the variance of the residuals is
7610
the same for all age groups.  In fact, the lines are close to
7611
parallel, so that's good; but they have some curvature, indicating
7612
that the relationship is nonlinear.  Nevertheless, the linear fit
7613
is a simple model that is probably good enough for some purposes.
7614
\index{model}
7615
\index{nonlinear}
7616

7617

7618
\section{Estimation}
7619
\label{regest}
7620

7621
The parameters {\tt slope} and {\tt inter} are estimates based on a
7622
sample; like other estimates, they are vulnerable to sampling bias,
7623
measurement error, and sampling error.  As discussed in
7624
Chapter~\ref{estimation}, sampling bias is caused by non-representative
7625
sampling, measurement error is caused by errors in collecting
7626
and recording data, and sampling error is the result of measuring a
7627
sample rather than the entire population.
7628
\index{sampling bias}
7629
\index{bias!sampling}
7630
\index{measurement error}
7631
\index{sampling error}
7632
\index{estimation}
7633

7634
To assess sampling error, we ask, ``If we run this experiment again,
7635
how much variability do we expect in the estimates?''  We can
7636
answer this question by running simulated experiments and computing
7637
sampling distributions of the estimates.
7638
\index{sampling error}
7639
\index{sampling distribution}
7640

7641
I simulate the experiments by resampling the data; that is, I treat
7642
the observed pregnancies as if they were the entire population
7643
and draw samples, with replacement, from the observed sample.
7644
\index{simulation}
7645
\index{replacement}
7646

7647
\begin{verbatim}
7648
def SamplingDistributions(live, iters=101):
7649
    t = []
7650
    for _ in range(iters):
7651
        sample = thinkstats2.ResampleRows(live)
7652
        ages = sample.agepreg
7653
        weights = sample.totalwgt_lb
7654
        estimates = thinkstats2.LeastSquares(ages, weights)
7655
        t.append(estimates)
7656

7657
    inters, slopes = zip(*t)
7658
    return inters, slopes
7659
\end{verbatim}
7660

7661
{\tt SamplingDistributions} takes a DataFrame with one row per live
7662
birth, and {\tt iters}, the number of experiments to simulate.  It
7663
uses {\tt ResampleRows} to resample the observed pregnancies.  We've
7664
already seen {\tt SampleRows}, which chooses random rows from a
7665
DataFrame.  {\tt thinkstats2} also provides {\tt ResampleRows}, which
7666
returns a sample the same size as the original:
7667
\index{DataFrame}
7668
\index{resampling}
7669

7670
\begin{verbatim}
7671
def ResampleRows(df):
7672
    return SampleRows(df, len(df), replace=True)
7673
\end{verbatim}
7674

7675
After resampling, we use the simulated sample to estimate parameters.
7676
The result is two sequences: the estimated intercepts and estimated
7677
slopes.
7678
\index{parameter}
7679

7680
I summarize the sampling distributions by printing the standard
7681
error and confidence interval:
7682
\index{sampling distribution}
7683

7684
\begin{verbatim}
7685
def Summarize(estimates, actual=None):
7686
    mean = thinkstats2.Mean(estimates)
7687
    stderr = thinkstats2.Std(estimates, mu=actual)
7688
    cdf = thinkstats2.Cdf(estimates)
7689
    ci = cdf.ConfidenceInterval(90)
7690
    print('mean, SE, CI', mean, stderr, ci)
7691
\end{verbatim}
7692

7693
{\tt Summarize} takes a sequence of estimates and the actual value.
7694
It prints the mean of the estimates, the standard error and 
7695
a 90\% confidence interval.
7696
\index{standard error}
7697
\index{confidence interval}
7698

7699
For the intercept, the mean estimate is 6.83, with standard error
7700
0.07 and 90\% confidence interval (6.71, 6.94).  The estimated slope, in
7701
more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220).
7702
There is almost a factor of two between the low and high ends of
7703
this CI, so it should be considered a rough estimate.
7704

7705
%inter 6.83039697331 6.83174035366
7706
%SE, CI 0.0699814482068 (6.7146843084406846, 6.9447797068631871)
7707
%slope 0.0174538514718 0.0173840926936
7708
%SE, CI 0.00276116142884 (0.012635074392201724, 0.021975282350381781)
7709

7710
To visualize the sampling error of the estimate, we could plot
7711
all of the fitted lines, or for a less cluttered representation,
7712
plot a 90\% confidence interval for each age.  Here's the code:
7713

7714
\begin{verbatim}
7715
def PlotConfidenceIntervals(xs, inters, slopes,
7716
                            percent=90, **options):
7717
    fys_seq = []
7718
    for inter, slope in zip(inters, slopes):
7719
        fxs, fys = thinkstats2.FitLine(xs, inter, slope)
7720
        fys_seq.append(fys)
7721

7722
    p = (100 - percent) / 2
7723
    percents = p, 100 - p
7724
    low, high = thinkstats2.PercentileRows(fys_seq, percents)
7725
    thinkplot.FillBetween(fxs, low, high, **options)
7726
\end{verbatim}
7727

7728
{\tt xs} is the sequence of mother's age.  {\tt inters} and {\tt slopes}
7729
are the estimated parameters generated by {\tt SamplingDistributions}.
7730
{\tt percent} indicates which confidence interval to plot.
7731

7732
{\tt PlotConfidenceIntervals} generates a fitted line for each pair
7733
of {\tt inter} and {\tt slope} and stores the results in a sequence,
7734
\verb"fys_seq".  Then it uses {\tt PercentileRows} to select the
7735
upper and lower percentiles of {\tt y} for each value of {\tt x}.
7736
For a 90\% confidence interval, it selects the 5th and 95th percentiles.
7737
{\tt FillBetween} draws a polygon that fills the space between two
7738
lines.
7739
\index{thinkplot}
7740
\index{FillBetween}
7741

7742
\begin{figure}
7743
% linear.py
7744
\centerline{\includegraphics[height=2.5in]{figs/linear3.pdf}}
7745
\caption{50\% and 90\% confidence intervals showing variability in the
7746
  fitted line due to sampling error of {\tt inter} and {\tt slope}.}
7747
\label{linear3}
7748
\end{figure}
7749

7750
Figure~\ref{linear3} shows the 50\% and 90\% confidence
7751
intervals for curves fitted to birth weight as a function of
7752
mother's age.
7753
  The vertical width of the region represents the effect of
7754
sampling error; the effect is smaller for values near the mean and
7755
larger for the extremes.
7756

7757

7758
\section{Goodness of fit}
7759
\label{goodness}
7760
\index{goodness of fit}
7761

7762
There are several ways to measure the quality of a linear model, or
7763
{\bf goodness of fit}.  One of the simplest is the standard deviation
7764
of the residuals.
7765
\index{standard deviation}
7766
\index{model}
7767

7768
If you use a linear model to make predictions, {\tt Std(res)}
7769
is the root mean squared error (RMSE) of your predictions.  For
7770
example, if you use mother's age to guess birth weight, the RMSE of
7771
your guess would be 1.40 lbs.
7772
\index{birth weight}
7773
\index{weight!birth}
7774

7775
If you guess birth weight without knowing the mother's age, the RMSE
7776
of your guess is {\tt Std(ys)}, which is 1.41 lbs.  So in this
7777
example, knowing a mother's age does not improve the predictions
7778
substantially.
7779
\index{prediction}
7780

7781
Another way to measure goodness of fit is  the {\bf
7782
  coefficient of determination}, usually denoted $R^2$ and 
7783
called ``R-squared'':
7784
\index{coefficient of determination}
7785
\index{r-squared}
7786

7787
\begin{verbatim}
7788
def CoefDetermination(ys, res):
7789
    return 1 - Var(res) / Var(ys)
7790
\end{verbatim}
7791

7792
{\tt Var(res)} is the MSE of your guesses using the model,
7793
{\tt Var(ys)} is the MSE without it.   So their ratio is the fraction
7794
of MSE that remains if you use the model, and $R^2$ is the fraction
7795
of MSE the model eliminates.
7796
\index{MSE}
7797

7798
For birth weight and mother's age, $R^2$ is 0.0047, which means
7799
that mother's age predicts about half of 1\% of variance in
7800
birth weight.
7801

7802
There is a simple relationship between the coefficient of
7803
determination and Pearson's coefficient of correlation: $R^2 = \rho^2$.
7804
For example, if $\rho$ is 0.8 or -0.8, $R^2 = 0.64$.
7805
\index{Pearson coefficient of correlation}
7806

7807
Although $\rho$ and $R^2$ are often used to quantify the strength of a
7808
relationship, they are not easy to interpret in terms of predictive
7809
power.  In my opinion, {\tt Std(res)} is the best representation
7810
of the quality of prediction, especially if it is presented
7811
in relation to {\tt Std(ys)}.
7812
\index{coefficient of determination}
7813
\index{r-squared}
7814

7815
For example, when people talk about the validity of the SAT
7816
(a standardized test used for college admission in the U.S.) they
7817
often talk about correlations between SAT scores and other measures of
7818
intelligence.
7819
\index{SAT}
7820
\index{IQ}
7821

7822
According to one study, there is a Pearson correlation of
7823
$\rho=0.72$ between total SAT scores and IQ scores, which sounds like
7824
a strong correlation.  But $R^2 = \rho^2 = 0.52$, so SAT scores
7825
account for only 52\% of variance in IQ.
7826

7827
IQ scores are normalized with {\tt Std(ys) = 15}, so
7828

7829
\begin{verbatim}
7830
>>> var_ys = 15**2
7831
>>> rho = 0.72
7832
>>> r2 = rho**2
7833
>>> var_res = (1 - r2) * var_ys
7834
>>> std_res = math.sqrt(var_res)
7835
10.4096
7836
\end{verbatim}
7837

7838
So using SAT score to predict IQ reduces RMSE from 15 points to 10.4
7839
points.  A correlation of 0.72 yields a reduction in RMSE of only
7840
31\%.
7841

7842
If you see a correlation that looks impressive, remember that $R^2$ is
7843
a better indicator of reduction in MSE, and reduction in RMSE is a
7844
better indicator of predictive power.
7845
\index{coefficient of determination}
7846
\index{r-squared}
7847
\index{prediction}
7848

7849

7850
\section{Testing a linear model}
7851

7852
The effect of mother's age on birth weight is small, and has little
7853
predictive power.  So is it possible that the apparent relationship
7854
is due to chance?  There are several ways we might test the
7855
results of a linear fit.
7856
\index{birth weight}
7857
\index{weight!birth}
7858
\index{model}
7859
\index{linear model}
7860

7861
One option is to test whether the apparent reduction in MSE is due to
7862
chance.  In that case, the test statistic is $R^2$ and the null
7863
hypothesis is that there is no relationship between the variables.  We
7864
can simulate the null hypothesis by permutation, as in
7865
Section~\ref{corrtest}, when we tested the correlation between
7866
mother's age and birth weight.  In fact, because $R^2 = \rho^2$, a
7867
one-sided test of $R^2$ is equivalent to a two-sided test of $\rho$.
7868
We've already done that test, and found $p < 0.001$, so we conclude
7869
that the apparent relationship between mother's age and birth weight
7870
is statistically significant.
7871
\index{null hypothesis}
7872
\index{permutation}
7873
\index{coefficient of determination}
7874
\index{r-squared}
7875
  \index{significant} \index{statistically significant}
7876

7877
Another approach is to test whether the apparent slope is due to chance.
7878
The null hypothesis is that the slope is actually zero; in that case
7879
we can model the birth weights as random variations around their mean.
7880
Here's a HypothesisTest for this model:
7881
\index{HypothesisTest}
7882
\index{model}
7883

7884
\begin{verbatim}
7885
class SlopeTest(thinkstats2.HypothesisTest):
7886

7887
    def TestStatistic(self, data):
7888
        ages, weights = data
7889
        _, slope = thinkstats2.LeastSquares(ages, weights)
7890
        return slope
7891

7892
    def MakeModel(self):
7893
        _, weights = self.data
7894
        self.ybar = weights.mean()
7895
        self.res = weights - self.ybar
7896

7897
    def RunModel(self):
7898
        ages, _ = self.data
7899
        weights = self.ybar + np.random.permutation(self.res)
7900
        return ages, weights
7901
\end{verbatim}
7902

7903
The data are represented as sequences of ages and weights.  The
7904
test statistic is the slope estimated by {\tt LeastSquares}.
7905
The model of the null hypothesis is represented by the mean weight
7906
of all babies and the deviations from the mean.  To
7907
generate simulated data, we permute the deviations and add them to
7908
the mean.
7909
\index{deviation}
7910
\index{null hypothesis}
7911
\index{permutation}
7912

7913
Here's the code that runs the hypothesis test:
7914

7915
\begin{verbatim}
7916
    live, firsts, others = first.MakeFrames()
7917
    live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
7918
    ht = SlopeTest((live.agepreg, live.totalwgt_lb))
7919
    pvalue = ht.PValue()
7920
\end{verbatim}
7921

7922
The p-value is less than $0.001$, so although the estimated
7923
slope is small, it is unlikely to be due to chance.
7924
\index{p-value}
7925
\index{dropna}
7926
\index{NaN}
7927

7928
Estimating the p-value by simulating the null hypothesis is strictly
7929
correct, but there is a simpler alternative.  Remember that we already
7930
computed the sampling distribution of the slope, in
7931
Section~\ref{regest}.  To do that, we assumed that the observed slope
7932
was correct and simulated experiments by resampling.
7933
\index{null hypothesis}
7934

7935
Figure~\ref{linear4} shows the sampling distribution of the
7936
slope, from Section~\ref{regest}, and the distribution of slopes
7937
generated under the null hypothesis.  The sampling distribution
7938
is centered about the estimated slope, 0.017 lbs/year, and the slopes
7939
under the null hypothesis are centered around 0; but other than
7940
that, the distributions are identical.  The distributions are
7941
also symmetric, for reasons we will see in Section~\ref{CLT}.
7942
\index{symmetric}
7943
\index{sampling distribution}
7944

7945
\begin{figure}
7946
% linear.py
7947
\centerline{\includegraphics[height=2.5in]{figs/linear4.pdf}}
7948
\caption{The sampling distribution of the estimated
7949
slope and the distribution of slopes
7950
generated under the null hypothesis.  The vertical lines are at 0
7951
and the observed slope, 0.017 lbs/year.}
7952
\label{linear4}
7953
\end{figure}
7954

7955
So we could estimate the p-value two ways:
7956
\index{p-value}
7957

7958
\begin{itemize}
7959

7960
\item Compute the probability that the slope under the null
7961
hypothesis exceeds the observed slope.
7962
\index{null hypothesis}
7963

7964
\item Compute the probability that the slope in the sampling
7965
distribution falls below 0.  (If the estimated slope were negative,
7966
we would compute the probability that the slope in the sampling
7967
distribution exceeds 0.)
7968

7969
\end{itemize}
7970

7971
The second option is easier because we normally want to compute the
7972
sampling distribution of the parameters anyway.  And it is a good
7973
approximation unless the sample size is small {\em and\/} the
7974
distribution of residuals is skewed.  Even then, it is usually good
7975
enough, because p-values don't have to be precise.
7976
\index{skewness}
7977
\index{parameter}
7978

7979
Here's the code that estimates the p-value of the slope using the
7980
sampling distribution:
7981
\index{sampling distribution}
7982

7983
\begin{verbatim}
7984
    inters, slopes = SamplingDistributions(live, iters=1001)
7985
    slope_cdf = thinkstats2.Cdf(slopes)
7986
    pvalue = slope_cdf[0]
7987
\end{verbatim}
7988

7989
Again, we find $p < 0.001$.  
7990

7991

7992
\section{Weighted resampling}
7993
\label{weighted}
7994

7995
So far we have treated the NSFG data as if it were a representative
7996
sample, but as I mentioned in Section~\ref{nsfg}, it is not.  The
7997
survey deliberately oversamples several groups in order to
7998
improve the chance of getting statistically significant results; that
7999
is, in order to improve the power of tests involving these groups.
8000
  \index{significant} \index{statistically significant}
8001

8002
This survey design is useful for many purposes, but it means that we
8003
cannot use the sample to estimate values for the general
8004
population without accounting for the sampling process.
8005

8006
For each respondent, the NSFG data includes a variable called {\tt
8007
  finalwgt}, which is the number of people in the general population
8008
the respondent represents.  This value is called a {\bf sampling
8009
  weight}, or just ``weight.''
8010
\index{sampling weight}
8011
\index{weight}
8012
\index{weighted resampling}
8013
\index{resampling!weighted}
8014

8015
As an example, if you survey 100,000 people in a country of 300
8016
million, each respondent represents 3,000 people.  If you oversample
8017
one group by a factor of 2, each person in the oversampled
8018
group would have a lower weight, about 1500.
8019

8020
To correct for oversampling, we can use resampling; that is, we
8021
can draw samples from the survey using probabilities proportional
8022
to sampling weights.  Then, for any quantity we want to estimate, we can
8023
generate sampling distributions, standard errors, and confidence
8024
intervals.  As an example, I will estimate mean birth weight with
8025
and without sampling weights.
8026
\index{standard error}
8027
\index{confidence interval}
8028
\index{birth weight}
8029
\index{weight!birth}
8030
\index{sampling distribution}
8031
\index{oversampling}
8032

8033
In Section~\ref{regest}, we saw {\tt ResampleRows}, which chooses
8034
rows from a DataFrame, giving each row the same probability.
8035
Now we need to do the same thing using probabilities
8036
proportional to sampling weights.
8037
{\tt ResampleRowsWeighted} takes a DataFrame, resamples rows according
8038
to the weights in {\tt finalwgt}, and returns a DataFrame containing
8039
the resampled rows:
8040
\index{DataFrame}
8041
\index{resampling}
8042

8043
\begin{verbatim}
8044
def ResampleRowsWeighted(df, column='finalwgt'):
8045
    weights = df[column]
8046
    cdf = Cdf(dict(weights))
8047
    indices = cdf.Sample(len(weights))
8048
    sample = df.loc[indices]
8049
    return sample
8050
\end{verbatim}
8051

8052
{\tt weights} is a Series; converting it to a dictionary makes
8053
a map from the indices to the weights.  In {\tt cdf} the values
8054
are indices and the probabilities are proportional to the
8055
weights.
8056

8057
{\tt indices} is a sequence of row indices; {\tt sample} is a
8058
DataFrame that contains the selected rows.  Since we sample with
8059
replacement, the same row might appear more than once.  \index{Cdf}
8060
\index{replacement}
8061

8062
Now we can compare the effect of resampling with and without
8063
weights.  Without weights, we generate the sampling distribution
8064
like this:
8065
\index{sampling distribution}
8066

8067
\begin{verbatim}
8068
    estimates = [ResampleRows(live).totalwgt_lb.mean()
8069
                 for _ in range(iters)]
8070
\end{verbatim}
8071

8072
With weights, it looks like this:
8073

8074
\begin{verbatim}
8075
    estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
8076
                 for _ in range(iters)]
8077
\end{verbatim}
8078

8079
The following table summarizes the results:
8080

8081
\begin{center}
8082
\begin{tabular}{|l|c|c|c|}
8083
\hline
8084
                    &  mean birth   & standard  &  90\% CI  \\ 
8085
                    &  weight (lbs) & error     &           \\ 
8086
\hline
8087
Unweighted          &  7.27  &  0.014  &  (7.24, 7.29)  \\ 
8088
Weighted            &  7.35  &  0.014  &  (7.32, 7.37)  \\ 
8089
\hline
8090
\end{tabular}
8091
\end{center}
8092

8093
%mean 7.26580789518
8094
%stderr 0.0141683527792
8095
%ci (7.2428565501217079, 7.2890814917127074)
8096
%mean 7.34778034718
8097
%stderr 0.0142738972319
8098
%ci (7.3232804012858885, 7.3704916897506925)
8099

8100
In this example, the effect of weighting is small but non-negligible.
8101
The difference in estimated means, with and without weighting, is
8102
about 0.08 pounds, or 1.3 ounces.  This difference is substantially
8103
larger than the standard error of the estimate, 0.014 pounds, which
8104
implies that the difference is not due to chance.
8105
\index{standard error}
8106
\index{confidence interval}
8107

8108

8109
\section{Exercises}
8110

8111
A solution to this exercise is in \verb"chap10soln.ipynb"
8112

8113
\begin{exercise}
8114

8115
Using the data from the BRFSS, compute the linear least squares
8116
fit for log(weight) versus height.
8117
How would you best present the estimated parameters for a model
8118
like this where one of the variables is log-transformed?
8119
If you were trying to guess
8120
someone's weight, how much would it help to know their height?
8121
\index{Behavioral Risk Factor Surveillance System}
8122
\index{BRFSS}
8123
\index{model}
8124

8125
Like the NSFG, the BRFSS oversamples some groups and provides
8126
a sampling weight for each respondent.  In the BRFSS data, the variable
8127
name for these weights is {\tt finalwt}.
8128
Use resampling, with and without weights, to estimate the mean height
8129
of respondents in the BRFSS, the standard error of the mean, and a
8130
90\% confidence interval.  How much does correct weighting affect the
8131
estimates?
8132
\index{confidence interval}
8133
\index{standard error}
8134
\index{oversampling}
8135
\index{sampling weight}
8136
\end{exercise}
8137

8138

8139
\section{Glossary}
8140

8141
\begin{itemize}
8142

8143
\item linear fit: a line intended to model the relationship between
8144
variables.  \index{linear fit}
8145

8146
\item least squares fit: A model of a dataset that minimizes the
8147
sum of squares of the residuals.
8148
\index{least squares fit}
8149

8150
\item residual: The deviation of an actual value from a model.
8151
\index{residuals}
8152

8153
\item goodness of fit: A measure of how well a model fits data.
8154
\index{goodness of fit}
8155

8156
\item coefficient of determination: A statistic intended to
8157
quantify goodness of fit.
8158
\index{coefficient of determination}
8159

8160
\item sampling weight: A value associated with an observation in a
8161
  sample that indicates what part of the population it represents.
8162
\index{sampling weight}
8163

8164
\end{itemize}
8165

8166

8167

8168
\chapter{Regression}
8169
\label{regression}
8170

8171
The linear least squares fit in the previous chapter is an example of
8172
{\bf regression}, which is the more general problem of fitting any
8173
kind of model to any kind of data.  This use of the term ``regression''
8174
is a historical accident; it is only indirectly related to the
8175
original meaning of the word.
8176
\index{model}
8177
\index{regression}
8178

8179
The goal of regression analysis is to describe the relationship
8180
between one set of variables, called the {\bf dependent variables},
8181
and another set of variables, called independent or {\bf
8182
  explanatory variables}.
8183
\index{explanatory variable}
8184
\index{dependent variable}
8185

8186
In the previous chapter we used mother's age as an explanatory
8187
variable to predict birth weight as a dependent variable.  When there
8188
is only one dependent and one explanatory variable, that's {\bf
8189
  simple regression}.  In this chapter, we move on to {\bf multiple
8190
  regression}, with more than one explanatory variable.  If there is
8191
more than one dependent variable, that's multivariate
8192
regression.
8193
\index{birth weight}
8194
\index{weight!birth}
8195
\index{simple regression}
8196
\index{multiple regression}
8197

8198
If the relationship between the dependent and explanatory variable
8199
is linear, that's {\bf linear regression}.  For example,
8200
if the dependent variable is $y$ and the explanatory variables
8201
are $x_1$ and $x_2$, we would write the following linear
8202
regression model:
8203
%
8204
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8205
%
8206
where $\beta_0$ is the intercept, $\beta_1$ is the parameter
8207
associated with $x_1$, $\beta_2$ is the parameter associated with
8208
$x_2$, and $\eps$ is the residual due to random variation or other
8209
unknown factors.
8210
\index{regression model}
8211
\index{linear regression}
8212

8213
Given a sequence of values for $y$ and sequences for $x_1$ and $x_2$,
8214
we can find the parameters, $\beta_0$, $\beta_1$, and $\beta_2$, that
8215
minimize the sum of $\eps^2$.  This process is called
8216
{\bf ordinary least squares}.  The computation is similar to {\tt
8217
  thinkstats2.LeastSquare}, but generalized to deal with more than one
8218
explanatory variable.  You can find the details at
8219
\url{https://en.wikipedia.org/wiki/Ordinary_least_squares}
8220
\index{explanatory variable}
8221
\index{ordinary least squares}
8222
\index{parameter}
8223

8224
The code for this chapter is in {\tt regression.py}.  For information
8225
about downloading and working with this code, see Section~\ref{code}.
8226

8227
\section{StatsModels}
8228
\label{statsmodels}
8229

8230
In the previous chapter I presented {\tt thinkstats2.LeastSquares}, an
8231
implementation of simple linear regression intended to be easy to
8232
read.  For multiple regression we'll switch to StatsModels, a Python
8233
package that provides several forms of regression and other
8234
analyses.  If you are using Anaconda, you already have StatsModels;
8235
otherwise you might have to install it.
8236
\index{Anaconda}
8237

8238
As an example, I'll run the model from the previous chapter with
8239
StatsModels:
8240
\index{StatsModels}
8241
\index{model}
8242

8243
\begin{verbatim}
8244
    import statsmodels.formula.api as smf
8245

8246
    live, firsts, others = first.MakeFrames()
8247
    formula = 'totalwgt_lb ~ agepreg'
8248
    model = smf.ols(formula, data=live)
8249
    results = model.fit()
8250
\end{verbatim}
8251

8252
{\tt statsmodels} provides two interfaces (APIs); the ``formula''
8253
API uses strings to identify the dependent and explanatory variables.
8254
It uses a syntax called {\tt patsy}; in this example, the \verb"~"
8255
operator separates the dependent variable on the left from the
8256
explanatory variables on the right.
8257
\index{explanatory variable}
8258
\index{dependent variable}
8259
\index{Patsy}
8260

8261
{\tt smf.ols} takes the formula string and the DataFrame, {\tt live},
8262
and returns an OLS object that represents the model.  The name {\tt ols}
8263
stands for ``ordinary least squares.''
8264
\index{DataFrame}
8265
\index{model}
8266
\index{ordinary least squares}
8267

8268
The {\tt fit} method fits the model to the data and returns a
8269
RegressionResults object that contains the results.
8270
\index{RegressionResults}
8271

8272
The results are also available as attributes.  {\tt params}
8273
is a Series that maps from variable names to their parameters, so we can
8274
get the intercept and slope like this:
8275
\index{Series}
8276

8277
\begin{verbatim}
8278
    inter = results.params['Intercept']
8279
    slope = results.params['agepreg']
8280
\end{verbatim}
8281

8282
The estimated parameters are 6.83 and 0.0175, the same as
8283
from {\tt LeastSquares}.
8284
\index{parameter}
8285

8286
{\tt pvalues} is a Series that maps from variable names to the associated
8287
p-values, so we can check whether the estimated slope is statistically
8288
significant:
8289
\index{p-value}
8290
  \index{significant} \index{statistically significant}
8291

8292
\begin{verbatim}
8293
    slope_pvalue = results.pvalues['agepreg']
8294
\end{verbatim}
8295

8296
The p-value associated with {\tt agepreg} is {\tt 5.7e-11}, which
8297
is less than $0.001$, as expected.
8298
\index{age}
8299

8300
{\tt results.rsquared} contains $R^2$, which is $0.0047$.  {\tt
8301
  results} also provides \verb"f_pvalue", which is the p-value
8302
associated with the model as a whole, similar to testing whether $R^2$
8303
is statistically significant.
8304
\index{model}
8305
\index{coefficient of determination}
8306
\index{r-squared}
8307

8308
And {\tt results} provides {\tt resid}, a sequence of residuals, and
8309
{\tt fittedvalues}, a sequence of fitted values corresponding to
8310
{\tt agepreg}.
8311
\index{residuals}
8312

8313
The results object provides {\tt summary()}, which
8314
represents the results in a readable format.  
8315

8316
\begin{verbatim}
8317
    print(results.summary())
8318
\end{verbatim}
8319

8320
But it prints a lot of information that is not relevant (yet), so
8321
I use a simpler function called {\tt SummarizeResults}.  Here are
8322
the results of this model:
8323

8324
\begin{verbatim}
8325
Intercept       6.83    (0)
8326
agepreg         0.0175  (5.72e-11)
8327
R^2 0.004738
8328
Std(ys) 1.408
8329
Std(res) 1.405
8330
\end{verbatim}
8331

8332
{\tt Std(ys)} is the standard deviation of the dependent variable,
8333
which is the RMSE if you have to guess birth weights without the benefit of
8334
any explanatory variables.  {\tt Std(res)} is the standard deviation
8335
of the residuals, which is the RMSE if your guesses are informed
8336
by the mother's age.  As we have already seen, knowing the mother's
8337
age provides no substantial improvement to the predictions.
8338
\index{standard deviation}
8339
\index{birth weight}
8340
\index{weight!birth}
8341
\index{explanatory variable}
8342
\index{dependent variable}
8343
\index{RMSE}
8344
\index{predictive power}
8345

8346

8347
\section{Multiple regression}
8348
\label{multiple}
8349

8350
In Section~\ref{birth_weights} we saw that first babies tend to be
8351
lighter than others, and this effect is statistically significant.
8352
But it is a strange result because there is no obvious mechanism that
8353
would cause first babies to be lighter.  So we might wonder whether
8354
this relationship is {\bf spurious}.
8355
\index{multiple regression}
8356
\index{spurious relationship}
8357

8358
In fact, there is a possible explanation for this effect.  We have
8359
seen that birth weight depends on mother's age, and we might expect
8360
that mothers of first babies are younger than others.
8361
\index{weight}
8362
\index{age}
8363

8364
With a few calculations we can check whether this explanation
8365
is plausible.  Then we'll use multiple regression to investigate
8366
more carefully.  First, let's see how big the difference in weight
8367
is:
8368

8369
\begin{verbatim}
8370
diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()
8371
\end{verbatim}
8372

8373
First babies are 0.125 lbs lighter, or 2 ounces.  And the difference
8374
in ages:
8375

8376
\begin{verbatim}
8377
diff_age = firsts.agepreg.mean() - others.agepreg.mean()
8378
\end{verbatim}
8379

8380
The mothers of first babies are 3.59 years younger.  Running the
8381
linear model again, we get the change in birth weight as a function
8382
of age:
8383
\index{birth weight}
8384
\index{weight!birth}
8385

8386
\begin{verbatim}
8387
results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit()
8388
slope = results.params['agepreg']
8389
\end{verbatim}
8390

8391
The slope is 0.0175 pounds per year.  If we multiply the slope by
8392
the difference in ages, we get the expected difference in birth
8393
weight for first babies and others, due to mother's age:
8394

8395
\begin{verbatim}
8396
slope * diff_age
8397
\end{verbatim}
8398

8399
The result is 0.063, just about half of the observed difference.
8400
So we conclude, tentatively, that the observed difference in birth
8401
weight can be partly explained by the difference in mother's age. 
8402

8403
Using multiple regression, we can explore these relationships
8404
more systematically.
8405
\index{multiple regression}
8406

8407
\begin{verbatim}
8408
    live['isfirst'] = live.birthord == 1
8409
    formula = 'totalwgt_lb ~ isfirst'
8410
    results = smf.ols(formula, data=live).fit()
8411
\end{verbatim}
8412

8413
The first line creates a new column named {\tt isfirst} that is
8414
True for first babies and false otherwise.  Then we fit a model
8415
using {\tt isfirst} as an explanatory variable.
8416
\index{model}
8417
\index{explanatory variable}
8418

8419
Here are the results:
8420

8421
\begin{verbatim}
8422
Intercept         7.33   (0)
8423
isfirst[T.True]  -0.125  (2.55e-05)
8424
R^2 0.00196
8425
\end{verbatim}
8426

8427
Because {\tt isfirst} is a boolean, {\tt ols} treats it as a
8428
{\bf categorical variable}, which means that the values fall
8429
into categories, like True and False, and should not be treated
8430
as numbers.  The estimated parameter is the effect on birth
8431
weight when {\tt isfirst} is true, so the result,
8432
-0.125 lbs, is the difference in
8433
birth weight between first babies and others.  
8434
\index{birth weight}
8435
\index{weight!birth}
8436
\index{categorical variable}
8437
\index{boolean}
8438

8439
The slope and the intercept are statistically significant,
8440
which means that they were unlikely to occur by chance, but the
8441
the $R^2$ value for this model is small, which means that
8442
{\tt isfirst} doesn't account for a substantial part of the
8443
variation in birth weight.
8444
\index{coefficient of determination}
8445
\index{r-squared}
8446

8447
The results are similar with {\tt agepreg}:
8448

8449
\begin{verbatim}
8450
Intercept       6.83    (0)
8451
agepreg         0.0175  (5.72e-11)
8452
R^2 0.004738
8453
\end{verbatim}
8454

8455
Again, the parameters are statistically significant, but
8456
$R^2$ is low.
8457
\index{coefficient of determination}
8458
\index{r-squared}
8459

8460
These models confirm results we have already seen.  But now we
8461
can fit a single model that includes both variables.  With the
8462
formula \verb"totalwgt_lb ~ isfirst + agepreg", we get:
8463

8464
\begin{verbatim}
8465
Intercept        6.91    (0)
8466
isfirst[T.True] -0.0698  (0.0253)
8467
agepreg          0.0154  (3.93e-08)
8468
R^2 0.005289
8469
\end{verbatim}
8470

8471
In the combined model, the parameter for {\tt isfirst} is smaller
8472
by about half, which means that part of the apparent effect of
8473
{\tt isfirst} is actually accounted for by {\tt agepreg}.  And
8474
the p-value for {\tt isfirst} is about 2.5\%, which is on the
8475
border of statistical significance.
8476
\index{p-value}
8477
\index{model}
8478

8479
$R^2$ for this model is a little higher, which indicates that the
8480
two variables together account for more variation in birth weight
8481
than either alone (but not by much).
8482
\index{birth weight}
8483
\index{weight!birth}
8484
\index{coefficient of determination}
8485
\index{r-squared}
8486

8487

8488
\section{Nonlinear relationships}
8489
\label{nonlinear}
8490

8491
Remembering that the contribution of {\tt agepreg} might be nonlinear,
8492
we might consider adding a variable to capture more of this
8493
relationship.  One option is to create a column, {\tt agepreg2},
8494
that contains the squares of the ages:
8495
\index{nonlinear}
8496

8497
\begin{verbatim}
8498
    live['agepreg2'] = live.agepreg**2
8499
    formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'
8500
\end{verbatim}
8501

8502
Now by estimating parameters for {\tt agepreg} and {\tt agepreg2},
8503
we are effectively fitting a parabola:
8504

8505
\begin{verbatim}
8506
Intercept        5.69     (1.38e-86)
8507
isfirst[T.True] -0.0504   (0.109)
8508
agepreg          0.112    (3.23e-07)
8509
agepreg2        -0.00185  (8.8e-06)
8510
R^2 0.007462
8511
\end{verbatim}
8512

8513
The parameter of {\tt agepreg2} is negative, so the parabola
8514
curves downward, which is consistent with the shape of the lines
8515
in Figure~\ref{linear2}.
8516
\index{parabola}
8517

8518
The quadratic model of {\tt agepreg} accounts for more of the
8519
variability in birth weight; the parameter for {\tt isfirst}
8520
is smaller in this model, and no longer statistically significant.
8521
\index{birth weight}
8522
\index{weight!birth}
8523
\index{quadratic model}
8524
\index{model}
8525
  \index{significant} \index{statistically significant}
8526

8527
Using computed variables like {\tt agepreg2} is a common way to
8528
fit polynomials and other functions to data.  
8529
This process is still considered linear
8530
regression, because the dependent variable is a linear function of
8531
the explanatory variables, regardless of whether some variables
8532
are nonlinear functions of others.
8533
\index{explanatory variable}
8534
\index{dependent variable}
8535
\index{nonlinear}
8536

8537
The following table summarizes the results of these regressions:
8538

8539
\begin{center}
8540
\begin{tabular}{|l|c|c|c|c|}
8541
\hline & isfirst & agepreg & agepreg2 & $R^2$ \\ \hline
8542
Model 1 & -0.125 * & -- & -- & 0.002 \\
8543
Model 2 & -- & 0.0175 * & -- & 0.0047 \\
8544
Model 3 & -0.0698 (0.025) & 0.0154 * & -- & 0.0053 \\
8545
Model 4 & -0.0504 (0.11) & 0.112 * & -0.00185 * & 0.0075 \\
8546
\hline
8547
\end{tabular}
8548
\end{center}
8549

8550
The columns in this table are the explanatory variables and
8551
the coefficient of determination, $R^2$.  Each entry is an estimated
8552
parameter and either a p-value in parentheses or an asterisk to
8553
indicate a p-value less that 0.001.
8554
\index{p-value}
8555
\index{coefficient of determination}
8556
\index{r-squared}
8557
\index{explanatory variable}
8558

8559
We conclude that the apparent difference in birth weight
8560
is explained, at least in part, by the difference in mother's age.
8561
When we include mother's age in the model, the effect of
8562
{\tt isfirst} gets smaller, and the remaining effect might be
8563
due to chance.
8564
\index{age}
8565

8566
In this example, mother's age acts as a {\bf control variable};
8567
including {\tt agepreg} in the model ``controls for'' the
8568
difference in age between first-time mothers and others, making
8569
it possible to isolate the effect (if any) of {\tt isfirst}. 
8570
\index{control variable}
8571

8572

8573
\section{Data mining}
8574
\label{mining}
8575

8576
So far we have used regression models for explanation; for example,
8577
in the previous section we discovered that an apparent difference
8578
in birth weight is actually due to a difference in mother's age.
8579
But the $R^2$ values of those models is very low, which means that
8580
they have little predictive power.  In this section we'll try to
8581
do better.
8582
\index{birth weight}
8583
\index{weight!birth}
8584
\index{regression model}
8585
\index{coefficient of determination}
8586
\index{r-squared}
8587

8588
Suppose one of your co-workers is expecting a baby and
8589
there is an office pool to guess the baby's birth weight (if you are
8590
not familiar with betting pools, see
8591
\url{https://en.wikipedia.org/wiki/Betting_pool}).
8592
\index{betting pool}
8593

8594
Now suppose that you {\em really\/} want to win the pool.  What could
8595
you do to improve your chances?  Well, 
8596
the NSFG dataset includes 244 variables about each pregnancy and another
8597
3087 variables about each respondent.  Maybe some of those variables
8598
have predictive power.  To find out which ones are most useful,
8599
why not try them all?
8600
\index{NSFG}
8601

8602
Testing the variables in the pregnancy table is easy, but in order to
8603
use the variables in the respondent table, we have to match up each
8604
pregnancy with a respondent.  In theory we could iterate through the
8605
rows of the pregnancy table, use the {\tt caseid} to find the
8606
corresponding respondent, and copy the values from the
8607
correspondent table into the pregnancy table.  But that would be slow.
8608
\index{join}
8609
\index{SQL}
8610

8611
A better option is to recognize this process as a {\bf join} operation
8612
as defined in SQL and other relational database languages (see
8613
\url{https://en.wikipedia.org/wiki/Join_(SQL)}).  Join is implemented
8614
as a DataFrame method, so we can perform the operation like this:
8615
\index{DataFrame}
8616

8617
\begin{verbatim}
8618
    live = live[live.prglngth>30]
8619
    resp = chap01soln.ReadFemResp()
8620
    resp.index = resp.caseid
8621
    join = live.join(resp, on='caseid', rsuffix='_r')
8622
\end{verbatim}
8623

8624
The first line selects records for pregnancies longer than 30 weeks,
8625
assuming that the office pool is formed several weeks before the
8626
due date.
8627
\index{betting pool}
8628

8629
The next line reads the respondent file.  The result is a DataFrame
8630
with integer indices; in order to look up respondents efficiently,
8631
I replace {\tt resp.index} with {\tt resp.caseid}. 
8632

8633
The {\tt join} method is invoked on {\tt live}, which is considered
8634
the ``left'' table, and passed {\tt resp}, which is the ``right'' table.
8635
The keyword argument {\tt on} indicates the variable used to match up
8636
rows from the two tables.
8637

8638
In this example some column names appear in both tables,
8639
so we have to provide {\tt rsuffix}, which is a string that will be
8640
appended to the names of overlapping columns from the right table.
8641
For example, both tables have a column named {\tt race} that encodes
8642
the race of the respondent.  The result of the join contains two
8643
columns named {\tt race} and \verb"race_r".
8644
\index{race}
8645

8646
The pandas implementation is fast.  Joining the NSFG tables takes
8647
less than a second on an ordinary desktop computer.
8648
Now we can start testing variables.
8649
\index{pandas}
8650
\index{join}
8651

8652
\begin{verbatim}
8653
    t = []
8654
    for name in join.columns:
8655
        try:
8656
            if join[name].var() < 1e-7:
8657
                continue
8658

8659
            formula = 'totalwgt_lb ~ agepreg + ' + name
8660
            model = smf.ols(formula, data=join)
8661
            if model.nobs < len(join)/2:
8662
                continue
8663

8664
            results = model.fit()
8665
        except (ValueError, TypeError):
8666
            continue
8667

8668
        t.append((results.rsquared, name))
8669
\end{verbatim}
8670

8671
For each variable we construct a model, compute $R^2$, and append
8672
the results to a list.  The models all include {\tt agepreg}, since
8673
we already know that it has some predictive power.
8674
\index{model}
8675
\index{coefficient of determination}
8676
\index{r-squared}
8677

8678
I check that each explanatory variable has some variability; otherwise
8679
the results of the regression are unreliable.  I also check the number
8680
of observations for each model.  Variables that contain a large number
8681
of {\tt nan}s are not good candidates for prediction.
8682
\index{explanatory variable}
8683
\index{NaN}
8684

8685
For most of these variables, we haven't done any cleaning.  Some of them
8686
are encoded in ways that don't work very well for linear regression.
8687
As a result, we might overlook some variables that would be useful if
8688
they were cleaned properly.  But maybe we will find some good candidates.
8689
\index{cleaning}
8690

8691

8692
\section{Prediction}
8693

8694
The next step is to sort the results and select the variables that
8695
yield the highest values of $R^2$.
8696
\index{prediction}
8697

8698
\begin{verbatim}
8699
    t.sort(reverse=True)
8700
    for mse, name in t[:30]:
8701
        print(name, mse)
8702
\end{verbatim}
8703

8704
The first variable on the list is \verb"totalwgt_lb",
8705
followed by \verb"birthwgt_lb".  Obviously, we can't use birth
8706
weight to predict birth weight.
8707
\index{birth weight}
8708
\index{weight!birth}
8709

8710
Similarly {\tt prglngth} has useful predictive power, but for the
8711
office pool we assume pregnancy length (and the related variables)
8712
are not known yet.
8713
\index{predictive power}
8714
\index{pregnancy length}
8715

8716
The first useful predictive variable is {\tt babysex} which indicates
8717
whether the baby is male or female.  In the NSFG dataset, boys are
8718
about 0.3 lbs heavier.  So, assuming that the sex of the baby is
8719
known, we can use it for prediction.
8720
\index{sex}
8721

8722
Next is {\tt race}, which indicates whether the respondent is white,
8723
black, or other.  As an explanatory variable, race can be problematic.
8724
In datasets like the NSFG, race is correlated with many other
8725
variables, including income and other socioeconomic factors.  In a
8726
regression model, race acts as a {\bf proxy variable},
8727
so apparent correlations with race are often caused, at least in
8728
part, by other factors.
8729
\index{explanatory variable}
8730
\index{race}
8731

8732
The next variable on the list is {\tt nbrnaliv}, which indicates
8733
whether the pregnancy yielded multiple births.  Twins and triplets
8734
tend to be smaller than other babies, so if we know whether our
8735
hypothetical co-worker is expecting twins, that would help.
8736
\index{multiple birth}
8737

8738
Next on the list is {\tt paydu}, which indicates whether the
8739
respondent owns her home.  It is one of several income-related
8740
variables that turn out to be predictive.  In datasets like the NSFG,
8741
income and wealth are correlated with just about everything.  In this
8742
example, income is related to diet, health, health care, and other
8743
factors likely to affect birth weight.
8744
\index{birth weight}
8745
\index{weight!birth}
8746
\index{income}
8747
\index{wealth}
8748

8749
Some of the other variables on the list are things that would not
8750
be known until later, like {\tt bfeedwks}, the number of weeks
8751
the baby was breast fed.  We can't use these variables for prediction,
8752
but you might want to speculate on reasons
8753
{\tt bfeedwks} might be correlated with birth weight.
8754

8755
Sometimes you start with a theory and use data to test it.  Other
8756
times you start with data and go looking for possible theories.
8757
The second approach, which this section demonstrates, is
8758
called {\bf data mining}.  An advantage of data mining is that it
8759
can discover unexpected patterns.  A hazard is that many of the
8760
patterns it discovers are either random or spurious.
8761
\index{theory}
8762
\index{data mining}
8763

8764
Having identified potential explanatory variables, I tested a few
8765
models and settled on this one:
8766
\index{model}
8767
\index{explanatory variable}
8768

8769
\begin{verbatim}
8770
    formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '
8771
               'nbrnaliv>1 + paydu==1 + totincr')
8772
    results = smf.ols(formula, data=join).fit()
8773
\end{verbatim}
8774

8775
This formula uses some syntax we have not seen yet:
8776
{\tt C(race)} tells the formula parser (Patsy) to treat race as a
8777
categorical variable, even though it is encoded numerically.
8778
\index{Patsy}
8779
\index{categorical variable}
8780

8781
The encoding for {\tt babysex} is 1 for male, 2 for female; writing
8782
{\tt babysex==1} converts it to boolean, True for male and false for
8783
female.
8784
\index{boolean}
8785

8786
Similarly {\tt nbrnaliv>1} is True for multiple births and 
8787
{\tt paydu==1} is True for respondents who own their houses.
8788

8789
{\tt totincr} is encoded numerically from 1-14, with each increment
8790
representing about \$5000 in annual income.  So we can treat these
8791
values as numerical, expressed in units of \$5000.
8792
\index{income}
8793

8794
Here are the results of the model:
8795

8796
\begin{verbatim}
8797
Intercept               6.63    (0)
8798
C(race)[T.2]            0.357   (5.43e-29)
8799
C(race)[T.3]            0.266   (2.33e-07)
8800
babysex == 1[T.True]    0.295   (5.39e-29)
8801
nbrnaliv > 1[T.True]   -1.38    (5.1e-37)
8802
paydu == 1[T.True]      0.12    (0.000114)
8803
agepreg                 0.00741 (0.0035)
8804
totincr                 0.0122  (0.00188)
8805
\end{verbatim}
8806

8807
The estimated parameters for race are larger than I expected,
8808
especially since we control for income.  The encoding
8809
is 1 for black, 2 for white, and 3 for other.  Babies of black
8810
mothers are lighter than babies of other races by 0.27--0.36 lbs.
8811
\index{control variable}
8812
\index{race}
8813

8814
As we've already seen, boys are heavier by about 0.3 lbs;
8815
twins and other multiplets are lighter by 1.4 lbs.
8816
\index{weight}
8817

8818
People who own their homes have heavier babies by about 0.12 lbs,
8819
even when we control for income.  The parameter for mother's
8820
age is smaller than what we saw in Section~\ref{multiple}, which
8821
suggests that some of the other variables are correlated with
8822
age, probably including {\tt paydu} and {\tt totincr}.
8823
\index{income}
8824

8825
All of these variables are statistically significant, some with
8826
very low p-values, but 
8827
$R^2$ is only 0.06, still quite small.
8828
RMSE without using the model is 1.27 lbs; with the model it drops
8829
to 1.23.  So your chance of winning the pool is not substantially
8830
improved.  Sorry!
8831
\index{p-value}
8832
\index{model}
8833
\index{coefficient of determination}
8834
\index{r-squared}
8835
  \index{significant} \index{statistically significant}
8836

8837

8838

8839
\section{Logistic regression}
8840

8841
In the previous examples, some of the explanatory variables were
8842
numerical and some categorical (including boolean).  But the dependent
8843
variable was always numerical.
8844
\index{explanatory variable}
8845
\index{dependent variable}
8846
\index{categorical variable}
8847

8848
Linear regression can be generalized to handle other kinds of
8849
dependent variables.  If the dependent variable is boolean, the
8850
generalized model is called {\bf logistic regression}.  If the dependent
8851
variable is an integer count, it's called {\bf Poisson
8852
regression}.
8853
\index{model}
8854
\index{logistic regression}
8855
\index{Poisson regression}
8856
\index{boolean}
8857

8858
As an example of logistic regression, let's consider a variation
8859
on the office pool scenario.
8860
Suppose
8861
a friend of yours is pregnant and you want to predict whether the
8862
baby is a boy or a girl.  You could use data from the NSFG to find
8863
factors that affect the ``sex ratio'', which is conventionally
8864
defined to be the probability
8865
of having a boy.
8866
\index{betting pool}
8867
\index{sex}
8868

8869
If you encode the dependent variable numerically, for example 0 for a
8870
girl and 1 for a boy, you could apply ordinary least squares, but
8871
there would be problems.  The linear model might be something like
8872
this:
8873
%
8874
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8875
%
8876
Where $y$ is the dependent variable, and $x_1$ and $x_2$ are
8877
explanatory variables.  Then we could find the parameters that
8878
minimize the residuals.
8879
\index{regression model}
8880
\index{explanatory variable}
8881
\index{dependent variable}
8882
\index{ordinary least squares}
8883

8884
The problem with this approach is that it produces predictions that
8885
are hard to interpret.  Given estimated parameters and values for
8886
$x_1$ and $x_2$, the model might predict $y=0.5$, but the only
8887
meaningful values of $y$ are 0 and 1.
8888
\index{parameter}
8889

8890
It is tempting to interpret a result like that as a probability; for
8891
example, we might say that a respondent with particular values of
8892
$x_1$ and $x_2$ has a 50\% chance of having a boy.  But it is also
8893
possible for this model to predict $y=1.1$ or $y=-0.1$, and those
8894
are not valid probabilities.
8895
\index{probability}
8896

8897
Logistic regression avoids this problem by expressing predictions in
8898
terms of {\bf odds} rather than probabilities.  If you are not
8899
familiar with odds, ``odds in favor'' of an event is the ratio of the
8900
probability it will occur to the probability that it will not.
8901
\index{odds}
8902

8903
So if I think my team has a 75\% chance of winning, I would
8904
say that the odds in their favor are three to one, because
8905
the chance of winning is three times the chance of losing.
8906

8907
Odds and probabilities are different representations of the same
8908
information.  Given a probability, you can compute the odds like this:
8909

8910
\begin{verbatim}
8911
    o = p / (1-p)
8912
\end{verbatim}
8913

8914
Given odds in favor, you can convert to
8915
probability like this:
8916

8917
\begin{verbatim}
8918
    p = o / (o+1)
8919
\end{verbatim}
8920

8921
Logistic regression is based on the following model:
8922
%
8923
\[ \log o = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \eps \]
8924
%
8925
Where $o$ is the odds in favor of a particular outcome; in the
8926
example, $o$ would be the odds of having a boy.
8927
\index{regression model}
8928

8929
Suppose we have estimated the parameters $\beta_0$, $\beta_1$, and
8930
$\beta_2$ (I'll explain how in a minute).  And suppose we are given
8931
values for $x_1$ and $x_2$.  We can compute the predicted value of
8932
$\log o$, and then convert to a probability:
8933

8934
\begin{verbatim}
8935
    o = np.exp(log_o)
8936
    p = o / (o+1)
8937
\end{verbatim}
8938

8939
So in the office pool scenario we could compute the predictive
8940
probability of having a boy.  But how do we estimate the parameters?
8941
\index{parameter}
8942

8943

8944
\section{Estimating parameters}
8945

8946
Unlike linear regression, logistic regression does not have a
8947
closed form solution, so it is solved by guessing an initial
8948
solution and improving it iteratively.
8949
\index{logistic regression}
8950
\index{closed form}
8951

8952
The usual goal is to find the maximum-likelihood estimate (MLE),
8953
which is the set of parameters that maximizes the likelihood of the
8954
data.  For example, suppose we have the following data:
8955
\index{MLE}
8956
\index{maximum likelihood estimator}
8957

8958
\begin{verbatim}
8959
>>> y = np.array([0, 1, 0, 1])
8960
>>> x1 = np.array([0, 0, 0, 1])
8961
>>> x2 = np.array([0, 1, 1, 1])
8962
\end{verbatim}
8963

8964
And we start with the initial guesses $\beta_0=-1.5$, $\beta_1=2.8$,
8965
and $\beta_2=1.1$:
8966

8967
\begin{verbatim}
8968
>>> beta = [-1.5, 2.8, 1.1]
8969
\end{verbatim}
8970

8971
Then for each row we can compute \verb"log_o":
8972

8973
\begin{verbatim}
8974
>>> log_o = beta[0] + beta[1] * x1 + beta[2] * x2 
8975
[-1.5 -0.4 -0.4  2.4]
8976
\end{verbatim}
8977

8978
And convert from log odds to probabilities:
8979
\index{log odds}
8980

8981
\begin{verbatim}
8982
>>> o = np.exp(log_o)
8983
[  0.223   0.670   0.670  11.02  ]
8984

8985
>>> p = o / (o+1)
8986
[ 0.182  0.401  0.401  0.916 ]
8987
\end{verbatim}
8988

8989
Notice that when \verb"log_o" is greater than 0, {\tt o}
8990
is greater than 1 and {\tt p} is greater than 0.5.
8991

8992
The likelihood of an outcome is {\tt p} when {\tt y==1} and {\tt 1-p}
8993
when {\tt y==0}.  For example, if we think the probability of a boy is
8994
0.8 and the outcome is a boy, the likelihood is 0.8; if
8995
the outcome is a girl, the likelihood is 0.2.  We can compute that
8996
like this:
8997
\index{likelihood}
8998

8999
\begin{verbatim}
9000
>>> likes = y * p + (1-y) * (1-p)
9001
[ 0.817  0.401  0.598  0.916 ]
9002
\end{verbatim}
9003

9004
The overall likelihood of the data is the product of {\tt likes}:
9005

9006
\begin{verbatim}
9007
>>> like = np.prod(likes)
9008
0.18
9009
\end{verbatim}
9010

9011
For these values of {\tt beta}, the likelihood of the data is 0.18.
9012
The goal of logistic regression is to find parameters that maximize
9013
this likelihood.  To do that, most statistics packages use an
9014
iterative solver like Newton's method (see
9015
\url{https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting}).
9016
\index{Newton's method}
9017
\index{iterative solver}
9018

9019

9020
\section{Implementation}
9021
\label{implementation}
9022

9023
StatsModels provides an implementation of logistic regression
9024
called {\tt logit}, named for the function that converts from
9025
probability to log odds.  To demonstrate its use, I'll look for
9026
variables that affect the sex ratio.
9027
\index{StatsModels}
9028
\index{sex ratio}
9029
\index{logit function}
9030

9031
Again, I load the NSFG data and select pregnancies longer than
9032
30 weeks:
9033

9034
\begin{verbatim}
9035
    live, firsts, others = first.MakeFrames()
9036
    df = live[live.prglngth>30]
9037
\end{verbatim}
9038

9039
{\tt logit} requires the dependent variable to be binary (rather than
9040
boolean), so I create a new column named {\tt boy}, using {\tt
9041
  astype(int)} to convert to binary integers:
9042
\index{dependent variable}
9043
\index{boolean}
9044
\index{binary}
9045

9046
\begin{verbatim}
9047
    df['boy'] = (df.babysex==1).astype(int)
9048
\end{verbatim}
9049

9050
Factors that have been found to affect sex ratio include parents'
9051
age, birth order, race, and social status.  We can use logistic
9052
regression to see if these effects appear in the NSFG data.  I'll
9053
start with the mother's age:
9054
\index{age}
9055
\index{race}
9056

9057
\begin{verbatim}
9058
    import statsmodels.formula.api as smf
9059

9060
    model = smf.logit('boy ~ agepreg', data=df)
9061
    results = model.fit()
9062
    SummarizeResults(results)
9063
\end{verbatim}
9064

9065
{\tt logit} takes the same arguments as {\tt ols}, a formula
9066
in Patsy syntax and a DataFrame.  The result is a Logit object
9067
that represents the model.  It contains attributes called
9068
{\tt endog} and {\tt exog} that contain the {\bf endogenous
9069
variable}, another name for the dependent variable,
9070
and the {\bf exogenous variables}, another name for the
9071
explanatory variables.  Since they are NumPy arrays, it is
9072
sometimes convenient to convert them to DataFrames:
9073
\index{NumPy}
9074
\index{pandas}
9075
\index{DataFrame}
9076
\index{explanatory variable}
9077
\index{dependent variable}
9078
\index{exogenous variable}
9079
\index{endogenous variable}
9080
\index{Patsy}
9081

9082
\begin{verbatim}
9083
    endog = pandas.DataFrame(model.endog, columns=[model.endog_names])
9084
    exog = pandas.DataFrame(model.exog, columns=model.exog_names)
9085
\end{verbatim}
9086

9087
The result of {\tt model.fit} is a BinaryResults object, which is
9088
similar to the RegressionResults object we got from {\tt ols}.
9089
Here is a summary of the results:
9090

9091
\begin{verbatim}
9092
Intercept   0.00579   (0.953)
9093
agepreg     0.00105   (0.783)
9094
R^2 6.144e-06
9095
\end{verbatim}
9096

9097
The parameter of {\tt agepreg} is positive, which suggests that
9098
older mothers are more likely to have boys, but the p-value is
9099
0.783, which means that the apparent effect could easily be due
9100
to chance.
9101
\index{p-value}
9102
\index{age}
9103

9104
The coefficient of determination, $R^2$, does not apply to logistic
9105
regression, but there are several alternatives that are used
9106
as ``pseudo $R^2$ values.''  These values can be useful for comparing
9107
models.  For example, here's a model that includes several factors
9108
believed to be associated with sex ratio:
9109
\index{model}
9110
\index{coefficient of determination}
9111
\index{r-squared}
9112
\index{pseudo r-squared}
9113

9114
\begin{verbatim}
9115
    formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'
9116
    model = smf.logit(formula, data=df)
9117
    results = model.fit()
9118
\end{verbatim}
9119

9120
Along with mother's age, this model includes father's age at
9121
birth ({\tt hpagelb}), birth order ({\tt birthord}), and
9122
race as a categorical variable.  Here are the results:
9123
\index{categorical variable}
9124

9125
\begin{verbatim}
9126
Intercept      -0.0301     (0.772)
9127
C(race)[T.2]   -0.0224     (0.66)
9128
C(race)[T.3]   -0.000457   (0.996)
9129
agepreg        -0.00267    (0.629)
9130
hpagelb         0.0047     (0.266)
9131
birthord        0.00501    (0.821)
9132
R^2 0.000144
9133
\end{verbatim}
9134

9135
None of the estimated parameters are statistically significant.  The
9136
pseudo-$R^2$ value is a little higher, but that could be due to
9137
chance.
9138
\index{pseudo r-squared}
9139
  \index{significant} \index{statistically significant}
9140

9141

9142
\section{Accuracy}
9143
\label{accuracy}
9144

9145
In the office pool scenario,
9146
we are most interested in the accuracy of the model:
9147
the number of successful predictions, compared with what we would
9148
expect by chance.
9149
\index{model}
9150
\index{accuracy}
9151

9152
In the NSFG data, there are more boys than girls, so the baseline
9153
strategy is to guess ``boy'' every time.  The accuracy of this
9154
strategy is just the fraction of boys:
9155

9156
\begin{verbatim}
9157
    actual = endog['boy']
9158
    baseline = actual.mean()
9159
\end{verbatim}
9160

9161
Since {\tt actual} is encoded in binary integers, the mean is the
9162
fraction of boys, which is 0.507.
9163

9164
Here's how we compute the accuracy of the model:
9165

9166
\begin{verbatim}
9167
    predict = (results.predict() >= 0.5)
9168
    true_pos = predict * actual
9169
    true_neg = (1 - predict) * (1 - actual)
9170
\end{verbatim}
9171

9172
{\tt results.predict} returns a NumPy array of probabilities, which we
9173
round off to 0 or 1.  Multiplying by {\tt actual}
9174
yields 1 if we predict a boy and get it right, 0 otherwise.  So,
9175
\verb"true_pos" indicates ``true positives''.
9176
\index{NumPy}
9177
\index{true positive}
9178
\index{true negative}
9179

9180
Similarly, \verb"true_neg" indicates the cases where we guess ``girl''
9181
and get it right.  Accuracy is the fraction of correct guesses:
9182

9183
\begin{verbatim}
9184
    acc = (sum(true_pos) + sum(true_neg)) / len(actual)
9185
\end{verbatim}
9186

9187
The result is 0.512, slightly better than the
9188
baseline, 0.507.  But, you should not take this result too seriously.
9189
We used the same data to build and test the model, so the model
9190
may not have predictive power on new data.
9191
\index{model}
9192

9193
Nevertheless, let's use the model to make a prediction for the office
9194
pool.  Suppose your friend is 35 years old and white,
9195
her husband is 39, and they are expecting their third child:
9196

9197
\begin{verbatim}
9198
    columns = ['agepreg', 'hpagelb', 'birthord', 'race']
9199
    new = pandas.DataFrame([[35, 39, 3, 2]], columns=columns)
9200
    y = results.predict(new)
9201
\end{verbatim}
9202

9203
To invoke {\tt results.predict} for a new case, you have to construct
9204
a DataFrame with a column for each variable in the model.  The result
9205
in this case is 0.52, so you should guess ``boy.''  But if the model
9206
improves your chances of winning, the difference is very small.
9207
\index{DataFrame}
9208

9209

9210

9211
\section{Exercises}
9212

9213
My solution to these exercises is in \verb"chap11soln.ipynb".
9214

9215
\begin{exercise}
9216
Suppose one of your co-workers is expecting a baby and you are
9217
participating in an office pool to predict the date of birth.
9218
Assuming that bets are placed during the 30th week of pregnancy, what
9219
variables could you use to make the best prediction?  You should limit
9220
yourself to variables that are known before the birth, and likely to
9221
be available to the people in the pool.
9222
\index{betting pool}
9223
\index{date of birth}
9224

9225
\end{exercise}
9226

9227

9228
\begin{exercise}
9229
The Trivers-Willard hypothesis suggests that for many mammals the
9230
sex ratio depends on ``maternal condition''; that is,
9231
factors like the mother's age, size, health, and social status.
9232
See \url{https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis}
9233
\index{Trivers-Willard hypothesis}
9234
\index{sex ratio}
9235

9236
Some studies have shown this effect among humans, but results are
9237
mixed.  In this chapter we tested some variables related to these
9238
factors, but didn't find any with a statistically significant effect
9239
on sex ratio.
9240
  \index{significant} \index{statistically significant}
9241

9242
As an exercise, use a data mining approach to test the other variables
9243
in the pregnancy and respondent files.  Can you find any factors with
9244
a substantial effect?  
9245
\index{data mining}
9246

9247
\end{exercise}
9248

9249

9250
\begin{exercise}
9251
If the quantity you want to predict is a count, you can use Poisson
9252
regression, which is implemented in StatsModels with a function called
9253
{\tt poisson}.  It works the same way as {\tt ols} and {\tt logit}.
9254
As an exercise, let's use it to predict how many children a woman
9255
has born; in the NSFG dataset, this variable is called {\tt numbabes}.
9256
\index{StatsModels}
9257
\index{Poisson regression}
9258

9259
Suppose you meet a woman who is 35 years old, black, and a college
9260
graduate whose annual household income exceeds \$75,000.  How many
9261
children would you predict she has born?
9262
\end{exercise}
9263

9264

9265
\begin{exercise}
9266
If the quantity you want to predict is categorical, you can use
9267
multinomial logistic regression, which is implemented in StatsModels
9268
with a function called {\tt mnlogit}.  As an exercise, let's use it to
9269
guess whether a woman is married, cohabitating, widowed, divorced,
9270
separated, or never married; in the NSFG dataset, marital status is
9271
encoded in a variable called {\tt rmarital}.
9272
\index{categorical variable}
9273
\index{marital status}
9274

9275
Suppose you meet a woman who is 25 years old, white, and a high
9276
school graduate whose annual household income is about \$45,000.
9277
What is the probability that she is married, cohabitating, etc?
9278
\end{exercise}
9279

9280

9281

9282

9283
\section{Glossary}
9284

9285
\begin{itemize}
9286

9287
\item regression: One of several related processes for estimating parameters
9288
that fit a model to data.
9289
\index{regression}
9290

9291
\item dependent variables: The variables in a regression model we would
9292
like to predict.  Also known as endogenous variables.
9293
\index{dependent variable}
9294
\index{endogenous variable}
9295

9296
\item explanatory variables: The variables used to predict or explain
9297
the dependent variables.  Also known as independent, or exogenous,
9298
variables.
9299
\index{explanatory variable}
9300
\index{exogenous variable}
9301

9302
\item simple regression: A regression with only one dependent and
9303
one explanatory variable.
9304
\index{simple regression}
9305

9306
\item multiple regression: A regression with multiple explanatory
9307
variables, but only one dependent variable.
9308
\index{multiple regression}
9309

9310
\item linear regression: A regression based on a linear model.
9311
\index{linear regression}
9312

9313
\item ordinary least squares: A linear regression that estimates
9314
parameters by minimizing the squared error of the residuals.
9315
\index{ordinary least squares}
9316

9317
\item spurious relationship: A relationship between two variables that is 
9318
caused by a statistical artifact or a factor, not included in the
9319
model, that is related to both variables.
9320
\index{spurious relationship}
9321

9322
\item control variable: A variable included in a regression to
9323
eliminate or ``control for'' a spurious relationship.
9324
\index{control variable}
9325

9326
\item proxy variable: A variable that contributes information to
9327
a regression model indirectly because of a relationship with another
9328
factor, so it acts as a proxy for that factor.
9329
\index{proxy variable}
9330

9331
\item categorical variable: A variable that can have one of a
9332
discrete set of unordered values.
9333
\index{categorical variable}
9334

9335
\item join: An operation that combines data from two DataFrames
9336
using a key to match up rows in the two frames.
9337
\index{join}
9338
\index{DataFrame}
9339

9340
\item data mining: An approach to finding relationships between
9341
variables by testing a large number of models.
9342
\index{data mining}
9343

9344
\item logistic regression: A form of regression used when the
9345
dependent variable is boolean.
9346
\index{logistic regression}
9347

9348
\item Poisson regression: A form of regression used when the
9349
dependent variable is a non-negative integer, usually a count.
9350
\index{Poisson regression}
9351

9352
\item odds: An alternative way of representing a probability, $p$, as
9353
  the ratio of the probability and its complement, $p / (1-p)$.
9354
\index{odds}
9355

9356
\end{itemize}
9357

9358

9359

9360
\chapter{Time series analysis}
9361

9362
A {\bf time series} is a sequence of measurements from a system that
9363
varies in time.  One famous example is the ``hockey stick graph'' that
9364
shows global average temperature over time (see
9365
\url{https://en.wikipedia.org/wiki/Hockey_stick_graph}).
9366
\index{time series}
9367
\index{hockey stick graph}
9368

9369
The example I work with in this chapter comes from Zachary M. Jones, a
9370
researcher in political science who studies the black market for
9371
cannabis in the U.S.  (\url{http://zmjones.com/marijuana}).  He
9372
collected data from a web site called ``Price of Weed'' that
9373
crowdsources market information by asking participants to report the
9374
price, quantity, quality, and location of cannabis transactions
9375
(\url{http://www.priceofweed.com/}).  The goal of his project is to
9376
investigate the effect of policy decisions, like legalization, on
9377
markets.  I find this project appealing because it is an example that
9378
uses data to address important political questions, like drug policy.
9379
\index{Price of Weed}
9380
\index{cannabis}
9381

9382
I hope you will
9383
find this chapter interesting, but I'll take this opportunity to
9384
reiterate the importance of maintaining a professional attitude to
9385
data analysis.  Whether and which drugs should be illegal are
9386
important and difficult public policy questions; our decisions should
9387
be informed by accurate data reported honestly.
9388
\index{ethics}
9389

9390
The code for this chapter is in {\tt timeseries.py}.  For information
9391
about downloading and working with this code, see Section~\ref{code}.
9392

9393

9394
\section{Importing and cleaning}
9395

9396
The data I downloaded from
9397
Mr. Jones's site is in the repository for this book.
9398
The following code reads it into a
9399
pandas DataFrame:
9400
\index{pandas}
9401
\index{DataFrame}
9402

9403
\begin{verbatim}
9404
    transactions = pandas.read_csv('mj-clean.csv', parse_dates=[5])
9405
\end{verbatim}
9406

9407
\verb"parse_dates" tells \verb"read_csv" to interpret values in column 5
9408
as dates and convert them to NumPy {\tt datetime64} objects.
9409
\index{NumPy}
9410

9411
The DataFrame has a row for each reported transaction and 
9412
the following columns:
9413

9414
\begin{itemize}
9415

9416
\item city: string city name.
9417

9418
\item state: two-letter state abbreviation.
9419

9420
\item price: price paid in dollars.
9421
\index{price}
9422

9423
\item amount: quantity purchased in grams.
9424

9425
\item quality: high, medium, or low quality, as reported by the purchaser.
9426

9427
\item date: date of report, presumed to be shortly after date of purchase.
9428

9429
\item ppg: price per gram, in dollars.
9430

9431
\item state.name: string state name.
9432

9433
\item lat: approximate latitude of the transaction, based on city name.
9434

9435
\item lon: approximate longitude of the transaction.
9436

9437
\end{itemize}
9438

9439
Each transaction is an event in time, so we could treat this dataset
9440
as a time series.  But the events are not equally spaced in time; the
9441
number of transactions reported each day varies from 0 to several
9442
hundred.  Many methods used to analyze time series require the
9443
measurements to be equally spaced, or at least things are simpler if
9444
they are.
9445
\index{transaction}
9446
\index{equally spaced data}
9447

9448
In order to demonstrate these methods, I divide the dataset
9449
into groups by reported quality, and then transform each group into
9450
an equally spaced series by computing the mean daily price per gram.
9451

9452
\begin{verbatim}
9453
def GroupByQualityAndDay(transactions):
9454
    groups = transactions.groupby('quality')
9455
    dailies = {}
9456
    for name, group in groups:
9457
        dailies[name] = GroupByDay(group)        
9458

9459
    return dailies
9460
\end{verbatim}
9461

9462
{\tt groupby} is a DataFrame method that returns a GroupBy object,
9463
{\tt groups}; used in a for loop, it iterates the names of the groups
9464
and the DataFrames that represent them.  Since the values of {\tt
9465
  quality} are {\tt low}, {\tt medium}, and {\tt high}, we get three
9466
groups with those names.  \index{DataFrame} \index{groupby}
9467

9468
The loop iterates through the groups and calls {\tt GroupByDay},
9469
which computes the daily average price and returns a new DataFrame:
9470

9471
\begin{verbatim}
9472
def GroupByDay(transactions, func=np.mean):
9473
    grouped = transactions[['date', 'ppg']].groupby('date')
9474
    daily = grouped.aggregate(func)
9475

9476
    daily['date'] = daily.index
9477
    start = daily.date[0]
9478
    one_year = np.timedelta64(1, 'Y')
9479
    daily['years'] = (daily.date - start) / one_year
9480

9481
    return daily
9482
\end{verbatim}
9483

9484
The parameter, {\tt transactions}, is a DataFrame that contains
9485
columns {\tt date} and {\tt ppg}.  We select these two
9486
columns, then group by {\tt date}.
9487
\index{groupby}
9488

9489
The result, {\tt grouped}, is a map from each date to a DataFrame that
9490
contains prices reported on that date.  {\tt aggregate} is a
9491
GroupBy method that iterates through the groups and applies a
9492
function to each column of the group; in this case there is only one
9493
column, {\tt ppg}.  So the result of {\tt aggregate} is a DataFrame
9494
with one row for each date and one column, {\tt ppg}.
9495
\index{aggregate}
9496

9497
Dates in these DataFrames are stored as NumPy {\tt datetime64}
9498
objects, which are represented as 64-bit integers in nanoseconds.
9499
For some of the analyses coming up, it will be convenient to
9500
work with time in more human-friendly units, like years.  So
9501
{\tt GroupByDay} adds a column named {\tt date} by copying
9502
the {\tt index}, then adds {\tt years}, which contains the number
9503
of years since the first transaction as a floating-point number.
9504
\index{NumPy}
9505
\index{datetime64}
9506

9507
The resulting DataFrame has columns {\tt ppg}, {\tt date}, and
9508
{\tt years}.
9509
\index{DataFrame}
9510

9511

9512
\section{Plotting}
9513

9514
The result from {\tt GroupByQualityAndDay} is a map from each quality
9515
to a DataFrame of daily prices.  Here's the code I use to plot
9516
the three time series:
9517
\index{DataFrame}
9518
\index{visualization}
9519

9520
\begin{verbatim}
9521
    thinkplot.PrePlot(rows=3)
9522
    for i, (name, daily) in enumerate(dailies.items()):
9523
        thinkplot.SubPlot(i+1)
9524
        title = 'price per gram ($)' if i==0 else ''
9525
        thinkplot.Config(ylim=[0, 20], title=title)
9526
        thinkplot.Scatter(daily.index, daily.ppg, s=10, label=name)
9527
        if i == 2: 
9528
            pyplot.xticks(rotation=30)
9529
        else:
9530
            thinkplot.Config(xticks=[])
9531
\end{verbatim}
9532

9533
{\tt PrePlot} with {\tt rows=3} means that we are planning to
9534
make three subplots laid out in three rows.  The loop iterates
9535
through the DataFrames and creates a scatter plot for each.  It is
9536
common to plot time series with line segments between the points,
9537
but in this case there are many data points and prices are highly
9538
variable, so adding lines would not help.
9539
\index{thinkplot}
9540

9541
Since the labels on the x-axis are dates, I use {\tt pyplot.xticks}
9542
to rotate the ``ticks'' 30 degrees, making them more readable.
9543
\index{pyplot}
9544
\index{ticks}
9545
\index{xticks}
9546

9547
\begin{figure}
9548
% timeseries.py
9549
\centerline{\includegraphics[width=3.5in]{figs/timeseries1.pdf}}
9550
\caption{Time series of daily price per gram for high, medium, and low
9551
quality cannabis.}
9552
\label{timeseries1}
9553
\end{figure}
9554

9555
Figure~\ref{timeseries1} shows the result.  One apparent feature in
9556
these plots is a gap around November 2013.  It's possible that data
9557
collection was not active during this time, or the data might not
9558
be available.  We will consider ways to deal with this missing data
9559
later.
9560
\index{missing values}
9561

9562
Visually, it looks like the price of high quality cannabis is
9563
declining during this period, and the price of medium quality is
9564
increasing.  The price of low quality might also be increasing, but it
9565
is harder to tell, since it seems to be more volatile.  Keep in mind
9566
that quality data is reported by volunteers, so trends over time
9567
might reflect changes in how participants apply these labels.
9568
\index{price}
9569

9570

9571
\section{Linear regression}
9572
\label{timeregress}
9573

9574
Although there are methods specific to time series analysis, for many
9575
problems a simple way to get started is by applying general-purpose
9576
tools like linear regression.  The following function takes a
9577
DataFrame of daily prices and computes a least squares fit, returning
9578
the model and results objects from StatsModels:
9579
\index{DataFrame}
9580
\index{StatsModels}
9581
\index{linear regression}
9582

9583
\begin{verbatim}
9584
def RunLinearModel(daily):
9585
    model = smf.ols('ppg ~ years', data=daily)
9586
    results = model.fit()
9587
    return model, results
9588
\end{verbatim}
9589

9590
Then we can iterate through the qualities and fit a model to
9591
each:
9592

9593
\begin{verbatim}
9594
    for name, daily in dailies.items():
9595
        model, results = RunLinearModel(daily)
9596
        print(name)
9597
        regression.SummarizeResults(results)
9598
\end{verbatim}
9599

9600
Here are the results:
9601

9602
\begin{center}
9603
\begin{tabular}{|l|l|l|c|} \hline
9604
quality & intercept & slope & $R^2$ \\ \hline
9605
high    & 13.450  & -0.708  & 0.444 \\
9606
medium  &  8.879  & 0.283   & 0.050 \\
9607
low     &  5.362  & 0.568   & 0.030 \\
9608
\hline
9609
\end{tabular}
9610
\end{center}
9611

9612
The estimated slopes indicate that the price of high quality cannabis
9613
dropped by about 71 cents per year during the observed interval; for
9614
medium quality it increased by 28 cents per year, and for low quality
9615
it increased by 57 cents per year.  These estimates are all
9616
statistically significant with very small p-values.
9617
\index{p-value}
9618
  \index{significant} \index{statistically significant}
9619

9620
The $R^2$ value for high quality cannabis is 0.44, which means
9621
that time as an explanatory variable accounts for 44\% of the observed
9622
variability in price.  For the other qualities, the change in price
9623
is smaller, and variability in prices is higher, so the values
9624
of $R^2$ are smaller (but still statistically significant).
9625
\index{explanatory variable}
9626
  \index{significant} \index{statistically significant}
9627

9628
The following code plots the observed prices and the fitted values:
9629

9630
\begin{verbatim}
9631
def PlotFittedValues(model, results, label=''):
9632
    years = model.exog[:,1]
9633
    values = model.endog
9634
    thinkplot.Scatter(years, values, s=15, label=label)
9635
    thinkplot.Plot(years, results.fittedvalues, label='model')
9636
\end{verbatim}
9637

9638
As we saw in Section~\ref{implementation}, {\tt model} contains
9639
{\tt exog} and {\tt endog}, NumPy arrays with the exogenous
9640
(explanatory) and endogenous (dependent) variables.
9641
\index{NumPy}
9642
\index{explanatory variable}
9643
\index{dependent variable}
9644
\index{exogenous variable}
9645
\index{endogenous variable}
9646

9647
\begin{figure}
9648
% timeseries.py
9649
\centerline{\includegraphics[height=2.5in]{figs/timeseries2.pdf}}
9650
\caption{Time series of daily price per gram for high quality cannabis,
9651
and a linear least squares fit.}
9652
\label{timeseries2}
9653
\end{figure}
9654

9655
{\tt PlotFittedValues} makes a scatter plot of the data points and a line
9656
plot of the fitted values.  Figure~\ref{timeseries2} shows the results
9657
for high quality cannabis.  The model seems like a good linear fit
9658
for the data; nevertheless, linear regression is not the most 
9659
appropriate choice for this data:
9660
\index{model}
9661
\index{fitted values}
9662

9663
\begin{itemize}
9664

9665
\item First, there is no reason to expect the long-term trend to be a
9666
  line or any other simple function.  In general, prices are
9667
  determined by supply and demand, both of which vary over time in
9668
  unpredictable ways.
9669
\index{trend}
9670

9671
\item Second, the linear regression model gives equal weight to all
9672
  data, recent and past.  For purposes of prediction, we should
9673
  probably give more weight to recent data.
9674
\index{weight}
9675

9676
\item Finally, one of the assumptions of linear regression is that the
9677
  residuals are uncorrelated noise.  With time series data, this
9678
  assumption is often false because successive values are correlated.
9679
\index{residuals}
9680

9681
\end{itemize}
9682

9683
The next section presents an alternative that is more appropriate
9684
for time series data.
9685

9686

9687
\section{Moving averages}
9688

9689
Most time series analysis is based on the modeling assumption that the
9690
observed series is the sum of three components:
9691
\index{model}
9692
\index{moving average}
9693

9694
\begin{itemize}
9695

9696
\item Trend: A smooth function that captures persistent changes.
9697
\index{trend}
9698

9699
\item Seasonality: Periodic variation, possibly including daily,
9700
weekly, monthly, or yearly cycles.
9701
\index{seasonality}
9702

9703
\item Noise: Random variation around the long-term trend.
9704
\index{noise}
9705

9706
\end{itemize}
9707

9708
Regression is one way to extract the trend from a series, as we
9709
saw in the previous section.  But if the trend is not a simple
9710
function, a good alternative is a {\bf moving average}.  A moving
9711
average divides the series into overlapping regions, called {\bf windows},
9712
and computes the average of the values in each window.
9713
\index{window}
9714

9715
One of the simplest moving averages is the {\bf rolling mean}, which
9716
computes the mean of the values in each window.  For example, if
9717
the window size is 3, the rolling mean computes the mean of
9718
values 0 through 2, 1 through 3, 2 through 4, etc.
9719
\index{rolling mean}
9720
\index{mean!rolling}
9721

9722
pandas provides \verb"rolling_mean", which takes a Series and a
9723
window size and returns a new Series.
9724
\index{pandas}
9725
\index{Series}
9726

9727
\begin{verbatim}
9728
>>> series = np.arange(10)
9729
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
9730

9731
>>> pandas.rolling_mean(series, 3)
9732
array([ nan,  nan,   1,   2,   3,   4,   5,   6,   7,   8])
9733
\end{verbatim}
9734

9735
The first two values are {\tt nan}; the next value is the mean of
9736
the first three elements, 0, 1, and 2.  The next value is the mean
9737
of 1, 2, and 3.  And so on.
9738

9739
Before we can apply \verb"rolling_mean" to the cannabis data, we
9740
have to deal with missing values.  There are a few days in the
9741
observed interval with no reported transactions for one or more
9742
quality categories, and a period in 2013 when data collection was
9743
not active.
9744
\index{missing values}
9745

9746
In the DataFrames we have used so far, these dates are absent;
9747
the index skips days with no data.  For the analysis that follows,
9748
we need to represent this missing data explicitly.  We can do
9749
that by ``reindexing'' the DataFrame:
9750
 \index{DataFrame}
9751
\index{reindex}
9752

9753
\begin{verbatim}
9754
    dates = pandas.date_range(daily.index.min(), daily.index.max())
9755
    reindexed = daily.reindex(dates)
9756
\end{verbatim}
9757

9758
The first line computes a date range that includes every day from the
9759
beginning to the end of the observed interval.  The second line
9760
creates a new DataFrame with all of the data from {\tt daily}, but
9761
including rows for all dates, filled with {\tt nan}.
9762
\index{interval}
9763
\index{date range}
9764

9765
Now we can plot the rolling mean like this:
9766

9767
\begin{verbatim}
9768
    roll_mean = pandas.rolling_mean(reindexed.ppg, 30)
9769
    thinkplot.Plot(roll_mean.index, roll_mean)
9770
\end{verbatim}
9771

9772
The window size is 30, so each value in \verb"roll_mean" is
9773
the mean of 30 values from {\tt reindexed.ppg}.  
9774
\index{pandas}
9775
\index{window}
9776

9777
\begin{figure}
9778
% timeseries.py
9779
\centerline{\includegraphics[height=2.5in]{figs/timeseries10.pdf}}
9780
\caption{Daily price and a rolling mean (left) and exponentially-weighted
9781
moving average (right).}
9782
\label{timeseries10}
9783
\end{figure}
9784

9785
Figure~\ref{timeseries10} (left)
9786
shows the result. 
9787
The rolling mean seems to do a good job of smoothing out the noise and
9788
extracting the trend.  The first 29 values are {\tt nan}, and wherever
9789
there's a missing value, it's followed by another 29 {\tt nan}s.
9790
There are ways to fill in these gaps, but they are a minor nuisance.
9791
\index{missing values}
9792
\index{noise}
9793
\index{smoothing}
9794

9795
An alternative is the {\bf exponentially-weighted moving average} (EWMA),
9796
which has two advantages.  First, as the name suggests, it computes
9797
a weighted average where the most recent value has the highest weight
9798
and the weights for previous values drop off exponentially.
9799
Second, the pandas implementation of EWMA handles missing values
9800
better.
9801
\index{reindex}
9802
\index{exponentially-weighted moving average}
9803
\index{EWMA}
9804

9805
\begin{verbatim}
9806
    ewma = pandas.ewma(reindexed.ppg, span=30)
9807
    thinkplot.Plot(ewma.index, ewma)
9808
\end{verbatim}
9809

9810
The {\bf span} parameter corresponds roughly to the window size of
9811
a moving average; it controls how fast the weights drop off, so it
9812
determines the number of points that make a non-negligible contribution
9813
to each average.
9814
\index{span}
9815
\index{window}
9816

9817
Figure~\ref{timeseries10} (right) shows the EWMA for the same data.
9818
It is similar to the rolling mean, where they are both defined,
9819
but it has no missing values, which makes it easier to work with.  The
9820
values are noisy at the beginning of the time series, because they are
9821
based on fewer data points.
9822
\index{missing values}
9823

9824

9825
\section{Missing values}
9826

9827
Now that we have characterized the trend of the time series, the
9828
next step is to investigate seasonality, which is periodic behavior.
9829
Time series data based on human behavior often exhibits daily,
9830
weekly, monthly, or yearly cycles.  In the next section I present
9831
methods to test for seasonality, but they don't work well with
9832
missing data, so we have to solve that problem first.
9833
\index{missing values}
9834
\index{seasonality}
9835

9836
A simple and common way to fill missing data is to use a moving
9837
average.  The Series method {\tt fillna} does just what we want:
9838
\index{Series}
9839
\index{fillna}
9840

9841
\begin{verbatim}
9842
    reindexed.ppg.fillna(ewma, inplace=True)
9843
\end{verbatim}
9844

9845
Wherever {\tt reindexed.ppg} is {\tt nan}, {\tt fillna} replaces
9846
it with the corresponding value from {\tt ewma}.  The {\tt inplace}
9847
flag tells {\tt fillna} to modify the existing Series rather than
9848
create a new one.
9849

9850
A drawback of this method is that it understates the noise in the
9851
series.  We can solve that problem by adding in resampled
9852
residuals:
9853
\index{resampling}
9854
\index{noise}
9855

9856
\begin{verbatim}
9857
    resid = (reindexed.ppg - ewma).dropna()
9858
    fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))
9859
    reindexed.ppg.fillna(fake_data, inplace=True)
9860
\end{verbatim}
9861

9862
% (One note on vocabulary: in this book I am using
9863
%``resampling'' in the statistical sense, which is drawing a random
9864
%sample from a population that is, itself, a sample.  In the context
9865
%of time series analysis, it has another meaning: changing the
9866
%time between measurements in a series.  I don't use the second
9867
%meaning in this book, but you might encounter it.)
9868

9869
{\tt resid} contains the residual values, not including days
9870
when {\tt ppg} is {\tt nan}.  \verb"fake_data" contains the
9871
sum of the moving average and a random sample of residuals.
9872
Finally, {\tt fillna} replaces {\tt nan} with values from
9873
\verb"fake_data".
9874
\index{dropna}
9875
\index{fillna}
9876
\index{NaN}
9877

9878
\begin{figure}
9879
% timeseries.py
9880
\centerline{\includegraphics[height=2.5in]{figs/timeseries8.pdf}}
9881
\caption{Daily price with filled data.}
9882
\label{timeseries8}
9883
\end{figure}
9884

9885
Figure~\ref{timeseries8} shows the result.  The filled data is visually
9886
similar to the actual values.  Since the resampled residuals are
9887
random, the results are different every time; later we'll see how
9888
to characterize the error created by missing values.
9889
\index{resampling}
9890
\index{missing values}
9891

9892

9893
\section{Serial correlation}
9894

9895
As prices vary from day to day, you might expect to see patterns.
9896
If the price is high on Monday,
9897
you might expect it to be high for a few more days; and
9898
if it's low, you might expect it to stay low.  A pattern
9899
like this is called {\bf serial
9900
correlation}, because each value is correlated with the next one
9901
in the series.
9902
\index{correlation!serial}
9903
\index{serial correlation}
9904

9905
To compute serial correlation, we can shift the time series
9906
by an interval called a {\bf lag}, and then compute the correlation
9907
of the shifted series with the original:
9908
\index{lag}
9909

9910
\begin{verbatim}
9911
def SerialCorr(series, lag=1):
9912
    xs = series[lag:]
9913
    ys = series.shift(lag)[lag:]
9914
    corr = thinkstats2.Corr(xs, ys)
9915
    return corr
9916
\end{verbatim}
9917

9918
After the shift, the first {\tt lag} values are {\tt nan}, so
9919
I use a slice to remove them before computing {\tt Corr}.
9920
\index{NaN}
9921

9922
%high 0.480121816154
9923
%medium 0.164600078362
9924
%low 0.103373620131
9925

9926
If we apply {\tt SerialCorr} to the raw price data with lag 1, we find
9927
serial correlation 0.48 for the high quality category, 0.16 for
9928
medium and 0.10 for low.  In any time series with a long-term trend,
9929
we expect to see strong serial correlations; for example, if prices
9930
are falling, we expect to see values above the mean in the first
9931
half of the series and values below the mean in the second half.
9932

9933
It is more interesting to see if the correlation persists if you
9934
subtract away the trend.  For example, we can compute the residual
9935
of the EWMA and then compute its serial correlation:
9936
\index{EWMA}
9937

9938
\begin{verbatim}
9939
    ewma = pandas.ewma(reindexed.ppg, span=30)
9940
    resid = reindexed.ppg - ewma
9941
    corr = SerialCorr(resid, 1)
9942
\end{verbatim}
9943

9944
With lag=1, the serial correlations for the de-trended data are
9945
-0.022 for high quality, -0.015 for medium, and 0.036 for low.
9946
These values are small, indicating that there is little or
9947
no one-day serial correlation in this series.
9948
\index{pandas}
9949

9950
To check for weekly, monthly, and yearly seasonality, I ran
9951
the analysis again with different lags.  Here are the results:
9952
\index{seasonality}
9953

9954
\begin{center}
9955
\begin{tabular}{|c|c|c|c|}
9956
\hline
9957
lag & high & medium & low \\ \hline
9958
1 & -0.029 & -0.014 & 0.034 \\
9959
7 & 0.02 & -0.042 & -0.0097 \\
9960
30 & 0.014 & -0.0064 & -0.013 \\
9961
365 & 0.045 & 0.015 & 0.033 \\
9962
\hline
9963
\end{tabular}
9964
\end{center}
9965

9966
In the next section we'll test whether these correlations are
9967
statistically significant (they are not), but at this point we can
9968
tentatively conclude that there are no substantial seasonal patterns
9969
in these series, at least not with these lags.
9970
  \index{significant} \index{statistically significant}
9971

9972

9973
\section{Autocorrelation}
9974

9975
If you think a series might have some serial correlation, but you
9976
don't know which lags to test, you can test them all!  The {\bf
9977
  autocorrelation function} is a function that maps from lag to the
9978
serial correlation with the given lag.  ``Autocorrelation'' is another
9979
name for serial correlation, used more often when the lag is not 1.
9980
\index{autocorrelation function}
9981

9982
StatsModels, which we used for linear regression in
9983
Section~\ref{statsmodels}, also provides functions for time series
9984
analysis, including {\tt acf}, which computes the autocorrelation
9985
function:
9986
\index{StatsModels}
9987

9988
\begin{verbatim}
9989
    import statsmodels.tsa.stattools as smtsa
9990
    acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)
9991
\end{verbatim}
9992

9993
{\tt acf} computes serial correlations with
9994
lags from 0 through {\tt nlags}.  The {\tt unbiased} flag tells
9995
{\tt acf} to correct the estimates for the sample size.  The result
9996
is an array of correlations.  If we select daily prices for high
9997
quality, and extract correlations for lags 1, 7, 30, and 365, we can
9998
confirm that {\tt acf} and {\tt SerialCorr} yield approximately
9999
the same results:
10000
\index{acf}
10001

10002
\begin{verbatim}
10003
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
10004
1.000, -0.029, 0.020, 0.014, 0.044
10005
\end{verbatim}
10006

10007
With {\tt lag=0}, {\tt acf} computes the correlation of the series
10008
with itself, which is always 1.
10009
\index{lag}
10010

10011
\begin{figure}
10012
% timeseries.py
10013
\centerline{\includegraphics[height=2.5in]{figs/timeseries9.pdf}}
10014
\caption{Autocorrelation function for daily prices (left), and
10015
daily prices with a simulated weekly seasonality (right).}
10016
\label{timeseries9}
10017
\end{figure}
10018

10019
Figure~\ref{timeseries9} (left) shows autocorrelation functions for
10020
the three quality categories, with {\tt nlags=40}.  The gray region
10021
shows the normal variability we would expect if there is no actual
10022
autocorrelation; anything that falls outside this range is
10023
statistically significant, with a p-value less than 5\%.  Since
10024
the false positive rate is 5\%, and
10025
we are computing 120 correlations (40 lags for each of 3 times series),
10026
we expect to see about 6 points outside this region.  In fact, there
10027
are 7.  We conclude that there are no autocorrelations
10028
in these series that could not be explained by chance.
10029
\index{p-value}
10030
  \index{significant} \index{statistically significant}
10031
\index{false positive}
10032

10033
I computed the gray regions by resampling the residuals.  You
10034
can see my code in {\tt timeseries.py}; the function is called
10035
{\tt SimulateAutocorrelation}.
10036
\index{resampling}
10037

10038
To see what the autocorrelation function looks like when there is a
10039
seasonal component, I generated simulated data by adding a weekly
10040
cycle.  Assuming that demand for cannabis is higher on weekends, we
10041
might expect the price to be higher.  To simulate this effect, I
10042
select dates that fall on Friday or Saturday and add a random amount
10043
to the price, chosen from a uniform distribution from \$0 to \$2.
10044
\index{simulation}
10045
\index{uniform distribution}
10046
\index{distribution!uniform}
10047

10048
\begin{verbatim}
10049
def AddWeeklySeasonality(daily):
10050
    frisat = (daily.index.dayofweek==4) | (daily.index.dayofweek==5)
10051
    fake = daily.copy()
10052
    fake.ppg[frisat] += np.random.uniform(0, 2, frisat.sum())
10053
    return fake
10054
\end{verbatim}
10055

10056
{\tt frisat} is a boolean Series, {\tt True} if the day of the
10057
week is Friday or Saturday.  {\tt fake} is a new DataFrame, initially
10058
a copy of {\tt daily}, which we modify by adding random values
10059
to {\tt ppg}.  {\tt frisat.sum()} is the total number of Fridays
10060
and Saturdays, which is the number of random values we have to
10061
generate.
10062
\index{DataFrame}
10063
\index{Series}
10064
\index{boolean}
10065

10066
Figure~\ref{timeseries9} (right) shows autocorrelation functions for
10067
prices with this simulated seasonality.  As expected, the
10068
correlations are highest when the lag is a multiple of 7.  For
10069
high and medium quality, the new correlations are statistically
10070
significant.  For low quality they are not, because residuals in this
10071
category are large; the effect would have to be bigger
10072
to be visible through the noise.
10073
  \index{significant} \index{statistically significant}
10074
\index{residuals}
10075
\index{lag}
10076

10077

10078
\section{Prediction}  
10079

10080
Time series analysis can be used to investigate, and sometimes
10081
explain, the behavior of systems that vary in time.  It can also
10082
make predictions.
10083
\index{prediction}
10084

10085
The linear regressions we used in Section~\ref{timeregress} can be
10086
used for prediction.  The RegressionResults class provides {\tt
10087
  predict}, which takes a DataFrame containing the explanatory
10088
variables and returns a sequence of predictions.  Here's the code:
10089
\index{explanatory variable}
10090
\index{linear regression}
10091

10092
\begin{verbatim}
10093
def GenerateSimplePrediction(results, years):
10094
    n = len(years)
10095
    inter = np.ones(n)
10096
    d = dict(Intercept=inter, years=years)
10097
    predict_df = pandas.DataFrame(d)
10098
    predict = results.predict(predict_df)
10099
    return predict
10100
\end{verbatim}
10101

10102
{\tt results} is a RegressionResults object; {\tt years} is the
10103
sequence of time values we want predictions for.  The function
10104
constructs a DataFrame, passes it to {\tt predict}, and
10105
returns the result.
10106
\index{pandas}
10107
\index{DataFrame}
10108

10109
If all we want is a single, best-guess prediction, we're done.  But
10110
for most purposes it is important to quantify error.  In other words,
10111
we want to know how accurate the prediction is likely to be.
10112

10113
There are three sources of error we should take into account:
10114

10115
\begin{itemize}
10116

10117
\item Sampling error: The prediction is based on estimated
10118
parameters, which depend on random variation
10119
in the sample.  If we run the experiment again, we expect
10120
the estimates to vary.
10121
\index{sampling error}
10122
\index{parameter}
10123

10124
\item Random variation:  Even if the estimated parameters are
10125
perfect, the observed data varies randomly around the long-term
10126
trend, and we expect this variation to continue in the future.
10127
\index{noise}
10128

10129
\item Modeling error: We have already seen evidence that the long-term
10130
trend is not linear, so predictions based on a linear model will
10131
eventually fail.  
10132
\index{modeling error}
10133

10134
\end{itemize}
10135

10136
Another source of error to consider is unexpected future events.
10137
Agricultural prices are affected by weather, and all prices are
10138
affected by politics and law.  As I write this, cannabis is legal in
10139
two states and legal for medical purposes in 20 more.  If more states
10140
legalize it, the price is likely to go down.  But if
10141
the federal government cracks down, the price might go up.
10142

10143
Modeling errors and unexpected future events are hard to quantify.
10144
Sampling error and random variation are easier to deal with, so we'll
10145
do that first.
10146

10147
To quantify sampling error, I use resampling, as we did in
10148
Section~\ref{regest}.  As always, the goal is to use the actual
10149
observations to simulate what would happen if we ran the experiment
10150
again.  The simulations are based on the assumption that the estimated
10151
parameters are correct, but the random residuals could have been
10152
different.  Here is a function that runs the simulations:
10153
\index{resampling}
10154

10155
\begin{verbatim}
10156
def SimulateResults(daily, iters=101):
10157
    model, results = RunLinearModel(daily)
10158
    fake = daily.copy()
10159
    
10160
    result_seq = []
10161
    for i in range(iters):
10162
        fake.ppg = results.fittedvalues + Resample(results.resid)
10163
        _, fake_results = RunLinearModel(fake)
10164
        result_seq.append(fake_results)
10165

10166
    return result_seq
10167
\end{verbatim}
10168

10169
{\tt daily} is a DataFrame containing the observed prices;
10170
{\tt iters} is the number of simulations to run.
10171
\index{DataFrame}
10172
\index{price}
10173

10174
{\tt SimulateResults} uses {\tt RunLinearModel}, from
10175
Section~\ref{timeregress}, to estimate the slope and intercept
10176
of the observed values.
10177

10178
Each time through the loop, it generates a ``fake'' dataset by
10179
resampling the residuals and adding them to the fitted values.  Then
10180
it runs a linear model on the fake data and stores the RegressionResults
10181
object.
10182
\index{model}
10183
\index{residuals}
10184

10185
The next step is to use the simulated results to generate predictions:
10186

10187
\begin{verbatim}
10188
def GeneratePredictions(result_seq, years, add_resid=False):
10189
    n = len(years)
10190
    d = dict(Intercept=np.ones(n), years=years, years2=years**2)
10191
    predict_df = pandas.DataFrame(d)
10192
    
10193
    predict_seq = []
10194
    for fake_results in result_seq:
10195
        predict = fake_results.predict(predict_df)
10196
        if add_resid:
10197
            predict += thinkstats2.Resample(fake_results.resid, n)
10198
        predict_seq.append(predict)
10199

10200
    return predict_seq
10201
\end{verbatim}
10202

10203
{\tt GeneratePredictions} takes the sequence of results from the
10204
previous step, as well as {\tt years}, which is a sequence of
10205
floats that specifies the interval to generate predictions for,
10206
and \verb"add_resid", which indicates whether it should add resampled
10207
residuals to the straight-line prediction.
10208
{\tt GeneratePredictions} iterates through the sequence of
10209
RegressionResults and generates a sequence of predictions.
10210
\index{resampling}
10211

10212
\begin{figure}
10213
% timeseries.py
10214
\centerline{\includegraphics[height=2.5in]{figs/timeseries4.pdf}}
10215
\caption{Predictions based on linear fits, showing variation due
10216
to sampling error and prediction error.}
10217
\label{timeseries4}
10218
\end{figure}
10219

10220
Finally, here's the code that plots a 90\% confidence interval for
10221
the predictions:
10222
\index{confidence interval}
10223

10224
\begin{verbatim}
10225
def PlotPredictions(daily, years, iters=101, percent=90):
10226
    result_seq = SimulateResults(daily, iters=iters)
10227
    p = (100 - percent) / 2
10228
    percents = p, 100-p
10229

10230
    predict_seq = GeneratePredictions(result_seq, years, True)
10231
    low, high = thinkstats2.PercentileRows(predict_seq, percents)
10232
    thinkplot.FillBetween(years, low, high, alpha=0.3, color='gray')
10233

10234
    predict_seq = GeneratePredictions(result_seq, years, False)
10235
    low, high = thinkstats2.PercentileRows(predict_seq, percents)
10236
    thinkplot.FillBetween(years, low, high, alpha=0.5, color='gray')
10237
\end{verbatim}
10238

10239
{\tt PlotPredictions} calls {\tt GeneratePredictions} twice: once
10240
with \verb"add_resid=True" and again with \verb"add_resid=False".
10241
It uses {\tt PercentileRows} to select the 5th and 95th percentiles
10242
for each year, then plots a gray region between these bounds.
10243
\index{FillBetween}
10244

10245
Figure~\ref{timeseries4} shows the result.
10246
The dark gray region represents a 90\% confidence interval for
10247
the sampling error; that is, uncertainty about the estimated
10248
slope and intercept due to sampling.
10249
\index{sampling error}
10250

10251
The lighter region shows
10252
a 90\% confidence interval for prediction error, which is the
10253
sum of sampling error and random variation.
10254
\index{noise}
10255

10256
These regions quantify sampling error and random variation, but
10257
not modeling error.  In general modeling error is hard to quantify,
10258
but in this case we can address at least one source of error,
10259
unpredictable external events.
10260
\index{modeling error}
10261

10262
The regression model is based on the assumption that the system
10263
is {\bf stationary}; that is, that the parameters of the model
10264
don't change over time.
10265
Specifically, it assumes that the slope and
10266
intercept are constant, as well as the distribution of residuals.
10267
\index{stationary model}
10268
\index{parameter}
10269

10270
But looking at the moving averages in Figure~\ref{timeseries10}, it
10271
seems like the slope changes at least once during the observed
10272
interval, and the variance of the residuals seems bigger in the first
10273
half than the second.
10274
\index{slope}
10275

10276
As a result, the parameters we get depend on the interval we
10277
observe.  To see how much effect this has on the predictions,
10278
we can extend {\tt SimulateResults} to use intervals of observation
10279
with different start and end dates.  My implementation is in
10280
{\tt timeseries.py}.
10281
\index{simulation}
10282

10283
\begin{figure}
10284
% timeseries.py
10285
\centerline{\includegraphics[height=2.5in]{figs/timeseries5.pdf}}
10286
\caption{Predictions based on linear fits, showing
10287
variation due to the interval of observation.}
10288
\label{timeseries5}
10289
\end{figure}
10290

10291
Figure~\ref{timeseries5} shows the result for the medium quality
10292
category.  The lightest gray area shows a confidence interval that
10293
includes uncertainty due to sampling error, random variation, and
10294
variation in the interval of observation.
10295
\index{confidence interval}
10296
\index{interval}
10297

10298
The model based on the entire interval has positive slope, indicating
10299
that prices were increasing.  But the most recent interval shows signs
10300
of decreasing prices, so models based on the most recent data have
10301
negative slope.  As a result, the widest predictive interval includes
10302
the possibility of decreasing prices over the next year.
10303
\index{model}
10304

10305

10306
\section{Further reading}
10307

10308
Time series analysis is a big topic; this chapter has only scratched
10309
the surface.  An important tool for working with time series data
10310
is autoregression, which I did not cover here, mostly because it turns
10311
out not to be useful for the example data I worked with.
10312
\index{time series}
10313

10314
But once you
10315
have learned the material in this chapter, you are well prepared
10316
to learn about autoregression.  One resource I recommend is
10317
Philipp Janert's book, {\it Data Analysis with Open Source Tools},
10318
O'Reilly Media, 2011.  His chapter on time series analysis picks up
10319
where this one leaves off.
10320
\index{Janert, Philipp}
10321

10322

10323
\section{Exercises}
10324

10325
My solution to these exercises is in \verb"chap12soln.py".
10326

10327
\begin{exercise}
10328
The linear model I used in this chapter has the obvious drawback
10329
that it is linear, and there is no reason to expect prices to
10330
change linearly over time.
10331
We can add flexibility to the model by adding a quadratic term,
10332
as we did in Section~\ref{nonlinear}.  
10333
\index{nonlinear}
10334
\index{linear model}
10335
\index{quadratic model}
10336

10337
Use a quadratic model to fit the time series of daily prices,
10338
and use the model to generate predictions.  You will have to
10339
write a version of {\tt RunLinearModel} that runs that quadratic
10340
model, but after that you should be able to reuse code in
10341
{\tt timeseries.py} to generate predictions.
10342
\index{prediction}
10343

10344
\end{exercise}
10345

10346
\begin{exercise}
10347
Write a definition for a class named {\tt SerialCorrelationTest}
10348
that extends {\tt HypothesisTest} from Section~\ref{hypotest}.
10349
It should take a series and a lag as data, compute the serial
10350
correlation of the series with the given lag, and then compute
10351
the p-value of the observed correlation.
10352
\index{HypothesisTest}
10353
\index{p-value}
10354
\index{lag}
10355

10356
Use this class to test whether the serial correlation in raw
10357
price data is statistically significant.  Also test the residuals
10358
of the linear model and (if you did the previous exercise),
10359
the quadratic model.
10360
\index{quadratic model}
10361
  \index{significant} \index{statistically significant}
10362

10363
\end{exercise}
10364

10365
\begin{exercise}
10366
There are several ways to extend the EWMA model to generate predictions.
10367
One of the simplest is something like this:
10368
\index{EWMA}
10369

10370
\begin{enumerate}
10371

10372
\item Compute the EWMA of the time series and use the last point
10373
as an intercept, {\tt inter}.
10374

10375
\item Compute the EWMA of differences between successive elements in
10376
the time series and use the last point as a slope, {\tt slope}.
10377
\index{slope}
10378

10379
\item To predict values at future times, compute {\tt inter + slope * dt},
10380
where {\tt dt} is the difference between the time of the prediction and
10381
the time of the last observation.
10382
\index{prediction}
10383

10384
\end{enumerate}
10385

10386
Use this method to generate predictions for a year after the last
10387
observation.  A few hints:
10388

10389
\begin{itemize}
10390

10391
\item Use {\tt timeseries.FillMissing} to fill in missing values
10392
before running this analysis.  That way the time between consecutive
10393
elements is consistent.
10394
\index{missing values}
10395

10396
\item Use {\tt Series.diff} to compute differences between successive
10397
elements.
10398
\index{Series}
10399

10400
\item Use {\tt reindex} to extend the DataFrame index into the future.
10401
\index{reindex}
10402

10403
\item Use {\tt fillna} to put your predicted values into the DataFrame.
10404
\index{fillna}
10405

10406
\end{itemize}
10407

10408
\end{exercise}
10409

10410

10411
\section{Glossary}
10412

10413
\begin{itemize}
10414

10415
\item time series: A dataset where each value is associated with
10416
a timestamp, often a series of measurements and the times they
10417
were collected.
10418
\index{time series}
10419

10420
\item window: A sequence of consecutive values in a time series,
10421
often used to compute a moving average.
10422
\index{window}
10423

10424
\item moving average: One of several statistics intended to estimate
10425
the underlying trend in a time series by computing averages (of
10426
some kind) for a series of overlapping windows.
10427
\index{moving average}
10428

10429
\item rolling mean: A moving average based on the mean value in
10430
each window.
10431
\index{rolling mean}
10432

10433
\item exponentially-weighted moving average (EWMA): A moving
10434
average based on a weighted mean that gives the highest weight
10435
to the most recent values, and exponentially decreasing weights
10436
to earlier values. \index{exponentially-weighted moving average} \index{EWMA}
10437

10438
\item span: A parameter of EWMA that determines how quickly the
10439
weights decrease.
10440
\index{span}
10441

10442
\item serial correlation: Correlation between a time series and
10443
a shifted or lagged version of itself.
10444
\index{serial correlation}
10445

10446
\item lag: The size of the shift in a serial correlation or
10447
autocorrelation.
10448
\index{lag}
10449

10450
\item autocorrelation: A more general term for a serial correlation
10451
with any amount of lag.
10452
\index{autocorrelation function}
10453

10454
\item autocorrelation function: A function that maps from lag to
10455
serial correlation.
10456

10457
\item stationary: A model is stationary if the parameters and the
10458
distribution of residuals does not change over time.
10459
\index{model}
10460
\index{stationary model}
10461

10462
\end{itemize}
10463

10464

10465

10466
\chapter{Survival analysis}
10467

10468
{\bf Survival analysis} is a way to describe how long things last.
10469
It is often used to study human lifetimes, but it
10470
also applies to ``survival'' of mechanical and electronic components, or
10471
more generally to intervals in time before an event.
10472
\index{survival analysis}
10473
\index{mechanical component}
10474
\index{electrical component}
10475

10476
If someone you know has been diagnosed with a life-threatening
10477
disease, you might have seen a ``5-year survival rate,'' which
10478
is the probability of surviving five years after diagnosis.  That
10479
estimate and related statistics are the result of survival analysis.
10480
\index{survival rate}
10481

10482
The code in this chapter is in {\tt survival.py}.  For information
10483
about downloading and working with this code, see Section~\ref{code}.
10484

10485

10486
\section{Survival curves}
10487
\label{survival}
10488

10489
The fundamental concept in survival analysis is the {\bf survival
10490
  curve}, $S(t)$, which is a function that maps from a duration, $t$, to the
10491
probability of surviving longer than $t$.  If you know the distribution
10492
of durations, or ``lifetimes'', finding the survival curve is easy;
10493
it's just the complement of the CDF: \index{survival curve}
10494
%
10495
\[ S(t) = 1 - \CDF(t) \]
10496
%
10497
where $CDF(t)$ is the probability of a lifetime less than or equal
10498
to $t$.
10499
\index{complementary CDF} \index{CDF!complementary} \index{CCDF}
10500

10501
For example, in the NSFG dataset, we know the duration of 11189
10502
complete pregnancies.  We can read this data and compute the CDF:
10503
\index{pregnancy length}
10504

10505
\begin{verbatim}
10506
    preg = nsfg.ReadFemPreg()
10507
    complete = preg.query('outcome in [1, 3, 4]').prglngth
10508
    cdf = thinkstats2.Cdf(complete, label='cdf')
10509
\end{verbatim}
10510

10511
The outcome codes {\tt 1, 3, 4} indicate live birth, stillbirth,
10512
and miscarriage.  For this analysis I am excluding induced abortions,
10513
ectopic pregnancies, and pregnancies that were in progress when
10514
the respondent was interviewed.
10515

10516
The DataFrame method {\tt query} takes a boolean expression and
10517
evaluates it for each row, selecting the rows that yield True.
10518
\index{DataFrame}
10519
\index{boolean}
10520
\index{query}
10521

10522
\begin{figure}
10523
% survival.py
10524
\centerline{\includegraphics[height=3.0in]{figs/survival1.pdf}}
10525
\caption{Cdf and survival curve for pregnancy length (top),
10526
hazard curve (bottom).}
10527
\label{survival1}
10528
\end{figure}
10529

10530
Figure~\ref{survival1} (top) shows the CDF of pregnancy length
10531
and its complement, the survival curve.  To represent the
10532
survival curve, I define an object that wraps a Cdf and 
10533
adapts the interface:
10534
\index{Cdf}
10535
\index{pregnancy length}
10536
\index{SurvivalFunction}
10537

10538
\begin{verbatim}
10539
class SurvivalFunction(object):
10540
    def __init__(self, cdf, label=''):
10541
        self.cdf = cdf
10542
        self.label = label or cdf.label
10543

10544
    @property
10545
    def ts(self):
10546
        return self.cdf.xs
10547

10548
    @property
10549
    def ss(self):
10550
        return 1 - self.cdf.ps
10551
\end{verbatim}
10552

10553
{\tt SurvivalFunction} provides two properties: {\tt ts}, which
10554
is the sequence of lifetimes, and {\tt ss}, which is the survival
10555
curve.  In Python, a ``property'' is a method that can be
10556
invoked as if it were a variable.
10557

10558
We can instantiate a {\tt SurvivalFunction} by passing
10559
the CDF of lifetimes:
10560
\index{property}
10561

10562
\begin{verbatim}
10563
    sf = SurvivalFunction(cdf)
10564
\end{verbatim}
10565

10566
{\tt SurvivalFunction} also provides \verb"__getitem__" and
10567
{\tt Prob}, which evaluates the survival curve:
10568

10569
\begin{verbatim}
10570
# class SurvivalFunction
10571

10572
    def __getitem__(self, t):
10573
        return self.Prob(t)
10574

10575
    def Prob(self, t):
10576
        return 1 - self.cdf.Prob(t)
10577
\end{verbatim}
10578

10579
For example, {\tt sf[13]} is the fraction of pregnancies that
10580
proceed past the first trimester:
10581
\index{trimester}
10582

10583
\begin{verbatim}
10584
>>> sf[13]
10585
0.86022
10586
>>> cdf[13]
10587
0.13978
10588
\end{verbatim}
10589

10590
About 86\% of pregnancies proceed past the first trimester;
10591
about 14\% do not.
10592

10593
{\tt SurvivalFunction} provides {\tt Render}, so we can
10594
plot {\tt sf} using the functions in {\tt thinkplot}:
10595
\index{thinkplot}
10596

10597
\begin{verbatim}
10598
    thinkplot.Plot(sf)
10599
\end{verbatim}
10600

10601
Figure~\ref{survival1} (top) shows the result.  The curve is nearly
10602
flat between 13 and 26 weeks, which shows that few pregnancies
10603
end in the second trimester.  And the curve is steepest around 39
10604
weeks, which is the most common pregnancy length.
10605
\index{pregnancy length}
10606

10607

10608
\section{Hazard function}
10609
\label{hazard}
10610

10611
From the survival curve we can derive the {\bf hazard function};
10612
for pregnancy lengths, the hazard function maps from a time, $t$, to
10613
the fraction of pregnancies that continue until $t$ and then end at
10614
$t$.  To be more precise:
10615
%
10616
\[ \lambda(t) = \frac{S(t) - S(t+1)}{S(t)} \]
10617
%
10618
The numerator is the fraction of lifetimes that end at $t$, which
10619
is also $\PMF(t)$.
10620
\index{hazard function}
10621

10622
{\tt SurvivalFunction} provides {\tt MakeHazard}, which calculates
10623
the hazard function:
10624

10625
\begin{verbatim}
10626
# class SurvivalFunction
10627

10628
    def MakeHazard(self, label=''):
10629
        ss = self.ss
10630
        lams = {}
10631
        for i, t in enumerate(self.ts[:-1]):
10632
            hazard = (ss[i] - ss[i+1]) / ss[i]
10633
            lams[t] = hazard
10634

10635
        return HazardFunction(lams, label=label)
10636
\end{verbatim}
10637

10638
The {\tt HazardFunction} object is a wrapper for a pandas
10639
Series:
10640
\index{pandas}
10641
\index{Series}
10642
\index{wrapper}
10643

10644
\begin{verbatim}
10645
class HazardFunction(object):
10646

10647
    def __init__(self, d, label=''):
10648
        self.series = pandas.Series(d)
10649
        self.label = label
10650
\end{verbatim}
10651

10652
{\tt d} can be a dictionary or any other type that can initialize
10653
a Series, including another Series.  {\tt label} is a string used
10654
to identify the HazardFunction when plotted.
10655
\index{HazardFunction}
10656

10657
{\tt HazardFunction} provides \verb"__getitem__", so we can evaluate
10658
it like this:
10659

10660
\begin{verbatim}
10661
>>> hf = sf.MakeHazard()
10662
>>> hf[39]
10663
0.49689
10664
\end{verbatim}
10665

10666
So of all pregnancies that proceed until week 39, about
10667
50\% end in week 39.
10668

10669
Figure~\ref{survival1} (bottom) shows the hazard function for
10670
pregnancy lengths.  For times after week 42, the hazard function
10671
is erratic because it is based on a small number of cases.
10672
Other than that the shape of the curve is as expected: it is
10673
highest around 39 weeks, and a little higher in the first
10674
trimester than in the second.
10675
\index{pregnancy length}
10676

10677
The hazard function is useful in its own right, but it is also an
10678
important tool for estimating survival curves, as we'll see in the
10679
next section.
10680

10681

10682
\section{Inferring survival curves}
10683

10684
If someone gives you the CDF of lifetimes, it is easy to compute the
10685
survival and hazard functions.  But in many real-world
10686
scenarios, we can't measure the distribution of lifetimes directly.
10687
We have to infer it.
10688
\index{survival curve}
10689
\index{CDF}
10690

10691
For example, suppose you are following a group of patients to see how
10692
long they survive after diagnosis.  Not all patients are diagnosed on
10693
the same day, so at any point in time, some patients have survived
10694
longer than others.  If some patients have died, we know their
10695
survival times.  For patients who are still alive, we don't know
10696
survival times, but we have a lower bound.
10697
\index{diagnosis}
10698

10699
If we wait until all patients are dead, we can compute the survival
10700
curve, but if we are evaluating the effectiveness of a new treatment,
10701
we can't wait that long!  We need a way to estimate survival curves
10702
using incomplete information.
10703
\index{incomplete information}
10704

10705
As a more cheerful example, I will use NSFG data to quantify how
10706
long respondents ``survive'' until they get married for the
10707
first time.  The range of respondents' ages is 14 to 44 years, so
10708
the dataset provides a snapshot of women at different stages in their
10709
lives.
10710
\index{marital status}
10711

10712
For women who have been married, the dataset includes the date
10713
of their first marriage and their age at the time.
10714
For women who have not been married, we know their age when interviewed,
10715
but have no way of knowing when or if they will get married.
10716
\index{age}
10717

10718
Since we know the age at first marriage for {\em some\/} women, it
10719
might be tempting to exclude the rest and compute the CDF of
10720
the known data.  That is a bad idea.  The result would
10721
be doubly misleading: (1) older women would be overrepresented,
10722
because they are more likely to be married when interviewed,
10723
and (2) married women would be overrepresented!  In fact, this
10724
analysis would lead to the conclusion that all women get married,
10725
which is obviously incorrect.
10726

10727

10728
\section{Kaplan-Meier estimation}
10729

10730
In this example it is not only desirable but necessary to include
10731
observations of unmarried women, which brings us to one of the central
10732
algorithms in survival analysis, {\bf Kaplan-Meier estimation}.
10733
\index{Kaplan-Meier estimation}
10734

10735
The general idea is that we can use the data to estimate the hazard
10736
function, then convert the hazard function to a survival curve.
10737
To estimate the hazard function, we consider, for each age,
10738
(1) the number of women who got married at that age and (2) the number
10739
of women ``at risk'' of getting married, which includes all women
10740
who were not married at an earlier age.
10741
\index{hazard function}
10742
\index{at risk}
10743

10744
Here's the code:
10745

10746
\begin{verbatim}
10747
def EstimateHazardFunction(complete, ongoing, label=''):
10748

10749
    hist_complete = Counter(complete)
10750
    hist_ongoing = Counter(ongoing)
10751

10752
    ts = list(hist_complete | hist_ongoing)
10753
    ts.sort()
10754

10755
    at_risk = len(complete) + len(ongoing)
10756

10757
    lams = pandas.Series(index=ts)
10758
    for t in ts:
10759
        ended = hist_complete[t]
10760
        censored = hist_ongoing[t]
10761

10762
        lams[t] = ended / at_risk
10763
        at_risk -= ended + censored
10764

10765
    return HazardFunction(lams, label=label)
10766
\end{verbatim}
10767

10768
{\tt complete} is the set of complete observations; in this case,
10769
the ages when respondents got married.  {\tt ongoing} is the set
10770
of incomplete observations; that is, the ages of unmarried women
10771
when they were interviewed.
10772

10773
First, we precompute \verb"hist_complete", which is a Counter
10774
that maps from each age to the number of women married at that
10775
age, and \verb"hist_ongoing" which maps from each age to the
10776
number of unmarried women interviewed at that age.
10777

10778
\index{Counter}
10779
\index{survival curve}
10780

10781
{\tt ts} is the union of ages when respondents got married
10782
and ages when unmarried women were interviewed, sorted in
10783
increasing order.
10784

10785
\verb"at_risk" keeps track of the number of respondents considered
10786
``at risk'' at each age; initially, it is the total number of
10787
respondents.
10788

10789
The result is stored in a Pandas {\tt Series} that maps from
10790
each age to the estimated hazard function at that age.
10791

10792
Each time through the loop, we consider one age, {\tt t},
10793
and compute the number of events that end at {\tt t} (that is,
10794
the number of respondents married at that age) and the number
10795
of events censored at {\tt t} (that is, the number of women
10796
interviewed at {\tt t} whose future marriage dates are
10797
censored).  In this context, ``censored'' means that the
10798
data are unavailable because of the data collection process.
10799

10800
The estimated hazard function is the fraction of the cases
10801
at risk that end at {\tt t}.
10802

10803
At the end of the loop, we subtract from \verb"at_risk" the
10804
number of cases that ended or were censored at {\tt t}.
10805

10806
Finally, we pass {\tt lams} to the {\tt HazardFunction}
10807
constructor and return the result.
10808

10809
\index{HazardFunction}
10810

10811

10812
\section{The marriage curve}
10813

10814
To test this function, we have to do some data cleaning and
10815
transformation.  The NSFG variables we need are:
10816
\index{marital status}
10817

10818
\begin{itemize}
10819

10820
\item {\tt cmbirth}: The respondent's date of birth, known for
10821
all respondents.
10822
\index{date of birth}
10823

10824
\item {\tt cmintvw}: The date the respondent was interviewed,
10825
known for all respondents.
10826

10827
\item {\tt cmmarrhx}: The date the respondent was first married,
10828
if applicable and known.
10829

10830
\item {\tt evrmarry}: 1 if the respondent had been
10831
married prior to the date of interview, 0 otherwise.
10832

10833
\end{itemize}
10834

10835
The first three variables are encoded in ``century-months''; that is, the
10836
integer number of months since December 1899.  So century-month
10837
1 is January 1900.
10838
\index{century month}
10839

10840
First, we read the respondent file and replace invalid values of
10841
{\tt cmmarrhx}:
10842

10843
\begin{verbatim}
10844
    resp = chap01soln.ReadFemResp()
10845
    resp.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)
10846
\end{verbatim}
10847

10848
Then we compute each respondent's age when married and age when
10849
interviewed:
10850
\index{NaN}
10851

10852
\begin{verbatim}
10853
    resp['agemarry'] = (resp.cmmarrhx - resp.cmbirth) / 12.0
10854
    resp['age'] = (resp.cmintvw - resp.cmbirth) / 12.0
10855
\end{verbatim}
10856

10857
Next we extract {\tt complete}, which is the age at marriage for
10858
women who have been married, and {\tt ongoing}, which is the
10859
age at interview for women who have not:
10860
\index{age}
10861

10862
\begin{verbatim}
10863
    complete = resp[resp.evrmarry==1].agemarry
10864
    ongoing = resp[resp.evrmarry==0].age
10865
\end{verbatim}
10866

10867
Finally we compute the
10868
hazard function.
10869
\index{hazard function}
10870

10871
\begin{verbatim}
10872
    hf = EstimateHazardFunction(complete, ongoing)
10873
\end{verbatim}
10874

10875
Figure~\ref{survival2} (top) shows the estimated hazard function;
10876
it is low in the teens,
10877
higher in the 20s, and declining in the 30s.  It increases again in
10878
the 40s, but that is an artifact of the estimation process; as the
10879
number of respondents ``at risk'' decreases, a small number of
10880
women getting married yields a large estimated hazard.  The survival
10881
curve will smooth out this noise.
10882
\index{noise}
10883

10884

10885
\section{Estimating the survival curve}
10886

10887
Once we have the hazard function, we can estimate the survival curve.
10888
The chance of surviving past time {\tt t} is the chance of surviving
10889
all times up through {\tt t}, which is the cumulative product of
10890
the complementary hazard function:
10891
%
10892
\[ [1-\lambda(0)] [1-\lambda(1)] \ldots [1-\lambda(t)] \]
10893
%
10894
The {\tt HazardFunction} class provides {\tt MakeSurvival}, which
10895
computes this product:
10896
\index{cumulative product}
10897
\index{SurvivalFunction}
10898

10899
\begin{verbatim}
10900
# class HazardFunction:
10901

10902
    def MakeSurvival(self):
10903
        ts = self.series.index
10904
        ss = (1 - self.series).cumprod()
10905
        cdf = thinkstats2.Cdf(ts, 1-ss)
10906
        sf = SurvivalFunction(cdf)
10907
        return sf
10908
\end{verbatim}
10909

10910
{\tt ts} is the sequence of times where the hazard function is
10911
estimated.  {\tt ss} is the cumulative product of the complementary
10912
hazard function, so it is the survival curve.
10913

10914
Because of the way {\tt SurvivalFunction} is implemented, we have
10915
to compute the complement of {\tt ss}, make a Cdf, and then instantiate
10916
a SurvivalFunction object.
10917
\index{Cdf}
10918
\index{complementary CDF}
10919

10920

10921
\begin{figure}
10922
% survival.py
10923
\centerline{\includegraphics[height=2.5in]{figs/survival2.pdf}}
10924
\caption{Hazard function for age at first marriage (top) and
10925
survival curve (bottom).}
10926
\label{survival2}
10927
\end{figure}
10928

10929
Figure~\ref{survival2} (bottom) shows the result.  The survival
10930
curve is steepest between 25 and 35, when most women get married.
10931
Between 35 and 45,
10932
the curve is nearly flat, indicating that women who do not marry
10933
before age 35 are unlikely to get married.
10934

10935
A curve like this was the basis of a famous magazine article in 1986;
10936
{\it Newsweek\/} reported that a 40-year old unmarried woman was ``more
10937
likely to be killed by a terrorist'' than get married.  These
10938
statistics were widely reported and became part of popular culture,
10939
but they were wrong then (because they were based on faulty analysis)
10940
and turned out to be even more wrong (because of cultural changes that
10941
were already in progress and continued).  In 2006, {\it Newsweek\/} ran
10942
an another article admitting that they were wrong.
10943
\index{Newsweek}
10944

10945
I encourage you to read more about this article, the statistics it was
10946
based on, and the reaction.  It should remind you of the ethical
10947
obligation to perform statistical analysis with care, interpret the
10948
results with appropriate skepticism, and present them to the public
10949
accurately and honestly.
10950
\index{ethics}
10951

10952

10953
\section{Confidence intervals}
10954

10955
Kaplan-Meier analysis yields a single estimate of the survival curve,
10956
but it is also important to quantify the uncertainty of the estimate.
10957
As usual, there are three possible sources of error: measurement
10958
error, sampling error, and modeling error.
10959
\index{confidence interval}
10960
\index{modeling error}
10961
\index{sampling error}
10962

10963
In this example, measurement error is probably small.  People
10964
generally know when they were born, whether they've been married, and
10965
when.  And they can be expected to report this information accurately.
10966
\index{measurement error}
10967

10968
We can quantify sampling error by resampling.  Here's the code:
10969
\index{resampling}
10970

10971
\begin{verbatim}
10972
def ResampleSurvival(resp, iters=101):
10973
    low, high = resp.agemarry.min(), resp.agemarry.max()
10974
    ts = np.arange(low, high, 1/12.0)
10975

10976
    ss_seq = []
10977
    for i in range(iters):
10978
        sample = thinkstats2.ResampleRowsWeighted(resp)
10979
        hf, sf = EstimateSurvival(sample)
10980
        ss_seq.append(sf.Probs(ts))
10981

10982
    low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])
10983
    thinkplot.FillBetween(ts, low, high)
10984
\end{verbatim}
10985

10986
{\tt ResampleSurvival} takes {\tt resp}, a DataFrame of respondents,
10987
and {\tt iters}, the number of times to resample.  It computes {\tt
10988
  ts}, which is the sequence of ages where we will evaluate the survival
10989
curves.
10990
\index{DataFrame}
10991

10992
Inside the loop, {\tt ResampleSurvival}:
10993

10994
\begin{itemize}
10995

10996
\item Resamples the respondents using {\tt ResampleRowsWeighted},
10997
which we saw in Section~\ref{weighted}.
10998
\index{weighted resampling}
10999

11000
\item Calls {\tt EstimateSurvival}, which uses the process in the
11001
previous sections to estimate the hazard and survival curves, and
11002

11003
\item Evaluates the survival curve at each age in {\tt ts}.
11004

11005
\end{itemize}
11006

11007
\verb"ss_seq" is a sequence of evaluated survival curves.
11008
{\tt PercentileRows} takes this sequence and computes the 5th and 95th
11009
percentiles, returning a 90\% confidence interval for the survival
11010
curve.
11011
\index{FillBetween}
11012

11013
\begin{figure}
11014
% survival.py
11015
\centerline{\includegraphics[height=2.5in]{figs/survival3.pdf}}
11016
\caption{Survival curve for age at first marriage (dark line) and a 90\%
11017
confidence interval based on weighted resampling (gray line).}
11018
\label{survival3}
11019
\end{figure}
11020

11021
Figure~\ref{survival3} shows the result along with the survival
11022
curve we estimated in the previous section.  The confidence
11023
interval takes into account the sampling weights, unlike the estimated
11024
curve.  The discrepancy between them indicates that the sampling
11025
weights have a substantial effect on the estimate---we will have
11026
to keep that in mind.
11027
\index{confidence interval}
11028
\index{sampling weight}
11029

11030

11031
\section{Cohort effects}
11032

11033
One of the challenges of survival analysis is that different parts
11034
of the estimated curve are based on different groups of respondents.
11035
The part of the curve at time {\tt t} is based on respondents
11036
whose age was at least {\tt t} when they were interviewed.
11037
So the leftmost part of the curve includes data from all respondents,
11038
but the rightmost part includes only the oldest respondents.
11039

11040
If the relevant characteristics of the respondents are not changing
11041
over time, that's fine, but in this case it seems likely that marriage
11042
patterns are different for women born in different generations.
11043
We can investigate this effect by grouping respondents according
11044
to their decade of birth.  Groups like this, defined by date of
11045
birth or similar events, are called {\bf cohorts}, and differences
11046
between the groups are called {\bf cohort effects}.
11047
\index{cohort}
11048
\index{cohort effect}
11049

11050
To investigate cohort effects in the NSFG marriage data, I gathered
11051
the Cycle 6 data from 2002 used throughout this book;
11052
the Cycle 7 data from 2006--2010 used in Section~\ref{replication};
11053
and the Cycle 5 data from 1995.  In total these datasets include
11054
30,769 respondents.
11055

11056
\begin{verbatim}
11057
    resp5 = ReadFemResp1995()
11058
    resp6 = ReadFemResp2002()
11059
    resp7 = ReadFemResp2010()
11060
    resps = [resp5, resp6, resp7]
11061
\end{verbatim}
11062

11063
For each DataFrame, {\tt resp}, I use {\tt cmbirth} to compute the
11064
decade of birth for each respondent:
11065
\index{pandas}
11066
\index{DataFrame}
11067

11068
\begin{verbatim}
11069
    month0 = pandas.to_datetime('1899-12-15')
11070
    dates = [month0 + pandas.DateOffset(months=cm) 
11071
             for cm in resp.cmbirth]
11072
    resp['decade'] = (pandas.DatetimeIndex(dates).year - 1900) // 10
11073
\end{verbatim}
11074

11075
{\tt cmbirth} is encoded as the integer number of months since
11076
December 1899; {\tt month0} represents that date as a Timestamp
11077
object.  For each birth date, we instantiate a {\tt DateOffset} that
11078
contains the century-month and add it to {\tt month0}; the result
11079
is a sequence of Timestamps, which is converted to a {\tt
11080
  DateTimeIndex}.  Finally, we extract {\tt year} and compute
11081
decades.
11082
\index{DateTimeIndex}
11083
\index{Index}
11084
\index{century month}
11085

11086
To take into account the sampling weights, and also to show
11087
variability due to sampling error, I resample the data,
11088
group respondents by decade, and plot survival curves:
11089
\index{resampling}
11090
\index{sampling error}
11091

11092
\begin{verbatim}
11093
    for i in range(iters):
11094
        samples = [thinkstats2.ResampleRowsWeighted(resp) 
11095
                   for resp in resps]
11096
        sample = pandas.concat(samples, ignore_index=True)
11097
        groups = sample.groupby('decade')
11098

11099
        EstimateSurvivalByDecade(groups, alpha=0.2)
11100
\end{verbatim}
11101

11102
Data from the three NSFG cycles use different sampling weights,
11103
so I resample them separately and then use {\tt concat}
11104
to merge them into a single DataFrame.  The parameter \verb"ignore_index"
11105
tells {\tt concat} not to match up respondents by index; instead
11106
it creates a new index from 0 to 30768.
11107
\index{pandas}
11108
\index{DataFrame}
11109
\index{groupby}
11110

11111
{\tt EstimateSurvivalByDecade} plots survival curves for each cohort:
11112

11113
\begin{verbatim}
11114
def EstimateSurvivalByDecade(resp):
11115
    for name, group in groups:
11116
        hf, sf = EstimateSurvival(group)
11117
        thinkplot.Plot(sf)
11118
\end{verbatim}
11119

11120
\begin{figure}
11121
% survival.py
11122
\centerline{\includegraphics[height=2.5in]{figs/survival4.pdf}}
11123
\caption{Survival curves for respondents born during different decades.}
11124
\label{survival4}
11125
\end{figure}
11126

11127
Figure~\ref{survival4} shows the results.  Several patterns are
11128
visible:
11129

11130
\begin{itemize}
11131

11132
\item Women born in the 50s married earliest, with successive
11133
  cohorts marrying later and later, at least until age 30 or so.
11134

11135
\item Women born in the 60s follow a surprising pattern.  Prior
11136
to age 25, they were marrying at slower rates than their predecessors.
11137
After age 25, they were marrying faster.  By age 32 they had overtaken
11138
the 50s cohort, and at age 44 they are substantially more likely to
11139
have married.
11140
\index{marital status}
11141

11142
Women born in the 60s turned 25 between 1985 and 1995.  Remembering
11143
that the {\it Newsweek\/} article I mentioned was published in 1986, it
11144
is tempting to imagine that the article triggered a marriage boom.
11145
That explanation would be too pat, but it is possible that the article
11146
and the reaction to it were indicative of a mood that affected the
11147
behavior of this cohort.
11148
\index{Newsweek}
11149

11150
\item The pattern of the 70s cohort is similar.  They are less
11151
likely than their predecessors to be married before age 25, but
11152
at age 35 they have caught up with both of the previous cohorts.
11153

11154
\item Women born in the 80s are even less likely to marry before
11155
age 25.  What happens after that is not clear; for more data, we
11156
have to wait for the next cycle of the NSFG.
11157

11158
\end{itemize}
11159

11160
In the meantime we can make some predictions.
11161
\index{prediction}
11162

11163

11164
\section{Extrapolation}
11165

11166
The survival curve for the 70s cohort ends at about age 38;
11167
for the 80s cohort it ends at age 28, and for the 90s cohort
11168
we hardly have any data at all.
11169
\index{extrapolation}
11170

11171
We can extrapolate these curves by ``borrowing'' data from the
11172
previous cohort.  HazardFunction provides a method, {\tt Extend}, that
11173
copies the tail from another longer HazardFunction:
11174
\index{HazardFunction}
11175

11176
\begin{verbatim}
11177
# class HazardFunction
11178

11179
    def Extend(self, other):
11180
        last = self.series.index[-1]
11181
        more = other.series[other.series.index > last]
11182
        self.series = pandas.concat([self.series, more])
11183
\end{verbatim}
11184

11185
As we saw in Section~\ref{hazard}, the HazardFunction contains a Series
11186
that maps from $t$ to $\lambda(t)$.  {\tt Extend} finds {\tt last},
11187
which is the last index in {\tt self.series}, selects values from
11188
{\tt other} that come later than {\tt last}, and appends them
11189
onto {\tt self.series}.
11190
\index{pandas}
11191
\index{Series}
11192

11193
Now we can extend the HazardFunction for each cohort, using values
11194
from the predecessor:
11195

11196
\begin{verbatim}
11197
def PlotPredictionsByDecade(groups):
11198
    hfs = []
11199
    for name, group in groups:
11200
        hf, sf = EstimateSurvival(group)
11201
        hfs.append(hf)
11202

11203
    thinkplot.PrePlot(len(hfs))
11204
    for i, hf in enumerate(hfs):
11205
        if i > 0:
11206
            hf.Extend(hfs[i-1])
11207
        sf = hf.MakeSurvival()
11208
        thinkplot.Plot(sf)
11209
\end{verbatim}
11210

11211
{\tt groups} is a GroupBy object with respondents grouped by decade of
11212
birth.  The first loop computes the HazardFunction for each group.
11213
\index{groupby}
11214

11215
The second loop extends each HazardFunction with values from
11216
its predecessor, which might contain values from the previous
11217
group, and so on.  Then it converts each HazardFunction to
11218
a SurvivalFunction and plots it.
11219

11220
\begin{figure}
11221
% survival.py
11222
\centerline{\includegraphics[height=2.5in]{figs/survival5.pdf}}
11223
\caption{Survival curves for respondents born during different decades,
11224
with predictions for the later cohorts.}
11225
\label{survival5}
11226
\end{figure}
11227

11228
Figure~\ref{survival5} shows the results; I've removed the 50s cohort
11229
to make the predictions more visible.  These results suggest that by
11230
age 40, the most recent cohorts will converge with the 60s cohort,
11231
with fewer than 20\% never married.
11232
\index{visualization}
11233

11234

11235
\section{Expected remaining lifetime}
11236

11237
Given a survival curve, we can compute the expected remaining
11238
lifetime as a function of current age.  For example, given the
11239
survival curve of pregnancy length from Section~\ref{survival},
11240
we can compute the expected time until delivery.
11241
\index{pregnancy length}
11242

11243
The first step is to extract the PMF of lifetimes.  {\tt SurvivalFunction}
11244
provides a method that does that:
11245

11246
\begin{verbatim}
11247
# class SurvivalFunction
11248

11249
    def MakePmf(self, filler=None):
11250
        pmf = thinkstats2.Pmf()
11251
        for val, prob in self.cdf.Items():
11252
            pmf.Set(val, prob)
11253

11254
        cutoff = self.cdf.ps[-1]
11255
        if filler is not None:
11256
            pmf[filler] = 1-cutoff
11257

11258
        return pmf
11259
\end{verbatim}
11260

11261
Remember that the SurvivalFunction contains the Cdf of lifetimes.
11262
The loop copies the values and probabilities from the Cdf into
11263
a Pmf.
11264
\index{Pmf}
11265
\index{Cdf}
11266

11267
{\tt cutoff} is the highest probability in the Cdf, which is 1
11268
if the Cdf is complete, and otherwise less than 1.  
11269
If the Cdf is incomplete, we plug in the provided value, {\tt filler},
11270
to cap it off.
11271

11272
The Cdf of pregnancy lengths is complete, so we don't have to worry
11273
about this detail yet.
11274
\index{pregnancy length}
11275

11276
The next step is to compute the expected remaining lifetime, where
11277
``expected'' means average.  {\tt SurvivalFunction}
11278
provides a method that does that, too:
11279
\index{expected remaining lifetime}
11280

11281
\begin{verbatim}
11282
# class SurvivalFunction
11283

11284
    def RemainingLifetime(self, filler=None, func=thinkstats2.Pmf.Mean):
11285
        pmf = self.MakePmf(filler=filler)
11286
        d = {}
11287
        for t in sorted(pmf.Values())[:-1]:
11288
            pmf[t] = 0
11289
            pmf.Normalize()
11290
            d[t] = func(pmf) - t
11291

11292
        return pandas.Series(d)
11293
\end{verbatim}
11294

11295
{\tt RemainingLifetime} takes {\tt filler}, which is passed along
11296
to {\tt MakePmf}, and {\tt func} which is the function used to
11297
summarize the distribution of remaining lifetimes.
11298

11299
{\tt pmf} is the Pmf of lifetimes extracted from the SurvivalFunction.
11300
{\tt d} is a dictionary that contains the results, a map from
11301
current age, {\tt t}, to expected remaining lifetime.
11302
\index{Pmf}
11303

11304
The loop iterates through the values in the Pmf.  For each value
11305
of {\tt t} it computes the conditional distribution of lifetimes,
11306
given that the lifetime exceeds {\tt t}.  It does that by removing
11307
values from the Pmf one at a time and renormalizing the remaining
11308
values.
11309

11310
Then it uses {\tt func} to summarize the conditional distribution.
11311
In this example the result is the mean pregnancy length, given that
11312
the length exceeds {\tt t}.  By subtracting {\tt t} we get the
11313
mean remaining pregnancy length.
11314
\index{pregnancy length}
11315

11316
\begin{figure}
11317
% survival.py
11318
\centerline{\includegraphics[height=2.5in]{figs/survival6.pdf}}
11319
\caption{Expected remaining pregnancy length (left) and
11320
years until first marriage (right).}
11321
\label{survival6}
11322
\end{figure}
11323

11324
Figure~\ref{survival6} (left) shows the expected remaining pregnancy
11325
length as a function of the current duration.  For example, during
11326
Week 0, the expected remaining duration is about 34 weeks.  That's
11327
less than full term (39 weeks) because terminations of pregnancy
11328
in the first trimester bring the average down.
11329
\index{pregnancy length}
11330

11331
The curve drops slowly during the first trimester.  After 13 weeks,
11332
the expected remaining lifetime has dropped by only 9 weeks, to
11333
25.  After that the curve drops faster, by about a week per week.
11334

11335
Between Week 37 and 42, the curve levels off between 1 and 2 weeks.
11336
At any time during this period, the expected remaining lifetime is the
11337
same; with each week that passes, the destination gets no closer.
11338
Processes with this property are called {\bf memoryless} because
11339
the past has no effect on the predictions.
11340
This behavior is the mathematical basis of the infuriating mantra
11341
of obstetrics nurses: ``any day now.''
11342
\index{memoryless}
11343

11344
Figure~\ref{survival6} (right) shows the median remaining time until
11345
first marriage, as a function of age.  For an 11 year-old girl, the
11346
median time until first marriage is about 14 years.  The curve decreases
11347
until age 22 when the median remaining time is about 7 years.
11348
After that it increases again: by age 30 it is back where it started,
11349
at 14 years.
11350

11351
Based on this data, young women have decreasing remaining
11352
``lifetimes''.  Mechanical components with this property are called {\bf NBUE}
11353
for ``new better than used in expectation,'' meaning that a new part is
11354
expected to last longer.
11355
\index{NBUE}
11356

11357
Women older than 22 have increasing remaining time until first
11358
marriage.  Components with this property are called {\bf UBNE} for
11359
``used better than new in expectation.''  That is, the older the part,
11360
the longer it is expected to last.  Newborns and cancer patients are
11361
also UBNE; their life expectancy increases the longer they live.
11362
\index{UBNE}
11363

11364
For this example I computed median, rather than mean, because the
11365
Cdf is incomplete; the survival curve projects that about 20\%
11366
of respondents will not marry before age 44.  The age of
11367
first marriage for these women is unknown, and might be non-existent,
11368
so we can't compute a mean.
11369
\index{Cdf}
11370
\index{median}
11371

11372
I deal with these unknown values by replacing them with {\tt np.inf},
11373
a special value that represents infinity.  That makes the mean
11374
infinity for all ages, but the median is well-defined as long as
11375
more than 50\% of the remaining lifetimes are finite, which is true
11376
until age 30.  After that it is hard to define a meaningful
11377
expected remaining lifetime.
11378
\index{inf}
11379

11380
Here's the code that computes and plots these functions:
11381

11382
\begin{verbatim}
11383
    rem_life1 = sf1.RemainingLifetime()
11384
    thinkplot.Plot(rem_life1)
11385

11386
    func = lambda pmf: pmf.Percentile(50)
11387
    rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)
11388
    thinkplot.Plot(rem_life2)
11389
\end{verbatim}
11390

11391
{\tt sf1} is the survival curve for pregnancy length;
11392
in this case we can use the default values for {\tt RemainingLifetime}.
11393
\index{pregnancy length}
11394

11395
{\tt sf2} is the survival curve for age at first marriage;
11396
{\tt func} is a function that takes a Pmf and computes its
11397
median (50th percentile).
11398
\index{Pmf}
11399

11400

11401
\section{Exercises}
11402

11403
My solution to this exercise is in \verb"chap13soln.py".
11404

11405
\begin{exercise}
11406
In NSFG Cycles 6 and 7, the variable {\tt cmdivorcx} contains the
11407
date of divorce for the respondent's first marriage, if applicable,
11408
encoded in century-months.
11409
\index{divorce}
11410
\index{marital status}
11411

11412
Compute the duration of marriages that have ended in divorce, and
11413
the duration, so far, of marriages that are ongoing.  Estimate the
11414
hazard and survival curve for the duration of marriage.
11415

11416
Use resampling to take into account sampling weights, and plot
11417
data from several resamples to visualize sampling error.
11418
\index{resampling}
11419

11420
Consider dividing the respondents into groups by decade of birth,
11421
and possibly by age at first marriage.
11422
\index{groupby}
11423

11424
\end{exercise}
11425

11426

11427
\section{Glossary}
11428

11429
\begin{itemize}
11430

11431
\item survival analysis: A set of methods for describing and
11432
  predicting lifetimes, or more generally time until an event occurs.
11433
\index{survival analysis}
11434

11435
\item survival curve: A function that maps from a time, $t$, to the
11436
  probability of surviving past $t$.
11437
\index{survival curve}
11438

11439
\item hazard function: A function that maps from $t$ to the fraction
11440
of people alive until $t$ who die at $t$.
11441
\index{hazard function}
11442

11443
\item Kaplan-Meier estimation: An algorithm for estimating hazard and
11444
survival functions.
11445
\index{Kaplan-Meier estimation}
11446

11447
\item cohort: a group of subjects defined by an event, like date of
11448
birth, in a particular interval of time.
11449
\index{cohort}
11450

11451
\item cohort effect: a difference between cohorts.
11452
\index{cohort effect}
11453

11454
\item NBUE: A property of expected remaining lifetime, ``New
11455
better than used in expectation.''
11456
\index{NBUE}
11457

11458
\item UBNE: A property of expected remaining lifetime, ``Used
11459
better than new in expectation.''
11460
\index{UBNE}
11461

11462
\end{itemize}
11463

11464

11465
\chapter{Analytic methods}
11466
\label{analysis}
11467

11468
This book has focused on computational methods like simulation and
11469
resampling, but some of the problems we solved have
11470
analytic solutions that can be much faster.
11471
\index{resampling}
11472
\index{analytic methods}
11473
\index{computational methods}
11474

11475
I present some of these methods in this chapter, and explain
11476
how they work.  At the end of the chapter, I make suggestions
11477
for integrating computational and analytic methods for exploratory
11478
data analysis.
11479

11480
The code in this chapter is in {\tt normal.py}.  For information
11481
about downloading and working with this code, see Section~\ref{code}.
11482

11483

11484
\section{Normal distributions}
11485
\label{why_normal}
11486
\index{normal distribution}
11487
\index{distribution!normal}
11488
\index{Gaussian distribution}
11489
\index{distribution!Gaussian}
11490

11491
As a motivating example, let's review the problem from
11492
Section~\ref{gorilla}:
11493
\index{gorilla}
11494

11495
\begin{quotation}
11496
\noindent Suppose you are a scientist studying gorillas in a wildlife
11497
preserve.  Having weighed 9 gorillas, you find sample mean $\xbar=90$ kg and
11498
sample standard deviation, $S=7.5$ kg.  If you use $\xbar$ to estimate
11499
the population mean, what is the standard error of the estimate?
11500
\end{quotation}
11501

11502
To answer that question, we need the sampling
11503
distribution of $\xbar$.  In Section~\ref{gorilla} we approximated
11504
this distribution by simulating the experiment (weighing
11505
9 gorillas), computing $\xbar$ for each simulated experiment, and
11506
accumulating the distribution of estimates.
11507
\index{standard error}
11508
\index{standard deviation}
11509

11510
The result is an approximation of the sampling distribution.  Then we
11511
use the sampling distribution to compute standard errors and
11512
confidence intervals:
11513
\index{confidence interval}
11514
\index{sampling distribution}
11515

11516
\begin{enumerate}
11517

11518
\item The standard deviation of the sampling distribution is the
11519
  standard error of the estimate; in the example, it is about
11520
  2.5 kg.
11521

11522
\item The interval between the 5th and 95th percentile of the sampling
11523
  distribution is a 90\% confidence interval.  If we run the
11524
  experiment many times, we expect the estimate to fall in this
11525
  interval 90\% of the time.  In the example, the 90\% CI is
11526
  $(86, 94)$ kg.
11527

11528
\end{enumerate}
11529

11530
Now we'll do the same calculation analytically.  We
11531
take advantage of the fact that the weights of adult female gorillas
11532
are roughly normally distributed.  Normal distributions have two
11533
properties that make them amenable for analysis: they are ``closed'' under
11534
linear transformation and addition.  To explain what that means, I
11535
need some notation.  \index{analysis}
11536
\index{linear transformation}
11537
\index{addition, closed under}
11538

11539
If the distribution of a quantity, $X$, is
11540
normal with parameters $\mu$ and $\sigma$, you can write
11541
%
11542
\[ X \sim \normal~(\mu, \sigma^{2})\]
11543
%
11544
where the symbol $\sim$ means ``is distributed'' and the script letter
11545
$\normal$ stands for ``normal.''
11546

11547
%The other analytic distributions in this chapter are sometimes
11548
%written $\mathrm{Exponential}(\lambda)$, $\mathrm{Pareto}(x_m,
11549
%\alpha)$ and, for lognormal, $\mathrm{Log}-\normal~(\mu,
11550
%\sigma^2)$.
11551

11552
A linear transformation of $X$ is something like $X' = a X + b$, where
11553
$a$ and $b$ are real numbers.\index{linear transformation}
11554
A family of distributions is closed under
11555
linear transformation if $X'$ is in the same family as $X$.  The normal
11556
distribution has this property; if $X \sim \normal~(\mu,
11557
\sigma^2)$,
11558
%
11559
\[ X' \sim \normal~(a \mu + b, a^{2} \sigma^2) \tag*{(1)} \]
11560
%
11561
Normal distributions are also closed under addition.  
11562
If $Z = X + Y$ and
11563
$X \sim \normal~(\mu_{X}, \sigma_{X}^{2})$ and
11564
$Y \sim \normal~(\mu_{Y}, \sigma_{Y}^{2})$ then
11565
%
11566
\[ Z \sim \normal~(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)  \tag*{(2)}\]
11567
%
11568
In the special case $Z = X + X$, we have
11569
%
11570
\[ Z \sim \normal~(2 \mu_X, 2 \sigma_X^2) \]
11571
%
11572
and in general if we draw $n$ values of $X$ and add them up, we have
11573
%
11574
\[ Z \sim \normal~(n \mu_X, n \sigma_X^2)  \tag*{(3)}\]
11575

11576

11577
\section{Sampling distributions}
11578

11579
Now we have everything we need to compute the sampling distribution of
11580
$\xbar$.  Remember that we compute $\xbar$ by weighing $n$ gorillas,
11581
adding up the total weight, and dividing by $n$.
11582
\index{sampling distribution}
11583
\index{gorilla}
11584
\index{weight}
11585

11586
Assume that the distribution of gorilla weights, $X$, is
11587
approximately normal:
11588
%
11589
\[ X \sim \normal~(\mu, \sigma^2)\]
11590
%
11591
If we weigh $n$ gorillas, the total weight, $Y$, is distributed
11592
%
11593
\[ Y \sim \normal~(n \mu, n \sigma^2) \]
11594
%
11595
using Equation 3.  And if we divide by $n$, the sample mean,
11596
$Z$, is distributed
11597
%
11598
\[ Z \sim \normal~(\mu, \sigma^2/n) \]
11599
%
11600
using Equation 1 with $a = 1/n$.
11601

11602
The distribution of $Z$ is the sampling distribution of $\xbar$.
11603
The mean of $Z$ is $\mu$, which shows that $\xbar$ is an unbiased
11604
estimate of $\mu$.  The variance of the sampling distribution
11605
is $\sigma^2 / n$.
11606
\index{biased estimator}
11607
\index{estimator!biased}
11608

11609
So the standard deviation of the sampling distribution, which is the
11610
standard error of the estimate, is $\sigma / \sqrt{n}$.  In the
11611
example, $\sigma$ is 7.5 kg and $n$ is 9, so the standard error is 2.5
11612
kg.  That result is consistent with what we estimated by simulation,
11613
but much faster to compute!
11614
\index{standard error}
11615
\index{standard deviation}
11616

11617
We can also use the sampling distribution to compute confidence
11618
intervals.  A 90\% confidence interval for $\xbar$ is the interval
11619
between the 5th and 95th percentiles of $Z$.  Since $Z$ is normally
11620
distributed, we can compute percentiles by evaluating the inverse
11621
CDF.
11622
\index{inverse CDF}
11623
\index{CDF, inverse}
11624
\index{confidence interval}
11625

11626
There is no closed form for the CDF of the normal distribution
11627
or its inverse, but there are fast numerical methods and they
11628
are implemented in SciPy, as we saw in Section~\ref{normal}.
11629
{\tt thinkstats2} provides a wrapper function that makes the
11630
SciPy function a little easier to use:
11631
\index{SciPy}
11632
\index{normal distribution}
11633
\index{wrapper}
11634
\index{closed form}
11635

11636
\begin{verbatim}
11637
def EvalNormalCdfInverse(p, mu=0, sigma=1):
11638
    return scipy.stats.norm.ppf(p, loc=mu, scale=sigma)
11639
\end{verbatim}
11640

11641
Given a probability, {\tt p}, it returns the corresponding
11642
percentile from a normal distribution with parameters {\tt mu}
11643
and {\tt sigma}.  For the 90\% confidence interval of $\xbar$,
11644
we compute the 5th and 95th percentiles like this:
11645
\index{percentile}
11646

11647
\begin{verbatim}
11648
>>> thinkstats2.EvalNormalCdfInverse(0.05, mu=90, sigma=2.5)
11649
85.888
11650

11651
>>> thinkstats2.EvalNormalCdfInverse(0.95, mu=90, sigma=2.5)
11652
94.112
11653
\end{verbatim}
11654

11655
So if we run the experiment many times, we expect the
11656
estimate, $\xbar$, to fall in the range $(85.9, 94.1)$ about
11657
90\% of the time.  Again, this is consistent with the result
11658
we got by simulation.
11659
\index{simulation}
11660

11661

11662
\section{Representing normal distributions}
11663

11664
To make these calculations easier, I have defined a class called
11665
{\tt Normal} that represents a normal distribution and encodes
11666
the equations in the previous sections.  Here's what it looks
11667
like:
11668
\index{Normal}
11669

11670
\begin{verbatim}
11671
class Normal(object):
11672

11673
    def __init__(self, mu, sigma2):
11674
        self.mu = mu
11675
        self.sigma2 = sigma2
11676

11677
    def __str__(self):
11678
        return 'N(%g, %g)' % (self.mu, self.sigma2)
11679
\end{verbatim}
11680

11681
So we can instantiate a Normal that represents the distribution
11682
of gorilla weights:
11683
\index{gorilla}
11684

11685
\begin{verbatim}
11686
>>> dist = Normal(90, 7.5**2)
11687
>>> dist
11688
N(90, 56.25)
11689
\end{verbatim}
11690

11691
{\tt Normal} provides {\tt Sum}, which takes a sample size, {\tt n},
11692
and returns the distribution of the sum of {\tt n} values, using
11693
Equation 3:
11694

11695
\begin{verbatim}
11696
    def Sum(self, n):
11697
        return Normal(n * self.mu, n * self.sigma2)
11698
\end{verbatim}
11699

11700
Normal also knows how to multiply and divide using
11701
Equation 1:
11702

11703
\begin{verbatim}
11704
    def __mul__(self, factor):
11705
        return Normal(factor * self.mu, factor**2 * self.sigma2)
11706

11707
    def __div__(self, divisor):
11708
        return 1 / divisor * self
11709
\end{verbatim}
11710

11711
So we can compute the sampling distribution of the mean with sample
11712
size 9:
11713
\index{sampling distribution}
11714
\index{sample size}
11715

11716
\begin{verbatim}
11717
>>> dist_xbar = dist.Sum(9) / 9
11718
>>> dist_xbar.sigma
11719
2.5
11720
\end{verbatim}
11721

11722
The standard deviation of the sampling distribution is 2.5 kg, as we
11723
saw in the previous section.  Finally, Normal provides {\tt
11724
  Percentile}, which we can use to compute a confidence interval:
11725
\index{standard deviation}
11726
\index{confidence interval}
11727

11728
\begin{verbatim}
11729
>>> dist_xbar.Percentile(5), dist_xbar.Percentile(95)
11730
85.888 94.113
11731
\end{verbatim}
11732

11733
And that's the same answer we got before.  We'll use the Normal
11734
class again later, but before we go on, we need one more bit of
11735
analysis.
11736

11737

11738
\section{Central limit theorem}
11739
\label{CLT}
11740

11741
As we saw in the previous sections, if we add values drawn from normal
11742
distributions, the distribution of the sum is normal.
11743
Most other distributions don't have this property;
11744
if we add values drawn from other distributions, the sum does not
11745
generally have an analytic distribution.
11746
  \index{sum}
11747
\index{normal distribution} \index{distribution!normal}
11748
\index{Gaussian distribution} \index{distribution!Gaussian}
11749

11750
But if we add up {\tt n} values from
11751
almost any distribution, the distribution of the sum converges to
11752
normal as {\tt n} increases.
11753

11754
More specifically, if the distribution of the values has mean and
11755
standard deviation $\mu$ and $\sigma$, the distribution of the sum is
11756
approximately $\normal(n \mu, n \sigma^2)$.
11757
\index{standard deviation}
11758

11759
This result is the Central Limit Theorem (CLT).  It is one of the
11760
most useful tools for statistical analysis, but it comes with
11761
caveats:
11762
\index{Central Limit Theorem}
11763
\index{CLT}
11764

11765
\begin{itemize}
11766

11767
\item The values have to be drawn independently.  If they are
11768
correlated, the CLT doesn't apply (although this is seldom a problem
11769
in practice).
11770
\index{independent}
11771

11772
\item The values have to come from the same distribution (although
11773
  this requirement can be relaxed).
11774
\index{identical}
11775

11776
\item The values have to be drawn
11777
  from a distribution with finite mean and variance.  So most Pareto
11778
  distributions are out.
11779
\index{mean}
11780
\index{variance}
11781
\index{Pareto distribution}
11782
\index{distribution!Pareto}
11783
\index{exponential distribution}
11784
\index{distribution!exponential}
11785

11786
\item The rate of convergence depends
11787
  on the skewness of the distribution.  Sums from an exponential
11788
  distribution converge for small {\tt n}.  Sums from a
11789
  lognormal distribution require larger sizes.
11790
\index{lognormal distribution}
11791
\index{distribution!lognormal}
11792
\index{skewness}
11793

11794
\end{itemize}
11795

11796
The Central Limit Theorem explains the prevalence
11797
of normal distributions in the natural world.  Many characteristics of
11798
living things are affected by genetic
11799
and environmental factors whose effect is additive.  The characteristics
11800
we measure are the sum of a large number of small effects, so their
11801
distribution tends to be normal.
11802
\index{normal distribution}
11803
\index{distribution!normal}
11804
\index{Gaussian distribution}
11805
\index{distribution!Gaussian}
11806
\index{Central Limit Theorem}
11807
\index{CLT}
11808

11809

11810
\section{Testing the CLT}
11811

11812
To see how the Central Limit Theorem works, and when it doesn't,
11813
let's try some experiments.  First, we'll try
11814
an exponential distribution:
11815

11816
\begin{verbatim}
11817
def MakeExpoSamples(beta=2.0, iters=1000):
11818
    samples = []
11819
    for n in [1, 10, 100]:
11820
        sample = [np.sum(np.random.exponential(beta, n))
11821
                  for _ in range(iters)]
11822
        samples.append((n, sample))
11823
    return samples
11824
\end{verbatim}
11825

11826
{\tt MakeExpoSamples} generates samples of sums of exponential values
11827
(I use ``exponential values'' as shorthand for ``values from an
11828
exponential distribution'').
11829
{\tt beta} is the parameter of the distribution; {\tt iters}
11830
is the number of sums to generate.
11831

11832
To explain this function, I'll start from the inside and work my way
11833
out.  Each time we call {\tt np.random.exponential}, we get a sequence
11834
of {\tt n} exponential values and compute its sum.  {\tt sample}
11835
is a list of these sums, with length {\tt iters}.
11836
\index{NumPy}
11837

11838
It is easy to get {\tt n} and {\tt iters} confused:  {\tt n} is the
11839
number of terms in each sum;  {\tt iters} is the number of sums we
11840
compute in order to characterize the distribution of sums.
11841

11842
The return value is a list of {\tt (n, sample)} pairs.  For
11843
each pair, we make a normal probability plot:
11844
\index{thinkplot}
11845
\index{normal probability plot}
11846

11847
\begin{verbatim}
11848
def NormalPlotSamples(samples, plot=1, ylabel=''):
11849
    for n, sample in samples:
11850
        thinkplot.SubPlot(plot)
11851
        thinkstats2.NormalProbabilityPlot(sample)
11852

11853
        thinkplot.Config(title='n=%d' % n, ylabel=ylabel)
11854
        plot += 1
11855
\end{verbatim}
11856

11857
{\tt NormalPlotSamples} takes the list of pairs from {\tt
11858
  MakeExpoSamples} and generates a row of normal probability plots.
11859
\index{normal probability plot}
11860

11861
\begin{figure}
11862
% normal.py
11863
\centerline{\includegraphics[height=3.5in]{figs/normal1.pdf}}
11864
\caption{Distributions of sums of exponential values (top row) and
11865
lognormal values (bottom row).}
11866
\label{normal1}
11867
\end{figure}
11868

11869
Figure~\ref{normal1} (top row) shows
11870
the results.  With {\tt n=1}, the distribution of the sum is still
11871
exponential, so the normal probability plot is not a straight line.
11872
But with {\tt n=10} the distribution of the sum is approximately
11873
normal, and with {\tt n=100} it is all but indistinguishable from
11874
normal.
11875

11876
Figure~\ref{normal1} (bottom row) shows similar results for a
11877
lognormal distribution.  Lognormal distributions are generally more
11878
skewed than exponential distributions, so the distribution of sums
11879
takes longer to converge.  With {\tt n=10} the normal
11880
probability plot is nowhere near straight, but with {\tt n=100}
11881
it is approximately normal.
11882
\index{lognormal distribution}
11883
\index{distribution!lognormal}
11884
\index{skewness}
11885

11886
\begin{figure}
11887
% normal.py
11888
\centerline{\includegraphics[height=3.5in]{figs/normal2.pdf}}
11889
\caption{Distributions of sums of Pareto values (top row) and
11890
correlated exponential values (bottom row).}
11891
\label{normal2}
11892
\end{figure}
11893

11894
Pareto distributions are even more skewed than lognormal.  Depending
11895
on the parameters, many Pareto distributions do not have finite mean
11896
and variance.  As a result, the Central Limit Theorem does not apply.
11897
Figure~\ref{normal2} (top row) shows distributions of sums of
11898
Pareto values.  Even with {\tt n=100} the normal probability plot
11899
is far from straight.
11900
\index{Pareto distribution}
11901
\index{distribution!Pareto}
11902
\index{Central Limit Theorem}
11903
\index{CLT}
11904
\index{normal probability plot}
11905

11906
I also mentioned that CLT does not apply if the values are correlated.
11907
To test that, I generate correlated values from an exponential
11908
distribution.  The algorithm for generating correlated values is
11909
(1) generate correlated normal values, (2) use the normal CDF
11910
to transform the values to uniform, and (3) use the inverse
11911
exponential CDF to transform the uniform values to exponential.
11912
\index{inverse CDF}
11913
\index{CDF, inverse}
11914
\index{correlation}
11915
\index{random number}
11916

11917
{\tt GenerateCorrelated} returns an iterator of {\tt n} normal values
11918
with serial correlation {\tt rho}:
11919
\index{iterator}
11920

11921
\begin{verbatim}
11922
def GenerateCorrelated(rho, n):
11923
    x = random.gauss(0, 1)
11924
    yield x
11925

11926
    sigma = math.sqrt(1 - rho**2)
11927
    for _ in range(n-1):
11928
        x = random.gauss(x*rho, sigma)
11929
        yield x
11930
\end{verbatim}
11931

11932
The first value is a standard normal value.  Each subsequent value
11933
depends on its predecessor: if the previous value is {\tt x}, the mean of
11934
the next value is {\tt x*rho}, with variance {\tt 1-rho**2}.  Note that {\tt
11935
  random.gauss} takes the standard deviation as the second argument,
11936
not variance.
11937
\index{standard deviation}
11938
\index{standard normal distribution}
11939

11940
{\tt GenerateExpoCorrelated}
11941
takes the resulting sequence and transforms it to exponential:
11942

11943
\begin{verbatim}
11944
def GenerateExpoCorrelated(rho, n):
11945
    normal = list(GenerateCorrelated(rho, n))
11946
    uniform = scipy.stats.norm.cdf(normal)
11947
    expo = scipy.stats.expon.ppf(uniform)
11948
    return expo
11949
\end{verbatim}
11950

11951
{\tt normal} is a list of correlated normal values.  {\tt uniform}
11952
is a sequence of uniform values between 0 and 1.  {\tt expo} is
11953
a correlated sequence of exponential values.
11954
{\tt ppf} stands for ``percent point function,'' which is another
11955
name for the inverse CDF.
11956
\index{inverse CDF}
11957
\index{CDF, inverse}
11958
\index{percent point function}
11959

11960
Figure~\ref{normal2} (bottom row) shows distributions of sums of
11961
correlated exponential values with {\tt rho=0.9}.  The correlation
11962
slows the rate of convergence; nevertheless, with {\tt n=100} the
11963
normal probability plot is nearly straight.  So even though CLT
11964
does not strictly apply when the values are correlated, moderate
11965
correlations are seldom a problem in practice.
11966
\index{normal probability plot}
11967
\index{correlation}
11968

11969
These experiments are meant to show how the Central Limit Theorem
11970
works, and what happens when it doesn't.  Now let's see how we can
11971
use it.
11972

11973

11974
\section{Applying the CLT}
11975
\label{usingCLT}
11976

11977
To see why the Central Limit Theorem is useful, let's get back
11978
to the example in Section~\ref{testdiff}: testing the apparent
11979
difference in mean pregnancy length for first babies and others.
11980
As we've seen, the apparent difference is about
11981
0.078 weeks:
11982
\index{pregnancy length}
11983
\index{Central Limit Theorem}
11984
\index{CLT}
11985

11986
\begin{verbatim}
11987
>>> live, firsts, others = first.MakeFrames()
11988
>>> delta = firsts.prglngth.mean() - others.prglngth.mean()
11989
0.078
11990
\end{verbatim}
11991

11992
Remember the logic of hypothesis testing: we compute a p-value, which
11993
is the probability of the observed difference under the null
11994
hypothesis; if it is small, we conclude that the observed difference
11995
is unlikely to be due to chance.
11996
\index{p-value}
11997
\index{null hypothesis}
11998
\index{hypothesis testing}
11999

12000
In this example, the null hypothesis is that the distribution of
12001
pregnancy lengths is the same for first babies and others.  
12002
So we can compute the sampling distribution of the mean
12003
like this:
12004
\index{sampling distribution}
12005

12006
\begin{verbatim}
12007
    dist1 = SamplingDistMean(live.prglngth, len(firsts))
12008
    dist2 = SamplingDistMean(live.prglngth, len(others))
12009
\end{verbatim}
12010

12011
Both sampling distributions are based on the same population, which is
12012
the pool of all live births.  {\tt SamplingDistMean} takes this
12013
sequence of values and the sample size, and returns a Normal object
12014
representing the sampling distribution:
12015

12016
\begin{verbatim}
12017
def SamplingDistMean(data, n):
12018
    mean, var = data.mean(), data.var()
12019
    dist = Normal(mean, var)
12020
    return dist.Sum(n) / n
12021
\end{verbatim}
12022

12023
{\tt mean} and {\tt var} are the mean and variance of
12024
{\tt data}.  We approximate the distribution of the data with
12025
a normal distribution, {\tt dist}.  
12026

12027
In this example, the data are not normally distributed, so this
12028
approximation is not very good.  But then we compute {\tt dist.Sum(n)
12029
  / n}, which is the sampling distribution of the mean of {\tt n}
12030
values.  Even if the data are not normally distributed, the sampling
12031
distribution of the mean is, by the Central Limit Theorem.
12032
\index{Central Limit Theorem}
12033
\index{CLT}
12034

12035
Next, we compute the sampling distribution of the difference
12036
in the means.  The {\tt Normal} class knows how to perform
12037
subtraction using Equation 2:
12038
\index{Normal}
12039

12040
\begin{verbatim}
12041
    def __sub__(self, other):
12042
        return Normal(self.mu - other.mu,
12043
                      self.sigma2 + other.sigma2)
12044
\end{verbatim}
12045

12046
So we can compute the sampling distribution of the difference like this:
12047

12048
\begin{verbatim}
12049
>>> dist = dist1 - dist2
12050
N(0, 0.0032)
12051
\end{verbatim}
12052

12053
The mean is 0, which makes sense because we expect two samples from
12054
the same distribution to have the same mean, on average.  The variance
12055
of the sampling distribution is 0.0032.
12056
\index{sampling distribution}
12057

12058
{\tt Normal} provides {\tt Prob}, which evaluates the normal CDF.
12059
We can use {\tt Prob} to compute the probability of a
12060
difference as large as {\tt delta} under the null hypothesis:
12061
\index{null hypothesis}
12062

12063
\begin{verbatim}
12064
>>> 1 - dist.Prob(delta)
12065
0.084
12066
\end{verbatim}
12067

12068
Which means that the p-value for a one-sided test is 0.84.  For
12069
a two-sided test we would also compute
12070
\index{p-value}
12071
\index{one-sided test}
12072
\index{two-sided test}
12073

12074
\begin{verbatim}
12075
>>> dist.Prob(-delta)
12076
0.084
12077
\end{verbatim}
12078

12079
Which is the same because the normal distribution is symmetric.
12080
The sum of the tails is 0.168, which is consistent with the estimate
12081
in Section~\ref{testdiff}, which was 0.17.
12082
\index{symmetric}
12083

12084

12085

12086
\section{Correlation test}
12087

12088
In Section~\ref{corrtest} we used a permutation test for the correlation
12089
between birth weight and mother's age, and found that it is
12090
statistically significant, with p-value less than 0.001.
12091
\index{p-value}
12092
\index{birth weight}
12093
\index{weight!birth}
12094
\index{permutation}
12095
  \index{significant} \index{statistically significant}
12096

12097
Now we can do the same thing analytically.  The method is based
12098
on this mathematical result: given two variables that are normally distributed
12099
and uncorrelated, if we generate a sample with size $n$,
12100
compute Pearson's correlation, $r$, and then compute the transformed
12101
correlation
12102
%
12103
\[ t = r \sqrt{\frac{n-2}{1-r^2}} \]
12104
%
12105
the distribution of $t$ is Student's t-distribution with parameter
12106
$n-2$.  The t-distribution is an analytic distribution; the CDF can
12107
be computed efficiently using gamma functions.
12108
\index{Pearson coefficient of correlation}
12109
\index{correlation}
12110

12111
We can use this result to compute the sampling distribution of
12112
correlation under the null hypothesis; that is, if we generate
12113
uncorrelated sequences of normal values, what is the distribution of
12114
their correlation?  {\tt StudentCdf} takes the sample size, {\tt n}, and
12115
returns the sampling distribution of correlation:
12116
\index{null hypothesis}
12117
\index{sampling distribution}
12118

12119
\begin{verbatim}
12120
def StudentCdf(n):
12121
    ts = np.linspace(-3, 3, 101)
12122
    ps = scipy.stats.t.cdf(ts, df=n-2)
12123
    rs = ts / np.sqrt(n - 2 + ts**2)
12124
    return thinkstats2.Cdf(rs, ps)
12125
\end{verbatim}
12126

12127
{\tt ts} is a NumPy array of values for $t$, the transformed
12128
correlation.  {\tt ps} contains the corresponding probabilities,
12129
computed using the CDF of the Student's t-distribution implemented in
12130
SciPy.  The parameter of the t-distribution, {\tt df}, stands for
12131
``degrees of freedom.''  I won't explain that term, but you can read
12132
about it at
12133
\url{http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)}.
12134
\index{NumPy}
12135
\index{SciPy}
12136
\index{Student's t-distribution}
12137
\index{distribution!Student's t}
12138
\index{degrees of freedom}
12139

12140
\begin{figure}
12141
% normal.py
12142
\centerline{\includegraphics[height=2.5in]{figs/normal4.pdf}}
12143
\caption{Sampling distribution of correlations for uncorrelated
12144
normal variables.}
12145
\label{normal4}
12146
\end{figure}
12147

12148
To get from {\tt ts} to the correlation coefficients, {\tt rs},
12149
we apply the inverse transform,
12150
%
12151
\[ r = t / \sqrt{n - 2 + t^2} \]
12152
%
12153
The result is the sampling distribution of $r$ under the null hypothesis.
12154
Figure~\ref{normal4} shows this distribution along with the distribution
12155
we generated in Section~\ref{corrtest} by resampling.  They are nearly
12156
identical.  Although the actual distributions are not normal, 
12157
Pearson's coefficient of correlation is based on sample means
12158
and variances.  By the Central Limit Theorem, these moment-based
12159
statistics are normally distributed even if the data are not.
12160
\index{Central Limit Theorem}
12161
\index{CLT}
12162
\index{null hypothesis}
12163
\index{resampling}
12164

12165
From Figure~\ref{normal4}, we can see that the
12166
observed correlation, 0.07, is unlikely to occur if the variables
12167
are actually uncorrelated.
12168
Using the analytic distribution, we can compute just how unlikely:
12169
\index{analytic distribution}
12170

12171
\begin{verbatim}
12172
    t = r * math.sqrt((n-2) / (1-r**2))
12173
    p_value = 1 - scipy.stats.t.cdf(t, df=n-2)
12174
\end{verbatim}
12175

12176
We compute the value of {\tt t} that corresponds to {\tt r=0.07}, and
12177
then evaluate the t-distribution at {\tt t}.  The result is {\tt
12178
  2.9e-11}.  This example demonstrates an advantage of the analytic
12179
method: we can compute very small p-values.  But in practice it
12180
usually doesn't matter.
12181
\index{SciPy}
12182
\index{p-value}
12183

12184

12185
\section{Chi-squared test}
12186

12187
In Section~\ref{casino2} we used the chi-squared statistic to
12188
test whether a die is crooked.  The chi-squared statistic measures
12189
the total normalized deviation from the expected values in a table:
12190
%
12191
\[ \goodchi^2 = \sum_i \frac{{(O_i - E_i)}^2}{E_i} \]
12192
%
12193
One reason the chi-squared statistic is widely used is that
12194
its sampling distribution under the null hypothesis is analytic;
12195
by a remarkable coincidence\footnote{Not really.}, it is called
12196
the chi-squared distribution.  Like the t-distribution, the
12197
chi-squared CDF can be computed efficiently using gamma functions.
12198
\index{deviation}
12199
\index{null hypothesis}
12200
\index{sampling distribution}
12201
\index{chi-squared test}
12202
\index{chi-squared distribution}
12203
\index{distribution!chi-squared}
12204

12205
\begin{figure}
12206
% normal.py
12207
\centerline{\includegraphics[height=2.5in]{figs/normal5.pdf}}
12208
\caption{Sampling distribution of chi-squared statistics for
12209
a fair six-sided die.}
12210
\label{normal5}
12211
\end{figure}
12212

12213
SciPy provides an implementation of the chi-squared distribution,
12214
which we use to compute the sampling distribution of the
12215
chi-squared statistic:
12216
\index{SciPy}
12217

12218
\begin{verbatim}
12219
def ChiSquaredCdf(n):
12220
    xs = np.linspace(0, 25, 101)
12221
    ps = scipy.stats.chi2.cdf(xs, df=n-1)
12222
    return thinkstats2.Cdf(xs, ps)
12223
\end{verbatim}
12224

12225
Figure~\ref{normal5} shows the analytic result along with the
12226
distribution we got by resampling.  They are very similar,
12227
especially in the tail, which is the part we usually care most
12228
about.
12229
\index{resampling}
12230
\index{tail}
12231

12232
We can use this distribution to compute the p-value of the
12233
observed test statistic, {\tt chi2}:
12234
\index{test statistic}
12235
\index{p-value}
12236

12237
\begin{verbatim}
12238
    p_value = 1 - scipy.stats.chi2.cdf(chi2, df=n-1)
12239
\end{verbatim}
12240

12241
The result is 0.041, which is consistent with the result
12242
from Section~\ref{casino2}.
12243

12244
The parameter of the chi-squared distribution is ``degrees of
12245
freedom'' again.  In this case the correct parameter is {\tt n-1},
12246
where {\tt n} is the size of the table, 6.  Choosing this parameter
12247
can be tricky; to be honest, I am never confident that I have it
12248
right until I generate something like Figure~\ref{normal5} to compare
12249
the analytic results to the resampling results.
12250
\index{degrees of freedom}
12251

12252

12253
\section{Discussion}
12254

12255
This book focuses on computational methods like resampling and
12256
permutation.  These methods have several advantages over analysis:
12257
\index{resampling}
12258
\index{permutation}
12259
\index{computational methods}
12260

12261
\begin{itemize}
12262

12263
\item They are easier to explain and understand.  For example, one of
12264
  the most difficult topics in an introductory statistics class is
12265
  hypothesis testing.  Many students don't really understand what
12266
  p-values are.  I think the approach I presented in
12267
  Chapter~\ref{testing}---simulating the null hypothesis and
12268
  computing test statistics---makes the fundamental idea clearer.
12269
\index{p-value}
12270
\index{null hypothesis}
12271

12272
\item They are robust and versatile.  Analytic methods are often based
12273
  on assumptions that might not hold in practice.  Computational
12274
  methods require fewer assumptions, and can be adapted and extended
12275
  more easily.
12276
\index{robust}
12277

12278
\item They are debuggable.  Analytic methods are often like a black
12279
  box: you plug in numbers and they spit out results.  But it's easy
12280
  to make subtle errors, hard to be confident that the results are
12281
  right, and hard to find the problem if they are not.  Computational
12282
  methods lend themselves to incremental development and testing,
12283
  which fosters confidence in the results.
12284
\index{debugging}
12285

12286
\end{itemize}
12287

12288
But there is one drawback: computational methods can be slow.  Taking
12289
into account these pros and cons, I recommend the following process:
12290

12291
\begin{enumerate}
12292

12293
\item Use computational methods during exploration.  If you find a
12294
  satisfactory answer and the run time is acceptable, you can stop.
12295
\index{exploration}
12296

12297
\item If run time is not acceptable, look for opportunities to
12298
  optimize.  Using analytic methods is one of several methods of
12299
  optimization.
12300

12301
\item If replacing a computational method with an analytic method is
12302
  appropriate, use the computational method as a basis of comparison, 
12303
  providing mutual validation between the computational and
12304
  analytic results.
12305
\index{model}
12306

12307
\end{enumerate}
12308

12309
For the vast majority of problems I have worked on, I didn't have
12310
to go past Step 1.
12311

12312

12313
\section{Exercises}
12314

12315
A solution to these exercises is in \verb"chap14soln.py"
12316

12317
\begin{exercise}
12318
\label{log_clt}
12319
In Section~\ref{lognormal}, we saw that the distribution
12320
of adult weights is approximately lognormal.  One possible
12321
explanation is that the weight a person
12322
gains each year is proportional to their current weight.
12323
In that case, adult weight is the product of a large number
12324
of multiplicative factors:
12325
%
12326
\[ w = w_0 f_1 f_2 \ldots f_n  \]
12327
%
12328
where $w$ is adult weight, $w_0$ is birth weight, and $f_i$
12329
is the weight gain factor for year $i$.
12330
\index{birth weight}
12331
\index{weight!birth}
12332
\index{lognormal distribution}
12333
\index{distribution!lognormal}
12334
\index{adult weight}
12335

12336
The log of a product is the sum of the logs of the
12337
factors:
12338
%
12339
\[ \log w = \log w_0 + \log f_1 + \log f_2 + \cdots + \log f_n \]
12340
%
12341
So by the Central Limit Theorem, the distribution of $\log w$ is
12342
approximately normal for large $n$, which implies that the
12343
distribution of $w$ is lognormal.
12344
\index{Central Limit Theorem}
12345
\index{CLT}
12346

12347
To model this phenomenon, choose a distribution for $f$ that seems
12348
reasonable, then generate a sample of adult weights by choosing a
12349
random value from the distribution of birth weights, choosing a
12350
sequence of factors from the distribution of $f$, and computing the
12351
product.  What value of $n$ is needed to converge to a lognormal
12352
distribution?
12353
\index{model}
12354

12355
\index{logarithm}
12356
\index{product}
12357

12358
\end{exercise}
12359

12360

12361

12362
\begin{exercise}
12363
In Section~\ref{usingCLT} we used the Central Limit Theorem to find
12364
the sampling distribution of the difference in means, $\delta$, under
12365
the null hypothesis that both samples are drawn from the same
12366
population.
12367
\index{null hypothesis}
12368
\index{sampling distribution}
12369

12370
We can also use this distribution to find the standard error of the
12371
estimate and confidence intervals, but that would only be
12372
approximately correct.  To be more precise, we should compute the
12373
sampling distribution of $\delta$ under the alternate hypothesis that
12374
the samples are drawn from different populations.
12375
\index{standard error}
12376
\index{standard deviation}
12377
\index{confidence interval}
12378

12379
Compute this distribution and use it to calculate the standard error
12380
and a 90\% confidence interval for the difference in means.
12381
\end{exercise}
12382

12383

12384
\begin{exercise}
12385
In a recent paper\footnote{``Evidence for the persistent effects of an
12386
  intervention to mitigate gender-sterotypical task allocation within
12387
  student engineering teams,'' Proceedings of the IEEE Frontiers in Education
12388
Conference, 2014.}, Stein et al.~investigate the
12389
effects of an intervention intended to mitigate gender-stereotypical
12390
task allocation within student engineering teams.
12391

12392
Before and after the intervention, students responded to a survey that
12393
asked them to rate their contribution to each aspect of class projects on
12394
a 7-point scale.
12395

12396
Before the intervention, male students reported higher scores for the
12397
programming aspect of the project than female students; on average men
12398
reported a score of 3.57 with standard error 0.28.  Women reported
12399
1.91, on average, with standard error 0.32.
12400
\index{standard error}
12401

12402
Compute the sampling distribution of the gender gap (the difference in
12403
means), and test whether it is statistically significant.  Because you
12404
are given standard errors for the estimated means, you don't need to
12405
know the sample size to figure out the sampling distributions.
12406
  \index{significant} \index{statistically significant}
12407
\index{sampling distribution}
12408

12409
After the intervention, the gender gap was smaller: the average score
12410
for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE
12411
0.16).  Again, compute the sampling distribution of the gender gap and
12412
test it.
12413
\index{gender gap}
12414

12415
Finally, estimate the change in gender gap; what is the sampling
12416
distribution of this change, and is it statistically significant?
12417
  \index{significant} \index{statistically significant}
12418
\end{exercise}
12419

12420
\cleardoublepage
12421
\phantomsection
12422
\addcontentsline{toc}{chapter}{\indexname}%
12423
\printindex
12424

12425
\clearemptydoublepage
12426
%\blankpage
12427
%\blankpage
12428
%\blankpage
12429

12430

12431
\end{document}
12432

12433

12434

12435
Product

Resources

Company