Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

📚 The CoCalc Library - books, templates and other resources

Views: 96155
License: OTHER
1
% LaTeX source for ``Think Bayes: Bayesian Statistics Made Simple''
2
% Copyright 2012 Allen B. Downey.
3
4
% License: Creative Commons Attribution-NonCommercial 3.0 Unported License.
5
% http://creativecommons.org/licenses/by-nc/3.0/
6
%
7
8
\documentclass[12pt]{book}
9
\usepackage[width=5.5in,height=8.5in,
10
hmarginratio=3:2,vmarginratio=1:1]{geometry}
11
12
% for some of these packages, you might have to install
13
% texlive-latex-extra (in Ubuntu)
14
15
\usepackage[T1]{fontenc}
16
\usepackage{textcomp}
17
\usepackage{mathpazo}
18
\usepackage{url}
19
\usepackage{graphicx}
20
\usepackage{subfig}
21
\usepackage{amsmath}
22
\usepackage{amsthm}
23
\usepackage{makeidx}
24
\usepackage{setspace}
25
\usepackage{hevea}
26
\usepackage{upquote}
27
\usepackage{fancyhdr}
28
\usepackage[bookmarks]{hyperref}
29
30
\title{Think Bayes}
31
\author{Allen B. Downey}
32
33
\newcommand{\thetitle}{Think Bayes: Bayesian Statistics Made Simple}
34
\newcommand{\theversion}{1.0.8}
35
36
% these styles get translated in CSS for the HTML version
37
\newstyle{a:link}{color:black;}
38
\newstyle{p+p}{margin-top:1em;margin-bottom:1em}
39
\newstyle{img}{border:0px}
40
41
% change the arrows in the HTML version
42
\setlinkstext
43
{\imgsrc[ALT="Previous"]{back.png}}
44
{\imgsrc[ALT="Up"]{up.png}}
45
{\imgsrc[ALT="Next"]{next.png}}
46
47
\makeindex
48
49
\newif\ifplastex
50
\plastexfalse
51
52
\begin{document}
53
54
\frontmatter
55
56
\ifplastex
57
58
\else
59
\fi
60
61
\newcommand{\PMF}{\mathrm{PMF}}
62
\newcommand{\PDF}{\mathrm{PDF}}
63
\newcommand{\CDF}{\mathrm{CDF}}
64
\newcommand{\ICDF}{\mathrm{ICDF}}
65
66
\ifplastex
67
\usepackage{localdef}
68
\maketitle
69
70
\else
71
72
\newtheorem{exercise}{Exercise}[chapter]
73
74
\input{latexonly}
75
76
\begin{latexonly}
77
78
\newtheoremstyle{exercise}% name of the style to be used
79
{\topsep}% measure of space to leave above the theorem. E.g.: 3pt
80
{\topsep}% measure of space to leave below the theorem. E.g.: 3pt
81
{}% name of font to use in the body of the theorem
82
{0pt}% measure of space to indent
83
{\bfseries}% name of head font
84
{}% punctuation between head and body
85
{ }% space after theorem head; " " = normal interword space
86
{}% Manually specify head
87
88
\theoremstyle{exercise}
89
90
\renewcommand{\blankpage}{\thispagestyle{empty} \quad \newpage}
91
92
% TITLE PAGES FOR LATEX VERSION
93
94
%-half title--------------------------------------------------
95
\thispagestyle{empty}
96
97
\begin{flushright}
98
\vspace*{2.0in}
99
100
\begin{spacing}{3}
101
{\huge Think Bayes}\\
102
{\Large Bayesian Statistics Made Simple}
103
\end{spacing}
104
105
\vspace{0.25in}
106
107
Version \theversion
108
109
\vfill
110
111
\end{flushright}
112
113
%--verso------------------------------------------------------
114
115
\blankpage
116
\blankpage
117
118
%--title page--------------------------------------------------
119
\pagebreak
120
\thispagestyle{empty}
121
122
\begin{flushright}
123
\vspace*{2.0in}
124
125
\begin{spacing}{3}
126
{\huge Think Bayes}\\
127
{\Large Bayesian Statistics Made Simple}
128
\end{spacing}
129
130
\vspace{0.25in}
131
132
Version \theversion
133
134
\vspace{1in}
135
136
137
{\Large
138
Allen B. Downey\\
139
}
140
141
142
\vspace{0.5in}
143
144
{\Large Green Tea Press}
145
146
{\small Needham, Massachusetts}
147
148
\vfill
149
150
\end{flushright}
151
152
153
%--copyright--------------------------------------------------
154
\pagebreak
155
\thispagestyle{empty}
156
157
Copyright \copyright ~2012 Allen B. Downey.
158
159
160
\vspace{0.2in}
161
162
\begin{flushleft}
163
Green Tea Press \\
164
9 Washburn Ave \\
165
Needham MA 02492
166
\end{flushleft}
167
168
Permission is granted to copy, distribute, and/or modify this document
169
under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported
170
License, which is available at \url{http://creativecommons.org/licenses/by-nc/3.0/}.
171
172
\vspace{0.2in}
173
174
\end{latexonly}
175
176
177
% HTMLONLY
178
179
\begin{htmlonly}
180
181
% TITLE PAGE FOR HTML VERSION
182
183
{\Large \thetitle}
184
185
{\large Allen B. Downey}
186
187
Version \theversion
188
189
\vspace{0.25in}
190
191
Copyright 2012 Allen B. Downey
192
193
\vspace{0.25in}
194
195
Permission is granted to copy, distribute, and/or modify this document
196
under the terms of the Creative Commons Attribution-NonCommercial 3.0
197
Unported License, which is available at
198
\url{http://creativecommons.org/licenses/by-nc/3.0/}.
199
200
\setcounter{chapter}{-1}
201
202
\end{htmlonly}
203
204
\fi
205
% END OF THE PART WE SKIP FOR PLASTEX
206
207
\chapter{Preface}
208
\label{preface}
209
210
\section{My theory, which is mine}
211
212
The premise of this book, and the other books in the {\it Think X}
213
series, is that if you know how to program, you
214
can use that skill to learn other topics.
215
216
Most books on Bayesian statistics use mathematical notation and
217
present ideas in terms of mathematical concepts like calculus.
218
This book uses Python code instead of math, and discrete approximations
219
instead of continuous mathematics. As a result, what would
220
be an integral in a math book becomes a summation, and
221
most operations on probability distributions are simple loops.
222
223
I think this presentation is easier to understand, at least for people with
224
programming skills. It is also more general, because when we make
225
modeling decisions, we can choose the most appropriate model without
226
worrying too much about whether the model lends itself to conventional
227
analysis.
228
229
Also, it provides a smooth development path from simple examples to
230
real-world problems. Chapter~\ref{estimation} is a good example. It
231
starts with a simple example involving dice, one of the staples of
232
basic probability. From there it proceeds in small steps to the
233
locomotive problem, which I borrowed from Mosteller's {\it
234
Fifty Challenging Problems in Probability with Solutions}, and from
235
there to the German tank problem, a famously successful application of
236
Bayesian methods during World War II.
237
238
239
\section{Modeling and approximation}
240
241
Most chapters in this book are motivated by a real-world problem, so
242
they involve some degree of modeling. Before we can apply Bayesian
243
methods (or any other analysis), we have to make decisions about which
244
parts of the real-world system to include in the model and which
245
details we can abstract away. \index{modeling}
246
247
For example, in Chapter~\ref{prediction}, the motivating problem is to
248
predict the winner of a hockey game. I model goal-scoring as a
249
Poisson process, which implies that a goal is equally likely at any
250
point in the game. That is not exactly true, but it is probably a
251
good enough model for most purposes.
252
\index{Poisson process}
253
254
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
255
scores (the SAT is a standardized test used for college admissions in
256
the United States). I start with a simple model that assumes that all
257
SAT questions are equally difficult, but in fact the designers of the
258
SAT deliberately include some questions that are relatively easy and
259
some that are relatively hard. I present a second model that accounts
260
for this aspect of the design, and show that it doesn't have a big
261
effect on the results after all.
262
263
I think it is important to include modeling as an explicit part
264
of problem solving because it reminds us to think about modeling
265
errors (that is, errors due to simplifications and assumptions
266
of the model).
267
268
Many of the methods in this book are based on discrete distributions,
269
which makes some people worry about numerical errors. But for
270
real-world problems, numerical errors are almost always
271
smaller than modeling errors.
272
273
Furthermore, the discrete approach often allows better modeling
274
decisions, and I would rather have an approximate solution
275
to a good model than an exact solution to a bad model.
276
277
On the other hand, continuous methods sometimes yield performance
278
advantages---for example by replacing a linear- or quadratic-time
279
computation with a constant-time solution.
280
281
So I recommend a general process with these steps:
282
283
\begin{enumerate}
284
285
\item While you are exploring a problem, start with simple models and
286
implement them in code that is clear, readable, and demonstrably
287
correct. Focus your attention on good modeling decisions, not
288
optimization.
289
290
\item Once you have a simple model working, identify the
291
biggest sources of error. You might need to increase the number of
292
values in a discrete approximation, or increase the number of
293
iterations in a Monte Carlo simulation, or add details to the model.
294
295
\item If the performance of your solution is good enough for your
296
application, you might not have to do any optimization. But if you
297
do, there are two approaches to consider. You can review your code
298
and look for optimizations; for example, if you cache previously
299
computed results you might be able to avoid redundant computation.
300
Or you can look for analytic methods that yield computational
301
shortcuts.
302
303
\end{enumerate}
304
305
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
306
can explore several alternative models before investing heavily in any
307
of them.
308
309
Another benefit is that if you get to Step 3, you will be starting
310
with a reference implementation that is likely to be correct,
311
which you can use for regression testing (that is, checking that the
312
optimized code yields the same results, at least approximately).
313
\index{regression testing}
314
315
316
\section{Working with the code}
317
\label{download}
318
319
The code and sound samples used in this book are available from
320
\url{https://github.com/AllenDowney/ThinkBayes}. Git is a version
321
control system that allows you to keep track of the files that
322
make up a project. A collection of files under Git's control is
323
called a ``repository''. GitHub is a hosting service that provides
324
storage for Git repositories and a convenient web interface.
325
\index{repository}
326
\index{Git}
327
\index{GitHub}
328
329
The GitHub homepage for my repository provides several ways to
330
work with the code:
331
332
\begin{itemize}
333
334
\item You can create a copy of my repository
335
on GitHub by pressing the {\sf Fork} button. If you don't already
336
have a GitHub account, you'll need to create one. After forking, you'll
337
have your own repository on GitHub that you can use to keep track
338
of code you write while working on this book. Then you can
339
clone the repo, which means that you copy the files
340
to your computer.
341
\index{fork}
342
343
\item Or you could clone
344
my repository. You don't need a GitHub account to do this, but you
345
won't be able to write your changes back to GitHub.
346
\index{clone}
347
348
\item If you don't want to use Git at all, you can download the files
349
in a Zip file using the button in the lower-right corner of the
350
GitHub page.
351
352
\end{itemize}
353
354
The code for the first edition of the book works with Python 2.
355
If you are using Python 3, you might want to use the updated code
356
in \url{https://github.com/AllenDowney/ThinkBayes2} instead.
357
358
I developed this book using Anaconda from
359
Continuum Analytics, which is a free Python distribution that includes
360
all the packages you'll need to run the code (and lots more).
361
I found Anaconda easy to install. By default it does a user-level
362
installation, not system-level, so you don't need administrative
363
privileges. You can
364
download Anaconda from \url{http://continuum.io/downloads}.
365
\index{Anaconda}
366
367
If you don't want to use Anaconda, you will need the following
368
packages:
369
370
\begin{itemize}
371
372
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
373
\index{NumPy}
374
375
\item SciPy for scientific computation,
376
\url{http://www.scipy.org/};
377
\index{SciPy}
378
379
\item matplotlib for visualization, \url{http://matplotlib.org/}.
380
\index{matplotlib}
381
382
\end{itemize}
383
384
Although these are commonly used packages, they are not included with
385
all Python installations, and they can be hard to install in some
386
environments. If you have trouble installing them, I
387
recommend using Anaconda or one of the other Python distributions
388
that include these packages.
389
\index{installation}
390
391
Many of the examples in this book use classes and functions defined in
392
{\tt thinkbayes.py}. Some of them also use {\tt thinkplot.py}, which
393
provides wrappers for some of the functions in {\tt pyplot}, which is
394
part of {\tt matplotlib}.
395
396
397
\section{Code style}
398
399
Experienced Python programmers will notice that the code in this
400
book does not comply with PEP 8, which is the most common
401
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
402
\index{PEP 8}
403
404
Specifically, PEP 8 calls for lowercase function names with
405
underscores between words, \verb"like_this". In this book and
406
the accompanying code, function and method names begin with
407
a capital letter and use camel case, \verb"LikeThis".
408
409
I broke this rule because I developed some of the code
410
while I was a Visiting Scientist at Google, so I followed
411
the Google style guide, which deviates from PEP 8 in a few
412
places. Once I got used to Google style, I found that I liked
413
it. And at this point, it would be too much trouble to change.
414
415
Also on the topic of style, I write ``Bayes's theorem''
416
with an {\it s} after the apostrophe, which is preferred in some
417
style guides and deprecated in others. I don't have a strong
418
preference. I had to choose one, and this is the one I chose.
419
420
And finally one typographical note: throughout the book, I use
421
PMF and CDF for the mathematical concept of a probability
422
mass function or cumulative distribution function, and Pmf and Cdf
423
to refer to the Python objects I use to represent them.
424
425
426
\section{Prerequisites}
427
428
There are several excellent modules for doing Bayesian statistics in
429
Python, including {\tt pymc} and OpenBUGS. I chose not to use them
430
for this book because you need a fair amount of background knowledge
431
to get started with these modules, and I want to keep the
432
prerequisites minimal. If you know Python and a little bit about
433
probability, you are ready to start this book.
434
435
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
436
no code. Chapter~\ref{compstat} introduces {\tt Pmf}, a thinly disguised
437
Python dictionary I use to represent a probability mass function
438
(PMF). Then Chapter~\ref{estimation} introduces {\tt Suite}, a kind
439
of Pmf that provides a framework for doing Bayesian updates.
440
441
In some of the later chapters, I use
442
analytic distributions including the Gaussian (normal) distribution,
443
the exponential and Poisson distributions, and the beta distribution.
444
In Chapter~\ref{species} I break out the less-common Dirichlet
445
distribution, but I explain it as I go along. If you are not familiar
446
with these distributions, you can read about them on Wikipedia. You
447
could also read the companion to this book, {\it Think Stats}, or an
448
introductory statistics book (although I'm afraid most of them take
449
a mathematical approach that is not particularly helpful for practical
450
purposes).
451
452
453
454
\section*{Contributor List}
455
456
If you have a suggestion or correction, please send email to
457
{\it downey@allendowney.com}. If I make a change based on your
458
feedback, I will add you to the contributor list
459
(unless you ask to be omitted).
460
\index{contributors}
461
462
If you include at least part of the sentence the
463
error appears in, that makes it easy for me to search. Page and
464
section numbers are fine, too, but not as easy to work with.
465
Thanks!
466
467
\small
468
469
\begin{itemize}
470
471
\item First, I have to acknowledge David MacKay's excellent book,
472
{\it Information Theory, Inference, and Learning Algorithms}, which is
473
where I first came to understand Bayesian methods. With his
474
permission, I use several problems from
475
his book as examples.
476
477
\item This book also benefited from my interactions with Sanjoy
478
Mahajan, especially in fall 2012, when I audited his class on
479
Bayesian Inference at Olin College.
480
481
\item I wrote parts of this book during project nights with the Boston
482
Python User Group, so I would like to thank them for their
483
company and pizza.
484
485
\item Jonathan Edwards sent in the first typo.
486
487
\item George Purkins found a markup error.
488
489
\item Olivier Yiptong sent several helpful suggestions.
490
491
\item Yuriy Pasichnyk found several errors.
492
493
\item Kristopher Overholt sent a long list of corrections and suggestions.
494
495
\item Robert Marcus found a misplaced {\it i}.
496
497
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
498
499
\item Markus Dobler pointed out that drawing cookies from a bowl
500
with replacement is an unrealistic scenario.
501
502
\item Tom Pollard and Paul A. Giannaros spotted a version problem with
503
some of the numbers in the train example.
504
505
\item Ram Limbu found a typo and suggested a clarification.
506
507
\item In spring 2013, students in my class, Computational Bayesian
508
Statistics, made many helpful corrections and suggestions: Kai
509
Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
510
Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
511
Brendan Ritter, and Evan Simpson.
512
513
\item Greg Marra and Matt Aasted helped me clarify the discussion of
514
{\it The Price is Right} problem.
515
516
\item Marcus Ogren pointed out that the original statement of the
517
locomotive problem was ambiguous.
518
519
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
520
book and found many opportunities for improvement.
521
522
\item James Lawry spotted a math error.
523
524
\item Ben Kahle found a reference to the wrong figure.
525
526
\item Jeffrey Law found an inconsistency between the text and the code.
527
528
\item Linda Pescatore found a typo and made some helpful suggestions.
529
530
\item Tomasz Mi\k{a}sko sent many excellent corrections and suggestions.
531
532
% ENDCONTRIB
533
534
\end{itemize}
535
536
\normalsize
537
538
\clearemptydoublepage
539
540
% TABLE OF CONTENTS
541
\begin{latexonly}
542
543
\tableofcontents
544
545
\clearemptydoublepage
546
547
\end{latexonly}
548
549
% START THE BOOK
550
\mainmatter
551
552
\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}
553
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
554
\newcommand{\T}[1]{\mbox{#1}}
555
\newcommand{\AND}{~\mathrm{and}~}
556
\newcommand{\NOT}{\mathrm{not}~}
557
558
559
\chapter{Bayes's Theorem}
560
\label{intro}
561
562
\section{Conditional probability}
563
564
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
565
which is surprisingly easy to derive, provided that you understand
566
conditional probability. So we'll start with probability, then
567
conditional probability, then Bayes's theorem, and on to Bayesian
568
statistics.
569
\index{conditional probability}
570
\index{probability!conditional}
571
572
A probability is a number between 0 and 1 (including both) that
573
represents a degree of belief in a fact or prediction. The value
574
1 represents certainty that a fact is true, or that a prediction
575
will come true. The value 0 represents certainty
576
that the fact is false.
577
\index{degree of belief}
578
579
Intermediate values represent degrees of certainty. The value 0.5,
580
often written as 50\%, means that a predicted outcome is
581
as likely to happen as not. For example, the probability that a tossed
582
coin lands face up is very close to 50\%.
583
\index{coin toss}
584
585
A conditional probability is a probability based on some background
586
information. For example, I want to know the probability
587
that I will have a heart attack in the next year. According to the
588
CDC, ``Every year about 785,000 Americans have a first coronary attack.
589
(\url{http://www.cdc.gov/heartdisease/facts.htm})''
590
\index{heart attack}
591
592
The U.S. population is about 311 million, so the probability that a
593
randomly chosen American will have a heart attack in the next year is
594
roughly 0.3\%.
595
596
But I am not a randomly chosen American. Epidemiologists have
597
identified many factors that affect the risk of heart attacks;
598
depending on those factors, my risk might be higher or lower than
599
average.
600
601
I am male, 45 years old, and I have borderline high cholesterol.
602
Those factors increase my chances. However, I have low blood pressure
603
and I don't smoke, and those factors decrease my chances.
604
605
Plugging everything into the online calculator at
606
\url{http://cvdrisk.nhlbi.nih.gov/calculator.asp}, I find that my
607
risk of a heart attack in the next year is about 0.2\%, less than the
608
national average. That value is a conditional probability, because it
609
is based on a number of factors that make up my ``condition.''
610
611
The usual notation for conditional probability is \p{A|B}, which
612
is the probability of $A$ given that $B$ is true. In this
613
example, $A$ represents the prediction that I will have a heart
614
attack in the next year, and $B$ is the set of conditions I listed.
615
616
617
\section{Conjoint probability}
618
619
{\bf Conjoint probability} is a fancy way to say the probability that
620
two things are true. I write \p{A \AND B} to mean the
621
probability that $A$ and $B$ are both true.
622
\index{conjoint probability}
623
\index{probability!conjoint}
624
625
If you learned about probability in the context of coin tosses and
626
dice, you might have learned the following formula:
627
%
628
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
629
%
630
For example, if I toss two coins, and $A$ means the first coin lands
631
face up, and $B$ means the second coin lands face up, then $\p{A} =
632
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
633
634
But this formula only works because in this case $A$ and $B$ are
635
independent; that is, knowing the outcome of the first event does
636
not change the probability of the second. Or, more formally,
637
\p{B|A} = \p{B}.
638
\index{independence}
639
\index{dependence}
640
641
Here is a different example where the events are not independent.
642
Suppose that $A$ means that it rains today and $B$ means that it
643
rains tomorrow. If I know that it rained today, it is more likely
644
that it will rain tomorrow, so $\p{B|A} > \p{B}$.
645
646
In general, the probability of a conjunction is
647
%
648
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
649
%
650
for any $A$ and $B$. So if the chance of rain on any given day
651
is 0.5, the chance of rain on two consecutive days is not
652
0.25, but probably a bit higher.
653
654
655
\section{The cookie problem}
656
657
We'll get to Bayes's theorem soon, but I want to motivate it with an
658
example called the cookie problem.\footnote{Based on an example from
659
\url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
660
there.} Suppose there are two bowls of cookies. Bowl 1 contains
661
30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of
662
each.
663
\index{Bayes's theorem}
664
\index{cookie problem}
665
666
Now suppose you choose one of the bowls at random and, without
667
looking, select a cookie at random. The cookie is vanilla. What is
668
the probability that it came from Bowl 1?
669
670
This is a conditional probability; we want \p{\T{Bowl 1} |
671
\T{vanilla}}, but it is not obvious how to compute it. If I asked a
672
different question---the probability of a vanilla cookie given Bowl
673
1---it would be easy:
674
%
675
\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]
676
%
677
Sadly, \p{A|B} is {\em not} the same as \p{B|A}, but there
678
is a way to get from one to the other: Bayes's theorem.
679
680
681
\section{Bayes's theorem}
682
683
At this point we have everything we need to derive Bayes's theorem.
684
We'll start with the observation that conjunction is commutative; that is
685
%
686
\[ \p{A \AND B} = \p{B \AND A} \]
687
%
688
for any events $A$ and $B$.
689
\index{Bayes's theorem!derivation}
690
\index{conjunction}
691
692
Next, we write the probability of a conjunction:
693
%
694
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
695
%
696
Since we have not said anything about what $A$ and $B$ mean, they
697
are interchangeable. Interchanging them yields
698
%
699
\[ \p{B \AND A} = \p{B}~\p{A|B} \]
700
%
701
That's all we need. Pulling those pieces together, we get
702
%
703
\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]
704
%
705
Which means there are two ways to compute the conjunction.
706
If you have \p{A}, you multiply by the conditional
707
probability \p{B|A}. Or you can do it the other way around; if you
708
know \p{B}, you multiply by \p{A|B}. Either way you should get
709
the same thing.
710
711
Finally we can divide through by \p{B}:
712
%
713
\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]
714
%
715
And that's Bayes's theorem! It might not look like much, but
716
it turns out to be surprisingly powerful.
717
718
For example, we can use it to solve the cookie problem. I'll write
719
$B_1$ for the hypothesis that the cookie came from Bowl 1
720
and $V$ for the vanilla cookie. Plugging in Bayes's theorem
721
we get
722
%
723
\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]
724
%
725
The term on the left is what we want: the probability of Bowl 1, given
726
that we chose a vanilla cookie. The terms on the right are:
727
728
\begin{itemize}
729
730
\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned
731
by what kind of cookie we got. Since the problem says we chose a
732
bowl at random, we can assume $\p{B_1} = 1/2$.
733
734
\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie
735
from Bowl 1, which is 3/4.
736
737
\item \p{V}: This is the probability of drawing a vanilla cookie from
738
either bowl. Since we had an equal chance of choosing either bowl
739
and the bowls contain the same number of cookies, we had the same
740
chance of choosing any cookie. Between the two bowls there are
741
50 vanilla and 30
742
chocolate cookies, so \p{V} = 5/8.
743
744
\end{itemize}
745
746
Putting it together, we have
747
%
748
\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]
749
%
750
which reduces to 3/5. So the vanilla cookie is evidence in favor of
751
the hypothesis that we chose Bowl 1, because vanilla cookies are more
752
likely to come from Bowl 1.
753
\index{evidence}
754
755
This example demonstrates one use of Bayes's theorem: it provides
756
a strategy to get from \p{B|A} to \p{A|B}. This strategy is useful
757
in cases, like the cookie problem, where it is easier to compute
758
the terms on the right side of Bayes's theorem than the term on the
759
left.
760
761
762
\section{The diachronic interpretation}
763
764
There is another way to think of Bayes's theorem: it gives us a
765
way to update the probability of a hypothesis, $H$, in light of
766
some body of data, $D$.
767
\index{diachronic interpretation}
768
769
This way of thinking about Bayes's theorem is called the
770
{\bf diachronic interpretation}. ``Diachronic'' means that something
771
is happening over time; in this case
772
the probability of the hypotheses changes, over time, as
773
we see new data.
774
775
Rewriting Bayes's theorem with $H$ and $D$ yields:
776
%
777
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
778
%
779
In this interpretation, each term has a name:
780
\index{prior}
781
\index{posterior}
782
\index{likelihood}
783
\index{normalizing constant}
784
785
\begin{itemize}
786
787
\item \p{H} is the probability of the hypothesis before we see
788
the data, called the prior probability, or just {\bf prior}.
789
790
\item \p{H|D} is what we want to compute, the probability of
791
the hypothesis after we see the data, called the {\bf posterior}.
792
793
\item \p{D|H} is the probability of the data under the hypothesis,
794
called the {\bf likelihood}.
795
796
\item \p{D} is the probability of the data under any hypothesis,
797
called the {\bf normalizing constant}.
798
799
\end{itemize}
800
801
Sometimes we can compute the prior based on background
802
information. For example, the cookie problem specifies that we choose
803
a bowl at random with equal probability.
804
805
In other cases the prior is subjective; that is, reasonable people
806
might disagree, either because they use different background
807
information or because they interpret the same information
808
differently.
809
\index{subjective prior}
810
811
The likelihood is usually the easiest part to compute. In the
812
cookie problem, if we know which bowl the cookie came from,
813
we find the probability of a vanilla cookie by counting.
814
815
The normalizing constant can be tricky. It is supposed to be the
816
probability of seeing the data under any hypothesis at all, but in the
817
most general case it is hard to nail down what that means.
818
819
Most often we simplify things by specifying a set of hypotheses
820
that are
821
\index{mutually exclusive}
822
\index{collectively exhaustive}
823
824
\begin{description}
825
826
\item[Mutually exclusive:] At most one hypothesis in
827
the set can be true, and
828
829
\item[Collectively exhaustive:] There are no other
830
possibilities; at least one of the hypotheses has to be true.
831
832
\end{description}
833
834
I use the word {\bf suite} for a set of hypotheses that has these
835
properties.
836
\index{suite}
837
838
In the cookie problem, there are only two hypotheses---the cookie
839
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
840
collectively exhaustive.
841
842
In that case we can compute \p{D} using the law of total probability,
843
which says that if there are two exclusive ways that something
844
might happen, you can add up the probabilities like this:
845
%
846
\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]
847
%
848
Plugging in the values from the cookie problem, we have
849
%
850
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
851
%
852
which is what we computed earlier by mentally combining the two
853
bowls.
854
\index{total probability}
855
856
857
\newcommand{\MM}{M\&M}
858
859
\section{The \MM~problem}
860
861
\MM's are small candy-coated chocolates that come in a variety of
862
colors. Mars, Inc., which makes \MM's, changes the mixture of
863
colors from time to time.
864
\index{M and M problem}
865
866
In 1995, they introduced blue \MM's. Before then, the color mix in
867
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
868
Green, 10\% Orange, 10\% Tan. Afterward it was 24\% Blue , 20\%
869
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
870
871
Suppose a friend of mine has two bags of \MM's, and he tells me
872
that one is from 1994 and one from 1996. He won't tell me which is
873
which, but he gives me one \MM~from each bag. One is yellow and
874
one is green. What is the probability that the yellow one came
875
from the 1994 bag?
876
877
This problem is similar to the cookie problem, with the twist that I
878
draw one sample from each bowl/bag. This problem also gives me a
879
chance to demonstrate the table method, which is useful for solving
880
problems like this on paper. In the next chapter we will
881
solve them computationally.
882
\index{table method}
883
884
The first step is to enumerate the hypotheses. The bag the yellow
885
\MM~came from I'll call Bag 1; I'll call the other Bag 2. So
886
the hypotheses are:
887
888
\begin{itemize}
889
890
\item A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
891
892
\item B: Bag 1 is from 1996 and Bag 2 from 1994.
893
894
\end{itemize}
895
896
Now we construct a table with a row for each hypothesis and a
897
column for each term in Bayes's theorem:
898
899
\begin{tabular}{|c|c|c|c|c|}
900
\hline
901
& Prior & Likelihood & & Posterior \\
902
& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\
903
\hline
904
A & 1/2 & (20)(20) & 200 & 20/27 \\
905
B & 1/2 & (14)(10) & 70 & 7/27 \\
906
\hline
907
\end{tabular}
908
909
The first column has the priors.
910
Based on the statement of the problem,
911
it is reasonable to choose $\p{A} = \p{B} = 1/2$.
912
913
The second column has the likelihoods, which follow from the
914
information in the problem. For example, if $A$ is true, the yellow
915
\MM~came from the 1994 bag with probability 20\%, and the green came
916
from the 1996 bag with probability 20\%. If $B$ is true, the yellow
917
\MM~came from the 1996 bag with probability 14\%, and the green came
918
from the 1994 bag with probability 10\%.
919
Because the selections are
920
independent, we get the conjoint probability by multiplying.
921
\index{independence}
922
923
The third column is just the product of the previous two.
924
The sum of this column, 270, is the normalizing constant.
925
To get the last column, which contains the posteriors, we divide
926
the third column by the normalizing constant.
927
928
That's it. Simple, right?
929
930
Well, you might be bothered by one detail. I write \p{D|H}
931
in terms of percentages, not probabilities, which means it
932
is off by a factor of 10,000. But that
933
cancels out when we divide through by the normalizing constant, so
934
it doesn't affect the result.
935
\index{normalizing constant}
936
937
When the set of hypotheses is mutually exclusive and collectively
938
exhaustive, you can multiply the likelihoods by any factor, if it is
939
convenient, as long as you apply the same factor to the entire column.
940
941
942
\section{The Monty Hall problem}
943
944
The Monty Hall problem might be the most contentious question in
945
the history of probability. The scenario is simple, but the correct
946
answer is so counterintuitive that many people just can't accept
947
it, and many smart people have embarrassed themselves not just by
948
getting it wrong but by arguing the wrong side, aggressively,
949
in public.
950
\index{Monty Hall problem}
951
952
Monty Hall was the original host of the game show {\em Let's Make a
953
Deal}. The Monty Hall problem is based on one of the regular
954
games on the show. If you are on the show, here's what happens:
955
956
\begin{itemize}
957
958
\item Monty shows you three closed doors and tells you that there is a
959
prize behind each door: one prize is a car, the other two are less
960
valuable prizes like peanut butter and fake finger nails. The
961
prizes are arranged at random.
962
963
\item The object of the game is to guess which door has the car. If
964
you guess right, you get to keep the car.
965
966
\item You pick a door, which we will call Door A. We'll call the
967
other doors B and C.
968
969
\item Before opening the door you chose, Monty increases the
970
suspense by opening either Door B or C, whichever does not
971
have the car. (If the car is actually behind Door A, Monty can
972
safely open B or C, so he chooses one at random.)
973
974
\item Then Monty offers you the option to stick with your original
975
choice or switch to the one remaining unopened door.
976
977
\end{itemize}
978
979
The question is, should you ``stick'' or ``switch'' or does it
980
make no difference?
981
\index{stick}
982
\index{switch}
983
\index{intuition}
984
985
Most people have the strong intuition that it makes no difference.
986
There are two doors left, they reason, so the chance that the car
987
is behind Door A is 50\%.
988
989
But that is wrong. In fact, the chance of winning if you stick
990
with Door A is only 1/3; if you switch, your chances are 2/3.
991
992
By applying Bayes's theorem, we can break this problem into simple
993
pieces, and maybe convince ourselves that the correct answer is,
994
in fact, correct.
995
996
To start, we should make a careful statement of the data. In
997
this case $D$ consists of two parts: Monty chooses Door B
998
{\em and} there is no car there.
999
1000
Next we define three hypotheses: $A$, $B$, and $C$ represent the
1001
hypothesis that the car is behind Door A, Door B, or Door C.
1002
Again, let's apply the table method:
1003
1004
\begin{tabular}{|c|c|c|c|c|}
1005
\hline
1006
& Prior & Likelihood & & Posterior \\
1007
& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\
1008
\hline
1009
A & 1/3 & 1/2 & 1/6 & 1/3 \\
1010
B & 1/3 & 0 & 0 & 0 \\
1011
C & 1/3 & 1 & 1/3 & 2/3 \\
1012
\hline
1013
\end{tabular}
1014
1015
Filling in the priors is easy because we are told that the prizes
1016
are arranged at random, which suggests that the car is equally
1017
likely to be behind any door.
1018
1019
Figuring out the likelihoods takes some thought, but with reasonable
1020
care we can be confident that we have it right:
1021
1022
\begin{itemize}
1023
1024
\item If the car is actually behind A, Monty could safely open Doors B
1025
or C. So the probability that he chooses B is 1/2. And since the
1026
car is actually behind A, the probability that the car is not behind
1027
B is 1.
1028
1029
\item If the car is actually behind B, Monty has to open door C, so
1030
the probability that he opens door B is 0.
1031
1032
\item Finally, if the car is behind Door C, Monty opens B with
1033
probability 1 and finds no car there with probability 1.
1034
1035
\end{itemize}
1036
1037
Now the hard part is over; the rest is just arithmetic. The
1038
sum of the third column is 1/2. Dividing through yields
1039
$\p{A|D} = 1/3$ and $\p{C|D} = 2/3$. So you are better off switching.
1040
1041
There are many variations of the Monty Hall problem. One of the
1042
strengths of the Bayesian approach is that it generalizes to handle
1043
these variations.
1044
1045
For example, suppose that Monty always chooses B if he can, and
1046
only chooses C if he has to (because the car is behind B). In
1047
that case the revised table is:
1048
1049
\begin{tabular}{|c|c|c|c|c|}
1050
\hline
1051
& Prior & Likelihood & & Posterior \\
1052
& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\
1053
\hline
1054
A & 1/3 & 1 & 1/3 & 1/2 \\
1055
B & 1/3 & 0 & 0 & 0 \\
1056
C & 1/3 & 1 & 1/3 & 1/2 \\
1057
\hline
1058
\end{tabular}
1059
1060
The only change is \p{D|A}. If the car is behind $A$, Monty can
1061
choose to open B or C. But in this variation he always chooses
1062
B, so $\p{D|A} = 1$.
1063
1064
As a result, the likelihoods are the same for $A$ and $C$, and the
1065
posteriors are the same: $\p{A|D} = \p{C|D} = 1/2$. In this case, the
1066
fact that Monty chose B reveals no information about the location of
1067
the car, so it doesn't matter whether the contestant sticks or
1068
switches.
1069
1070
On the other hand, if he had opened $C$, we would know $\p{B|D} = 1$.
1071
1072
I included the Monty Hall problem in this chapter because I think it
1073
is fun, and because Bayes's theorem makes the complexity of the
1074
problem a little more manageable. But it is not a typical use of
1075
Bayes's theorem, so if you found it confusing, don't worry!
1076
1077
\section{Discussion}
1078
1079
For many problems involving conditional probability, Bayes's theorem
1080
provides a divide-and-conquer strategy. If \p{A|B} is hard to
1081
compute, or hard to measure experimentally, check whether it might be
1082
easier to compute the other terms in Bayes's theorem, \p{B|A}, \p{A}
1083
and \p{B}.
1084
\index{divide-and-conquer}
1085
1086
If the Monty Hall problem is your idea of fun, I have collected a
1087
number of similar problems in an article called ``All your Bayes are
1088
belong to us,'' which you can read at
1089
\url{http://allendowney.blogspot.com/2011/10/all-your-bayes-are-belong-to-us.html}.
1090
1091
1092
\chapter{Computational Statistics}
1093
\label{compstat}
1094
1095
\section{Distributions}
1096
1097
In statistics a {\bf distribution} is a set of values and their
1098
corresponding probabilities.
1099
\index{distribution}
1100
1101
For example, if you roll a six-sided die, the set of possible
1102
values is the numbers 1 to 6, and the probability associated
1103
with each value is 1/6.
1104
\index{dice}
1105
1106
As another example, you might be interested in how many times each
1107
word appears in common English usage. You could build a distribution
1108
that includes each word and how many times it appears.
1109
\index{word frequency}
1110
1111
To represent a distribution in Python, you could use a dictionary that
1112
maps from each value to its probability. I have written a class
1113
called {\tt Pmf} that uses a Python dictionary in exactly that way,
1114
and provides a number of useful methods.
1115
I called the class Pmf in reference to
1116
a {\bf probability mass function}, which is a way to
1117
represent a distribution mathematically.
1118
\index{probability mass function}
1119
\index{Pmf class}
1120
1121
{\tt Pmf} is defined in a Python module I wrote to accompany this
1122
book, {\tt thinkbayes.py}. You can download it from
1123
\url{http://thinkbayes.com/thinkbayes.py}. For more information
1124
see Section~\ref{download}.
1125
1126
To use {\tt Pmf} you can import it like this:
1127
1128
\begin{verbatim}
1129
from thinkbayes import Pmf
1130
\end{verbatim}
1131
1132
The following code builds a Pmf to represent the distribution
1133
of outcomes for a six-sided die:
1134
1135
\begin{verbatim}
1136
pmf = Pmf()
1137
for x in [1,2,3,4,5,6]:
1138
pmf.Set(x, 1/6.0)
1139
\end{verbatim}
1140
1141
\verb"Pmf" creates an empty Pmf with no values. The
1142
\verb"Set" method sets the probability associated with each
1143
value to $1/6$.
1144
1145
Here's another example that counts the number of times each word
1146
appears in a sequence:
1147
1148
\begin{verbatim}
1149
pmf = Pmf()
1150
for word in word_list:
1151
pmf.Incr(word, 1)
1152
\end{verbatim}
1153
1154
\verb"Incr" increases the ``probability'' associated with each
1155
word by 1. If a word is not already in the Pmf, it is added.
1156
1157
I put ``probability'' in quotes because in this example, the
1158
probabilities are not normalized; that is, they do not add up to 1.
1159
So they are not true probabilities.
1160
1161
But in this example the word counts are proportional to the
1162
probabilities. So after we count all the words, we can compute
1163
probabilities by dividing through by the total number of words. {\tt
1164
Pmf} provides a method, \verb"Normalize", that does exactly that:
1165
\index{Pmf methods}
1166
1167
\begin{verbatim}
1168
pmf.Normalize()
1169
\end{verbatim}
1170
1171
Once you have a Pmf object, you can ask for the probability
1172
associated with any value:
1173
\index{Prob}
1174
1175
\begin{verbatim}
1176
print pmf.Prob('the')
1177
\end{verbatim}
1178
1179
And that would print the frequency of the word ``the'' as a fraction
1180
of the words in the list.
1181
1182
Pmf uses a Python dictionary to store the values and their
1183
probabilities, so the values in the Pmf can be any hashable type.
1184
The probabilities can be any numerical type, but they are usually
1185
floating-point numbers (type \verb"float").
1186
1187
1188
\section{The cookie problem}
1189
1190
In the context of Bayes's theorem, it is natural to use a Pmf
1191
to map from each hypothesis to its probability. In the cookie
1192
problem, the hypotheses are $B_1$ and $B_2$. In Python, I
1193
represent them with strings:
1194
\index{cookie problem}
1195
1196
\begin{verbatim}
1197
pmf = Pmf()
1198
pmf.Set('Bowl 1', 0.5)
1199
pmf.Set('Bowl 2', 0.5)
1200
\end{verbatim}
1201
1202
This distribution, which contains the priors for each hypothesis,
1203
is called (wait for it) the {\bf prior distribution}.
1204
\index{prior distribution}
1205
1206
To update the distribution based on new data (the vanilla cookie),
1207
we multiply each prior by the corresponding likelihood. The likelihood
1208
of drawing a vanilla cookie from Bowl 1 is 3/4. The likelihood
1209
for Bowl 2 is 1/2.
1210
\index{Mult}
1211
1212
\begin{verbatim}
1213
pmf.Mult('Bowl 1', 0.75)
1214
pmf.Mult('Bowl 2', 0.5)
1215
\end{verbatim}
1216
1217
\verb"Mult" does what you would expect. It gets the probability
1218
for the given hypothesis and multiplies by the given likelihood.
1219
1220
After this update, the distribution is no longer normalized, but
1221
because these hypotheses are mutually exclusive and collectively
1222
exhaustive, we can {\bf renormalize}:
1223
\index{renormalize}
1224
1225
\begin{verbatim}
1226
pmf.Normalize()
1227
\end{verbatim}
1228
1229
The result is a distribution that contains the posterior probability
1230
for each hypothesis, which is called (wait now) the
1231
{\bf posterior distribution}.
1232
\index{posterior distribution}
1233
1234
Finally, we can get the posterior probability for Bowl 1:
1235
1236
\begin{verbatim}
1237
print pmf.Prob('Bowl 1')
1238
\end{verbatim}
1239
1240
And the answer is 0.6. You can download this example
1241
from \url{http://thinkbayes.com/cookie.py}. For more information
1242
see Section~\ref{download}.
1243
\index{cookie.py}
1244
1245
1246
\section{The Bayesian framework}
1247
\label{framework}
1248
1249
\index{Bayesian framework}
1250
Before we go on to other problems, I want to rewrite the code
1251
from the previous section to make it more general. First I'll
1252
define a class to encapsulate the code related to this problem:
1253
1254
\begin{verbatim}
1255
class Cookie(Pmf):
1256
1257
def __init__(self, hypos):
1258
Pmf.__init__(self)
1259
for hypo in hypos:
1260
self.Set(hypo, 1)
1261
self.Normalize()
1262
\end{verbatim}
1263
1264
A Cookie object is a Pmf that maps from hypotheses to their
1265
probabilities. The \verb"__init__" method gives each hypothesis
1266
the same prior probability. As in the previous section, there are
1267
two hypotheses:
1268
1269
\begin{verbatim}
1270
hypos = ['Bowl 1', 'Bowl 2']
1271
pmf = Cookie(hypos)
1272
\end{verbatim}
1273
1274
\verb"Cookie" provides an \verb"Update" method that takes
1275
data as a parameter and updates the probabilities:
1276
\index{Update}
1277
1278
\begin{verbatim}
1279
def Update(self, data):
1280
for hypo in self.Values():
1281
like = self.Likelihood(data, hypo)
1282
self.Mult(hypo, like)
1283
self.Normalize()
1284
\end{verbatim}
1285
1286
\verb"Update" loops through each hypothesis in the suite
1287
and multiplies its probability by the likelihood of the
1288
data under the hypothesis, which is computed by \verb"Likelihood":
1289
\index{Likelihood}
1290
1291
\begin{verbatim}
1292
mixes = {
1293
'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
1294
'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
1295
}
1296
1297
def Likelihood(self, data, hypo):
1298
mix = self.mixes[hypo]
1299
like = mix[data]
1300
return like
1301
\end{verbatim}
1302
1303
\verb"Likelihood" uses \verb"mixes", which is a dictionary
1304
that maps from the name of a bowl to the mix of cookies in
1305
the bowl.
1306
1307
Here's what the update looks like:
1308
1309
\begin{verbatim}
1310
pmf.Update('vanilla')
1311
\end{verbatim}
1312
1313
And then we can print the posterior probability of each hypothesis:
1314
1315
\begin{verbatim}
1316
for hypo, prob in pmf.Items():
1317
print hypo, prob
1318
\end{verbatim}
1319
1320
The result is
1321
1322
\begin{verbatim}
1323
Bowl 1 0.6
1324
Bowl 2 0.4
1325
\end{verbatim}
1326
1327
which is the same as what we got before. This code is more complicated
1328
than what we saw in the previous section. One advantage is that it
1329
generalizes to the case where we draw more than one cookie from the
1330
same bowl (with replacement):
1331
1332
\begin{verbatim}
1333
dataset = ['vanilla', 'chocolate', 'vanilla']
1334
for data in dataset:
1335
pmf.Update(data)
1336
\end{verbatim}
1337
1338
The other advantage is that it provides a framework for solving many
1339
similar problems. In the next section we'll solve the Monty Hall
1340
problem computationally and then see what parts of the framework are
1341
the same.
1342
1343
The code in this section is available from
1344
\url{http://thinkbayes.com/cookie2.py}.
1345
For more information
1346
see Section~\ref{download}.
1347
1348
\section{The Monty Hall problem}
1349
1350
To solve the Monty Hall problem, I'll define a new class:
1351
\index{Monty Hall problem}
1352
1353
\begin{verbatim}
1354
class Monty(Pmf):
1355
1356
def __init__(self, hypos):
1357
Pmf.__init__(self)
1358
for hypo in hypos:
1359
self.Set(hypo, 1)
1360
self.Normalize()
1361
\end{verbatim}
1362
1363
So far \verb"Monty" and \verb"Cookie" are exactly the same.
1364
And the code that creates the Pmf is the same, too, except for
1365
the names of the hypotheses:
1366
1367
\begin{verbatim}
1368
hypos = 'ABC'
1369
pmf = Monty(hypos)
1370
\end{verbatim}
1371
1372
Calling \verb"Update" is pretty much the same:
1373
1374
\begin{verbatim}
1375
data = 'B'
1376
pmf.Update(data)
1377
\end{verbatim}
1378
1379
And the implementation of \verb"Update" is exactly the same:
1380
1381
\begin{verbatim}
1382
def Update(self, data):
1383
for hypo in self.Values():
1384
like = self.Likelihood(data, hypo)
1385
self.Mult(hypo, like)
1386
self.Normalize()
1387
\end{verbatim}
1388
1389
The only part that requires some work is \verb"Likelihood":
1390
1391
\begin{verbatim}
1392
def Likelihood(self, data, hypo):
1393
if hypo == data:
1394
return 0
1395
elif hypo == 'A':
1396
return 0.5
1397
else:
1398
return 1
1399
\end{verbatim}
1400
1401
Finally, printing the results is the same:
1402
1403
\begin{verbatim}
1404
for hypo, prob in pmf.Items():
1405
print hypo, prob
1406
\end{verbatim}
1407
1408
And the answer is
1409
1410
\begin{verbatim}
1411
A 0.333333333333
1412
B 0.0
1413
C 0.666666666667
1414
\end{verbatim}
1415
1416
In this example, writing \verb"Likelihood" is a little complicated,
1417
but the framework of the Bayesian update is simple. The code in
1418
this section is available from \url{http://thinkbayes.com/monty.py}.
1419
For more information
1420
see Section~\ref{download}.
1421
1422
\section{Encapsulating the framework}
1423
1424
\index{Suite class}
1425
Now that we see what elements of the framework are the same, we
1426
can encapsulate them in an object---a \verb"Suite" is a \verb"Pmf"
1427
that provides \verb"__init__", \verb"Update", and \verb"Print":
1428
1429
\begin{verbatim}
1430
class Suite(Pmf):
1431
"""Represents a suite of hypotheses and their probabilities."""
1432
1433
def __init__(self, hypo=tuple()):
1434
"""Initializes the distribution."""
1435
1436
def Update(self, data):
1437
"""Updates each hypothesis based on the data."""
1438
1439
def Print(self):
1440
"""Prints the hypotheses and their probabilities."""
1441
\end{verbatim}
1442
1443
The implementation of \verb"Suite" is in \verb"thinkbayes.py". To use
1444
\verb"Suite", you should write a class that inherits from it and
1445
provides \verb"Likelihood". For example, here is the solution to the
1446
Monty Hall problem rewritten to use \verb"Suite":
1447
1448
\begin{verbatim}
1449
from thinkbayes import Suite
1450
1451
class Monty(Suite):
1452
1453
def Likelihood(self, data, hypo):
1454
if hypo == data:
1455
return 0
1456
elif hypo == 'A':
1457
return 0.5
1458
else:
1459
return 1
1460
\end{verbatim}
1461
1462
And here's the code that uses this class:
1463
1464
\begin{verbatim}
1465
suite = Monty('ABC')
1466
suite.Update('B')
1467
suite.Print()
1468
\end{verbatim}
1469
1470
You can download this example from
1471
\url{http://thinkbayes.com/monty2.py}.
1472
For more information
1473
see Section~\ref{download}.
1474
1475
1476
\section{The \MM~problem}
1477
1478
\index{M and M problem}
1479
We can use the \verb"Suite" framework to solve the \MM~problem.
1480
Writing the \verb"Likelihood" function is tricky, but everything
1481
else is straightforward.
1482
1483
First I need to encode the color mixes from before and
1484
after 1995:
1485
1486
\begin{verbatim}
1487
mix94 = dict(brown=30,
1488
yellow=20,
1489
red=20,
1490
green=10,
1491
orange=10,
1492
tan=10)
1493
1494
mix96 = dict(blue=24,
1495
green=20,
1496
orange=16,
1497
yellow=14,
1498
red=13,
1499
brown=13)
1500
\end{verbatim}
1501
1502
Then I have to encode the hypotheses:
1503
1504
\begin{verbatim}
1505
hypoA = dict(bag1=mix94, bag2=mix96)
1506
hypoB = dict(bag1=mix96, bag2=mix94)
1507
\end{verbatim}
1508
1509
\verb"hypoA" represents the hypothesis that Bag 1 is from
1510
1994 and Bag 2 from 1996. \verb"hypoB" is the other way
1511
around.
1512
1513
Next I map from the name of the hypothesis to the representation:
1514
1515
\begin{verbatim}
1516
hypotheses = dict(A=hypoA, B=hypoB)
1517
\end{verbatim}
1518
1519
And finally I can write \verb"Likelihood". In this case
1520
the hypothesis, \verb"hypo", is a string, either \verb"A" or \verb"B".
1521
The data is a tuple that specifies a bag
1522
and a color.
1523
1524
\begin{verbatim}
1525
def Likelihood(self, data, hypo):
1526
bag, color = data
1527
mix = self.hypotheses[hypo][bag]
1528
like = mix[color]
1529
return like
1530
\end{verbatim}
1531
1532
Here's the code that creates the suite and updates it:
1533
1534
\begin{verbatim}
1535
suite = M_and_M('AB')
1536
1537
suite.Update(('bag1', 'yellow'))
1538
suite.Update(('bag2', 'green'))
1539
1540
suite.Print()
1541
\end{verbatim}
1542
1543
And here's the result:
1544
1545
\begin{verbatim}
1546
A 0.740740740741
1547
B 0.259259259259
1548
\end{verbatim}
1549
1550
The posterior probability of A is approximately $20/27$, which is what
1551
we got before.
1552
1553
The code in this section is available from
1554
\url{http://thinkbayes.com/m_and_m.py}. For more information see
1555
Section~\ref{download}.
1556
1557
\section{Discussion}
1558
1559
This chapter presents the Suite class, which encapsulates the
1560
Bayesian update framework.
1561
1562
{\tt Suite} is an {\bf abstract type}, which means that it defines the
1563
interface a Suite is supposed to have, but does not provide a complete
1564
implementation. The {\tt Suite} interface includes {\tt Update} and
1565
{\tt Likelihood}, but the {\tt Suite} class only provides an
1566
implementation of {\tt Update}, not {\tt Likelihood}.
1567
\index{abstract type} \index{concrete type} \index{interface}
1568
\index{implementation}
1569
1570
A {\bf concrete type} is a class that extends an abstract parent
1571
class and provides an implementation of the missing methods.
1572
For example, {\tt Monty} extends {\tt Suite}, so it inherits
1573
{\tt Update} and provides {\tt Likelihood}.
1574
1575
If you are familiar with
1576
design patterns, you might recognize this as an example of the
1577
template method pattern.
1578
You can read about this pattern at
1579
\url{http://en.wikipedia.org/wiki/Template_method_pattern}.
1580
\index{template method pattern}
1581
1582
Most of the examples in the following chapters follow the same
1583
pattern; for each problem we define a new class that extends {\tt
1584
Suite}, inherits {\tt Update}, and provides {\tt Likelihood}. In a
1585
few cases we override {\tt Update}, usually to improve performance.
1586
1587
\section{Exercises}
1588
1589
\begin{exercise}
1590
1591
In Section~\ref{framework} I said that the solution to the cookie
1592
problem generalizes to the case where we draw multiple cookies
1593
with replacement.
1594
1595
But in the more likely scenario where we eat the cookies we draw,
1596
the likelihood of each draw depends on the previous draws.
1597
1598
Modify the solution in this chapter to handle selection without
1599
replacement. Hint: add instance variables to {\tt Cookie} to
1600
represent the hypothetical state of the bowls, and modify
1601
{\tt Likelihood} accordingly. You might want to define a
1602
{\tt Bowl} object.
1603
1604
\end{exercise}
1605
1606
1607
1608
1609
\chapter{Estimation}
1610
\label{estimation}
1611
1612
\section{The dice problem}
1613
1614
\index{Dice problem}
1615
Suppose I have a box of dice that contains a 4-sided die, a 6-sided
1616
die, an 8-sided die, a 12-sided die, and a 20-sided die. If you
1617
have ever played {\it Dungeons~\&~Dragons}, you know what I am talking about.
1618
\index{Dungeons and Dragons}
1619
1620
Suppose I select a die from the box at random, roll it, and get a 6.
1621
What is the probability that I rolled each die?
1622
\index{dice}
1623
1624
Let me suggest a three-step strategy for approaching a problem like this.
1625
1626
\begin{enumerate}
1627
1628
\item Choose a representation for the hypotheses.
1629
1630
\item Choose a representation for the data.
1631
1632
\item Write the likelihood function.
1633
1634
\end{enumerate}
1635
1636
In previous examples I used strings to represent hypotheses and
1637
data, but for the die problem I'll use numbers. Specifically,
1638
I'll use the integers 4, 6, 8, 12, and 20 to represent hypotheses:
1639
1640
\begin{verbatim}
1641
suite = Dice([4, 6, 8, 12, 20])
1642
\end{verbatim}
1643
1644
And integers from 1 to 20 for the data.
1645
These representations make it easy to
1646
write the likelihood function:
1647
1648
\begin{verbatim}
1649
class Dice(Suite):
1650
def Likelihood(self, data, hypo):
1651
if hypo < data:
1652
return 0
1653
else:
1654
return 1.0/hypo
1655
\end{verbatim}
1656
1657
Here's how \verb"Likelihood" works. If \verb"hypo<data", that
1658
means the roll is greater than the number of sides on the die.
1659
That can't happen, so the likelihood is 0.
1660
1661
Otherwise the question is, ``Given that there are {\tt hypo}
1662
sides, what is the chance of rolling {\tt data}?'' The
1663
answer is \verb"1/hypo", regardless of {\tt data}.
1664
1665
Here is the statement that does the update (if I roll a 6):
1666
1667
\begin{verbatim}
1668
suite.Update(6)
1669
\end{verbatim}
1670
1671
And here is the posterior distribution:
1672
1673
\begin{verbatim}
1674
4 0.0
1675
6 0.392156862745
1676
8 0.294117647059
1677
12 0.196078431373
1678
20 0.117647058824
1679
\end{verbatim}
1680
1681
After we roll a 6, the probability for the 4-sided die is 0. The
1682
most likely alternative is the 6-sided die, but there is still
1683
almost a 12\% chance for the 20-sided die.
1684
1685
What if we roll a few more times and get 6, 8, 7, 7, 5, and 4?
1686
1687
\begin{verbatim}
1688
for roll in [6, 8, 7, 7, 5, 4]:
1689
suite.Update(roll)
1690
\end{verbatim}
1691
1692
With this data the 6-sided die is eliminated, and the 8-sided
1693
die seems quite likely. Here are the results:
1694
1695
\begin{verbatim}
1696
4 0.0
1697
6 0.0
1698
8 0.943248453672
1699
12 0.0552061280613
1700
20 0.0015454182665
1701
\end{verbatim}
1702
1703
Now the probability is 94\% that we are rolling the 8-sided die,
1704
and less than 1\% for the 20-sided die.
1705
1706
The dice problem is based on an example I saw in Sanjoy Mahajan's class on
1707
Bayesian inference. You can download the code in this section from
1708
\url{http://thinkbayes.com/dice.py}.
1709
For more information
1710
see Section~\ref{download}.
1711
1712
\section{The locomotive problem}
1713
1714
\index{locomotive problem}
1715
\index{Mosteller, Frederick}
1716
\index{German tank problem}
1717
I found the locomotive problem
1718
in Frederick Mosteller's, {\it Fifty Challenging Problems in
1719
Probability with Solutions} (Dover, 1987):
1720
1721
\begin{quote}
1722
``A railroad numbers its locomotives in order 1..N. One day you see a
1723
locomotive with the number 60. Estimate how many locomotives the
1724
railroad has.''
1725
\end{quote}
1726
1727
Based on this observation, we know the railroad has 60 or more
1728
locomotives. But how many more? To apply Bayesian reasoning, we
1729
can break this problem into two steps:
1730
1731
\begin{enumerate}
1732
1733
\item What did we know about $N$ before we saw the data?
1734
1735
\item For any given value of $N$, what is the likelihood of
1736
seeing the data (a locomotive with number 60)?
1737
1738
\end{enumerate}
1739
1740
The answer to the first question is the prior. The answer to the
1741
second is the likelihood.
1742
1743
\begin{figure}
1744
% train.py
1745
\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}
1746
\caption{Posterior distribution for the locomotive problem, based
1747
on a uniform prior.}
1748
\label{fig.train1}
1749
\end{figure}
1750
1751
We don't have much basis to choose a prior, but we can start with
1752
something simple and then consider alternatives. Let's assume that
1753
$N$ is equally likely to be any value from 1 to 1000.
1754
1755
\begin{verbatim}
1756
hypos = xrange(1, 1001)
1757
\end{verbatim}
1758
1759
Now all we need is a likelihood function. In a hypothetical fleet of
1760
$N$ locomotives, what is the probability that we would see number 60?
1761
If we assume that there is only one train-operating company (or only
1762
one we care about) and that we are equally likely to see any of its
1763
locomotives, then the chance of seeing any particular locomotive is
1764
$1/N$.
1765
1766
Here's the likelihood function:
1767
\index{likelihood function}
1768
1769
\begin{verbatim}
1770
class Train(Suite):
1771
def Likelihood(self, data, hypo):
1772
if hypo < data:
1773
return 0
1774
else:
1775
return 1.0/hypo
1776
\end{verbatim}
1777
1778
This might look familiar; the likelihood functions for the locomotive
1779
problem and the dice problem are identical.
1780
\index{dice problem}
1781
1782
Here's the update:
1783
1784
\begin{verbatim}
1785
suite = Train(hypos)
1786
suite.Update(60)
1787
\end{verbatim}
1788
1789
There are too many hypotheses to print, so I plotted the
1790
results in Figure~\ref{fig.train1}. Not surprisingly, all
1791
values of $N$ below 60 have been eliminated.
1792
1793
The most likely
1794
value, if you had to guess, is 60. That might not seem like
1795
a very good guess; after all, what are the chances that you just
1796
happened to see the train with the highest number?
1797
Nevertheless, if you want to maximize the chance of getting
1798
the answer exactly right, you should guess 60.
1799
1800
But maybe that's
1801
not the right goal. An alternative is to compute the
1802
mean of the posterior distribution:
1803
1804
\begin{verbatim}
1805
def Mean(suite):
1806
total = 0
1807
for hypo, prob in suite.Items():
1808
total += hypo * prob
1809
return total
1810
1811
print Mean(suite)
1812
\end{verbatim}
1813
1814
Or you could use the very similar method provided by {\tt Pmf}:
1815
1816
\begin{verbatim}
1817
print suite.Mean()
1818
\end{verbatim}
1819
1820
The mean of the posterior is 333, so that might be a
1821
good guess if you wanted to minimize error. If you played this
1822
guessing game over and over, using the mean of the posterior as your
1823
estimate would minimize the mean squared error over the long run (see
1824
\url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).
1825
\index{mean squared error}
1826
1827
You can download this example from \url{http://thinkbayes.com/train.py}.
1828
For more information
1829
see Section~\ref{download}.
1830
1831
\section{What about that prior?}
1832
1833
To make any progress on the locomotive problem we had to make
1834
assumptions, and some of them were pretty arbitrary. In
1835
particular, we chose a uniform prior from 1 to 1000, without
1836
much justification for choosing 1000, or for choosing a uniform
1837
distribution.
1838
\index{prior distribution}
1839
1840
It is not crazy to believe that a railroad company might operate 1000
1841
locomotives, but a reasonable person might guess more or fewer. So we
1842
might wonder whether the posterior distribution is sensitive to these
1843
assumptions. With so little data---only one observation---it probably
1844
is.
1845
1846
Recall that with a uniform prior from 1 to 1000, the mean of
1847
the posterior is 333. With an upper bound of 500, we get a
1848
posterior mean of 207, and with an upper bound of 2000,
1849
the posterior mean is 552.
1850
1851
So that's bad. There are two ways to proceed:
1852
1853
\begin{itemize}
1854
1855
\item Get more data.
1856
1857
\item Get more background information.
1858
1859
\end{itemize}
1860
1861
With more data, posterior distributions based on different
1862
priors tend to converge. For example, suppose that in addition
1863
to train 60 we also see trains 30 and 90. We can update the
1864
distribution like this:
1865
1866
\begin{verbatim}
1867
for data in [60, 30, 90]:
1868
suite.Update(data)
1869
\end{verbatim}
1870
1871
With these data, the means of the posteriors are
1872
1873
\begin{tabular}{|l|l|}
1874
\hline
1875
Upper & Posterior \\
1876
Bound & Mean \\
1877
\hline
1878
500 & 152 \\
1879
1000 & 164\\
1880
2000 & 171\\
1881
\hline
1882
\end{tabular}
1883
1884
So the differences are smaller.
1885
1886
1887
\section{An alternative prior}
1888
1889
\begin{figure}
1890
% train.py
1891
\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}
1892
\caption{Posterior distribution based on a power law prior,
1893
compared to a uniform prior.}
1894
\label{fig.train4}
1895
\end{figure}
1896
1897
If more data are not available, another option is to improve the
1898
priors by gathering more background information. It is probably
1899
not reasonable to assume that a train-operating company with 1000 locomotives
1900
is just as likely as a company with only 1.
1901
1902
With some effort, we could probably find a list of companies that
1903
operate locomotives in the area of observation. Or we could
1904
interview an expert in rail shipping to gather information about
1905
the typical size of companies.
1906
1907
But even without getting into the specifics of railroad economics, we
1908
can make some educated guesses. In most fields, there are many small
1909
companies, fewer medium-sized companies, and only one or two very
1910
large companies. In fact, the distribution of company sizes tends to
1911
follow a power law, as Robert Axtell reports in {\it Science} (see
1912
\url{http://www.sciencemag.org/content/293/5536/1818.full.pdf}).
1913
\index{power law}
1914
\index{Axtell, Robert}
1915
1916
This law suggests that if there are 1000 companies with fewer than
1917
10 locomotives, there might be 100 companies with 100 locomotives,
1918
10 companies with 1000, and possibly one company with 10,000 locomotives.
1919
1920
Mathematically, a power law means that the number of companies
1921
with a given size is inversely proportional to size, or
1922
%
1923
\[ \PMF(x) \propto \left( \frac{1}{x} \right)^{\alpha} \]
1924
%
1925
where $\PMF(x)$ is the probability mass function of $x$ and $\alpha$ is
1926
a parameter that is often near 1.
1927
1928
We can construct a power law prior like this:
1929
1930
\begin{verbatim}
1931
class Train(Dice):
1932
1933
def __init__(self, hypos, alpha=1.0):
1934
Pmf.__init__(self)
1935
for hypo in hypos:
1936
self.Set(hypo, hypo**(-alpha))
1937
self.Normalize()
1938
\end{verbatim}
1939
1940
And here's the code that constructs the prior:
1941
1942
\begin{verbatim}
1943
hypos = range(1, 1001)
1944
suite = Train(hypos)
1945
\end{verbatim}
1946
1947
Again, the upper bound is arbitrary, but with a power law
1948
prior, the posterior is less sensitive to this choice.
1949
1950
Figure~\ref{fig.train4} shows the new posterior based on
1951
the power law, compared to the posterior based on the
1952
uniform prior. Using the background information
1953
represented in the power law prior, we can all but eliminate
1954
values of $N$ greater than 700.
1955
1956
If we start with this prior and observe trains 30, 60, and 90,
1957
the means of the posteriors are
1958
1959
\begin{tabular}{|l|l|}
1960
\hline
1961
Upper & Posterior \\
1962
Bound & Mean \\
1963
\hline
1964
500 & 131 \\
1965
1000 & 133 \\
1966
2000 & 134 \\
1967
\hline
1968
\end{tabular}
1969
1970
Now the differences are much smaller. In fact,
1971
with an arbitrarily large upper bound, the mean converges on 134.
1972
1973
So the power law prior is more realistic, because it is based on
1974
general information about the size of companies, and it
1975
behaves better in practice.
1976
1977
You can download the examples in this section from
1978
\url{http://thinkbayes.com/train3.py}.
1979
For more information
1980
see Section~\ref{download}.
1981
1982
\section{Credible intervals}
1983
\label{credible}
1984
1985
Once you have computed a posterior distribution, it is often useful
1986
to summarize the results with a single point estimate or an interval.
1987
For point estimates it is common to use the mean, median, or the
1988
value with maximum likelihood.
1989
\index{credible interval}
1990
\index{maximum likelihood}
1991
1992
For intervals we usually report two values computed
1993
so that there is a 90\% chance that the unknown value falls
1994
between them (or any other probability).
1995
These values define a {\bf credible interval}.
1996
1997
A simple way to compute a credible interval is to add up the
1998
probabilities in the posterior distribution and record the values
1999
that correspond to probabilities 5\% and 95\%. In other words,
2000
the 5th and 95th percentiles.
2001
\index{percentile}
2002
2003
\verb"thinkbayes" provides a function that computes percentiles:
2004
2005
\begin{verbatim}
2006
def Percentile(pmf, percentage):
2007
p = percentage / 100.0
2008
total = 0
2009
for val, prob in pmf.Items():
2010
total += prob
2011
if total >= p:
2012
return val
2013
\end{verbatim}
2014
2015
And here's the code that uses it:
2016
2017
\begin{verbatim}
2018
interval = Percentile(suite, 5), Percentile(suite, 95)
2019
print interval
2020
\end{verbatim}
2021
2022
For the previous example---the locomotive problem with a power law prior
2023
and three trains---the 90\% credible interval is $(91, 243)$. The
2024
width of this range suggests, correctly, that we are still quite
2025
uncertain about how many locomotives there are.
2026
2027
2028
\section{Cumulative distribution functions}
2029
2030
In the previous section we computed percentiles by iterating through
2031
the values and probabilities in a Pmf. If we need to compute more
2032
than a few percentiles, it is more efficient to use a cumulative
2033
distribution function, or Cdf.
2034
\index{cumulative distribution function}
2035
\index{Cdf}
2036
2037
Cdfs and Pmfs are equivalent in the sense that they contain the
2038
same information about the distribution, and you can always convert
2039
from one to the other. The advantage of the Cdf is that you can
2040
compute percentiles more efficiently.
2041
2042
{\tt thinkbayes} provides a {\tt Cdf} class that represents a
2043
cumulative distribution function. {\tt Pmf} provides a method
2044
that makes the corresponding Cdf:
2045
2046
\begin{verbatim}
2047
cdf = suite.MakeCdf()
2048
\end{verbatim}
2049
2050
And {\tt Cdf} provides a function named \verb"Percentile"
2051
2052
\begin{verbatim}
2053
interval = cdf.Percentile(5), cdf.Percentile(95)
2054
\end{verbatim}
2055
2056
Converting from a Pmf to a Cdf takes time proportional to the number
2057
of values, {\tt len(pmf)}. The Cdf stores the values and
2058
probabilities in sorted lists, so looking up a probability to get the
2059
corresponding value takes ``log time'': that is, time proportional to
2060
the logarithm of the number of values. Looking up a value to get the
2061
corresponding probability is also logarithmic, so Cdfs are efficient
2062
for many calculations.
2063
2064
The examples in this section are in \url{http://thinkbayes.com/train3.py}.
2065
For more information
2066
see Section~\ref{download}.
2067
2068
2069
\section{The German tank problem}
2070
2071
During World War II, the Economic Warfare Division of the American
2072
Embassy in London used statistical analysis to estimate German
2073
production of tanks and other equipment.\footnote{Ruggles and Brodie,
2074
``An Empirical Approach to Economic Intelligence in World War II,''
2075
{\em Journal of the American Statistical Association}, Vol. 42,
2076
No. 237 (March 1947).}
2077
2078
The Western Allies had captured log books, inventories, and repair
2079
records that included chassis and engine serial numbers for individual
2080
tanks.
2081
2082
Analysis of these records indicated that serial numbers were allocated
2083
by manufacturer and tank type in blocks of 100 numbers, that numbers
2084
in each block were used sequentially, and that not all numbers in each
2085
block were used. So the problem of estimating German tank production
2086
could be reduced, within each block of 100 numbers, to a form of the
2087
locomotive problem.
2088
2089
Based on this insight, American and British analysts produced
2090
estimates substantially lower than estimates from other forms
2091
of intelligence. And after the war, records indicated that they were
2092
substantially more accurate.
2093
2094
They performed similar analyses for tires, trucks, rockets, and other
2095
equipment, yielding accurate and actionable economic intelligence.
2096
2097
The German tank problem is historically interesting; it is also a nice
2098
example of real-world application of statistical estimation. So far
2099
many of the examples in this book have been toy problems, but it will
2100
not be long before we start solving real problems. I think it is an
2101
advantage of Bayesian analysis, especially with the computational
2102
approach we are taking, that it provides such a short path from a
2103
basic introduction to the research frontier.
2104
2105
2106
\section{Discussion}
2107
2108
Among Bayesians, there are two approaches to choosing prior
2109
distributions. Some recommend choosing the prior that best represents
2110
background information about the problem; in that case the prior
2111
is said to be {\bf informative}. The problem with using an informative
2112
prior is that people might use different background information (or
2113
interpret it differently). So informative priors often seem subjective.
2114
\index{informative prior}
2115
2116
The alternative is a so-called {\bf uninformative prior}, which is
2117
intended to be as unrestricted as possible, in order to let the data
2118
speak for themselves. In some cases you can identify a unique prior
2119
that has some desirable property, like representing minimal prior
2120
information about the estimated quantity.
2121
\index{uninformative prior}
2122
2123
Uninformative priors are appealing because they seem more
2124
objective. But I am generally in favor of using informative priors.
2125
Why? First, Bayesian analysis is always based on
2126
modeling decisions. Choosing the prior is one of those decisions, but
2127
it is not the only one, and it might not even be the most subjective.
2128
So even if an uninformative prior is more objective, the entire analysis
2129
is still subjective.
2130
\index{modeling}
2131
\index{subjectivity}
2132
\index{objectivity}
2133
2134
Also, for most practical problems, you are likely to be in one of two
2135
regimes: either you have a lot of data or not very much. If you have
2136
a lot of data, the choice of the prior doesn't matter very much;
2137
informative and uninformative priors yield almost the same results.
2138
We'll see an example like this in the next chapter.
2139
2140
But if, as in the locomotive problem, you don't have much data,
2141
using relevant background information (like the power law distribution)
2142
makes a big difference.
2143
\index{locomotive problem}
2144
2145
And if, as in the German tank problem, you have to make life-and-death
2146
decisions based on your results, you should probably use all of the
2147
information at your disposal, rather than maintaining the illusion of
2148
objectivity by pretending to know less than you do.
2149
\index{German tank problem}
2150
2151
2152
\section{Exercises}
2153
2154
\begin{exercise}
2155
To write a likelihood function for the locomotive problem, we had
2156
to answer this question: ``If the railroad has $N$ locomotives, what
2157
is the probability that we see number 60?''
2158
2159
The answer depends on what sampling process we use when we observe the
2160
locomotive. In this chapter, I resolved the ambiguity by specifying
2161
that there is only one train-operating company (or only one that we
2162
care about).
2163
2164
But suppose instead that there are many companies with different
2165
numbers of trains. And suppose that you are equally likely to see any
2166
train operated by any company.
2167
In that case, the likelihood function is different because you
2168
are more likely to see a train operated by a large company.
2169
2170
As an exercise, implement the likelihood function for this variation
2171
of the locomotive problem, and compare the results.
2172
2173
\end{exercise}
2174
2175
2176
2177
2178
\chapter{More Estimation}
2179
\label{more}
2180
2181
\section{The Euro problem}
2182
\label{euro}
2183
2184
\index{Euro problem}
2185
\index{MacKay, David}
2186
In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay
2187
poses this problem:
2188
2189
\begin{quote}
2190
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
2191
2192
\begin{quote}
2193
When spun on edge 250 times, a Belgian one-euro coin came
2194
up heads 140 times and tails 110. `It looks very suspicious
2195
to me,' said Barry Blight, a statistics lecturer at the London
2196
School of Economics. `If the coin were unbiased, the chance of
2197
getting a result as extreme as that would be less than 7\%.'
2198
\end{quote}
2199
2200
But do these data give evidence that the coin is biased rather than fair?
2201
\end{quote}
2202
2203
To answer that question, we'll proceed in two steps. The first
2204
is to estimate the probability that the coin lands face up. The second
2205
is to evaluate whether the data support the hypothesis that the
2206
coin is biased.
2207
2208
You can download the code in this section from
2209
\url{http://thinkbayes.com/euro.py}.
2210
For more information
2211
see Section~\ref{download}.
2212
2213
Any given coin has some probability, $x$, of landing heads up when spun
2214
on edge. It seems reasonable to believe that the value of $x$ depends
2215
on some physical characteristics of the coin, primarily the distribution
2216
of weight.
2217
2218
If a coin is perfectly balanced, we expect $x$ to be close to 50\%, but
2219
for a lopsided coin, $x$ might be substantially different. We can use
2220
Bayes's theorem and the observed data to estimate $x$.
2221
2222
Let's define 101 hypotheses, where $H_x$ is the hypothesis that the
2223
probability of heads is $x$\%, for values from 0 to 100. I'll
2224
start with a uniform prior where the probability of $H_x$ is the same
2225
for all $x$. We'll come back later to consider other priors.
2226
\index{uniform distribution}
2227
2228
\begin{figure}
2229
% euro.py
2230
\centerline{\includegraphics[height=2.5in]{figs/euro1.pdf}}
2231
\caption{Posterior distribution for the Euro problem
2232
on a uniform prior.}
2233
\label{fig.euro1}
2234
\end{figure}
2235
2236
The likelihood function is relatively easy: If $H_x$ is true, the
2237
probability of heads is $x/100$ and the probability of tails is $1-
2238
x/100$.
2239
2240
\begin{verbatim}
2241
class Euro(Suite):
2242
2243
def Likelihood(self, data, hypo):
2244
x = hypo
2245
if data == 'H':
2246
return x/100.0
2247
else:
2248
return 1 - x/100.0
2249
\end{verbatim}
2250
2251
Here's the code that makes the suite and updates it:
2252
2253
\begin{verbatim}
2254
suite = Euro(xrange(0, 101))
2255
dataset = 'H' * 140 + 'T' * 110
2256
2257
for data in dataset:
2258
suite.Update(data)
2259
\end{verbatim}
2260
2261
The result is in Figure~\ref{fig.euro1}.
2262
2263
2264
\section{Summarizing the posterior}
2265
2266
Again, there are several ways to summarize the posterior distribution.
2267
One option is to find the most likely value in the posterior
2268
distribution. \verb"thinkbayes" provides a function that does
2269
that:
2270
\index{posterior distribution}
2271
\index{maximum likelihood}
2272
2273
\begin{verbatim}
2274
def MaximumLikelihood(pmf):
2275
"""Returns the value with the highest probability."""
2276
prob, val = max((prob, val) for val, prob in pmf.Items())
2277
return val
2278
\end{verbatim}
2279
2280
In this case the result is 56, which is also the observed percentage of
2281
heads, $140/250 = 56\%$. So that suggests (correctly) that the
2282
observed percentage is the maximum likelihood estimator
2283
for the population.
2284
2285
We might also summarize the posterior by computing the mean
2286
and median:
2287
\index{median}
2288
2289
\begin{verbatim}
2290
print 'Mean', suite.Mean()
2291
print 'Median', thinkbayes.Percentile(suite, 50)
2292
\end{verbatim}
2293
2294
The mean is 55.95; the median is 56. Finally, we can compute a
2295
credible interval:
2296
2297
\begin{verbatim}
2298
print 'CI', thinkbayes.CredibleInterval(suite, 90)
2299
\end{verbatim}
2300
2301
The result is $(51, 61)$.
2302
2303
Now, getting back to the original question, we would like to know
2304
whether the coin is fair. We observe that the posterior credible
2305
interval does not include 50\%, which suggests that the coin is not
2306
fair.
2307
2308
But that is not exactly the question we started with. MacKay asked,
2309
`` Do these data give evidence that the coin is biased rather than
2310
fair?'' To answer that question, we will have to be more precise
2311
about what it means to say that data constitute evidence for
2312
a hypothesis. And that is the subject of the next chapter.
2313
\index{evidence}
2314
2315
But before we go on, I want to address one possible source of confusion.
2316
Since we want to know whether the coin is fair, it might be tempting
2317
to ask for the probability that {\tt x} is 50\%:
2318
2319
\begin{verbatim}
2320
print suite.Prob(50)
2321
\end{verbatim}
2322
2323
The result is 0.021, but that value is almost meaningless. The
2324
decision to evaluate 101 hypotheses was arbitrary; we could have
2325
divided the range into more or fewer pieces, and if we had, the
2326
probability for any given hypothesis would be greater or less.
2327
2328
2329
\section{Swamping the priors}
2330
\label{triangle}
2331
2332
\begin{figure}
2333
% euro.py
2334
\centerline{\includegraphics[height=2.5in]{figs/euro2.pdf}}
2335
\caption{Uniform and triangular priors for the
2336
Euro problem.}
2337
\label{fig.euro2}
2338
\end{figure}
2339
2340
\begin{figure}
2341
% euro.py
2342
\centerline{\includegraphics[height=2.5in]{figs/euro3.pdf}}
2343
\caption{Posterior distributions for the Euro problem.}
2344
\label{fig.euro3}
2345
\end{figure}
2346
2347
We started with a uniform prior, but that might not be a good
2348
choice. I can believe
2349
that if a coin is lopsided, $x$ might deviate substantially from
2350
50\%, but it seems unlikely that the Belgian Euro coin is so
2351
imbalanced that $x$ is 10\% or 90\%.
2352
2353
It might be more reasonable to choose a prior that gives
2354
higher probability to values of $x$ near 50\% and lower probability
2355
to extreme values.
2356
2357
As an example, I constructed a triangular prior, shown in
2358
Figure~\ref{fig.euro2}. Here's the code that constructs the prior:
2359
2360
\begin{verbatim}
2361
def TrianglePrior():
2362
suite = Euro()
2363
for x in range(0, 51):
2364
suite.Set(x, x)
2365
for x in range(51, 101):
2366
suite.Set(x, 100-x)
2367
suite.Normalize()
2368
\end{verbatim}
2369
2370
Figure~\ref{fig.euro2} shows the result (and the uniform prior for
2371
comparison).
2372
Updating this prior with the same dataset yields the posterior
2373
distribution shown in Figure~\ref{fig.euro3}. Even with substantially
2374
different priors, the posterior distributions are very similar. The
2375
medians and the credible intervals are identical; the means differ by
2376
less than 0.5\%. \index{triangle distribution}
2377
2378
This is an example of {\bf swamping the priors}: with enough
2379
data, people who start with different priors will tend to
2380
converge on the same posterior.
2381
\index{swamping the priors}
2382
\index{convergence}
2383
2384
2385
\section{Optimization}
2386
2387
The code I have shown so far is meant to be easy to read, but it
2388
is not very efficient. In general, I like to develop code that
2389
is demonstrably correct, then check whether it is fast enough for
2390
my purposes. If so, there is no need to optimize.
2391
For this example, if we care about run time,
2392
there are several ways we can speed it up.
2393
\index{optimization}
2394
2395
The first opportunity is to reduce the number of times we
2396
normalize the suite.
2397
In the original code, we call \verb"Update" once for each spin.
2398
2399
\begin{verbatim}
2400
dataset = 'H' * heads + 'T' * tails
2401
2402
for data in dataset:
2403
suite.Update(data)
2404
\end{verbatim}
2405
2406
And here's what \verb"Update" looks like:
2407
2408
\begin{verbatim}
2409
def Update(self, data):
2410
for hypo in self.Values():
2411
like = self.Likelihood(data, hypo)
2412
self.Mult(hypo, like)
2413
return self.Normalize()
2414
\end{verbatim}
2415
2416
Each update iterates through the hypotheses, then calls \verb"Normalize",
2417
which iterates through the hypotheses again. We can save some
2418
time by doing all of the updates before normalizing.
2419
2420
\verb"Suite" provides a method called \verb"UpdateSet" that does
2421
exactly that. Here it is:
2422
2423
\begin{verbatim}
2424
def UpdateSet(self, dataset):
2425
for data in dataset:
2426
for hypo in self.Values():
2427
like = self.Likelihood(data, hypo)
2428
self.Mult(hypo, like)
2429
return self.Normalize()
2430
\end{verbatim}
2431
2432
And here's how we can invoke it:
2433
2434
\begin{verbatim}
2435
dataset = 'H' * heads + 'T' * tails
2436
suite.UpdateSet(dataset)
2437
\end{verbatim}
2438
2439
This optimization speeds things up, but the run time is still
2440
proportional to the amount of data. We can speed things up
2441
even more by rewriting \verb"Likelihood" to process the entire
2442
dataset, rather than one spin at a time.
2443
2444
In the original version,
2445
\verb"data" is a string that encodes either heads or tails:
2446
2447
\begin{verbatim}
2448
def Likelihood(self, data, hypo):
2449
x = hypo / 100.0
2450
if data == 'H':
2451
return x
2452
else:
2453
return 1-x
2454
\end{verbatim}
2455
2456
As an alternative, we could encode the dataset as a tuple of
2457
two integers: the number of heads and tails.
2458
In that case \verb"Likelihood" looks like this:
2459
\index{tuple}
2460
2461
\begin{verbatim}
2462
def Likelihood(self, data, hypo):
2463
x = hypo / 100.0
2464
heads, tails = data
2465
like = x**heads * (1-x)**tails
2466
return like
2467
\end{verbatim}
2468
2469
And then we can call \verb"Update" like this:
2470
2471
\begin{verbatim}
2472
heads, tails = 140, 110
2473
suite.Update((heads, tails))
2474
\end{verbatim}
2475
2476
Since we have replaced repeated multiplication with exponentiation,
2477
this version takes the same time for any number of spins.
2478
2479
2480
\section{The beta distribution}
2481
\label{beta}
2482
2483
\index{beta distribution}
2484
There is one more optimization that solves this problem
2485
even faster.
2486
2487
So far we have used a Pmf object to represent a discrete set of
2488
values for {\tt x}. Now we will use a continuous
2489
distribution, specifically the beta distribution (see
2490
\url{http://en.wikipedia.org/wiki/Beta_distribution}).
2491
\index{continuous distribution}
2492
2493
The beta distribution is defined on the interval from 0 to 1
2494
(including both), so it is a natural choice for describing
2495
proportions and probabilities. But wait, it gets better.
2496
2497
%TODO: explain the binomial distribution in the previous section
2498
2499
It turns out that if you do a Bayesian update with a binomial
2500
likelihood function, which is what we did in the previous section, the beta
2501
distribution is a {\bf conjugate prior}. That means that if the prior
2502
distribution for {\tt x} is a beta distribution, the posterior is also
2503
a beta distribution. But wait, it gets even better.
2504
\index{binomial likelihood function}
2505
\index{conjugate prior}
2506
2507
The shape of the beta distribution depends on two parameters, written
2508
$\alpha$ and $\beta$, or {\tt alpha} and {\tt beta}. If the prior
2509
is a beta distribution with parameters {\tt alpha} and {\tt beta}, and
2510
we see data with {\tt h} heads and {\tt t} tails, the posterior is a
2511
beta distribution with parameters {\tt alpha+h} and {\tt beta+t}. In
2512
other words, we can do an update with two additions.
2513
\index{parameter}
2514
2515
So that's great, but it only works if we can find a beta distribution
2516
that is a good choice for a prior. Fortunately, for many realistic
2517
priors there is a beta distribution that is at least a good
2518
approximation, and for a uniform prior there is a perfect match. The
2519
beta distribution with {\tt alpha=1} and {\tt beta=1} is uniform from
2520
0 to 1.
2521
2522
Let's see how we can take advantage of all this.
2523
{\tt thinkbayes.py} provides
2524
a class that represents a beta distribution:
2525
\index{Beta object}
2526
2527
\begin{verbatim}
2528
class Beta(object):
2529
2530
def __init__(self, alpha=1, beta=1):
2531
self.alpha = alpha
2532
self.beta = beta
2533
\end{verbatim}
2534
2535
By default \verb"__init__" makes a uniform distribution.
2536
{\tt Update} performs a Bayesian update:
2537
2538
\begin{verbatim}
2539
def Update(self, data):
2540
heads, tails = data
2541
self.alpha += heads
2542
self.beta += tails
2543
\end{verbatim}
2544
2545
{\tt data} is a pair of integers representing the number of
2546
heads and tails.
2547
2548
So we have yet another way to solve the Euro problem:
2549
2550
\begin{verbatim}
2551
beta = thinkbayes.Beta()
2552
beta.Update((140, 110))
2553
print beta.Mean()
2554
\end{verbatim}
2555
2556
{\tt Beta} provides {\tt Mean}, which
2557
computes a simple function of {\tt alpha}
2558
and {\tt beta}:
2559
2560
\begin{verbatim}
2561
def Mean(self):
2562
return float(self.alpha) / (self.alpha + self.beta)
2563
\end{verbatim}
2564
2565
For the Euro problem the posterior mean is 56\%, which is the
2566
same result we got using Pmfs.
2567
2568
{\tt Beta} also provides {\tt EvalPdf}, which evaluates
2569
the probability density
2570
function (PDF) of the beta distribution:
2571
\index{probability density function}
2572
\index{PDF}
2573
2574
\begin{verbatim}
2575
def EvalPdf(self, x):
2576
return x**(self.alpha-1) * (1-x)**(self.beta-1)
2577
\end{verbatim}
2578
2579
Finally, {\tt Beta} provides {\tt MakePmf}, which
2580
uses {\tt EvalPdf} to generate a discrete approximation
2581
of the beta distribution.
2582
2583
%This expression might look familiar. Here's {\tt
2584
% thinkbayes.EvalBinomialPmf}
2585
2586
%\begin{verbatim}
2587
%def EvalBinomialPmf(x, yes, no):
2588
% return x**yes * (1-x)**no
2589
%\end{verbatim}
2590
2591
%It's the same function, but in {\tt EvalPdf}, we think of {\tt x} as a
2592
%random variable and {\tt alpha} and {\tt beta} as parameters; in {\tt
2593
% EvalBinomialPmf}, {\tt x} is the parameter, and {\tt yes} and {\tt
2594
% no} are random variables. Distributions like these that share the
2595
%same PDF are called {\bf conjugate distributions}.
2596
%\index{conjugate distribution}
2597
2598
2599
\section{Discussion}
2600
2601
In this chapter we solved the same problem with two different
2602
priors and found that with a large dataset, the priors get
2603
swamped. If two people start with different
2604
prior beliefs, they generally find, as they see more data, that
2605
their posterior distributions converge. At some point the
2606
difference between their distributions is small enough that it has
2607
no practical effect.
2608
\index{swamping the priors}
2609
\index{convergence}
2610
2611
When this happens, it relieves some of the worry about objectivity
2612
that I discussed in the previous chapter. And for many real-world
2613
problems even stark prior beliefs can eventually be reconciled
2614
by data.
2615
2616
But that is not always the case. First, remember that all Bayesian
2617
analysis is based on modeling decisions. If you and I do not
2618
choose the same model, we might interpret data differently. So
2619
even with the same data, we would compute different likelihoods,
2620
and our posterior beliefs might not converge.
2621
\index{modeling}
2622
2623
Also, notice that in a Bayesian update, we multiply
2624
each prior probability by a likelihood, so if \p{H} is 0,
2625
\p{H|D} is also 0, regardless of $D$. In the Euro problem,
2626
if you are convinced that $x$ is less than 50\%, and you assign
2627
probability 0 to all other hypotheses, no amount of data will
2628
convince you otherwise.
2629
\index{Euro problem}
2630
2631
This observation is the basis of {\bf Cromwell's rule}, which is the
2632
recommendation that you should avoid giving a prior probability of
2633
0 to any hypothesis that is even remotely possible
2634
(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).
2635
\index{Cromwell's rule}
2636
2637
Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech
2638
you, in the bowels of Christ, think it possible that you may be
2639
mistaken.'' For Bayesians, this turns out to be good advice (even if
2640
it's a little overwrought).
2641
\index{Cromwell, Oliver}
2642
2643
2644
\section{Exercises}
2645
2646
\begin{exercise}
2647
2648
Suppose that instead of observing coin tosses directly, you measure
2649
the outcome using an instrument that is not always correct. Specifically,
2650
suppose there is a probability {\tt y} that an actual heads is reported
2651
as tails, or actual tails reported as heads.
2652
2653
Write a class that estimates the bias of a coin given a series of
2654
outcomes and the value of {\tt y}.
2655
2656
How does the spread of the posterior distribution depend on
2657
{\tt y}?
2658
2659
\end{exercise}
2660
2661
2662
\begin{exercise}
2663
2664
\index{Reddit}
2665
This exercise is inspired by a question posted by a
2666
``redditor'' named dominosci on Reddit's statistics ``subreddit'' at
2667
\url{http://reddit.com/r/statistics}.
2668
2669
Reddit is an online forum with many interest groups called
2670
subreddits. Users, called redditors, post links to online
2671
content and other web pages. Other redditors vote on the links,
2672
giving an ``upvote'' to high-quality links and a ``downvote'' to
2673
links that are bad or irrelevant.
2674
2675
A problem, identified by dominosci, is that some redditors
2676
are more reliable than others, and Reddit does not take
2677
this into account.
2678
2679
The challenge is to devise a system so that when a redditor
2680
casts a vote, the estimated quality of the link is updated
2681
in accordance with the reliability of the redditor, and the
2682
estimated reliability of the redditor is updated in accordance
2683
with the quality of the link.
2684
2685
One approach is to model the quality of the link as the
2686
probability of garnering an upvote, and to model the reliability
2687
of the redditor as the probability of correctly giving an upvote
2688
to a high-quality item.
2689
2690
Write class definitions for redditors and links and an update function
2691
that updates both objects whenever a redditor casts a vote.
2692
2693
\end{exercise}
2694
2695
2696
2697
\chapter{Odds and Addends}
2698
2699
\section{Odds}
2700
2701
One way to represent a probability is with a number between
2702
0 and 1, but that's not the only way. If you have ever bet
2703
on a football game or a horse race, you have probably encountered
2704
another representation of probability, called {\bf odds}.
2705
\index{odds}
2706
2707
You might have heard expressions like ``the odds are
2708
three to one,'' but you might not know what that means.
2709
The {\bf odds in favor} of an event are the ratio of the probability
2710
it will occur to the probability that it will not.
2711
2712
So if I think my team has a 75\% chance of winning, I would
2713
say that the odds in their favor are three to one, because
2714
the chance of winning is three times the chance of losing.
2715
2716
You can write odds in decimal form, but it is most common to
2717
write them as a ratio of integers. So ``three to one'' is
2718
written $3:1$.
2719
2720
When probabilities are low, it is more common to report the
2721
{\bf odds against} rather than the odds in favor. For
2722
example, if I think my horse has a 10\% chance of winning,
2723
I would say that the odds against are $9:1$.
2724
2725
Probabilities and odds are different representations of the
2726
same information. Given a probability, you can compute the
2727
odds like this:
2728
2729
\begin{verbatim}
2730
def Odds(p):
2731
return p / (1-p)
2732
\end{verbatim}
2733
2734
Given the odds in favor, in decimal form, you can convert to
2735
probability like this:
2736
2737
\begin{verbatim}
2738
def Probability(o):
2739
return o / (o+1)
2740
\end{verbatim}
2741
2742
If you represent odds with a numerator and denominator, you
2743
can convert to probability like this:
2744
2745
\begin{verbatim}
2746
def Probability2(yes, no):
2747
return yes / (yes + no)
2748
\end{verbatim}
2749
2750
When I work with odds in my head, I find it helpful to picture
2751
people at the track. If 20\% of them think my horse will win,
2752
then 80\% of them don't, so the odds in favor are $20:80$ or
2753
$1:4$.
2754
2755
If the odds are $5:1$ against my horse, then five out of six
2756
people think she will lose, so the probability of winning
2757
is $1/6$.
2758
\index{horse racing}
2759
2760
2761
\section{The odds form of Bayes's theorem}
2762
2763
\index{Bayes's theorem!odds form}
2764
In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability
2765
form}:
2766
%
2767
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
2768
%
2769
If we have two hypotheses, $A$ and $B$,
2770
we can write the ratio of posterior probabilities like this:
2771
%
2772
\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}
2773
{\p{B}~\p{D|B}} \]
2774
%
2775
Notice that the normalizing constant, \p{D}, drops out of
2776
this equation.
2777
\index{normalizing constant}
2778
2779
If $A$ and $B$ are mutually exclusive and collectively exhaustive,
2780
that means $\p{B} = 1 - \p{A}$, so we can rewrite the ratio of
2781
the priors, and the ratio of the posteriors, as odds.
2782
2783
Writing \odds{A} for odds in favor of $A$, we get:
2784
%
2785
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2786
%
2787
In words, this says that the posterior odds are the prior odds times
2788
the likelihood ratio. This is the {\bf odds form} of Bayes's theorem.
2789
2790
This form is most convenient for computing a Bayesian update on
2791
paper or in your head. For example, let's go back to the
2792
cookie problem:
2793
\index{cookie problem}
2794
2795
\begin{quote}
2796
Suppose there are two bowls of cookies. Bowl 1 contains
2797
30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of
2798
each.
2799
2800
Now suppose you choose one of the bowls at random and, without looking,
2801
select a cookie at random. The cookie is vanilla. What is the probability
2802
that it came from Bowl 1?
2803
\end{quote}
2804
2805
The prior probability is 50\%, so the prior odds are $1:1$, or just
2806
1. The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.
2807
So the posterior odds are $3:2$, which corresponds to probability
2808
$3/5$.
2809
2810
2811
\section{Oliver's blood}
2812
\label{oliver}
2813
2814
\index{Oliver's blood problem}
2815
\index{MacKay, David}
2816
Here is another problem from MacKay's {\it Information Theory,
2817
Inference, and Learning Algorithms}:
2818
2819
\begin{quote}
2820
Two people have left traces of their own blood at the scene of
2821
a crime. A suspect, Oliver, is tested and found to have type
2822
`O' blood. The blood groups of the two traces are found to
2823
be of type `O' (a common type in the local population, having frequency
2824
60\%) and of type `AB' (a rare type, with frequency 1\%).
2825
Do these data [the traces found at the scene] give evidence
2826
in favor of the proposition that Oliver was one of the people
2827
[who left blood at the scene]?
2828
\end{quote}
2829
2830
To answer this question, we need to think about what it means
2831
for data to give evidence in favor of (or against) a hypothesis.
2832
Intuitively, we might say that data favor a hypothesis if the
2833
hypothesis is more likely in light of the data than it was before.
2834
\index{evidence}
2835
2836
In the cookie problem, the prior odds are $1:1$, or probability 50\%.
2837
The posterior odds are $3:2$, or probability 60\%. So we could say
2838
that the vanilla cookie is evidence in favor of Bowl 1.
2839
2840
The odds form of Bayes's theorem provides a way to make this
2841
intuition more precise. Again
2842
%
2843
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2844
%
2845
Or dividing through by \odds{A}:
2846
%
2847
\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]
2848
%
2849
The term on the left is the ratio of the posterior and prior odds.
2850
The term on the right is the likelihood ratio, also called the {\bf Bayes
2851
factor}.
2852
\index{likelihood ratio}
2853
\index{Bayes factor}
2854
2855
If the Bayes factor value is greater than 1, that means that the
2856
data were more likely under $A$ than under $B$. And since the
2857
odds ratio is also greater than 1, that means that the odds are
2858
greater, in light of the data, than they were before.
2859
2860
If the Bayes factor is less than 1, that means the data were
2861
less likely under $A$ than under $B$, so the odds in
2862
favor of $A$ go down.
2863
2864
Finally, if the Bayes factor is exactly 1, the data are equally
2865
likely under either hypothesis, so the odds do not change.
2866
2867
Now we can get back to the Oliver's blood problem. If Oliver is
2868
one of the people who left blood at the crime scene, then he
2869
accounts for the `O' sample, so the probability of the data
2870
is just the probability that a random member of the population
2871
has type `AB' blood, which is 1\%.
2872
2873
If Oliver did not leave blood at the scene, then we have two
2874
samples to account for. If we choose two random people from
2875
the population, what is the chance of finding one with type `O'
2876
and one with type `AB'? Well, there are two ways it might happen:
2877
the first person we choose might have type `O' and the second
2878
`AB', or the other way around. So the total probability is
2879
$2 (0.6) (0.01) = 1.2\%$.
2880
2881
The likelihood of the data is slightly higher if Oliver is
2882
{\it not} one of the people who left blood at the scene, so
2883
the blood data is actually evidence against Oliver's guilt.
2884
\index{evidence}
2885
2886
This example is a little contrived, but it is an example of
2887
the counterintuitive result that data {\it consistent} with
2888
a hypothesis are not necessarily {\it in favor of}
2889
the hypothesis.
2890
2891
If this result is so counterintuitive that it bothers you,
2892
this way of thinking might help: the data consist of a common
2893
event, type `O' blood, and a rare event, type `AB' blood.
2894
If Oliver accounts for the common event, that leaves the rare
2895
event still unexplained. If Oliver doesn't account for the
2896
`O' blood, then we have two chances to find someone in the
2897
population with `AB' blood. And that factor of two makes
2898
the difference.
2899
2900
2901
\section{Addends}
2902
\label{addends}
2903
2904
The fundamental operation of Bayesian statistics is
2905
{\tt Update}, which takes a prior distribution and a set
2906
of data, and produces a posterior distribution. But solving
2907
real problems usually involves a number of other operations,
2908
including scaling, addition and other arithmetic operations,
2909
max and min, and mixtures.
2910
\index{distribution!operations}
2911
2912
This chapter presents addition and max; I will present
2913
other operations as we need them.
2914
2915
The first example is based on
2916
{\it Dungeons~\&~Dragons}, a role-playing game where the results
2917
of players' decisions are usually determined by rolling dice.
2918
In fact, before game play starts, players generate each
2919
attribute of their characters---strength, intelligence, wisdom,
2920
dexterity, constitution, and charisma---by rolling three
2921
6-sided dice and adding them up.
2922
\index{Dungeons and Dragons}
2923
2924
So you might be curious to know the distribution of this sum.
2925
There are two ways you might compute it:
2926
\index{simulation}
2927
\index{enumeration}
2928
2929
\begin{description}
2930
2931
\item[Simulation:] Given a Pmf that represents the distribution
2932
for a single die, you can draw random samples, add them up,
2933
and accumulate the distribution of simulated sums.
2934
2935
\item[Enumeration:] Given two Pmfs, you can enumerate all possible
2936
pairs of values and compute the distribution of the sums.
2937
2938
\end{description}
2939
2940
\verb"thinkbayes" provides functions for both. Here's an example
2941
of the first approach. First, I'll define a class to represent
2942
a single die as a Pmf:
2943
2944
\begin{verbatim}
2945
class Die(thinkbayes.Pmf):
2946
2947
def __init__(self, sides):
2948
thinkbayes.Pmf.__init__(self)
2949
for x in xrange(1, sides+1):
2950
self.Set(x, 1)
2951
self.Normalize()
2952
\end{verbatim}
2953
2954
Now I can create a 6-sided die:
2955
2956
\begin{verbatim}
2957
d6 = Die(6)
2958
\end{verbatim}
2959
2960
And use \verb"thinkbayes.SampleSum" to generate a sample of 1000 rolls.
2961
2962
\begin{verbatim}
2963
dice = [d6] * 3
2964
three = thinkbayes.SampleSum(dice, 1000)
2965
\end{verbatim}
2966
2967
\verb"SampleSum" takes list of distributions (either Pmf or Cdf
2968
objects) and the sample size, {\tt n}. It generates {\tt n} random
2969
sums and returns their distribution as a Pmf object.
2970
2971
\begin{verbatim}
2972
def SampleSum(dists, n):
2973
pmf = MakePmfFromList(RandomSum(dists) for i in xrange(n))
2974
return pmf
2975
\end{verbatim}
2976
2977
\verb"SampleSum" uses \verb"RandomSum", also in \verb"thinkbayes.py":
2978
2979
\begin{verbatim}
2980
def RandomSum(dists):
2981
total = sum(dist.Random() for dist in dists)
2982
return total
2983
\end{verbatim}
2984
2985
{\tt RandomSum} invokes {\tt Random} on each distribution and
2986
adds up the results.
2987
2988
The drawback of simulation is that the result
2989
is only approximately correct. As \verb"n" gets larger, it gets
2990
more accurate, but of course the run time increases as well.
2991
2992
The other approach is to enumerate all pairs of values and
2993
compute the sum and probability of each pair. This is implemented
2994
in \verb"Pmf.__add__":
2995
2996
\begin{verbatim}
2997
# class Pmf
2998
2999
def __add__(self, other):
3000
pmf = Pmf()
3001
for v1, p1 in self.Items():
3002
for v2, p2 in other.Items():
3003
pmf.Incr(v1+v2, p1*p2)
3004
return pmf
3005
\end{verbatim}
3006
3007
{\tt self} is a Pmf, of course; {\tt other} can be a Pmf or anything
3008
else that provides {\tt Items}. The result is a new Pmf. The time to
3009
run \verb"__add__" depends on the number of items in {\tt self} and
3010
{\tt other}; it is proportional to {\tt len(self) * len(other)}.
3011
3012
And here's how it's used:
3013
3014
\begin{verbatim}
3015
three_exact = d6 + d6 + d6
3016
\end{verbatim}
3017
3018
When you apply the {\tt +} operator to a Pmf, Python invokes
3019
\verb"__add__". In this example, \verb"__add__" is invoked twice.
3020
3021
Figure~\ref{fig.dungeons1} shows an approximate result generated
3022
by simulation and the exact result computed by enumeration.
3023
3024
\begin{figure}
3025
% dungeons.py
3026
\centerline{\includegraphics[height=2.5in]{figs/dungeons1.pdf}}
3027
\caption{Approximate and exact distributions for the sum of
3028
three 6-sided dice.}
3029
\label{fig.dungeons1}
3030
\end{figure}
3031
3032
\verb"Pmf.__add__" is based on the assumption that the random
3033
selections from each Pmf are independent. In the example of rolling
3034
several dice, this assumption is pretty good. In other cases, we
3035
would have to extend this method to use conditional probabilities.
3036
\index{independence}
3037
3038
The code from this section is available from
3039
\url{http://thinkbayes.com/dungeons.py}.
3040
For more information
3041
see Section~\ref{download}.
3042
3043
\section{Maxima}
3044
3045
\begin{figure}
3046
% dungeons.py
3047
\centerline{\includegraphics[height=2.5in]{figs/dungeons2.pdf}}
3048
\caption{Distribution of the maximum of six rolls of three dice.}
3049
\label{fig.dungeons2}
3050
\end{figure}
3051
3052
When you generate a {\it Dungeons~\&~Dragons} character, you are
3053
particularly interested in the character's best attributes, so
3054
you might like to know the
3055
distribution of the maximum attribute.
3056
3057
There are three ways to compute the distribution of a maximum:
3058
\index{maximum}
3059
\index{simulation}
3060
\index{enumeration}
3061
\index{exponentiation}
3062
3063
\begin{description}
3064
3065
\item[Simulation:] Given a Pmf that represents the distribution
3066
for a single selection, you can generate random samples, find the maximum,
3067
and accumulate the distribution of simulated maxima.
3068
3069
\item[Enumeration:] Given two Pmfs, you can enumerate all possible
3070
pairs of values and compute the distribution of the maximum.
3071
3072
\item[Exponentiation:] If we convert a Pmf to a Cdf, there is a simple
3073
and efficient algorithm for finding the Cdf of the maximum.
3074
3075
\end{description}
3076
3077
The code to simulate maxima is almost identical to the code for
3078
simulating sums:
3079
3080
\begin{verbatim}
3081
def RandomMax(dists):
3082
total = max(dist.Random() for dist in dists)
3083
return total
3084
3085
def SampleMax(dists, n):
3086
pmf = MakePmfFromList(RandomMax(dists) for i in xrange(n))
3087
return pmf
3088
\end{verbatim}
3089
3090
All I did was replace ``sum'' with ``max''. And the code
3091
for enumeration is almost identical, too:
3092
3093
\begin{verbatim}
3094
def PmfMax(pmf1, pmf2):
3095
res = thinkbayes.Pmf()
3096
for v1, p1 in pmf1.Items():
3097
for v2, p2 in pmf2.Items():
3098
res.Incr(max(v1, v2), p1*p2)
3099
return res
3100
\end{verbatim}
3101
3102
In fact, you could generalize this function by taking the
3103
appropriate operator as a parameter.
3104
3105
The only problem with this algorithm is that if each Pmf
3106
has $m$ values, the run time is proportional to $m^2$.
3107
And if we want the maximum of {\tt k} selections, it takes
3108
time proportional to $k m^2$.
3109
3110
If we convert the Pmfs to Cdfs, we can do the same calculation
3111
much faster! The key is to remember the definition of the
3112
cumulative distribution function:
3113
%
3114
\[ CDF(x) = \p{X \le~x} \]
3115
%
3116
where $X$ is a random variable that means ``a value chosen
3117
randomly from this distribution.'' So, for example, $CDF(5)$
3118
is the probability that a value from this distribution is less
3119
than or equal to 5.
3120
3121
If I draw $X$ from $CDF_1$ and $Y$ from $CDF_2$, and compute
3122
the maximum $Z = max(X, Y)$, what is the chance that $Z$ is
3123
less than or equal to 5? Well, in that case both $X$ and $Y$
3124
must be less than or equal to 5.
3125
3126
\index{independence}
3127
If the selections of $X$ and $Y$ are independent,
3128
%
3129
\[ CDF_3(5) = CDF_1(5) CDF_2(5) \]
3130
%
3131
where $CDF_3$ is the distribution of $Z$. I chose the value
3132
5 because I think it makes the formulas easy to read, but we
3133
can generalize for any value of $z$:
3134
%
3135
\[ CDF_3(z) = CDF_1(z) CDF_2(z) \]
3136
%
3137
In the special case where we draw $k$ values from the same
3138
distribution,
3139
%
3140
\[ CDF_k(z) = CDF_1(z)^k \]
3141
%
3142
So to find the distribution of the maximum of $k$ values,
3143
we can enumerate the probabilities in the given Cdf
3144
and raise them to the $k$th power.
3145
\verb"Cdf" provides a method that does just that:
3146
3147
\begin{verbatim}
3148
# class Cdf
3149
3150
def Max(self, k):
3151
cdf = self.Copy()
3152
cdf.ps = [p**k for p in cdf.ps]
3153
return cdf
3154
\end{verbatim}
3155
3156
\verb"Max" takes the number of selections, {\tt k}, and returns a new
3157
Cdf that represents the distribution of the maximum of {\tt k}
3158
selections. The run time for this method is proportional to
3159
$m$, the number of items in the Cdf.
3160
3161
\verb"Pmf.Max" does the same thing for Pmfs. It has to do a little
3162
more work to convert the Pmf to a Cdf, so the run time is proportional
3163
to $m \log m$, but that's still better than quadratic.
3164
3165
Finally, here's an example that computes the distribution of
3166
a character's best attribute:
3167
3168
\begin{verbatim}
3169
best_attr_cdf = three_exact.Max(6)
3170
best_attr_pmf = best_attr_cdf.MakePmf()
3171
\end{verbatim}
3172
3173
Where \verb"three_exact" is defined in the previous section.
3174
If we print the results, we see that the chance of generating
3175
a character with at least one attribute of 18 is about 3\%.
3176
Figure~\ref{fig.dungeons2} shows the distribution.
3177
3178
3179
\section{Mixtures}
3180
\label{mixture}
3181
3182
\begin{figure}
3183
% dungeons.py
3184
\centerline{\includegraphics[height=2.5in]{figs/dungeons3.pdf}}
3185
\caption{Distribution outcome for random die from a box.}
3186
\label{fig.dungeons3}
3187
\end{figure}
3188
3189
Let's do one more example from {\it Dungeons~\&~Dragons}. Suppose
3190
I have a box of dice with the following inventory:
3191
3192
\begin{verbatim}
3193
5 4-sided dice
3194
4 6-sided dice
3195
3 8-sided dice
3196
2 12-sided dice
3197
1 20-sided die
3198
\end{verbatim}
3199
3200
I choose a die from the box and roll it. What is the distribution
3201
of the outcome?
3202
3203
If you know which die it is, the answer is easy. A die with {\tt n}
3204
sides yields a uniform distribution from 1 to {\tt n}, including both.
3205
\index{uniform distribution}
3206
3207
But if we don't know which die it is, the resulting distribution is
3208
a {\bf mixture} of uniform distributions with different bounds.
3209
In general, this kind of mixture does not fit any simple mathematical
3210
model, but it is straightforward to compute the distribution in
3211
the form of a PMF.
3212
\index{mixture}
3213
3214
As always, one option is to simulate the scenario, generate a random
3215
sample, and compute the PMF of the sample. This approach is simple
3216
and it generates an approximate solution quickly. But if we want an
3217
exact solution, we need a different approach.
3218
\index{simulation}
3219
3220
Let's start with a simple version of the problem where there are
3221
only two dice, one with 6 sides and one with 8. We can make a Pmf to
3222
represent each die:
3223
3224
\begin{verbatim}
3225
d6 = Die(6)
3226
d8 = Die(8)
3227
\end{verbatim}
3228
3229
Then we create a Pmf to represent the mixture:
3230
3231
\begin{verbatim}
3232
mix = thinkbayes.Pmf()
3233
for die in [d6, d8]:
3234
for outcome, prob in die.Items():
3235
mix.Incr(outcome, prob)
3236
mix.Normalize()
3237
\end{verbatim}
3238
3239
The first loop enumerates the dice; the second enumerates the
3240
outcomes and their probabilities. Inside the loop,
3241
{\tt Pmf.Incr} adds up the contributions from the two distributions.
3242
3243
This code assumes that the two dice are equally likely. More
3244
generally, we need to know the probability of each die so we can
3245
weight the outcomes accordingly.
3246
3247
First we create a Pmf that maps from each die to the probability it is
3248
selected:
3249
3250
\begin{verbatim}
3251
pmf_dice = thinkbayes.Pmf()
3252
pmf_dice.Set(Die(4), 5)
3253
pmf_dice.Set(Die(6), 4)
3254
pmf_dice.Set(Die(8), 3)
3255
pmf_dice.Set(Die(12), 2)
3256
pmf_dice.Set(Die(20), 1)
3257
pmf_dice.Normalize()
3258
\end{verbatim}
3259
3260
Next we need a more general version of the mixture algorithm:
3261
3262
\begin{verbatim}
3263
mix = thinkbayes.Pmf()
3264
for die, weight in pmf_dice.Items():
3265
for outcome, prob in die.Items():
3266
mix.Incr(outcome, weight*prob)
3267
\end{verbatim}
3268
3269
Now each die has a weight associated with it (which makes it a
3270
weighted die, I suppose). When we add each outcome to the mixture,
3271
its probability is multiplied by {\tt weight}.
3272
3273
Figure~\ref{fig.dungeons3} shows the result. As expected, values 1
3274
through 4 are the most likely because any die can produce them.
3275
Values above 12 are unlikely because there is only one die in the box
3276
that can produce them (and it does so less than half the time).
3277
3278
{\tt thinkbayes} provides a function named {\tt MakeMixture}
3279
that encapsulates this algorithm, so we could have written:
3280
3281
\begin{verbatim}
3282
mix = thinkbayes.MakeMixture(pmf_dice)
3283
\end{verbatim}
3284
3285
We'll use {\tt MakeMixture} again in Chapters~\ref{prediction} and
3286
~\ref{observer}.
3287
3288
3289
\section{Discussion}
3290
3291
Other than the odds form of Bayes's theorem, this chapter is not
3292
specifically Bayesian. But Bayesian analysis is all about
3293
distributions, so it is important to understand the concept of a
3294
distribution well. From a computational point of view, a distribution
3295
is any data structure that represents a set of values (possible
3296
outcomes of a random process) and their probabilities.
3297
\index{distribution}
3298
3299
We have seen two representations of distributions: Pmfs and Cdfs.
3300
These representations are equivalent in the sense that they contain
3301
the same information, so you can convert from one to the other. The
3302
primary difference between them is performance: some operations are
3303
faster and easier with a Pmf; others are faster with a Cdf.
3304
\index{Pmf} \index{Cdf}
3305
3306
The other goal of this chapter is to introduce operations that act on
3307
distributions, like \verb"Pmf.__add__", {\tt Cdf.Max}, and {\tt
3308
thinkbayes.MakeMixture}. We will use these operations later, but I
3309
introduce them now to encourage you to think of a distribution as a
3310
fundamental unit of computation, not just a container for values and
3311
probabilities.
3312
3313
3314
3315
\chapter{Decision Analysis}
3316
\label{decisionanalysis}
3317
3318
\section{The {\it Price is Right} problem}
3319
3320
On November 1, 2007, contestants named Letia and Nathaniel appeared
3321
on {\it The Price is Right}, an American game show. They competed in
3322
a game called {\it The Showcase}, where the objective is to guess the price
3323
of a showcase of prizes. The contestant who comes closest to the
3324
actual price of the showcase, without going over, wins the prizes.
3325
\index{Price is Right}
3326
\index{Showcase}
3327
3328
Nathaniel went first. His showcase included a dishwasher, a wine
3329
cabinet, a laptop computer, and a car. He bid \$26,000.
3330
3331
Letia's showcase included a pinball machine, a video arcade game, a
3332
pool table, and a cruise of the Bahamas. She bid \$21,500.
3333
3334
The actual price of Nathaniel's showcase was \$25,347. His bid
3335
was too high, so he lost.
3336
3337
The actual price of Letia's showcase was \$21,578. She was only
3338
off by \$78, so she won her showcase and, because
3339
her bid was off by less than \$250, she also won Nathaniel's
3340
showcase.
3341
3342
For a Bayesian thinker, this scenario suggests several questions:
3343
3344
\begin{enumerate}
3345
3346
\item Before seeing the prizes, what prior beliefs should the
3347
contestant have about the price of the showcase?
3348
3349
\item After seeing the prizes, how should the contestant update
3350
those beliefs?
3351
3352
\item Based on the posterior distribution, what should the
3353
contestant bid?
3354
3355
\end{enumerate}
3356
3357
The third question demonstrates a common use of Bayesian analysis:
3358
decision analysis. Given a posterior distribution, we can choose
3359
the bid that maximizes the contestant's expected return.
3360
\index{decision analysis}
3361
3362
This problem is inspired by an example in Cameron Davidson-Pilon's
3363
book, {\it Bayesian Methods for Hackers}. The code I wrote for this
3364
chapter is available from \url{http://thinkbayes.com/price.py}; it
3365
reads data files you can download from
3366
\url{http://thinkbayes.com/showcases.2011.csv} and
3367
\url{http://thinkbayes.com/showcases.2012.csv}. For more information
3368
see Section~\ref{download}.
3369
\index{Davidson-Pilon, Cameron}
3370
3371
3372
\section{The prior}
3373
3374
\begin{figure}
3375
% price.py
3376
\centerline{\includegraphics[height=2.5in]{figs/price1.pdf}}
3377
\caption{Distribution of prices for showcases on
3378
{\it The Price is Right}, 2011-12.}
3379
\label{fig.price1}
3380
\end{figure}
3381
3382
To choose a prior distribution of prices, we can take advantage
3383
of data from previous episodes. Fortunately, fans of the show
3384
keep detailed records. When I corresponded with Mr.~Davidson-Pilon
3385
about his book, he sent me data collected by Steve Gee at
3386
\url{http://tpirsummaries.8m.com}. It includes the price of
3387
each showcase from the 2011 and 2012 seasons and the bids
3388
offered by the contestants.
3389
\index{Gee, Steve}
3390
3391
Figure~\ref{fig.price1} shows the distribution of prices for these
3392
showcases. The most common value for both showcases is around
3393
\$28,000, but the first showcase has a second mode near \$50,000,
3394
and the second showcase is occasionally worth more than \$70,000.
3395
3396
These distributions are based on actual data, but they
3397
have been smoothed by Gaussian kernel density estimation (KDE).
3398
Before we go on, I want to take a detour to talk about
3399
probability density functions and KDE.
3400
\index{kernel density estimation}
3401
\index{KDE}
3402
3403
3404
\section{Probability density functions}
3405
3406
So far we have been working with probability mass functions, or PMFs.
3407
A PMF is a map from each possible value to its probability. In my
3408
implementation, a Pmf object provides a method named {\tt Prob} that
3409
takes a value and returns a probability, also known as a {\bf probability
3410
mass}.
3411
\index{probability density function}
3412
\index{Pdf}
3413
\index{Pmf}
3414
3415
A {\bf probability density function}, or PDF, is the continuous version of a
3416
PMF, where the possible values make up a continuous range rather than
3417
a discrete set.
3418
3419
\index{Gaussian distribution}
3420
In mathematical notation, PDFs are usually written as functions; for
3421
example, here is the PDF of a Gaussian distribution with
3422
mean 0 and standard deviation 1:
3423
%
3424
\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]
3425
%
3426
For a given value of $x$, this function computes a probability
3427
density.
3428
A density is similar
3429
to a probability mass in the sense that a higher density indicates
3430
that a value is more likely.
3431
\index{density}
3432
\index{probability density}
3433
\index{probability}
3434
3435
But a density is not a probability. A density can be 0 or any positive
3436
value; it is not bounded, like a probability, between 0 and 1.
3437
3438
If you integrate a density
3439
over a continuous range, the result is a probability. But
3440
for the applications in this book we seldom have to do that.
3441
3442
Instead we primarily use probability densities as part
3443
of a likelihood function. We will see an example soon.
3444
3445
3446
\section{Representing PDFs}
3447
3448
\index{Pdf}
3449
To represent PDFs in Python,
3450
{\tt thinkbayes.py} provides a class named {\tt Pdf}.
3451
{\tt Pdf} is an {\bf abstract type}, which means that it defines
3452
the interface a Pdf is supposed to have, but does not provide
3453
a complete implementation. The {\tt Pdf} interface includes
3454
two methods, {\tt Density} and {\tt MakePmf}:
3455
3456
\begin{verbatim}
3457
class Pdf(object):
3458
3459
def Density(self, x):
3460
raise UnimplementedMethodException()
3461
3462
def MakePmf(self, xs):
3463
pmf = Pmf()
3464
for x in xs:
3465
pmf.Set(x, self.Density(x))
3466
pmf.Normalize()
3467
return pmf
3468
\end{verbatim}
3469
3470
{\tt Density} takes a value, {\tt x}, and returns the corresponding
3471
density. {\tt MakePmf} makes a discrete approximation to the PDF.
3472
3473
{\tt Pdf} provides an implementation of {\tt MakePmf}, but not {\tt
3474
Density}, which has to be provided by a child class.
3475
\index{abstract type} \index{concrete type} \index{interface}
3476
\index{implementation}
3477
3478
\index{Gaussian distribution}
3479
A {\bf concrete type} is a child class that extends an abstract type
3480
and provides an implementation of the missing methods.
3481
For example, {\tt GaussianPdf} extends {\tt Pdf} and provides
3482
{\tt Density}:
3483
3484
\begin{verbatim}
3485
class GaussianPdf(Pdf):
3486
3487
def __init__(self, mu, sigma):
3488
self.mu = mu
3489
self.sigma = sigma
3490
3491
def Density(self, x):
3492
return scipy.stats.norm.pdf(x, self.mu, self.sigma)
3493
\end{verbatim}
3494
3495
\verb"__init__" takes {\tt mu} and {\tt sigma}, which are
3496
the mean and standard deviation of the distribution, and stores
3497
them as attributes.
3498
3499
{\tt Density} uses a function from {\tt scipy.stats} to evaluate the
3500
Gaussian PDF. The function is called {\tt norm.pdf} because the
3501
Gaussian distribution is also called the ``normal'' distribution.
3502
\index{scipy}
3503
\index{normal distribution}
3504
3505
The Gaussian PDF is defined by a simple mathematical function,
3506
so it is easy to evaluate. And it is useful because many
3507
quantities in the real world have distributions that are
3508
approximately Gaussian.
3509
\index{Gaussian distribution}
3510
\index{Gaussian PDF}
3511
3512
But with real data, there is no guarantee that the distribution
3513
is Gaussian or any other simple mathematical function. In
3514
that case we can use a sample to estimate the PDF of
3515
the whole population.
3516
3517
For example, in {\it The Price Is Right} data, we have
3518
313 prices for the first showcase. We can think of these
3519
values as a sample from the population of all possible showcase
3520
prices.
3521
3522
This sample includes the following values (in order):
3523
%
3524
\[ 28800, 28868, 28941, 28957, 28958 \]
3525
%
3526
In the sample, no values appear between 28801 and 28867, but
3527
there is no reason to think that these values are impossible.
3528
Based on our background information, we expect all
3529
values in this range to be equally likely. In other words,
3530
we expect the PDF to be fairly smooth.
3531
3532
Kernel density estimation (KDE) is an algorithm that takes
3533
a sample and finds an appropriately smooth PDF that fits
3534
the data. You can read details at
3535
\url{http://en.wikipedia.org/wiki/Kernel_density_estimation}.
3536
\index{KDE}
3537
\index{kernel density estimation}
3538
3539
{\tt scipy} provides an implementation of KDE and {\tt thinkbayes}
3540
provides a class called {\tt EstimatedPdf} that
3541
uses it:
3542
\index{scipy}
3543
\index{numpy}
3544
3545
\begin{verbatim}
3546
class EstimatedPdf(Pdf):
3547
3548
def __init__(self, sample):
3549
self.kde = scipy.stats.gaussian_kde(sample)
3550
3551
def Density(self, x):
3552
return self.kde.evaluate(x)
3553
\end{verbatim}
3554
3555
\verb"__init__" takes a sample
3556
and computes a kernel density estimate. The result is a
3557
\verb"gaussian_kde" object that provides an {\tt evaluate}
3558
method.
3559
3560
{\tt Density} takes a value, calls \verb"gaussian_kde.evaluate",
3561
and returns the resulting density.
3562
\index{density}
3563
3564
Finally, here's an outline of the code I used to generate
3565
Figure~\ref{fig.price1}:
3566
\index{numpy}
3567
3568
\begin{verbatim}
3569
prices = ReadData()
3570
pdf = thinkbayes.EstimatedPdf(prices)
3571
3572
low, high = 0, 75000
3573
n = 101
3574
xs = numpy.linspace(low, high, n)
3575
pmf = pdf.MakePmf(xs)
3576
\end{verbatim}
3577
3578
{\tt pdf} is a {\tt Pdf} object, estimated by KDE. {\tt pmf}
3579
is a Pmf object that approximates the Pdf by evaluating the density
3580
at a sequence of equally spaced values.
3581
3582
{\tt linspace} stands for
3583
``linear space.'' It takes a range, {\tt low} and {\tt high}, and
3584
the number of points, {\tt n}, and returns a new {\tt numpy}
3585
array with {\tt n} elements equally spaced between {\tt low} and
3586
{\tt high}, including both.
3587
3588
And now back to {\it The Price is Right}.
3589
3590
3591
\section{Modeling the contestants}
3592
3593
\begin{figure}
3594
% price.py
3595
\centerline{\includegraphics[height=2.5in]{figs/price2.pdf}}
3596
\caption{Cumulative distribution (CDF) of the difference between the
3597
contestant's bid and the actual price.}
3598
\label{fig.price2}
3599
\end{figure}
3600
3601
The PDFs in Figure~\ref{fig.price1} estimate the distribution of
3602
possible prices. If you were a contestant on the
3603
show, you could use this distribution to quantify your prior belief
3604
about the price of each showcase (before you see the prizes).
3605
3606
To update these priors, we have to answer these questions:
3607
3608
\begin{enumerate}
3609
3610
\item What data should we consider and how should we quantify it?
3611
3612
\item Can we compute a likelihood function; that is,
3613
for each hypothetical value of {\tt price}, can we compute
3614
the conditional likelihood of the data?
3615
3616
\end{enumerate}
3617
3618
To answer these questions, I am going to model the contestant
3619
as a price-guessing instrument with known error characteristics.
3620
In other words, when the contestant sees the prizes, he or she
3621
guesses the price of each prize---ideally without taking into
3622
consideration the fact that the prize is part of a showcase---and
3623
adds up the prices. Let's call this total {\tt guess}.
3624
\index{error}
3625
3626
Under this model, the question we have to answer is, ``If the
3627
actual price is {\tt price}, what is the likelihood that the
3628
contestant's estimate would be {\tt guess}?''
3629
\index{likelihood}
3630
3631
Or if we define
3632
%
3633
\begin{verbatim}
3634
error = price - guess
3635
\end{verbatim}
3636
%
3637
then we could ask, ``What is the likelihood
3638
that the contestant's estimate is off by {\tt error}?''
3639
3640
To answer this question, we can use the historical data again.
3641
Figure~\ref{fig.price2} shows the cumulative distribution of {\tt diff},
3642
the difference between the contestant's bid and the actual price
3643
of the showcase.
3644
\index{Cdf}
3645
3646
The definition of diff is
3647
%
3648
\begin{verbatim}
3649
diff = price - bid
3650
\end{verbatim}
3651
%
3652
When {\tt diff} is negative, the bid is too high. As an
3653
aside, we can use this distribution to compute the probability that the
3654
contestants overbid: the first contestant overbids 25\% of the
3655
time; the second contestant overbids 29\% of the time.
3656
3657
We can also see that the bids are biased;
3658
that is, they are more likely to be too low than too high. And
3659
that makes sense, given the rules of the game.
3660
3661
Finally, we can use this distribution to estimate the reliability of
3662
the contestants' guesses. This step is a little tricky because
3663
we don't actually know the contestant's guesses; we only know
3664
what they bid.
3665
3666
So we'll have to make some assumptions. Specifically, I
3667
assume that the distribution of {\tt error} is Gaussian with mean 0
3668
and the same variance as {\tt diff}.
3669
\index{Gaussian distribution}
3670
3671
The {\tt Player} class implements this model:
3672
\index{numpy}
3673
3674
\begin{verbatim}
3675
class Player(object):
3676
3677
def __init__(self, prices, bids, diffs):
3678
self.pdf_price = thinkbayes.EstimatedPdf(prices)
3679
self.cdf_diff = thinkbayes.MakeCdfFromList(diffs)
3680
3681
mu = 0
3682
sigma = numpy.std(diffs)
3683
self.pdf_error = thinkbayes.GaussianPdf(mu, sigma)
3684
\end{verbatim}
3685
3686
{\tt prices} is a sequence of showcase prices, {\tt bids} is a
3687
sequence of bids, and {\tt diffs} is a sequence of diffs, where
3688
again {\tt diff = price - bid}.
3689
3690
\verb"pdf_price" is the smoothed PDF of prices, estimated by KDE.
3691
\verb"cdf_diff" is the cumulative distribution of {\tt diff},
3692
which we saw in Figure~\ref{fig.price2}. And \verb"pdf_error"
3693
is the PDF that characterizes the distribution of errors; where
3694
{\tt error = price - guess}.
3695
3696
Again, we use the variance of {\tt diff} to estimate the variance of
3697
{\tt error}. This estimate is not perfect because contestants' bids
3698
are sometimes strategic; for example, if Player 2 thinks that Player 1
3699
has overbid, Player 2 might make a very low bid. In that case {\tt
3700
diff} does not reflect {\tt error}. If this happens a lot, the
3701
observed variance in {\tt diff} might overestimate the variance in
3702
{\tt error}. Nevertheless, I think it is a reasonable modeling
3703
decision.
3704
3705
As an alternative, someone preparing to appear on the show could
3706
estimate their own distribution of {\tt error} by watching previous shows
3707
and recording their guesses and the actual prices.
3708
3709
3710
\section{Likelihood}
3711
3712
Now we are ready to write the likelihood function. As usual,
3713
I define a new class that extends {\tt thinkbayes.Suite}:
3714
\index{likelihood}
3715
3716
\begin{verbatim}
3717
class Price(thinkbayes.Suite):
3718
3719
def __init__(self, pmf, player):
3720
thinkbayes.Suite.__init__(self, pmf)
3721
self.player = player
3722
\end{verbatim}
3723
3724
{\tt pmf} represents the prior distribution and
3725
{\tt player} is a Player object as described in the previous
3726
section. Here's {\tt Likelihood}:
3727
3728
\begin{verbatim}
3729
def Likelihood(self, data, hypo):
3730
price = hypo
3731
guess = data
3732
3733
error = price - guess
3734
like = self.player.ErrorDensity(error)
3735
3736
return like
3737
\end{verbatim}
3738
3739
{\tt hypo} is the hypothetical price of the showcase. {\tt data}
3740
is the contestant's best guess at the price. {\tt error} is
3741
the difference, and {\tt like} is the likelihood of the data,
3742
given the hypothesis.
3743
3744
{\tt ErrorDensity} is defined in {\tt Player}:
3745
3746
\begin{verbatim}
3747
# class Player:
3748
3749
def ErrorDensity(self, error):
3750
return self.pdf_error.Density(error)
3751
\end{verbatim}
3752
3753
{\tt ErrorDensity} works by evaluating \verb"pdf_error" at
3754
the given value of {\tt error}.
3755
The result is a probability density, so it is not really a probability.
3756
But remember that {\tt Likelihood} doesn't
3757
need to compute a probability; it only has to compute something {\em
3758
proportional} to a probability. As long as the constant of
3759
proportionality is the same for all likelihoods, it gets canceled out
3760
when we normalize the posterior distribution.
3761
\index{density}
3762
\index{likelihood}
3763
3764
And therefore, a probability density is a perfectly good likelihood.
3765
3766
3767
\section{Update}
3768
3769
\begin{figure}
3770
% price.py
3771
\centerline{\includegraphics[height=2.5in]{figs/price3.pdf}}
3772
\caption{Prior and posterior distributions for Player 1, based on
3773
a best guess of \$20,000.}
3774
\label{fig.price3}
3775
\end{figure}
3776
3777
3778
{\tt Player} provides a method that takes the contestant's
3779
guess and computes the posterior distribution:
3780
3781
\begin{verbatim}
3782
# class Player
3783
3784
def MakeBeliefs(self, guess):
3785
pmf = self.PmfPrice()
3786
self.prior = Price(pmf, self)
3787
self.posterior = self.prior.Copy()
3788
self.posterior.Update(guess)
3789
\end{verbatim}
3790
3791
{\tt PmfPrice} generates a discrete approximation
3792
to the PDF of price, which we use to construct the prior.
3793
3794
{\tt PmfPrice} uses {\tt MakePmf}, which
3795
evaluates \verb"pdf_price" at a sequence of values:
3796
3797
\begin{verbatim}
3798
# class Player
3799
3800
n = 101
3801
price_xs = numpy.linspace(0, 75000, n)
3802
3803
def PmfPrice(self):
3804
return self.pdf_price.MakePmf(self.price_xs)
3805
\end{verbatim}
3806
3807
To construct the posterior, we make a copy of the
3808
prior and then invoke {\tt Update}, which invokes {\tt Likelihood}
3809
for each hypothesis, multiplies the priors by the likelihoods,
3810
and renormalizes.
3811
\index{normalize}
3812
3813
So let's get back to the original scenario. Suppose you are
3814
Player 1 and when you see your showcase, your best guess is
3815
that the total price of the prizes is \$20,000.
3816
3817
Figure~\ref{fig.price3} shows prior and
3818
posterior beliefs about the actual price.
3819
The posterior is shifted
3820
to the left because your guess
3821
is on the low end of the prior range.
3822
3823
On one level, this result makes sense. The most likely value
3824
in the prior is \$27,750, your best guess is \$20,000, and
3825
the mean of the posterior is somewhere in between: \$25,096.
3826
3827
On another level, you might find this result bizarre, because it
3828
suggests that if you {\em think} the price is \$20,000, then you
3829
should {\em believe} the price is \$24,000.
3830
3831
To resolve this apparent paradox, remember that you are combining two
3832
sources of information, historical data about past showcases and
3833
guesses about the prizes you see.
3834
3835
We are treating the historical data as the prior and updating it
3836
based on your guesses, but we could equivalently use your guess
3837
as a prior and update it based on historical data.
3838
3839
If you think of it that way, maybe it is less surprising that the
3840
most likely value in the posterior is not your original guess.
3841
3842
3843
\section{Optimal bidding}
3844
3845
Now that we have a posterior distribution, we can use it to
3846
compute the optimal bid, which I define as the bid that maximizes
3847
expected return (see \url{http://en.wikipedia.org/wiki/Expected_return}).
3848
\index{decision analysis}
3849
3850
I'm going to present the methods in this section top-down, which
3851
means I will show you how they are used before I show you how they
3852
work. If you see an unfamiliar method, don't worry; the definition
3853
will be along shortly.
3854
3855
To compute optimal bids, I wrote a class called {\tt GainCalculator}:
3856
3857
\begin{verbatim}
3858
class GainCalculator(object):
3859
3860
def __init__(self, player, opponent):
3861
self.player = player
3862
self.opponent = opponent
3863
\end{verbatim}
3864
3865
{\tt player} and {\tt opponent} are {\tt Player} objects.
3866
3867
{\tt GainCalculator} provides {\tt ExpectedGains}, which
3868
computes a sequence of bids and the expected gain for each
3869
bid:
3870
\index{numpy}
3871
3872
\begin{verbatim}
3873
def ExpectedGains(self, low=0, high=75000, n=101):
3874
bids = numpy.linspace(low, high, n)
3875
3876
gains = [self.ExpectedGain(bid) for bid in bids]
3877
3878
return bids, gains
3879
\end{verbatim}
3880
3881
{\tt low} and {\tt high} specify the range of possible bids;
3882
{\tt n} is the number of bids to try.
3883
3884
{\tt ExpectedGains} calls {\tt ExpectedGain}, which
3885
computes expected gain for a given bid:
3886
3887
\begin{verbatim}
3888
def ExpectedGain(self, bid):
3889
suite = self.player.posterior
3890
total = 0
3891
for price, prob in sorted(suite.Items()):
3892
gain = self.Gain(bid, price)
3893
total += prob * gain
3894
return total
3895
\end{verbatim}
3896
3897
{\tt ExpectedGain} loops through the values in the posterior
3898
and computes the gain for each bid, given the actual prices of
3899
the showcase. It weights each gain with the corresponding
3900
probability and returns the total.
3901
3902
\begin{figure}
3903
% price.py
3904
\centerline{\includegraphics[height=2.5in]{figs/price5.pdf}}
3905
\caption{Expected gain versus bid in a scenario where Player 1's best
3906
guess is \$20,000 and Player 2's best guess is \$40,000.}
3907
\label{fig.price5}
3908
\end{figure}
3909
3910
{\tt ExpectedGain} invokes {\tt Gain}, which takes a bid and an actual
3911
price and returns the expected gain:
3912
3913
\begin{verbatim}
3914
def Gain(self, bid, price):
3915
if bid > price:
3916
return 0
3917
3918
diff = price - bid
3919
prob = self.ProbWin(diff)
3920
3921
if diff <= 250:
3922
return 2 * price * prob
3923
else:
3924
return price * prob
3925
\end{verbatim}
3926
3927
If you overbid, you get nothing. Otherwise we compute
3928
the difference between your bid and the price, which determines
3929
your probability of winning.
3930
3931
If {\tt diff} is less than \$250, you win both showcases. For
3932
simplicity, I assume that both showcases have the same price. Since
3933
this outcome is rare, it doesn't make much difference.
3934
3935
Finally, we have to compute the probability of winning based
3936
on {\tt diff}:
3937
3938
\begin{verbatim}
3939
def ProbWin(self, diff):
3940
prob = (self.opponent.ProbOverbid() +
3941
self.opponent.ProbWorseThan(diff))
3942
return prob
3943
\end{verbatim}
3944
3945
If your opponent overbids, you win. Otherwise, you have to hope
3946
that your opponent is off by more than {\tt diff}. {\tt Player}
3947
provides methods to compute both probabilities:
3948
3949
\begin{verbatim}
3950
# class Player:
3951
3952
def ProbOverbid(self):
3953
return self.cdf_diff.Prob(-1)
3954
3955
def ProbWorseThan(self, diff):
3956
return 1 - self.cdf_diff.Prob(diff)
3957
\end{verbatim}
3958
3959
This code might be confusing because the computation is now from
3960
the point of view of the opponent, who is computing, ``What is
3961
the probability that I overbid?'' and ``What is the probability
3962
that my bid is off by more than {\tt diff}?''
3963
3964
Both answers are based on the CDF of {\tt diff}. If the opponent's
3965
{\tt diff} is less than or equal to -1, you win. If the opponent's
3966
{\tt diff} is worse than yours, you win. Otherwise you lose.
3967
3968
Finally, here's the code that computes optimal bids:
3969
3970
\begin{verbatim}
3971
# class Player:
3972
3973
def OptimalBid(self, guess, opponent):
3974
self.MakeBeliefs(guess)
3975
calc = GainCalculator(self, opponent)
3976
bids, gains = calc.ExpectedGains()
3977
gain, bid = max(zip(gains, bids))
3978
return bid, gain
3979
\end{verbatim}
3980
3981
Given a guess and an opponent, {\tt OptimalBid} computes
3982
the posterior distribution, instantiates a {\tt GainCalculator},
3983
computes expected gains for a range of bids and returns
3984
the optimal bid and expected gain. Whew!
3985
3986
Figure~\ref{fig.price5} shows the results for both players,
3987
based on a scenario where Player 1's best guess is \$20,000
3988
and Player 2's best guess is \$40,000.
3989
3990
For Player 1 the optimal bid is \$21,000, yielding an expected
3991
return of almost \$16,700. This is a case (which turns out
3992
to be unusual) where the optimal bid is actually higher than
3993
the contestant's best guess.
3994
3995
For Player 2 the optimal bid is \$31,500, yielding an expected
3996
return of almost \$19,400. This is the more typical case where
3997
the optimal bid is less than the best guess.
3998
3999
4000
\section{Discussion}
4001
4002
One of the features of Bayesian estimation is that the
4003
result comes in the form of a posterior distribution. Classical
4004
estimation usually generates a single point estimate or a confidence
4005
interval, which is sufficient if estimation is the last step in the
4006
process, but if you want to use an estimate as an input to a
4007
subsequent analysis, point estimates and intervals are often not much
4008
help.
4009
\index{distribution}
4010
4011
In this example, we use the posterior distribution
4012
to compute an optimal bid. The return on a given bid is asymmetric
4013
and discontinuous (if you overbid, you lose), so it would be hard to
4014
solve this problem analytically. But it is relatively simple to do
4015
computationally.
4016
\index{decision analysis}
4017
4018
Newcomers to Bayesian thinking are often tempted to summarize the
4019
posterior distribution by computing the mean or the maximum
4020
likelihood estimate. These summaries can be useful, but if that's
4021
all you need, then you probably don't need Bayesian methods in the
4022
first place.
4023
\index{maximum likelihood}
4024
\index{summary statistic}
4025
4026
Bayesian methods are most useful when you can carry the posterior
4027
distribution into the next step of the analysis to perform some
4028
kind of decision analysis, as we did in this chapter, or some kind of
4029
prediction, as we see in the next chapter.
4030
4031
4032
4033
\chapter{Prediction}
4034
\label{prediction}
4035
4036
\section{The Boston Bruins problem}
4037
4038
In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
4039
Bruins played a best-of-seven championship series against the despised
4040
Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then
4041
won the next two games 8-1 and 4-0. At this point in the series, what
4042
is the probability that Boston will win the next game, and what is
4043
their probability of winning the championship?
4044
\index{hockey}
4045
\index{Boston Bruins}
4046
\index{Vancouver Canucks}
4047
4048
As always, to answer a question like this, we need to make some
4049
assumptions. First, it is reasonable to believe that goal scoring in
4050
hockey is at least approximately a Poisson process, which means that
4051
it is equally likely for a goal to be scored at any time during a
4052
game. Second, we can assume that against a particular opponent, each team
4053
has some long-term average goals per game, denoted $\lambda$.
4054
\index{Poisson process}
4055
4056
Given these assumptions, my strategy for answering this question is
4057
4058
\begin{enumerate}
4059
4060
\item Use statistics from previous games to choose a prior
4061
distribution for $\lambda$.
4062
4063
\item Use the score from the first four games to estimate $\lambda$
4064
for each team.
4065
4066
\item Use the posterior distributions of $\lambda$ to compute
4067
distribution of goals for each team, the distribution of the
4068
goal differential, and the probability that each team wins
4069
the next game.
4070
4071
\item Compute the probability that each team wins the series.
4072
4073
\end{enumerate}
4074
4075
To choose a prior distribution, I got some statistics from
4076
\url{http://www.nhl.com}, specifically the average goals per game
4077
for each team in the 2010-11 season. The distribution is roughly
4078
Gaussian with mean 2.8 and standard deviation 0.3.
4079
\index{National Hockey League}
4080
\index{NHL}
4081
4082
The Gaussian distribution is continuous, but we'll approximate it with
4083
a discrete Pmf. \verb"thinkbayes" provides \verb"MakeGaussianPmf" to
4084
do exactly that:
4085
\index{numpy}
4086
\index{Gaussian distribution}
4087
4088
\begin{verbatim}
4089
def MakeGaussianPmf(mu, sigma, num_sigmas, n=101):
4090
pmf = Pmf()
4091
low = mu - num_sigmas*sigma
4092
high = mu + num_sigmas*sigma
4093
4094
for x in numpy.linspace(low, high, n):
4095
p = scipy.stats.norm.pdf(mu, sigma, x)
4096
pmf.Set(x, p)
4097
pmf.Normalize()
4098
return pmf
4099
\end{verbatim}
4100
4101
{\tt mu} and {\tt sigma} are the mean and standard deviation of the
4102
Gaussian distribution. \verb"num_sigmas" is the number of standard
4103
deviations above and below the mean that the Pmf will span, and {\tt
4104
n} is the number of values in the Pmf.
4105
4106
Again we use {\tt numpy.linspace} to make an array of {\tt n}
4107
equally spaced values between {\tt low} and {\tt high}, including
4108
both.
4109
4110
\verb"norm.pdf" evaluates the Gaussian probability density function (PDF).
4111
\index{PDF}
4112
\index{probability density function}
4113
4114
Getting back to the hockey problem, here's the definition for a suite
4115
of hypotheses about the value of $\lambda$.
4116
4117
\begin{verbatim}
4118
class Hockey(thinkbayes.Suite):
4119
4120
def __init__(self):
4121
pmf = thinkbayes.MakeGaussianPmf(2.7, 0.3, 4)
4122
thinkbayes.Suite.__init__(self, pmf)
4123
\end{verbatim}
4124
4125
So the prior distribution is Gaussian with mean 2.7, standard deviation
4126
0.3, and it spans 4 sigmas above and below the mean.
4127
4128
As always, we have to decide how to represent each hypothesis; in
4129
this case I represent the hypothesis that $\lambda=x$ with the
4130
floating-point value {\tt x}.
4131
4132
4133
\section{Poisson processes}
4134
4135
In mathematical statistics, a {\bf process} is a stochastic model of a
4136
physical system (``stochastic'' means that the model has some kind of
4137
randomness in it). For example, a Bernoulli process is a model of a
4138
sequence of events, called trials, in which each trial has two
4139
possible outcomes, like success and failure. So a Bernoulli process
4140
is a natural model for a series of coin flips, or a series of shots on
4141
goal. \index{process} \index{Poisson process}
4142
4143
A Poisson process is the continuous version of a Bernoulli process,
4144
where an event can occur at any point in time with equal probability.
4145
Poisson processes can be used to model customers arriving in a store,
4146
buses arriving at a bus stop, or goals scored in a hockey game.
4147
\index{Bernoulli process}
4148
4149
In many real systems the probability of an event changes over time.
4150
Customers are more likely to go to a store at certain times of day,
4151
buses are supposed to arrive at fixed intervals, and goals are more
4152
or less likely at different times during a game.
4153
4154
But all models are based on simplifications, and in this case modeling
4155
a hockey game with a Poisson process is a reasonable choice. Heuer,
4156
M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league
4157
and come to the same conclusion; see
4158
\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}.
4159
\index{Heuer, Andreas}
4160
4161
The benefit of using this model is that we can compute the distribution
4162
of goals per game efficiently, as well as the distribution of time
4163
between goals. Specifically, if the average number of goals
4164
in a game is {\tt lam}, the distribution of goals per game is
4165
given by the Poisson PMF:
4166
\index{Poisson distribution}
4167
4168
\begin{verbatim}
4169
def EvalPoissonPmf(k, lam):
4170
return (lam)**k * math.exp(-lam) / math.factorial(k)
4171
\end{verbatim}
4172
4173
And the distribution of time between goals is given by the
4174
exponential PDF:
4175
\index{exponential distribution}
4176
4177
\begin{verbatim}
4178
def EvalExponentialPdf(x, lam):
4179
return lam * math.exp(-lam * x)
4180
\end{verbatim}
4181
4182
I use the variable
4183
{\tt lam} because {\tt lambda} is a reserved keyword in Python.
4184
Both of these functions are in \verb"thinkbayes.py".
4185
4186
4187
\section{The posteriors}
4188
4189
\begin{figure}
4190
% hockey.py
4191
\centerline{\includegraphics[height=2.5in]{figs/hockey1.pdf}}
4192
\caption{Posterior distribution of the number of
4193
goals per game.}
4194
\label{fig.hockey1}
4195
\end{figure}
4196
4197
Now we can compute the likelihood that a team with a hypothetical
4198
value of {\tt lam} scores {\tt k} goals in a game:
4199
4200
\begin{verbatim}
4201
# class Hockey
4202
4203
def Likelihood(self, data, hypo):
4204
lam = hypo
4205
k = data
4206
like = thinkbayes.EvalPoissonPmf(k, lam)
4207
return like
4208
\end{verbatim}
4209
4210
Each hypothesis is a possible value of $\lambda$; {\tt
4211
data} is the observed number of goals, {\tt k}.
4212
4213
With the likelihood function in place, we can make a suite for each
4214
team and update them with the scores from the first four games.
4215
4216
\begin{verbatim}
4217
suite1 = Hockey('bruins')
4218
suite1.UpdateSet([0, 2, 8, 4])
4219
4220
suite2 = Hockey('canucks')
4221
suite2.UpdateSet([1, 3, 1, 0])
4222
\end{verbatim}
4223
4224
Figure~\ref{fig.hockey1} shows the resulting posterior distributions
4225
for {\tt lam}. Based on the first four games, the most likely
4226
values for {\tt lam} are 2.6 for the Canucks and 2.9 for the Bruins.
4227
4228
4229
\section{The distribution of goals}
4230
4231
\begin{figure}
4232
% hockey.py
4233
\centerline{\includegraphics[height=2.5in]{figs/hockey2.pdf}}
4234
\caption{Distribution of goals in a single game.}
4235
\label{fig.hockey2}
4236
\end{figure}
4237
4238
To compute the probability that each team wins the next game,
4239
we need to compute the distribution of goals for each team.
4240
4241
If we knew the value of {\tt lam} exactly, we could use the
4242
Poisson distribution again. \verb"thinkbayes" provides a
4243
method that computes a truncated approximation of a Poisson
4244
distribution:
4245
\index{Poisson distribution}
4246
4247
\begin{verbatim}
4248
def MakePoissonPmf(lam, high):
4249
pmf = Pmf()
4250
for k in xrange(0, high+1):
4251
p = EvalPoissonPmf(k, lam)
4252
pmf.Set(k, p)
4253
pmf.Normalize()
4254
return pmf
4255
\end{verbatim}
4256
4257
The range of values in the computed Pmf is from 0 to {\tt high}.
4258
So if the value of {\tt lam} were exactly 3.4, we would compute:
4259
4260
\begin{verbatim}
4261
lam = 3.4
4262
goal_dist = thinkbayes.MakePoissonPmf(lam, 10)
4263
\end{verbatim}
4264
4265
I chose the upper bound, 10, because the probability of scoring
4266
more than 10 goals in a game is quite low.
4267
4268
That's simple enough so far; the problem is that we don't know
4269
the value of {\tt lam} exactly. Instead, we have a distribution
4270
of possible values for {\tt lam}.
4271
4272
For each value of {\tt lam}, the distribution of goals is Poisson.
4273
So the overall distribution of goals is a mixture of these
4274
Poisson distributions, weighted according to the probabilities
4275
in the distribution of {\tt lam}.
4276
\index{mixture}
4277
\index{Poisson distribution}
4278
4279
Given the posterior distribution of {\tt lam}, here's the code
4280
that makes the distribution of goals:
4281
4282
\begin{verbatim}
4283
def MakeGoalPmf(suite):
4284
metapmf = thinkbayes.Pmf()
4285
4286
for lam, prob in suite.Items():
4287
pmf = thinkbayes.MakePoissonPmf(lam, 10)
4288
metapmf.Set(pmf, prob)
4289
4290
mix = thinkbayes.MakeMixture(metapmf)
4291
return mix
4292
\end{verbatim}
4293
4294
For each value of {\tt lam} we make a Poisson Pmf and add it to the
4295
meta-Pmf. I call it a meta-Pmf because it is a Pmf that contains
4296
Pmfs as its values.
4297
\index{meta-Pmf}
4298
4299
Then we use \verb"MakeMixture" to compute the mixture
4300
(we saw {\tt MakeMixture} in Section~\ref{mixture}).
4301
\index{mixture}
4302
\index{MakeMixture}
4303
4304
Figure~\ref{fig.hockey2} shows the resulting distribution of goals for
4305
the Bruins and Canucks. The Bruins are less likely to
4306
score 3 goals or fewer in the next game, and more likely to score 4 or
4307
more.
4308
4309
4310
\section{The probability of winning}
4311
4312
\begin{figure}
4313
% hockey.py
4314
\centerline{\includegraphics[height=2.5in]{figs/hockey3.pdf}}
4315
\caption{Distribution of time between goals.}
4316
\label{fig.hockey3}
4317
\end{figure}
4318
4319
To get the probability of winning, first we compute the
4320
distribution of the goal differential:
4321
4322
\begin{verbatim}
4323
goal_dist1 = MakeGoalPmf(suite1)
4324
goal_dist2 = MakeGoalPmf(suite2)
4325
diff = goal_dist1 - goal_dist2
4326
\end{verbatim}
4327
4328
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
4329
pairs of values and computes the difference. Subtracting two
4330
distributions is almost the same as adding, which we saw in
4331
Section~\ref{addends}.
4332
4333
If the goal differential is positive, the Bruins win; if negative, the
4334
Canucks win; if 0, it's a tie:
4335
4336
\begin{verbatim}
4337
p_win = diff.ProbGreater(0)
4338
p_loss = diff.ProbLess(0)
4339
p_tie = diff.Prob(0)
4340
\end{verbatim}
4341
4342
With the distributions from the previous section, \verb"p_win"
4343
is 46\%, \verb"p_loss" is 37\%, and \verb"p_tie" is 17\%.
4344
4345
In the event of a tie at the end of ``regulation play,'' the teams play
4346
overtime periods until one team scores. Since the game ends
4347
immediately when the first goal is scored, this overtime format
4348
is known as ``sudden death.''
4349
\index{overtime}
4350
\index{sudden death}
4351
4352
4353
\section{Sudden death}
4354
4355
To compute the probability of winning in a sudden death overtime,
4356
the important statistic is not goals per game, but time until the
4357
first goal. The assumption that goal-scoring is a Poisson process
4358
implies that the time between goals
4359
is exponentially distributed.
4360
\index{Poisson process}
4361
\index{exponential distribution}
4362
4363
Given {\tt lam}, we can compute the time between goals like this:
4364
4365
\begin{verbatim}
4366
lam = 3.4
4367
time_dist = thinkbayes.MakeExponentialPmf(lam, high=2, n=101)
4368
\end{verbatim}
4369
4370
{\tt high} is the upper bound of the distribution. In this case
4371
I chose 2, because the probability of going more than two games
4372
without scoring is small. {\tt n} is the number of values in
4373
the Pmf.
4374
4375
If we know {\tt lam} exactly, that's all there is to it.
4376
But we don't; instead we have a posterior
4377
distribution of possible values. So as we did with the distribution
4378
of goals, we make a meta-Pmf and compute a mixture of
4379
Pmfs.
4380
\index{MakeMixture}
4381
\index{meta-Pmf}
4382
\index{mixture}
4383
4384
\begin{verbatim}
4385
def MakeGoalTimePmf(suite):
4386
metapmf = thinkbayes.Pmf()
4387
4388
for lam, prob in suite.Items():
4389
pmf = thinkbayes.MakeExponentialPmf(lam, high=2, n=2001)
4390
metapmf.Set(pmf, prob)
4391
4392
mix = thinkbayes.MakeMixture(metapmf)
4393
return mix
4394
\end{verbatim}
4395
4396
Figure~\ref{fig.hockey3} shows the resulting distributions. For
4397
time values less than one period (one third of a game), the Bruins
4398
are more likely to score. The time until the Canucks score is
4399
more likely to be longer.
4400
4401
I set the number of values, {\tt n}, fairly high in order to minimize
4402
the number of ties, since it is not possible for both teams
4403
to score simultaneously.
4404
4405
Now we compute the probability that the Bruins score first:
4406
4407
\begin{verbatim}
4408
time_dist1 = MakeGoalTimePmf(suite1)
4409
time_dist2 = MakeGoalTimePmf(suite2)
4410
p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)
4411
\end{verbatim}
4412
4413
For the Bruins, the probability of winning in overtime is 52\%.
4414
4415
Finally, the total probability of winning is the chance of
4416
winning at the end of regulation play plus the probability
4417
of winning in overtime.
4418
4419
\begin{verbatim}
4420
p_tie = diff.Prob(0)
4421
p_overtime = thinkbayes.PmfProbLess(time_dist1, time_dist2)
4422
4423
p_win = diff.ProbGreater(0) + p_tie * p_overtime
4424
\end{verbatim}
4425
4426
For the Bruins, the overall chance of winning the next game is 55\%.
4427
4428
To win the series, the Bruins can either win the next two games
4429
or split the next two and win the third. Again, we can compute
4430
the total probability:
4431
4432
\begin{verbatim}
4433
# win the next two
4434
p_series = p_win**2
4435
4436
# split the next two, win the third
4437
p_series += 2 * p_win * (1-p_win) * p_win
4438
\end{verbatim}
4439
4440
The Bruins chance of winning the series is 57\%. And in 2011,
4441
they did.
4442
4443
4444
\section{Discussion}
4445
4446
As always, the analysis in this chapter is based on modeling decisions,
4447
and modeling is almost always an iterative process. In general,
4448
you want to start with something simple that yields an approximate
4449
answer, identify likely sources of error, and look for opportunities
4450
for improvement.
4451
\index{modeling}
4452
\index{iterative modeling}
4453
4454
In this example, I would consider these options:
4455
4456
\begin{itemize}
4457
4458
\item I chose a prior based on the average goals per game for each
4459
team. But this statistic is averaged across all opponents. Against
4460
a particular opponent, we might expect more variability. For
4461
example, if the team with the best offense plays the team with the
4462
worst defense, the expected goals per game might be several standard
4463
deviations above the mean.
4464
4465
\item For data I used only the first four games of the championship
4466
series. If the same teams played each other during the
4467
regular season, I could use the results from those games as well.
4468
One complication is that the composition of teams changes during
4469
the season due to trades and injuries. So it might be best to
4470
give more weight to recent games.
4471
4472
\item To take advantage of all available information, we could
4473
use results from all regular season games to estimate each team's
4474
goal scoring rate, possibly adjusted by estimating
4475
an additional factor for each pairwise match-up. This approach
4476
would be more complicated, but it is still feasible.
4477
4478
\end{itemize}
4479
4480
For the first option, we could use the results from the regular season
4481
to estimate the variability across all pairwise match-ups. Thanks to
4482
Dirk Hoag at \url{http://forechecker.blogspot.com/}, I was able to get
4483
the number of goals scored during regulation play (not overtime) for
4484
each game in the regular season.
4485
\index{Hoag, Dirk}
4486
4487
Teams in different conferences only play each other one or two
4488
times in the regular season, so I focused on pairs that played
4489
each other 4--6 times. For each pair, I computed the average
4490
goals per game, which is an estimate of $\lambda$, then plotted
4491
the distribution of these estimates.
4492
4493
The mean of these estimates is 2.8, again, but the standard
4494
deviation is 0.85, substantially higher than what we got computing
4495
one estimate for each team.
4496
4497
If we run the analysis again with the higher-variance prior, the
4498
probability that the Bruins win the series is 80\%, substantially
4499
higher than the result with the low-variance prior, 57\%.
4500
4501
So it turns out that the results are sensitive to the prior, which
4502
makes sense considering how little data we have to work with. Based
4503
on the difference between the low-variance model and the high-variable
4504
model, it seems worthwhile to put some effort into getting the prior
4505
right.
4506
4507
The code and data for this chapter are available from
4508
\url{http://thinkbayes.com/hockey.py} and
4509
\url{http://thinkbayes.com/hockey_data.csv}.
4510
For more information
4511
see Section~\ref{download}.
4512
4513
\section{Exercises}
4514
4515
\begin{exercise}
4516
4517
If buses arrive at a bus stop every 20 minutes, and you
4518
arrive at the bus stop at a random time, your wait time until
4519
the bus arrives is uniformly distributed from 0 to 20 minutes.
4520
\index{bus stop problem}
4521
4522
But in reality, there is variability in the time between
4523
buses. Suppose you are waiting for a bus, and you know the historical
4524
distribution of time between buses. Compute your distribution
4525
of wait times.
4526
4527
Hint: Suppose that the time between buses is either
4528
5 or 10 minutes with equal probability. What is the probability
4529
that you arrive during one of the 10 minute intervals?
4530
4531
I solve a version of this problem in the next chapter.
4532
4533
\end{exercise}
4534
4535
4536
\begin{exercise}
4537
4538
Suppose that passengers arriving at the bus stop are well-modeled
4539
by a Poisson process with parameter $\lambda$. If you arrive at the
4540
stop and find 3 people waiting, what is your posterior distribution
4541
for the time since the last bus arrived.
4542
\index{Poisson process}
4543
\index{bus stop problem}
4544
4545
I solve a version of this problem in the next chapter.
4546
4547
\end{exercise}
4548
4549
4550
\begin{exercise}
4551
4552
Suppose that you are an ecologist sampling the insect population in
4553
a new environment. You deploy 100 traps in a test area and come back
4554
the next day to check on them. You find that 37 traps have been
4555
triggered, trapping an insect inside. Once a trap triggers, it
4556
cannot trap another insect until it has been reset.
4557
\index{insect sampling problem}
4558
4559
If you reset the traps and come back in two days, how many traps
4560
do you expect to find triggered? Compute a posterior predictive
4561
distribution for the number of traps.
4562
\index{predictive distribution}
4563
4564
\end{exercise}
4565
4566
4567
\begin{exercise}
4568
4569
Suppose you are the manager of an apartment building with
4570
100 light bulbs in common areas. It is your responsibility
4571
to replace light bulbs when they break.
4572
\index{light bulb problem}
4573
4574
On January 1, all 100 bulbs are working. When you inspect
4575
them on February 1, you find 3 light bulbs out. If you
4576
come back on April 1, how many light bulbs do you expect to
4577
find broken?
4578
4579
In the previous exercise, you could reasonably assume that an event is
4580
equally likely at any time. For light bulbs, the likelihood of
4581
failure depends on the age of the bulb. Specifically, old bulbs
4582
have an increasing failure rate due to evaporation of the filament.
4583
4584
This problem is more open-ended than some; you will have to make
4585
modeling decisions. You might want to read about the Weibull
4586
distribution
4587
(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).
4588
Or you might want to look around for information about
4589
light bulb survival curves.
4590
\index{Weibull distribution}
4591
4592
\end{exercise}
4593
4594
4595
\chapter{Observer Bias}
4596
\label{observer}
4597
4598
\section{The Red Line problem}
4599
4600
In Massachusetts, the Red Line is a subway that connects
4601
Cambridge and Boston. When I was working in Cambridge I took the Red
4602
Line from Kendall Square to South Station and caught the commuter rail
4603
to Needham. During rush hour Red Line trains run every 7--8
4604
minutes, on average.
4605
\index{Red Line problem}
4606
\index{Boston}
4607
4608
When I arrived at the station, I could estimate the time until
4609
the next train based on the number of passengers on the platform.
4610
If there were only a few people, I inferred that I just missed
4611
a train and expected to wait about 7 minutes. If there were
4612
more passengers, I expected the train to arrive sooner. But if
4613
there were a large number of passengers, I suspected that
4614
trains were not running on schedule, so I would go back to the
4615
street level and get a taxi.
4616
4617
While I was waiting for trains, I thought about how Bayesian
4618
estimation could help predict my wait time and decide when I
4619
should give up and take a taxi. This chapter presents the
4620
analysis I came up with.
4621
4622
This chapter is based on a project by Brendan Ritter and
4623
Kai Austin, who took a class with me at Olin College.
4624
The code in this chapter is available from
4625
\url{http://thinkbayes.com/redline.py}. The code I used
4626
to collect data is in \url{http://thinkbayes.com/redline_data.py}.
4627
For more information
4628
see Section~\ref{download}.
4629
\index{Olin College}
4630
4631
4632
\section{The model}
4633
4634
\begin{figure}
4635
% redline.py
4636
\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}
4637
\caption{PMF of gaps between trains, based on collected data,
4638
smoothed by KDE. {\tt z} is the actual distribution; {\tt zb}
4639
is the biased distribution seen by passengers. }
4640
\label{fig.redline0}
4641
\end{figure}
4642
4643
Before we get to the analysis, we have to make some
4644
modeling decisions. First, I will treat passenger arrivals as
4645
a Poisson process, which means I assume that passengers are equally
4646
likely to arrive at any time, and that they arrive at an unknown
4647
rate, $\lambda$, measured in passengers per minute. Since I
4648
observe passengers during a short period of time, and at the same
4649
time every day, I assume that $\lambda$ is constant.
4650
\index{Poisson process}
4651
4652
On the other hand, the arrival process for trains is not Poisson.
4653
Trains to Boston are supposed to leave from the end of the line
4654
(Alewife station) every 7--8 minutes during peak times, but by the time
4655
they get to Kendall Square, the time between trains varies between 3
4656
and 12 minutes.
4657
4658
To gather data on the time between trains, I wrote a script that
4659
downloads real-time data from
4660
\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound
4661
trains arriving at Kendall square, and records their arrival times
4662
in a database. I ran the script from 4pm to 6pm every weekday
4663
for 5 days, and recorded about 15 arrivals per day. Then
4664
I computed the time between consecutive arrivals; the distribution
4665
of these gaps is shown in Figure~\ref{fig.redline0}, labeled {\tt z}.
4666
4667
If you stood on the platform from 4pm to 6pm and recorded the time
4668
between trains, this is the distribution you would see. But if you
4669
arrive at some random time (without regard to the train schedule) you
4670
would see a different distribution. The average time
4671
between trains, as seen by a random passenger, is substantially
4672
higher than the true average.
4673
4674
Why? Because a passenger is more like to arrive during a
4675
large interval than a small one. Consider a simple example:
4676
suppose that the time between trains is either 5 minutes
4677
or 10 minutes with equal probability. In that case
4678
the average time between
4679
trains is 7.5 minutes.
4680
4681
But a passenger is more likely to arrive during a 10 minute gap
4682
than a 5 minute gap; in fact, twice as likely. If we surveyed
4683
arriving passengers, we would find that 2/3 of them arrived during
4684
a 10 minute gap, and only 1/3 during a 5 minute gap. So the
4685
average time between trains, as seen by an arriving passenger,
4686
is 8.33 minutes.
4687
4688
This kind of {\bf observer bias} appears in many contexts. Students
4689
think that classes are bigger than they are because more of them are
4690
in the big classes. Airline passengers think that planes are fuller
4691
than they are because more of them are on full flights.
4692
\index{observer bias}
4693
4694
In each case, values from the actual distribution are
4695
oversampled in proportion to their value. In the Red Line example,
4696
a gap that is twice as big is twice as likely to be observed.
4697
4698
So given the actual distribution of gaps, we can compute the
4699
distribution of gaps as seen by passengers. {\tt BiasPmf}
4700
does this computation:
4701
4702
\begin{verbatim}
4703
def BiasPmf(pmf):
4704
new_pmf = pmf.Copy()
4705
4706
for x, p in pmf.Items():
4707
new_pmf.Mult(x, x)
4708
4709
new_pmf.Normalize()
4710
return new_pmf
4711
\end{verbatim}
4712
4713
{\tt pmf} is the actual distribution; \verb"new_pmf" is the
4714
biased distribution. Inside the loop, we multiply the
4715
probability of each value, {\tt x}, by the likelihood it will
4716
be observed, which is proportional to {\tt x}. Then we
4717
normalize the result.
4718
4719
Figure~\ref{fig.redline0} shows the actual distribution of gaps,
4720
labeled {\tt z}, and the distribution of gaps seen by passengers,
4721
labeled {\tt zb} for ``z biased''.
4722
4723
4724
\section{Wait times}
4725
4726
\begin{figure}
4727
% redline.py
4728
\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}
4729
\caption{CDF of {\tt z}, {\tt zb}, and the wait time seen
4730
by passengers, {\tt y}. }
4731
\label{fig.redline2}
4732
\end{figure}
4733
4734
Wait time, which I call {\tt y}, is the time between the arrival
4735
of a passenger and the next arrival of a train. Elapsed time, which I
4736
call {\tt x}, is the time between the arrival of the previous
4737
train and the arrival of a passenger. I chose these definitions
4738
so that {\tt zb = x + y}.
4739
4740
Given the distribution of {\tt zb}, we can compute the distribution of
4741
{\tt y}. I'll start with a simple case and then generalize.
4742
Suppose, as in the previous example, that {\tt zb} is either 5 minutes
4743
with probability 1/3, or 10 minutes with probability 2/3.
4744
4745
If we arrive at a random time during a 5 minute gap,
4746
{\tt y} is uniform from 0 to 5 minutes. If we arrive during a 10
4747
minute gap, {\tt y} is uniform from 0 to 10. So the overall
4748
distribution is a mixture of uniform distributions weighted
4749
according to the probability of each gap.
4750
\index{uniform distribution}
4751
4752
The following function takes the distribution of {\tt zb} and
4753
computes the distribution of {\tt y}:
4754
4755
\begin{verbatim}
4756
def PmfOfWaitTime(pmf_zb):
4757
metapmf = thinkbayes.Pmf()
4758
for gap, prob in pmf_zb.Items():
4759
uniform = MakeUniformPmf(0, gap)
4760
metapmf.Set(uniform, prob)
4761
4762
pmf_y = thinkbayes.MakeMixture(metapmf)
4763
return pmf_y
4764
\end{verbatim}
4765
4766
{\tt PmfOfWaitTime} makes a meta-Pmf that maps from each uniform
4767
distribution to its probability. Then it uses {\tt MakeMixture},
4768
which we saw in Section~\ref{mixture}, to compute the mixture.
4769
\index{mixture}
4770
\index{MakeMixture}
4771
\index{meta-Pmf}
4772
4773
{\tt PmfOfWaitTime} also uses {\tt MakeUniformPmf}, defined here:
4774
4775
\begin{verbatim}
4776
def MakeUniformPmf(low, high):
4777
pmf = thinkbayes.Pmf()
4778
for x in MakeRange(low=low, high=high):
4779
pmf.Set(x, 1)
4780
pmf.Normalize()
4781
return pmf
4782
\end{verbatim}
4783
4784
{\tt low} and {\tt high} are the range of the uniform distribution,
4785
(both ends included). Finally, {\tt MakeUniformPmf} uses {\tt
4786
MakeRange}, defined here:
4787
4788
\begin{verbatim}
4789
def MakeRange(low, high, skip=10):
4790
return range(low, high+skip, skip)
4791
\end{verbatim}
4792
4793
{\tt MakeRange} defines a set of possible values for wait time
4794
(expressed in seconds). By default it divides the range into
4795
10 second intervals.
4796
4797
To encapsulate the process of computing these distributions, I
4798
created a class called {\tt WaitTimeCalculator}:
4799
4800
\begin{verbatim}
4801
class WaitTimeCalculator(object):
4802
4803
def __init__(self, pmf_z):
4804
self.pmf_z = pmf_z
4805
self.pmf_zb = BiasPmf(pmf)
4806
4807
self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)
4808
self.pmf_x = self.pmf_y
4809
\end{verbatim}
4810
4811
The parameter, \verb"pmf_z", is the unbiased distribution of {\tt z}.
4812
\verb"pmf_zb" is the biased distribution of gap time, as seen by
4813
passengers.
4814
4815
\verb"pmf_y" is the distribution of wait time. \verb"pmf_x" is the
4816
distribution of elapsed time, which is the same as the distribution of
4817
wait time. To see why, remember that for a particular value of
4818
{\tt zb}, the distribution of {\tt y} is uniform from 0 to {\tt zb}.
4819
Also
4820
%
4821
\begin{verbatim}
4822
x = zb - y
4823
\end{verbatim}
4824
%
4825
So the distribution of {\tt x} is also uniform from 0 to {\tt zb}.
4826
4827
Figure~\ref{fig.redline2} shows the distribution of {\tt z}, {\tt zb},
4828
and {\tt y} based on the data I collected from the Red Line web site.
4829
4830
To present these distributions, I am switching from Pmfs to Cdfs.
4831
Most people are more familiar with Pmfs, but I think Cdfs are easier
4832
to interpret, once you get used to them. And if you want to plot
4833
several distributions on the same axes, Cdfs are the way to go.
4834
\index{Cdf}
4835
\index{cumulative distribution function}
4836
4837
The mean of {\tt z} is 7.8 minutes. The mean of {\tt zb} is 8.8
4838
minutes, about 13\% higher. The mean of {\tt y} is 4.4, half
4839
the mean of {\tt zb}.
4840
4841
As an aside, the Red Line schedule reports that trains run every
4842
9 minutes during peak times. This is close to the average of
4843
{\tt zb}, but higher than the average of {\tt z}. I exchanged email
4844
with a representative of the MBTA, who confirmed that the reported
4845
time between trains is deliberately conservative in order to
4846
account for variability.
4847
4848
4849
\section{Predicting wait times}
4850
\label{elapsed}
4851
4852
\begin{figure}
4853
% redline.py
4854
\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}
4855
\caption{Prior and posterior of {\tt x} and predicted {\tt y}. }
4856
\label{fig.redline3}
4857
\end{figure}
4858
4859
Let's get back to the motivating question: suppose that when
4860
I arrive at the platform I see 10 people waiting.
4861
How long should I expect to wait until the next train arrives?
4862
4863
As always, let's start with the easiest version of the problem
4864
and work our way up. Suppose we are given the actual distribution of
4865
{\tt z}, and we know that the passenger arrival rate,
4866
$\lambda$, is 2 passengers per minute.
4867
4868
In that case we can:
4869
4870
\begin{enumerate}
4871
4872
\item Use the distribution of {\tt z} to compute
4873
the prior distribution of {\tt zp}, the time between trains
4874
as seen by a passenger.
4875
4876
\item Then we can use the number of passengers to estimate the distribution
4877
of {\tt x}, the elapsed time since the last train.
4878
4879
\item Finally, we use the relation {\tt y = zp - x} to get the
4880
distribution of {\tt y}.
4881
4882
\end{enumerate}
4883
4884
The first step is to create a {\tt WaitTimeCalculator} that
4885
encapsulates the distributions of {\tt zp}, {\tt x},
4886
and {\tt y}, prior to taking into account the number of
4887
passengers.
4888
4889
\begin{verbatim}
4890
wtc = WaitTimeCalculator(pmf_z)
4891
\end{verbatim}
4892
4893
\verb"pmf_z" is the given distribution of gap times.
4894
4895
The next step is to make an {\tt ElapsedTimeEstimator} (defined
4896
below), which encapsulates the posterior distribution of {\tt x} and
4897
the predictive distribution of {\tt y}.
4898
\index{predictive distribution}
4899
4900
\begin{verbatim}
4901
ete = ElapsedTimeEstimator(wtc,
4902
lam=2.0/60,
4903
num_passengers=15)
4904
\end{verbatim}
4905
4906
The parameters are the {\tt WaitTimeCalculator}, the passenger
4907
arrival rate, {\tt lam} (expressed in passengers per second),
4908
and the observed number of passengers, let's say 15.
4909
4910
Here is the definition of {\tt ElapsedTimeEstimator}:
4911
4912
\begin{verbatim}
4913
class ElapsedTimeEstimator(object):
4914
4915
def __init__(self, wtc, lam, num_passengers):
4916
self.prior_x = Elapsed(wtc.pmf_x)
4917
4918
self.post_x = self.prior_x.Copy()
4919
self.post_x.Update((lam, num_passengers))
4920
4921
self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)
4922
\end{verbatim}
4923
4924
\verb"prior_x" and \verb"posterior_x" are the prior and
4925
posterior distributions of elapsed time. \verb"pmf_y" is
4926
the predictive distribution of wait time.
4927
4928
{\tt ElapsedTimeEstimator} uses {\tt Elapsed} and {\tt PredictWaitTime},
4929
defined below.
4930
4931
{\tt Elapsed} is a Suite that represents the hypothetical
4932
distribution of {\tt x}. The prior distribution of {\tt x}
4933
comes straight from the {\tt WaitTimeCalculator}. Then we
4934
use the data, which consists of the arrival rate, {\tt lam},
4935
and the number of passengers on the platform, to compute
4936
the posterior distribution.
4937
4938
Here's the definition of {\tt Elapsed}:
4939
4940
\begin{verbatim}
4941
class Elapsed(thinkbayes.Suite):
4942
4943
def Likelihood(self, data, hypo):
4944
x = hypo
4945
lam, k = data
4946
like = thinkbayes.EvalPoissonPmf(k, lam * x)
4947
return like
4948
\end{verbatim}
4949
4950
As always, {\tt Likelihood} takes a hypothesis and data, and
4951
computes the likelihood of the data under the hypothesis.
4952
In this case {\tt hypo} is the elapsed time since the last train
4953
and {\tt data} is a tuple of {\tt lam} and the number of
4954
passengers.
4955
\index{likelihood}
4956
4957
The likelihood of the data is the probability of getting
4958
{\tt k} arrivals in {\tt x} time, given arrival rate
4959
{\tt lam}. We compute that using the PMF of the Poisson
4960
distribution.
4961
\index{Poisson distribution}
4962
4963
Finally, here's the definition of {\tt PredictWaitTime}:
4964
4965
\begin{verbatim}
4966
def PredictWaitTime(pmf_zb, pmf_x):
4967
pmf_y = pmf_zb - pmf_x
4968
RemoveNegatives(pmf_y)
4969
return pmf_y
4970
\end{verbatim}
4971
4972
\verb"pmf_zb" is the distribution of gaps between trains;
4973
\verb"pmf_x" is the distribution of elapsed time, based on
4974
the observed number of passengers. Since {\tt y = zb - x},
4975
we can compute
4976
4977
\begin{verbatim}
4978
pmf_y = pmf_zb - pmf_x
4979
\end{verbatim}
4980
4981
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
4982
all pairs of {\tt zb} and {\tt x}, computes the differences, and adds
4983
the results to \verb"pmf_y".
4984
4985
The resulting Pmf includes some negative values, which we know are
4986
impossible. For example, if you arrive during a gap of 5 minutes, you
4987
can't wait more than 5 minutes. {\tt RemoveNegatives} removes the
4988
impossible values from the distribution and renormalizes.
4989
4990
\begin{verbatim}
4991
def RemoveNegatives(pmf):
4992
for val in pmf.Values():
4993
if val < 0:
4994
pmf.Remove(val)
4995
pmf.Normalize()
4996
\end{verbatim}
4997
4998
Figure~\ref{fig.redline3} shows the results. The prior distribution
4999
of {\tt x} is the same as the distribution of {\tt y} in
5000
Figure~\ref{fig.redline2}. The posterior distribution of {\tt x}
5001
shows that, after seeing 15 passengers on the platform, we believe
5002
that the time since the last train is probably 5-10 minutes. The
5003
predictive distribution of {\tt y} indicates that we expect the next
5004
train in less than 5 minutes, with about 80\% confidence.
5005
\index{predictive distribution}
5006
5007
5008
\section{Estimating the arrival rate}
5009
5010
\begin{figure}
5011
% redline.py
5012
\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}
5013
\caption{Prior and posterior distributions of {\tt lam} based
5014
on five days of passenger data. }
5015
\label{fig.redline1}
5016
\end{figure}
5017
5018
The analysis so far has been based on the assumption that we know (1)
5019
the distribution of gaps and (2) the passenger arrival rate. Now we
5020
are ready to relax the second assumption.
5021
5022
Suppose that you just moved to Boston, so you don't know much about
5023
the passenger arrival rate on the Red Line. After a few days of
5024
commuting, you could make a guess, at least qualitatively. With
5025
a little more effort, you could estimate $\lambda$ quantitatively.
5026
\index{arrival rate}
5027
5028
Each day when you arrive at the platform, you should note the
5029
time and the number of passengers waiting (if the platform is too
5030
big, you could choose a sample area). Then you should record your
5031
wait time and the
5032
number of new arrivals while you are waiting.
5033
5034
After five days, you might have data like this:
5035
%
5036
\begin{verbatim}
5037
k1 y k2
5038
-- --- --
5039
17 4.6 9
5040
22 1.0 0
5041
23 1.4 4
5042
18 5.4 12
5043
4 5.8 11
5044
\end{verbatim}
5045
%
5046
where {\tt k1} is the number of passengers waiting when you arrive,
5047
{\tt y} is your wait time in minutes, and {\tt k2} is the number of
5048
passengers who arrive while you are waiting.
5049
5050
Over the course of one week, you waited 18 minutes and saw 36
5051
passengers arrive, so you would estimate that the arrival rate is
5052
2 passengers per minute. For practical purposes that estimate is
5053
good enough, but for the sake of completeness I
5054
will compute a posterior distribution for $\lambda$ and show how
5055
to use that distribution in the rest of the analysis.
5056
5057
{\tt ArrivalRate} is a {\tt Suite} that represents hypotheses about
5058
$\lambda$. As always, {\tt Likelihood} takes a hypothesis and data,
5059
and computes the likelihood of the data under the hypothesis.
5060
5061
In this case the hypothesis is a value of $\lambda$. The data is a
5062
pair, {\tt y, k}, where {\tt y} is a wait time and {\tt k} is the
5063
number of passengers that arrived.
5064
5065
\begin{verbatim}
5066
class ArrivalRate(thinkbayes.Suite):
5067
5068
def Likelihood(self, data, hypo):
5069
lam = hypo
5070
y, k = data
5071
like = thinkbayes.EvalPoissonPmf(k, lam * y)
5072
return like
5073
\end{verbatim}
5074
5075
This {\tt Likelihood} might look familiar; it
5076
is almost identical to {\tt Elapsed.Likelihood} in
5077
Section~\ref{elapsed}. The difference is that in {\tt
5078
Elapsed.Likelihood} the hypothesis is {\tt x}, the elapsed time; in
5079
{\tt ArrivalRate.Likelihood} the hypothesis is {\tt lam}, the arrival
5080
rate. But in both cases the likelihood is the probability of seeing
5081
{\tt k} arrivals in some period of time, given {\tt lam}.
5082
5083
{\tt ArrivalRateEstimator} encapsulates the process of estimating
5084
$\lambda$. The parameter, \verb"passenger_data", is a list
5085
of {\tt k1, y, k2} tuples, as in the table above.
5086
\index{numpy}
5087
5088
\begin{verbatim}
5089
class ArrivalRateEstimator(object):
5090
5091
def __init__(self, passenger_data):
5092
low, high = 0, 5
5093
n = 51
5094
hypos = numpy.linspace(low, high, n) / 60
5095
5096
self.prior_lam = ArrivalRate(hypos)
5097
5098
self.post_lam = self.prior_lam.Copy()
5099
for k1, y, k2 in passenger_data:
5100
self.post_lam.Update((y, k2))
5101
\end{verbatim}
5102
5103
\verb"__init__" builds
5104
{\tt hypos}, which is a sequence of hypothetical values for {\tt lam},
5105
then builds the prior distribution, \verb"prior_lam".
5106
The {\tt for} loop updates the prior with data, yielding the posterior
5107
distribution, \verb"post_lam".
5108
5109
Figure~\ref{fig.redline1} shows
5110
the prior and posterior distributions. As expected, the mean and
5111
median of the posterior are near the observed rate, 2 passengers per
5112
minute. But the spread of the posterior distribution captures our
5113
uncertainty about $\lambda$ based on a small sample.
5114
5115
5116
\section{Incorporating uncertainty}
5117
5118
\begin{figure}
5119
% redline.py
5120
\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}
5121
\caption{Predictive distributions of {\tt y} for possible values
5122
of {\tt lam}. }
5123
\label{fig.redline4}
5124
\end{figure}
5125
5126
Whenever there is uncertainty about one of the inputs to an analysis,
5127
we can take it into account by a process like this:
5128
\index{uncertainty}
5129
5130
\begin{enumerate}
5131
5132
\item Implement the analysis based on a deterministic value of the
5133
uncertain parameter (in this case $\lambda$).
5134
5135
\item Compute the distribution of the uncertain parameter.
5136
5137
\item Run the analysis for each value of the parameter, and generate a
5138
set of predictive distributions.
5139
\index{predictive distribution}
5140
5141
\item Compute a mixture of the predictive distributions, using the
5142
weights from the distribution of the parameter.
5143
\index{mixture}
5144
5145
\end{enumerate}
5146
5147
We have already done steps (1) and (2). I wrote a class
5148
called {\tt WaitMixtureEstimator} to handle steps (3) and (4).
5149
5150
\begin{verbatim}
5151
class WaitMixtureEstimator(object):
5152
5153
def __init__(self, wtc, are, num_passengers=15):
5154
self.metapmf = thinkbayes.Pmf()
5155
5156
for lam, prob in sorted(are.post_lam.Items()):
5157
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
5158
self.metapmf.Set(ete.pmf_y, prob)
5159
5160
self.mixture = thinkbayes.MakeMixture(self.metapmf)
5161
\end{verbatim}
5162
5163
{\tt wtc} is the {\tt WaitTimeCalculator} that contains the
5164
distribution of {\tt zb}. {\tt are} is the {\tt ArrivalTimeEstimator}
5165
that contains the distribution of {\tt lam}.
5166
5167
The first line makes a meta-Pmf that maps from each possible
5168
distribution of {\tt y} to its probability. For each value
5169
of {\tt lam}, we use {\tt ElapsedTimeEstimator} to
5170
compute the corresponding distribution of
5171
{\tt y} and store it in the Meta-Pmf. Then
5172
we use {\tt MakeMixture} to compute the mixture.
5173
\index{MakeMixture}
5174
\index{meta-Pmf}
5175
\index{mixture}
5176
5177
%For purposes of comparison, I also compute the distribution of
5178
%{\tt y} based on a single point estimate of {\tt lam}, which is
5179
%the mean of the posterior distribution.
5180
5181
Figure~\ref{fig.redline4} shows the results. The shaded lines
5182
in the background are the distributions of {\tt y} for each value
5183
of {\tt lam}, with line thickness that represents likelihood.
5184
The dark line is the mixture of these distributions.
5185
5186
In this case we could get a very similar result using a single point
5187
estimate of {\tt lam}. So it was not necessary, for practical purposes,
5188
to include the uncertainty of the estimate.
5189
5190
In general, it is important to include variability if the system
5191
response is non-linear; that is, if small changes in the input can
5192
cause big changes in the output. In this case, posterior variability
5193
in {\tt lam} is small and the system response is approximately
5194
linear for small perturbations.
5195
\index{non-linear}
5196
5197
5198
\section{Decision analysis}
5199
5200
\begin{figure}
5201
% redline.py
5202
\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}
5203
\caption{Probability that wait time exceeds 15 minutes as
5204
a function of the number of passengers on the platform. }
5205
\label{fig.redline5}
5206
\end{figure}
5207
5208
At this point we can use the number of passengers on the platform
5209
to predict the distribution of wait times. Now
5210
let's get to the second part of the question: when should I stop
5211
waiting for the train and go catch a taxi?
5212
\index{decision analysis}
5213
5214
Remember that in the original scenario, I am trying to get to
5215
South Station to catch the commuter rail. Suppose I leave
5216
the office with enough time that I can wait 15 minutes
5217
and still make my connection at South Station.
5218
5219
In that case I would like to know the probability that {\tt y} exceeds
5220
15 minutes as a function of \verb"num_passengers". It is easy enough
5221
to use the
5222
analysis from Section~\ref{elapsed} and run it for a range of
5223
\verb"num_passengers".
5224
5225
But there's a problem.
5226
The analysis is sensitive to the frequency of long delays, and
5227
because long delays are rare, it is hard to estimate
5228
their frequency.
5229
5230
I only have data from one week,
5231
and the longest delay I observed was 15 minutes. So I can't
5232
estimate the frequency of longer delays accurately.
5233
5234
However, I can use previous observations to make at least a coarse
5235
estimate. When I commuted by Red Line for a year, I saw three long
5236
delays caused by a signaling problem, a power outage, and ``police
5237
activity'' at another stop. So I estimate that there are about
5238
3 major delays per year.
5239
5240
But remember that my observations are biased. I am more likely
5241
to observe long delays because they affect a large number
5242
of passengers. So we should treat my observations as a sample
5243
of {\tt zb} rather than {\tt z}. Here's how we can do that.
5244
\index{observer bias}
5245
5246
During my year of commuting, I took the Red Line home about 220
5247
times. So I take the observed gap times, \verb"gap_times",
5248
generate a sample of 220 gaps, and compute their Pmf:
5249
5250
\begin{verbatim}
5251
n = 220
5252
cdf_z = thinkbayes.MakeCdfFromList(gap_times)
5253
sample_z = cdf_z.Sample(n)
5254
pmf_z = thinkbayes.MakePmfFromList(sample_z)
5255
\end{verbatim}
5256
5257
Next I bias \verb"pmf_z" to get the distribution of
5258
{\tt zb}, draw a sample, and then add in delays of
5259
30, 40, and 50 minutes (expressed in seconds):
5260
5261
\begin{verbatim}
5262
cdf_zp = BiasPmf(pmf_z).MakeCdf()
5263
sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]
5264
\end{verbatim}
5265
5266
{\tt Cdf.Sample} is more efficient than {\tt Pmf.Sample}, so it
5267
is usually faster to convert a Pmf to a Cdf before sampling.
5268
5269
Next I use the sample of {\tt zb} to estimate a Pdf using
5270
KDE, and then convert the Pdf to a Pmf:
5271
5272
\begin{verbatim}
5273
pdf_zb = thinkbayes.EstimatedPdf(sample_zb)
5274
xs = MakeRange(low=60)
5275
pmf_zb = pdf_zb.MakePmf(xs)
5276
\end{verbatim}
5277
5278
Finally I unbias the distribution of {\tt zb} to get the
5279
distribution of {\tt z}, which I use to create the
5280
{\tt WaitTimeCalculator}:
5281
5282
\begin{verbatim}
5283
pmf_z = UnbiasPmf(pmf_zb)
5284
wtc = WaitTimeCalculator(pmf_z)
5285
\end{verbatim}
5286
5287
This process is complicated, but
5288
all of the steps are operations we have seen before.
5289
Now we are ready to compute the probability of a long wait.
5290
5291
\begin{verbatim}
5292
def ProbLongWait(num_passengers, minutes):
5293
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
5294
cdf_y = ete.pmf_y.MakeCdf()
5295
prob = 1 - cdf_y.Prob(minutes * 60)
5296
\end{verbatim}
5297
5298
Given the number of passengers on the platform,
5299
{\tt ProbLongWait}
5300
makes an {\tt ElapsedTimeEstimator},
5301
extracts the distribution of wait time, and
5302
computes
5303
the probability that wait time
5304
exceeds {\tt minutes}.
5305
5306
Figure~\ref{fig.redline5} shows the result. When the number of
5307
passengers is less than 20, we infer that the system is
5308
operating normally, so the probability of a long delay is small.
5309
If there are 30 passengers, we estimate that it has been 15
5310
minutes since the last train; that's longer than a normal delay,
5311
so we infer that something is wrong and expect longer delays.
5312
5313
If we are willing to accept a 10\% chance of missing the connection
5314
at South Station, we should stay and wait as long as there
5315
are fewer than 30 passengers, and take a taxi if there are more.
5316
5317
Or, to take this analysis one step further, we could quantify the cost
5318
of missing the connection and the cost of taking a taxi, then choose
5319
the threshold that minimizes expected cost.
5320
5321
\section{Discussion}
5322
5323
The analysis so far has been based on the assumption that the
5324
arrival rate of passengers is the same every day. For a commuter
5325
train during rush hour, that might not be a bad assumption, but
5326
there are some obvious exceptions. For example, if there is a special
5327
event nearby, a large number of people might arrive at the same time.
5328
In that case, the estimate of {\tt lam} would be too low, so the
5329
estimates of {\tt x} and {\tt y} would be too high.
5330
5331
If special events are as common as major delays, it would
5332
be important to include them in the model. We could do that by
5333
extending the distribution of {\tt lam} to include occasional
5334
large values.
5335
5336
We started with the assumption that we know
5337
distribution of {\tt z}.
5338
As an alternative, a passenger could estimate {\tt z}, but it would
5339
not be easy.
5340
As a passenger, you only
5341
observe only your own wait time, {\tt y}. Unless you skip
5342
the first train and wait for the second, you don't
5343
observe the gap between trains, {\tt z}.
5344
5345
However, we could make some inferences about {\tt zb}. If we note
5346
the number of passengers waiting when we arrive, we can estimate
5347
the elapsed time since the last train, {\tt x}. Then we observe
5348
{\tt y}. If we add the posterior distribution of {\tt x} to
5349
the observed {\tt y}, we get a distribution that represents
5350
our posterior belief about the observed value of {\tt zb}.
5351
5352
We can use this distribution to update our beliefs about the
5353
distribution of {\tt zb}. Finally, we can compute the
5354
inverse of {\tt BiasPmf} to get from the distribution of {\tt zb}
5355
to the distribution of {\tt z}.
5356
5357
I leave this analysis as an exercise for the
5358
reader. One suggestion: you should read Chapter~\ref{species} first.
5359
You can find the outline of
5360
a solution in \url{http://thinkbayes.com/redline.py}.
5361
For more information
5362
see Section~\ref{download}.
5363
5364
\section{Exercises}
5365
5366
\begin{exercise}
5367
This exercise is from
5368
MacKay, {\em Information Theory, Inference, and Learning Algorithms}:
5369
\index{MacKay, David}
5370
5371
\begin{quote}
5372
Unstable particles are emitted from a source and decay at a
5373
distance $x$, a real number that has an exponential probability
5374
distribution with [parameter] $\lambda$. Decay events can only be
5375
observed if they occur in a window extending from $x=1$ cm to $x=20$
5376
cm. $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$
5377
cm. What is the posterior distribution of $\lambda$?
5378
5379
\end{quote}
5380
5381
You can download a solution to this exercise from
5382
\url{http://thinkbayes.com/decay.py}.
5383
5384
\end{exercise}
5385
5386
5387
\chapter{Two Dimensions}
5388
\label{paintball}
5389
5390
\section{Paintball}
5391
5392
Paintball is a sport in which competing teams try to shoot each other
5393
with guns that fire paint-filled pellets that break on impact, leaving
5394
a colorful mark on the target. It is usually played in an
5395
arena decorated with barriers and other objects that can be
5396
used as cover.
5397
\index{Paintball problem}
5398
5399
Suppose you are playing paintball in an indoor arena 30 feet
5400
wide and 50 feet long. You are standing near one of the 30 foot
5401
walls, and you suspect that one of your opponents has taken cover
5402
nearby. Along the wall, you see several paint spatters, all the same
5403
color, that you think your opponent fired recently.
5404
5405
The spatters are at 15, 16, 18, and 21 feet, measured from the
5406
lower-left corner of the room. Based on these data, where do you
5407
think your opponent is hiding?
5408
5409
Figure~\ref{fig.paintball} shows a diagram of the arena. Using the
5410
lower-left corner of the room as the origin, I denote the unknown
5411
location of the shooter with coordinates $\alpha$ and $\beta$, or {\tt
5412
alpha} and {\tt beta}. The location of a spatter is labeled
5413
{\tt x}. The angle the opponent shoots at is $\theta$ or {\tt theta}.
5414
5415
The Paintball problem is a modified version
5416
of the Lighthouse problem, a common example of Bayesian analysis. My
5417
notation follows the presentation of the problem in D.S.~Sivia's, {\it Data
5418
Analysis: a Bayesian Tutorial, Second Edition} (Oxford, 2006).
5419
\index{Sivia, D.S.}
5420
5421
You can download the code in this chapter from
5422
\url{http://thinkbayes.com/paintball.py}.
5423
For more information
5424
see Section~\ref{download}.
5425
5426
\section{The suite}
5427
5428
\begin{figure}
5429
% paintball.py
5430
\centerline{\includegraphics[height=2.5in]{figs/paintball.pdf}}
5431
\caption{Diagram of the layout for the paintball problem.}
5432
\label{fig.paintball}
5433
\end{figure}
5434
5435
To get started, we need a Suite that represents a set of hypotheses
5436
about the location of the opponent. Each hypothesis is a
5437
pair of coordinates: {\tt (alpha, beta)}.
5438
5439
Here is the definition of the Paintball suite:
5440
5441
\begin{verbatim}
5442
class Paintball(thinkbayes.Suite, thinkbayes.Joint):
5443
5444
def __init__(self, alphas, betas, locations):
5445
self.locations = locations
5446
pairs = [(alpha, beta)
5447
for alpha in alphas
5448
for beta in betas]
5449
thinkbayes.Suite.__init__(self, pairs)
5450
\end{verbatim}
5451
5452
{\tt Paintball} inherits from {\tt Suite}, which we have seen before,
5453
and {\tt Joint}, which I will explain soon.
5454
\index{Joint pmf}
5455
5456
{\tt alphas} is the list of possible values for {\tt alpha}; {\tt
5457
betas} is the list of values for {\tt beta}. {\tt pairs} is a list
5458
of all {\tt (alpha, beta)} pairs.
5459
5460
{\tt locations} is a list of possible locations along
5461
the wall; it is stored for use in {\tt Likelihood}.
5462
5463
\begin{figure}
5464
% paintball.py
5465
\centerline{\includegraphics[height=2.5in]{figs/paintball2.pdf}}
5466
\caption{Posterior CDFs for {\tt alpha} and {\tt beta}, given the data.}
5467
\label{fig.paintball2}
5468
\end{figure}
5469
5470
The room is 30 feet wide and 50 feet long, so here's the code that
5471
creates the suite:
5472
5473
\begin{verbatim}
5474
alphas = range(0, 31)
5475
betas = range(1, 51)
5476
locations = range(0, 31)
5477
5478
suite = Paintball(alphas, betas, locations)
5479
\end{verbatim}
5480
5481
This prior distribution assumes that all locations in the room are
5482
equally likely. Given a map of the room, we might choose a more
5483
detailed prior, but we'll start simple.
5484
5485
5486
\section{Trigonometry}
5487
5488
Now we need a likelihood function, which means we have to figure
5489
out the likelihood of hitting any spot along the wall, given
5490
the location of the opponent.
5491
\index{likelihood}
5492
5493
As a simple model, imagine that the opponent is like a rotating
5494
turret, equally likely to shoot in any direction.
5495
In that case, he is most likely to hit
5496
the wall at location {\tt alpha}, and less likely to hit the wall far
5497
away from {\tt alpha}.
5498
\index{trigonometry}
5499
5500
With a little trigonometry, we can compute the probability of hitting
5501
any spot along the wall. Imagine that the shooter fires a shot at
5502
angle $\theta$; the pellet would hit the wall at location $x$, where
5503
%
5504
\[ x - \alpha = \beta \tan \theta \]
5505
%
5506
Solving this equation for $\theta$ yields
5507
%
5508
\[ \theta = tan^{-1} \left( \frac{x - \alpha}{\beta} \right) \]
5509
%
5510
So given a location on the wall, we can find $\theta$.
5511
5512
Taking the derivative of the first equation with respect to
5513
$\theta$ yields
5514
%
5515
\[ \frac{dx}{d\theta} = \frac{\beta}{\cos^2 \theta} \]
5516
%
5517
This derivative is what I'll call the ``strafing speed'',
5518
which is the speed of the target location along the wall as $\theta$
5519
increases. The probability of hitting a given point on the wall is
5520
inversely related to strafing speed.
5521
\index{strafing speed}
5522
5523
If we know the coordinates of the shooter and a location
5524
along the wall, we can compute strafing speed:
5525
5526
\begin{verbatim}
5527
def StrafingSpeed(alpha, beta, x):
5528
theta = math.atan2(x - alpha, beta)
5529
speed = beta / math.cos(theta)**2
5530
return speed
5531
\end{verbatim}
5532
5533
{\tt alpha} and {\tt beta} are the coordinates of the shooter;
5534
{\tt x} is the location of a spatter. The result is
5535
the derivative of {\tt x} with respect to {\tt theta}.
5536
5537
\begin{figure}
5538
% paintball.py
5539
\centerline{\includegraphics[height=2.5in]{figs/paintball1.pdf}}
5540
\caption{PMF of location given {\tt alpha=10}, for several values of
5541
{\tt beta}.}
5542
\label{fig.paintball1}
5543
\end{figure}
5544
5545
Now we can compute a Pmf that represents the probability of hitting
5546
any location on the wall. {\tt MakeLocationPmf} takes {\tt alpha} and
5547
{\tt beta}, the coordinates of the shooter, and {\tt locations}, a
5548
list of possible values of {\tt x}.
5549
5550
\begin{verbatim}
5551
def MakeLocationPmf(alpha, beta, locations):
5552
pmf = thinkbayes.Pmf()
5553
for x in locations:
5554
prob = 1.0 / StrafingSpeed(alpha, beta, x)
5555
pmf.Set(x, prob)
5556
pmf.Normalize()
5557
return pmf
5558
\end{verbatim}
5559
5560
{\tt MakeLocationPmf} computes the probability of hitting
5561
each location, which is inversely related to
5562
strafing speed. The result is a Pmf of locations and their
5563
probabilities.
5564
5565
Figure~\ref{fig.paintball1} shows the Pmf of location with {\tt alpha
5566
= 10} and a range of values for {\tt beta}. For all values of beta
5567
the most likely spatter location is {\tt x = 10}; as {\tt beta}
5568
increases, so does the spread of the Pmf.
5569
5570
5571
5572
\section{Likelihood}
5573
5574
Now all we need is a likelihood function.
5575
We can use {\tt MakeLocationPmf} to compute the likelihood
5576
of any value of {\tt x}, given the coordinates of the opponent.
5577
\index{likelihood}
5578
5579
\begin{verbatim}
5580
def Likelihood(self, data, hypo):
5581
alpha, beta = hypo
5582
x = data
5583
pmf = MakeLocationPmf(alpha, beta, self.locations)
5584
like = pmf.Prob(x)
5585
return like
5586
\end{verbatim}
5587
5588
Again, {\tt alpha} and {\tt beta} are the hypothetical coordinates of
5589
the shooter, and {\tt x} is the location of an observed spatter.
5590
5591
{\tt pmf} contains the probability of each location, given the
5592
coordinates of the shooter. From this Pmf, we select the probability
5593
of the observed location.
5594
5595
And we're done. To update the suite, we can use {\tt UpdateSet},
5596
which is inherited from {\tt Suite}.
5597
5598
\begin{verbatim}
5599
suite.UpdateSet([15, 16, 18, 21])
5600
\end{verbatim}
5601
5602
The result is a distribution that maps each {\tt (alpha, beta)} pair
5603
to a posterior probability.
5604
5605
5606
\section{Joint distributions}
5607
5608
When each value in a distribution is a tuple of variables, it is
5609
called a {\bf joint distribution} because it represents the
5610
distributions of the variables together, that is ``jointly''.
5611
A joint distribution contains the distributions of the variables,
5612
as well as information about the relationships among them.
5613
\index{joint distribution}
5614
5615
Given a joint distribution, we can compute the distributions
5616
of each variable independently, which are called the {\bf marginal
5617
distributions}.
5618
\index{marginal distribution}
5619
\index{Joint}
5620
5621
{\tt thinkbayes.Joint} provides a method that computes marginal
5622
distributions:
5623
5624
\begin{verbatim}
5625
# class Joint:
5626
5627
def Marginal(self, i):
5628
pmf = Pmf()
5629
for vs, prob in self.Items():
5630
pmf.Incr(vs[i], prob)
5631
return pmf
5632
\end{verbatim}
5633
5634
{\tt i} is the index of the variable we want; in this example
5635
{\tt i=0} indicates the distribution of {\tt alpha}, and
5636
{\tt i=1} indicates the distribution of {\tt beta}.
5637
5638
Here's the code that extracts the marginal distributions:
5639
5640
\begin{verbatim}
5641
marginal_alpha = suite.Marginal(0)
5642
marginal_beta = suite.Marginal(1)
5643
\end{verbatim}
5644
5645
Figure~\ref{fig.paintball2} shows the results (converted to CDFs).
5646
The median value for {\tt alpha} is 18, near the center of mass of
5647
the observed spatters. For {\tt beta}, the most likely values are
5648
close to the wall, but beyond 10 feet the distribution is almost
5649
uniform, which indicates that the data do not distinguish strongly
5650
between these possible locations.
5651
5652
Given the posterior marginals, we can compute credible intervals
5653
for each coordinate independently:
5654
\index{credible interval}
5655
5656
\begin{verbatim}
5657
print 'alpha CI', marginal_alpha.CredibleInterval(50)
5658
print 'beta CI', marginal_beta.CredibleInterval(50)
5659
\end{verbatim}
5660
5661
The 50\% credible intervals are {\tt (14, 21)} for {\tt alpha} and
5662
{\tt (5, 31)} for {\tt beta}. So the data provide evidence that the
5663
shooter is in the near side of the room. But it is not strong
5664
evidence. The 90\% credible intervals cover most of the room!
5665
\index{evidence}
5666
5667
5668
\section{Conditional distributions}
5669
\label{conditional}
5670
5671
\begin{figure}
5672
% paintball.py
5673
\centerline{\includegraphics[height=2.5in]{figs/paintball3.pdf}}
5674
\caption{Posterior distributions for {\tt alpha} conditioned on several values
5675
of {\tt beta}.}
5676
\label{fig.paintball3}
5677
\end{figure}
5678
5679
The marginal distributions contain information about the variables
5680
independently, but they do not capture the dependence between
5681
variables, if any.
5682
\index{independence}
5683
\index{dependence}
5684
5685
One way to visualize dependence is by computing {\bf conditional
5686
distributions}. {\tt thinkbayes.Joint} provides a method that
5687
does that:
5688
\index{conditional distribution}
5689
\index{Joint}
5690
5691
\begin{verbatim}
5692
def Conditional(self, i, j, val):
5693
pmf = Pmf()
5694
for vs, prob in self.Items():
5695
if vs[j] != val: continue
5696
pmf.Incr(vs[i], prob)
5697
5698
pmf.Normalize()
5699
return pmf
5700
\end{verbatim}
5701
5702
Again, {\tt i} is the index of the variable we want; {\tt j}
5703
is the index of the conditioning variable, and {\tt val} is the
5704
conditional value.
5705
5706
The result is the distribution of the $i$th variable under the
5707
condition that the $j$th variable is {\tt val}.
5708
5709
For example, the following code computes the conditional distributions
5710
of {\tt alpha} for a range of values of {\tt beta}:
5711
5712
\begin{verbatim}
5713
betas = [10, 20, 40]
5714
5715
for beta in betas:
5716
cond = suite.Conditional(0, 1, beta)
5717
\end{verbatim}
5718
5719
Figure~\ref{fig.paintball3} shows the results, which we could
5720
fully describe as ``posterior conditional marginal distributions.''
5721
Whew!
5722
5723
If the variables were independent, the conditional distributions would
5724
all be the same. Since they are all different, we can tell the
5725
variables are dependent. For example, if we know (somehow) that {\tt
5726
beta = 10}, the conditional distribution of {\tt alpha} is fairly
5727
narrow. For larger values of {\tt beta}, the distribution of
5728
{\tt alpha} is wider.
5729
\index{dependence}
5730
\index{independence}
5731
5732
5733
\section{Credible intervals}
5734
5735
\begin{figure}
5736
% paintball.py
5737
\centerline{\includegraphics[height=2.5in]{figs/paintball5.pdf}}
5738
\caption{Credible intervals for the coordinates of the opponent.}
5739
\label{fig.paintball5}
5740
\end{figure}
5741
5742
Another way to visualize the posterior joint distribution is to
5743
compute credible intervals. When we looked at credible intervals
5744
in Section~\ref{credible},
5745
I skipped over a subtle point: for a given distribution, there
5746
are many intervals with the same level of credibility. For example,
5747
if you want a 50\% credible interval, you could choose any set of
5748
values whose probability adds up to 50\%.
5749
5750
When the values are one-dimensional, it is most common to choose
5751
the {\bf central credible interval}; for example, the central 50\%
5752
credible interval contains all values between the 25th and 75th
5753
percentiles.
5754
\index{central credible interval}
5755
5756
In multiple dimensions it is less obvious what the right credible
5757
interval should be. The best choice might depend on context, but
5758
one common choice is the maximum likelihood credible interval, which
5759
contains the most likely values that add up to 50\% (or some other
5760
percentage).
5761
\index{maximum likelihood}
5762
5763
{\tt thinkbayes.Joint} provides a method that computes maximum
5764
likelihood credible intervals.
5765
\index{Joint}
5766
5767
\begin{verbatim}
5768
# class Joint:
5769
5770
def MaxLikeInterval(self, percentage=90):
5771
interval = []
5772
total = 0
5773
5774
t = [(prob, val) for val, prob in self.Items()]
5775
t.sort(reverse=True)
5776
5777
for prob, val in t:
5778
interval.append(val)
5779
total += prob
5780
if total >= percentage/100.0:
5781
break
5782
5783
return interval
5784
\end{verbatim}
5785
5786
The first step is to make a list of the values in the suite,
5787
sorted in descending order by probability. Next we traverse the
5788
list, adding each value to the interval, until the total
5789
probability exceeds {\tt percentage}. The result is a list
5790
of values from the suite. Notice that this set of values
5791
is not necessarily contiguous.
5792
5793
To visualize the intervals, I wrote a function that ``colors''
5794
each value according to how many intervals it appears in:
5795
5796
\begin{verbatim}
5797
def MakeCrediblePlot(suite):
5798
d = dict((pair, 0) for pair in suite.Values())
5799
5800
percentages = [75, 50, 25]
5801
for p in percentages:
5802
interval = suite.MaxLikeInterval(p)
5803
for pair in interval:
5804
d[pair] += 1
5805
5806
return d
5807
\end{verbatim}
5808
5809
{\tt d} is a dictionary that maps from each value in the suite
5810
to the number of intervals it appears in. The loop computes intervals
5811
for several percentages and modifies {\tt d}.
5812
5813
Figure~\ref{fig.paintball5} shows the result. The 25\% credible
5814
interval is the darkest region near the bottom wall. For higher
5815
percentages, the credible interval is bigger, of course, and skewed
5816
toward the right side of the room.
5817
5818
5819
\section{Discussion}
5820
5821
This chapter shows that the Bayesian framework from the previous
5822
chapters can be extended to handle a two-dimensional parameter space.
5823
The only difference is that each hypothesis is represented by
5824
a tuple of parameters.
5825
5826
I also presented {\tt Joint}, which is a parent class that provides
5827
methods that apply to joint distributions:
5828
{\tt Marginal}, {\tt Conditional}, and {\tt MakeLikeInterval}.
5829
In object-oriented terms,
5830
{\tt Joint} is a mixin (see \url{http://en.wikipedia.org/wiki/Mixin}).
5831
\index{Joint}
5832
5833
There is a lot of new vocabulary in this chapter, so let's review:
5834
5835
\begin{description}
5836
5837
\item[Joint distribution:] A distribution that represents all possible
5838
values in a multidimensional space and their probabilities. The
5839
example in this chapter is a two-dimensional space made up of the
5840
coordinates {\tt alpha} and {\tt beta}. The joint distribution
5841
represents the probability of each ({\tt alpha}, {\tt beta}) pair.
5842
5843
\item[Marginal distribution:] The distribution of one parameter in a
5844
joint distribution, treating the other parameters as unknown. For
5845
example, Figure~\ref{fig.paintball2} shows the distributions of {\tt
5846
alpha} and {\tt beta} independently.
5847
5848
\item[Conditional distribution:] The distribution of one parameter in
5849
a joint distribution, conditioned on one or more of the other
5850
parameters. Figure~\ref{fig.paintball3} shows several distributions for
5851
{\tt alpha}, conditioned on different values of {\tt beta}.
5852
5853
\end{description}
5854
5855
Given the joint distribution, you can compute marginal and conditional
5856
distributions. With enough conditional distributions, you could
5857
re-create the joint distribution, at least approximately. But given
5858
the marginal distributions you cannot re-create the joint distribution
5859
because you have lost information about the dependence between
5860
variables.
5861
\index{joint distribution}
5862
\index{conditional distribution}
5863
\index{marginal distribution}
5864
5865
If there are $n$ possible values for each of two parameters, most
5866
operations on the joint distribution take time proportional to $n^2$.
5867
If there are $d$ parameters, run time is proportional to $n^d$,
5868
which quickly becomes impractical as the number of dimensions increases.
5869
5870
If you can process a million hypotheses in a reasonable amount of time,
5871
you could handle two dimensions with 1000 values for each parameter,
5872
or three dimensions with 100 values each, or six dimensions with 10
5873
values each.
5874
5875
If you need more dimensions, or more values per dimension, there are
5876
optimizations you can try. I present an example
5877
in Chapter~\ref{species}.
5878
5879
You can download the code in this chapter from
5880
\url{http://thinkbayes.com/paintball.py}.
5881
For more information
5882
see Section~\ref{download}.
5883
5884
\section{Exercises}
5885
5886
\begin{exercise}
5887
In our simple model, the opponent is equally likely to shoot in any
5888
direction. As an exercise, let's consider improvements to this model.
5889
5890
The analysis in this chapter suggests that a shooter is most likely to
5891
hit the closest wall. But in reality, if the opponent is close to a
5892
wall, he is unlikely to shoot at the wall because he is unlikely to
5893
see a target between himself and the wall.
5894
5895
Design an improved model that takes this behavior
5896
into account. Try to find a model that is more realistic, but not
5897
too complicated.
5898
\end{exercise}
5899
5900
5901
5902
5903
5904
\chapter{Approximate Bayesian Computation}
5905
5906
\section{The Variability Hypothesis}
5907
5908
I have a soft spot for crank science. Recently I visited Norumbega
5909
Tower, which is an enduring monument to the crackpot theories of Eben
5910
Norton Horsford, inventor of double-acting baking powder and fake
5911
history. But that's not what this chapter is about.
5912
\index{crank science}
5913
\index{Horsford, Eben Norton}
5914
5915
This chapter is about the Variability Hypothesis, which
5916
\index{Variability Hypothesis}
5917
\index{Meckel, Johann}
5918
5919
\begin{quote}
5920
"originated in the early nineteenth century with Johann Meckel, who
5921
argued that males have a greater range of ability than females,
5922
especially in intelligence. In other words, he believed that most
5923
geniuses and most mentally retarded people are men. Because he
5924
considered males to be the 'superior animal,' Meckel concluded that
5925
females' lack of variation was a sign of inferiority."
5926
5927
From \url{http://en.wikipedia.org/wiki/Variability_hypothesis}.
5928
\end{quote}
5929
5930
I particularly like that last part, because I suspect that if it turns
5931
out that women are actually more variable, Meckel would take that as a
5932
sign of inferiority, too. Anyway, you will not be surprised to hear
5933
that the evidence for the Variability Hypothesis is weak.
5934
\index{evidence}
5935
5936
Nevertheless, it came up in my class recently when we looked at data
5937
from the CDC's Behavioral Risk Factor Surveillance System (BRFSS),
5938
specifically the self-reported heights of adult American men and women.
5939
The dataset includes responses from 154407 men and 254722 women.
5940
Here's what we found:
5941
\index{Centers for Disease Control}
5942
\index{CDC}
5943
\index{BRFSS}
5944
\index{Behavioral Risk Factor Surveillance System}
5945
5946
\begin{itemize}
5947
5948
\item The average height for men is 178 cm; the average height for
5949
women is 163 cm. So men are taller, on average. No surprise there.
5950
5951
\item For men the standard deviation is 7.7 cm; for women it is 7.3
5952
cm. So in absolute terms, men's heights are more variable.
5953
5954
\item But to compare variability between groups, it is more meaningful
5955
to use the coefficient of variation (CV), which is the standard
5956
deviation divided by the mean. It is a dimensionless measure of
5957
variability relative to scale. For men CV is 0.0433; for women it
5958
is 0.0444.
5959
\index{coefficient of variation}
5960
5961
\end{itemize}
5962
5963
That's very close, so we could conclude that this dataset provides
5964
weak evidence against the Variability Hypothesis. But we can use
5965
Bayesian methods to make that conclusion more precise. And answering
5966
this question gives me a chance to demonstrate some techniques
5967
for working with large datasets.
5968
\index{height}
5969
5970
I will proceed in a few steps:
5971
5972
\begin{enumerate}
5973
5974
\item We'll start with the simplest implementation, but it only works
5975
for datasets smaller than 1000 values.
5976
5977
\item By computing probabilities under a log transform, we can scale
5978
up to the full size of the dataset, but the computation gets slow.
5979
5980
\item Finally, we speed things up substantially with Approximate
5981
Bayesian Computation, also known as ABC.
5982
5983
\end{enumerate}
5984
5985
You can download the code in this chapter from
5986
\url{http://thinkbayes.com/variability.py}.
5987
For more information
5988
see Section~\ref{download}.
5989
5990
\section{Mean and standard deviation}
5991
5992
In Chapter~\ref{paintball} we estimated two parameters simultaneously
5993
using a joint distribution. In this chapter we use the same
5994
method to estimate the parameters of a Gaussian distribution:
5995
the mean, {\tt mu}, and the standard deviation, {\tt sigma}.
5996
\index{Gaussian distribution}
5997
5998
For this problem, I define a Suite called {\tt Height} that
5999
represents a map from each {\tt mu, sigma} pair to its probability:
6000
6001
\begin{verbatim}
6002
class Height(thinkbayes.Suite, thinkbayes.Joint):
6003
6004
def __init__(self, mus, sigmas):
6005
pairs = [(mu, sigma)
6006
for mu in mus
6007
for sigma in sigmas]
6008
6009
thinkbayes.Suite.__init__(self, pairs)
6010
\end{verbatim}
6011
6012
{\tt mus} is a sequence of possible values for {\tt mu}; {\tt sigmas}
6013
is a sequence of values for {\tt sigma}. The prior distribution
6014
is uniform over all {\tt mu, sigma} pairs.
6015
\index{Joint}
6016
\index{joint distribution}
6017
6018
The likelihood function is easy. Given hypothetical values
6019
of {\tt mu} and {\tt sigma}, we compute the likelihood
6020
of a particular value, {\tt x}. That's what {\tt EvalGaussianPdf}
6021
does, so all we have to do is use it:
6022
\index{likelihood}
6023
6024
\begin{verbatim}
6025
# class Height
6026
6027
def Likelihood(self, data, hypo):
6028
x = data
6029
mu, sigma = hypo
6030
like = thinkbayes.EvalGaussianPdf(x, mu, sigma)
6031
return like
6032
\end{verbatim}
6033
6034
If you have studied statistics from a mathematical perspective,
6035
you know that when you evaluate a PDF, you get a probability
6036
density. In order to get a probability, you have to integrate
6037
probability densities over some range.
6038
\index{density}
6039
6040
But for our purposes, we don't need a probability; we just
6041
need something proportional to the probability we want.
6042
A probability density does that job nicely.
6043
6044
The hardest part of this problem turns
6045
out to be choosing appropriate ranges for {\tt mus} and
6046
{\tt sigmas}. If the range is too small, we omit some
6047
possibilities with non-negligible probability and get the
6048
wrong answer. If the range is too big, we get the right answer,
6049
but waste computational power.
6050
6051
So this is an opportunity to use classical estimation to
6052
make Bayesian techniques more efficient. Specifically, we can use
6053
classical estimators to find a likely location for {\tt mu} and {\tt
6054
sigma}, and use the standard errors of those estimates to choose a
6055
likely spread.
6056
\index{classical estimation}
6057
6058
If the true parameters of the distribution are $\mu$ and $\sigma$, and
6059
we take a sample of $n$ values, an estimator of $\mu$ is the sample
6060
mean, {\tt m}.
6061
6062
And an estimator of $\sigma$ is the sample standard
6063
variance, {\tt s}.
6064
6065
The standard error of the estimated $\mu$ is $s / \sqrt{n}$
6066
and the standard error of the estimated $\sigma$ is
6067
$s / \sqrt{2 (n-1)}$.
6068
6069
Here's the code to compute all that:
6070
6071
\begin{verbatim}
6072
def FindPriorRanges(xs, num_points, num_stderrs=3.0):
6073
6074
# compute m and s
6075
n = len(xs)
6076
m = numpy.mean(xs)
6077
s = numpy.std(xs)
6078
6079
# compute ranges for m and s
6080
stderr_m = s / math.sqrt(n)
6081
mus = MakeRange(m, stderr_m, num_stderrs)
6082
6083
stderr_s = s / math.sqrt(2 * (n-1))
6084
sigmas = MakeRange(s, stderr_s, num_stderrs)
6085
6086
return mus, sigmas
6087
\end{verbatim}
6088
6089
{\tt xs} is the dataset. \verb"num_points" is the desired number of
6090
values in the range. \verb"num_stderrs" is the width of the range on
6091
each side of the estimate, in number of standard errors.
6092
6093
The return
6094
value is a pair of sequences, {\tt mus} and {\tt sigmas}.
6095
6096
Here's {\tt MakeRange}:
6097
\index{numpy}
6098
6099
\begin{verbatim}
6100
def MakeRange(estimate, stderr, num_stderrs):
6101
spread = stderr * num_stderrs
6102
array = numpy.linspace(estimate-spread,
6103
estimate+spread,
6104
num_points)
6105
return array
6106
\end{verbatim}
6107
6108
{\tt numpy.linspace} makes an array of equally spaced elements between
6109
{\tt estimate-spread} and {\tt estimate+spread}, including both.
6110
\index{linspace}
6111
6112
6113
\section{Update}
6114
6115
Finally here's the code to make and update the suite:
6116
6117
\begin{verbatim}
6118
mus, sigmas = FindPriorRanges(xs, num_points)
6119
suite = Height(mus, sigmas)
6120
suite.UpdateSet(xs)
6121
print suite.MaximumLikelihood()
6122
\end{verbatim}
6123
6124
This process might seem bogus, because we use the data to choose the
6125
range of the prior distribution, and then use the data again to do the
6126
update. In general, using the same data twice is, in fact, bogus.
6127
\index{bogus}
6128
\index{maximum likelihood}
6129
6130
But in this case it is ok. Really. We use the data to choose the
6131
range for the prior, but only to avoid computing a lot of
6132
probabilities that would have been very small anyway. With
6133
\verb"num_stderrs=4", the range is big enough to cover all values with
6134
non-negligible likelihood. After that, making it bigger has no effect
6135
on the results.
6136
6137
In effect, the prior is uniform over all values
6138
of {\tt mu} and {\tt sigma}, but for computational efficiency
6139
we ignore all the values that don't matter.
6140
6141
\section{The posterior distribution of CV}
6142
6143
Once we have the posterior joint distribution of {\tt mu} and {\tt
6144
sigma}, we can compute the distribution of CV for men and women, and
6145
then the probability that one exceeds the other.
6146
6147
To compute the distribution of CV, we enumerate pairs of
6148
{\tt mu} and {\tt sigma}:
6149
6150
\begin{verbatim}
6151
def CoefVariation(suite):
6152
pmf = thinkbayes.Pmf()
6153
for (mu, sigma), p in suite.Items():
6154
pmf.Incr(sigma/mu, p)
6155
return pmf
6156
\end{verbatim}
6157
6158
Then we use \verb"thinkbayes.PmfProbGreater" to compute the
6159
probability that men are more variable.
6160
6161
The analysis itself is simple, but there are two more issues we
6162
have to deal with:
6163
6164
\begin{enumerate}
6165
6166
\item As the size of the dataset increases, we run into a series of
6167
computational problems due to the limitations of floating-point
6168
arithmetic.
6169
6170
\item The dataset contains a number of extreme values that are almost
6171
certainly errors. We will need to make the estimation process
6172
robust in the presence of these outliers.
6173
6174
\end{enumerate}
6175
6176
The following sections explain these problems and their solutions.
6177
6178
6179
\section{Underflow}
6180
\label{underflow}
6181
6182
If we select the first 100 values from the BRFSS dataset and run the
6183
analysis I just described, it runs without errors and we get posterior
6184
distributions that look reasonable.
6185
6186
If we select the first 1000 values and run the program again, we get
6187
an error in \verb"Pmf.Normalize":
6188
6189
\begin{verbatim}
6190
ValueError: total probability is zero.
6191
\end{verbatim}
6192
6193
The problem is that we are using probability densities to compute
6194
likelihoods, and densities from continuous distributions tend to be
6195
small. And if you take 1000 small values and multiply
6196
them together, the result is very small. In this case it is so small
6197
it can't be represented by a floating-point number, so it gets rounded
6198
down to zero, which is called {\bf underflow}. And if all
6199
probabilities in the distribution are 0, it's not a distribution any
6200
more.
6201
\index{underflow}
6202
6203
A possible solution is to renormalize the Pmf after each update,
6204
or after each batch of 100. That would work, but it would be slow.
6205
6206
A better alternative is to compute likelihoods under a log
6207
transform. That way, instead of multiplying small values, we can add
6208
up log likelihoods. {\tt Pmf} provides methods {\tt Log}, {\tt
6209
LogUpdateSet} and {\tt Exp} to make this process easy.
6210
\index{logarithm}
6211
\index{log transform}
6212
6213
{\tt Log} computes the log of the probabilities in a Pmf:
6214
6215
\begin{verbatim}
6216
# class Pmf
6217
6218
def Log(self):
6219
m = self.MaxLike()
6220
for x, p in self.d.iteritems():
6221
if p:
6222
self.Set(x, math.log(p/m))
6223
else:
6224
self.Remove(x)
6225
\end{verbatim}
6226
6227
Before applying the log transform {\tt Log} uses {\tt MaxLike} to find
6228
{\tt m}, the highest probability in the Pmf. It divide all
6229
probabilities by {\tt m}, so the highest probability gets normalized
6230
to 1, which yields a log of 0. The other log probabilities are all
6231
negative. If there are any values in the Pmf with probability 0, they
6232
are removed.
6233
6234
While the Pmf is under a log transform, we can't use {\tt Update},
6235
{\tt UpdateSet}, or {\tt Normalize}. The result would be nonsensical;
6236
if you try, Pmf raises an exception.
6237
Instead, we have to use {\tt LogUpdate}
6238
and {\tt LogUpdateSet}.
6239
\index{exception}
6240
6241
Here's the implementation of {\tt LogUpdateSet}:
6242
6243
\begin{verbatim}
6244
# class Suite
6245
6246
def LogUpdateSet(self, dataset):
6247
for data in dataset:
6248
self.LogUpdate(data)
6249
\end{verbatim}
6250
6251
{\tt LogUpdateSet} loops through the data and calls {\tt LogUpdate}:
6252
6253
\begin{verbatim}
6254
# class Suite
6255
6256
def LogUpdate(self, data):
6257
for hypo in self.Values():
6258
like = self.LogLikelihood(data, hypo)
6259
self.Incr(hypo, like)
6260
\end{verbatim}
6261
6262
{\tt LogUpdate} is just like {\tt Update} except that it calls
6263
{\tt LogLikelihood} instead of {\tt Likelihood}, and {\tt Incr}
6264
instead of {\tt Mult}.
6265
6266
Using log-likelihoods avoids the problem with underflow, but while
6267
the Pmf is under the log transform, there's not much we can do with
6268
it. We have to use {\tt Exp} to invert the transform:
6269
6270
\begin{verbatim}
6271
# class Pmf
6272
6273
def Exp(self):
6274
m = self.MaxLike()
6275
for x, p in self.d.iteritems():
6276
self.Set(x, math.exp(p-m))
6277
\end{verbatim}
6278
6279
If the log-likelihoods are large negative numbers, the resulting
6280
likelihoods might underflow. So {\tt Exp} finds the maximum
6281
log-likelihood, {\tt m}, and shifts all the likelihoods up by {\tt m}.
6282
The resulting distribution has a maximum likelihood of 1. This
6283
process inverts the log transform with minimal loss of precision.
6284
\index{maximum likelihood}
6285
6286
6287
\section{Log-likelihood}
6288
6289
Now all we need is {\tt LogLikelihood}.
6290
6291
\begin{verbatim}
6292
# class Height
6293
6294
def LogLikelihood(self, data, hypo):
6295
x = data
6296
mu, sigma = hypo
6297
loglike = scipy.stats.norm.logpdf(x, mu, sigma)
6298
return loglike
6299
\end{verbatim}
6300
6301
{\tt norm.logpdf} computes the log-likelihood of the
6302
Gaussian PDF.
6303
\index{scipy}
6304
\index{log-likelihood}
6305
6306
6307
Here's what the whole update process looks like:
6308
6309
\begin{verbatim}
6310
suite.Log()
6311
suite.LogUpdateSet(xs)
6312
suite.Exp()
6313
suite.Normalize()
6314
\end{verbatim}
6315
6316
To review, {\tt Log} puts the suite under a log transform.
6317
{\tt LogUpdateSet} calls {\tt LogUpdate}, which calls
6318
{\tt LogLikelihood}. {\tt LogUpdate} uses {\tt Pmf.Incr},
6319
because adding a log-likelihood is the same as multiplying
6320
by a likelihood.
6321
6322
After the update, the log-likelihoods are large negative
6323
numbers, so {\tt Exp} shifts them up before inverting the
6324
transform, which is how we avoid underflow.
6325
6326
Once the suite is transformed back, the probabilities
6327
are ``linear'' again, which means ``not logarithmic'',
6328
so we can use {\tt Normalize} again.
6329
6330
Using this algorithm, we can process the entire dataset without
6331
underflow, but it is still slow. On my computer it might
6332
take an hour. We can do better.
6333
6334
6335
\section{A little optimization}
6336
6337
This section uses math and computational optimization
6338
to speed things up by a factor of 100. But the following section
6339
presents an algorithm that is even faster. So if you want to
6340
get right to the good stuff, feel free to skip this section.
6341
\index{optimization}
6342
6343
{\tt Suite.LogUpdateSet} calls {\tt LogUpdate} once for each data
6344
point. We can speed it up by computing the log-likelihood of the entire
6345
dataset at once.
6346
6347
We'll start with the Gaussian PDF:
6348
%
6349
\[ \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \right] \]
6350
%
6351
and compute the log (dropping the constant term):
6352
%
6353
\[ -\log \sigma -\frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 \]
6354
%
6355
Given a sequence of values, $x_i$, the total log-likelihood is
6356
%
6357
\[ \sum_i -\log \sigma - \frac{1}{2} \left( \frac{x_i-\mu}{\sigma} \right)^2 \]
6358
%
6359
Pulling out the terms that don't depend on $i$, we get
6360
%
6361
\[ -n \log \sigma - \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2 \]
6362
%
6363
which we can translate into Python:
6364
6365
\begin{verbatim}
6366
# class Height
6367
6368
def LogUpdateSetFast(self, data):
6369
xs = tuple(data)
6370
n = len(xs)
6371
6372
for hypo in self.Values():
6373
mu, sigma = hypo
6374
total = Summation(xs, mu)
6375
loglike = -n * math.log(sigma) - total / 2 / sigma**2
6376
self.Incr(hypo, loglike)
6377
\end{verbatim}
6378
6379
By itself, this would be a small improvement, but it
6380
creates an opportunity for a bigger one. Notice that the
6381
summation only depends on {\tt mu}, not {\tt sigma}, so we only
6382
have to compute it once for each value of {\tt mu}.
6383
\index{optimization}
6384
6385
To avoid recomputing, I factor out a function that computes the
6386
summation, and {\bf memoize} it so it stores previously computed
6387
results in a dictionary (see
6388
\url{http://en.wikipedia.org/wiki/Memoization}): \index{memoization}
6389
6390
\begin{verbatim}
6391
def Summation(xs, mu, cache={}):
6392
try:
6393
return cache[xs, mu]
6394
except KeyError:
6395
ds = [(x-mu)**2 for x in xs]
6396
total = sum(ds)
6397
cache[xs, mu] = total
6398
return total
6399
\end{verbatim}
6400
6401
{\tt cache} stores previously computed sums. The {\tt try} statement
6402
returns a result from the cache if possible; otherwise it computes
6403
the summation, then caches and returns the result.
6404
\index{cache}
6405
6406
The only catch is that we can't use a list as a key in the cache, because
6407
it is not a hashable type. That's why {\tt LogUpdateSetFast} converts
6408
the dataset to a tuple.
6409
6410
This optimization speeds up the computation by about a
6411
factor of 100, processing the entire dataset (154~407 men and 254~722
6412
women) in less than a minute on my not-very-fast computer.
6413
6414
6415
\section{ABC}
6416
6417
But maybe you don't have that kind of time. In that case, Approximate
6418
Bayesian Computation (ABC) might be the way to go. The motivation
6419
behind ABC is that the likelihood of any particular dataset is:
6420
\index{ABC}
6421
\index{Approximate Bayesian Computation}
6422
6423
\begin{enumerate}
6424
6425
\item Very small, especially for large datasets, which is why we had
6426
to use the log transform,
6427
6428
\item Expensive to compute, which is why we had to do so much
6429
optimization, and
6430
6431
\item Not really what we want anyway.
6432
6433
\end{enumerate}
6434
6435
We don't really care about the likelihood of seeing the exact dataset
6436
we saw. Especially for continuous variables, we care about the
6437
likelihood of seeing any dataset like the one we saw.
6438
6439
For example, in the Euro problem, we don't care about the order of
6440
the coin flips, only the total number of heads and tails. And in
6441
the locomotive problem, we don't care about which particular trains were
6442
seen, only the number of trains and the maximum of the serial numbers.
6443
\index{locomotive problem}
6444
\index{Euro problem}
6445
6446
Similarly, in the BRFSS sample, we don't really want to know the
6447
probability of seeing one particular set of values (especially since
6448
there are hundreds of thousands of them). It is more
6449
relevant to ask, ``If we sample 100,000 people from a population
6450
with hypothetical values of $\mu$ and $\sigma$, what would be
6451
the chance of collecting a sample with the observed mean and
6452
variance?''
6453
\index{BRFSS}
6454
6455
For samples from a Gaussian distribution, we can answer this question
6456
efficiently because we can find the distribution of the sample
6457
statistics analytically. In fact, we already did it when we computed
6458
the range of the prior.
6459
\index{Gaussian distribution}
6460
6461
If you draw $n$ values from a Gaussian distribution with parameters
6462
$\mu$ and $\sigma$, and compute the sample mean, $m$, the
6463
distribution of $m$ is Gaussian
6464
with parameters $\mu$ and $\sigma / \sqrt{n}$.
6465
6466
Similarly, the distribution of the sample standard deviation, $s$, is
6467
Gaussian with parameters $\sigma$ and $\sigma / \sqrt{2 (n-1)}$.
6468
6469
\index{sample statistics}
6470
We can use these sample distributions to compute the likelihood of the
6471
sample statistics, $m$ and $s$, given hypothetical values
6472
for $\mu$ and $\sigma$. Here's a new version of \verb"LogUpdateSet"
6473
that does it:
6474
6475
\begin{verbatim}
6476
def LogUpdateSetABC(self, data):
6477
xs = data
6478
n = len(xs)
6479
6480
# compute sample statistics
6481
m = numpy.mean(xs)
6482
s = numpy.std(xs)
6483
6484
for hypo in sorted(self.Values()):
6485
mu, sigma = hypo
6486
6487
# compute log likelihood of m, given hypo
6488
stderr_m = sigma / math.sqrt(n)
6489
loglike = EvalGaussianLogPdf(m, mu, stderr_m)
6490
6491
#compute log likelihood of s, given hypo
6492
stderr_s = sigma / math.sqrt(2 * (n-1))
6493
loglike += EvalGaussianLogPdf(s, sigma, stderr_s)
6494
6495
self.Incr(hypo, loglike)
6496
\end{verbatim}
6497
6498
On my computer this function processes the entire dataset in about a
6499
second, and the result agrees with the exact result with about 5
6500
digits of precision.
6501
6502
6503
\section{Robust estimation}
6504
6505
\begin{figure}
6506
% variability.py
6507
\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_male.pdf}}
6508
\caption{Contour plot of the posterior joint distribution of
6509
mean and standard deviation of height for men in the U.S.}
6510
\label{fig.variability1}
6511
\end{figure}
6512
6513
\begin{figure}
6514
% variability.py
6515
\centerline{\includegraphics[height=2.5in]{figs/variability_posterior_female.pdf}}
6516
\caption{Contour plot of the posterior joint distribution of
6517
mean and standard deviation of height for women in the U.S.}
6518
\label{fig.variability2}
6519
\end{figure}
6520
6521
We are almost ready to look at results, but we have one more
6522
problem to deal with. There are a number of outliers in this
6523
dataset that are almost certainly errors. For example, there
6524
are three adults with reported height of 61 cm, which would
6525
place them among the shortest living adults in the world.
6526
At the other end, there are four women with reported height
6527
229 cm, just short of the tallest women in the world.
6528
6529
It is not impossible that these values are correct, but it is
6530
unlikely, which makes it hard to know how to deal with them.
6531
And we have to get
6532
it right, because these extreme values have a disproportionate
6533
effect on the estimated variability.
6534
6535
Because ABC is based on summary statistics, rather than the entire
6536
dataset, we can make it more robust by choosing summary statistics
6537
that are robust in the presence of outliers. For example, rather
6538
than use the sample mean and standard deviation, we could use the median
6539
and inter-quartile range
6540
(IQR), which is the difference between the 25th and 75th percentiles.
6541
\index{summary statistic}
6542
\index{robust estimation}
6543
\index{inter-quartile range}
6544
\index{IQR}
6545
6546
More generally, we could compute an inter-percentile range (IPR) that
6547
spans any given fraction of the distribution, {\tt p}:
6548
6549
\begin{verbatim}
6550
def MedianIPR(xs, p):
6551
cdf = thinkbayes.MakeCdfFromList(xs)
6552
median = cdf.Percentile(50)
6553
6554
alpha = (1-p) / 2
6555
ipr = cdf.Value(1-alpha) - cdf.Value(alpha)
6556
return median, ipr
6557
\end{verbatim}
6558
6559
{\tt xs} is a sequence of values. {\tt p} is the desired range;
6560
for example, {\tt p=0.5} yields the inter-quartile range.
6561
6562
{\tt MedianIPR} works by computing the CDF of {\tt xs},
6563
then extracting the median and the difference between two
6564
percentiles.
6565
6566
We can convert from {\tt ipr} to an estimate of {\tt sigma} using the
6567
Gaussian CDF to compute the fraction of the distribution covered by a
6568
given number of standard deviations. For example, it is a well-known
6569
rule of thumb that 68\% of a Gaussian distribution falls within one
6570
standard deviation of the mean, which leaves 16\% in each tail. If we
6571
compute the range between the 16th and 84th percentiles, we expect the
6572
result to be {\tt 2 * sigma}. So we can estimate {\tt sigma} by
6573
computing the 68\% IPR and dividing by 2.
6574
\index{Gaussian distribution}
6575
6576
More generally we could use any number of {\tt sigmas}.
6577
{\tt MedianS} performs the more general version of this
6578
computation:
6579
6580
\begin{verbatim}
6581
def MedianS(xs, num_sigmas):
6582
half_p = thinkbayes.StandardGaussianCdf(num_sigmas) - 0.5
6583
6584
median, ipr = MedianIPR(xs, half_p * 2)
6585
s = ipr / 2 / num_sigmas
6586
6587
return median, s
6588
\end{verbatim}
6589
6590
Again, {\tt xs} is the sequence of values; \verb"num_sigmas" is the
6591
number of standard deviations the results should be based on. The
6592
result is {\tt median}, which estimates $\mu$, and {\tt s}, which
6593
estimates $\sigma$.
6594
6595
Finally, in {\tt LogUpdateSetABC} we can replace the sample mean and
6596
standard deviation with {\tt median} and {\tt s}. And that pretty
6597
much does it.
6598
6599
It might seem odd that we are using observed percentiles to
6600
estimate $\mu$ and $\sigma$, but it is an example of the
6601
flexibility of the Bayesian approach. In effect we are asking,
6602
``Given hypothetical values for $\mu$ and $\sigma$, and
6603
a sampling process that has some chance of introducing errors,
6604
what is the likelihood of generating a given set of sample
6605
statistics?''
6606
6607
We are free to choose any sample statistics we like, up to a point:
6608
$\mu$ and $\sigma$ determine the location and spread of
6609
a distribution, so we need to choose statistics that capture those
6610
characteristics. For example, if we chose the 49th and 51st percentiles,
6611
we would get very little information about spread, so it
6612
would leave the estimate of $\sigma$ relatively unconstrained
6613
by the data. All values of {\tt sigma} would have nearly the
6614
same likelihood of producing the observed values, so the posterior
6615
distribution of {\tt sigma} would look a lot like the
6616
prior.
6617
6618
6619
\section{Who is more variable?}
6620
6621
\begin{figure}
6622
% variability.py
6623
\centerline{\includegraphics[height=2.5in]{figs/variability_cv.pdf}}
6624
\caption{Posterior distributions of CV for men and women, based on
6625
robust estimators.}
6626
\label{fig.variability3}
6627
\end{figure}
6628
6629
Finally we are ready to answer the question we started with: is the
6630
coefficient of variation greater for men than for women?
6631
6632
Using ABC based on the median and IPR with \verb"num_sigmas=1", I
6633
computed posterior joint distributions for {\tt mu} and {\tt
6634
sigma}. Figures~\ref{fig.variability1} and ~\ref{fig.variability2}
6635
show the results as a contour plot with {\tt mu} on the x-axis, {\tt
6636
sigma} on the y-axis, and probability on the z-axis.
6637
6638
For each joint distribution, I computed the posterior distribution of
6639
CV. Figure~\ref{fig.variability3} shows these distributions for men
6640
and women. The mean for men is 0.0410; for women it is 0.0429.
6641
Since there is no overlap between the distributions, we conclude with
6642
near certainty that
6643
women are more variable in height than men.
6644
6645
So is that the end of the Variability Hypothesis? Sadly, no. It turns
6646
out that this
6647
result depends on the choice of the
6648
inter-percentile range. With \verb"num_sigmas=1", we conclude that
6649
women are more variable, but with \verb"num_sigmas=2" we conclude
6650
with equal confidence that men are more variable.
6651
6652
The reason for the difference is that there
6653
are more men of short stature, and their distance from the mean is
6654
greater.
6655
6656
So our evaluation of the Variability Hypothesis depends on the
6657
interpretation of ``variability.'' With \verb"num_sigmas=1" we
6658
focus on people near the mean. As we increase
6659
\verb"num_sigmas", we give more weight to the extremes.
6660
6661
To decide which
6662
emphasis is appropriate, we would need a more precise statement
6663
of the hypothesis. As it is, the Variability Hypothesis may be
6664
too vague to evaluate.
6665
6666
Nevertheless, it helped
6667
me demonstrate several new ideas and, I hope you agree,
6668
it makes an interesting example.
6669
6670
6671
\section{Discussion}
6672
6673
There are two ways you might think of ABC. One interpretation
6674
is that it is, as the name suggests, an approximation that is
6675
faster to compute than the exact value.
6676
6677
But remember that Bayesian analysis is always
6678
based on modeling decisions, which implies that there is no
6679
``exact'' solution. For any interesting
6680
physical system there are many possible models, and each model
6681
yields different results. To interpret the results, we have to
6682
evaluate the models.
6683
\index{modeling}
6684
6685
So another interpretation of ABC is that it represents an alternative
6686
model of the likelihood. When we compute \p{D|H}, we are asking
6687
``What is the likelihood of the data under a given hypothesis?''
6688
\index{likelihood}
6689
6690
For large datasets, the likelihood of the data is very small, which
6691
is a hint that we might not be asking the right question. What
6692
we really want to know is the likelihood of any outcome
6693
like the data, where the definition of ``like'' is yet another
6694
modeling decision.
6695
6696
The underlying idea of ABC is that two datasets are alike if they yield
6697
the same summary statistics. But in some cases, like the example in
6698
this chapter, it is not obvious which summary statistics to choose.
6699
\index{summary statistic}
6700
6701
You can download the code in this chapter from
6702
\url{http://thinkbayes.com/variability.py}.
6703
For more information
6704
see Section~\ref{download}.
6705
6706
\section{Exercises}
6707
6708
\begin{exercise}
6709
6710
An ``effect size'' is a statistic intended to measure the difference
6711
between two groups (see
6712
\url{http://en.wikipedia.org/wiki/Effect_size}).
6713
6714
For example, we could use data from the BRFSS to estimate the
6715
difference in height between men and women. By sampling values
6716
from the posterior distributions of $\mu$ and
6717
$\sigma$, we could generate the posterior distribution of this
6718
difference.
6719
6720
But it might be better to use a dimensionless measure of effect
6721
size, rather than a difference measured in cm. One option is
6722
to use divide through by the standard deviation (similar to what
6723
we did with the coefficient of variation).
6724
6725
If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the
6726
parameters for Group 2 are $(\mu_2, \sigma_2)$, the dimensionless
6727
effect size is
6728
%
6729
\[ \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]
6730
%
6731
Write a function that takes joint distributions of
6732
{\tt mu} and {\tt sigma} for two groups and returns
6733
the posterior distribution of effect size.
6734
6735
Hint: if enumerating all pairs from the two distributions takes too
6736
long, consider random sampling.
6737
6738
\end{exercise}
6739
6740
6741
6742
\chapter{Hypothesis Testing}
6743
6744
\section{Back to the Euro problem}
6745
6746
In Section~\ref{euro} I presented a problem from MacKay's {\it Information
6747
Theory, Inference, and Learning Algorithms}:
6748
\index{MacKay, David}
6749
6750
\begin{quote}
6751
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
6752
6753
\begin{quote}
6754
When spun on edge 250 times, a Belgian one-euro coin came
6755
up heads 140 times and tails 110. `It looks very suspicious
6756
to me,' said Barry Blight, a statistics lecturer at the London
6757
School of Economics. `If the coin were unbiased, the chance of
6758
getting a result as extreme as that would be less than 7\%.'
6759
\end{quote}
6760
6761
But do these data give evidence that the coin is biased rather than fair?
6762
\end{quote}
6763
6764
We estimated the probability that the coin would
6765
land face up, but we didn't really answer MacKay's question:
6766
Do the data give evidence that the coin is biased?
6767
\index{Euro problem}
6768
\index{evidence}
6769
6770
In Chapter~\ref{more} I proposed that data are in favor of
6771
a hypothesis if the data are more likely under the hypothesis than
6772
under the alternative or, equivalently, if the Bayes factor is greater
6773
than 1.
6774
\index{hypothesis testing}
6775
\index{Bayes factor}
6776
6777
In the Euro example, we have two hypotheses to consider: I'll use
6778
$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis
6779
that it is biased.
6780
\index{fair coin}
6781
\index{biased coin}
6782
6783
If the coin is fair, it is easy to compute the likelihood of the
6784
data, \p{D|F}. In fact, we already wrote the function
6785
that does it.
6786
6787
\begin{verbatim}
6788
def Likelihood(self, data, hypo):
6789
x = hypo / 100.0
6790
head, tails = data
6791
like = x**heads * (1-x)**tails
6792
return like
6793
\end{verbatim}
6794
6795
To use it we can
6796
create a {\tt Euro} suite and invoke
6797
{\tt Likelihood}:
6798
6799
\begin{verbatim}
6800
suite = Euro()
6801
likelihood = suite.Likelihood(data, 50)
6802
\end{verbatim}
6803
6804
\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except
6805
that the probability of seeing any particular dataset is very small.
6806
It takes two likelihoods to make a ratio, so we also have to
6807
compute \p{D|B}.
6808
6809
It is not obvious how to compute the likelihood of $B$, because
6810
it's not obvious what ``biased'' means.
6811
6812
One possibility is to cheat and look at the data before we define
6813
the hypothesis. In that case we would say that ``biased'' means that
6814
the probability of heads is 140/250.
6815
6816
\begin{verbatim}
6817
actual_percent = 100.0 * 140 / 250
6818
likelihood = suite.Likelihood(data, actual_percent)
6819
\end{verbatim}
6820
6821
This version of $B$ I call \verb"B_cheat"; the likelihood of
6822
\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is
6823
6.1. So we would say that the data are evidence in favor of this
6824
version of $B$.
6825
\index{evidence}
6826
6827
But using the data to formulate the hypothesis
6828
is obviously bogus. By that definition, any dataset would
6829
be evidence in favor of $B$, unless the observed percentage of heads
6830
is exactly 50\%.
6831
\index{bogus}
6832
6833
\section{Making a fair comparison}
6834
\label{suitelike}
6835
6836
To make a legitimate comparison, we have to define $B$ without looking
6837
at the data. So let's try a different definition. If you inspect
6838
a Belgian Euro coin, you might notice that the ``heads'' side is more
6839
prominent than the ``tails'' side. You might expect the shape to
6840
have some effect on
6841
$x$, but be unsure whether it makes heads more or less
6842
likely. So you might say ``I think the coin is biased so that
6843
$x$ is either 0.6 or 0.4, but I am not sure which.''
6844
6845
We can think of this version, which I'll call \verb"B_two"
6846
as a hypothesis made up of two
6847
sub-hypotheses. We can compute the likelihood for each
6848
sub-hypothesis and then compute the average likelihood.
6849
6850
\begin{verbatim}
6851
like40 = suite.Likelihood(data, 40)
6852
like60 = suite.Likelihood(data, 60)
6853
likelihood = 0.5 * like40 + 0.5 * like60
6854
\end{verbatim}
6855
6856
The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which
6857
means the data provide weak evidence in favor of \verb"b_two".
6858
\index{evidence}
6859
\index{likelihood ratio}
6860
\index{Bayes factor}
6861
6862
More generally, suppose you suspect that the coin is biased, but you
6863
have no clue about the value of $x$. In that case you might build a
6864
Suite, which I call \verb"b_uniform", to represent sub-hypotheses from
6865
0 to 100.
6866
6867
\begin{verbatim}
6868
b_uniform = Euro(xrange(0, 101))
6869
b_uniform.Remove(50)
6870
b_uniform.Normalize()
6871
\end{verbatim}
6872
6873
I initialize \verb"b_uniform" with values from 0 to 100.
6874
I removed the sub-hypothesis that $x$ is 50\%, because if
6875
$x$ is 50\% the coin is fair, but it has almost no
6876
effect on the result whether you remove it or not.
6877
6878
To compute the likelihood of
6879
\verb"b_uniform" we compute the likelihood of each sub-hypothesis
6880
and accumulate a weighted average.
6881
6882
\begin{verbatim}
6883
def SuiteLikelihood(suite, data):
6884
total = 0
6885
for hypo, prob in suite.Items():
6886
like = suite.Likelihood(data, hypo)
6887
total += prob * like
6888
return total
6889
\end{verbatim}
6890
6891
The likelihood ratio for \verb"b_uniform" is 0.47, which means
6892
that the data are weak evidence against \verb"b_uniform",
6893
compared to $F$.
6894
\index{likelihood}
6895
6896
If you think about the computation performed by
6897
\verb"SuiteLikelihood", you might notice that it is similar to an
6898
update. To refresh your memory, here's the {\tt Update} function:
6899
6900
\begin{verbatim}
6901
def Update(self, data):
6902
for hypo in self.Values():
6903
like = self.Likelihood(data, hypo)
6904
self.Mult(hypo, like)
6905
return self.Normalize()
6906
\end{verbatim}
6907
6908
And here's {\tt Normalize}:
6909
6910
\begin{verbatim}
6911
def Normalize(self):
6912
total = self.Total()
6913
6914
factor = 1.0 / total
6915
for x in self.d:
6916
self.d[x] *= factor
6917
6918
return total
6919
\end{verbatim}
6920
6921
The return value from {\tt Normalize} is the total of the
6922
probabilities in the Suite, which is the average of the likelihoods
6923
for the sub-hypotheses, weighted by the prior probabilities. And {\tt
6924
Update} passes this value along, so instead of using {\tt
6925
SuiteLikelihood}, we could compute the likelihood of
6926
\verb"b_uniform" like this:
6927
6928
\begin{verbatim}
6929
likelihood = b_uniform.Update(data)
6930
\end{verbatim}
6931
6932
6933
6934
\section{The triangle prior}
6935
6936
In Chapter~\ref{more} we also considered a triangle-shaped prior that
6937
gives higher probability to values of $x$ near 50\%. If we think of
6938
this prior as a suite of sub-hypotheses, we can compute its likelihood
6939
like this:
6940
\index{triangle distribution}
6941
6942
\begin{verbatim}
6943
b_triangle = TrianglePrior()
6944
likelihood = b_triangle.Update(data)
6945
\end{verbatim}
6946
6947
The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so
6948
again we would say that the data are weak evidence against $B$.
6949
\index{evidence}
6950
6951
The following table shows the priors we have considered, the
6952
likelihood of each, and the likelihood ratio (or Bayes factor)
6953
relative to $F$.
6954
\index{likelihood ratio}
6955
\index{Bayes factor}
6956
6957
\begin{tabular}{|l|r|r|}
6958
\hline
6959
Hypothesis & Likelihood & Bayes \\
6960
& $\times 10^{-76}$ & Factor \\
6961
\hline
6962
$F$ & 5.5 & -- \\
6963
\verb"B_cheat" & 34 & 6.1 \\
6964
\verb"B_two" & 7.4 & 1.3 \\
6965
\verb"B_uniform" & 2.6 & 0.47 \\
6966
\verb"B_triangle" & 4.6 & 0.84 \\
6967
\hline
6968
\end{tabular}
6969
6970
Depending on which definition we choose, the data might provide
6971
evidence for or against the hypothesis that the coin is biased, but
6972
in either case it is relatively weak evidence.
6973
6974
In summary, we can use Bayesian hypothesis testing to compare the
6975
likelihood of $F$ and $B$, but we have to do some work to specify
6976
precisely what $B$ means. This specification depends on background
6977
information about coins and their behavior when spun, so people
6978
could reasonably disagree about the right definition.
6979
6980
My presentation of this example follows
6981
David MacKay's discussion, and comes to the same conclusion.
6982
You can download the code I used in this chapter from
6983
\url{http://thinkbayes.com/euro3.py}.
6984
For more information
6985
see Section~\ref{download}.
6986
6987
\section{Discussion}
6988
6989
The Bayes factor for \verb"B_uniform" is 0.47, which means
6990
that the data provide evidence against this hypothesis, compared
6991
to $F$. In the previous section I characterized this evidence
6992
as ``weak,'' but didn't say why.
6993
\index{evidence}
6994
6995
Part of the answer is historical. Harold Jeffreys, an early
6996
proponent of Bayesian statistics, suggested a scale for
6997
interpreting Bayes factors:
6998
6999
\begin{tabular}{|l|l|}
7000
\hline
7001
Bayes & Strength \\
7002
Factor & \\
7003
\hline
7004
1 -- 3 & Barely worth mentioning \\
7005
3 -- 10 & Substantial \\
7006
10 -- 30 & Strong \\
7007
30 -- 100 & Very strong \\
7008
$>$ 100 & Decisive \\
7009
\hline
7010
\end{tabular}
7011
7012
In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",
7013
so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely
7014
worth mentioning.'' Other authors have suggested variations on the
7015
wording. To avoid arguing about adjectives, we could think about odds
7016
instead.
7017
7018
If your prior odds are 1:1, and you see evidence with Bayes
7019
factor 2, your posterior odds are 2:1. In terms of probability,
7020
the data changed your degree of belief from 50\% to 66\%. For
7021
most real world problems, that change would be small relative
7022
to modeling errors and other sources of uncertainty.
7023
7024
On the other hand, if you had seen evidence with Bayes
7025
factor 100, your posterior odds would be 100:1 or more than 99\%.
7026
Whether or not you agree that such evidence is ``decisive,''
7027
it is certainly strong.
7028
7029
\section{Exercises}
7030
7031
\begin{exercise}
7032
Some people believe in the existence of extra-sensory
7033
perception (ESP); for example, the ability of some people to guess
7034
the value of an unseen playing card with probability better
7035
than chance.
7036
\index{ESP}
7037
\index{extra-sensory perception}
7038
7039
What is your prior degree of belief in this kind of ESP?
7040
Do you think it is as likely to exist as not? Or are you
7041
more skeptical about it? Write down your prior odds.
7042
7043
Now compute the strength of the evidence it would take to
7044
convince you that ESP is at least 50\% likely to exist.
7045
What Bayes factor would be needed to make you 90\% sure
7046
that ESP exists?
7047
\end{exercise}
7048
7049
7050
\begin{exercise}
7051
Suppose that your answer to the previous question is 1000;
7052
that is, evidence with Bayes factor 1000 in favor of ESP would
7053
be sufficient to change your mind.
7054
7055
Now suppose that you read a paper in a respectable peer-reviewed
7056
scientific journal that presents evidence with Bayes factor 1000 in
7057
favor of ESP. Would that change your mind?
7058
7059
If not, how do you resolve the apparent contradiction?
7060
You might find it helpful to read about David Hume's article, ``Of
7061
Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.
7062
\index{Hume, David}
7063
7064
\end{exercise}
7065
7066
7067
7068
\chapter{Evidence}
7069
\label{evidence}
7070
7071
\section{Interpreting SAT scores}
7072
7073
Suppose you are the Dean of Admission at a small engineering
7074
college in Massachusetts, and you are considering two candidates,
7075
Alice and Bob, whose qualifications are similar in many ways,
7076
with the exception that Alice got a higher score on the Math
7077
portion of the SAT, a standardized test intended to measure
7078
preparation for college-level work in mathematics.
7079
\index{SAT}
7080
\index{standardized test}
7081
7082
If Alice got 780 and Bob got a 740 (out of a possible 800), you might
7083
want to know whether that difference is evidence that Alice is better
7084
prepared than Bob, and what the strength of that evidence is.
7085
\index{evidence}
7086
7087
Now in reality, both scores are very good, and both
7088
candidates are probably well prepared for college math. So
7089
the real Dean of Admission would probably suggest that we choose
7090
the candidate who best demonstrates the other skills and
7091
attitudes we look for in students. But as an example of
7092
Bayesian hypothesis testing, let's stick with a narrower question:
7093
``How strong is the evidence that Alice is better prepared
7094
than Bob?''
7095
7096
To answer that question, we need to make some modeling decisions.
7097
I'll start with a simplification I know is wrong; then we'll come back
7098
and improve the model. I pretend, temporarily, that
7099
all SAT questions are equally difficult. Actually, the designers of
7100
the SAT choose questions with a range of difficulty, because that
7101
improves the ability to measure statistical differences between
7102
test-takers.
7103
\index{modeling}
7104
7105
But if we choose a model where all questions are equally difficult, we
7106
can define a characteristic, \verb"p_correct", for each test-taker,
7107
which is the probability of answering any question correctly. This
7108
simplification makes it easy to compute the likelihood of a given
7109
score.
7110
7111
7112
\section{The scale}
7113
7114
In order to understand SAT scores, we have to understand the scoring
7115
and scaling process. Each test-taker gets a raw score based on the
7116
number of correct and incorrect questions. The raw score is converted
7117
to a scaled score in the range 200--800.
7118
\index{scaled score}
7119
7120
In 2009, there were 54 questions on the math SAT. The raw score
7121
for each test-taker is the number of questions answered correctly
7122
minus a penalty of $1/4$ point for each question answered incorrectly.
7123
7124
The College Board, which administers the SAT, publishes the
7125
map from raw scores to scaled scores. I have downloaded that
7126
data and wrapped it in an Interpolator object that provides a forward
7127
lookup (from raw score to scaled) and a reverse lookup (from scaled
7128
score to raw).
7129
\index{College Board}
7130
7131
You can download the code for this example from
7132
\url{http://thinkbayes.com/sat.py}.
7133
For more information
7134
see Section~\ref{download}.
7135
7136
\section{The prior}
7137
7138
The College Board also publishes the distribution of scaled scores
7139
for all test-takers. If we convert each scaled score to a raw score,
7140
and divide by the number of questions, the result is an estimate
7141
of \verb"p_correct".
7142
So we can use the distribution of raw scores to model the
7143
prior distribution of \verb"p_correct".
7144
7145
Here is the code that reads and processes the data:
7146
7147
\begin{verbatim}
7148
class Exam(object):
7149
7150
def __init__(self):
7151
self.scale = ReadScale()
7152
scores = ReadRanks()
7153
score_pmf = thinkbayes.MakePmfFromDict(dict(scores))
7154
self.raw = self.ReverseScale(score_pmf)
7155
self.max_score = max(self.raw.Values())
7156
self.prior = DivideValues(self.raw, self.max_score)
7157
\end{verbatim}
7158
7159
{\tt Exam} encapsulates the information we have about the exam.
7160
{\tt ReadScale} and {\tt ReadRanks} read files and return
7161
objects that contain the data:
7162
{\tt self.scale} is the {\tt Interpolator} that converts
7163
from raw to scaled scores and back; {\tt scores} is a list
7164
of (score, frequency) pairs.
7165
7166
\verb"score_pmf" is the Pmf of
7167
scaled scores. {\tt self.raw} is the Pmf of raw scores, and
7168
{\tt self.prior} is the Pmf of \verb"p_correct".
7169
7170
\begin{figure}
7171
% sat.py
7172
\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}
7173
\caption{Prior distribution of {\tt p\_correct} for SAT test-takers.}
7174
\label{fig.satprior}
7175
\end{figure}
7176
7177
Figure~\ref{fig.satprior} shows the prior distribution of
7178
\verb"p_correct". This distribution is approximately Gaussian, but it
7179
is compressed at the extremes. By design, the SAT has the most power
7180
to discriminate between test-takers within two standard deviations of
7181
the mean, and less power outside that range.
7182
\index{Gaussian distribution}
7183
7184
For each test-taker, I define a Suite called {\tt Sat} that
7185
represents the distribution of \verb"p_correct". Here's the definition:
7186
7187
\begin{verbatim}
7188
class Sat(thinkbayes.Suite):
7189
7190
def __init__(self, exam, score):
7191
thinkbayes.Suite.__init__(self)
7192
7193
self.exam = exam
7194
self.score = score
7195
7196
# start with the prior distribution
7197
for p_correct, prob in exam.prior.Items():
7198
self.Set(p_correct, prob)
7199
7200
# update based on an exam score
7201
self.Update(score)
7202
\end{verbatim}
7203
7204
\verb"__init__" takes an Exam object and a scaled score. It makes a
7205
copy of the prior distribution and then updates itself based on the
7206
exam score.
7207
7208
As usual, we inherit {\tt Update} from {\tt Suite} and provide
7209
{\tt Likelihood}:
7210
7211
\begin{verbatim}
7212
def Likelihood(self, data, hypo):
7213
p_correct = hypo
7214
score = data
7215
7216
k = self.exam.Reverse(score)
7217
n = self.exam.max_score
7218
like = thinkbayes.EvalBinomialPmf(k, n, p_correct)
7219
return like
7220
\end{verbatim}
7221
7222
{\tt hypo} is a hypothetical
7223
value of \verb"p_correct", and {\tt data} is a scaled score.
7224
7225
To keep things simple, I interpret the raw score as the number of
7226
correct answers, ignoring the penalty for wrong answers. With
7227
this simplification, the likelihood is given by the binomial
7228
distribution, which computes the probability of $k$ correct
7229
responses out of $n$ questions.
7230
\index{binomial distribution}
7231
\index{raw score}
7232
7233
7234
\section{Posterior}
7235
7236
\begin{figure}
7237
% sat.py
7238
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}
7239
\caption{Posterior distributions of {\tt p\_correct} for Alice and Bob.}
7240
\label{fig.satposterior1}
7241
\end{figure}
7242
7243
Figure~\ref{fig.satposterior1} shows the posterior distributions
7244
of \verb"p_correct" for Alice and Bob based on their exam scores.
7245
We can see that they overlap, so it is possible that \verb"p_correct"
7246
is actually higher for Bob, but it seems unlikely.
7247
7248
Which brings us back to the original question, ``How strong is the
7249
evidence that Alice is better prepared than Bob?'' We can use the
7250
posterior distributions of \verb"p_correct" to answer this question.
7251
7252
To formulate the question in terms of Bayesian hypothesis testing,
7253
I define two hypotheses:
7254
7255
\begin{itemize}
7256
7257
\item $A$: \verb"p_correct" is higher for Alice than for Bob.
7258
7259
\item $B$: \verb"p_correct" is higher for Bob than for Alice.
7260
7261
\end{itemize}
7262
7263
To compute the likelihood of $A$, we can enumerate all pairs of values
7264
from the posterior distributions and add up the total probability of
7265
the cases where \verb"p_correct" is higher for Alice than for Bob.
7266
And we already have a function, \verb"thinkbayes.PmfProbGreater",
7267
that does that.
7268
7269
So we can define a Suite that computes the posterior probabilities
7270
of $A$ and $B$:
7271
7272
\begin{verbatim}
7273
class TopLevel(thinkbayes.Suite):
7274
7275
def Update(self, data):
7276
a_sat, b_sat = data
7277
7278
a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)
7279
b_like = thinkbayes.PmfProbLess(a_sat, b_sat)
7280
c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)
7281
7282
a_like += c_like / 2
7283
b_like += c_like / 2
7284
7285
self.Mult('A', a_like)
7286
self.Mult('B', b_like)
7287
7288
self.Normalize()
7289
\end{verbatim}
7290
7291
Usually when we define a new Suite, we inherit {\tt Update}
7292
and provide {\tt Likelihood}. In this case I override {\tt Update},
7293
because it is easier to evaluate the likelihood of both
7294
hypotheses at the same time.
7295
7296
The data passed to {\tt Update} are Sat objects that represent
7297
the posterior distributions of \verb"p_correct".
7298
7299
\verb"a_like" is the total probability that
7300
\verb"p_correct" is higher for Alice; \verb"b_like" is that
7301
probability that it is higher for Bob.
7302
7303
\verb"c_like" is the probability that they are ``equal,'' but this
7304
equality is an artifact of the decision to model \verb"p_correct" with
7305
a set of discrete values. If we use more values, \verb"c_like"
7306
is smaller, and in the extreme, if \verb"p_correct" is
7307
continuous, \verb"c_like" is zero. So I treat \verb"c_like" as
7308
a kind of round-off error and split it evenly between \verb"a_like"
7309
and \verb"b_like".
7310
7311
Here is the code that creates {\tt TopLevel} and updates it:
7312
7313
\begin{verbatim}
7314
exam = Exam()
7315
a_sat = Sat(exam, 780)
7316
b_sat = Sat(exam, 740)
7317
7318
top = TopLevel('AB')
7319
top.Update((a_sat, b_sat))
7320
top.Print()
7321
\end{verbatim}
7322
7323
The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21. The
7324
likelihood ratio (or Bayes factor) is 3.8, which means that these test
7325
scores are evidence that Alice is better than Bob at answering SAT
7326
questions. If we believed, before seeing the test scores, that $A$
7327
and $B$ were equally likely, then after seeing the scores we should
7328
believe that the probability of $A$ is 79\%, which means there is
7329
still a 21\% chance that Bob is actually better prepared.
7330
\index{likelihood ratio}
7331
\index{Bayes factor}
7332
7333
7334
\section{A better model}
7335
7336
Remember that the analysis we have done so far is based on
7337
the simplification that all SAT questions are equally difficult.
7338
In reality, some are easier than others, which means that the
7339
difference between Alice and Bob might be even smaller.
7340
7341
But how big is the modeling error? If it is small, we conclude
7342
that the first model---based on the simplification that all questions
7343
are equally difficult---is good enough. If it's large,
7344
we need a better model.
7345
\index{modeling error}
7346
7347
In the next few sections, I develop a better model and
7348
discover (spoiler alert!) that the modeling error is small. So if
7349
you are satisfied with the simple model, you can skip to the next
7350
chapter. If you want to see how the more realistic model works,
7351
read on...
7352
7353
\begin{itemize}
7354
7355
\item Assume that each test-taker has some
7356
degree of {\tt efficacy}, which measures their
7357
ability to answer SAT questions.
7358
\index{efficacy}
7359
7360
\item Assume that each question has some level of
7361
{\tt difficulty}.
7362
7363
\item Finally, assume that the chance that a test-taker answers a
7364
question correctly is related to {\tt efficacy} and {\tt difficulty}
7365
according to this function:
7366
7367
\begin{verbatim}
7368
def ProbCorrect(efficacy, difficulty, a=1):
7369
return 1 / (1 + math.exp(-a * (efficacy - difficulty)))
7370
\end{verbatim}
7371
7372
\end{itemize}
7373
7374
This function is a simplified version of the curve used in {\bf item
7375
response theory}, which you can read about at
7376
\url{http://en.wikipedia.org/wiki/Item_response_theory}. {\tt
7377
efficacy} and {\tt difficulty} are considered to be on the same
7378
scale, and the probability of getting a question right depends only on
7379
the difference between them.
7380
\index{item response theory}
7381
7382
When {\tt efficacy} and {\tt difficulty} are equal, the
7383
probability of getting the question right is 50\%. As
7384
{\tt efficacy} increases, this probability approaches 100\%.
7385
As it decreases (or as {\tt difficulty} increases), the
7386
probability approaches 0\%.
7387
7388
Given the distribution of {\tt efficacy} across test-takers
7389
and the distribution of {\tt difficulty} across questions, we
7390
can compute the expected distribution of raw scores. We'll do that
7391
in two steps. First, for a person with given {\tt efficacy},
7392
we'll compute the distribution of raw scores.
7393
7394
\begin{verbatim}
7395
def PmfCorrect(efficacy, difficulties):
7396
pmf0 = thinkbayes.Pmf([0])
7397
7398
ps = [ProbCorrect(efficacy, diff) for diff in difficulties]
7399
pmfs = [BinaryPmf(p) for p in ps]
7400
dist = sum(pmfs, pmf0)
7401
return dist
7402
\end{verbatim}
7403
7404
{\tt difficulties} is a list of difficulties, one for each question.
7405
{\tt ps} is a list of probabilities, and {\tt pmfs} is a list of
7406
two-valued Pmf objects; here's the function that makes them:
7407
7408
\begin{verbatim}
7409
def BinaryPmf(p):
7410
pmf = thinkbayes.Pmf()
7411
pmf.Set(1, p)
7412
pmf.Set(0, 1-p)
7413
return pmf
7414
\end{verbatim}
7415
7416
{\tt dist} is the sum of these Pmfs. Remember from Section~\ref{addends}
7417
that when we add up Pmf objects, the result is the distribution
7418
of the sums. In order to use Python's {\tt sum} to add up Pmfs,
7419
we have to provide {\tt pmf0} which is the identity for Pmfs,
7420
so {\tt pmf + pmf0} is always {\tt pmf}.
7421
7422
If we know a person's efficacy, we can compute their distribution
7423
of raw scores. For a group of people with a different efficacies, the
7424
resulting distribution of raw scores is a mixture. Here's the code
7425
that computes the mixture:
7426
7427
\begin{verbatim}
7428
# class Exam:
7429
7430
def MakeRawScoreDist(self, efficacies):
7431
pmfs = thinkbayes.Pmf()
7432
for efficacy, prob in efficacies.Items():
7433
scores = PmfCorrect(efficacy, self.difficulties)
7434
pmfs.Set(scores, prob)
7435
7436
mix = thinkbayes.MakeMixture(pmfs)
7437
return mix
7438
\end{verbatim}
7439
7440
{\tt MakeRawScoreDist} takes {\tt efficacies}, which is a Pmf that
7441
represents the distribution of efficacy across test-takers. I assume
7442
it is Gaussian with mean 0 and standard deviation 1.5. This
7443
choice is mostly arbitrary. The probability of getting a question
7444
correct depends on the difference between efficacy and difficulty, so
7445
we can choose the units of efficacy and then calibrate the units of
7446
difficulty accordingly. \index{Gaussian distribution}
7447
7448
{\tt pmfs} is a meta-Pmf that contains one Pmf for each level of
7449
efficacy, and maps to the fraction of test-takers at that level. {\tt
7450
MakeMixture} takes the meta-pmf and computes the distribution of the
7451
mixture (see Section~\ref{mixture}). \index{meta-Pmf}
7452
\index{MakeMixture}
7453
7454
7455
\section{Calibration}
7456
7457
If we were given the distribution of difficulty, we could use
7458
\verb"MakeRawScoreDist" to compute the distribution of raw scores.
7459
But for us the problem is the other way around: we are given the
7460
distribution of raw scores and we want to infer the distribution of
7461
difficulty.
7462
7463
\begin{figure}
7464
% sat.py
7465
\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}
7466
\caption{Actual distribution of raw scores and a model to fit it.}
7467
\label{fig.satcalibrate}
7468
\end{figure}
7469
7470
I assume that the distribution of difficulty is uniform with
7471
parameters {\tt center} and {\tt width}. {\tt MakeDifficulties}
7472
makes a list of difficulties with these parameters.
7473
\index{numpy}
7474
7475
\begin{verbatim}
7476
def MakeDifficulties(center, width, n):
7477
low, high = center-width, center+width
7478
return numpy.linspace(low, high, n)
7479
\end{verbatim}
7480
7481
By trying out a few combinations, I found that
7482
{\tt center=-0.05} and {\tt width=1.8} yield a distribution
7483
of raw scores similar to the actual data, as shown in
7484
Figure~\ref{fig.satcalibrate}.
7485
\index{calibration}
7486
7487
So, assuming that the distribution of difficulty is uniform,
7488
its range is approximately
7489
{\tt -1.85} to {\tt 1.75}, given that
7490
efficacy is Gaussian with mean 0 and standard deviation 1.5.
7491
\index{Gaussian distribution}
7492
7493
The following table shows the range of {\tt ProbCorrect} for
7494
test-takers at different levels of efficacy:
7495
7496
\begin{tabular}{|r|r|r|r|}
7497
\hline
7498
& \multicolumn{3}{|c|}{Difficulty} \\
7499
\hline
7500
Efficacy & -1.85 & -0.05 & 1.75 \\
7501
\hline
7502
3.00 & 0.99 & 0.95 & 0.78 \\
7503
1.50 & 0.97 & 0.82 & 0.44 \\
7504
0.00 & 0.86 & 0.51 & 0.15 \\
7505
-1.50 & 0.59 & 0.19 & 0.04 \\
7506
-3.00 & 0.24 & 0.05 & 0.01 \\
7507
\hline
7508
\end{tabular}
7509
7510
Someone with efficacy 3 (two standard deviations above
7511
the mean) has a 99\% chance of answering the easiest questions on
7512
the exam, and a 78\% chance of answering the hardest. On the other
7513
end of the range, someone two standard deviations below the mean
7514
has only a 24\% chance of answering the easiest questions.
7515
7516
7517
\section{Posterior distribution of efficacy}
7518
7519
\begin{figure}
7520
% sat.py
7521
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}
7522
\caption{Posterior distributions of efficacy for Alice and Bob.}
7523
\label{fig.satposterior2}
7524
\end{figure}
7525
7526
Now that the model is calibrated, we can compute the posterior
7527
distribution of efficacy for Alice and Bob. Here is a version of the
7528
Sat class that uses the new model:
7529
7530
\begin{verbatim}
7531
class Sat2(thinkbayes.Suite):
7532
7533
def __init__(self, exam, score):
7534
self.exam = exam
7535
self.score = score
7536
7537
# start with the Gaussian prior
7538
efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)
7539
thinkbayes.Suite.__init__(self, efficacies)
7540
7541
# update based on an exam score
7542
self.Update(score)
7543
\end{verbatim}
7544
7545
\verb"Update" invokes
7546
\verb"Likelihood", which computes the likelihood of a given test score
7547
for a hypothetical level of efficacy.
7548
7549
\begin{verbatim}
7550
def Likelihood(self, data, hypo):
7551
efficacy = hypo
7552
score = data
7553
raw = self.exam.Reverse(score)
7554
7555
pmf = self.exam.PmfCorrect(efficacy)
7556
like = pmf.Prob(raw)
7557
return like
7558
\end{verbatim}
7559
7560
{\tt pmf} is the distribution of raw scores for a test-taker
7561
with the given efficacy; {\tt like} is the probability of
7562
the observed score.
7563
7564
Figure~\ref{fig.satposterior2} shows the posterior distributions
7565
of efficacy for Alice and Bob. As expected, the location
7566
of Alice's distribution is farther to the right, but again there
7567
is some overlap.
7568
7569
Using {\tt TopLevel} again, we compare $A$, the
7570
hypothesis that Alice's efficacy is higher, and $B$, the
7571
hypothesis that Bob's is higher. The likelihood ratio is
7572
3.4, a bit smaller than what we got from the simple model (3.8).
7573
So this model indicates that the data are evidence in favor
7574
of $A$, but a little weaker than the previous estimate.
7575
7576
If our prior belief is that $A$ and $B$ are equally likely,
7577
then in light of this evidence we would give $A$ a posterior
7578
probability of 77\%, leaving a 23\% chance that Bob's efficacy
7579
is higher.
7580
7581
7582
\section{Predictive distribution}
7583
7584
The analysis we have done so far generates estimates for
7585
Alice and Bob's efficacy, but since efficacy is not directly
7586
observable, it is hard to validate the results.
7587
\index{predictive distribution}
7588
7589
To give the model predictive power, we can use it to answer
7590
a related question: ``If Alice and Bob take the math SAT
7591
again, what is the chance that Alice will do better again?''
7592
7593
We'll answer this question in two steps:
7594
7595
\begin{itemize}
7596
7597
\item We'll use the posterior distribution of efficacy to
7598
generate a predictive distribution of raw score for each test-taker.
7599
7600
\item We'll compare the two predictive distributions to compute
7601
the probability that Alice gets a higher score again.
7602
7603
\end{itemize}
7604
7605
We already have most of the code we need. To compute
7606
the predictive distributions, we can use \verb"MakeRawScoreDist" again:
7607
7608
\begin{verbatim}
7609
exam = Exam()
7610
a_sat = Sat(exam, 780)
7611
b_sat = Sat(exam, 740)
7612
7613
a_pred = exam.MakeRawScoreDist(a_sat)
7614
b_pred = exam.MakeRawScoreDist(b_sat)
7615
\end{verbatim}
7616
7617
Then we can find the likelihood that Alice does better on the second
7618
test, Bob does better, or they tie:
7619
7620
\begin{verbatim}
7621
a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)
7622
b_like = thinkbayes.PmfProbLess(a_pred, b_pred)
7623
c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)
7624
\end{verbatim}
7625
7626
The probability that Alice does better on the second exam is 63\%,
7627
which means that Bob has a 37\% chance of doing as well or better.
7628
7629
Notice that we have more confidence about Alice's efficacy than we do
7630
about the outcome of the next test. The posterior odds are 3:1 that
7631
Alice's efficacy is higher, but only 2:1 that Alice will do better on
7632
the next exam.
7633
7634
7635
\section{Discussion}
7636
7637
\begin{figure}
7638
% sat.py
7639
\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}
7640
\caption{Joint posterior distribution of {\tt p\_correct} for Alice and Bob.}
7641
\label{fig.satjoint}
7642
\end{figure}
7643
7644
We started this chapter with the question,
7645
``How strong is the evidence that Alice is better prepared
7646
than Bob?'' On the face of it, that sounds like we want to
7647
test two hypotheses: either Alice is more prepared or Bob is.
7648
7649
But in order to compute likelihoods for these hypotheses, we
7650
have to solve an estimation problem. For each test-taker
7651
we have to find the posterior distribution of either
7652
\verb"p_correct" or \verb"efficacy".
7653
7654
Values like this are called {\bf nuisance parameters} because
7655
we don't care what they are, but we have
7656
to estimate them to answer the question we care about.
7657
\index{nuisance parameter}
7658
7659
One way to visualize the analysis we did in this chapter is
7660
to plot the space of these parameters. \verb"thinkbayes.MakeJoint"
7661
takes two Pmfs, computes their joint distribution, and returns
7662
a joint pmf of each possible pair of values and its probability.
7663
7664
\begin{verbatim}
7665
def MakeJoint(pmf1, pmf2):
7666
joint = Joint()
7667
for v1, p1 in pmf1.Items():
7668
for v2, p2 in pmf2.Items():
7669
joint.Set((v1, v2), p1 * p2)
7670
return joint
7671
\end{verbatim}
7672
7673
This function assumes that the two distributions are independent.
7674
\index{joint distribution}
7675
\index{independence}
7676
7677
Figure~\ref{fig.satjoint} shows the joint posterior distribution of
7678
\verb"p_correct" for Alice and Bob. The diagonal line indicates the
7679
part of the space where \verb"p_correct" is the same for Alice and
7680
Bob. To the right of this line, Alice is more prepared; to the left,
7681
Bob is more prepared.
7682
7683
In {\tt TopLevel.Update}, when we compute the likelihoods of $A$ and
7684
$B$, we add up the probability mass on each side of this line. For the
7685
cells that fall on the line, we add up the total mass and split it
7686
between $A$ and $B$.
7687
7688
The process we used in this chapter---estimating nuisance
7689
parameters in order to evaluate the likelihood of competing
7690
hypotheses---is a common Bayesian approach to problems like this.
7691
7692
7693
7694
7695
\chapter{Simulation}
7696
7697
In this chapter I describe my solution to a problem posed
7698
by a patient with a kidney tumor. I think the problem is
7699
important and relevant to patients with these tumors
7700
and doctors treating them.
7701
7702
And I think the solution is interesting because, although it
7703
is a Bayesian approach to the problem, the use of Bayes's theorem
7704
is implicit. I present the solution and my code; at the end
7705
of the chapter I will explain the Bayesian part.
7706
7707
If you want more technical detail than I present here, you can
7708
read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.
7709
7710
7711
\section{The Kidney Tumor problem}
7712
7713
\index{Kidney tumor problem}
7714
\index{Reddit}
7715
I am a frequent reader and occasional contributor to the online statistics
7716
forum at \url{http://reddit.com/r/statistics}. In November 2011, I read
7717
the following message:
7718
7719
\begin{quote}
7720
"I have Stage IV Kidney Cancer and am trying to determine if the
7721
cancer formed before I retired from the military. ... Given the
7722
dates of retirement and detection is it possible to determine when
7723
there was a 50/50 chance that I developed the disease? Is it
7724
possible to determine the probability on the retirement date? My
7725
tumor was 15.5 cm x 15 cm at detection. Grade II."
7726
\end{quote}
7727
7728
I contacted the author of the message and got more information; I learned
7729
that veterans get different benefits if it is "more likely than not"
7730
that a tumor formed while they were in military service (among other
7731
considerations).
7732
7733
Because renal tumors grow slowly, and often do not cause symptoms,
7734
they are sometimes left untreated. As a result, doctors can observe
7735
the rate of growth for untreated tumors by comparing scans from the
7736
same patient at different times. Several papers have reported these
7737
growth rates.
7738
7739
I collected data from a paper by Zhang et al\footnote{Zhang et al,
7740
Distribution of Renal Tumor Growth Rates Determined by Using Serial
7741
Volumetric CT Measurements, January 2009 {\it Radiology}, 250,
7742
137-144.}. I contacted the authors to see if I could get raw data,
7743
but they refused on grounds of medical privacy. Nevertheless, I was
7744
able to extract the data I needed by printing one of their graphs and
7745
measuring it with a ruler.
7746
7747
\begin{figure}
7748
% kidney.py
7749
\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}
7750
\caption{CDF of RDT in doublings per year.}
7751
\label{fig.kidney2}
7752
\end{figure}
7753
7754
They report growth rates in reciprocal doubling time (RDT),
7755
which is in units of doublings per year. So a tumor with $RDT=1$
7756
doubles in volume each year; with $RDT=2$ it quadruples in the same
7757
time, and with $RDT=-1$, it halves. Figure~\ref{fig.kidney2} shows the
7758
distribution of RDT for 53 patients.
7759
\index{doubling time}
7760
7761
The squares are the data points from the paper; the line is a model I
7762
fit to the data. The positive tail fits an exponential distribution
7763
well, so I used a mixture of two exponentials.
7764
\index{exponential distribution}
7765
\index{mixture}
7766
7767
7768
7769
\section{A simple model}
7770
7771
It is usually a good idea to start with a simple model before
7772
trying something more challenging. Sometimes the simple model is
7773
sufficient for the problem at hand, and if not, you can use it
7774
to validate the more complex model.
7775
\index{modeling}
7776
7777
For my simple model, I assume that tumors grow with a constant
7778
doubling time, and that they are three-dimensional in the sense that
7779
if the maximum linear measurement doubles, the volume is multiplied by
7780
eight.
7781
7782
I learned from my correspondent that the time between his discharge
7783
from the military and his diagnosis was 3291 days (about 9 years).
7784
So my first calculation was, ``If this tumor grew at the median
7785
rate, how big would it have been at the date of discharge?''
7786
7787
The median volume doubling time reported by Zhang et al is 811 days.
7788
Assuming 3-dimensional geometry, the doubling time for a linear
7789
measure is three times longer.
7790
7791
\begin{verbatim}
7792
# time between discharge and diagnosis, in days
7793
interval = 3291.0
7794
7795
# doubling time in linear measure is doubling time in volume * 3
7796
dt = 811.0 * 3
7797
7798
# number of doublings since discharge
7799
doublings = interval / dt
7800
7801
# how big was the tumor at time of discharge (diameter in cm)
7802
d1 = 15.5
7803
d0 = d1 / 2.0 ** doublings
7804
\end{verbatim}
7805
7806
You can download the code in this chapter from
7807
\url{http://thinkbayes.com/kidney.py}. For more information
7808
see Section~\ref{download}.
7809
7810
The result, {\tt d0}, is about 6 cm. So if this tumor formed after
7811
the date of discharge, it must have grown substantially faster than
7812
the median rate. Therefore I concluded that it is ``more likely than
7813
not'' that this tumor formed before the date of discharge.
7814
7815
In addition, I computed the growth rate that would be implied
7816
if this tumor had formed after the date of discharge. If we
7817
assume an initial size of 0.1 cm, we can compute the number of
7818
doublings to get to a final size of 15.5 cm:
7819
7820
\begin{verbatim}
7821
# assume an initial linear measure of 0.1 cm
7822
d0 = 0.1
7823
d1 = 15.5
7824
7825
# how many doublings would it take to get from d0 to d1
7826
doublings = log2(d1 / d0)
7827
7828
# what linear doubling time does that imply?
7829
dt = interval / doublings
7830
7831
# compute the volumetric doubling time and RDT
7832
vdt = dt / 3
7833
rdt = 365 / vdt
7834
\end{verbatim}
7835
7836
{\tt dt} is linear doubling time, so {\tt vdt} is volumetric
7837
doubling time, and {\tt rdt} is reciprocal doubling
7838
time.
7839
7840
The number of doublings, in linear measure, is 7.3, which implies
7841
an RDT of 2.4. In the data from Zhang et al, only 20\% of tumors
7842
grew this fast during a period of observation. So again,
7843
I concluded that is ``more likely than not'' that the tumor
7844
formed prior to the date of discharge.
7845
7846
These calculations are sufficient to answer the question as
7847
posed, and on behalf of my correspondent, I wrote a letter explaining
7848
my conclusions to the Veterans' Benefit Administration.
7849
\index{Veterans' Benefit Administration}
7850
7851
Later I told a friend, who is an oncologist, about my results. He was
7852
surprised by the growth rates observed by Zhang et al, and by what
7853
they imply about the ages of these tumors. He suggested that the
7854
results might be interesting to researchers and doctors.
7855
7856
But in order to make them useful, I wanted a more general model
7857
of the relationship between age and size.
7858
7859
7860
\section{A more general model}
7861
7862
Given the size of a tumor at time of diagnosis, it would be most
7863
useful to know the probability that the tumor formed before
7864
any given date; in other words, the distribution of ages.
7865
\index{modeling}
7866
\index{simulation}
7867
7868
To find it, I run simulations of tumor growth to get the
7869
distribution of size conditioned on age. Then we can use
7870
a Bayesian approach to get the
7871
distribution of age conditioned on size.
7872
\index{conditional distribution}
7873
7874
The simulation starts with a small tumor and runs these steps:
7875
7876
\begin{enumerate}
7877
7878
\item Choose a growth rate from the distribution of RDT.
7879
7880
\item Compute the size of the tumor at the end of an interval.
7881
7882
\item Record the size of the tumor at each interval.
7883
7884
\item Repeat until the tumor exceeds the maximum relevant size.
7885
7886
\end{enumerate}
7887
7888
For the initial size I chose 0.3 cm, because carcinomas smaller than
7889
that are less likely to be invasive and less likely to have the blood
7890
supply needed for rapid growth (see
7891
\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).
7892
\index{carcinoma}
7893
7894
I chose an interval of 245 days (about 8 months) because that is the
7895
median time between measurements in the data source.
7896
7897
For the maximum size I chose 20 cm. In the data source, the range of
7898
observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond
7899
the observed range at each end, but not by far, and not in a way
7900
likely to have a strong effect on the results.
7901
7902
\begin{figure}
7903
% kidney.py
7904
\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}
7905
\caption{Simulations of tumor growth, size vs. time.}
7906
\label{fig.kidney4}
7907
\end{figure}
7908
7909
The simulation is based on one big simplification:
7910
the growth rate is chosen independently during each interval,
7911
so it does not depend on age, size, or growth rate during
7912
previous intervals.
7913
\index{independence}
7914
7915
In Section~\ref{serial} I review these assumptions and
7916
consider more detailed models. But first let's look at some
7917
examples.
7918
7919
Figure~\ref{fig.kidney4} shows
7920
the size of simulated tumors as a function of
7921
age. The dashed line at 10 cm shows the range of ages for tumors at
7922
that size: the fastest-growing tumor gets there in 8 years; the
7923
slowest takes more than 35.
7924
7925
I am presenting results in terms of linear measurements, but the
7926
calculations are in terms of volume. To convert from one to the
7927
other, again, I use the volume of a sphere with the given
7928
diameter.
7929
\index{volume}
7930
\index{sphere}
7931
7932
7933
\section{Implementation}
7934
7935
Here is the kernel of the simulation:
7936
\index{simulation}
7937
7938
\begin{verbatim}
7939
def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):
7940
seq = v0,
7941
age = 0
7942
7943
for rdt in rdt_seq:
7944
age += interval
7945
final, seq = ExtendSequence(age, seq, rdt, interval)
7946
if final > vmax:
7947
break
7948
7949
return seq
7950
\end{verbatim}
7951
7952
\verb"rdt_seq" is an iterator that yields
7953
random values from the CDF of growth rate.
7954
{\tt v0} is the initial volume in mL. {\tt interval} is the time step
7955
in years. {\tt vmax} is the final volume corresponding to a linear
7956
measurement of 20 cm.
7957
\index{iterator}
7958
7959
{\tt Volume} converts from linear measurement in cm to volume
7960
in mL, based on the simplification that the tumor is a sphere:
7961
7962
\begin{verbatim}
7963
def Volume(diameter, factor=4*math.pi/3):
7964
return factor * (diameter/2.0)**3
7965
\end{verbatim}
7966
7967
{\tt ExtendSequence} computes the volume of the tumor at the
7968
end of the interval.
7969
7970
\begin{verbatim}
7971
def ExtendSequence(age, seq, rdt, interval):
7972
initial = seq[-1]
7973
doublings = rdt * interval
7974
final = initial * 2**doublings
7975
new_seq = seq + (final,)
7976
cache.Add(age, new_seq, rdt)
7977
7978
return final, new_seq
7979
\end{verbatim}
7980
7981
{\tt age} is the age of the tumor at the end of the interval.
7982
{\tt seq} is a tuple that contains the volumes so far. {\tt rdt} is
7983
the growth rate during the interval, in doublings per year.
7984
{\tt interval} is the size of the time step in years.
7985
7986
The return values are {\tt final}, the volume of the
7987
tumor at the end of the interval, and \verb"new_seq", a new
7988
tuple containing the volumes in {\tt seq} plus the new volume
7989
{\tt final}.
7990
7991
{\tt Cache.Add} records the age and size of each tumor at the end
7992
of each interval, as explained in the next section.
7993
\index{cache}
7994
7995
7996
\section{Caching the joint distribution}
7997
7998
\begin{figure}
7999
% kidney.py
8000
\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}
8001
\caption{Joint distribution of age and tumor size.}
8002
\label{fig.kidney8}
8003
\end{figure}
8004
8005
Here's how the cache works.
8006
8007
\begin{verbatim}
8008
class Cache(object):
8009
8010
def __init__(self):
8011
self.joint = thinkbayes.Joint()
8012
\end{verbatim}
8013
8014
{\tt joint} is a joint Pmf that records the
8015
frequency of each age-size pair, so it approximates the
8016
joint distribution of age and size.
8017
\index{joint distribution}
8018
8019
At the end of each simulated interval, {\tt ExtendSequence} calls
8020
{\tt Add}:
8021
8022
\begin{verbatim}
8023
# class Cache
8024
8025
def Add(self, age, seq):
8026
final = seq[-1]
8027
cm = Diameter(final)
8028
bucket = round(CmToBucket(cm))
8029
self.joint.Incr((age, bucket))
8030
\end{verbatim}
8031
8032
Again, {\tt age} is the age of the tumor, and {\tt seq} is the
8033
sequence of volumes so far.
8034
8035
\begin{figure}
8036
% kidney.py
8037
\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}
8038
\caption{Distributions of age, conditioned on size.}
8039
\label{fig.kidney6}
8040
\end{figure}
8041
8042
Before adding the new data to the joint distribution, we use {\tt
8043
Diameter} to convert from volume to diameter in centimeters:
8044
8045
\begin{verbatim}
8046
def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):
8047
return 2 * (factor * volume) ** exp
8048
\end{verbatim}
8049
8050
And
8051
{\tt CmToBucket} to convert from centimeters to a discrete bucket
8052
number:
8053
8054
\begin{verbatim}
8055
def CmToBucket(x, factor=10):
8056
return factor * math.log(x)
8057
\end{verbatim}
8058
8059
The buckets are equally spaced on a log scale. Using {\tt factor=10}
8060
yields a reasonable number of buckets; for example,
8061
1 cm maps to bucket 0 and 10 cm maps to bucket 23.
8062
\index{log scale}
8063
\index{bucket}
8064
8065
After running the simulations, we can plot the joint distribution
8066
as a pseudocolor plot, where each cell represents the number of
8067
tumors observed at a given size-age pair.
8068
Figure~\ref{fig.kidney8} shows the joint distribution after 1000
8069
simulations.
8070
\index{pseudocolor plot}
8071
8072
8073
8074
\section{Conditional distributions}
8075
8076
\begin{figure}
8077
% kidney.py
8078
\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}
8079
\caption{Percentiles of tumor age as a function of size.}
8080
\label{fig.kidney7}
8081
\end{figure}
8082
8083
By taking a vertical slice from the joint distribution, we can get the
8084
distribution of sizes for any given age. By taking a horizontal
8085
slice, we can get the distribution of ages conditioned on size.
8086
\index{conditional distribution}
8087
8088
Here's the code that reads the joint distribution and builds
8089
the conditional distribution for a given size.
8090
\index{joint distribution}
8091
8092
\begin{verbatim}
8093
# class Cache
8094
8095
def ConditionalCdf(self, bucket):
8096
pmf = self.joint.Conditional(0, 1, bucket)
8097
cdf = pmf.MakeCdf()
8098
return cdf
8099
\end{verbatim}
8100
8101
\verb"bucket" is the integer bucket number corresponding to
8102
tumor size. {\tt Joint.Conditional} computes the
8103
PMF of age conditioned on {\tt bucket}.
8104
The result is the CDF of age conditioned on {\tt bucket}.
8105
8106
Figure~\ref{fig.kidney6} shows several of these CDFs, for
8107
a range of sizes. To summarize these distributions, we can
8108
compute percentiles as a function of size.
8109
\index{percentile}
8110
8111
\begin{verbatim}
8112
percentiles = [95, 75, 50, 25, 5]
8113
8114
for bucket in cache.GetBuckets():
8115
cdf = ConditionalCdf(bucket)
8116
ps = [cdf.Percentile(p) for p in percentiles]
8117
\end{verbatim}
8118
8119
Figure~\ref{fig.kidney7} shows these percentiles for each
8120
size bucket. The data points are computed from the estimated
8121
joint distribution. In the model, size and time are discrete,
8122
which contributes numerical errors, so I also show a least
8123
squares fit for each sequence of percentiles.
8124
\index{least squares fit}
8125
8126
8127
\section{Serial Correlation}
8128
\label{serial}
8129
8130
The results so far are based on a number of modeling decisions;
8131
let's review them and consider which ones are the most
8132
likely sources of error:
8133
\index{modeling error}
8134
8135
\begin{itemize}
8136
8137
\item To convert from linear measure to volume, we assume that
8138
tumors are approximately spherical. This assumption is probably
8139
fine for tumors up to a few centimeters, but not for very
8140
large tumors.
8141
\index{sphere}
8142
8143
\item The distribution of growth rates in the simulations are based on
8144
a continuous model we chose to fit the data reported by Zhang et al,
8145
which is based on 53 patients. The fit is only approximate and, more
8146
importantly, a larger sample would yield a
8147
different distribution.
8148
\index{growth rate}
8149
8150
\item The growth model does not take into account tumor subtype or
8151
grade; this assumption is consistent with the conclusion of Zhang et al:
8152
``Growth rates in renal tumors of different sizes, subtypes and
8153
grades represent a wide range and overlap substantially.''
8154
But with a larger sample, a difference might become apparent.
8155
\index{tumor type}
8156
8157
\item The distribution of growth rate does not depend on the size of
8158
the tumor. This assumption would not be realistic for very
8159
small and very large tumors, whose growth is limited by blood supply.
8160
8161
But tumors observed by Zhang et al ranged from 1 to 12 cm, and they
8162
found no statistically significant relationship between
8163
size and growth rate. So if there is a relationship, it is
8164
likely to be weak, at least in this size range.
8165
8166
\item In the simulations, growth rate during each interval is
8167
independent of previous growth rates. In reality it is plausible
8168
that tumors that have grown quickly in the past are more likely
8169
to grow quickly. In other words, there is probably
8170
a serial correlation in growth rate.
8171
\index{serial correlation}
8172
8173
\end{itemize}
8174
8175
Of these, the first and last seem the most problematic. I'll
8176
investigate serial correlation first, then come back to
8177
spherical geometry.
8178
8179
To simulate correlated growth, I wrote a generator\footnote{If you are
8180
not familiar with Python generators, see
8181
\url{http://wiki.python.org/moin/Generators}.} that yields a
8182
correlated series from a given Cdf. Here's how the algorithm works:
8183
\index{generator}
8184
8185
\begin{enumerate}
8186
8187
\item Generate correlated values from a Gaussian distribution.
8188
This is easy to do because we can compute the distribution
8189
of the next value conditioned on the previous value.
8190
\index{Gaussian distribution}
8191
8192
\item Transform each value to its cumulative probability using
8193
the Gaussian CDF.
8194
\index{cumulative probability}
8195
8196
\item Transform each cumulative probability to the corresponding value
8197
using the given Cdf.
8198
8199
\end{enumerate}
8200
8201
Here's what that looks like in code:
8202
8203
\begin{verbatim}
8204
def CorrelatedGenerator(cdf, rho):
8205
x = random.gauss(0, 1)
8206
yield Transform(x)
8207
8208
sigma = math.sqrt(1 - rho**2);
8209
while True:
8210
x = random.gauss(x * rho, sigma)
8211
yield Transform(x)
8212
\end{verbatim}
8213
8214
{\tt cdf} is the desired Cdf; {\tt rho} is the desired correlation.
8215
The values of {\tt x} are Gaussian; {\tt Transform} converts them
8216
to the desired distribution.
8217
8218
The first value of {\tt x} is Gaussian with mean 0 and standard
8219
deviation 1. For subsequent values, the mean and standard deviation
8220
depend on the previous value. Given the previous {\tt x}, the mean of the
8221
next value is {\tt x * rho}, and the variance is {\tt 1 - rho**2}.
8222
\index{correlated random value}
8223
8224
{\tt Transform} maps from each
8225
Gaussian value, {\tt x}, to a value from the given Cdf, {\tt y}.
8226
8227
\begin{verbatim}
8228
def Transform(x):
8229
p = thinkbayes.GaussianCdf(x)
8230
y = cdf.Value(p)
8231
return y
8232
\end{verbatim}
8233
8234
{\tt GaussianCdf} computes the CDF of the standard Gaussian
8235
distribution at {\tt x}, returning a cumulative probability.
8236
{\tt Cdf.Value} maps from a cumulative probability to the
8237
corresponding value in {\tt cdf}.
8238
8239
Depending on the shape of {\tt cdf}, information can
8240
be lost in transformation, so the actual correlation might be
8241
lower than {\tt rho}. For example, when I generate
8242
10000 values from the distribution of growth rates with
8243
{\tt rho=0.4}, the actual correlation is 0.37.
8244
But since we are guessing at the right correlation anyway,
8245
that's close enough.
8246
8247
Remember that {\tt MakeSequence} takes an iterator as an argument.
8248
That interface allows it to work with different generators:
8249
\index{generator}
8250
8251
\begin{verbatim}
8252
iterator = UncorrelatedGenerator(cdf)
8253
seq1 = MakeSequence(iterator)
8254
8255
iterator = CorrelatedGenerator(cdf, rho)
8256
seq2 = MakeSequence(iterator)
8257
\end{verbatim}
8258
8259
In this example, {\tt seq1} and {\tt seq2} are
8260
drawn from the same distribution, but the values in {\tt seq1}
8261
are uncorrelated and the values in {\tt seq2} are correlated
8262
with a coefficient of approximately {\tt rho}.
8263
\index{serial correlation}
8264
8265
Now we can see what effect serial correlation has on the results;
8266
the following table shows percentiles of age for a 6 cm tumor,
8267
using the uncorrelated generator and a correlated generator
8268
with target $\rho = 0.4$.
8269
\index{percentile}
8270
8271
\begin{table}
8272
\input{kidney_table2}
8273
\caption{Percentiles of tumor age conditioned on size.}
8274
\end{table}
8275
8276
Correlation makes the fastest growing tumors faster and the slowest
8277
slower, so the range of ages is wider. The difference is modest for
8278
low percentiles, but for the 95th percentile it is more than 6 years.
8279
To compute these percentiles precisely, we would need a better
8280
estimate of the actual serial correlation.
8281
8282
However, this model is sufficient to answer the question
8283
we started with: given a tumor with a linear dimension of
8284
15.5 cm, what is the probability that it formed more than
8285
8 years ago?
8286
8287
Here's the code:
8288
8289
\begin{verbatim}
8290
# class Cache
8291
8292
def ProbOlder(self, cm, age):
8293
bucket = CmToBucket(cm)
8294
cdf = self.ConditionalCdf(bucket)
8295
p = cdf.Prob(age)
8296
return 1-p
8297
\end{verbatim}
8298
8299
{\tt cm} is the size of the tumor; {\tt age} is the age threshold
8300
in years. {\tt ProbOlder} converts size to a bucket number,
8301
gets the Cdf of age conditioned on bucket, and computes the
8302
probability that age exceeds the given value.
8303
8304
With no serial correlation, the probability that a
8305
15.5 cm tumor is older than 8 years is 0.999, or almost certain.
8306
With correlation 0.4, faster-growing tumors are more likely, but
8307
the probability is still 0.995. Even with correlation 0.8, the
8308
probability is 0.978.
8309
8310
Another likely source of error is the assumption that tumors are
8311
approximately spherical. For a tumor with linear dimensions 15.5 x 15
8312
cm, this assumption is probably not valid. If, as seems likely, a
8313
tumor this size
8314
is relatively flat, it might have the same volume as a 6 cm sphere.
8315
With this smaller volume and correlation 0.8, the probability of age
8316
greater than 8 is still 95\%.
8317
8318
So even taking into account modeling errors, it is unlikely that such
8319
a large tumor could have formed less than 8 years prior to the date of
8320
diagnosis.
8321
\index{modeling error}
8322
8323
8324
\section{Discussion}
8325
8326
Well, we got through a whole chapter without using Bayes's theorem or
8327
the {\tt Suite} class that encapsulates Bayesian updates. What
8328
happened?
8329
8330
One way to think about Bayes's theorem is as an algorithm for
8331
inverting conditional probabilities. Given \p{B|A}, we can compute
8332
\p{A|B}, provided we know \p{A} and \p{B}. Of course this algorithm
8333
is only useful if, for some reason, it is easier to compute \p{B|A}
8334
than \p{A|B}.
8335
8336
In this example, it is. By running simulations, we can estimate the
8337
distribution of size conditioned on age, or \p{size|age}. But it is
8338
harder to get the distribution of age conditioned on size, or
8339
\p{age|size}. So this seems like a perfect opportunity to use Bayes's
8340
theorem.
8341
8342
The reason I didn't is computational efficiency. To estimate
8343
\p{size|age} for any given size, you have to run a lot of simulations.
8344
Along the way, you end up computing \p{size|age} for a lot of sizes.
8345
In fact, you end up computing the entire joint distribution of size
8346
and age, \p{size, age}.
8347
\index{joint distribution}
8348
8349
And once you have the joint distribution, you don't really need
8350
Bayes's theorem, you can extract \p{age|size} by taking slices from
8351
the joint distribution, as demonstrated in {\tt ConditionalCdf}.
8352
\index{conditional distribution}
8353
8354
So we side-stepped Bayes, but he was with us in spirit.
8355
8356
8357
\chapter{A Hierarchical Model}
8358
\label{hierarchical}
8359
8360
8361
\section{The Geiger counter problem}
8362
8363
I got the idea for the following problem from Tom Campbell-Ricketts,
8364
author of the Maximum Entropy blog at
8365
\url{http://maximum-entropy-blog.blogspot.com}. And he got the idea
8366
from E.~T.~Jaynes, author of the classic {\em Probability Theory: The
8367
Logic of Science}:
8368
\index{Jaynes, E.~T.}
8369
\index{Campbell-Ricketts, Tom}
8370
\index{Geiger counter problem}
8371
8372
\begin{quote}
8373
Suppose that a radioactive source emits particles toward
8374
a Geiger counter at an average rate of $r$ particles per second,
8375
but the counter only registers a fraction, $f$, of the particles
8376
that hit it. If $f$ is 10\% and
8377
the counter registers 15 particles in a one second
8378
interval, what is the posterior distribution of $n$, the actual
8379
number of particles that hit the counter, and $r$, the average
8380
rate particles are emitted?
8381
\end{quote}
8382
8383
To get started on a problem like this, think about the chain of
8384
causation that starts with the parameters of the system and ends
8385
with the observed data:
8386
\index{causation}
8387
8388
\begin{enumerate}
8389
8390
\item The source emits particles at an average rate, $r$.
8391
8392
\item During any given second, the source emits $n$ particles
8393
toward the counter.
8394
8395
\item Out of those $n$ particles, some number, $k$, get counted.
8396
8397
\end{enumerate}
8398
8399
The probability that an atom decays is the same at any point in time,
8400
so radioactive decay is well modeled by a Poisson process. Given $r$,
8401
the distribution of $n$ is Poisson distribution with parameter $r$.
8402
\index{radioactive decay}
8403
\index{Poisson process}
8404
8405
And if we assume that the probability of detection for each particle
8406
is independent of the others, the distribution of $k$ is the binomial
8407
distribution with parameters $n$ and $f$.
8408
\index{binomial distribution}
8409
8410
Given the parameters of the system, we can find the distribution of
8411
the data. So we can solve what is called the {\bf forward problem}.
8412
\index{forward problem}
8413
8414
Now we want to go the other way: given the data, we
8415
want the distribution of the parameters. This is called
8416
the {\bf inverse problem}. And if you can solve the forward
8417
problem, you can use Bayesian methods to solve the inverse problem.
8418
\index{inverse problem}
8419
8420
8421
\section{Start simple}
8422
8423
\begin{figure}
8424
% jaynes.py
8425
\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}
8426
\caption{Posterior distribution of $n$ for three values of $r$.}
8427
\label{fig.jaynes1}
8428
\end{figure}
8429
8430
Let's start with a simple version of the problem where we know
8431
the value of $r$. We are given the value of $f$, so all we
8432
have to do is estimate $n$.
8433
8434
I define a Suite called {\tt Detector} that models the behavior
8435
of the detector and estimates $n$.
8436
8437
\begin{verbatim}
8438
class Detector(thinkbayes.Suite):
8439
8440
def __init__(self, r, f, high=500, step=1):
8441
pmf = thinkbayes.MakePoissonPmf(r, high, step=step)
8442
thinkbayes.Suite.__init__(self, pmf, name=r)
8443
self.r = r
8444
self.f = f
8445
\end{verbatim}
8446
8447
If the average emission rate is $r$ particles per second, the
8448
distribution of $n$ is Poisson with parameter $r$.
8449
{\tt high} and {\tt step} determine the upper bound for $n$
8450
and the step size between hypothetical values.
8451
\index{Poisson distribution}
8452
8453
Now we need a likelihood function:
8454
\index{likelihood}
8455
8456
\begin{verbatim}
8457
# class Detector
8458
8459
def Likelihood(self, data, hypo):
8460
k = data
8461
n = hypo
8462
p = self.f
8463
8464
return thinkbayes.EvalBinomialPmf(k, n, p)
8465
\end{verbatim}
8466
8467
{\tt data} is the number of particles detected, and {\tt hypo} is
8468
the hypothetical number of particles emitted, $n$.
8469
8470
If there are actually $n$ particles, and the probability of detecting
8471
any one of them is $f$, the probability of detecting $k$ particles is
8472
given by the binomial distribution.
8473
\index{binomial distribution}
8474
8475
That's it for the Detector. We can try it out for a range
8476
of values of $r$:
8477
8478
\begin{verbatim}
8479
f = 0.1
8480
k = 15
8481
8482
for r in [100, 250, 400]:
8483
suite = Detector(r, f, step=1)
8484
suite.Update(k)
8485
print suite.MaximumLikelihood()
8486
\end{verbatim}
8487
8488
Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for
8489
several given values of $r$.
8490
8491
8492
\section{Make it hierarchical}
8493
8494
In the previous section, we assume $r$ is known. Now let's
8495
relax that assumption. I define another Suite, called {\tt Emitter},
8496
that models the behavior of the emitter and estimates $r$:
8497
8498
\begin{verbatim}
8499
class Emitter(thinkbayes.Suite):
8500
8501
def __init__(self, rs, f=0.1):
8502
detectors = [Detector(r, f) for r in rs]
8503
thinkbayes.Suite.__init__(self, detectors)
8504
\end{verbatim}
8505
8506
{\tt rs} is a sequence of hypothetical value for $r$. {\tt detectors}
8507
is a sequence of Detector objects, one for each value of $r$. The
8508
values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};
8509
that is, a Suite that contains other Suites as values.
8510
\index{meta-Suite}
8511
8512
To update the Emitter, we have to compute the likelihood of the data
8513
under each hypothetical value of $r$. But each value of $r$ is
8514
represented by a Detector that contains a range of values for $n$.
8515
8516
To compute the likelihood of the data for a given Detector, we loop
8517
through the values of $n$ and add up the total probability of $k$.
8518
That's what {\tt SuiteLikelihood} does:
8519
8520
\begin{verbatim}
8521
# class Detector
8522
8523
def SuiteLikelihood(self, data):
8524
total = 0
8525
for hypo, prob in self.Items():
8526
like = self.Likelihood(data, hypo)
8527
total += prob * like
8528
return total
8529
\end{verbatim}
8530
8531
Now we can write the Likelihood function for the Emitter:
8532
8533
\begin{verbatim}
8534
# class Emitter
8535
8536
def Likelihood(self, data, hypo):
8537
detector = hypo
8538
like = detector.SuiteLikelihood(data)
8539
return like
8540
\end{verbatim}
8541
8542
Each {\tt hypo} is a Detector, so we can invoke
8543
{\tt SuiteLikelihood} to get the likelihood of the data under
8544
the hypothesis.
8545
8546
After we update the Emitter, we have to update each of the
8547
Detectors, too.
8548
8549
\begin{verbatim}
8550
# class Emitter
8551
8552
def Update(self, data):
8553
thinkbayes.Suite.Update(self, data)
8554
8555
for detector in self.Values():
8556
detector.Update()
8557
\end{verbatim}
8558
8559
A model like this, with multiple levels of Suites, is called {\bf
8560
hierarchical}. \index{hierarchical model}
8561
8562
8563
\section{A little optimization}
8564
8565
You might recognize {\tt SuiteLikelihood}; we saw it
8566
in Section~\ref{suitelike}. At the time, I pointed out that
8567
we didn't really need it, because the total probability
8568
computed by {\tt SuiteLikelihood} is exactly the normalizing
8569
constant computed and returned by {\tt Update}.
8570
\index{normalizing constant}
8571
8572
So instead of updating the Emitter and then updating the
8573
Detectors, we can do both steps at the same time, using
8574
the result from {\tt Detector.Update} as the likelihood
8575
of Emitter.
8576
8577
Here's the streamlined version of {\tt Emitter.Likelihood}:
8578
8579
\begin{verbatim}
8580
# class Emitter
8581
8582
def Likelihood(self, data, hypo):
8583
return hypo.Update(data)
8584
\end{verbatim}
8585
8586
And with this version of {\tt Likelihood} we can use the
8587
default version of {\tt Update}. So this version has fewer
8588
lines of code, and it runs faster because it does not compute
8589
the normalizing constant twice.
8590
\index{optimization}
8591
8592
8593
\section{Extracting the posteriors}
8594
8595
\begin{figure}
8596
% jaynes.py
8597
\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}
8598
\caption{Posterior distributions of $n$ and $r$.}
8599
\label{fig.jaynes2}
8600
\end{figure}
8601
8602
After we update the Emitter, we can get the posterior distribution
8603
of $r$ by looping through the Detectors and their probabilities:
8604
8605
\begin{verbatim}
8606
# class Emitter
8607
8608
def DistOfR(self):
8609
items = [(detector.r, prob) for detector, prob in self.Items()]
8610
return thinkbayes.MakePmfFromItems(items)
8611
\end{verbatim}
8612
8613
{\tt items} is a list of values of $r$ and their probabilities.
8614
The result is the Pmf of $r$.
8615
8616
To get the posterior distribution of $n$, we have to compute
8617
the mixture of the Detectors. We can use
8618
{\tt thinkbayes.MakeMixture}, which takes a meta-Pmf that maps
8619
from each distribution to its probability. And that's exactly
8620
what the Emitter is:
8621
8622
\begin{verbatim}
8623
# class Emitter
8624
8625
def DistOfN(self):
8626
return thinkbayes.MakeMixture(self)
8627
\end{verbatim}
8628
8629
Figure~\ref{fig.jaynes2} shows the results. Not surprisingly, the
8630
most likely value for $n$ is 150. Given $f$ and $n$, the expected
8631
count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is
8632
$k / f$, which is 150.
8633
8634
And if 150 particles are emitted in one second, the most likely value
8635
of $r$ is 150 particles per second. So the posterior distribution of
8636
$r$ is also centered on 150.
8637
8638
The posterior distributions of $r$ and $n$ are similar;
8639
the only difference is that we are slightly less certain about $n$.
8640
In general, we can be more certain about the long-range emission rate,
8641
$r$, than about the number of particles emitted in any particular second,
8642
$n$.
8643
8644
You can download the code in this chapter from
8645
\url{http://thinkbayes.com/jaynes.py}. For more information see
8646
Section~\ref{download}.
8647
8648
8649
\section{Discussion}
8650
8651
The Geiger counter problem demonstrates the connection between
8652
causation and hierarchical modeling. In the example, the
8653
emission rate $r$ has a causal effect on the number of particles,
8654
$n$, which has a causal effect on the particle count, $k$.
8655
\index{Geiger counter problem}
8656
\index{causation}
8657
8658
The hierarchical model reflects the structure of the
8659
system, with causes at the top and effects at the bottom.
8660
\index{hierarchical model}
8661
8662
\begin{enumerate}
8663
8664
\item At the top level, we start with a range of hypothetical
8665
values for $r$.
8666
8667
\item For each value of $r$, we have a range of values for $n$,
8668
and the prior distribution of $n$ depends on $r$.
8669
8670
\item When we update the model, we go bottom-up. We compute
8671
a posterior distribution of $n$ for each value of $r$, then
8672
compute the posterior distribution of $r$.
8673
8674
\end{enumerate}
8675
8676
So causal information flows down the hierarchy, and inference flows
8677
up.
8678
8679
8680
\section{Exercises}
8681
8682
\begin{exercise}
8683
This exercise is also inspired by an example in Jaynes, {\em
8684
Probability Theory}.
8685
8686
Suppose you buy a mosquito trap that is supposed to reduce the
8687
population of mosquitoes near your house. Each
8688
week, you empty the trap and count the number of mosquitoes
8689
captured. After the first week, you count 30 mosquitoes.
8690
After the second week, you count 20 mosquitoes. Estimate the
8691
percentage change in the number of mosquitoes in your yard.
8692
8693
To answer this question, you have to make some modeling
8694
decisions. Here are some suggestions:
8695
8696
\begin{itemize}
8697
8698
\item Suppose that each week a large number of mosquitoes, $N$, is bred
8699
in a wetland near your home.
8700
8701
\item During the week, some fraction of
8702
them, $f_1$, wander into your yard, and of those some fraction, $f_2$,
8703
are caught in the trap.
8704
8705
\item Your solution should take into account your prior belief
8706
about how much $N$ is likely to change from one week to the next.
8707
You can do that by adding a level to the hierarchy to
8708
model the percent change in $N$.
8709
8710
\end{itemize}
8711
8712
\end{exercise}
8713
8714
8715
\chapter{Dealing with Dimensions}
8716
\label{species}
8717
8718
\section{Belly button bacteria}
8719
8720
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
8721
science project with the goal of identifying bacterial species that
8722
can be found in human navels (\url{http://bbdata.yourwildlife.org}).
8723
The project might seem whimsical, but it is part of an increasing
8724
interest in the human microbiome, the set of microorganisms that live
8725
on human skin and parts of the body.
8726
\index{biodiversity}
8727
\index{belly button}
8728
\index{bacteria}
8729
\index{microbiome}
8730
8731
In their pilot study, BBB2 researchers collected swabs from the navels
8732
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
8733
fragments of 16S rDNA, then identified the species or genus the
8734
fragments came from. Each identified fragment is called a ``read.''
8735
\index{navel}
8736
\index{rDNA}
8737
\index{pyrosequencing}
8738
8739
We can use these data to answer several related questions:
8740
8741
\begin{itemize}
8742
8743
\item Based on the number of species observed, can we estimate
8744
the total number of species in the environment?
8745
\index{species}
8746
8747
\item Can we estimate the prevalence of each species; that is, the
8748
fraction of the total population belonging to each species?
8749
\index{prevalence}
8750
8751
\item If we are planning to collect additional samples, can we predict
8752
how many new species we are likely to discover?
8753
8754
\item How many additional reads are needed to increase the
8755
fraction of observed species to a given threshold?
8756
8757
\end{itemize}
8758
8759
These questions make up what is called the {\bf Unseen Species problem}.
8760
\index{Unseen Species problem}
8761
8762
8763
\section{Lions and tigers and bears}
8764
8765
I'll start with a simplified version of the problem where we know that
8766
there are exactly three species. Let's call them lions, tigers and
8767
bears. Suppose we visit a wild animal preserve and see 3 lions, 2
8768
tigers and one bear.
8769
\index{lions and tigers and bears}
8770
8771
If we have an equal chance of observing any animal in the preserve,
8772
the number of each species we see is governed by the multinomial
8773
distribution. If the prevalence of lions and tigers and bears is
8774
\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of
8775
seeing 3 lions, 2 tigers and one bear is proportional to
8776
\index{multinomial distribution}
8777
8778
\begin{verbatim}
8779
p_lion**3 * p_tiger**2 * p_bear**1
8780
\end{verbatim}
8781
8782
An approach that is tempting, but not correct, is to use beta
8783
distributions, as in Section~\ref{beta}, to describe the prevalence of
8784
each species separately. For example, we saw 3 lions and 3 non-lions;
8785
if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior
8786
distribution of \verb"p_lion" is:
8787
\index{beta distribution}
8788
8789
\begin{verbatim}
8790
beta = thinkbayes.Beta()
8791
beta.Update((3, 3))
8792
print beta.MaximumLikelihood()
8793
\end{verbatim}
8794
8795
The maximum likelihood estimate for \verb"p_lion" is the observed
8796
rate, 50\%. Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"
8797
are 33\% and 17\%.
8798
\index{maximum likelihood}
8799
8800
But there are two problems:
8801
8802
\begin{enumerate}
8803
8804
\item We have implicitly used a prior for each species that is uniform
8805
from 0 to 1, but since we know that there are three species, that
8806
prior is not correct. The right prior should have a mean of 1/3,
8807
and there should be zero likelihood that any species has a
8808
prevalence of 100\%.
8809
8810
\item The distributions for each species are not independent, because
8811
the prevalences have to add up to 1. To capture this dependence, we
8812
need a joint distribution for the three prevalences.
8813
\index{independence}
8814
\index{joint distribution}
8815
8816
\end{enumerate}
8817
8818
We can use a Dirichlet distribution to solve both of these problems
8819
(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}). In
8820
the same way we used the beta distribution to describe the
8821
distribution of bias for a coin, we can use a Dirichlet
8822
distribution to describe the joint distribution of \verb"p_lion",
8823
\verb"p_tiger" and \verb"p_bear".
8824
\index{beta distribution}
8825
\index{Dirichlet distribution}
8826
8827
The Dirichlet distribution is the multi-dimensional generalization
8828
of the beta distribution. Instead of two possible outcomes, like
8829
heads and tails, the Dirichlet distribution handles any number of
8830
outcomes: in this example, three species.
8831
8832
If there are {\tt n} outcomes, the Dirichlet distribution is
8833
described by {\tt n} parameters, written $\alpha_1$ through $\alpha_n$.
8834
8835
Here's the definition, from {\tt thinkbayes.py}, of a class that
8836
represents a Dirichlet distribution:
8837
\index{numpy}
8838
8839
\begin{verbatim}
8840
class Dirichlet(object):
8841
8842
def __init__(self, n):
8843
self.n = n
8844
self.params = numpy.ones(n, dtype=numpy.int)
8845
\end{verbatim}
8846
8847
{\tt n} is the number of dimensions; initially the parameters
8848
are all 1. I use a {\tt numpy} array to store the parameters
8849
so I can take advantage of array operations.
8850
8851
Given a Dirichlet distribution, the marginal distribution
8852
for each prevalence is a beta distribution, which we can
8853
compute like this:
8854
8855
\begin{verbatim}
8856
def MarginalBeta(self, i):
8857
alpha0 = self.params.sum()
8858
alpha = self.params[i]
8859
return Beta(alpha, alpha0-alpha)
8860
\end{verbatim}
8861
8862
{\tt i} is the index of the marginal distribution we want.
8863
{\tt alpha0} is the sum of the parameters; {\tt alpha} is the
8864
parameter for the given species.
8865
\index{marginal distribution}
8866
8867
In the example, the prior marginal distribution for each species
8868
is {\tt Beta(1, 2)}. We can compute the prior means like
8869
this:
8870
8871
\begin{verbatim}
8872
dirichlet = thinkbayes.Dirichlet(3)
8873
for i in range(3):
8874
beta = dirichlet.MarginalBeta(i)
8875
print beta.Mean()
8876
\end{verbatim}
8877
8878
As expected, the prior mean prevalence for each species is 1/3.
8879
8880
To update the Dirichlet distribution, we add the
8881
observations to the parameters like this:
8882
8883
\begin{verbatim}
8884
def Update(self, data):
8885
m = len(data)
8886
self.params[:m] += data
8887
\end{verbatim}
8888
8889
Here {\tt data} is a sequence of counts in the same order as {\tt
8890
params}, so in this example, it should be the number of lions,
8891
tigers and bears.
8892
8893
{\tt data} can be shorter than {\tt params}; in that
8894
case there are some species that have not been
8895
observed.
8896
8897
Here's code that updates {\tt dirichlet} with the observed data and
8898
computes the posterior marginal distributions.
8899
8900
\begin{verbatim}
8901
data = [3, 2, 1]
8902
dirichlet.Update(data)
8903
8904
for i in range(3):
8905
beta = dirichlet.MarginalBeta(i)
8906
pmf = beta.MakePmf()
8907
print i, pmf.Mean()
8908
\end{verbatim}
8909
8910
\begin{figure}
8911
% species.py
8912
\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}
8913
\caption{Distribution of prevalences for three species.}
8914
\label{fig.species1}
8915
\end{figure}
8916
8917
Figure~\ref{fig.species1} shows the results. The posterior
8918
mean prevalences are 44\%, 33\%, and 22\%.
8919
8920
8921
\section{The hierarchical version}
8922
8923
We have solved a simplified version of the problem: if we
8924
know how many species there are, we can estimate the prevalence
8925
of each.
8926
\index{prevalence}
8927
8928
Now let's get back to the original problem, estimating the total
8929
number of species. To solve this problem I'll define a meta-Suite,
8930
which is a Suite that contains other Suites as hypotheses. In this
8931
case, the top-level Suite contains hypotheses about the number of
8932
species; the bottom level contains hypotheses about prevalences.
8933
\index{hierarchical model}
8934
\index{meta-Suite}
8935
8936
Here's the class definition:
8937
8938
\begin{verbatim}
8939
class Species(thinkbayes.Suite):
8940
8941
def __init__(self, ns):
8942
hypos = [thinkbayes.Dirichlet(n) for n in ns]
8943
thinkbayes.Suite.__init__(self, hypos)
8944
\end{verbatim}
8945
8946
\verb"__init__" takes a list of possible values for {\tt n} and
8947
makes a list of Dirichlet objects.
8948
8949
Here's the code that creates the top-level suite:
8950
8951
\begin{verbatim}
8952
ns = range(3, 30)
8953
suite = Species(ns)
8954
\end{verbatim}
8955
8956
{\tt ns} is the list of possible values for {\tt n}. We have seen 3
8957
species, so there have to be at least that many. I chose an upper
8958
bound that seems reasonable, but we will check later that the
8959
probability of exceeding this bound is low. And at least initially
8960
we assume that any value in this range is equally likely.
8961
8962
To update a hierarchical model, you have to update all levels.
8963
Usually you have to update the bottom
8964
level first and work up, but in this case we can
8965
update the top level first:
8966
8967
\begin{verbatim}
8968
#class Species
8969
8970
def Update(self, data):
8971
thinkbayes.Suite.Update(self, data)
8972
for hypo in self.Values():
8973
hypo.Update(data)
8974
\end{verbatim}
8975
8976
{\tt Species.Update} invokes {\tt Update} in the parent class,
8977
then loops through the sub-hypotheses and updates them.
8978
8979
Now all we need is a likelihood function:
8980
8981
\begin{verbatim}
8982
# class Species
8983
8984
def Likelihood(self, data, hypo):
8985
dirichlet = hypo
8986
like = 0
8987
for i in range(1000):
8988
like += dirichlet.Likelihood(data)
8989
8990
return like
8991
\end{verbatim}
8992
8993
{\tt data} is a sequence of
8994
observed counts; {\tt hypo} is a Dirichlet object.
8995
{\tt Species.Likelihood} calls
8996
{\tt Dirichlet.Likelihood} 1000 times and returns the total.
8997
8998
Why call it 1000 times? Because {\tt
8999
Dirichlet.Likelihood} doesn't actually compute the likelihood of the
9000
data under the whole Dirichlet distribution. Instead, it draws one
9001
sample from the hypothetical distribution and computes the likelihood
9002
of the data under the sampled set of prevalences.
9003
9004
Here's what it looks like:
9005
9006
\begin{verbatim}
9007
# class Dirichlet
9008
9009
def Likelihood(self, data):
9010
m = len(data)
9011
if self.n < m:
9012
return 0
9013
9014
x = data
9015
p = self.Random()
9016
q = p[:m]**x
9017
return q.prod()
9018
\end{verbatim}
9019
9020
The length of {\tt data} is the number of species observed. If
9021
we see more species than we thought existed, the likelihood is 0.
9022
9023
\index{multinomial distribution}
9024
Otherwise we select a random set of prevalences, {\tt p}, and
9025
compute the multinomial PMF, which is
9026
%
9027
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
9028
%
9029
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the
9030
observed number. The first term, $c_x$, is the multinomial
9031
coefficient; I leave it out of the computation because it is
9032
a multiplicative factor that depends only
9033
on the data, not the hypothesis, so it gets normalized away
9034
(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).
9035
\index{multinomial coefficient}
9036
9037
{\tt m} is the number of observed species.
9038
We only need the first {\tt m} elements of {\tt p};
9039
for the others, $x_i$ is 0, so
9040
$p_i^{x_i}$ is 1, and we can leave them out of the product.
9041
9042
9043
\section{Random sampling}
9044
\label{randomdir}
9045
9046
There are two ways to generate a random sample from a Dirichlet
9047
distribution. One is to use the marginal beta distributions, but in
9048
that case you have to select one at a time and scale the rest so they
9049
add up to 1 (see
9050
\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).
9051
\index{random sample}
9052
9053
A less obvious, but faster, way is to select values from {\tt n} gamma
9054
distributions, then normalize by dividing through by the total.
9055
Here's the code:
9056
\index{numpy}
9057
\index{gamma distribution}
9058
9059
\begin{verbatim}
9060
# class Dirichlet
9061
9062
def Random(self):
9063
p = numpy.random.gamma(self.params)
9064
return p / p.sum()
9065
\end{verbatim}
9066
9067
Now we're ready to look at some results. Here is the code that extracts
9068
the posterior distribution of {\tt n}:
9069
9070
\begin{verbatim}
9071
def DistOfN(self):
9072
pmf = thinkbayes.Pmf()
9073
for hypo, prob in self.Items():
9074
pmf.Set(hypo.n, prob)
9075
return pmf
9076
\end{verbatim}
9077
9078
{\tt DistOfN} iterates
9079
through the top-level hypotheses and accumulates the probability
9080
of each {\tt n}.
9081
9082
\begin{figure}
9083
% species.py
9084
\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}
9085
\caption{Posterior distribution of {\tt n}.}
9086
\label{fig.species2}
9087
\end{figure}
9088
9089
Figure~\ref{fig.species2} shows the result. The most likely value is 4.
9090
Values from 3 to 7 are reasonably likely; after that the probabilities
9091
drop off quickly. The probability that there are 29 species is
9092
low enough to be negligible; if we chose a higher bound,
9093
we would get nearly the same result.
9094
9095
Remember that this result is based on a uniform prior for {\tt n}. If
9096
we have background information about the number of species in the
9097
environment, we might choose a different prior. \index{uniform
9098
distribution}
9099
9100
9101
\section{Optimization}
9102
9103
I have to admit that I am proud of this example. The Unseen Species
9104
problem is not easy, and I think this solution is simple and clear,
9105
and takes surprisingly few lines of code (about 50 so far).
9106
9107
The only problem is that it is slow. It's good enough for the example
9108
with only 3 observed species, but not good enough for the belly button
9109
data, with more than 100 species in some samples.
9110
9111
The next few sections present a series of optimizations we need to
9112
make this solution scale. Before we get into the details, here's
9113
a road map.
9114
\index{optimization}
9115
9116
\begin{itemize}
9117
9118
\item The first step is to recognize that if we update the Dirichlet
9119
distributions with the same data, the first {\tt m} parameters are
9120
the same for all of them. The only difference is the number of
9121
hypothetical unseen species. So we don't really need {\tt n}
9122
Dirichlet objects; we can store the parameters in the top level of
9123
the hierarchy. {\tt Species2} implements this optimization.
9124
9125
\item {\tt Species2} also uses the same set of random values for all
9126
of the hypotheses. This saves time generating random values, but it
9127
has a second benefit that turns out to be more important: by giving
9128
all hypotheses the same selection from the sample space, we make
9129
the comparison between the hypotheses more fair, so it takes
9130
fewer iterations to converge.
9131
9132
\item Even with these changes there is a major performance problem.
9133
As the number of observed species increases, the array of random
9134
prevalences gets bigger, and the chance of choosing one that is
9135
approximately right becomes small. So the vast majority of
9136
iterations yield small likelihoods that don't contribute much to the
9137
total, and don't discriminate between hypotheses.
9138
9139
The solution is to do the updates one species at a time. {\tt
9140
Species4} is a simple implementation of this strategy using
9141
Dirichlet objects to represent the sub-hypotheses.
9142
9143
\item Finally, {\tt Species5} combines the sub-hypotheses into the top
9144
level and uses {\tt numpy} array operations to speed things up.
9145
\index{numpy}
9146
9147
\end{itemize}
9148
9149
If you are not interested in the details, feel free to skip to
9150
Section~\ref{belly} where we look at results from the belly
9151
button data.
9152
9153
9154
\section{Collapsing the hierarchy}
9155
\label{collapsing}
9156
9157
All of the bottom-level Dirichlet distributions are updated
9158
with the same data, so the first {\tt m} parameters are the same for
9159
all of them.
9160
We can eliminate them and merge the parameters into
9161
the top-level suite. {\tt Species2} implements this optimization:
9162
\index{numpy}
9163
9164
\begin{verbatim}
9165
class Species2(object):
9166
9167
def __init__(self, ns):
9168
self.ns = ns
9169
self.probs = numpy.ones(len(ns), dtype=numpy.double)
9170
self.params = numpy.ones(self.high, dtype=numpy.int)
9171
\end{verbatim}
9172
9173
{\tt ns} is the list of hypothetical values for {\tt n};
9174
{\tt probs} is the list of corresponding probabilities. And
9175
{\tt params} is the sequence of Dirichlet parameters, initially
9176
all 1.
9177
9178
{\tt Species2.Update} updates both levels of
9179
the hierarchy: first the probability for each value of {\tt n},
9180
then the Dirichlet parameters:
9181
\index{numpy}
9182
9183
\begin{verbatim}
9184
# class Species2
9185
9186
def Update(self, data):
9187
like = numpy.zeros(len(self.ns), dtype=numpy.double)
9188
for i in range(1000):
9189
like += self.SampleLikelihood(data)
9190
9191
self.probs *= like
9192
self.probs /= self.probs.sum()
9193
9194
m = len(data)
9195
self.params[:m] += data
9196
\end{verbatim}
9197
9198
{\tt SampleLikelihood} returns an array of likelihoods, one for each
9199
value of {\tt n}. {\tt like} accumulates the total likelihood for
9200
1000 samples. {\tt self.probs} is multiplied by the total likelihood,
9201
then normalized. The last two lines, which update the parameters,
9202
are the same as in {\tt Dirichlet.Update}.
9203
9204
Now let's look at {\tt SampleLikelihood}. There are two
9205
opportunities for optimization here:
9206
9207
\begin{itemize}
9208
9209
\item When the hypothetical number of species, {\tt n},
9210
exceeds the observed number, {\tt m}, we only need the first {\tt m}
9211
terms of the multinomial PMF; the rest are 1.
9212
9213
\item If the number of species is large, the likelihood of the data
9214
might be too small for floating-point (see ~\ref{underflow}). So it
9215
is safer to compute log-likelihoods.
9216
\index{log-likelihood} \index{underflow}
9217
9218
\end{itemize}
9219
9220
\index{multinomial distribution}
9221
Again, the multinomial PMF is
9222
%
9223
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
9224
%
9225
So the log-likelihood is
9226
%
9227
\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]
9228
%
9229
which is fast and easy to compute. Again, $c_x$
9230
it is the same for all hypotheses, so we can drop it.
9231
Here's the code:
9232
\index{numpy}
9233
9234
\begin{verbatim}
9235
# class Species2
9236
9237
def SampleLikelihood(self, data):
9238
gammas = numpy.random.gamma(self.params)
9239
9240
m = len(data)
9241
row = gammas[:m]
9242
col = numpy.cumsum(gammas)
9243
9244
log_likes = []
9245
for n in self.ns:
9246
ps = row / col[n-1]
9247
terms = data * numpy.log(ps)
9248
log_like = terms.sum()
9249
log_likes.append(log_like)
9250
9251
log_likes -= numpy.max(log_likes)
9252
likes = numpy.exp(log_likes)
9253
9254
coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
9255
likes *= coefs
9256
9257
return likes
9258
\end{verbatim}
9259
9260
{\tt gammas} is an array of values from a gamma distribution; its
9261
length is the largest hypothetical value of {\tt n}. {\tt row} is
9262
just the first {\tt m} elements of {\tt gammas}; since these are the
9263
only elements that depend on the data, they are the only ones we need.
9264
\index{gamma distribution}
9265
9266
For each value of {\tt n} we need to divide {\tt row} by the
9267
total of the first {\tt n} values from {\tt gamma}. {\tt cumsum}
9268
computes these cumulative sums and stores them in {\tt col}.
9269
\index{cumulative sum}
9270
9271
The loop iterates through the values of {\tt n} and accumulates
9272
a list of log-likelihoods.
9273
\index{log-likelihood}
9274
9275
Inside the loop, {\tt ps} contains the row of probabilities, normalized
9276
with the appropriate cumulative sum. {\tt terms} contains the
9277
terms of the summation, $x_i \log p_i$, and \verb"log_like" contains
9278
their sum.
9279
9280
After the loop, we want to convert the log-likelihoods to linear
9281
likelihoods, but first it's a good idea to shift them so the largest
9282
log-likelihood is 0; that way the linear likelihoods are not too
9283
small (see ~\ref{underflow}).
9284
9285
Finally, before we return the likelihood, we have to apply a correction
9286
factor, which is the number of ways we could have observed these {\tt m}
9287
species, if the total number of species is {\tt n}.
9288
{\tt BinomialCoefficient} computes ``n choose m'', which is written
9289
$\binom{n}{m}$.
9290
\index{binomial coefficient}
9291
9292
As often happens, the optimized version is less readable and more
9293
error-prone than the original. But that's one reason I think it is
9294
a good idea to start with the simple version; we can use it for
9295
regression testing. I plotted results from both versions and confirmed
9296
that they are approximately equal, and that they converge as the
9297
number of iterations increases.
9298
\index{regression testing}
9299
9300
9301
\section{One more problem}
9302
9303
There's more we could do to optimize this code, but there's another
9304
problem we need to fix first. As the number of observed
9305
species increases, this version gets noisier and takes more
9306
iterations to converge on a good answer.
9307
9308
The problem is that if the prevalences we choose from the Dirichlet
9309
distribution, the {\tt ps}, are not at least approximately right,
9310
the likelihood of the observed data is close to zero and almost
9311
equally bad for all values of {\tt n}. So most iterations don't
9312
provide any useful contribution to the total likelihood. And as the
9313
number of observed species, {\tt m}, gets large, the probability of
9314
choosing {\tt ps} with non-negligible likelihood gets small. Really
9315
small.
9316
9317
Fortunately, there is a solution. Remember that if you observe
9318
a set of data, you can update the prior distribution with the
9319
entire dataset, or you can break it up into a series of updates
9320
with subsets of the data, and the result is the same either way.
9321
9322
For this example, the key is to perform the updates one species at
9323
a time. That way when we generate a random set of {\tt ps}, only
9324
one of them affects the computed likelihood, so the chance of choosing
9325
a good one is much better.
9326
9327
Here's a new version that updates one species at a time:
9328
\index{numpy}
9329
9330
\begin{verbatim}
9331
class Species4(Species):
9332
9333
def Update(self, data):
9334
m = len(data)
9335
9336
for i in range(m):
9337
one = numpy.zeros(i+1)
9338
one[i] = data[i]
9339
Species.Update(self, one)
9340
\end{verbatim}
9341
9342
This version inherits \verb"__init__" from {\tt Species}, so it
9343
represents the hypotheses as a list of Dirichlet objects (unlike
9344
{\tt Species2}).
9345
9346
{\tt Update} loops through the observed species and makes an
9347
array, {\tt one}, with all zeros and one species count. Then
9348
it calls {\tt Update} in the parent class, which computes
9349
the likelihoods and updates the sub-hypotheses.
9350
9351
So in the running example, we do three updates. The first
9352
is something like ``I have seen three lions.'' The second is
9353
``I have seen two tigers and no additional lions.'' And the third
9354
is ``I have seen one bear and no more lions and tigers.''
9355
9356
Here's the new version of {\tt Likelihood}:
9357
9358
\begin{verbatim}
9359
# class Species4
9360
9361
def Likelihood(self, data, hypo):
9362
dirichlet = hypo
9363
like = 0
9364
for i in range(self.iterations):
9365
like += dirichlet.Likelihood(data)
9366
9367
# correct for the number of unseen species the new one
9368
# could have been
9369
m = len(data)
9370
num_unseen = dirichlet.n - m + 1
9371
like *= num_unseen
9372
9373
return like
9374
\end{verbatim}
9375
9376
This is almost the same as {\tt Species.Likelihood}. The difference
9377
is the factor, \verb"num_unseen". This correction is necessary
9378
because each time we see a species for the first time, we have to
9379
consider that there were some number of other unseen species that
9380
we might have seen. For larger values of {\tt n} there are more
9381
unseen species that we could have seen, which increases the likelihood
9382
of the data.
9383
9384
This is a subtle point and I have to admit that I did not get it right
9385
the first time. But again I was able to validate this version
9386
by comparing it to the previous versions.
9387
\index{regression testing}
9388
9389
9390
\section{We're not done yet}
9391
9392
\newcommand{\BigO}[1]{\mathcal{O}(#1)}
9393
9394
Performing the updates one species at a time solves one problem, but
9395
it creates another. Each update takes time proportional to $k m$,
9396
where $k$ is the number of hypotheses and $m$ is the number of observed
9397
species. So if we do $m$ updates, the total run time is
9398
proportional to $k m^2$.
9399
9400
But we can speed things up using the same trick we used in
9401
Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and
9402
collapse the two levels of the hierarchy into a single object. So
9403
here's yet another version of {\tt Species}:
9404
9405
\begin{verbatim}
9406
class Species5(Species2):
9407
9408
def Update(self, data):
9409
m = len(data)
9410
for i in range(m):
9411
self.UpdateOne(i+1, data[i])
9412
self.params[i] += data[i]
9413
\end{verbatim}
9414
9415
This version inherits \verb"__init__" from {\tt Species2}, so
9416
it uses {\tt ns} and {\tt probs} to represent the distribution
9417
of {\tt n}, and {\tt params} to represent the parameters of
9418
the Dirichlet distribution.
9419
9420
{\tt Update} is similar to what we saw in the previous section.
9421
It loops through the observed species and calls {\tt UpdateOne}:
9422
\index{numpy}
9423
9424
\begin{verbatim}
9425
# class Species5
9426
9427
def UpdateOne(self, i, count):
9428
likes = numpy.zeros(len(self.ns), dtype=numpy.double)
9429
for i in range(self.iterations):
9430
likes += self.SampleLikelihood(i, count)
9431
9432
unseen_species = [n-i+1 for n in self.ns]
9433
likes *= unseen_species
9434
9435
self.probs *= likes
9436
self.probs /= self.probs.sum()
9437
\end{verbatim}
9438
9439
This function is similar to {\tt Species2.Update}, with two changes:
9440
9441
\begin{itemize}
9442
9443
\item The interface is different. Instead of the whole dataset, we
9444
get {\tt i}, the index of the observed species, and {\tt count},
9445
how many of that species we've seen.
9446
9447
\item We have to apply a correction factor for the number of unseen
9448
species, as in {\tt Species4.Likelihood}. The difference here is
9449
that we update all of the likelihoods at once with array
9450
multiplication.
9451
9452
\end{itemize}
9453
9454
Finally, here's {\tt SampleLikelihood}:
9455
\index{numpy}
9456
9457
\begin{verbatim}
9458
# class Species5
9459
9460
def SampleLikelihood(self, i, count):
9461
gammas = numpy.random.gamma(self.params)
9462
9463
sums = numpy.cumsum(gammas)[self.ns[0]-1:]
9464
9465
ps = gammas[i-1] / sums
9466
log_likes = numpy.log(ps) * count
9467
9468
log_likes -= numpy.max(log_likes)
9469
likes = numpy.exp(log_likes)
9470
9471
return likes
9472
\end{verbatim}
9473
9474
This is similar to {\tt Species2.SampleLikelihood}; the
9475
difference is that each update only includes a single species,
9476
so we don't need a loop.
9477
9478
The runtime of this function is proportional to the number
9479
of hypotheses, $k$. It runs $m$ times, so the run time of
9480
the update is proportional to $k m$.
9481
And the number of iterations we
9482
need to get an accurate result is usually small.
9483
9484
9485
\section{The belly button data}
9486
\label{belly}
9487
9488
That's enough about lions and tigers and bears.
9489
Let's get back to belly buttons. To get a sense of what the
9490
data look like, consider subject B1242,
9491
whose sample of 400 reads yielded 61 species with the following
9492
counts:
9493
9494
\begin{verbatim}
9495
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,
9496
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
9497
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
9498
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
9499
\end{verbatim}
9500
9501
There are a few dominant species that make up a large
9502
fraction of the whole, but many species that yielded only
9503
a single read. The number of these ``singletons'' suggests
9504
that there are likely to be at least a few unseen species.
9505
\index{species}
9506
9507
In the example with lions and tigers, we assume that each
9508
animal in the preserve is equally likely to be observed.
9509
Similarly, for the belly button data, we assume that each
9510
bacterium is equally likely to yield a read.
9511
9512
In reality, each step in the data-collection
9513
process might introduce biases. Some species might
9514
be more likely to be picked up by a swab, or to yield identifiable
9515
amplicons. So when we talk about the prevalence of each species,
9516
we should remember this source of error.
9517
\index{sample bias}
9518
9519
I should also acknowledge that I am using the term ``species''
9520
loosely. First, bacterial species are not well defined. Second,
9521
some reads identify a particular species, others only identify
9522
a genus. To be more precise, I should say ``operational
9523
taxonomic unit'', or OTU.
9524
\index{operational taxonomic unit}
9525
\index{OTU}
9526
9527
Now let's process some of the belly button data. I define
9528
a class called {\tt Subject} to represent information about
9529
each subject in the study:
9530
9531
\begin{verbatim}
9532
class Subject(object):
9533
9534
def __init__(self, code):
9535
self.code = code
9536
self.species = []
9537
\end{verbatim}
9538
9539
Each subject has a string code, like ``B1242'', and a list of
9540
(count, species name) pairs, sorted in increasing order by count.
9541
{\tt Subject} provides several methods to make it
9542
easy to access these counts and species names. You can see the details
9543
in \url{http://thinkbayes.com/species.py}.
9544
For more information
9545
see Section~\ref{download}.
9546
9547
\begin{figure}
9548
% species.py
9549
\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}
9550
\caption{Distribution of {\tt n} for subject B1242.}
9551
\label{species-ndist}
9552
\end{figure}
9553
9554
{\tt Subject} provides a method named {\tt Process} that creates and
9555
updates a {\tt Species5} suite,
9556
which represents the distributions of {\tt n} and the prevalences.
9557
\index{prevalence}
9558
9559
And {\tt Suite2} provides {\tt DistOfN}, which returns the posterior
9560
distribution of {\tt n}.
9561
9562
\begin{verbatim}
9563
# class Suite2
9564
9565
def DistN(self):
9566
items = zip(self.ns, self.probs)
9567
pmf = thinkbayes.MakePmfFromItems(items)
9568
return pmf
9569
\end{verbatim}
9570
9571
Figure~\ref{species-ndist} shows the distribution of {\tt n} for
9572
subject B1242. The probability that there are exactly 61 species, and
9573
no unseen species, is nearly zero. The most likely value is 72, with
9574
90\% credible interval 66 to 79. At the high end, it is unlikely that
9575
there are as many as 87 species.
9576
9577
Next we compute the posterior distribution of prevalence for
9578
each species. {\tt Species2} provides {\tt DistOfPrevalence}:
9579
9580
\begin{verbatim}
9581
# class Species2
9582
9583
def DistOfPrevalence(self, index):
9584
metapmf = thinkbayes.Pmf()
9585
9586
for n, prob in zip(self.ns, self.probs):
9587
beta = self.MarginalBeta(n, index)
9588
pmf = beta.MakePmf()
9589
metapmf.Set(pmf, prob)
9590
9591
mix = thinkbayes.MakeMixture(metapmf)
9592
return metapmf, mix
9593
\end{verbatim}
9594
9595
{\tt index} indicates which species we want. For each
9596
{\tt n}, we have a different posterior distribution
9597
of prevalence.
9598
9599
\begin{figure}
9600
% species.py
9601
\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}
9602
\caption{Distribution of prevalences for subject B1242.}
9603
\label{species-prev}
9604
\end{figure}
9605
9606
The loop iterates through the possible values of {\tt n}
9607
and their probabilities. For each value of {\tt n} it gets
9608
a Beta object representing the marginal distribution for the
9609
indicated species. Remember that Beta objects contain the
9610
parameters {\tt alpha} and {\tt beta}; they don't have
9611
values and probabilities like a Pmf, but they provide {\tt MakePmf},
9612
which generates a discrete approximation to the continuous
9613
beta distribution.
9614
\index{Beta object}
9615
9616
{\tt metapmf} is a meta-Pmf that contains the distributions
9617
of prevalence, conditioned on {\tt n}. {\tt MakeMixture}
9618
combines the meta-Pmf into {\tt mix}, which combines the
9619
conditional distributions into a single distribution
9620
of prevalence.
9621
\index{meta-Pmf}
9622
\index{mixture}
9623
\index{MakeMixture}
9624
9625
Figure~\ref{species-prev} shows results for the five
9626
species with the most reads. The most prevalent species accounts for
9627
23\% of the 400 reads, but since there are almost certainly unseen
9628
species, the most likely estimate for its prevalence is 20\%,
9629
with 90\% credible interval between 17\% and 23\%.
9630
9631
9632
\section{Predictive distributions}
9633
9634
\begin{figure}
9635
% species.py
9636
\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}
9637
\caption{Simulated rarefaction curves for subject B1242.}
9638
\label{species-rare}
9639
\end{figure}
9640
9641
I introduced the hidden species problem in the form of four related
9642
questions. We have answered the first two by computing the posterior
9643
distribution for {\tt n} and the prevalence of each species.
9644
\index{predictive distribution}
9645
9646
The other two questions are:
9647
9648
\begin{itemize}
9649
9650
\item If we are planning to collect additional reads, can we predict
9651
how many new species we are likely to discover?
9652
9653
\item How many additional reads are needed to increase the
9654
fraction of observed species to a given threshold?
9655
9656
\end{itemize}
9657
9658
To answer predictive questions like this we can use the posterior
9659
distributions to simulate possible future events and compute
9660
predictive distributions for the number of species, and fraction of
9661
the total, we are likely to see.
9662
9663
The kernel of these simulations looks like this:
9664
\index{simulation}
9665
9666
\begin{enumerate}
9667
9668
\item Choose {\tt n} from its posterior distribution.
9669
9670
\item Choose a prevalence for each species, including possible unseen
9671
species, using the Dirichlet distribution.
9672
\index{Dirichlet distribution}
9673
9674
\item Generate a random sequence of future observations.
9675
9676
\item Compute the number of new species, \verb"num_new", as a function
9677
of the number of additional reads, {\tt k}.
9678
9679
\item Repeat the previous steps and accumulate the joint distribution
9680
of \verb"num_new" and {\tt k}.
9681
\index{joint distribution}
9682
9683
\end{enumerate}
9684
9685
And here's the code. {\tt RunSimulation} runs a single simulation:
9686
9687
\begin{verbatim}
9688
# class Subject
9689
9690
def RunSimulation(self, num_reads):
9691
m, seen = self.GetSeenSpecies()
9692
n, observations = self.GenerateObservations(num_reads)
9693
9694
curve = []
9695
for k, obs in enumerate(observations):
9696
seen.add(obs)
9697
9698
num_new = len(seen) - m
9699
curve.append((k+1, num_new))
9700
9701
return curve
9702
\end{verbatim}
9703
9704
\verb"num_reads" is the number of additional reads to simulate.
9705
{\tt m} is the number of seen species, and {\tt seen} is a set of
9706
strings with a unique name for each species.
9707
{\tt n} is a random value from the posterior distribution, and
9708
{\tt observations} is a random sequence of species names.
9709
9710
Each time through the loop, we add the new observation to
9711
{\tt seen} and record the number of reads and the number of
9712
new species so far.
9713
9714
The result of {\tt RunSimulation} is a {\bf rarefaction curve},
9715
represented as a list of pairs with the number of reads and
9716
the number of new species.
9717
\index{rarefaction curve}
9718
9719
Before we see the results, let's look at {\tt GetSeenSpecies} and
9720
{\tt GenerateObservations}.
9721
9722
\begin{verbatim}
9723
#class Subject
9724
9725
def GetSeenSpecies(self):
9726
names = self.GetNames()
9727
m = len(names)
9728
seen = set(SpeciesGenerator(names, m))
9729
return m, seen
9730
\end{verbatim}
9731
9732
{\tt GetNames} returns the list of species names that appear in
9733
the data files, but for many subjects these names are not unique.
9734
So I use {\tt SpeciesGenerator} to extend each name with a serial
9735
number:
9736
\index{generator}
9737
9738
\begin{verbatim}
9739
def SpeciesGenerator(names, num):
9740
i = 0
9741
for name in names:
9742
yield '%s-%d' % (name, i)
9743
i += 1
9744
9745
while i < num:
9746
yield 'unseen-%d' % i
9747
i += 1
9748
\end{verbatim}
9749
9750
Given a name like {\tt Corynebacterium}, {\tt SpeciesGenerator} yields
9751
{\tt Corynebacterium-1}. When the list of names is exhausted, it
9752
yields names like {\tt unseen-62}.
9753
9754
Here is {\tt GenerateObservations}:
9755
9756
\begin{verbatim}
9757
# class Subject
9758
9759
def GenerateObservations(self, num_reads):
9760
n, prevalences = self.suite.SamplePosterior()
9761
9762
names = self.GetNames()
9763
name_iter = SpeciesGenerator(names, n)
9764
9765
d = dict(zip(name_iter, prevalences))
9766
cdf = thinkbayes.MakeCdfFromDict(d)
9767
observations = cdf.Sample(num_reads)
9768
9769
return n, observations
9770
\end{verbatim}
9771
9772
Again, \verb"num_reads" is the number of additional reads
9773
to generate. {\tt n} and {\tt prevalences} are samples from
9774
the posterior distribution.
9775
9776
{\tt cdf} is a Cdf object that maps species names, including the
9777
unseen, to cumulative probabilities. Using a Cdf makes it efficient
9778
to generate a random sequence of species names.
9779
\index{Cdf}
9780
\index{cumulative probability}
9781
9782
Finally, here is {\tt Species2.SamplePosterior}:
9783
9784
\begin{verbatim}
9785
def SamplePosterior(self):
9786
pmf = self.DistOfN()
9787
n = pmf.Random()
9788
prevalences = self.SamplePrevalences(n)
9789
return n, prevalences
9790
\end{verbatim}
9791
9792
And {\tt SamplePrevalences}, which generates a sample of
9793
prevalences conditioned on {\tt n}:
9794
\index{numpy}
9795
\index{random sample}
9796
9797
\begin{verbatim}
9798
# class Species2
9799
9800
def SamplePrevalences(self, n):
9801
params = self.params[:n]
9802
gammas = numpy.random.gamma(params)
9803
gammas /= gammas.sum()
9804
return gammas
9805
\end{verbatim}
9806
9807
We saw this algorithm for generating random values from a Dirichlet
9808
distribution in Section~\ref{randomdir}.
9809
9810
Figure~\ref{species-rare} shows 100 simulated rarefaction curves
9811
for subject B1242. The curves are ``jittered;''
9812
that is, I shifted each curve by a random offset so they
9813
would not all overlap. By inspection we can estimate that after
9814
400 more reads we are likely to find 2--6 new species.
9815
9816
9817
\section{Joint posterior}
9818
9819
\begin{figure}
9820
% species.py
9821
\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}
9822
\caption{Distributions of the number of new species conditioned on
9823
the number of additional reads.}
9824
\label{species-cond}
9825
\end{figure}
9826
9827
We can use these simulations to estimate the
9828
joint distribution of \verb"num_new" and {\tt k}, and from that
9829
we can get the distribution of \verb"num_new" conditioned on any
9830
value of {\tt k}.
9831
\index{joint distribution}
9832
9833
\begin{verbatim}
9834
def MakeJointPredictive(curves):
9835
joint = thinkbayes.Joint()
9836
for curve in curves:
9837
for k, num_new in curve:
9838
joint.Incr((k, num_new))
9839
joint.Normalize()
9840
return joint
9841
\end{verbatim}
9842
9843
{\tt MakeJointPredictive} makes a Joint object, which is a
9844
Pmf whose values are tuples.
9845
\index{Joint object}
9846
9847
{\tt curves} is a list of rarefaction curves created by
9848
{\tt RunSimulation}. Each curve contains a list of pairs of
9849
{\tt k} and \verb"num_new".
9850
\index{rarefaction curve}
9851
9852
The resulting joint distribution is a map from each pair to
9853
its probability of occurring. Given the joint distribution, we
9854
can use {\tt Joint.Conditional}
9855
get the distribution of \verb"num_new" conditioned on {\tt k}
9856
(see Section~\ref{conditional}).
9857
\index{conditional distribution}
9858
9859
{\tt Subject.MakeConditionals} takes a list of {\tt ks}
9860
and computes the conditional distribution of \verb"num_new"
9861
for each {\tt k}. The result is a list of Cdf objects.
9862
9863
\begin{verbatim}
9864
def MakeConditionals(curves, ks):
9865
joint = MakeJointPredictive(curves)
9866
9867
cdfs = []
9868
for k in ks:
9869
pmf = joint.Conditional(1, 0, k)
9870
pmf.name = 'k=%d' % k
9871
cdf = pmf.MakeCdf()
9872
cdfs.append(cdf)
9873
9874
return cdfs
9875
\end{verbatim}
9876
9877
Figure~\ref{species-cond} shows the results. After 100 reads, the
9878
median predicted number of new species is 2; the 90\% credible
9879
interval is 0 to 5. After 800 reads, we expect to see 3 to 12 new
9880
species.
9881
9882
9883
\section{Coverage}
9884
9885
\begin{figure}
9886
% species.py
9887
\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}
9888
\caption{Complementary CDF of coverage for a range of additional reads.}
9889
\label{species-frac}
9890
\end{figure}
9891
9892
The last question we want to answer is, ``How many additional reads
9893
are needed to increase the fraction of observed species to a given
9894
threshold?''
9895
\index{coverage}
9896
9897
To answer this question, we need a version of {\tt RunSimulation}
9898
that computes the fraction of observed species rather than the
9899
number of new species.
9900
9901
\begin{verbatim}
9902
# class Subject
9903
9904
def RunSimulation(self, num_reads):
9905
m, seen = self.GetSeenSpecies()
9906
n, observations = self.GenerateObservations(num_reads)
9907
9908
curve = []
9909
for k, obs in enumerate(observations):
9910
seen.add(obs)
9911
9912
frac_seen = len(seen) / float(n)
9913
curve.append((k+1, frac_seen))
9914
9915
return curve
9916
\end{verbatim}
9917
9918
Next we loop through each curve and make a dictionary, {\tt d},
9919
that maps from the number of additional reads, {\tt k}, to
9920
a list of {\tt fracs}; that is, a list of values for the
9921
coverage achieved after {\tt k} reads.
9922
9923
\begin{verbatim}
9924
def MakeFracCdfs(self, curves):
9925
d = {}
9926
for curve in curves:
9927
for k, frac in curve:
9928
d.setdefault(k, []).append(frac)
9929
9930
cdfs = {}
9931
for k, fracs in d.iteritems():
9932
cdf = thinkbayes.MakeCdfFromList(fracs)
9933
cdfs[k] = cdf
9934
9935
return cdfs
9936
\end{verbatim}
9937
9938
Then for each value of {\tt k} we make a Cdf of {\tt fracs}; this Cdf
9939
represents the distribution of coverage after {\tt k} reads.
9940
9941
Remember that the CDF tells you the probability of falling below a
9942
given threshold, so the {\em complementary} CDF tells you the
9943
probability of exceeding it. Figure~\ref{species-frac} shows
9944
complementary CDFs for a range of values of {\tt k}.
9945
\index{complementary CDF}
9946
9947
To read this figure, select the level of coverage you want to achieve
9948
along the $x$-axis. As an example, choose 90\%.
9949
\index{coverage}
9950
9951
Now you can read up the chart to find the probability of achieving
9952
90\% coverage after {\tt k} reads. For example, with 200 reads,
9953
you have about a 40\% chance of getting 90\% coverage. With 1000 reads, you
9954
have a 90\% chance of getting 90\% coverage.
9955
9956
With that, we have answered the four questions that make up the unseen
9957
species problem. To validate the algorithms in this chapter with
9958
real data, I had to deal with a few more details. But
9959
this chapter is already too long, so I won't discuss them here.
9960
9961
You can read about the problems, and how I addressed them, at
9962
\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.
9963
9964
You can download the code in this chapter from
9965
\url{http://thinkbayes.com/species.py}.
9966
For more information
9967
see Section~\ref{download}.
9968
9969
9970
\section{Discussion}
9971
9972
The Unseen Species problem is an area of active research, and I
9973
believe the algorithm in this chapter is a novel contribution. So in
9974
fewer than 200 pages we have made it from the basics of probability to
9975
the research frontier. I'm very happy about that.
9976
9977
My goal for this book is to present three related ideas:
9978
9979
\begin{itemize}
9980
9981
\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is
9982
the idea of using probability distributions to represent uncertain
9983
beliefs, using data to update those distributions, and using the
9984
results to make predictions and inform decisions.
9985
9986
\item {\bf A computational approach}: The premise of this book is that
9987
it is easier to understand Bayesian analysis using computation
9988
rather than math, and easier to implement Bayesian methods with
9989
reusable building blocks that can be rearranged to solve real-world
9990
problems quickly.
9991
9992
\item {\bf Iterative modeling}: Most real-world problems involve
9993
modeling decisions and trade-offs between realism and complexity.
9994
It is often impossible to know ahead of time what factors should be
9995
included in the model and which can be abstracted away. The best
9996
approach is to iterate, starting with simple models and adding
9997
complexity gradually, using each model to validate the others.
9998
9999
\end{itemize}
10000
10001
These ideas are versatile and powerful; they are applicable to
10002
problems in every area of science and engineering, from simple
10003
examples to topics of current research.
10004
10005
If you made it this far, you should be prepared to apply these
10006
tools to new problems relevant to your work. I hope you find
10007
them useful; let me know how it goes!
10008
10009
10010
10011
%\chapter{Future chapters}
10012
10013
%Bayesian regression (hybrid version with resampling?)
10014
%\url{http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/}
10015
10016
%Change point detection:
10017
10018
%Deconvolution: Estimating round trip times
10019
10020
%Bayesian search
10021
10022
%Extension of the Euro problem: evaluating reddit items and redditors
10023
%\url{http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/}
10024
10025
%Charles Darwin problem (capture-tag-recapture)
10026
%\url{http://maximum-entropy-blog.blogspot.com/2012/04/capture-recapture-and-charles-darwin.html}
10027
10028
% http://camdp.com/blogs/how-solve-price-rights-showdown
10029
10030
% https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
10031
10032
% http://blog.yhathq.com/posts/estimating-user-lifetimes-with-pymc.html
10033
10034
\printindex
10035
10036
\end{document}
10037
10038