Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Let's read data from the BRFSS:
age | sex | wtyrago | finalwt | wtkg2 | htm3 | |
---|---|---|---|---|---|---|
count | 410856.000000 | 414509.000000 | 390399.000000 | 414509.000000 | 398484.000000 | 409129.000000 |
mean | 54.862180 | 1.624368 | 79.721319 | 561.774700 | 78.992453 | 168.825190 |
std | 16.737702 | 0.484286 | 20.565164 | 1076.538764 | 19.546157 | 10.352653 |
min | 18.000000 | 1.000000 | 22.727273 | 1.695143 | 20.000000 | 61.000000 |
25% | 43.000000 | 1.000000 | 64.545455 | 97.006804 | 64.550000 | 160.000000 |
50% | 55.000000 | 2.000000 | 77.272727 | 234.010543 | 77.270000 | 168.000000 |
75% | 67.000000 | 2.000000 | 90.909091 | 590.775576 | 90.910000 | 175.000000 |
max | 99.000000 | 2.000000 | 342.272727 | 60995.111700 | 309.090000 | 236.000000 |
If we group by sex, we get a DataFrameGroupBy
object.
If we select a particular column from the GroupBy
, we get a SeriesGroupBy
object.
If you invoke a reduce method on a DataFrameGroupBy, you get a DataFrame:
age | wtyrago | finalwt | wtkg2 | htm3 | |
---|---|---|---|---|---|
sex | |||||
1 | 54.114237 | 89.880639 | 727.426119 | 89.016966 | 178.066221 |
2 | 55.314401 | 73.278344 | 462.115407 | 72.684712 | 163.223475 |
If you invoke a reduce method on a SeriesGroupBy, you get a Series:
You can use aggregate
to apply a collection of reduce methods:
age | wtyrago | finalwt | wtkg2 | htm3 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | |
sex | ||||||||||
1 | 54.114237 | 16.383712 | 89.880639 | 19.260805 | 727.426119 | 1370.544678 | 89.016966 | 18.240528 | 178.066221 | 7.723563 |
2 | 55.314401 | 16.932158 | 73.278344 | 18.678964 | 462.115407 | 836.457063 | 72.684712 | 17.609006 | 163.223475 | 7.269156 |
mean | std | |
---|---|---|
sex | ||
1 | 178.066221 | 7.723563 |
2 | 163.223475 | 7.269156 |
If the reduce method you want is not available, you can make your own:
Here's how it works when we apply it directly:
And we can use apply
to apply it to each group:
The digitize-groupby combo
Let's say we want to group people into deciles (bottom 10%, next 10%, and so on).
We can start by defining the cumulative probabilities that mark the borders between deciles.
And then use deciles
to find the values that correspond to those cumulative probabilities.
digitize
takes a series and a sequence of bin boundaries, and computes the bin index for each element in the series.
Exercise: Collect the code snippets from the previous cells to write a function called digitize
that takes a Series and a number of bins and return the results from np.digitize
.
Now, if your digitize
function is working, we can assign the results to a new column in the DataFrame:
And then group by height_decile
Now we can compute means for each variable in each group:
age | sex | wtyrago | finalwt | wtkg2 | htm3 | |
---|---|---|---|---|---|---|
height_decile | ||||||
1 | 59.343555 | 1.978576 | 65.732170 | 482.633403 | 65.071502 | 152.147807 |
2 | 57.708848 | 1.976869 | 68.925145 | 460.942597 | 68.415169 | 157.001891 |
3 | 56.693958 | 1.964062 | 71.014838 | 461.123901 | 70.420280 | 160.002595 |
4 | 55.411945 | 1.943195 | 72.962619 | 481.562164 | 72.417512 | 163.000728 |
5 | 55.010687 | 1.890204 | 74.834351 | 498.016131 | 74.251253 | 165.002383 |
6 | 54.469226 | 1.781678 | 77.115205 | 534.779185 | 76.563233 | 168.000509 |
7 | 53.708454 | 1.573876 | 80.716256 | 570.787161 | 80.072338 | 171.405279 |
8 | 53.354186 | 1.306171 | 85.399642 | 643.930094 | 84.663829 | 175.001213 |
9 | 53.838675 | 1.134095 | 89.774340 | 651.359825 | 88.932201 | 178.888785 |
10 | 51.531986 | 1.034134 | 98.338009 | 714.941775 | 97.387332 | 186.214057 |
11 | 55.246926 | 1.758967 | 71.630761 | 792.074824 | 71.236649 | 236.000000 |
It looks like:
The shortest people are older than the tallest people, on average.
The shortest people are much more likely to be female (no surprise there).
The shortest people are lighter than the tallest people (
wtkg2
), and they were lighter last year, too (wtyrago
).Shorter people are more oversampled, so they have lower final weights. This is at least partly, and maybe entirely, due to the relationship with sex.
The fact that all of these variables are associates with height suggests that it will be important to control for age and sex for almost any analysis we want to do with this data.
Nevertheless, we'll start with a simple analysis looking at weights within each height group.
If we apply quantile
to a SeriesGroupBy
, we get back a Series
with a MultiIndex
.
If you unstack a MultiIndex, the inner level of the MultIndex gets broken out into columns.
0.25 | 0.5 | 0.75 | |
---|---|---|---|
height_decile | |||
1 | 54.55 | 62.73 | 72.73 |
2 | 57.73 | 65.91 | 76.36 |
3 | 59.09 | 68.18 | 78.64 |
4 | 61.36 | 68.18 | 81.36 |
5 | 63.18 | 71.36 | 81.82 |
6 | 64.09 | 72.73 | 84.09 |
7 | 68.18 | 77.27 | 88.64 |
8 | 72.73 | 81.82 | 90.91 |
9 | 77.27 | 86.36 | 97.27 |
10 | 84.09 | 95.45 | 106.82 |
11 | 59.09 | 69.09 | 80.00 |
Which makes it convenient to plot each of the columns as a line.
The other view of this data we might like is the CDF of weight within each height group.
We can use apply
with the Cdf
constructor from thinkstats2
. The results is a Series of Cdf objects.
And now we can plot the CDFs
Exercise: Plot CDFs of weight for men and women separately, broken out by decile of height.
age | wtyrago | finalwt | wtkg2 | htm3 | ||
---|---|---|---|---|---|---|
sex | height_decile | |||||
1 | 1 | 53.844743 | 71.977922 | 1191.234678 | 71.514913 | 150.233293 |
2 | 56.595658 | 71.746929 | 1003.079353 | 71.003320 | 157.025469 | |
3 | 54.869454 | 72.944564 | 964.772919 | 72.326525 | 160.028037 | |
4 | 56.042489 | 74.553472 | 879.420399 | 74.016273 | 163.005126 | |
5 | 55.165862 | 76.385110 | 917.796857 | 75.760722 | 165.013377 | |
6 | 56.300059 | 79.337985 | 814.286624 | 78.624502 | 168.001398 | |
7 | 56.170706 | 82.956049 | 705.872453 | 82.235605 | 171.753016 | |
8 | 55.213608 | 86.572882 | 706.591295 | 85.843598 | 175.001515 | |
9 | 54.734034 | 90.392173 | 676.266016 | 89.530755 | 178.925710 | |
10 | 51.677645 | 98.621484 | 721.615568 | 97.668112 | 186.240144 | |
11 | 50.650927 | 79.295724 | 1195.072324 | 78.971384 | 236.000000 | |
2 | 1 | 59.464339 | 65.595065 | 467.120026 | 64.928121 | 152.189721 |
2 | 57.735134 | 68.857631 | 448.105460 | 68.351735 | 157.001333 | |
3 | 56.762345 | 70.942543 | 442.349122 | 70.347362 | 160.001647 | |
4 | 55.373809 | 72.864145 | 457.600597 | 72.317857 | 163.000463 | |
5 | 54.991480 | 74.636768 | 446.241040 | 74.057330 | 165.001027 | |
6 | 53.955434 | 76.467033 | 456.712813 | 75.963038 | 168.000260 | |
7 | 51.872079 | 78.972655 | 470.481305 | 78.390760 | 171.147071 | |
8 | 49.126449 | 82.582291 | 501.930594 | 81.844168 | 175.000528 | |
9 | 48.030870 | 85.524509 | 490.531017 | 84.845692 | 178.650349 | |
10 | 47.388489 | 89.769177 | 526.098271 | 89.063530 | 185.475904 | |
11 | 56.722523 | 69.371896 | 664.090555 | 68.941935 | NaN |
height_decile | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|
sex | |||||||||||
1 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
2 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |