Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.
This is the accompanying code for this book.
License: GPL3
Examples and Exercises from Think Stats, 2nd Edition
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
Examples from Chapter 1
Read NSFG data into a Pandas DataFrame.
caseid | pregordr | howpreg_n | howpreg_p | moscurrp | nowprgdk | pregend1 | pregend2 | nbrnaliv | multbrth | ... | laborfor_i | religion_i | metro_i | basewgt | adj_mod_basewgt | finalwgt | secu_p | sest | cmintvw | totalwgt_lb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | NaN | 8.8125 |
1 | 1 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | NaN | 7.8750 |
2 | 2 | 1 | NaN | NaN | NaN | NaN | 5.0 | NaN | 3.0 | 5.0 | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 9.1250 |
3 | 2 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 7.0000 |
4 | 2 | 3 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 6.1875 |
5 rows × 244 columns
Print the column names.
Select a single column name.
Select a column and check what type it is.
Print a column.
Select a single element from a column.
Select a slice from a column.
Select a column using dot notation.
Count the number of times each value occurs.
Check the values of another variable.
Make a dictionary that maps from each respondent's caseid
to a list of indices into the pregnancy DataFrame
. Use it to select the pregnancy outcomes for a single respondent.
Exercises
Select the birthord
column, print the value counts, and compare to results published in the codebook
We can also use isnull
to count the number of nans.
Select the prglngth
column, print the value counts, and compare to results published in the codebook
To compute the mean of a column, you can invoke the mean
method on a Series. For example, here is the mean birthweight in pounds:
Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.
nsfg.py
also provides ReadFemResp
, which reads the female respondents file and returns a DataFrame
:
DataFrame
provides a method head
that displays the first five rows:
caseid | rscrinf | rdormres | rostscrn | rscreenhisp | rscreenrace | age_a | age_r | cmbirth | agescrn | ... | pubassis_i | basewgt | adj_mod_basewgt | finalwgt | secu_r | sest | cmintvw | cmlstyr | screentime | intvlngth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2298 | 1 | 5 | 5 | 1 | 5.0 | 27 | 27 | 902 | 27 | ... | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | 1234 | 1222 | 18:26:36 | 110.492667 |
1 | 5012 | 1 | 5 | 1 | 5 | 5.0 | 42 | 42 | 718 | 42 | ... | 0 | 2335.279149 | 2846.799490 | 4744.191350 | 2 | 18 | 1233 | 1221 | 16:30:59 | 64.294000 |
2 | 11586 | 1 | 5 | 1 | 5 | 5.0 | 43 | 43 | 708 | 43 | ... | 0 | 2335.279149 | 2846.799490 | 4744.191350 | 2 | 18 | 1234 | 1222 | 18:19:09 | 75.149167 |
3 | 6794 | 5 | 5 | 4 | 1 | 5.0 | 15 | 15 | 1042 | 15 | ... | 0 | 3783.152221 | 5071.464231 | 5923.977368 | 2 | 18 | 1234 | 1222 | 15:54:43 | 28.642833 |
4 | 616 | 1 | 5 | 4 | 1 | 5.0 | 20 | 20 | 991 | 20 | ... | 0 | 5341.329968 | 6437.335772 | 7229.128072 | 2 | 18 | 1233 | 1221 | 14:19:44 | 69.502667 |
5 rows × 3087 columns
Select the age_r
column from resp
and print the value counts. How old are the youngest and oldest respondents?
We can use the caseid
to match up rows from resp
and preg
. For example, we can select the row from resp
for caseid
2298 like this:
caseid | rscrinf | rdormres | rostscrn | rscreenhisp | rscreenrace | age_a | age_r | cmbirth | agescrn | ... | pubassis_i | basewgt | adj_mod_basewgt | finalwgt | secu_r | sest | cmintvw | cmlstyr | screentime | intvlngth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2298 | 1 | 5 | 5 | 1 | 5.0 | 27 | 27 | 902 | 27 | ... | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | 1234 | 1222 | 18:26:36 | 110.492667 |
1 rows × 3087 columns
And we can get the corresponding rows from preg
like this:
caseid | pregordr | howpreg_n | howpreg_p | moscurrp | nowprgdk | pregend1 | pregend2 | nbrnaliv | multbrth | ... | laborfor_i | religion_i | metro_i | basewgt | adj_mod_basewgt | finalwgt | secu_p | sest | cmintvw | totalwgt_lb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2610 | 2298 | 1 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 6.8750 |
2611 | 2298 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 5.5000 |
2612 | 2298 | 3 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 4.1875 |
2613 | 2298 | 4 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 6.8750 |
4 rows × 244 columns
How old is the respondent with caseid
1?
What are the pregnancy lengths for the respondent with caseid
2298?
What was the birthweight of the first baby born to the respondent with caseid
5012?