| Download

Think Stats by Allen B. Downey Think Stats is an introduction to Probability and Statistics for Python programmers.

This is the accompanying code for this book.

Website: http://greenteapress.com/wp/think-stats-2e/

Project: Support and Testing

Path: think-stats-code / chap01ex.ipynb

Views: ⁷¹¹⁹
License: GPL3

Kernel: Python 3

Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

MIT License: https://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

import nsfg

Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [2]:

preg = nsfg.ReadFemPreg()
preg.head()

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
0	1	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	8.8125
1	1	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	7.8750
2	2	1	NaN	NaN	NaN	NaN	5.0	NaN	3.0	5.0	...	7226.301740	8567.549110	12999.542264	2	12	NaN	9.1250
3	2	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	7.0000
4	2	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	6.1875

5 rows × 244 columns

Print the column names.

In [3]:

preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.

In [4]:

preg.columns[1]

'pregordr'

Select a column and check what type it is.

In [5]:

pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

Print a column.

In [6]:

pregordr

      1
      2
      1
      2
      3
      1
      2
      3
      1
      2
     1
     1
     2
     3
     1
     2
     3
     1
     2
     1
     2
     1
     2
     1
     2
     3
     1
     1
     2
     3
        ..
  2
  3
  1
  1
  1
  2
  1
  2
  3
  4
  1
  2
  1
  1
  2
  1
  2
  1
  2
  3
  1
  2
  1
  2
  3
  1
  2
  3
  4
  5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [7]:

pregordr[0]

1

Select a slice from a column.

In [8]:

pregordr[2:5]

  1
  2
  3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [9]:

pregordr = preg.pregordr

Count the number of times each value occurs.

In [10]:

preg.outcome.value_counts().sort_index()

  9148
  1862
   120
  1921
   190
   352
Name: outcome, dtype: int64

Check the values of another variable.

In [11]:

preg.birthwgt_lb.value_counts().sort_index()

0        8
0       40
0       53
0       98
0      229
0      697
0     2223
0     3049
0     1889
0      623
0     132
0      26
0      10
0       3
0       3
0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.

In [12]:

caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1])

Exercises

Select the birthord column, print the value counts, and compare to results published in the codebook

In [13]:

# Solution goes here

We can also use isnull to count the number of nans.

In [14]:

preg.birthord.isnull().sum()

4445

Select the prglngth column, print the value counts, and compare to results published in the codebook

In [15]:

# Solution goes here

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:

In [16]:

preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [17]:

# Solution goes here

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:

In [18]:

resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:

In [19]:

resp.head()

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667
1	5012	1	5	1	5	5.0	42	42	718	42	...	2335.279149	2846.799490	4744.191350	2	18	1233	1221	16:30:59	64.294000
2	11586	1	5	1	5	5.0	43	43	708	43	...	2335.279149	2846.799490	4744.191350	2	18	1234	1222	18:19:09	75.149167
3	6794	5	5	4	1	5.0	15	15	1042	15	...	3783.152221	5071.464231	5923.977368	2	18	1234	1222	15:54:43	28.642833
4	616	1	5	4	1	5.0	20	20	991	20	...	5341.329968	6437.335772	7229.128072	2	18	1233	1221	14:19:44	69.502667

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?

In [20]:

# Solution goes here

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:

In [21]:

resp[resp.caseid==2298]

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	pubassis_i	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	0	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:

In [22]:

preg[preg.caseid==2298]

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
2610	2298	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750
2611	2298	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	5.5000
2612	2298	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	4.1875
2613	2298	4	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750

4 rows × 244 columns

How old is the respondent with caseid 1?

In [23]:

# Solution goes here

What are the pregnancy lengths for the respondent with caseid 2298?

In [24]:

# Solution goes here

What was the birthweight of the first baby born to the respondent with caseid 5012?

In [25]:

# Solution goes here

In [ ]: