📚 The CoCalc Library - books, templates and other resources

Project: 📚 The Library - Shared Public Version

Path: cocalc-examples / think-stats-2ed / solutions / chap01soln.ipynb

Views: ⁹⁶¹⁴⁰
License: OTHER

Kernel: Python 3

Examples and Exercises from Think Stats, 2nd Edition

MIT License: https://opensource.org/licenses/MIT

In [2]:

from __future__ import print_function, division

import nsfg

Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [4]:

preg = nsfg.ReadFemPreg()
preg.head()

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
0	1	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	8.8125
1	1	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	7.8750
2	2	1	NaN	NaN	NaN	NaN	5.0	NaN	3.0	5.0	...	7226.301740	8567.549110	12999.542264	2	12	NaN	9.1250
3	2	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	7.0000
4	2	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	6.1875

5 rows × 244 columns

Print the column names.

In [5]:

preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.

In [6]:

preg.columns[1]

'pregordr'

Select a column and check what type it is.

In [7]:

pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

Print a column.

In [8]:

pregordr

      1
      2
      1
      2
      3
      1
      2
      3
      1
      2
     1
     1
     2
     3
     1
     2
     3
     1
     2
     1
     2
     1
     2
     1
     2
     3
     1
     1
     2
     3
        ..
  2
  3
  1
  1
  1
  2
  1
  2
  3
  4
  1
  2
  1
  1
  2
  1
  2
  1
  2
  3
  1
  2
  1
  2
  3
  1
  2
  3
  4
  5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [9]:

pregordr[0]

1

Select a slice from a column.

In [10]:

pregordr[2:5]

  1
  2
  3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [11]:

pregordr = preg.pregordr

Count the number of times each value occurs.

In [12]:

preg.outcome.value_counts().sort_index()

  9148
  1862
   120
  1921
   190
   352
Name: outcome, dtype: int64

Check the values of another variable.

In [13]:

preg.birthwgt_lb.value_counts().sort_index()

0        8
0       40
0       53
0       98
0      229
0      697
0     2223
0     3049
0     1889
0      623
0     132
0      26
0      10
0       3
0       3
0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.

In [14]:

caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1])

Exercises

Select the birthord column, print the value counts, and compare to results published in the codebook

In [15]:

# Solution

preg.birthord.value_counts().sort_index()

0     4413
0     2874
0     1234
0      421
0      126
0       50
0       20
0        7
0        2
0       1
Name: birthord, dtype: int64

We can also use isnull to count the number of nans.

In [16]:

preg.birthord.isnull().sum()

4445

Select the prglngth column, print the value counts, and compare to results published in the codebook

In [17]:

# Solution

preg.prglngth.value_counts().sort_index()

     15
      9
     78
    151
    412
    181
    543
    175
    409
    594
   137
   202
   170
   446
    29
    39
    44
   253
    17
    34
    18
    37
   147
    12
    31
    15
   117
     8
    38
    23
   198
    29
   122
    50
    60
   357
   329
   457
   609
  4744
  1120
   591
   328
   148
    46
    10
     1
     1
     7
     2
Name: prglngth, dtype: int64

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:

In [18]:

preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [19]:

# Solution

preg['totalwgt_kg'] = preg.totalwgt_lb / 2.2
preg.totalwgt_kg.mean()

3.302558389828807

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:

In [20]:

resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:

In [21]:

resp.head()

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667
1	5012	1	5	1	5	5.0	42	42	718	42	...	2335.279149	2846.799490	4744.191350	2	18	1233	1221	16:30:59	64.294000
2	11586	1	5	1	5	5.0	43	43	708	43	...	2335.279149	2846.799490	4744.191350	2	18	1234	1222	18:19:09	75.149167
3	6794	5	5	4	1	5.0	15	15	1042	15	...	3783.152221	5071.464231	5923.977368	2	18	1234	1222	15:54:43	28.642833
4	616	1	5	4	1	5.0	20	20	991	20	...	5341.329968	6437.335772	7229.128072	2	18	1233	1221	14:19:44	69.502667

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?

In [22]:

# Solution

resp.age_r.value_counts().sort_index()

  217
  223
  234
  235
  241
  258
  267
  287
  282
  269
  267
  260
  255
  252
  262
  292
  278
  273
  257
  255
  262
  266
  271
  256
  215
  256
  250
  215
  253
  235
Name: age_r, dtype: int64

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:

In [23]:

resp[resp.caseid==2298]

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	pubassis_i	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	0	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:

In [24]:

preg[preg.caseid==2298]

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb	totalwgt_kg
2610	2298	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750	3.125000
2611	2298	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	5.5000	2.500000
2612	2298	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	4.1875	1.903409
2613	2298	4	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750	3.125000

4 rows × 245 columns

How old is the respondent with caseid 1?

In [25]:

# Solution

resp[resp.caseid==1].age_r

1069    44
Name: age_r, dtype: int64

What are the pregnancy lengths for the respondent with caseid 2298?

In [26]:

# Solution

preg[preg.caseid==2298].prglngth

  40
  36
  30
  40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with caseid 5012?

In [27]:

# Solution

preg[preg.caseid==5012].birthwgt_lb

5515    6.0
Name: birthwgt_lb, dtype: float64

Examples and Exercises from Think Stats, 2nd Edition

Examples from Chapter 1

Exercises

Product

Resources

Company