📚 The CoCalc Library - books, templates and other resources

Project: 📚 The Library - Shared Public Version

Path: cocalc-examples / think-stats-2ed / examples / groupby_example.ipynb

Views: ⁹⁶¹³⁹
License: OTHER

Kernel: Python 3

GroupBy examples

Allen Downey

In [1]:

%matplotlib inline

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')

from thinkstats2 import Pmf, Cdf

import thinkstats2
import thinkplot

decorate = thinkplot.config

Let's load the GSS dataset.

In [2]:

%time gss = pd.read_hdf('../homeworks/gss.hdf5', 'gss')
gss.head()

CPU times: user 164 ms, sys: 40.2 ms, total: 204 ms
Wall time: 203 ms

	year	id_	agewed	divorce	sibs	childs	age	educ	paeduc	maeduc	...	realinc	cohort	marcohrt	wtssall	adults	wtssnr
0	1972	167	0	0	2	0	26.0	18.0	12	12	...	13537.0	1946.0	0	0.8893	2.0	1.0
1	1972	1256	30	2	0	1	38.0	12.0	97	99	...	18951.0	1934.0	1964	0.4446	1.0	1.0
2	1972	415	0	0	7	0	57.0	12.0	7	7	...	30458.0	1915.0	0	1.3339	3.0	1.0
3	1972	234	18	1	6	3	61.0	14.0	8	5	...	37226.0	1911.0	1929	0.8893	2.0	1.0
4	1972	554	22	2	3	3	59.0	12.0	6	11	...	30458.0	1913.0	1935	0.8893	2.0	1.0

5 rows × 101 columns

In [3]:

def counts(series):
    return series.value_counts(sort=False).sort_index()

The GSS interviews a few thousand respondents each year.

In [4]:

counts(gss['year'])

  1613
  1504
  1484
  1490
  1499
  1530
  1532
  1468
  1860
  1599
  1473
  1534
  1470
  1819
  1481
  1537
  1372
  1517
  1606
  2992
  2904
  2832
  2817
  2765
  2812
  4510
  2023
  2044
  1974
  2538
  2867
Name: year, dtype: int64

One of the questions they ask is "Do you think the use of marijuana should be made legal or not?"

The answer codes are:

1	Legal
2	Not legal
8	Don't know
9	No answer
0	Not applicable

Here is the distribution of responses for all years.

In [5]:

counts(gss['grass'])

  24398
  11027
  25195
   1733
    113
Name: grass, dtype: int64

I'll replace "Don't know", "No answer", and "Not applicable" with NaN.

In [6]:

gss['grass'].replace([0,8,9], np.nan, inplace=True)

And replace 2, which represents "No", with 1. That way we can use mean to compute the fraction in favor.

In [7]:

gss['grass'].replace(2, 0, inplace=True)

Here are the value counts after replacement.

In [8]:

counts(gss['grass'])

0.0    25195
1.0    11027
Name: grass, dtype: int64

And here's the mean.

In [9]:

gss['grass'].mean()

0.3044282480260615

So 30% of respondents thought marijuana should be legal, at the time they were interviewed.

Now we can see how that fraction depends on age, cohort (year of birth), and period (year of interview).

Group by year

First we'll group respondents by year.

In [10]:

grouped = gss.groupby('year')
grouped

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f49bdd8ac18>

The result in a DataFrameGroupBy object we can iterate through:

In [11]:

for name, group in grouped:
    print(name, len(group))

And we can compute summary statistics for each group.

In [12]:

for name, group in grouped:
    print(name, group['grass'].mean())

nan
0.20136518771331058
nan
0.22569198012775019
0.29395604395604397
nan
0.3056501021102791
0.2585844428871759
nan
0.23943661971830985
0.2147887323943662
nan
0.17466945024356298
0.15506508205998867
0.1705170517051705
0.17492416582406473
0.15940366972477063
0.17775467775467776
0.24342745861733203
0.24722075172048702
0.28713910761154854
0.2966589861751152
0.3395810363836825
0.3403755868544601
0.35785536159601
0.3435155412647374
0.4001597444089457
0.4773988897700238
0.4835341365461847
0.5790147152911068
0.5911039657020365

Using a for loop can be useful for debugging, but it is more concise, more idiomatic, and faster to apply operations directly to the DataFrameGroupBy object.

For example, if you select a column from a DataFrameGroupBy, the result is a SeriesGroupBy that represents one Series for each group.

In [13]:

grouped['grass']

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f49bdd874e0>

You can loop through the SeriesGroupBy, but you normally don't.

In [14]:

for name, series in grouped['grass']:
    print(name, series.mean())

nan
0.20136518771331058
nan
0.22569198012775019
0.29395604395604397
nan
0.3056501021102791
0.2585844428871759
nan
0.23943661971830985
0.2147887323943662
nan
0.17466945024356298
0.15506508205998867
0.1705170517051705
0.17492416582406473
0.15940366972477063
0.17775467775467776
0.24342745861733203
0.24722075172048702
0.28713910761154854
0.2966589861751152
0.3395810363836825
0.3403755868544601
0.35785536159601
0.3435155412647374
0.4001597444089457
0.4773988897700238
0.4835341365461847
0.5790147152911068
0.5911039657020365

Instead, you can apply a function to the SeriesGroupBy; the result is a new Series that maps from group names to the results from the function; in this case, it's the fraction of support for each interview year.

In [15]:

series = grouped['grass'].mean()
series

year
       NaN
  0.201365
       NaN
  0.225692
  0.293956
       NaN
  0.305650
  0.258584
       NaN
  0.239437
  0.214789
       NaN
  0.174669
  0.155065
  0.170517
  0.174924
  0.159404
  0.177755
  0.243427
  0.247221
  0.287139
  0.296659
  0.339581
  0.340376
  0.357855
  0.343516
  0.400160
  0.477399
  0.483534
  0.579015
  0.591104
Name: grass, dtype: float64

Overall support for legalization has been increasing since 1990.

In [16]:

series.plot(color='C0')
decorate(xlabel='Year of interview', 
         ylabel='% in favor',
         title='Should marijuana be made legal?')

Group by cohort

The variable cohort contains respondents' year of birth.

In [17]:

counts(gss['cohort'])

0      2
0      3
0      2
0      4
0     10
0      5
0     14
0     19
0     25
0     20
0     29
0     47
0     43
0     41
0     52
0     62
0     93
0    120
0    104
0    111
0    128
0    138
0    169
0    171
0    236
0    191
0    257
0    236
0    255
0    326
         ... 
0    813
0    809
0    691
0    649
0    671
0    641
0    570
0    542
0    536
0    483
0    590
0    476
0    467
0    413
0    342
0    335
0    371
0    278
0    306
0    205
0    227
0    185
0    188
0    107
0    116
0    116
0     89
0     50
0     53
0      6
Name: cohort, Length: 116, dtype: int64

Pulling together the code from the previous section, we can plot support for legalization by year of birth.

In [18]:

grouped = gss.groupby('cohort')
series = grouped['grass'].mean()
series.plot(color='C1')
decorate(xlabel='Year of birth', 
         ylabel='% in favor',
         title='Should marijuana be made legal?')

Later generations are more likely to support legalization than earlier generations.

Group by age

Finally, let's see how support varies with age at time of interview.

In [19]:

grouped = gss.groupby('age')
series = grouped['grass'].mean()
series.plot(color='C2')
decorate(xlabel='Age at interview', 
         ylabel='% in favor',
         title='Should marijuana be made legal?')

Younger people are more likely to support legalization than old people.

In general, it is not easy to separate period, cohort, and age effects, but there are ways. We'll come back to this example to see how.

GroupBy examples

Group by year

Group by cohort

Group by age

Product

Resources

Company