Let's load the GSS dataset.
year | id_ | agewed | divorce | sibs | childs | age | educ | paeduc | maeduc | ... | memchurh | realinc | cohort | marcohrt | ballot | wtssall | adults | compuse | databank | wtssnr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1972 | 167 | 0 | 0 | 2 | 0 | 26.0 | 18.0 | 12 | 12 | ... | 0 | 13537.0 | 1946.0 | 0 | 0 | 0.8893 | 2.0 | 0 | 0 | 1.0 |
1 | 1972 | 1256 | 30 | 2 | 0 | 1 | 38.0 | 12.0 | 97 | 99 | ... | 0 | 18951.0 | 1934.0 | 1964 | 0 | 0.4446 | 1.0 | 0 | 0 | 1.0 |
2 | 1972 | 415 | 0 | 0 | 7 | 0 | 57.0 | 12.0 | 7 | 7 | ... | 0 | 30458.0 | 1915.0 | 0 | 0 | 1.3339 | 3.0 | 0 | 0 | 1.0 |
3 | 1972 | 234 | 18 | 1 | 6 | 3 | 61.0 | 14.0 | 8 | 5 | ... | 0 | 37226.0 | 1911.0 | 1929 | 0 | 0.8893 | 2.0 | 0 | 0 | 1.0 |
4 | 1972 | 554 | 22 | 2 | 3 | 3 | 59.0 | 12.0 | 6 | 11 | ... | 0 | 30458.0 | 1913.0 | 1935 | 0 | 0.8893 | 2.0 | 0 | 0 | 1.0 |
5 rows × 101 columns
The GSS interviews a few thousand respondents each year.
One of the questions they ask is "Do you think the use of marijuana should be made legal or not?"
The answer codes are:
Here is the distribution of responses for all years.
I'll replace "Don't know", "No answer", and "Not applicable" with NaN.
And replace 2
, which represents "No", with 1
. That way we can use mean
to compute the fraction in favor.
Here are the value counts after replacement.
And here's the mean.
So 30% of respondents thought marijuana should be legal, at the time they were interviewed.
Now we can see how that fraction depends on age, cohort (year of birth), and period (year of interview).
Group by year
First we'll group respondents by year.
The result in a DataFrameGroupBy
object we can iterate through:
And we can compute summary statistics for each group.
Using a for loop can be useful for debugging, but it is more concise, more idiomatic, and faster to apply operations directly to the DataFrameGroupBy
object.
For example, if you select a column from a DataFrameGroupBy
, the result is a SeriesGroupBy
that represents one Series
for each group.
You can loop through the SeriesGroupBy
, but you normally don't.
Instead, you can apply a function to the SeriesGroupBy
; the result is a new Series
that maps from group names to the results from the function; in this case, it's the fraction of support for each interview year.
Overall support for legalization has been increasing since 1990.
Group by cohort
The variable cohort
contains respondents' year of birth.
Pulling together the code from the previous section, we can plot support for legalization by year of birth.
Later generations are more likely to support legalization than earlier generations.
Group by age
Finally, let's see how support varies with age at time of interview.
Younger people are more likely to support legalization than old people.
In general, it is not easy to separate period, cohort, and age effects, but there are ways. We'll come back to this example to see how.