Simple regression
An important thing to remember about regression is that it is not symmetric; that is, the regression of A onto B is not the same as the regression of B onto A.
To demonstrate, I'll load data from the BRFSS.
A few people report many vegetable servings per day. To simplify the visualization, I'm going to replace values greater than 8 with 8.
We can use SciPy to compute servings of vegetables as a function of income class.
Increasing income class by 1 is associated with an increase of 0.07 vegetables per day.
So if we hypothesize that people with higher incomes eat more vegetables, this result would not get us too excited.
We can see what the regression looks like by plotting the line of best fit on top of the scatter plot.
Now let's do it the other way around, regressing income as a function of vegetable servings.
Again, we can plot the line of best fit on top of the scatter plot.
The slope looks more impressive now. Each additional serving corresponds to 0.24 income codes, and each income code is several thousand dollars. So a result that seemed unimpressive in one direction seems more intruiging in the other direction.
But the primary point here is that regression is not symmetric. To see it more clearly, I'll plot both regression lines on top of the scatter plot.
The green line is income as a function of vegetables; the orange line is vegetables as a function of income.
And here's the same thing the other way around.
StatsModels
So far we have used scipy.linregress
to run simple regression. Sadly, that function doesn't do multiple regression, so we have to switch to a new library, StatsModels.
Here's the same example from the previous section, using StatsModels.
The result is an OLS
object, which we have to fit
:
results
contains a lot of information about the regression, which we can view using summary
.
One of the parts we're interested in is params
, which is a Pandas Series containing the estimated parameters.
And rsquared
contains the coefficient of determination, , which is pretty small in this case.
We can confirm that :
Exercise: Run this regression in the other direction and confirm that you get the same estimated slope we got from linregress
. Also confirm that is the same in either direction (which we know because correlation is the same in either direction).
Multiple regression
For experiments with multiple regression, let's load the GSS data again.
Let's explore the relationship between income and education, starting with simple regression:
It looks like people with more education have higher incomes, about $3586 per additional year of education.
Now that we are using StatsModels, it is easy to add explanatory variables. For example, we can add age
to the model like this.
It looks like the effect of age
is small, and adding it to the model has only a small effect on the estimated parameter for education.
But it's possible we are getting fooled by a nonlinear relationship. To see what the age effect looks like, I'll group by age and plot the mean income in each age group.
Yeah, that looks like a nonlinear effect.
We can model it by adding a quadratic term to the model.
Now the coefficient associated with age
is substantially larger. And the coefficient of the quadratic term is negative, which is consistent with the observation that the relationship has downward curvature.
Exercise: To see what the relationship between income and education looks like, group the dataset by educ
and plot mean income at each education level.
Exercise: Maybe the relationship with education is nonlinear, too. Add a quadratic term for educ
to the model and summarize the results.
Making predictions
The parameters of a non-linear model can be hard to interpret, but maybe we don't have to. Sometimes it is easier to judge a model by its predictions rather than its parameters.
The results object provides a predict
method that takes a DataFrame
and uses the model to generate a prediction for each row. Here's how we can create the DataFrame
:
age
contains equally-spaced points from 18 to 85, and age2
contains those values squared.
Now we can set educ
to 12 years of education and generate predictions:
This plot shows the structure of the model, which is a parabola. We also plot the data as an average in each age group.
Exercise: Generate the same plot, but show predictions for three levels of education: 12, 14, and 16 years.
Adding categorical variables
In a formula string, we can use C()
to indicate that a variable should be treated as categorical. For example, the following model contains sex
as a categorical variable.
The estimated parameter indicates that sex=2
, which indicates women, is associated with about $4150 lower income, after controlling for age and education.
Exercise: Use groupby
to group respondents by educ
, then plot mean realinc
for each education level.
Exercise: Make a DataFrame
with a range of values for educ
and constant age=30
. Compute age2
and educ2
accordingly.
Use this DataFrame
to generate predictions for each level of education, holding age constant. Generate and plot separate predictions for men and women.
Also plot the data for comparison.
Logistic regression
Let's use logistic regression to see what factors are associated with support for gun control. The variable we'll use is gunlaw
, which represents the response to this question: "Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?"
Here are the values.
1 means yes, 2 means no, 0 means the question wasn't asked; 8 and 9 mean the respondent doesn't know or refused to answer.
First I'll replace 0, 8, and 9 with NaN
In order to put gunlaw
on the left side of a regression, we have to recode it so 0 means no and 1 means yes.
Here's what it looks like after recoding.
Now we can run a logistic regression model
Here are the results.
Here are the parameters. The coefficient of sex=2
is positive, which indicates that women are more likely to support gun control, at least for this question.
The other parameters are not easy to interpret, but again we can use the regression results to generate predictions, which makes it possible to visualize the model.
I'll make a DataFrame
with a range of ages and a fixed level of education, and generate predictions for men and women.
Over the range of ages, women are more likely to support gun control than men, by about 15 percentage points.
Exercise: Generate a similar plot as a function of education, with constant age=40
.
Exercise: Use the variable grass
to explore support for legalizing marijuana. This variable record the response to this question: "Do you think the use of marijuana should be made legal or not?"
Recode
grass
for use with logistic regression.Run a regression model with age, education, and sex as explanatory variables.
Use the model to generate predictions for a range of ages, with education held constant, and plot the predictions for men and women. Also plot the mean level of support in each age group.
Use the model to generate predictions for a range of education levels, with age held constant, and plot the predictions for men and women. Also plot the mean level of support at each education level.
Note: This last graph might not look like a parabola. Why not?