Lab 9: Regression
In the previous lab, you learned to make scatterplots and compute correlation coefficients. This supplement will show you how to fit lines to data and estimate confidence intervals for a regression.
In this lab, as in Lab 8 on Correlation, you will study the correlation between head size (cubic centimeters) and brain weight (g) for a group of women.
As usual, import pandas, Numpy and Seaborn. For regression, we'll also import sub-libraries to plot lines and calculate regression lines and correlation coefficients.
Using pandas, import the file
brainhead.csv
and view the resulting data frame.
Head Size | Brain Weight | |
---|---|---|
0 | 2857 | 1027 |
1 | 3436 | 1235 |
2 | 3791 | 1260 |
3 | 3302 | 1165 |
4 | 3104 | 1080 |
... | ... | ... |
98 | 3214 | 1110 |
99 | 3394 | 1215 |
100 | 3233 | 1104 |
101 | 3352 | 1170 |
102 | 3391 | 1120 |
103 rows × 2 columns
Make a scatterplot of your data. Don’t plot a regression line. Note: The general syntax for making a scatterplot from a pandas dataframe
df
without fitting a line to it (or showing histograms of variables) is:sns.lmplot("xvar","yvar",data=df,fit_reg=False)
.
Compute the appropriate correlation coefficient and explain why you chose to use that correlation coefficient. In regression, we use the term as the fraction in variance of that is explained by variance in and is the square of the Pearson correlation coefficient. Interpret your value. NOTE: Because this time we imported the whole
scipy.stats
library, your syntax will have to be slightly different from last time.
There are several Python functions that can perform linear regression. We will use one of the simplest, linregress
from the scipy.stats
library. The basic syntax is reg = stats.linregress(xcolumn, ycolumn)
. You can then use reg.slope
and reg.intercept
to get the slope and intercept of your regression line.
Run a linear regression on your data and obtain the slope and y-intercept.
Write the equation for your line and briefly explain what it means
We would now like to plot this regression line. We can do this by picking a bunch of values, computing the corresponding values and plotting the result.
The Numpy command linspace
generates an array with a specified number of values between your min and max values. (The rather odd name comes from the widely used program Matlab, which Numpy and matplotlib often emulate.) Because you can do arithmetic directly on Numpy arrays, it is then simple to compute your y values. For example, the following code computes points on the line for 100 values between 0 and 7.
X_plot = np.linspace(0, 7, 100)
Y_plot = 5*X plot+10
The matplotlib command plot
from the sublibrary pyplot
(which we've called plt
) can then plot X_plot
and Y_plot
on top of your scatterplot. Just run it in the same cell as the scatterplot code.
Using appropriate max and min values, compute 100 points on your line. View the result.
Copy your scatterplot code. Then, overlay a plot of your regression line on top of it.
Finding a Confidence Interval
As before, we would like to know how the regression line might be different if we had a somewhat different dataset. The resampling method to find the relevant confidence intervals is almost the same as the one for correlation. The only significant difference is that you will need to keep track of the slopes of regression lines this time.
Note: it is not appropriate to calculate confidence intervals for slope and y-intercept separately since they are calculated together for each simulation. Instead you could graph the regression line of all simulations on one graph to get a visualization of the confidence interval of the regression line.
Find the 99% confidence intervals for the slope of your regression line.
Orthogonal Regression
So far, we have performed ordinary least squares regression, which minimizes the vertical distance from a point to the predicted value. This assumes that we know the x-value with much greater precision than the y-value, but often this isn't the case. In such situations, it's better to use orthogonal regression, which minimizes the total distance between the observed and predicted values. This can be done using the Scipy ODR (orthogonal distance regression) library.
Import the
scipy.odr
library asodr
.
The functions we will now use work best with Numpy arrays, so let's make a Numpy array version of the brainhead dataset.
Make such an array.
In order to run the ODR fitting function, we need to specify the type of function we want to fit and the data to which we will fit it. There are a couple of different ways to specify the function, but the simplest uses the scipy.odr built-in polynomial
function. The syntax to create a polynomial of degree is just odr.polynomial(n)
. (Remember that the degree of a polynomial is the highest power in it.)
Use the
polynomial
function to create a linear function and assign it to a variable. HINT: A line is a polynomial of degree 1.
We also need to create a data object using the odr.Data
function. In its simplest form, which we will use here, this object can be made from just an array of values and an array of values. The syntax is odr.Data(xvals, yvals)
. (Yes, capitalization matters.)
Using indexing, make such an object and assign it to a variable.
We can now fit the model function we made earlier to our data. This takes two lines:
myODR = odr.ODR(data, model) #Capitalization matters
ODRfit = myODR.run()
We can now use the pprint
function (not a typo!) to view the y-intercept and slope of the regression line. The syntax is:
ODRfit.pprint()
Fit a line to the brain-head data as described. Find the slope and y-intercept.
Like before, use the slope and intercept you found to plot the regression line on top of the data.
Examine the lines created by least squares regression and orthogonal regression. (You may want to plot both on the same graph.) Does one appear to fit the data better than the other?