CoCalc Public FilesKeliana's Sandbox / LS 40 / Assignment Solutions 21W / Lab Solutions / Lab 9 - Regression Solutions.ipynbOpen with one click!

1

In the previous lab, you learned to make scatterplots and compute correlation coefficients. This supplement will show you how to fit lines to data and estimate confidence intervals for a regression.

In this lab, as in Lab 8 on Correlation, you will study the correlation between head size (cubic centimeters) and brain weight (g) for a group of women.

2

- As usual, import pandas, Numpy and Seaborn. For regression, we'll also import sub-libraries to plot lines and calculate regression lines and correlation coefficients.

3

In [2]:

#TODO import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # Adds sub-library to plot line import scipy.stats as stats # Adds sub-library to calculate regression line and correlation coefficient

4

- Using pandas, import the file
`brainhead.csv`

and view the resulting data frame.

5

In [6]:

#TODO data=pd.read_csv('brainhead.csv') data

6

Head Size | Brain Weight | |
---|---|---|

0 | 2857 | 1027 |

1 | 3436 | 1235 |

2 | 3791 | 1260 |

3 | 3302 | 1165 |

4 | 3104 | 1080 |

... | ... | ... |

98 | 3214 | 1110 |

99 | 3394 | 1215 |

100 | 3233 | 1104 |

101 | 3352 | 1170 |

102 | 3391 | 1120 |

103 rows × 2 columns

- Make a scatterplot of your data. Don’t plot a regression
line. Note: The general syntax for making a scatterplot from a pandas
dataframe
`df`

without fitting a line to it (or showing histograms of variables) is:`sns.lmplot("xvar","yvar",data=df,fit_reg=False)`

.

7

In [18]:

#TODO p=sns.lmplot('Head Size','Brain Weight',data=data,fit_reg=False) p.set(title='Scatter Plot of Brain Weight vs Head Size in 103 Women',xlabel='Head Size (cc)',ylabel='Brain Weight (g)');

8

- Compute the appropriate correlation coefficient and explain why you chose to use that correlation coefficient. In regression, we use the term $R^2$ as the fraction in variance of $y$ that is explained by variance in $x$ and $R^2$ is the square of the Pearson correlation coefficient. Interpret your $R^2$ value. NOTE: Because this time we imported the whole
`scipy.stats`

library, your syntax will have to be slightly different from last time.

9

In [20]:

#We should look back at our plot from Lab 8 that showed the distributions of X and Y sns.jointplot('Head Size', 'Brain Weight', data=data)

10

<seaborn.axisgrid.JointGrid at 0x7f728aa5df28>

In [ ]:

#Because the points roughly lie on a line, X and Y both have symmetric distributions, and the data is homoskedastic in X and Y, it is appropriate to use Pearson's correlation coefficient

11

In [22]:

rho = stats.pearsonr(data['Head Size'],data['Brain Weight'])[0] #we imported scipy.stats and called it stats, so in order to use scipy.stats.pearsonr we should say stats.pearsonr rho

12

0.7751367523575622

There are several Python functions that can perform linear regression. We
will use one of the simplest, `linregress`

from the `scipy.stats`

library. The basic syntax is `reg = stats.linregress(xcolumn, ycolumn)`

. You can then use
`reg.slope`

and `reg.intercept`

to get the slope and intercept of your regression line.

13

- Run a linear regression on your data and obtain the slope and y-intercept.

14

In [36]:

regr=stats.linregress(data['Head Size'],data['Brain Weight']) slopeobs=regr.slope regr

15

LinregressResult(slope=0.27280169736173243, intercept=286.087017845948, rvalue=0.775136752357563, pvalue=7.3224702392419185e-22, stderr=0.022124995929903527)

- Write the equation for your line and briefly explain what it means

16

In [ ]:

#The line is y = 0.27280169736173243x + 286.087017845948. This means that, as a general rule, we can expect to see a 1 g increase in brain weight for every 1 cc increase in head size. The intercept doesn't make biological sense because it represents the brain weight we would expect to see for a head size of 0, but it helps us understand the numerical relationship between the variables.

17

We would now like to plot this regression line. We can do this by picking a bunch of $x$ values, computing the corresponding $y$ values and plotting the result.

The Numpy command `linspace`

generates an array with a specified number
of values between your min and max values. (The rather odd name comes from
the widely used program Matlab, which Numpy and matplotlib often emulate.)
Because you can do arithmetic directly on Numpy arrays, it is then simple to
compute your y values. For example, the following code computes points on the
line $Y = 5X + 10$ for 100 $X$ values between 0 and 7.

`X_plot = np.linspace(0, 7, 100)`

`Y_plot = 5*X plot+10`

The matplotlib command `plot`

from the sublibrary `pyplot`

(which we've called `plt`

) can then plot `X_plot`

and `Y_plot`

on top of
your scatterplot. Just run it in the same cell as the scatterplot code.

18

- Using appropriate max and min values, compute 100 points on your line. View the result.

19

In [25]:

#Pick min and max by finding what your min and max x values are in the dataset datamin = np.min(data['Head Size']) datamax = np.max(data['Head Size']) print(datamin,datamax)

20

2720 4204

In [29]:

X_plot = np.linspace(datamin,datamax,100) Y_plot = regr.slope*X_plot + regr.intercept #you can get the slope and intercept from your regression result using this syntax

21

- Copy your scatterplot code. Then, overlay a plot of your regression line on top of it.

22

In [33]:

p=sns.lmplot('Head Size','Brain Weight',data=data,fit_reg=False) p.set(title='Scatter Plot of Brain Weight vs Head Size in 103 Women with Regression Line',xlabel='Head Size (cc)',ylabel='Brain Weight (g)') plt.plot(X_plot,Y_plot); #these are collections of X and Y values that will be plotted as (X,Y) pairs

23

24

As before, we would like to know how the regression line might be different if we had a somewhat different dataset. The resampling method to find the relevant confidence intervals is almost the same as the one for correlation. The only significant difference is that you will need to keep track of the slopes of regression lines this time.

Note: it is not appropriate to calculate confidence intervals for slope and y-intercept separately since they are calculated together for each simulation. Instead you could graph the regression line of all simulations on one graph to get a visualization of the confidence interval of the regression line.

25

- Find the 99% confidence intervals for the slope of your regression line.

26

In [35]:

simslopes=np.zeros(10000) for i in range(10000): sim=data.sample(len(data),replace=True) simslopes[i]=stats.linregress(sim['Head Size'],sim['Brain Weight']).slope

27

In [40]:

simslopes.sort() CIlower=2*slopeobs-simslopes[9949] # Calculate lower limit of 95% confidence interval using index at top of middle 95% of data (M_upper) CIupper=2*slopeobs-simslopes[49] # Calculate upper limit of 95% confidence interval using index at bottom of middle 95% of data (M_lower) p=sns.distplot(simslopes,kde=False) p.set(xlabel = 'Regression slope',ylabel = 'Count',title = 'Simulations for Confidence Interval') p.axvline(CIlower,color='green') # Lower CI limit p.axvline(CIupper,color='green') # Upper CI limit p.axvline(slopeobs,color='red'); # Observed value

28

29

So far, we have performed ordinary least squares regression, which minimizes the vertical distance from a point to the predicted value. This assumes that we know the x-value with much greater precision than the y-value, but often this isn't the case. In such situations, it's better to use orthogonal regression, which minimizes the total distance between the observed and predicted values. This can be done using the Scipy ODR (orthogonal distance regression) library.

30

- Import the
`scipy.odr`

library as`odr`

.

31

In [41]:

import scipy.odr as odr

32

The functions we will now use work best with Numpy arrays, so let's make a Numpy array version of the brainhead dataset.

- Make such an array.

33

In [42]:

#two ways to do this: #1 dataarr = np.array(data) #2 do each column separately hs = np.array(data['Head Size']) bw = np.array(data['Brain Weight'])

34

In order to run the ODR fitting function, we need to specify the type of model function we want to fit and the data to which we will fit it. There are a couple of different ways to specify the function, but the simplest uses the scipy.odr built-in `polynomial`

function. The syntax to create a polynomial of degree $n$ is just `odr.polynomial(n)`

. (Remember that the degree of a polynomial is the highest power in it.)

35

- Use the
`polynomial`

function to create a linear function and assign it to a variable. HINT: A line is a polynomial of degree 1.

36

In [43]:

model = odr.polynomial(1)

37

We also need to create a data object using the `odr.Data`

function. In its simplest form, which we will use here, this object can be made from just an array of $x$ values and an array of $y$ values. The syntax is `odr.Data(xvals, yvals)`

. (Yes, capitalization matters.)

38

- Using indexing, make such an object and assign it to a variable.

39

In [45]:

odrdata = odr.Data(dataarr[:,0],dataarr[:,1]) #or odrdata = odr.Data(hs,bw)

40

We can now fit the model function we made earlier to our data. This takes two lines:

`myODR = odr.ODR(data, model) #Capitalization matters`

`ODRfit = myODR.run()`

We can now use the `pprint`

function (not a typo!) to view the y-intercept and slope of the regression line. The syntax is:

`ODRfit.pprint()`

41

- Fit a line to the brain-head data as described. Find the slope and y-intercept.

42

In [46]:

myODR = odr.ODR(odrdata,model) ODRfit = myODR.run() ODRfit.pprint()

43

Beta: [241.24058642 0.28591361]
Beta Std Error: [7.62152398e+01 2.22019640e-02]
Beta Covariance: [[ 1.44087854e+00 -4.18204199e-04]
[-4.18204199e-04 1.22271860e-07]]
Residual Variance: 4031.403487638505
Inverse Condition #: 2.4863646051725723e-05
Reason(s) for Halting:
Sum of squares convergence

In [52]:

print(ODRfit.beta[0],'intercept') print(ODRfit.beta[1],'slope')

44

241.24058641945388 intercept
0.28591360956520495 slope

- Like before, use the slope and intercept you found to plot the regression line on top of the data.

45

In [55]:

Y_plot_odr = ODRfit.beta[1]*X_plot + ODRfit.beta[0] #X data is the same as before - just a collection of spaced out X values p_odr=sns.lmplot('Head Size','Brain Weight',data=data,fit_reg=False) p_odr.set(title='Scatter Plot of Brain Weight vs Head Size in 103 Women with Orthogonal Regression Line',xlabel='Head Size (cc)',ylabel='Brain Weight (g)') plt.plot(X_plot,Y_plot_odr);

46

- Examine the lines created by least squares regression and orthogonal regression. (You may want to plot both on the same graph.) Does one appear to fit the data better than the other?

47

In [57]:

p=sns.lmplot('Head Size','Brain Weight',data=data,fit_reg=False) p.set(title='Scatter Plot of Brain Weight vs Head Size in 103 Women with Regression Lines',xlabel='Head Size (cc)',ylabel='Brain Weight (g)') plt.plot(X_plot,Y_plot,color='red'); plt.plot(X_plot,Y_plot_odr,color='green');

48

In [ ]:

#It looks like the ODR line (green) incorporates the extreme values better. This is the case because X and Y are both values that could have error in measurement, unlike an X variable like year that would not have any noise.

49