CoCalc -- linear_regression_in

Project: statistics_class_fall2017

Views: ⁴¹⁹

Scatter plots and linear regression

This worksheet is an interactive, guided module for learning the basics of scatter plotting, and of fitting linear models to data. It also shows how to read datafiles and produce a dataframe containing your data.

Let us begin by learning how to read datasets into R.

Example: The comma-separated data file named "winter.csv" contains the following 4 variables: (1) name of a U.S. city; (2) the mean January temperatures in that city (in degrees F); (3) the latitude of the city; and (4) the January temperature in degrees c.
We will read the data, print it out and see what it looks like, and plot the mean January temperature vs. latitude.

The commands below show how to do this.
Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.

%r
# Read the csv file
winterdat = read.csv(file="./winter.csv", header=TRUE, sep=",")
# Some points to note:
#    header=TRUE says my file has a header line at the top
#    TRUE must be in all upper-case
#    sep="," says to use comma as the separator

winterdat   # just to print out and see the data

City	Mean_Jan_Temp_F	Latitude	Jan_degreesC
Akron, OH	27	41.05	-2.78
Albany-Schenectady-Troy, NY	23	42.40	-5.00
Allentown, Bethlehem, PA-NJ	29	40.35	-1.67
Atlanta, GA	45	33.45	7.22
Baltimore, MD	35	39.20	1.67
Birmingham, AL	45	33.31	7.22
Boston, MA	30	42.15	-1.11
Bridgeport-Milford, CT	30	41.12	-1.11
Buffalo, NY	24	42.54	-4.44
Canton, OH	27	40.50	-2.78
Chattanooga, TN-GA	42	35.01	5.56
Chicago, IL	26	41.49	-3.33
Cincinnati, OH-KY-IN	34	39.08	1.11
Cleveland, OH	28	41.30	-2.22
Columbus, OH	31	40.00	-0.56
Dallas, TX	46	32.45	7.78
Dayton-Springfield, OH	30	39.54	-1.11
Denver, CO	30	39.44	-1.11
Detroit, MI	27	42.06	-2.78
Flint, MI	24	43.00	-4.44
Grand Rapids, MI	24	43.00	-4.44
Greensboro-Winston-Salem-High Point, NC	40	36.04	4.44
Hartford, CT	27	41.45	-2.78
Houston, TX	55	29.46	12.78
Indianapolis, IN	29	39.45	-1.67
Kansas City, MO	31	39.05	-0.56
Lancaster, PA	32	40.05	0.00
Los Angeles, Long Beach, CA	53	34.00	11.67
Louisville, KY-IN	35	38.15	1.67
Memphis, TN-AR-MS	42	35.07	5.56
Miami-Hialeah, FL	67	25.45	19.44
Milwaukee, WI	20	43.03	-6.67
Minneapolis-St. Paul, MN-WI	12	44.58	-11.11
Nashville, TN	40	36.10	4.44
New Haven-Meriden, CT	30	41.20	-1.11
New Orleans, LA	54	30.00	12.22
New York, NY	33	40.40	0.56
Philadelphia, PA-NJ	32	40.00	0.00
Pittsburgh, PA	29	40.26	-1.67
Portland, OR	38	45.31	3.33
Providence, RI	29	41.50	-1.67
Reading, PA	33	40.20	0.56
Richmond-Petersburg, VA	39	37.35	3.89
Rochester, NY	25	43.15	-3.89
St. Louis, MO-IL	32	38.39	0.00
San Diego, CA	55	32.43	12.78
San Francisco, CA	48	37.45	8.89
San Jose, CA	49	37.20	9.44
Seattle, WA	40	47.36	4.44
Springfield, MA	28	42.05	-2.22
Syracuse, NY	24	43.05	-4.44
Toledo, OH	26	41.40	-3.33
Utica-Rome, NY	23	43.05	-5.00
Washington, DC-MD-VA	37	38.50	2.78
Wichita, KS	32	37.42	0.00
Wilmington, DE-NJ-MD	33	39.45	0.56
Worcester, MA	24	42.16	-4.44
York, PA	33	40.00	0.56
Youngstown-Warren, OH	28	41.05	-2.22

The above "read.csv" command produces a dataframe named "winterdat". We can now compute summary stats, plot histograms, boxplots, etc. for the above variables.

However, given the focus of the present tutorial, let's scatter plot the mean January temperature vs. latitude

%r
# Let's first create shorter names for the explanatory & response variables:
xvar = winterdat$Latitude    # be careful and spell it exacly as shown in above printout
yvar = winterdat$Mean_Jan_Temp_F

# Next, make a scatter plot:
plot(xvar, yvar, xlab="Latitude", ylab="Mean January Temp (F)")

# We can add the line of best-fit to the scatter plot, without
# actually finding its equation (uncommet the next line to see it):
# abline(lm(yvar ~ xvar))      # draw regression line on scatter plot

# It is easy to find the correlation
r = cor(xvar, yvar)
cat("correlation=", r)  # This is one one way to print text and variables together

correlation= -0.8573135

Nest, we will standardize the x, y variables and convert them to z-scores.

Here is how:

%r
# Use the "scale" function to standardize as shown below.
# Note that we're using the names "zx", "zy" to store the
# z-scores after standardizing:
zx = scale ( xvar, center = TRUE, scale = TRUE )  # the "center" option subtracts the mean
zy = scale ( yvar, center = TRUE, scale = TRUE )  # the "scale"  option divides by the SD

# Now, let's scatterplot the standardized values and see what
# they look like
plot(zx, zy, xlab="Latitude - standardized", ylab="Mean January Temp - standardized")  # make scatter plot

# Let's verify that the correlation has remained the same
# even after standardizing:
r_after = cor(zx, zy)
cat("correlation after standardizing=", r_after)

correlation after standardizing= -0.8573135

Next, we will let R compute a linear regression model for us.
Note that there are several, slightly different, variations on how to do this. Some examples are shown below.

%r
# Method 1: Use the variables "xvar", "yvar" which have been
# extracted from original dataframe
lm(yvar ~ xvar)

# Method 2: Directly use the variables from original dataframe
lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat)

# Method 3: Store the results of "lm" in a new variable
# and query that variable for information about the model
lmresults = lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat)
summary ( lmresults )

# Notice that Method 3 gives more information about
# the results, including the R-squared value.
myres = residuals (lmresults )  # prints out values of the residuals
plot(xvar, myres, xlab="Latitude", ylab="Residuals (F)")
abline( 0, 0 )   # add horizontal reference line on x-axis

Call:
lm(formula = yvar ~ xvar)

Coefficients:
(Intercept)         xvar  
     118.14        -2.15  

Call:
lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)

Coefficients:
(Intercept)     Latitude  
     118.14        -2.15  

Call:
lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2978  -2.6353  -0.8719   0.3965  23.6789 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  118.139      6.743   17.52   <2e-16 ***
Latitude      -2.150      0.171  -12.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.272 on 57 degrees of freedom
Multiple R-squared:  0.735,	Adjusted R-squared:  0.7303 
F-statistic: 158.1 on 1 and 57 DF,  p-value: < 2.2e-16

︠90cc99b2-2cc4-4b3c-9ae0-e0ca726cc83bsi︠
%hide
%html
<FONT COLOR="#030886"><B>Exercise:</B>

The U.S. Center for Disease Control and Prevention (CDC)
publishes state by state data on mortality rates by different
causes, including deaths by firearms.  Using this in
conjunction with gun ownership data in each state, we can
explore the association, if any, between firearm deaths
and gun ownership.  The file "firearms2013.csv" contains
these data for the year 2013.  Carry out the following tasks:
<OL TYPE="a">
  <LI> Read the file into R
  <LI> Make a sctterplot of "deaths_per_100k" vs
"gun_ownership_rate".  Be sure to label your axes.
  <LI> Compute the correlation between those two variables
  <LI> Plot the same two variables in standardized form
  <LI> Construct a linear regression model to predict
firearm deaths from gun ownership rate.  Plot the
model together with the original data.
  <LI> Plot the residuals.
  <LI> </LI>Find the $R^2$ value.
  <LI> Write a short paragraph discussing the
quality and appropriateness of the linear model,
based on the scatter plot, correlation, $R^2$, etc.
Are there any conclusions you can draw from the model?
</OL>
<P></P>
</FONT>
<P></P>
Turn in a printed copy of a PDF
<P></P>

Exercise: The U.S. Center for Disease Control and Prevention (CDC) publishes state by state data on mortality rates by different causes, including deaths by firearms. Using this in conjunction with gun ownership data in each state, we can explore the association, if any, between firearm deaths and gun ownership. The file "firearms2013.csv" contains these data for the year 2013. Carry out the following tasks:

Read the file into R
Make a sctterplot of "deaths_per_100k" vs "gun_ownership_rate". Be sure to label your axes.
Compute the correlation between those two variables
Plot the same two variables in standardized form
Construct a linear regression model to predict firearm deaths from gun ownership rate. Plot the model together with the original data.
Plot the residuals.

Find the

R^2

value.

Write a short paragraph discussing the quality and appropriateness of the linear model, based on the scatter plot, correlation, $R^2$ , etc. Are there any conclusions you can draw from the model?

Turn in a printed copy of a PDF