Sharedlinear_regression_in_r.sagewsOpen in CoCalc

Scatter plots and linear regression

This worksheet is an interactive, guided module for learning the basics of scatter plotting, and of fitting linear models to data. It also shows how to read datafiles and produce a dataframe containing your data.

Let us begin by learning how to read datasets into R.

Example: The comma-separated data file named "winter.csv" contains the following 4 variables: (1) name of a U.S. city; (2) the mean January temperatures in that city (in degrees F); (3) the latitude of the city; and (4) the January temperature in degrees c.
We will read the data, print it out and see what it looks like, and plot the mean January temperature vs. latitude.

The commands below show how to do this.
Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.
%r
# Read the csv file
winterdat = read.csv(file="./winter.csv", header=TRUE, sep=",")
# Some points to note:
#    header=TRUE says my file has a header line at the top
#    TRUE must be in all upper-case
#    sep="," says to use comma as the separator

winterdat   # just to print out and see the data
CityMean_Jan_Temp_FLatitudeJan_degreesC
Akron, OH 27 41.05 -2.78
Albany-Schenectady-Troy, NY 23 42.40 -5.00
Allentown, Bethlehem, PA-NJ 29 40.35 -1.67
Atlanta, GA 45 33.45 7.22
Baltimore, MD 35 39.20 1.67
Birmingham, AL 45 33.31 7.22
Boston, MA 30 42.15 -1.11
Bridgeport-Milford, CT 30 41.12 -1.11
Buffalo, NY 24 42.54 -4.44
Canton, OH 27 40.50 -2.78
Chattanooga, TN-GA 42 35.01 5.56
Chicago, IL 26 41.49 -3.33
Cincinnati, OH-KY-IN 34 39.08 1.11
Cleveland, OH 28 41.30 -2.22
Columbus, OH 31 40.00 -0.56
Dallas, TX 46 32.45 7.78
Dayton-Springfield, OH 30 39.54 -1.11
Denver, CO 30 39.44 -1.11
Detroit, MI 27 42.06 -2.78
Flint, MI 24 43.00 -4.44
Grand Rapids, MI 24 43.00 -4.44
Greensboro-Winston-Salem-High Point, NC40 36.04 4.44
Hartford, CT 27 41.45 -2.78
Houston, TX 55 29.46 12.78
Indianapolis, IN 29 39.45 -1.67
Kansas City, MO 31 39.05 -0.56
Lancaster, PA 32 40.05 0.00
Los Angeles, Long Beach, CA 53 34.00 11.67
Louisville, KY-IN 35 38.15 1.67
Memphis, TN-AR-MS 42 35.07 5.56
Miami-Hialeah, FL 67 25.45 19.44
Milwaukee, WI 20 43.03 -6.67
Minneapolis-St. Paul, MN-WI 12 44.58 -11.11
Nashville, TN 40 36.10 4.44
New Haven-Meriden, CT 30 41.20 -1.11
New Orleans, LA 54 30.00 12.22
New York, NY 33 40.40 0.56
Philadelphia, PA-NJ 32 40.00 0.00
Pittsburgh, PA 29 40.26 -1.67
Portland, OR 38 45.31 3.33
Providence, RI 29 41.50 -1.67
Reading, PA 33 40.20 0.56
Richmond-Petersburg, VA 39 37.35 3.89
Rochester, NY 25 43.15 -3.89
St. Louis, MO-IL 32 38.39 0.00
San Diego, CA 55 32.43 12.78
San Francisco, CA 48 37.45 8.89
San Jose, CA 49 37.20 9.44
Seattle, WA 40 47.36 4.44
Springfield, MA 28 42.05 -2.22
Syracuse, NY 24 43.05 -4.44
Toledo, OH 26 41.40 -3.33
Utica-Rome, NY 23 43.05 -5.00
Washington, DC-MD-VA 37 38.50 2.78
Wichita, KS 32 37.42 0.00
Wilmington, DE-NJ-MD 33 39.45 0.56
Worcester, MA 24 42.16 -4.44
York, PA 33 40.00 0.56
Youngstown-Warren, OH 28 41.05 -2.22
The above "read.csv" command produces a dataframe named "winterdat". We can now compute summary stats, plot histograms, boxplots, etc. for the above variables.

However, given the focus of the present tutorial, let's scatter plot the mean January temperature vs. latitude
%r
# Let's first create shorter names for the explanatory & response variables:
xvar = winterdat$Latitude    # be careful and spell it exacly as shown in above printout
yvar = winterdat$Mean_Jan_Temp_F

# Next, make a scatter plot:
plot(xvar, yvar, xlab="Latitude", ylab="Mean January Temp (F)")

# We can add the line of best-fit to the scatter plot, without
# actually finding its equation (uncommet the next line to see it):
# abline(lm(yvar ~ xvar))      # draw regression line on scatter plot

# It is easy to find the correlation
r = cor(xvar, yvar)
cat("correlation=", r)  # This is one one way to print text and variables together
correlation= -0.8573135
Nest, we will standardize the x, y variables and convert them to z-scores.

Here is how:
%r
# Use the "scale" function to standardize as shown below.
# Note that we're using the names "zx", "zy" to store the
# z-scores after standardizing:
zx = scale ( xvar, center = TRUE, scale = TRUE )  # the "center" option subtracts the mean
zy = scale ( yvar, center = TRUE, scale = TRUE )  # the "scale"  option divides by the SD

# Now, let's scatterplot the standardized values and see what
# they look like
plot(zx, zy, xlab="Latitude - standardized", ylab="Mean January Temp - standardized")  # make scatter plot

# Let's verify that the correlation has remained the same
# even after standardizing:
r_after = cor(zx, zy)
cat("correlation after standardizing=", r_after)

correlation after standardizing= -0.8573135
Next, we will let R compute a linear regression model for us.
Note that there are several, slightly different, variations on how to do this. Some examples are shown below.
%r
# Method 1: Use the variables "xvar", "yvar" which have been
# extracted from original dataframe
lm(yvar ~ xvar)

# Method 2: Directly use the variables from original dataframe
lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat)

# Method 3: Store the results of "lm" in a new variable
# and query that variable for information about the model
lmresults = lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat)
summary ( lmresults )

# Notice that Method 3 gives more information about
# the results, including the R-squared value.
myres = residuals (lmresults )  # prints out values of the residuals
plot(xvar, myres, xlab="Latitude", ylab="Residuals (F)")
abline( 0, 0 )   # add horizontal reference line on x-axis
Call: lm(formula = yvar ~ xvar) Coefficients: (Intercept) xvar 118.14 -2.15 Call: lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat) Coefficients: (Intercept) Latitude 118.14 -2.15 Call: lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat) Residuals: Min 1Q Median 3Q Max -10.2978 -2.6353 -0.8719 0.3965 23.6789 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 118.139 6.743 17.52 <2e-16 *** Latitude -2.150 0.171 -12.57 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.272 on 57 degrees of freedom Multiple R-squared: 0.735, Adjusted R-squared: 0.7303 F-statistic: 158.1 on 1 and 57 DF, p-value: < 2.2e-16
90cc99b2-2cc4-4b3c-9ae0-e0ca726cc83bsi
%hide
%html
<FONT COLOR="#030886"><B>Exercise:</B>

The U.S. Center for Disease Control and Prevention (CDC)
publishes state by state data on mortality rates by different
causes, including deaths by firearms.  Using this in
conjunction with gun ownership data in each state, we can
explolre the association, if any, between firearm deaths
and gun ownership.  The file "firearms2013.csv" contains
these data for the year 2013.  Carry out the following tasks:
<OL TYPE="a">
  <LI> Read the file into R
  <LI> Make a sctterplot of "deaths_per_100k" vs
"gun_ownership_rate".  Be sure to label your axes.
  <LI> Compute the correlation between those two variables
  <LI> Plot the same two variables in standardized form
  <LI> Construct a linear regression model to predict
firearm deaths from gun ownership rate.  Plot the
model together with the original data.
  <LI> Plot the residuals.
  <LI> </LI>Find the $R^2$ value.
  <LI> Write a short paragraph discussing the
quality and appropriateness of the linear model,
based on the scatter plot, correlation, $R^2$, etc.
Are there any conclusions you can draw from the model?
</OL>
<P></P>
</FONT>
<P></P>
Turn in a printed copy of a PDF
<P></P>

Exercise: The U.S. Center for Disease Control and Prevention (CDC) publishes state by state data on mortality rates by different causes, including deaths by firearms. Using this in conjunction with gun ownership data in each state, we can explolre the association, if any, between firearm deaths and gun ownership. The file "firearms2013.csv" contains these data for the year 2013. Carry out the following tasks:
  1. Read the file into R
  2. Make a sctterplot of "deaths_per_100k" vs "gun_ownership_rate". Be sure to label your axes.
  3. Compute the correlation between those two variables
  4. Plot the same two variables in standardized form
  5. Construct a linear regression model to predict firearm deaths from gun ownership rate. Plot the model together with the original data.
  6. Plot the residuals.
  7. Find the R2R^2 value.
  8. Write a short paragraph discussing the quality and appropriateness of the linear model, based on the scatter plot, correlation, R2R^2, etc. Are there any conclusions you can draw from the model?

Turn in a printed copy of a PDF