Scatter plots and linear regression
This worksheet is an interactive, guided module for learning the basics of scatter plotting, and of fitting linear models to data. It also shows how to read datafiles and produce a dataframe containing your data. Let us begin by learning how to read datasets into R. Example: The comma-separated data file named "winter.csv" contains the following 4 variables: (1) name of a U.S. city; (2) the mean January temperatures in that city (in degrees F); (3) the latitude of the city; and (4) the January temperature in degrees c.We will read the data, print it out and see what it looks like, and plot the mean January temperature vs. latitude. The commands below show how to do this.
Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.
City | Mean_Jan_Temp_F | Latitude | Jan_degreesC |
---|---|---|---|
Akron, OH | 27 | 41.05 | -2.78 |
Albany-Schenectady-Troy, NY | 23 | 42.40 | -5.00 |
Allentown, Bethlehem, PA-NJ | 29 | 40.35 | -1.67 |
Atlanta, GA | 45 | 33.45 | 7.22 |
Baltimore, MD | 35 | 39.20 | 1.67 |
Birmingham, AL | 45 | 33.31 | 7.22 |
Boston, MA | 30 | 42.15 | -1.11 |
Bridgeport-Milford, CT | 30 | 41.12 | -1.11 |
Buffalo, NY | 24 | 42.54 | -4.44 |
Canton, OH | 27 | 40.50 | -2.78 |
Chattanooga, TN-GA | 42 | 35.01 | 5.56 |
Chicago, IL | 26 | 41.49 | -3.33 |
Cincinnati, OH-KY-IN | 34 | 39.08 | 1.11 |
Cleveland, OH | 28 | 41.30 | -2.22 |
Columbus, OH | 31 | 40.00 | -0.56 |
Dallas, TX | 46 | 32.45 | 7.78 |
Dayton-Springfield, OH | 30 | 39.54 | -1.11 |
Denver, CO | 30 | 39.44 | -1.11 |
Detroit, MI | 27 | 42.06 | -2.78 |
Flint, MI | 24 | 43.00 | -4.44 |
Grand Rapids, MI | 24 | 43.00 | -4.44 |
Greensboro-Winston-Salem-High Point, NC | 40 | 36.04 | 4.44 |
Hartford, CT | 27 | 41.45 | -2.78 |
Houston, TX | 55 | 29.46 | 12.78 |
Indianapolis, IN | 29 | 39.45 | -1.67 |
Kansas City, MO | 31 | 39.05 | -0.56 |
Lancaster, PA | 32 | 40.05 | 0.00 |
Los Angeles, Long Beach, CA | 53 | 34.00 | 11.67 |
Louisville, KY-IN | 35 | 38.15 | 1.67 |
Memphis, TN-AR-MS | 42 | 35.07 | 5.56 |
Miami-Hialeah, FL | 67 | 25.45 | 19.44 |
Milwaukee, WI | 20 | 43.03 | -6.67 |
Minneapolis-St. Paul, MN-WI | 12 | 44.58 | -11.11 |
Nashville, TN | 40 | 36.10 | 4.44 |
New Haven-Meriden, CT | 30 | 41.20 | -1.11 |
New Orleans, LA | 54 | 30.00 | 12.22 |
New York, NY | 33 | 40.40 | 0.56 |
Philadelphia, PA-NJ | 32 | 40.00 | 0.00 |
Pittsburgh, PA | 29 | 40.26 | -1.67 |
Portland, OR | 38 | 45.31 | 3.33 |
Providence, RI | 29 | 41.50 | -1.67 |
Reading, PA | 33 | 40.20 | 0.56 |
Richmond-Petersburg, VA | 39 | 37.35 | 3.89 |
Rochester, NY | 25 | 43.15 | -3.89 |
St. Louis, MO-IL | 32 | 38.39 | 0.00 |
San Diego, CA | 55 | 32.43 | 12.78 |
San Francisco, CA | 48 | 37.45 | 8.89 |
San Jose, CA | 49 | 37.20 | 9.44 |
Seattle, WA | 40 | 47.36 | 4.44 |
Springfield, MA | 28 | 42.05 | -2.22 |
Syracuse, NY | 24 | 43.05 | -4.44 |
Toledo, OH | 26 | 41.40 | -3.33 |
Utica-Rome, NY | 23 | 43.05 | -5.00 |
Washington, DC-MD-VA | 37 | 38.50 | 2.78 |
Wichita, KS | 32 | 37.42 | 0.00 |
Wilmington, DE-NJ-MD | 33 | 39.45 | 0.56 |
Worcester, MA | 24 | 42.16 | -4.44 |
York, PA | 33 | 40.00 | 0.56 |
Youngstown-Warren, OH | 28 | 41.05 | -2.22 |
The above "read.csv" command produces a dataframe named
"winterdat". We can now compute summary stats, plot histograms,
boxplots, etc. for the above variables.
However, given the focus of the present tutorial,
let's scatter plot the mean January temperature vs. latitude
correlation= -0.8573135
Nest, we will standardize the x, y variables and convert
them to z-scores.
Here is how:
correlation after standardizing= -0.8573135
Next, we will let R compute a linear regression model for us.
Note that there are several, slightly different, variations on how to do this. Some examples are shown below.
Note that there are several, slightly different, variations on how to do this. Some examples are shown below.
Call:
lm(formula = yvar ~ xvar)
Coefficients:
(Intercept) xvar
118.14 -2.15
Call:
lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)
Coefficients:
(Intercept) Latitude
118.14 -2.15
Call:
lm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)
Residuals:
Min 1Q Median 3Q Max
-10.2978 -2.6353 -0.8719 0.3965 23.6789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 118.139 6.743 17.52 <2e-16 ***
Latitude -2.150 0.171 -12.57 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.272 on 57 degrees of freedom
Multiple R-squared: 0.735, Adjusted R-squared: 0.7303
F-statistic: 158.1 on 1 and 57 DF, p-value: < 2.2e-16
Exercise:
The U.S. Center for Disease Control and Prevention (CDC)
publishes state by state data on mortality rates by different
causes, including deaths by firearms. Using this in
conjunction with gun ownership data in each state, we can
explore the association, if any, between firearm deaths
and gun ownership. The file "firearms2013.csv" contains
these data for the year 2013. Carry out the following tasks:
- Read the file into R
- Make a sctterplot of "deaths_per_100k" vs "gun_ownership_rate". Be sure to label your axes.
- Compute the correlation between those two variables
- Plot the same two variables in standardized form
- Construct a linear regression model to predict firearm deaths from gun ownership rate. Plot the model together with the original data.
- Plot the residuals.
- Find the value.
- Write a short paragraph discussing the quality and appropriateness of the linear model, based on the scatter plot, correlation, , etc. Are there any conclusions you can draw from the model?