︠0cc9d58b-71c9-43ff-87d4-c9b5d6045dbdsi︠ %hide %html

Scatter plots and linear regression

This worksheet is an interactive, guided module for learning the basics of scatter plotting, and of fitting linear models to data. It also shows how to read datafiles and produce a dataframe containing your data.

Let us begin by learning how to read datasets into R.

Example: The comma-separated data file named "winter.csv" contains the following 4 variables: (1) name of a U.S. city; (2) the mean January temperatures in that city (in degrees F); (3) the latitude of the city; and (4) the January temperature in degrees c.
We will read the data, print it out and see what it looks like, and plot the mean January temperature vs. latitude.

The commands below show how to do this.
Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign. ︡118a61e2-a1aa-499f-9684-8192235bc487︡{"hide":"input"}︡{"html":"

Scatter plots and linear regression

\n\nThis worksheet is an interactive, guided module for learning \nthe basics of scatter plotting, and of fitting linear models \nto data. It also shows how to read datafiles and produce \na dataframe containing your data.\n

\n\nLet us begin by learning how to read datasets into R.\n

\n\n\nExample: The comma-separated data file named \n\"winter.csv\" contains the following 4 variables: (1) name \nof a U.S. city; (2) the mean January temperatures in that \ncity (in degrees F); (3) the latitude of the city; and \n(4) the January temperature in degrees c.
\nWe will read the data, print it out and see what it \nlooks like, and plot the mean January temperature vs. \nlatitude.\n
\n

\nThe commands below show how to do this.
Note that all the \ninformation following any \"#\" sign is just to explain \nwhat is going on. R ignores anything that follows a \"#\" sign.\n"}︡{"done":true}︡ ︠ded389c5-1ac6-4e68-ad56-747d35cd67bb︠ %r # Read the csv file winterdat = read.csv(file="./winter.csv", header=TRUE, sep=",") # Some points to note: # header=TRUE says my file has a header line at the top # TRUE must be in all upper-case # sep="," says to use comma as the separator winterdat # just to print out and see the data ︡a1721ee3-78e5-4e04-9c89-67562d94b68a︡{"html":"\n\n\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\n
CityMean_Jan_Temp_FLatitudeJan_degreesC
Akron, OH 27 41.05 -2.78
Albany-Schenectady-Troy, NY 23 42.40 -5.00
Allentown, Bethlehem, PA-NJ 29 40.35 -1.67
Atlanta, GA 45 33.45 7.22
Baltimore, MD 35 39.20 1.67
Birmingham, AL 45 33.31 7.22
Boston, MA 30 42.15 -1.11
Bridgeport-Milford, CT 30 41.12 -1.11
Buffalo, NY 24 42.54 -4.44
Canton, OH 27 40.50 -2.78
Chattanooga, TN-GA 42 35.01 5.56
Chicago, IL 26 41.49 -3.33
Cincinnati, OH-KY-IN 34 39.08 1.11
Cleveland, OH 28 41.30 -2.22
Columbus, OH 31 40.00 -0.56
Dallas, TX 46 32.45 7.78
Dayton-Springfield, OH 30 39.54 -1.11
Denver, CO 30 39.44 -1.11
Detroit, MI 27 42.06 -2.78
Flint, MI 24 43.00 -4.44
Grand Rapids, MI 24 43.00 -4.44
Greensboro-Winston-Salem-High Point, NC40 36.04 4.44
Hartford, CT 27 41.45 -2.78
Houston, TX 55 29.46 12.78
Indianapolis, IN 29 39.45 -1.67
Kansas City, MO 31 39.05 -0.56
Lancaster, PA 32 40.05 0.00
Los Angeles, Long Beach, CA 53 34.00 11.67
Louisville, KY-IN 35 38.15 1.67
Memphis, TN-AR-MS 42 35.07 5.56
Miami-Hialeah, FL 67 25.45 19.44
Milwaukee, WI 20 43.03 -6.67
Minneapolis-St. Paul, MN-WI 12 44.58 -11.11
Nashville, TN 40 36.10 4.44
New Haven-Meriden, CT 30 41.20 -1.11
New Orleans, LA 54 30.00 12.22
New York, NY 33 40.40 0.56
Philadelphia, PA-NJ 32 40.00 0.00
Pittsburgh, PA 29 40.26 -1.67
Portland, OR 38 45.31 3.33
Providence, RI 29 41.50 -1.67
Reading, PA 33 40.20 0.56
Richmond-Petersburg, VA 39 37.35 3.89
Rochester, NY 25 43.15 -3.89
St. Louis, MO-IL 32 38.39 0.00
San Diego, CA 55 32.43 12.78
San Francisco, CA 48 37.45 8.89
San Jose, CA 49 37.20 9.44
Seattle, WA 40 47.36 4.44
Springfield, MA 28 42.05 -2.22
Syracuse, NY 24 43.05 -4.44
Toledo, OH 26 41.40 -3.33
Utica-Rome, NY 23 43.05 -5.00
Washington, DC-MD-VA 37 38.50 2.78
Wichita, KS 32 37.42 0.00
Wilmington, DE-NJ-MD 33 39.45 0.56
Worcester, MA 24 42.16 -4.44
York, PA 33 40.00 0.56
Youngstown-Warren, OH 28 41.05 -2.22
\n"}︡{"done":true}︡ ︠9a8780c9-f54f-44e5-8576-a5a8b4842187si︠ %hide %html The above "read.csv" command produces a dataframe named "winterdat". We can now compute summary stats, plot histograms, boxplots, etc. for the above variables.

However, given the focus of the present tutorial, let's scatter plot the mean January temperature vs. latitude
︡689be61d-0b00-4b9f-b4c5-5dd37989122b︡{"hide":"input"}︡{"html":"\nThe above \"read.csv\" command produces a dataframe named \n\"winterdat\". We can now compute summary stats, plot histograms, \nboxplots, etc. for the above variables.\n\n

\nHowever, given the focus of the present tutorial, \nlet's scatter plot the mean January temperature vs. latitude\n
\n\n"}︡{"done":true}︡ ︠3da00961-b70b-46b6-a143-dcd453721b2f︠ %r # Let's first create shorter names for the explanatory & response variables: xvar = winterdat$Latitude # be careful and spell it exacly as shown in above printout yvar = winterdat$Mean_Jan_Temp_F # Next, make a scatter plot: plot(xvar, yvar, xlab="Latitude", ylab="Mean January Temp (F)") # We can add the line of best-fit to the scatter plot, without # actually finding its equation (uncommet the next line to see it): # abline(lm(yvar ~ xvar)) # draw regression line on scatter plot # It is easy to find the correlation r = cor(xvar, yvar) cat("correlation=", r) # This is one one way to print text and variables together ︡7d9e259f-f8df-4ca1-8b8a-c93f26f0dc44︡{"stdout":"correlation= -0.8573135"}︡{"file":{"filename":"/tmp/tmpm8qFdY.png","show":false,"text":null,"uuid":"10cf989d-af1e-4f73-86d7-7bbb0374a633"},"once":false}︡{"html":""}︡{"done":true}︡ ︠360f7817-c5e6-4012-99f5-04732cf17052si︠ %hide %html Nest, we will standardize the x, y variables and convert them to z-scores.

Here is how:
︡4a784192-f949-4dde-93dc-678ef983d8a6︡{"hide":"input"}︡{"html":"\nNest, we will standardize the x, y variables and convert \nthem to z-scores.\n\n

\nHere is how:\n
"}︡{"done":true}︡ ︠994c11bf-1c31-4d8f-b310-107a58f15bcfs︠ %r # Use the "scale" function to standardize as shown below. # Note that we're using the names "zx", "zy" to store the # z-scores after standardizing: zx = scale ( xvar, center = TRUE, scale = TRUE ) # the "center" option subtracts the mean zy = scale ( yvar, center = TRUE, scale = TRUE ) # the "scale" option divides by the SD # Now, let's scatterplot the standardized values and see what # they look like plot(zx, zy, xlab="Latitude - standardized", ylab="Mean January Temp - standardized") # make scatter plot # Let's verify that the correlation has remained the same # even after standardizing: r_after = cor(zx, zy) cat("correlation after standardizing=", r_after) ︡80618760-f302-400d-98a9-b4c90ad03093︡{"stdout":"correlation after standardizing= -0.8573135"}︡{"file":{"filename":"/tmp/tmp3DWbFS.png","show":false,"text":null,"uuid":"fc42cd94-534f-4ac6-a3b6-cb8c1b036ab7"},"once":false}︡{"html":""}︡{"done":true}︡ ︠5ed9268d-129b-40a9-9c2f-c01b45784ad5si︠ %hide %html Next, we will let R compute a linear regression model for us.
Note that there are several, slightly different, variations on how to do this. Some examples are shown below.
︡19764368-ca3a-4081-b562-f2127e8438b5︡{"hide":"input"}︡{"html":"\nNext, we will let R compute a linear regression model for us.
\nNote that there are several, slightly different, variations on \nhow to do this. Some examples are shown below.\n
"}︡{"done":true}︡ ︠9fa485f4-8767-44fe-8ac5-c300a4619cf2s︠ %r # Method 1: Use the variables "xvar", "yvar" which have been # extracted from original dataframe lm(yvar ~ xvar) # Method 2: Directly use the variables from original dataframe lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat) # Method 3: Store the results of "lm" in a new variable # and query that variable for information about the model lmresults = lm(Mean_Jan_Temp_F ~ Latitude, data = winterdat) summary ( lmresults ) # Notice that Method 3 gives more information about # the results, including the R-squared value. myres = residuals (lmresults ) # prints out values of the residuals plot(xvar, myres, xlab="Latitude", ylab="Residuals (F)") abline( 0, 0 ) # add horizontal reference line on x-axis ︡58c035bb-775b-48f9-a878-bd27c54374b9︡{"stdout":"\nCall:\nlm(formula = yvar ~ xvar)\n\nCoefficients:\n(Intercept) xvar \n 118.14 -2.15 \n"}︡{"stdout":"\nCall:\nlm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)\n\nCoefficients:\n(Intercept) Latitude \n 118.14 -2.15 \n"}︡{"stdout":"\nCall:\nlm(formula = Mean_Jan_Temp_F ~ Latitude, data = winterdat)\n\nResiduals:\n Min 1Q Median 3Q Max \n-10.2978 -2.6353 -0.8719 0.3965 23.6789 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 118.139 6.743 17.52 <2e-16 ***\nLatitude -2.150 0.171 -12.57 <2e-16 ***\n---\nSignif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n\nResidual standard error: 5.272 on 57 degrees of freedom\nMultiple R-squared: 0.735,\tAdjusted R-squared: 0.7303 \nF-statistic: 158.1 on 1 and 57 DF, p-value: < 2.2e-16\n"}︡{"file":{"filename":"/tmp/tmp9Ug1XQ.png","show":false,"text":null,"uuid":"9979b698-f9f7-439e-89f5-9d3eab64d08e"},"once":false}︡{"html":""}︡{"done":true}︡︠1476dc9f-3db6-40d6-8ea4-d8634ad2aa70︠ ︠90cc99b2-2cc4-4b3c-9ae0-e0ca726cc83bsi︠ %hide %html Exercise: The U.S. Center for Disease Control and Prevention (CDC) publishes state by state data on mortality rates by different causes, including deaths by firearms. Using this in conjunction with gun ownership data in each state, we can explore the association, if any, between firearm deaths and gun ownership. The file "firearms2013.csv" contains these data for the year 2013. Carry out the following tasks:
  1. Read the file into R
  2. Make a sctterplot of "deaths_per_100k" vs "gun_ownership_rate". Be sure to label your axes.
  3. Compute the correlation between those two variables
  4. Plot the same two variables in standardized form
  5. Construct a linear regression model to predict firearm deaths from gun ownership rate. Plot the model together with the original data.
  6. Plot the residuals.
  7. Find the $R^2$ value.
  8. Write a short paragraph discussing the quality and appropriateness of the linear model, based on the scatter plot, correlation, $R^2$, etc. Are there any conclusions you can draw from the model?

Turn in a printed copy of a PDF

︡35dfb75c-1a44-4418-8a54-76d74f609e82︡{"hide":"input"}︡{"html":"Exercise:\n\nThe U.S. Center for Disease Control and Prevention (CDC)\npublishes state by state data on mortality rates by different\ncauses, including deaths by firearms. Using this in\nconjunction with gun ownership data in each state, we can\nexplore the association, if any, between firearm deaths\nand gun ownership. The file \"firearms2013.csv\" contains\nthese data for the year 2013. Carry out the following tasks:\n
    \n
  1. Read the file into R\n
  2. Make a sctterplot of \"deaths_per_100k\" vs\n\"gun_ownership_rate\". Be sure to label your axes.\n
  3. Compute the correlation between those two variables\n
  4. Plot the same two variables in standardized form\n
  5. Construct a linear regression model to predict\nfirearm deaths from gun ownership rate. Plot the\nmodel together with the original data.\n
  6. Plot the residuals.\n
  7. Find the $R^2$ value.\n
  8. Write a short paragraph discussing the\nquality and appropriateness of the linear model,\nbased on the scatter plot, correlation, $R^2$, etc.\nAre there any conclusions you can draw from the model?\n
\n

\n
\n

\nTurn in a printed copy of a PDF\n

"}︡{"done":true} ︠ab8e3986-0d61-4dc4-87be-258df703d0a0︠