The code below loads the necessary packages for data analysis and reads in the datasets 'starbucks.csv' and 'simdata.RData'.
# Set the location of our personally installed R packages
.libPaths(new = "~/Rlibs")
# Load the Tidyverse and modelr packages
library(tidyverse)
library(modelr)
# Provide notification if NA values are dropped when modeling
options(na.action = na.warn)
# Load the datasets for Homework 4
starbucks <- read_csv("starbucks.csv")
load("simdata.RData")
## Uncomment the code below to change the plot window size
# options(repr.plot.width = 6, repr.plot.height = 4)
## Uncomment the code below to change the tibble printing format.
# options(tibble.print_max = 30, tibble.print_min = 20,
# tibble.width = Inf)
Based on my knowledge, the amount of carbohydrates a food item contains is positively correlated with how many calories that food item provides. In this case, the number of calories will be the explanatory variable and the response variable will be the number of carbohydrates. This model is probably not true for all amounts of calories. Some foods have high amounts of fat and sugar but low amounts of carbohydrates and still provide a high number of calories.
The code below makes use of a couple of functions to model the linear relationship between carbohydrate content and calorie content. The code uses the lm() function to model the 'carb' and 'calories' columns from the 'starbucks' dataset to create a visual representation to describe the relationship between the two variables. The data_grid() function creates a tibble that has a column of each unique value of the specified column, in this case it's 'calories'. The add_predictions() function adds the predicted values of the response variable to the tibble containing the unique values of the explanatory variable. ggplot() and geom_point() plot the actual values of the 'calories' and 'carbs' columns while the geom_line() function takes the predicted response variable values and plots a line to illustrate the trend of the relationship between calories and carbohydrates.
sim1_model <- lm(carb~calories, starbucks)
sim1_model
grid <- data_grid(starbucks, calories)
print(grid)
grid <- add_predictions(grid, sim1_model)
print(grid)
ggplot(starbucks) +
geom_point(aes(calories, carb)) +
geom_line(aes(calories, pred), data=grid, color="red", size=1)
The add_residuals() function adds the residuals to the tibble, which are the differences between the predicted response variable values and the actual response variable values for each unique explanatory variable value. The geom_ref_line() function creates the thick white line as the reference line that symbolizes no difference between predicted and actual values. ggplot() and geom_point() maps out the residual points for each unique value of calories.
sim1_resid <- add_residuals(starbucks, sim1_model)
ggplot(sim1_resid) +
geom_ref_line(h=0) +
geom_point(aes(calories, y=resid))
It seems that food items with a low amount of calories have less variation in the discrepancy between actual and predicted values of carbohydrates while those food items with higher numbers of calories have greater variation.
The code below creates a frequency polygon plot of the residuals using ggplot() and geom_freqpoly() functions. It's like a line graph version of a histogram.
ggplot(sim1_resid) +
geom_freqpoly(aes(resid), binwidth=0.5)
The distribution of residuals appears to be relatively normal, with most of the residuals being around 0 (indicating the linear model is fairly accurate).
This data seems to meet the required conditions for a least squares regression line as the data is relatively linear (1), has a nearly normal distribution of residuals (2), and the residuals of a linear regression have a constant variance because they are by definition random (3).
The code below separates the simdata dataset to three different tibbles for each simulated dataset by using the filter() function and specifying the label of the simulated dataset.
sim_a_simdata<-filter(simdata, label=='sim_a')
sim_b_simdata<-filter(simdata, label=='sim_b')
sim_c_simdata<-filter(simdata, label=='sim_c')
The code below fits a linear model to the sim_a dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
sim_a_model <- lm(y~x, sim_a_simdata)
sim_a_model
grid <- data_grid(sim_a_simdata, x)
print(grid)
sim_a_resid <- add_residuals(sim_a_simdata, sim_a_model)
print(sim_a_resid)
grid <- add_predictions(grid, sim_a_model)
print(grid)
sima_plot<-ggplot(sim_a_simdata) +
geom_point(aes(x, y)) +
geom_line(aes(x, y=pred), data=grid, color="red", size=1) +
geom_abline(intercept=6,slope=1.5)
sima_plot
sima_resid_scatterplot<-ggplot(sim_a_resid) +
geom_ref_line(h=0) +
geom_point(aes(x, y=resid))
sima_resid_scatterplot
sima_freqpoly<-ggplot(sim_a_resid) +
geom_freqpoly(aes(resid), binwidth=0.5)
sima_freqpoly
This dataset seems to conform pretty well to the linear model and has a fairly normal distribution of residuals and constant variance.
The code below fits a linear model to the sim_b dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
sim_b_model <- lm(y~x, sim_b_simdata)
sim_b_model
grid <- data_grid(sim_b_simdata, x)
print(grid)
sim_b_resid <- add_residuals(sim_b_simdata, sim_b_model)
print(sim_b_resid)
grid <- add_predictions(grid, sim_b_model)
print(grid)
simb_plot<-ggplot(sim_b_simdata) +
geom_point(aes(x, y)) +
geom_line(aes(x, y=pred), data=grid, color="red", size=1) +
geom_abline(intercept=6,slope=1.5)
simb_plot
simb_resid_scatterplot<-ggplot(sim_b_resid) +
geom_ref_line(h=0) +
geom_point(aes(x, y=resid))
simb_resid_scatterplot
simb_freqpoly<-ggplot(sim_b_resid) +
geom_freqpoly(aes(resid), binwidth=0.5)
simb_freqpoly
This linear model seems to have a relatively normal distribution of residuals and constant variance however the scatterplot of residuals reveals that the higher values of x seem to conform more closely to the linear model than the lower values of x indicating that there may be another type of correlation that more accurately represents the relationship between x and y than a linear correlation.
The code below fits a linear model to the sim_c dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
sim_c_model <- lm(y~x, sim_c_simdata)
sim_c_model
grid <- data_grid(sim_c_simdata, x)
print(grid)
sim_c_resid <- add_residuals(sim_c_simdata, sim_c_model)
print(sim_c_resid)
grid <- add_predictions(grid, sim_c_model)
print(grid)
simc_plot<-ggplot(sim_c_simdata) +
geom_point(aes(x, y)) +
geom_line(aes(x, y=pred), data=grid, color="red", size=1) +
geom_abline(intercept=6,slope=1.5)
simc_plot
simc_resid_scatterplot<-ggplot(sim_c_resid) +
geom_ref_line(h=0) +
geom_point(aes(x, y=resid))
simb_resid_scatterplot
simc_freqpoly<-ggplot(sim_c_resid) +
geom_freqpoly(aes(resid), binwidth=0.5)
simc_freqpoly
The following code shows an alternative to the least-squares criterion, the mean-absolute distance criterion, which involves averaging over the absolute value of residuals instead of squaring them. This is implemented by using the function optim() in combination with the custom function 'make_prediction' shown below.
The code uses the optim() function to fit the "sim_a", "sim_b", and "sim_c" datasets from simdata using the mean absolute distance. The plot for the sim_a dataset shows both the mean-absolute distance and least-squares lines.
make_prediction <- function(intercept_slope, data)
{
intercept_slope[1] + data$x * intercept_slope[2]
}
measure_distance <- function(intercept_slope_parameters, dataset)
{
diff <- data$y - make_prediction(intercept_slope_parameters,
dataset)
mean(abs(diff))
}
measure_distance(c(6.153,1.482), sim_a_simdata)
sima_model_parameters <- optim(c(16, -5), measure_distance, data=sim_a_simdata)
print(sima_model_parameters$par)
ggplot(sim_a_simdata) +
geom_point(aes(x,y),size=2, color="grey30") +
geom_line(aes(x, pred), data=grid, color="red", size=1) +
geom_abline(intercept=sima_model_parameters$par[1],slope=sima_model_parameters$par[2])
measure_distance(c(2,0.7), sim_b_simdata)
simb_model_parameters <- optim(c(16, -5), measure_distance, data=sim_b_simdata)
print(simb_model_parameters$par)
ggplot(sim_b_simdata, aes(x, y)) +
geom_point(size=2, color="grey30") +
geom_abline(intercept=simb_model_parameters$par[1],
slope=simb_model_parameters$par[2])
Another alternative to lm() is to fit a smooth curve using the loess() function. Follow the procedure in Chapter 23.3 of R for Data Science of model fitting, grid generation, predictions, and visualization on the sim1 dataset (loaded as part of the modelr library). The plot should also include the least-squares line for comparison. Calculate the residuals and frequency polygon. How does this compare with the least-squares line? How does it compare with the default method of geom_smooth()?
sim1_mod <- loess(y ~ x, data = sim1)
sim1_mod
grid <- sim1 %>%
data_grid(x)
grid <- grid %>%
add_predictions(sim1_mod)
ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
sim1 <- sim1 %>%
add_residuals(sim1_mod)
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
ggplot(sim1, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
ggplot(sim1) +
geom_point(aes(x=x, y = y)) +
geom_smooth(mapping=aes(x=x,y=y),method='lm',se=FALSE)