The code below loads the necessary packages for data analysis and reads in the datasets 'starbucks.csv' and 'simdata.RData'.
# Set the location of our personally installed R packages.libPaths(new="~/Rlibs")# Load the Tidyverse and modelr packageslibrary(tidyverse)library(modelr)# Provide notification if NA values are dropped when modelingoptions(na.action=na.warn)# Load the datasets for Homework 4starbucks<-read_csv("starbucks.csv")load("simdata.RData")## Uncomment the code below to change the plot window size# options(repr.plot.width = 6, repr.plot.height = 4)## Uncomment the code below to change the tibble printing format.# options(tibble.print_max = 30, tibble.print_min = 20,# tibble.width = Inf)
Based on my knowledge, the amount of carbohydrates a food item contains is positively correlated with how many calories that food item provides. In this case, the number of calories will be the explanatory variable and the response variable will be the number of carbohydrates. This model is probably not true for all amounts of calories. Some foods have high amounts of fat and sugar but low amounts of carbohydrates and still provide a high number of calories.
The code below makes use of a couple of functions to model the linear relationship between carbohydrate content and calorie content. The code uses the lm() function to model the 'carb' and 'calories' columns from the 'starbucks' dataset to create a visual representation to describe the relationship between the two variables. The data_grid() function creates a tibble that has a column of each unique value of the specified column, in this case it's 'calories'. The add_predictions() function adds the predicted values of the response variable to the tibble containing the unique values of the explanatory variable. ggplot() and geom_point() plot the actual values of the 'calories' and 'carbs' columns while the geom_line() function takes the predicted response variable values and plots a line to illustrate the trend of the relationship between calories and carbohydrates.
The add_residuals() function adds the residuals to the tibble, which are the differences between the predicted response variable values and the actual response variable values for each unique explanatory variable value. The geom_ref_line() function creates the thick white line as the reference line that symbolizes no difference between predicted and actual values. ggplot() and geom_point() maps out the residual points for each unique value of calories.
It seems that food items with a low amount of calories have less variation in the discrepancy between actual and predicted values of carbohydrates while those food items with higher numbers of calories have greater variation.
The code below creates a frequency polygon plot of the residuals using ggplot() and geom_freqpoly() functions. It's like a line graph version of a histogram.
The distribution of residuals appears to be relatively normal, with most of the residuals being around 0 (indicating the linear model is fairly accurate).
This data seems to meet the required conditions for a least squares regression line as the pattern for the data is relatively linear, the observations are independent of one another, and the data has a nearly normal distribution of residuals. However the data points do not appear to remainly tightly clustered around the least squares regression line nor do the residuals appear to be evenly distributed in the scatterplot.
However, the data could be skewed because those foods with a high amount of protein or fat but low carbs could still have a high number of calories so this model could actually be incorrect.
The code below separates the simdata dataset to three different tibbles for each simulated dataset by using the filter() function and specifying the label of the simulated dataset.
The code below fits a linear model to the sim_a dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
This dataset seems to conform pretty well to the linear model and has a fairly normal distribution of residuals and constant variance.
The code below fits a linear model to the sim_b dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
This linear model seems to have a relatively normal distribution of residuals and constant variance however the scatterplot of residuals reveals that the higher values of x seem to conform more closely to the linear model than the lower values of x indicating that there may be another type of correlation that more accurately represents the relationship between x and y than a linear correlation.
The code below fits a linear model to the sim_c dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
The data points for sim_c_simdata dataset seem to follow a linear model however the residuals do not seem to have equal variance nor do they have a normal distribution. The distribution of residuals seems slightly skewed to the right.
The following code shows an alternative to the least-squares criterion, the mean-absolute distance criterion, which
involves averaging over the absolute value of residuals instead of squaring them. This is implemented
by using the function optim() in combination with the custom function 'make_prediction' shown below.
The code uses the optim() function to fit the "sim_a", "sim_b", and "sim_c" datasets from simdata using the mean absolute
distance. The plot for the sim_a dataset shows both the mean-absolute distance and least-squares lines.
This dataset has wildly different models for the mean-absolute criterion and the least squares regression line.
Another alternative to lm() is to fit a smooth curve using the loess() function. Follow the
procedure in Chapter 23.3 of R for Data Science of model fitting, grid generation, predictions, and
visualization on the sim1 dataset (loaded as part of the modelr library). The plot should also include
the least-squares line for comparison. Calculate the residuals and frequency polygon. How does this
compare with the least-squares line? How does it compare with the default method of geom_smooth()?