Homework 4
CDS-101 (Spring 2017)
Name: Helena Gray
The code below loads the necessary packages for data analysis and reads in the datasets 'starbucks.csv' and 'simdata.RData'.
Question 1. ###
Based on my knowledge, the amount of carbohydrates a food item contains is positively correlated with how many calories that food item provides. In this case, the number of calories will be the explanatory variable and the response variable will be the number of carbohydrates. This model is probably not true for all amounts of calories. Some foods have high amounts of fat and sugar but low amounts of carbohydrates and still provide a high number of calories.
Question 2. ###
The code below makes use of a couple of functions to model the linear relationship between carbohydrate content and calorie content. The code uses the lm() function to model the 'carb' and 'calories' columns from the 'starbucks' dataset to create a visual representation to describe the relationship between the two variables. The data_grid() function creates a tibble that has a column of each unique value of the specified column, in this case it's 'calories'. The add_predictions() function adds the predicted values of the response variable to the tibble containing the unique values of the explanatory variable. ggplot() and geom_point() plot the actual values of the 'calories' and 'carbs' columns while the geom_line() function takes the predicted response variable values and plots a line to illustrate the trend of the relationship between calories and carbohydrates.
The add_residuals() function adds the residuals to the tibble, which are the differences between the predicted response variable values and the actual response variable values for each unique explanatory variable value. The geom_ref_line() function creates the thick white line as the reference line that symbolizes no difference between predicted and actual values. ggplot() and geom_point() maps out the residual points for each unique value of calories.
It seems that food items with a low amount of calories have less variation in the discrepancy between actual and predicted values of carbohydrates while those food items with higher numbers of calories have greater variation.
The code below creates a frequency polygon plot of the residuals using ggplot() and geom_freqpoly() functions. It's like a line graph version of a histogram.
The distribution of residuals appears to be relatively normal, with most of the residuals being around 0 (indicating the linear model is fairly accurate).
Question 3.
This data seems to meet the required conditions for a least squares regression line as the pattern for the data is relatively linear, the observations are independent of one another, and the data has a nearly normal distribution of residuals. However the data points do not appear to remainly tightly clustered around the least squares regression line nor do the residuals appear to be evenly distributed in the scatterplot.
However, the data could be skewed because those foods with a high amount of protein or fat but low carbs could still have a high number of calories so this model could actually be incorrect.
Question 4.
The code below separates the simdata dataset to three different tibbles for each simulated dataset by using the filter() function and specifying the label of the simulated dataset.
The code below fits a linear model to the sim_a dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
This dataset seems to conform pretty well to the linear model and has a fairly normal distribution of residuals and constant variance.
The code below fits a linear model to the sim_b dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
This linear model seems to have a relatively normal distribution of residuals and constant variance however the scatterplot of residuals reveals that the higher values of x seem to conform more closely to the linear model than the lower values of x indicating that there may be another type of correlation that more accurately represents the relationship between x and y than a linear correlation.
The code below fits a linear model to the sim_c dataset, plots the linear model and the dataset to demonstrate how well the lienar model represents the dataset, and then plots the residuals in a scatterplot and a frequency polygon plot to show if the variability of the residuals is constant which is one of the conditions necessary to satisfy to justify the use of a linear model.
The data points for sim_c_simdata dataset seem to follow a linear model however the residuals do not seem to have equal variance nor do they have a normal distribution. The distribution of residuals seems slightly skewed to the right.
Question 5###
The following code shows an alternative to the least-squares criterion, the mean-absolute distance criterion, which involves averaging over the absolute value of residuals instead of squaring them. This is implemented by using the function optim() in combination with the custom function 'make_prediction' shown below.
The code uses the optim() function to fit the "sim_a", "sim_b", and "sim_c" datasets from simdata using the mean absolute distance. The plot for the sim_a dataset shows both the mean-absolute distance and least-squares lines.
For this model the larger residuals seem to have not affected the model as the least-squares regression line and the mean-absolute distance model are nearly identical.
For the sim_b_simdata the mean-absolute distance criterion seems to also match the least squares regression line fairly well.
This dataset has wildly different models for the mean-absolute criterion and the least squares regression line.
Question 5###
Another alternative to lm() is to fit a smooth curve using the loess() function. Follow the procedure in Chapter 23.3 of R for Data Science of model fitting, grid generation, predictions, and visualization on the sim1 dataset (loaded as part of the modelr library). The plot should also include the least-squares line for comparison. Calculate the residuals and frequency polygon. How does this compare with the least-squares line? How does it compare with the default method of geom_smooth()?
The loess function seems to actually fit fairly well with the least-squares regression line, just a little more curved on the lower and upper values for x. A visual inspection indicates both methods produce an equal variance, but neither have normal distributions for the residuals it seems and the distribution for the residuals changes with both methods.
The loess() function does not produce as linear a model as the geom_smooth() function for the sim1 dataset, however that's because the loess() function actually seems to be more sensitive to the pattern in the data than either the geom_smooth() function or the least-squares regression model.