 CoCalc Shared FilesCDS-102 / Lab Week 13 - Temperature data revisited, part 2: the sine function / CDS-102 Lab Week 13 Workbook.ipynb
Author: Helena Gray
Views : 4
Description: Jupyter notebook CDS-102/Lab Week 13 - Temperature data revisited, part 2: the sine function/CDS-102 Lab Week 13 Workbook.ipynb

# CDS-102: Lab 13 Workbook

## Name: Helena Gray

### April 26, 2017

In :
# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
library(tidyverse)
library(modelr)
# the model.sin function

In :
# To change the size of any plots, copy the code snippet
# below, uncomment it, and set the size of the width
# and height.
# Note: All subsequent figures will use the same size,
# unless you change the options() snippet and run it
# again.

# options(repr.plot.width=6, repr.plot.height=4)


The code below creates the number_days column and remove any missing values from the dataset.

In :
ndays <- nrow(t.data)
t.data.filtered <- filter(t.data, t.avg > -99)



The code below displays a plot of t.avg as a function of number_days.

In :
sine.function<-ggplot(t.data.filtered) + geom_point(aes(x=number_days, y=t.avg), size=0.5)
ggsave("sine_function.png", plot = sine.function, device="png", scale=1, width=5, height=4)
sine.function The code below filters the dataset to only include temperatures from the years 1995 and 1996, assigns this to a variable named t.data.y9596, and then plots t.avg as a function of number_days.

In :
 t.data.y9596<-filter(t.data.filtered, c(year==1995|year==1996))

sine.function9596<-ggplot(t.data.y9596) + geom_point(aes(x=number_days, y=t.avg), size=0.5)
ggsave("sine_function9596.png", plot = sine.function9596, device="png", scale=1, width=5, height=4)
sine.function9596 In the northern hemisphere we tend to be hottest in July and coldest in January. The difference between January 1st and July 1st is roughly 180 days. The time it takes the Earth to make one rotation around the sun is roughly 365 days or a year. Thus we will assign this number of days to a variable named model.T. In the context of this dataset, T has a specific meaning: it is the number of days that will pass before the pattern for the daily average temperatures begins to repeat itself. This is what the variable model.T will represent.

In :
model.T<-365


The code below creates models for n=1, n=2, n=3, n=4, n=5, which are various values of the phase shift and assign them to variables with names like mod_n_1, mod_n_2 using a custom function model.sin() which takes three inputs: the period T (given in units of days), the phase shift n (given in units of months), and x is the explanatory variable, in this case the number of days since January 1, 1995.

In :
mod_n_0 <- lm(t.avg~model.sin(T=model.T, n=0, x=number_days),
data=t.data.filtered)

mod_n_1 <- lm(t.avg~model.sin(T=model.T, n=1, x=number_days),
data=t.data.filtered)

mod_n_2 <- lm(t.avg~model.sin(T=model.T, n=2, x=number_days),
data=t.data.filtered)

mod_n_3 <- lm(t.avg~model.sin(T=model.T, n=3, x=number_days),
data=t.data.filtered)

mod_n_4 <- lm(t.avg~model.sin(T=model.T, n=4, x=number_days),
data=t.data.filtered)

mod_n_5 <- lm(t.avg~model.sin(T=model.T, n=5, x=number_days),
data=t.data.filtered)



The following code plots the models overlaid on top of the temperature data for years 1995 and 1996. It creates a tibble of model predictions using the data_grid() and gather_predictions() functions and then plots the values in this tibble by model.

In :
grid <- data_grid(data=t.data.y9596,
number_days=seq_range(number_days, n=1000,
expand=0.05))
grid <- gather_predictions(grid, mod_n_0, mod_n_1, mod_n_2,
mod_n_3, mod_n_4,
mod_n_5, .pred="t.avg")


In :
diff.months.sine.function<-ggplot(t.data.y9596) +
geom_point(aes(number_days, t.avg)) +
geom_line(data=grid, aes(number_days, t.avg),
color="red", size=1) +
facet_wrap(~model)

ggsave("sineFunctionMonths.png", plot = diff.months.sine.function, device="png", scale=1, width=5, height=4)
diff.months.sine.function The code below quantifies the quality of different models using the R2 parameter, which is extracted directly from the model variables.

In :
summary(mod_n_0)$adj.r.squared sd(summary(mod_n_0)$residuals)

summary(mod_n_1)$adj.r.squared sd(summary(mod_n_1)$residuals)

summary(mod_n_2)$adj.r.squared sd(summary(mod_n_2)$residuals)

summary(mod_n_3)$adj.r.squared sd(summary(mod_n_3)$residuals)

summary(mod_n_4)$adj.r.squared sd(summary(mod_n_4)$residuals)

summary(mod_n_5)$adj.r.squared sd(summary(mod_n_5)$residuals)

0.118288503785666
16.0387712780537
0.0138997638112263
16.9616607978321
0.302097593503599
14.2693828523976
0.694422195979437
9.44210325198461
0.797717251700635
7.68222928794176
0.50978190146657
11.9592105773343

The code below improves the choice of n by repeating the same procedure as in Task 5 and 6 for an n value that is between the two “best” values from the previous task which would be 3.5. The code follows the model fitting steps again, calculates R2 and the standard deviation of the residuals, and assigns this parameter to a variable named model.n.

In :

mod_n_3.5 <- lm(t.avg~model.sin(T=model.T, n=3.5, x=number_days),
data=t.data.filtered)

grid <- data_grid(data=t.data.y9596,
number_days=seq_range(number_days, n=1000,
expand=0.05))

print(grid)

perfectSine<-ggplot(t.data.y9596) +
geom_point(aes(number_days, t.avg)) +
geom_line(data=grid, aes(number_days, pred),
color="red", size=1)
ggsave("perfectSine.png", plot = perfectSine, device="png", scale=1, width=5, height=4)
perfectSine

summary(mod_n_3.5)$adj.r.squared sd(summary(mod_n_3.5)$residuals)

model.n<-mod_n_3.5

# A tibble: 1,000 × 2 number_days pred <dbl> <dbl> 1 -17.25000 37.88593 2 -16.48273 37.73637 3 -15.71547 37.59002 4 -14.94820 37.44693 5 -14.18093 37.30710 6 -13.41366 37.17056 7 -12.64640 37.03735 8 -11.87913 36.90748 9 -11.11186 36.78097 10 -10.34459 36.65784 # ... with 990 more rows
0.79871315879651
7.66329483397451