CoCalc -- CDS-102 Lab Week 09 Report.md

Project: Helena Gray - Introduction to Computational and Data Sciences (Spring 2017)

Path: CDS-102/Lab Week 09 - Statistical distribution of temperatures in Washington, DC/CDS-102 Lab Week 09 Report.md

Views: ⁹⁰

CDS-102: Lab 9 Report

"Statistical distribution of temperatures in Washington, DC"

Helena Gray

April 5, 2017

Link to workbook: Lab 9 Workbook

Procedure

Load the "tidyverse" package, import the dataset using read_csv, and display the top of the imported data.

# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
library(tidyverse)
# The dataset is in the file "MDWASHDC_JAN1995_DEC2016.csv"
dc.temps<-read.csv("MDWASHDC_JAN1995_DEC2016.csv")
head(dc.temps)

Lab Task 1

The code below generates a summary statistics (mean(), median(), min(), max(), and sd()) report of the average temperature grouped by month, using the summarise() function and group_by() function. Group_by() summarizes the data by grouping it by month, and summarise() generates the summary stats for each month and finally assigns the summary stats table to a variable "temps.table".

by_month<-group_by(dc.temps,month)

temps.table<-summarise(by_month,
 mean=mean(t.avg), max=max(t.avg),min=min(t.avg),med=median(t.avg), sd=sd(t.avg))
temps.table

Lab Task 2

The code below plots the Probability Mass Function (PMF) histogram of the average daily temperatures in the full dataset for each month of the year for all years. It uses the facet_wrap() function to create this as a 12 panel plot.

all.months.temps<-ggplot(dc.temps) + geom_histogram(mapping = aes(x =t.avg, y = ..density..), binwidth = 1, fill = "cyan3", color = "cyan4") + facet_wrap(~month)

ggsave("all.months.temps.png", plot = all.months.temps, device="png", scale=1, width=5, height=4)
all.months.temps

Lab Task 3

The code below creates the normal distribution model for the month of June (all years) using the summary statistics computed in task 1 by generating the Probability Density Function (PDF). It then stores the computed values of the model in a new two-column tibble named jun.model.

dc.temps.june<-filter(dc.temps,month==6)
jun.pdf<-dnorm(x = dc.temps.june$t.avg, mean =74.23167, sd = 14.716578)
jun.model<-tibble(temps=dc.temps.june$t.avg,PDF=jun.pdf)

Lab Task 4

The code below creates a new plot containing the average daily temperature PMF histogram and the normal distribution model for June (all years). The model appears to be normal however it does not fit the data perfectly as the temperatures are actually a lot higher than the normal model would suggest. Also the PMF appears to be slightly narrower than the normal model.

jun.ggplot<-ggplot(data=dc.temps.june)  + geom_histogram(binwidth = .5, mapping = aes(x=t.avg), alpha=.5)

data.ggplot.full <- ggplot_build(jun.ggplot)
data.ggplot.table <- data.ggplot.full$data[[1]]
histogram.table <- tibble(x = data.ggplot.table$x, density = data.ggplot.table$density, frequency = data.ggplot.table$count)

mean.june<-mean(dc.temps.june$t.avg)
sd.june<-sd(dc.temps.june$t.avg)


options(repr.plot.width = 6, repr.plot.height = 4)
data.ggplot.june<-ggplot(data=histogram.table)  + geom_col(mapping = aes(x=x, y=density), alpha=.5) + stat_function(fun=dnorm, args=list(mean=mean.june,sd=sd.june), color= "red")


ggsave("data.ggplot.june.png", plot = data.ggplot.june, device="png", scale=1, width=5, height=4)
data.ggplot.june

Lab Task 5

The code below creates a qqplot for the average temperature distribution in June (all years). It finds the 1st and 3rd quantiles, computes the line slope and intercept, and creates a theoretical line based on these values. This theoretical line is used for comparison on the plot. It appears as if the temperature data conforms fairly well to this theoretical normal distribution save for a few outliers.

# Find the 1st and 3rd quartiles (0.25 and 0.75 percentiles)
qq_y <- quantile(dc.temps.june$t.avg, c(0.25, 0.75))
# Find the matching normal values on the x-axis
qq_x <- qnorm(c(0.25, 0.75))
# Compute line slope
qq_slope <- diff(qq_y) / diff(qq_x)
# Compute line intercept
qq_int <- qq_y[1] - qq_slope * qq_x[1]


ggplot(dc.temps.june) +
geom_qq(aes(sample = t.avg), color = "cyan3") +
geom_abline(intercept = qq_int, slope = qq_slope, color = "black")

Lab Task 6

The code below creates a 12 panel series of qqplots (without theoretical lines) for each month (all years) using facet_wrap(). It seems as if the trend line for June seems to fit the data relatively well for the months May through September, however all the other months seem to have slightly lower temperatures than the normal distribution for June and with a few outliers.

all.months.qqplot<-ggplot(dc.temps) + geom_qq(aes(sample = t.avg), color = "cyan3") + facet_wrap(~month)
ggsave("all.months.qqplot.png",plot = all.months.qqplot, device="png", scale=1, width=5, height=4)
all.months.qqplot

Lab Task 7

The normal distribution model is used to compute the temperature of the 0.10 percentile for the month of June (all years) using the qnorm() function.

# The top 90% of temperatures are the temperatures in the 10th percentile or higher

june.mean =74.23167
june.sd = 14.716578
june.p10 <- qnorm(p = 0.10, mean = june.mean, sd = june.sd)
june.p10

Lab Task 8

The normal distribution model is used to compute the percentile of the temperature 83◦F for the month of June (all years) using the pnorm() function.

pnorm(q = 83, mean = june.mean, sd = june.sd)

Summary of Results

As a final result from the Washington DC temperature data, the first plot is a set of 12 plots showing the probability mass functions of all 12 months for every year, created from Lab Task 2. Each plot shows the likelihood that a certain temperature might occur. The higher the bar on the histogram, the more likely that temperature is to occur during that month.

PMF All Months DC Temps

For Lab Task 4, two plots were created to compare the PMF of June temperatures in DC for all years to the normal distribution model. The central peak is higher than the normal model and the PMF is a bit narrower on both sides compared to the normal distribution model.

PMF June with Normal Model DC Temps

For Lab Task 5, a qqplot was created for the average temperature distribution in June for all years. A theoretical line is included which represents the normal distribution. It seems to conform to the normal distribution fairly well aside from a few outliers.

QQPlot June All Years DC Temps

For Lab Task 6, a qqplot was created for the average temperature distribution for each month (all years). Separate plots were created for each month. All of the temperatures appear to conform to a normal distribution save for a few outliers.

QQPlot All Months DC Temps

For Lab Task 7, the temperature of the 10th percentile for June in DC is 55.4 degrees Fahrenheit. This means that this temperature was higher than 10% of the other temperatures in June for every year in which temperatures for June were recorded.

For Lab Task 8, the percentile of 83 degrees Fahrenheit for June of all years is the 72nd percentile. This means that this temperature was higher than 72% of the other daily temperatures in June for all years in which temperatures for June were recorded.

Key Questions

Are the daily temperatures for each month nearly normal, such that the normal distribution is a good model? Why or why not? ####

The daily temperatures appear to be nearly normal from from November to April except for a few outliers for each set. However, from May to September the temperatures seem to be higher than what the normal distribution would suggest (based on the histogram for each temperature and the qqplots).

For the month of June, what is the probability that any given day will have a temperature of 83◦F or higher? How cold are the coldest 10% of days?

The probability of observing a temperature of 83 degrees Fahrenheit or higher on any given day in June is 27.5%. The coldest 10% of days in June in DC are around 55.37 degrees Fahrenheit.

In a normal distribution, 68% of the data falls between one standard deviation above and below the mean (µ ± σ) and 95% falls between two standard deviations above and below the mean (µ ± 2σ). Report the mean for the month of March with a 68% and a 95% confidence interval and compare this with the available daily temperatures from March 2017 in Table 1 below. How many temperatures lie outside the 68% confidence interval? Are there any outside the 95% confidence interval? After comparing with confidence intervals, would you conclude that this month’s sequence of temperatures is somewhat unusual, or well within the realm of the normal distribution model?

Only 3 days in March 2017 had temperatures that were within the 68% confidence interval and the rest were outside of it. 12 days in March 2017 were outside of the 95% confidence interval. This month's temperatures are mostly outside of the normal distribution model.