CoCalc Shared FilesCDS-102 / Lab Week 09 - Statistical distribution of temperatures in Washington, DC / CDS-102 Lab Week 09 Workbook.ipynb
Author: Helena Gray
Views : 6
Description: Jupyter notebook CDS-102/Lab Week 09 - Statistical distribution of temperatures in Washington, DC/CDS-102 Lab Week 09 Workbook.ipynb

# CDS-102: Lab 9 Workbook

## Helena Gray

### March 30, 2017

In [2]:
# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
library(tidyverse)
# The dataset is in the file "MDWASHDC_JAN1995_DEC2016.csv"

monthdayyeart.avg
1 1 199540.6
1 2 199539.8
1 3 199529.3
1 4 199533.0
1 5 199520.9
1 6 199527.0

The code below generates a summary statistics (mean(), median(), min(), max(), and sd()) report of the average temperature grouped by month using the summarise() function.

In [4]:

by_month<-group_by(dc.temps,month)

temps.table<-summarise(by_month,
mean=mean(t.avg), max=max(t.avg),min=min(t.avg),med=median(t.avg), sd=sd(t.avg))
temps.table

monthmeanmaxminmedsd
1 36.19589 63.1 -99.0 36.10 11.903005
2 38.54678 62.8 11.9 38.55 8.666396
3 46.83534 76.4 -99.0 46.60 10.634093
4 56.95727 79.8 -99.0 56.90 10.009018
5 66.27434 85.5 50.2 65.95 7.144027
6 74.23167 89.6 -99.0 75.55 14.716578
7 79.70176 92.8 65.5 79.80 4.813713
8 77.95806 91.0 -99.0 78.20 10.525838
9 71.11848 86.1 -99.0 72.10 11.103055
10 59.88578 79.5 40.4 59.95 7.446588
11 49.30621 70.7 28.2 49.30 7.769721
12 40.26364 65.9 -99.0 40.75 13.539830

The code below plots the Probability Mass Function (PMF) histogram of the average daily temperatures in the full dataset for each month of the year using the geom_histogram() and ggplot() functions. It uses the facet_wrap() function to create this as a 12 panel plot.

In [10]:
all.months.temps<-ggplot(dc.temps) +
geom_histogram(mapping = aes(x =t.avg, y = ..density..),
binwidth = 1, fill = "cyan3", color = "cyan4") + facet_wrap(~month)

ggsave("all.months.temps.png", plot = all.months.temps, device="png", scale=1, width=5, height=4)
all.months.temps


{"output_type":"display_data"}

The code below creates the normal distribution model for the month of June (all years) using the summary statistics computed in task 1 by generating the Probability Density Function (PDF). It then stores the computed values of the model in a new two-column tibble named jun.model.

In [15]:
dc.temps.june<-filter(dc.temps,month==6)

jun.pdf<-dnorm(x = dc.temps.june$t.avg, mean =74.23167, sd = 14.716578) jun.model<-tibble(temps=dc.temps.june$t.avg,PDF=jun.pdf)


74.2316666666667
14.7165782544198

The code below creates a new plot containing the average daily temperature PMF histogram and the normal distribution model for June (all years). Note whether or not the model visually agrees with the histogram.

In [23]:
jun.ggplot<-ggplot(data=dc.temps.june)  + geom_histogram(binwidth = .5, mapping = aes(x=t.avg), alpha=.5)

data.ggplot.full <- ggplot_build(jun.ggplot)
data.ggplot.table <- data.ggplot.full$data[[1]] histogram.table <- tibble(x = data.ggplot.table$x, density = data.ggplot.table$density, frequency = data.ggplot.table$count)

mean.june<-mean(dc.temps.june$t.avg) sd.june<-sd(dc.temps.june$t.avg)

options(repr.plot.width = 6, repr.plot.height = 4)
data.ggplot.june<-ggplot(data=histogram.table)  + geom_col(mapping = aes(x=x, y=density), alpha=.5) + stat_function(fun=dnorm, args=list(mean=mean.june,sd=sd.june), color= "red")

ggsave("data.ggplot.june.png", plot = data.ggplot.june, device="png", scale=1, width=5, height=4)
data.ggplot.june

{"output_type":"display_data"}

The code below creates a qqplot for the average temperature distribution in June. A theoretical line is computed and included for comparison.

In [21]:
# Find the 1st and 3rd quartiles (0.25 and 0.75 percentiles)
qq_y <- quantile(dc.temps.june\$t.avg, c(0.25, 0.75))
# Find the matching normal values on the x-axis
qq_x <- qnorm(c(0.25, 0.75))
# Compute line slope
qq_slope <- diff(qq_y) / diff(qq_x)
# Compute line intercept
qq_int <- qq_y[1] - qq_slope * qq_x[1]

qqplot.june<-ggplot(dc.temps.june) +
geom_qq(aes(sample = t.avg), color = "cyan3") +
geom_abline(intercept = qq_int, slope = qq_slope, color = "black")

ggsave("qqplot.june.png", plot = qqplot.june, device="png", scale=1, width=5, height=4)
qqplot.june

{"output_type":"display_data"}

The code below creates a 12 panel series of qqplots (without theoretical lines) for each month (all years) using facet_wrap(). Note whether the trend for June applies to the other months.

In [22]:
all.months.qqplot<-ggplot(dc.temps) + geom_qq(aes(sample = t.avg), color = "cyan3") + facet_wrap(~month)
ggsave("all.months.qqplot.png",plot = all.months.qqplot, device="png", scale=1, width=5, height=4)
all.months.qqplot

{"output_type":"display_data"}

The normal distribution model is used to compute the temperature of the 0.10 percentile for the month of June (all years) using the qnorm() function.

In [18]:
# The top 90% of temperatures are the temperatures in the 10th percentile or higher

june.mean =74.23167
june.sd = 14.716578
june.p10 <- qnorm(p = 0.10, mean = june.mean, sd = june.sd)
june.p10

55.3716164246408

The normal distribution model is used to compute the percentile of the temperature 83◦F for the month of June (all years) using the pnorm() function.

In [24]:
pnorm(q = 83, mean = june.mean, sd = june.sd)

0.72434995525201

## Key Questions##

For the month of June, what is the probability that any given day will have a temperature of 83◦F or higher? The code below uses the pnorm() function to find this probability.

How cold are the coldest 10% of days? The qnorm() function is used to find this average temperature of the 10% coldest days.

In [26]:
pnorm(83, mean=june.mean, sd=june.sd, lower.tail=FALSE)
qnorm(0.1, mean=june.mean, sd=june.sd)


0.27565004474799
55.3716164246408

Report the mean for the month of March with a 68% and a 95% confidence interval.

In [27]:
ci.95<- 2* june.sd

cat("The 95% confidence interval for the unfiltered dataset is ", june.mean, "+-",ci.95,"\n")
cat("The 68% confidence interval for the unfiltered dataset is ", june.mean, "+-",june.sd)

june.mean+ci.95
june.mean-ci.95

june.mean+june.sd
june.mean-june.sd

The 95% confidence interval for the unfiltered dataset is 74.23167 +- 29.43316 The 68% confidence interval for the unfiltered dataset is 74.23167 +- 14.71658
103.664826
44.798514
88.948248
59.515092
In [ ]: