SharedCDS-102 / Lab Week 09 - Statistical distribution of temperatures in Washington, DC / CDS-102 Lab Week 09 Report.mdOpen in CoCalc

Link to workbook: Lab 9 Workbook

Load the "tidyverse" package, import the dataset using read_csv, and display the top of the imported data.

```
# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
library(tidyverse)
# The dataset is in the file "MDWASHDC_JAN1995_DEC2016.csv"
dc.temps<-read.csv("MDWASHDC_JAN1995_DEC2016.csv")
head(dc.temps)
```

The code below generates a summary statistics (mean(), median(), min(), max(), and sd()) report of the average temperature grouped by month, using the summarise() function and group_by() function. Group_by() summarizes the data by grouping it by month, and summarise() generates the summary stats for each month and finally assigns the summary stats table to a variable "temps.table".

```
by_month<-group_by(dc.temps,month)
temps.table<-summarise(by_month,
mean=mean(t.avg), max=max(t.avg),min=min(t.avg),med=median(t.avg), sd=sd(t.avg))
temps.table
```

The code below plots the Probability Mass Function (PMF) histogram of the average daily temperatures in the full dataset for each month of the year for all years. It uses the facet_wrap() function to create this as a 12 panel plot.

```
all.months.temps<-ggplot(dc.temps) + geom_histogram(mapping = aes(x =t.avg, y = ..density..), binwidth = 1, fill = "cyan3", color = "cyan4") + facet_wrap(~month)
ggsave("all.months.temps.png", plot = all.months.temps, device="png", scale=1, width=5, height=4)
all.months.temps
```

The code below creates the normal distribution model for the month of June (all years) using the summary statistics computed in task 1 by generating the Probability Density Function (PDF). It then stores the computed values of the model in a new two-column tibble named jun.model.

```
dc.temps.june<-filter(dc.temps,month==6)
jun.pdf<-dnorm(x = dc.temps.june$t.avg, mean =74.23167, sd = 14.716578)
jun.model<-tibble(temps=dc.temps.june$t.avg,PDF=jun.pdf)
```

The code below creates a new plot containing the average daily temperature PMF histogram and the normal distribution model for June (all years). The model appears to be normal however it does not fit the data perfectly as the temperatures are actually a lot higher than the normal model would suggest. Also the PMF appears to be slightly narrower than the normal model.

```
jun.ggplot<-ggplot(data=dc.temps.june) + geom_histogram(binwidth = .5, mapping = aes(x=t.avg), alpha=.5)
data.ggplot.full <- ggplot_build(jun.ggplot)
data.ggplot.table <- data.ggplot.full$data[[1]]
histogram.table <- tibble(x = data.ggplot.table$x, density = data.ggplot.table$density, frequency = data.ggplot.table$count)
mean.june<-mean(dc.temps.june$t.avg)
sd.june<-sd(dc.temps.june$t.avg)
options(repr.plot.width = 6, repr.plot.height = 4)
data.ggplot.june<-ggplot(data=histogram.table) + geom_col(mapping = aes(x=x, y=density), alpha=.5) + stat_function(fun=dnorm, args=list(mean=mean.june,sd=sd.june), color= "red")
ggsave("data.ggplot.june.png", plot = data.ggplot.june, device="png", scale=1, width=5, height=4)
data.ggplot.june
```

The code below creates a qqplot for the average temperature distribution in June (all years). It finds the 1st and 3rd quantiles, computes the line slope and intercept, and creates a theoretical line based on these values. This theoretical line is used for comparison on the plot. It appears as if the temperature data conforms fairly well to this theoretical normal distribution save for a few outliers.

```
# Find the 1st and 3rd quartiles (0.25 and 0.75 percentiles)
qq_y <- quantile(dc.temps.june$t.avg, c(0.25, 0.75))
# Find the matching normal values on the x-axis
qq_x <- qnorm(c(0.25, 0.75))
# Compute line slope
qq_slope <- diff(qq_y) / diff(qq_x)
# Compute line intercept
qq_int <- qq_y[1] - qq_slope * qq_x[1]
ggplot(dc.temps.june) +
geom_qq(aes(sample = t.avg), color = "cyan3") +
geom_abline(intercept = qq_int, slope = qq_slope, color = "black")
```

The code below creates a 12 panel series of qqplots (without theoretical lines) for each month (all years) using facet_wrap(). It seems as if the trend line for June seems to fit the data relatively well for the months May through September, however all the other months seem to have slightly lower temperatures than the normal distribution for June and with a few outliers.

```
all.months.qqplot<-ggplot(dc.temps) + geom_qq(aes(sample = t.avg), color = "cyan3") + facet_wrap(~month)
ggsave("all.months.qqplot.png",plot = all.months.qqplot, device="png", scale=1, width=5, height=4)
all.months.qqplot
```

The normal distribution model is used to compute the temperature of the 0.10 percentile for the month of June (all years) using the qnorm() function.

```
# The top 90% of temperatures are the temperatures in the 10th percentile or higher
june.mean =74.23167
june.sd = 14.716578
june.p10 <- qnorm(p = 0.10, mean = june.mean, sd = june.sd)
june.p10
```

The normal distribution model is used to compute the percentile of the temperature 83◦F for the month of June (all years) using the pnorm() function.

```
pnorm(q = 83, mean = june.mean, sd = june.sd)
```

As a final result from the Washington DC temperature data, the first plot is a set of 12 plots showing the probability mass functions of all 12 months for every year, created from Lab Task 2. Each plot shows the likelihood that a certain temperature might occur. The higher the bar on the histogram, the more likely that temperature is to occur during that month.

For Lab Task 4, two plots were created to compare the PMF of June temperatures in DC for all years to the normal distribution model. The central peak is higher than the normal model and the PMF is a bit narrower on both sides compared to the normal distribution model.

For Lab Task 5, a qqplot was created for the average temperature distribution in June for all years. A theoretical line is included which represents the normal distribution. It seems to conform to the normal distribution fairly well aside from a few outliers.

For Lab Task 6, a qqplot was created for the average temperature distribution for each month (all years). Separate plots were created for each month. All of the temperatures appear to conform to a normal distribution save for a few outliers.

For Lab Task 7, the temperature of the 10th percentile for June in DC is 55.4 degrees Fahrenheit. This means that this temperature was higher than 10% of the other temperatures in June for every year in which temperatures for June were recorded.

For Lab Task 8, the percentile of 83 degrees Fahrenheit for June of all years is the 72nd percentile. This means that this temperature was higher than 72% of the other daily temperatures in June for all years in which temperatures for June were recorded.

The daily temperatures appear to be nearly normal from from November to April except for a few outliers for each set. However, from May to September the temperatures seem to be higher than what the normal distribution would suggest (based on the histogram for each temperature and the qqplots).

The probability of observing a temperature of 83 degrees Fahrenheit or higher on any given day in June is 27.5%. The coldest 10% of days in June in DC are around 55.37 degrees Fahrenheit.

Only 3 days in March 2017 had temperatures that were within the 68% confidence interval and the rest were outside of it. 12 days in March 2017 were outside of the 95% confidence interval. This month's temperatures are mostly outside of the normal distribution model.