# Set the location for R packages
.libPaths(new = "~/Rlibs")
# Load the Tidyverse packages
library(tidyverse)
# Load the nycflights dataset
flights <- read_csv("nycflights13.csv")
**Vector of numbers between 1 and 100 that are indivisible by 2, 3 and 7: (Spring 2017)**
For this task I created three variables:
Code looks as follows:
v1 <- 1:100
indx <- v1 %% 3 & v1 %% 2 & v1 %% 7
(others <- v1[!!indx]) </span>
The code below cretes variable "v1" which is a vector from 1 to 100. Then it creates variable "indx" and assigns it the values of MOD(v1, 3) AND MOD(v1, 2) AND MODA(v1, 7) The code then creates variable "others and assigns it all the values of variable v1, EXCLUDING the values that are in the indx variable.
v1 <- 1:100
indx <- v1 %% 3 & v1 %% 2 & v1 %% 7
(others <- v1[!!indx])
Find flights that:
I'm creating a variable "Houston_Flights" and assigning the subset of data to it.
Houston_Flights <- (filter(flights, dest == "IAH"| dest == "HOU"))
The next line, I printed the variable to confirm the code.
print(Houston_Flights)
I'm creating a variable "UAD" and assigning the subset of data to it that has flights by united, American, and Delta.
The code below creates variable UAD, and assigns it a a filtered set of the dataset with all the records that have "AA" OR "UA" OR "DL" in teh carrier column.
UAD <- filter(flights, carrier == "AA" | carrier == "UA" | carrier == "DL")
I'm creating a variable "Del_but_made" and assigning the subset of data to it of flights that were delayed by at least an hour but made over 30 minutes in flight.
The code below creates variable "Del_but_made" and assigns it a filtered version of the dataset that contains flights with a dep_delay value bigger than 59 (delayed by at least 60 minutes) AND arr_delay value larger than 30 minutes.
Del_but_made <- filter(flights, dep_delay > 59 & arr_delay > 30)
To answer this question, I'm creating a variable " mid_6" with flights that departed between midnight and 6am (inclusive).
The code below creates variable "mid_6" and assigns it a subset of the data that is filtered with only flights that have a departure time of 600 or below (flights departed between midnight and 600 in the morning.
mid_6 <- filter(flights, dep_time <= 600)
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to the more convenient representation of number of minutes since midnight.
The code below create variable "dep_time_in_minutes" and assigns it a "transmuted" flights dataset with a column "dep_time" that contains the value of "hour" multiplied by 60 and added to the value of minute.
DTIM <- mutate(flights, min_dep_time = dep_time*60)
Times_Parsed <- mutate(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
Dep_time_in_minutes <- transmute(flights, dep_time = hour*60 + minute)
The code below creates variable "times_Parsed" which contains the dataset with two new columns; sched_hour and sched_minute.
Times_Parsed <- mutate(flights,
sched_dep_time,
sched_hour = sched_dep_time %/% 100,
sched_minute = sched_dep_time %% 100
)
The code below creates variable "sched_dep_time_in_minutes" and assigns it a dataset with mutated column "sched_dep_time_minutes" which contains the value of "sched_hour" multiplied by 60, and added to the value of "sched_minute" which calculate the scheudled departure time in minutes total since midnight.
sched_dep_time_in_minutes <- mutate(Times_Parsed, sched_dep_time_minutes = sched_hour*60 + sched_minute)
Compare air_time with the difference arr_time − dep_time. What did you expect to see? What do you need to do to fix it? Implement the fix.
After examining the data, 3 issues were identified with the dep_time - arr_time discrepancy when compared with airtime:
Consider the number of canceled flights per day in the dataset.
After reviewing the data, the following citeria woul best fit cancelled flights: flights with NA for departure time OR flights with NA for arrival time OR flights with NA for airtime. Dataset filtered below.
To answer this quesiton, I calculated the MEAN of departure delay time for calncelled flights, and the MEAN departure delay for all the flights in the dataset (calculations below.) The calculation indicates that calcelled flights's mean delay time was 36 minutes, while teh entire dataset had a mean of 12.6, less than half that of cancelled flights, which means that flight cancellation is related to the average dealy.
The code below creates variable "cancellations" and assigns it to a filtered dataset that includes only records with NA as thier departure time OR NA forarrival time OR NA for air time.
cancellations <- (filter(flights, is.na(dep_time) | is.na(arr_time)| is.na(air_time)))
The code below calculates and displays the mean of the dep_delay column (excluding NA values) within the "cancellations" variable, which contains only cancelled flights.
summarise(cancellations, delay = mean(dep_delay, na.rm = TRUE))
The code below calculates and displays the mean of the dep_delay column (excluding NA values) within the entire dataset
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
What time of day should you fly if you want to avoid delays as much as possible?
To answer this quesiton, I created a variable (by_time) thich contained the dataset grouped by scheduled departure time. Then I summarized that the data within the new variable by the mean of its departure delay value. From the results, the best time to schedule a flight is at 548, when the mean delay value is closest to 0, as low as 0.07692308.
The code below creates variable "by_time" and assigns it a table with all the unique values for "sched_dep_time" with the mean value for their respective "dep_delay".
by_time <- group_by(flights, sched_dep_time)
summarise(by_time, delay = mean(dep_delay, na.rm = TRUE))