CoCalc Shared FilesCDS-101 / Homework / HW2 / 2017Spring_cds101_hw2.htmlOpen in CoCalc with one click!

In [1]:

```
# Set the location for R packages
.libPaths(new = "~/Rlibs")
# Load the Tidyverse packages
library(tidyverse)
```

In [2]:

```
# Load the nycflights dataset
flights <- read_csv("nycflights13.csv")
```

**Vector of numbers between 1 and 100 that are indivisible by 2, 3 and 7: (Spring 2017)**

For this task I created three variables:

- Variable v1 that contained a vector of all the numbers between 1 and 100.
- The second variable "indx" contained all the numbers between 1 and 100 that are divisible by 2, or divisible by 3, or divisible by 7.
- The third variable, "others", had v1 without the indx variable, leaving us with all the numbers between 1 and 100 that are indivisible by 2, 3 or 7.

Code looks as follows:

v1 <- 1:100

indx <- v1 %% 3 & v1 %% 2 & v1 %% 7

(others <- v1[!!indx]) </span>

In [6]:

```
v1 <- 1:100
indx <- v1 %% 3 & v1 %% 2 & v1 %% 7
(others <- v1[!!indx])
```

Find flights that:

- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta.
- Were delayed by at least an hour, but made up over 30 minutes in flight.
- Departed between midnight and 6am (inclusive). </span>

I'm creating a variable "Houston_Flights" and assigning the subset of data to it.

In [7]:

```
Houston_Flights <- (filter(flights, dest == "IAH"| dest == "HOU"))
```

**The next line, I printed the variable to confirm the code.**

In [8]:

```
print(Houston_Flights)
```

I'm creating a variable "UAD" and assigning the subset of data to it that has flights by united, American, and Delta.

In [29]:

```
UAD <- filter(flights, carrier == "AA" | carrier == "UA" | carrier == "DL")
```

I'm creating a variable "Del_but_made" and assigning the subset of data to it of flights that were delayed by at least an hour but made over 30 minutes in flight.

In [33]:

```
Del_but_made <- filter(flights, dep_delay > 59 & arr_delay > 30)
```

To answer this question, I'm creating a variable " mid_6" with flights that departed between midnight and 6am (inclusive).

In [51]:

```
mid_6 <- filter(flights, dep_time <= 600)
```

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to the more convenient representation of number of minutes since midnight.

In [64]:

```
DTIM <- mutate(flights, min_dep_time = dep_time*60)
```

In [73]:

```
Times_Parsed <- mutate(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
```

In [78]:

```
Dep_time_in_minutes <- transmute(flights, dep_time = hour*60 + minute)
```

In [21]:

```
Times_Parsed <- mutate(flights,
sched_dep_time,
sched_hour = sched_dep_time %/% 100,
sched_minute = sched_dep_time %% 100
)
```

In [24]:

```
sched_dep_time_in_minutes <- mutate(Times_Parsed, sched_dep_time_minutes = sched_hour*60 + sched_minute)
```

Compare air_time with the difference arr_time − dep_time. What did you expect to see? What do you need to do to fix it? Implement the fix.

After examining the data, 3 issues were identified with the dep_time - arr_time discrepancy when compared with airtime:

- when substracting the dep_time from the arr_time, R is treatign these values as if they were number, and not time. To solve this issue, the columns must be formated in time format, or converted to minutes before they are substracted.
- The departure time is clocked at the city of departure, while the arrival time is clocked at the city of arrival, which in many cases resides in different time zones. The solution for this issue is to convert both times to a standard time, such as Zulu time, based on thier respective time zones.
- The third issue is that the arrival and departure time may include time on the runway, as long as the plane is moving using its own engines for the purpose of departing or embarking. Airtime is the time in the air only which only counts the time from when the wheels leave the runway and until they touch the runway again when landing. This can only be solved by using different datapoints.

Consider the number of canceled flights per day in the dataset.

- Review the dataset and determine what would be reasonable definition of a flight cancella- tion. Filter the dataset so that only the canceled flights remain. Note that the Boolean test (is.na(dep_delay) | is.na(arr_delay)) is not the best possible definition.
- Calculate the number of canceled flights per day using the filtered dataset. Is there a pattern? Is the proportion of canceled flights related to the average delay?

After reviewing the data, the following citeria woul best fit cancelled flights: flights with NA for departure time OR flights with NA for arrival time OR flights with NA for airtime. Dataset filtered below.

To answer this quesiton, I calculated the MEAN of departure delay time for calncelled flights, and the MEAN departure delay for all the flights in the dataset (calculations below.) The calculation indicates that calcelled flights's mean delay time was 36 minutes, while teh entire dataset had a mean of 12.6, less than half that of cancelled flights, which means that flight cancellation is related to the average dealy.

In [26]:

```
cancellations <- (filter(flights, is.na(dep_time) | is.na(arr_time)| is.na(air_time)))
```

In [27]:

```
summarise(cancellations, delay = mean(dep_delay, na.rm = TRUE))
```

In [24]:

```
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```

What time of day should you fly if you want to avoid delays as much as possible?

To answer this quesiton, I created a variable (by_time) thich contained the dataset grouped by scheduled departure time. Then I summarized that the data within the new variable by the mean of its departure delay value. From the results, the best time to schedule a flight is at 548, when the mean delay value is closest to 0, as low as 0.07692308.

In [25]:

```
by_time <- group_by(flights, sched_dep_time)
summarise(by_time, delay = mean(dep_delay, na.rm = TRUE))
```

In [ ]:

```
```