SharedCDS-101 / Homework / HW2 / 2017Spring_cds101_hw2.htmlOpen in CoCalc
Authors: James Glasbrenner, Muad Kholi
Views : 10
Description: Jupyter html version of CDS-101/Homework/HW2/2017Spring_cds101_hw2.ipynb
2017Spring_cds101_hw2

Homework 2

**CDS-101 (Spring 2017)**

**Name:** Muaad Kholi

In [1]:
# Set the location for R packages
.libPaths(new = "~/Rlibs")
# Load the Tidyverse packages
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
In [2]:
# Load the nycflights dataset
flights <- read_csv("nycflights13.csv")
Parsed with column specification:
cols(
  year = col_integer(),
  month = col_integer(),
  day = col_integer(),
  dep_time = col_integer(),
  sched_dep_time = col_integer(),
  dep_delay = col_double(),
  arr_time = col_integer(),
  sched_arr_time = col_integer(),
  arr_delay = col_double(),
  carrier = col_character(),
  flight = col_integer(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  air_time = col_double(),
  distance = col_double(),
  hour = col_double(),
  minute = col_double(),
  time_hour = col_datetime(format = "")
)

Question 1:

**Vector of numbers between 1 and 100 that are indivisible by 2, 3 and 7: (Spring 2017)**

For this task I created three variables:

  1. Variable v1 that contained a vector of all the numbers between 1 and 100.
  2. The second variable "indx" contained all the numbers between 1 and 100 that are divisible by 2, or divisible by 3, or divisible by 7.
  3. The third variable, "others", had v1 without the indx variable, leaving us with all the numbers between 1 and 100 that are indivisible by 2, 3 or 7.

Code looks as follows:

v1 <- 1:100
indx <- v1 %% 3 & v1 %% 2 & v1 %% 7
(others <- v1[!!indx]) </span>

The code below cretes variable "v1" which is a vector from 1 to 100. Then it creates variable "indx" and assigns it the values of MOD(v1, 3) AND MOD(v1, 2) AND MODA(v1, 7) The code then creates variable "others and assigns it all the values of variable v1, EXCLUDING the values that are in the indx variable.

In [6]:
v1 <- 1:100
indx <- v1 %% 3 & v1 %% 2 & v1 %% 7
(others <- v1[!!indx])
  1. 1
  2. 5
  3. 11
  4. 13
  5. 17
  6. 19
  7. 23
  8. 25
  9. 29
  10. 31
  11. 37
  12. 41
  13. 43
  14. 47
  15. 53
  16. 55
  17. 59
  18. 61
  19. 65
  20. 67
  21. 71
  22. 73
  23. 79
  24. 83
  25. 85
  26. 89
  27. 95
  28. 97

Question 2:

Find flights that:

  1. Flew to Houston (IAH or HOU)
  2. Were operated by United, American, or Delta.
  3. Were delayed by at least an hour, but made up over 30 minutes in flight.
  4. Departed between midnight and 6am (inclusive). </span>

Answer to Question 2.1.

I'm creating a variable "Houston_Flights" and assigning the subset of data to it.

In [7]:
Houston_Flights <- (filter(flights, dest == "IAH"| dest == "HOU"))

The next line, I printed the variable to confirm the code.

In [8]:
print(Houston_Flights)
# A tibble: 9,313 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      623            627        -4      933            932
4   2013     1     1      728            732        -4     1041           1038
5   2013     1     1      739            739         0     1104           1038
6   2013     1     1      908            908         0     1228           1219
7   2013     1     1     1028           1026         2     1350           1339
8   2013     1     1     1044           1045        -1     1352           1351
9   2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Answer to Question 2.2.

I'm creating a variable "UAD" and assigning the subset of data to it that has flights by united, American, and Delta.

The code below creates variable UAD, and assigns it a a filtered set of the dataset with all the records that have "AA" OR "UA" OR "DL" in teh carrier column.

In [29]:
UAD <- filter(flights, carrier == "AA" | carrier == "UA" | carrier == "DL")

Answer to Question 2.3.

I'm creating a variable "Del_but_made" and assigning the subset of data to it of flights that were delayed by at least an hour but made over 30 minutes in flight.

The code below creates variable "Del_but_made" and assigns it a filtered version of the dataset that contains flights with a dep_delay value bigger than 59 (delayed by at least 60 minutes) AND arr_delay value larger than 30 minutes.

In [33]:
Del_but_made <- filter(flights, dep_delay > 59 & arr_delay > 30)

Answer to Question 2.4

To answer this question, I'm creating a variable " mid_6" with flights that departed between midnight and 6am (inclusive).

The code below creates variable "mid_6" and assigns it a subset of the data that is filtered with only flights that have a departure time of 600 or below (flights departed between midnight and 600 in the morning.

In [51]:
mid_6 <- filter(flights, dep_time <= 600)

Question 3:

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to the more convenient representation of number of minutes since midnight.

The code below create variable "dep_time_in_minutes" and assigns it a "transmuted" flights dataset with a column "dep_time" that contains the value of "hour" multiplied by 60 and added to the value of minute.

In [64]:
DTIM <- mutate(flights, min_dep_time = dep_time*60)
In [73]:
Times_Parsed <- mutate(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
In [78]:
Dep_time_in_minutes <- transmute(flights, dep_time = hour*60 + minute)

The code below creates variable "times_Parsed" which contains the dataset with two new columns; sched_hour and sched_minute.

In [21]:
Times_Parsed <- mutate(flights,
  sched_dep_time,
  sched_hour = sched_dep_time %/% 100,
  sched_minute = sched_dep_time %% 100
)

The code below creates variable "sched_dep_time_in_minutes" and assigns it a dataset with mutated column "sched_dep_time_minutes" which contains the value of "sched_hour" multiplied by 60, and added to the value of "sched_minute" which calculate the scheudled departure time in minutes total since midnight.

In [24]:
sched_dep_time_in_minutes <- mutate(Times_Parsed, sched_dep_time_minutes = sched_hour*60 + sched_minute)

Question #4:

Compare air_time with the difference arr_time − dep_time. What did you expect to see? What do you need to do to fix it? Implement the fix.

Answer to question 4:

After examining the data, 3 issues were identified with the dep_time - arr_time discrepancy when compared with airtime:

  1. when substracting the dep_time from the arr_time, R is treatign these values as if they were number, and not time. To solve this issue, the columns must be formated in time format, or converted to minutes before they are substracted.
  2. The departure time is clocked at the city of departure, while the arrival time is clocked at the city of arrival, which in many cases resides in different time zones. The solution for this issue is to convert both times to a standard time, such as Zulu time, based on thier respective time zones.
  3. The third issue is that the arrival and departure time may include time on the runway, as long as the plane is moving using its own engines for the purpose of departing or embarking. Airtime is the time in the air only which only counts the time from when the wheels leave the runway and until they touch the runway again when landing. This can only be solved by using different datapoints.

Question 5:

Consider the number of canceled flights per day in the dataset.

  1. Review the dataset and determine what would be reasonable definition of a flight cancella- tion. Filter the dataset so that only the canceled flights remain. Note that the Boolean test (is.na(dep_delay) | is.na(arr_delay)) is not the best possible definition.
  2. Calculate the number of canceled flights per day using the filtered dataset. Is there a pattern? Is the proportion of canceled flights related to the average delay?

Answer for question 5.1:

After reviewing the data, the following citeria woul best fit cancelled flights: flights with NA for departure time OR flights with NA for arrival time OR flights with NA for airtime. Dataset filtered below.

Answer for question 5.2:

To answer this quesiton, I calculated the MEAN of departure delay time for calncelled flights, and the MEAN departure delay for all the flights in the dataset (calculations below.) The calculation indicates that calcelled flights's mean delay time was 36 minutes, while teh entire dataset had a mean of 12.6, less than half that of cancelled flights, which means that flight cancellation is related to the average dealy.

The code below creates variable "cancellations" and assigns it to a filtered dataset that includes only records with NA as thier departure time OR NA forarrival time OR NA for air time.

In [26]:
cancellations <- (filter(flights, is.na(dep_time) | is.na(arr_time)| is.na(air_time)))

The code below calculates and displays the mean of the dep_delay column (excluding NA values) within the "cancellations" variable, which contains only cancelled flights.

In [27]:
summarise(cancellations, delay = mean(dep_delay, na.rm = TRUE))
delay
36.01702

The code below calculates and displays the mean of the dep_delay column (excluding NA values) within the entire dataset

In [24]:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
delay
12.63907

Question #6:

What time of day should you fly if you want to avoid delays as much as possible?

Answer to question #6:

To answer this quesiton, I created a variable (by_time) thich contained the dataset grouped by scheduled departure time. Then I summarized that the data within the new variable by the mean of its departure delay value. From the results, the best time to schedule a flight is at 548, when the mean delay value is closest to 0, as low as 0.07692308.

The code below creates variable "by_time" and assigns it a table with all the unique values for "sched_dep_time" with the mean value for their respective "dep_delay".

In [25]:
by_time <- group_by(flights, sched_dep_time)
summarise(by_time, delay = mean(dep_delay, na.rm = TRUE))
sched_dep_timedelay
106 NaN
500 -3.13823529
501 -3.00000000
505 2.00000000
510 1.00000000
515 1.53140097
516 -5.00000000
517 5.11111111
520 -2.57142857
525 -0.70270270
527 10.00000000
528 2.00000000
529 4.42857143
530 2.08823529
534 2.80000000
535 -2.00000000
536 0.89655172
537 6.00000000
538 -1.25000000
539 -2.50000000
540 1.24324324
545 -0.34269663
548 0.07692308
549 10.30000000
550 1.73504274
551 9.10526316
555 -2.00000000
557 -1.00000000
558 50.00000000
559 9.38372093
2207 110.666667
2208 -3.000000
2210 21.000000
2215 21.782609
2219 18.846154
2220 20.602410
2225 28.492754
2227 44.000000
2229 10.080000
2230 15.363636
2231 38.229508
2241 17.500000
2245 22.251741
2246 4.608696
2249 19.947368
2250 16.846966
2251 17.244444
2253 11.684211
2255 9.370277
2258 2.058824
2300 27.285714
2305 33.622951
2315 30.000000
2330 18.714286
2339 80.000000
2345 17.000000
2352 -0.187500
2355 9.849315
2358 13.613636
2359 12.698529
In [ ]: