CoCalc Shared FilesCDS-101 / Homework / HW2 / 2017Spring_cds101_hw2.ipynb
Author: Mihai Ninov
Views : 6
Description: Jupyter notebook CDS-101/Homework/HW2/2017Spring_cds101_hw2.ipynb

# Homework 2

CDS-101 (Spring 2017)

Name: Mike Ninov

In [1]:
# Set the location for R packages
.libPaths(new = "~/Rlibs")
# Load the Tidyverse packages
library(tidyverse)

In [2]:
# Load the nycflights dataset

Parsed with column specification: cols( year = col_integer(), month = col_integer(), day = col_integer(), dep_time = col_integer(), sched_dep_time = col_integer(), dep_delay = col_double(), arr_time = col_integer(), sched_arr_time = col_integer(), arr_delay = col_double(), carrier = col_character(), flight = col_integer(), tailnum = col_character(), origin = col_character(), dest = col_character(), air_time = col_double(), distance = col_double(), hour = col_double(), minute = col_double(), time_hour = col_datetime(format = "") )

Question 1 In order to produce a vector containing all integeres from 1-100, we can simply use seq(1,100,1). In order to find which integers are NOT ! divisible by 2,3,7 we use the ! and %% operators.

In [3]:
v <- seq(1,100,1)
v[!(!v%%2) + (!v%%3) + (!v%%7)]

1. 1
2. 5
3. 11
4. 13
5. 17
6. 19
7. 23
8. 25
9. 29
10. 31
11. 37
12. 41
13. 43
14. 47
15. 53
16. 55
17. 59
18. 61
19. 65
20. 67
21. 71
22. 73
23. 79
24. 83
25. 85
26. 89
27. 95
28. 97

Question 2a Find all flights that flew to IAH or HOU

In [4]:
houston_dest <- filter(flights, dest == 'IAH' | dest == 'HOU')
print(houston_dest)

# A tibble: 9,313 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 623 627 -4 933 932 4 2013 1 1 728 732 -4 1041 1038 5 2013 1 1 739 739 0 1104 1038 6 2013 1 1 908 908 0 1228 1219 7 2013 1 1 1028 1026 2 1350 1339 8 2013 1 1 1044 1045 -1 1352 1351 9 2013 1 1 1114 900 134 1447 1222 10 2013 1 1 1205 1200 5 1503 1505 # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2b Find all fligths operated by United, American, or Delta.

In [5]:
carrier_triad <- filter(flights, carrier == 'UA' | carrier == 'AA' | carrier == 'DL')

# A tibble: 139,504 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 554 600 -6 812 837 5 2013 1 1 554 558 -4 740 728 6 2013 1 1 558 600 -2 753 745 7 2013 1 1 558 600 -2 924 917 8 2013 1 1 558 600 -2 923 937 9 2013 1 1 559 600 -1 941 910 10 2013 1 1 559 600 -1 854 902 # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2c Find all flights delayed by at least 1 hour, but made it up over 30 minutes in flight.

In [6]:
delay_hour_makeup <- filter(flights, dep_delay >= 60, dep_delay-arr_delay > 30)
print(delay_hour_makeup)

# A tibble: 1,844 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 2205 1720 285 46 2040 2 2013 1 1 2326 2130 116 131 18 3 2013 1 3 1503 1221 162 1803 1555 4 2013 1 3 1839 1700 99 2056 1950 5 2013 1 3 1850 1745 65 2148 2120 6 2013 1 3 1941 1759 102 2246 2139 7 2013 1 3 1950 1845 65 2228 2227 8 2013 1 3 2015 1915 60 2135 2111 9 2013 1 3 2257 2000 177 45 2224 10 2013 1 4 1917 1700 137 2135 1950 # ... with 1,834 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2d Find all flights between midnight and 6 AM inclusive.

In [7]:
redeye <- filter(flights, dep_time == 2400 |dep_time <=600)
print(redeye)

# A tibble: 9,373 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # ... with 9,363 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 3 Convert dep_time and sched_dep_tim to a more convenient representation of number of minutes since midnight.

In [8]:
time_x <- mutate(flights,dep_time=(dep_time %/% 100)*60 + (dep_time %%100), sched_dep_time=(sched_dep_time %/% 100)*60 + (sched_dep_time %% 100))
print(time_x)

# A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> 1 2013 1 1 317 315 2 830 819 2 2013 1 1 333 329 4 850 830 3 2013 1 1 342 340 2 923 850 4 2013 1 1 344 345 -1 1004 1022 5 2013 1 1 354 360 -6 812 837 6 2013 1 1 354 358 -4 740 728 7 2013 1 1 355 360 -5 913 854 8 2013 1 1 357 360 -3 709 723 9 2013 1 1 357 360 -3 838 846 10 2013 1 1 358 360 -2 753 745 # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 4 Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

I expect to see an error because air_time is a double, and arr_time & dep_time are integers. The arr_time is a 24 hour time format, but dep_time is calcuated with repect to midnight. When you attempt arr_time-dep_time, your answer would be wrong.

In order to fix this problem, one solution would be to convert arr_time & dep_time into a standardized time format.

In [18]:
mutate(flights,dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100), sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100),
arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100), sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>%
transmute((arr_time - dep_time) %% (60*24) - air_time)

(arr_time - dep_time)%%(60 * 24) - ai...
-34
-30
61
77
22
-44
40
19
21
-23
22
17
-139
-156
-35
19
-162
19
23
16
-40
34
20
23
14
-20
-152
17
82
18
15
35
-93
16
18
-148
21
18
24
19
21
55
35
-47
-32
16
-159
21
25
13
15
17
19
20
NA
NA
NA
NA
NA
NA

Question 5a Consider number of cancelled flights. Deterimine the definition of a flight cancellation. As seen above, there are no flights that arrived but did not depart, so we can just use the !is.na(dep_delay)

In [19]:
group_by(flights, departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>%
summarise(n=n())

departedarrivedn
FALSE FALSE 8255
TRUE FALSE 1175
TRUE TRUE 327346

Question 5b Find the pattern of cancelled flights in relation to average delay. The canx/avg_delay shows a strong correlation between cancellations and delay; if one is high then the other is likely to be as well.

In [20]:
mutate(flights,dep_date = lubridate::make_datetime(year, month, day)) %>%
group_by(dep_date) %>%
summarise(canx = sum(is.na(dep_delay)), n=n(), mean_dep_delay = mean(dep_delay,na.rm=TRUE), mean_arr_delay = mean(arr_delay,na.rm=TRUE))%>%
ggplot(aes(x=canx/n)) + geom_point(aes(y=mean_dep_delay), color='blue', alpha=0.2) +
geom_point(aes(y=mean_arr_delay), color='red', alpha=0.2) + ylab('delay (minutes)')


Question 6 What time of day should you fly if want to avoid delays? You would want to avoid flying late at night as the flight delays of the day accumilate into more delays in the evening.

In [22]:
ggplot(flights, aes(x=factor(hour), fill= arr_delay >5 | is.na(arr_delay))) + geom_bar()

In [ ]: