Introduction to Data Analysis

Lecture 1

The 'nycflights13' dataset: Data of flights out of NYC in 2013

In [1]:
library('nycflights13')
library('tidyverse')
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
In [2]:
head(flights, 5)
Out[2]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00

Given these data, what are some questions we can ask (and answer)?

Which airlines are most/least on-time?

In [3]:
flight_delays <- flights %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% group_by(carrier) %>% summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay))
In [4]:
flight_delays %>% arrange(meanArrDelay)
Out[4]:
carriermeanDepDelaymeanArrDelay
AS 5.830748 -9.9308886
HA 4.900585 -6.9152047
AA 8.569130 0.3642909
DL 9.223950 1.6443409
VX 12.756646 1.7644644
US 3.744693 2.1295951
UA 12.016908 3.5580111
9E 16.439574 7.3796692
B6 12.967548 9.4579733
WN 17.661657 9.6491199
MQ 10.445381 10.7747334
OO 12.586207 11.9310345
YV 18.898897 15.5569853
EV 19.838929 15.7964311
FL 18.605984 20.1159055
F9 20.201175 21.9207048
In [5]:
table<- merge(airlines, flight_delays, by="carrier") %>% arrange(-meanArrDelay)
table
Out[5]:
carriernamemeanDepDelaymeanArrDelay
F9 Frontier Airlines Inc. 20.201175 21.9207048
FL AirTran Airways Corporation18.605984 20.1159055
EV ExpressJet Airlines Inc. 19.838929 15.7964311
YV Mesa Airlines Inc. 18.898897 15.5569853
OO SkyWest Airlines Inc. 12.586207 11.9310345
MQ Envoy Air 10.445381 10.7747334
WN Southwest Airlines Co. 17.661657 9.6491199
B6 JetBlue Airways 12.967548 9.4579733
9E Endeavor Air Inc. 16.439574 7.3796692
UA United Air Lines Inc. 12.016908 3.5580111
US US Airways Inc. 3.744693 2.1295951
VX Virgin America 12.756646 1.7644644
DL Delta Air Lines Inc. 9.223950 1.6443409
AA American Airlines Inc. 8.569130 0.3642909
HA Hawaiian Airlines Inc. 4.900585 -6.9152047
AS Alaska Airlines Inc. 5.830748 -9.9308886
In [6]:
plot(table$meanDepDelay, table$meanArrDelay)
Out[6]:

Which NYC airports are most/least on-time?

In [7]:
flights %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% group_by(origin) %>% summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay)) %>% arrange(meanDepDelay)
Out[7]:
originmeanDepDelaymeanArrDelay
LGA 10.286585.783488
JFK 12.023615.551481
EWR 15.009119.107055

Is there a connection between amount of delay and destination airports?

In [8]:
summTable <- flights %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% group_by(dest) %>% summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay), numFlights = n()) %>% arrange(-meanDepDelay)
In [9]:
names(summTable)[1] <- 'faa'
In [10]:
airportnames <- airports[c('faa', 'name')]
In [11]:
merge(airportnames, summTable) %>% arrange(-meanDepDelay)
Out[11]:
faanamemeanDepDelaymeanArrDelaynumFlights
TUL Tulsa Intl 34.88776 33.659864 294
CAE Columbia Metropolitan 33.81132 41.764151 106
OKC Will Rogers World 29.18095 30.619048 315
BHM Birmingham Intl 29.01487 16.877323 269
TYS Mc Ghee Tyson 28.38235 24.069204 578
JAC Jackson Hole Airport 27.47619 28.095238 21
DSM Des Moines Intl 26.13958 19.005736 523
RIC Richmond Intl 23.62020 20.111253 2346
MSN Dane Co Rgnl Truax Fld 23.51259 20.196043 556
ALB Albany Intl 23.44737 14.397129 418
PVD Theodore Francis Green State 21.76536 16.234637 358
TVC Cherry Capital Airport 21.53684 12.968421 95
CHO Charlottesville-Albemarle 21.39130 9.500000 46
SBN South Bend Rgnl 21.10000 6.500000 10
MHT Manchester Regional Airport 21.02468 14.787554 932
CAK Akron Canton Regional Airport 20.84561 19.698337 842
SAT San Antonio Intl 20.37785 6.945372 659
MCI Kansas City Intl 20.23395 14.514058 1885
OMA Eppley Afld 20.19094 14.698898 817
CVG Cincinnati Northern Kentucky Intl 19.40188 15.364564 3725
GRR Gerald R Ford Intl 19.38462 18.189560 728
ILM Wilmington Intl 19.16822 4.635514 107
BGR Bangor Intl 19.15642 8.027933 358
GSP Greenville-Spartanburg International19.10127 15.935443 790
GSO Piedmont Triad 18.93097 14.112601 1492
MKE General Mitchell Intl 18.77372 14.167220 2709
SMF Sacramento Intl 18.69149 12.109929 282
MDW Chicago Midway Intl 18.64969 12.364224 4025
SAV Savannah Hilton Head Intl 18.08411 15.129506 749
BDL Bradley Intl 17.72087 7.048544 412
TPA Tampa Intl 12.074154 7.40852503 7390
PHL Philadelphia Intl 11.781311 10.12719014 1541
DTW Detroit Metro Wayne Co 11.717529 5.42996346 9031
BZN Gallatin Field 11.457143 7.60000000 35
MCO Orlando Intl 11.265340 5.45464309 13967
LGB Long Beach 11.155825 -0.06202723 661
SAN San Diego Intl 11.117386 3.13916574 2709
IAH George Bush Intercontinental 10.771630 4.24079040 7085
SEA Seattle Tacoma Intl 10.600000 -1.09909910 3885
PHX Phoenix Sky Harbor Intl 10.360617 2.09704733 4606
DCA Ronald Reagan Washington Natl 10.138624 9.06695204 9111
SJC Norman Y Mineta San Jose Intl 10.103659 3.44817073 328
LAS Mc Carran Intl 9.375000 0.25772849 5952
LAX Los Angeles Intl 9.351242 0.54711094 16026
HNL Honolulu Intl 9.315264 -1.36519258 701
CLT Charlotte Douglas Intl 9.196431 7.36031885 13674
SLC Salt Lake City Intl 9.026928 0.17625459 2451
MIA Miami Intl 8.868800 0.29905978 11593
BOS General Edward Lawrence Logan Intl 8.662029 2.91439222 15022
DFW Dallas Fort Worth Intl 8.603839 0.32212685 8388
RSW Southwest Florida Intl 8.230154 3.23814963 3502
AVL Asheville Regional Airport 8.149425 8.00383142 261
SRQ Sarasota Bradenton Intl 7.273106 3.08243131 1201
MVY Martha\\'s Vineyard 6.890476 -0.28571429 210
SNA John Wayne Arpt Orange Co 6.780788 -7.86822660 812
ACK Nantucket Mem 6.446970 4.85227273 264
XNA NW Arkansas Regional 5.800403 7.46572581 992
EYW Key West Intl 3.647059 6.35294118 17
PSP Palm Springs Intl -2.944444 -12.72222222 18
LEX Blue Grass -9.000000 -22.00000000 1

Does the amount of delay vary depending on the day of the week?

In [12]:
flight_delays <- flights %>% mutate(date = paste(year, month, day, sep="-")) %>% mutate(weekday = weekdays(as.Date(date))) %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% group_by(weekday) %>% summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay), volume = n())
In [13]:
flight_delays %>% arrange(meanDepDelay)
Out[13]:
weekdaymeanDepDelaymeanArrDelayvolume
Saturday 7.594407-1.44882837794
Tuesday 10.588355 5.38852649137
Sunday 11.477475 4.82002445506
Wednesday11.643321 7.05111948632
Friday 14.653974 9.07012048531
Monday 14.718728 9.65373949301
Thursday 16.04345111.74081948445

Does the amount of delay vary depending on day of the week and the airport of origin?

In [14]:
flight_delays <- flights %>% 
    mutate(date = paste(year, month, day, sep="-")) %>% 
    mutate(weekday = weekdays(as.Date(date))) %>% 
    filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
    group_by(origin, weekday) %>% 
    summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay), volume = n())

names(flight_delays)
Out[14]:
  1. 'origin'
  2. 'weekday'
  3. 'meanDepDelay'
  4. 'meanArrDelay'
  5. 'volume'
In [15]:
flight_delays %>% 
    ggplot( aes(meanDepDelay, meanArrDelay)) + geom_point()
Out[15]:

Number of flights leaving each airport on a particular day, various times of day

In [16]:
flights_1day <- flights %>%  
    mutate(date = paste(year, month, day, sep="-")) %>% 
    filter(date=="2013-1-1") %>% 
    transform(group = cut(dep_time, breaks=c(0, 300, 600, 900, 1200, 1500, 1800, 2100, 2400), labels=c('0-259', '300-559', '600-859', '900-1159', '1200-1459', '1500-1759', '1800-2059', '2100-2359') ) ) %>% 
    group_by(origin, group) %>% 
    summarize(volume = n())
In [17]:
flights_1day %>% filter(!is.na(group))
Out[17]:
origingroupvolume
EWR 300-559 5
EWR 600-859 47
EWR 900-1159 49
EWR 1200-145958
EWR 1500-175970
EWR 1800-205958
EWR 2100-235917
JFK 300-559 7
JFK 600-859 54
JFK 900-1159 35
JFK 1200-145939
JFK 1500-175972
JFK 1800-205964
JFK 2100-235925
LGA 300-559 7
LGA 600-859 50
LGA 900-1159 52
LGA 1200-145943
LGA 1500-175948
LGA 1800-205935
LGA 2100-2359 3

Number of flights (total in one year) leaving at particular times of day

In [18]:
flights_1day <- flights %>%  
    mutate(date = paste(year, month, day, sep="-")) %>% 
    transform(group = cut(dep_time, breaks=c(0, 300, 600, 900, 1200, 1500, 1800, 2100, 2400), labels=c('0-259', '300-559', '600-859', '900-1159', '1200-1459', '1500-1759', '1800-2059', '2100-2359') ) ) %>% 
    group_by(origin, group) %>% 
    summarize(volume = n())
In [0]: