Introduction to Data Analysis

Lecture 3: Working with Data Frames and Built-In Datasets

Today:

  1. Working with data frames
  2. Using dplyr

Part 1: Working with data frames

1.1. Arithmetic with lists and columns

Recall the data frame studentdata which we created last class:

In [1]:
heightList <- c(72, 65, 59)
weightList <- c(150, 180, 110)
studentdata <- as.data.frame(cbind(weightList, heightList) )
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata
Out[1]:
weightListheightListmajorshaveTakenTextAndIdeas
150 72 Music TRUE
180 65 Psychology TRUE
110 59 LinguisticsFALSE
In [2]:
names(studentdata)[1] <- 'weight_lb'
In [3]:
names(studentdata)[2] <- 'height_in'

Here, weight is in pounds and height is in inches. Suppose that you would like to create a new column where the height data is in cm.

First, let's consider how we would do that if we only have one number to work with.

In [4]:
height_Alex_in <- 72

Fact One inch is equal to 2.54 cm.

So, if Alex's height is 72 inches, then Alex's height in cm is $72 \times 2.54$.

In [5]:
height_Alex_cm <- height_Alex_in * 2.54
print(height_Alex_in)
print(height_Alex_cm)
[1] 72
[1] 182.88

The list height contains the heights of the three students in our dataset. Let's create a second list, called height_cm, which will contain the height of the same three students in cm. We can do the same arithmetic as above, and the arithmetic operation will be applied to each number on that list.

In [6]:
print(heightList)
[1] 72 65 59
In [7]:
heightList_cm <- heightList * 2.54
print(heightList_cm)
[1] 182.88 165.10 149.86

We can then add a new column called height_cm in the studentdata dataset that contains these values:

In [8]:
studentdata$height_cm <- heightList_cm
studentdata
Out[8]:
weight_lbheight_inmajorshaveTakenTextAndIdeasheight_cm
150 72 Music TRUE 182.88
180 65 Psychology TRUE 165.10
110 59 LinguisticsFALSE 149.86

Note: we could have done the above steps in one line. For example, let's do something similar with the weight column. Suppose that instead of working with lbs, we prefer to work with kgs.

Since 1 pound is equal to 0.453592 kg, we multiply each number in the weight column by 0.453592, then store the result in a new column called weight_kg.

In [9]:
studentdata$weight_kg <- studentdata$weight_lb * 0.453592
In [10]:
studentdata
Out[10]:
weight_lbheight_inmajorshaveTakenTextAndIdeasheight_cmweight_kg
150 72 Music TRUE 182.88 68.03880
180 65 Psychology TRUE 165.10 81.64656
110 59 LinguisticsFALSE 149.86 49.89512

1.2. Finding sums and averages of a column

Suppose, for example, that our three students are about to enter an elevator. Inside the elevator, a sign states that the total weight of the passangers of the elevator car cannot exceed 425 pounds. Is there a way to quickly add the numbers in the weight column?

Recall that the function sum() takes the sum of the numbers that are entered within its parentheses. For example:

In [11]:
sum(4, 100, 34.7, -100)
Out[11]:
38.7

We can also use sum() to take the sum of all numbers in a list or in a column of a data frame.

In [12]:
sum(studentdata$weight_lb)
Out[12]:
440
In [13]:
newlist <- 1:100
sum(newlist)
Out[13]:
5050
In [14]:
sum(-100:100)
Out[14]:
0

Of course, we can only take the sum of numeric data. If we try to take the sum of a column that stores text data, we will get an error message.

In [16]:
sum(studentdata$majors)
Error in sum(studentdata$majors): invalid 'type' (character) of argument
Traceback:

Another example: Perhaps we are interested in finding the average of the three students' height. Recall that to compute an average, we simply add all numbers and divide by the number of people. So, to compute the average height of the three students (in inches), we can first use sum(), then divide the result by 3:

In [17]:
sum(studentdata$height_in)/3
Out[17]:
65.3333333333333

Exercise In the cell below, find the average weight of the students, in kilograms.

A New Function While we can take averages quite easily by first taking the sum and then dividing by the number of individuals, R has kindly provided a function that allows us to compute averages a little more easily. This is the mean() function ("the mean" is synonymous to "the average"). We will get the same result using the two methods.

In [18]:
mean(studentdata$height_in)
Out[18]:
65.3333333333333

Exercise In the cell below, find the average weight of the students, in kilograms, using the mean() function

1.3. Finding the largest and the smallest number in a column

Our studentdata dataset is very small; therefore, finding the largest and the smallest height, for example, is very easy to do.

However, if we are working with a dataset that has hundreds of thousdands of rows (for example, the nycflights13 dataset), it is useful to have a function that will do this quickly for us.

New Functions

  • max(): to find the largest number in a list or in a column
  • min(): to find the smallest number in a list or in a column
In [19]:
max(studentdata$height_in)
Out[19]:
72
In [20]:
min(studentdata$height_in)
Out[20]:
59

What happens if we apply max() or min() to a list containing text data? Let's try:

In [21]:
max(studentdata$majors)
Out[21]:
'Psychology'
In [22]:
min(studentdata$majors)
Out[22]:
'Linguistics'

Any guess what happenned here?

In [23]:
studentdata
Out[23]:
weight_lbheight_inmajorshaveTakenTextAndIdeasheight_cmweight_kg
150 72 Music TRUE 182.88 68.03880
180 65 Psychology TRUE 165.10 81.64656
110 59 LinguisticsFALSE 149.86 49.89512

1.4. Working with missing data

Suppose that we have a fourth student, David, whose data we would like to add as a fourth row in our data frame.

In [24]:
studentdata <- rbind(studentdata, data.frame(weight_lb = 200, height_in = NA, majors = "Political Science", haveTakenTextAndIdeas = FALSE, height_cm = NA, weight_kg = 200 * 0.453592))
In [25]:
studentdata
Out[25]:
weight_lbheight_inmajorshaveTakenTextAndIdeasheight_cmweight_kg
150 72 Music TRUE 182.88 68.03880
180 65 Psychology TRUE 165.10 81.64656
110 59 Linguistics FALSE 149.86 49.89512
200 NA Political ScienceFALSE NA 90.71840
In [26]:
class(studentdata$height_in)
Out[26]:
'numeric'
In [27]:
mean(studentdata$height_in)
Out[27]:
<NA>

Note that the mean of heights is now NA because David's height data is missing. You will see that real-world data are often incomplete, and that it is not uncommon for datasets to have many NAs.

So, what do we do if we are still interested in finding the average of the three remaining students' heights? We can still use the mean() function, telling R to ignore any NAs:

In [28]:
mean(studentdata$height_in, na.rm=TRUE) #na.rm=TRUE means: "remove any NA's before computing the mean of the height_in column"
Out[28]:
65.3333333333333

We can also add na.rm=TRUE in other functions where we want to compute a summary of numerical data:

In [29]:
sum(studentdata$height_in, na.rm=TRUE) / 3
Out[29]:
65.3333333333333

Part 2: Using the dplyr package

As we saw in our last lecture, sometimes we need to modify data frames before we are able to glean certain information from it. Today, we will learn about various R functions that will help us in modifying a given data frame. This process is often called "Data Wrangling" and is an important skill to have if we want to be proficient and comfortable with data analysis.

A side note about R and packages

R is an amazing tool, because a lot of people contributes to the development of R. By downloading "packages" (these are codes for various functions, written by others), we can make a difficult process easier.

The main R packages that we will work with in our course are dplyr and ggplot2. The package tidyverse contains both dplyr and ggplot2; so we begin by installing and loading tidyverse:

In [30]:
install.packages('tidyverse')
Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘rematch’, ‘cellranger’, ‘clipr’, ‘cli’, ‘httr’, ‘lubridate’, ‘purrr’, ‘readxl’, ‘reprex’, ‘rlang’, ‘rstudioapi’, ‘tibble’, ‘tidyr’

In [31]:
library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Let's revisit the nycflights13 dataset.

In [32]:
library('nycflights13')
In [33]:
head(flights)
Out[33]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00

A few questions that we brainstormed last class are:

  1. Which airline is the "worst" in terms of the amount of delays?
  2. How many flights departed from each of the three airports?
  3. Which NYC airport is the worst in terms of the amount of delays?
  4. How many flights departed in the morning? afternoon? evening?

(and many more questions!)

Question 1: Which airline is the worst in terms of the amount of delays?

First, we need to think, what do we mean by "worst", exactly. Are we asking for simply the number of flights that had any amount of delays, or are we asking for the total amount of delays accumulated by each airlines (or maybe something else)?

Furthermore, we notice that there is a dep_delay column and an arr_delay column. Which type of delay are we interested in?

When we are answering these types of questions, it is usually not clear how we should quantify "worst" (or "best"), and we can argue about this for a long time. So, what we would do is simply pick one and go for it, for now.

Let's start with the following goal

For each airline, let's add up the number of minutes that the flights are delayed. Then, we will rank the airlines based on the sum of their number of minutes of delays. Let's also suppose that we care only about the arrival delay (that is, if a flight was delayed in its departure but still arrived as scheduled, this did not really upset the passangers.)

To do do this, we need to group the flights by the carrier, and then we will sum up their arr_delay column.

We do this in two steps, using two new functions:

  1. group_by(): this function lets us group the rows of a data frame according to the value of one of the columns.
  2. summarize(): this function lets us compute a summary number for each of the group.
In [34]:
flights_groupedby_carrier <- group_by(flights, carrier)
In [35]:
arrdelay_bycarrier <- summarize(flights_groupedby_carrier, total_arrdelay = sum(arr_delay, na.rm=TRUE))
In [36]:
arrdelay_bycarrier <- merge(arrdelay_bycarrier, airlines, by="carrier")
In [37]:
arrange(arrdelay_bycarrier, desc(total_arrdelay))
Out[37]:
carriertotal_arrdelayname
EV 807324 ExpressJet Airlines Inc.
B6 511194 JetBlue Airways
MQ 269767 Envoy Air
UA 205589 United Air Lines Inc.
9E 127624 Endeavor Air Inc.
WN 116214 Southwest Airlines Co.
DL 78366 Delta Air Lines Inc.
FL 63868 AirTran Airways Corporation
US 42232 US Airways Inc.
F9 14928 Frontier Airlines Inc.
AA 11638 American Airlines Inc.
VX 9027 Virgin America
YV 8463 Mesa Airlines Inc.
OO 346 SkyWest Airlines Inc.
HA -2365 Hawaiian Airlines Inc.
AS -7041 Alaska Airlines Inc.

Conclusion of the above exploratory analysis: Based on the sum of the arrival delays (in minutes), Alaska Airlines had the best record in 2013, while ExpressJet Airlines Inc. had the worst record.

While this is revealing, what else should we think about before giving the award of "The most delayed airline in 2013" to ExpressJet Airlines Inc.?

  • What about all those missing data (NAs) that we discarded?
  • It is possible that perhaps the sum of arrival delays is not the right quantity to use to rank the airlines. For example, airlines that are more majors will have a lot more flights than smaller airlines, and therefore might accumulate more delay minutes even if each delay is only, let's say, one minute long.
  • It seems like if an airline has a lot of flights with small delays, they are not worse than an airline with less volume that is consistently very delayed.
  • How about consistency of delays? It is possible that ExpressJet is usualy on-time, except for a handful of flights that are massively delayed due to a rare weather-related event.
In [38]:
ave_arr_delay_byCarrier <- summarize(flights_groupedby_carrier, ave_arr_delay = mean(arr_delay, na.rm=TRUE) )
In [39]:
ave_arr_delay_byCarrier <- merge(ave_arr_delay_byCarrier, airlines, by="carrier")
In [40]:
arrange(ave_arr_delay_byCarrier, desc(ave_arr_delay) )
Out[40]:
carrierave_arr_delayname
F9 21.9207048 Frontier Airlines Inc.
FL 20.1159055 AirTran Airways Corporation
EV 15.7964311 ExpressJet Airlines Inc.
YV 15.5569853 Mesa Airlines Inc.
OO 11.9310345 SkyWest Airlines Inc.
MQ 10.7747334 Envoy Air
WN 9.6491199 Southwest Airlines Co.
B6 9.4579733 JetBlue Airways
9E 7.3796692 Endeavor Air Inc.
UA 3.5580111 United Air Lines Inc.
US 2.1295951 US Airways Inc.
VX 1.7644644 Virgin America
DL 1.6443409 Delta Air Lines Inc.
AA 0.3642909 American Airlines Inc.
HA -6.9152047 Hawaiian Airlines Inc.
AS -9.9308886 Alaska Airlines Inc.
In [41]:
flights %>%   # don't worry about the code in this cell for now!  We will learn about them shortly.
  filter(!is.na(arr_delay)) %>% 
  filter(arr_delay > 0) %>%
  group_by(carrier) %>% 
  summarize(total_arr_delay = sum(arr_delay), count = n()) %>% 
  mutate(ave_arr_delay = total_arr_delay/count) %>%
  merge(airlines, by="carrier") %>%
  arrange(desc(ave_arr_delay))
Out[41]:
carriertotal_arr_delaycountave_arr_delayname
OO 606 10 60.60000 SkyWest Airlines Inc.
YV 13179 258 51.08140 Mesa Airlines Inc.
9E 327023 6637 49.27271 Endeavor Air Inc.
EV 1181808 24484 48.26858 ExpressJet Airlines Inc.
F9 18651 392 47.57908 Frontier Airlines Inc.
VX 76557 1746 43.84708 Virgin America
FL 77874 1895 41.09446 AirTran Airways Corporation
WN 216125 5304 40.74755 Southwest Airlines Co.
B6 944574 23609 40.00906 JetBlue Airways
AA 409671 10706 38.26555 American Airlines Inc.
MQ 442604 11693 37.85205 Envoy Air
DL 619485 16413 37.74356 Delta Air Lines Inc.
UA 814458 22222 36.65098 United Air Lines Inc.
HA 3398 97 35.03093 Hawaiian Airlines Inc.
AS 6495 189 34.36508 Alaska Airlines Inc.
US 213206 7349 29.01157 US Airways Inc.