Introduction to Data Analysis¶

Lecture 3: Working with Data Frames and Built-In Datasets¶

Today:

Working with data frames
Using dplyr

Part 1: Working with data frames¶

1.1. Arithmetic with lists and columns¶

Recall the data frame studentdata which we created last class:

heightList <- c(72, 65, 59)
weightList <- c(150, 180, 110)
studentdata <- as.data.frame(cbind(weightList, heightList) )
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata

names(studentdata)[1] <- 'weight_lb'

names(studentdata)[2] <- 'height_in'

Here, weight is in pounds and height is in inches. Suppose that you would like to create a new column where the height data is in cm.

First, let's consider how we would do that if we only have one number to work with.

height_Alex_in <- 72

Fact One inch is equal to 2.54 cm.

So, if Alex's height is 72 inches, then Alex's height in cm is $72 \times 2.54$.

height_Alex_cm <- height_Alex_in * 2.54
print(height_Alex_in)
print(height_Alex_cm)

[1] 72
[1] 182.88

The list height contains the heights of the three students in our dataset. Let's create a second list, called height_cm, which will contain the height of the same three students in cm. We can do the same arithmetic as above, and the arithmetic operation will be applied to each number on that list.

print(heightList)

[1] 72 65 59

heightList_cm <- heightList * 2.54
print(heightList_cm)

[1] 182.88 165.10 149.86

We can then add a new column called height_cm in the studentdata dataset that contains these values:

studentdata$height_cm <- heightList_cm
studentdata

Note: we could have done the above steps in one line. For example, let's do something similar with the weight column. Suppose that instead of working with lbs, we prefer to work with kgs.

Since 1 pound is equal to 0.453592 kg, we multiply each number in the weight column by 0.453592, then store the result in a new column called weight_kg.

studentdata$weight_kg <- studentdata$weight_lb * 0.453592

studentdata

1.2. Finding sums and averages of a column¶

Suppose, for example, that our three students are about to enter an elevator. Inside the elevator, a sign states that the total weight of the passangers of the elevator car cannot exceed 425 pounds. Is there a way to quickly add the numbers in the weight column?

Recall that the function sum() takes the sum of the numbers that are entered within its parentheses. For example:

sum(4, 100, 34.7, -100)

We can also use sum() to take the sum of all numbers in a list or in a column of a data frame.

sum(studentdata$weight_lb)

newlist <- 1:100
sum(newlist)

sum(-100:100)

Of course, we can only take the sum of numeric data. If we try to take the sum of a column that stores text data, we will get an error message.

sum(studentdata$majors)

Error in sum(studentdata$majors): invalid 'type' (character) of argument
Traceback:

Another example: Perhaps we are interested in finding the average of the three students' height. Recall that to compute an average, we simply add all numbers and divide by the number of people. So, to compute the average height of the three students (in inches), we can first use sum(), then divide the result by 3:

sum(studentdata$height_in)/3

Exercise In the cell below, find the average weight of the students, in kilograms.

A New Function While we can take averages quite easily by first taking the sum and then dividing by the number of individuals, R has kindly provided a function that allows us to compute averages a little more easily. This is the mean() function ("the mean" is synonymous to "the average"). We will get the same result using the two methods.

mean(studentdata$height_in)

Exercise In the cell below, find the average weight of the students, in kilograms, using the mean() function

1.3. Finding the largest and the smallest number in a column¶

Our studentdata dataset is very small; therefore, finding the largest and the smallest height, for example, is very easy to do.

However, if we are working with a dataset that has hundreds of thousdands of rows (for example, the nycflights13 dataset), it is useful to have a function that will do this quickly for us.

New Functions

max(): to find the largest number in a list or in a column
min(): to find the smallest number in a list or in a column

max(studentdata$height_in)

min(studentdata$height_in)

What happens if we apply max() or min() to a list containing text data? Let's try:

max(studentdata$majors)

min(studentdata$majors)

Any guess what happenned here?

studentdata

1.4. Working with missing data¶

Suppose that we have a fourth student, David, whose data we would like to add as a fourth row in our data frame.

studentdata <- rbind(studentdata, data.frame(weight_lb = 200, height_in = NA, majors = "Political Science", haveTakenTextAndIdeas = FALSE, height_cm = NA, weight_kg = 200 * 0.453592))

studentdata

class(studentdata$height_in)

mean(studentdata$height_in)

Note that the mean of heights is now NA because David's height data is missing. You will see that real-world data are often incomplete, and that it is not uncommon for datasets to have many NAs.

So, what do we do if we are still interested in finding the average of the three remaining students' heights? We can still use the mean() function, telling R to ignore any NAs:

mean(studentdata$height_in, na.rm=TRUE) #na.rm=TRUE means: "remove any NA's before computing the mean of the height_in column"

We can also add na.rm=TRUE in other functions where we want to compute a summary of numerical data:

sum(studentdata$height_in, na.rm=TRUE) / 3

Part 2: Using the `dplyr` package¶

As we saw in our last lecture, sometimes we need to modify data frames before we are able to glean certain information from it. Today, we will learn about various R functions that will help us in modifying a given data frame. This process is often called "Data Wrangling" and is an important skill to have if we want to be proficient and comfortable with data analysis.

A side note about R and packages¶

R is an amazing tool, because a lot of people contributes to the development of R. By downloading "packages" (these are codes for various functions, written by others), we can make a difficult process easier.

The main R packages that we will work with in our course are dplyr and ggplot2. The package tidyverse contains both dplyr and ggplot2; so we begin by installing and loading tidyverse:

install.packages('tidyverse')

Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘rematch’, ‘cellranger’, ‘clipr’, ‘cli’, ‘httr’, ‘lubridate’, ‘purrr’, ‘readxl’, ‘reprex’, ‘rlang’, ‘rstudioapi’, ‘tibble’, ‘tidyr’

library('tidyverse')

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Let's revisit the nycflights13 dataset.

library('nycflights13')

head(flights)

A few questions that we brainstormed last class are:

Which airline is the "worst" in terms of the amount of delays?
How many flights departed from each of the three airports?
Which NYC airport is the worst in terms of the amount of delays?
How many flights departed in the morning? afternoon? evening?

(and many more questions!)

Question 1: Which airline is the worst in terms of the amount of delays?¶

First, we need to think, what do we mean by "worst", exactly. Are we asking for simply the number of flights that had any amount of delays, or are we asking for the total amount of delays accumulated by each airlines (or maybe something else)?

Furthermore, we notice that there is a dep_delay column and an arr_delay column. Which type of delay are we interested in?

When we are answering these types of questions, it is usually not clear how we should quantify "worst" (or "best"), and we can argue about this for a long time. So, what we would do is simply pick one and go for it, for now.

Let's start with the following goal¶

For each airline, let's add up the number of minutes that the flights are delayed. Then, we will rank the airlines based on the sum of their number of minutes of delays. Let's also suppose that we care only about the arrival delay (that is, if a flight was delayed in its departure but still arrived as scheduled, this did not really upset the passangers.)

To do do this, we need to group the flights by the carrier, and then we will sum up their arr_delay column.

We do this in two steps, using two new functions:

group_by(): this function lets us group the rows of a data frame according to the value of one of the columns.
summarize(): this function lets us compute a summary number for each of the group.

flights_groupedby_carrier <- group_by(flights, carrier)

arrdelay_bycarrier <- summarize(flights_groupedby_carrier, total_arrdelay = sum(arr_delay, na.rm=TRUE))

arrdelay_bycarrier <- merge(arrdelay_bycarrier, airlines, by="carrier")

arrange(arrdelay_bycarrier, desc(total_arrdelay))

Conclusion of the above exploratory analysis: Based on the sum of the arrival delays (in minutes), Alaska Airlines had the best record in 2013, while ExpressJet Airlines Inc. had the worst record.

While this is revealing, what else should we think about before giving the award of "The most delayed airline in 2013" to ExpressJet Airlines Inc.?

What about all those missing data (NAs) that we discarded?
It is possible that perhaps the sum of arrival delays is not the right quantity to use to rank the airlines. For example, airlines that are more majors will have a lot more flights than smaller airlines, and therefore might accumulate more delay minutes even if each delay is only, let's say, one minute long.
It seems like if an airline has a lot of flights with small delays, they are not worse than an airline with less volume that is consistently very delayed.
How about consistency of delays? It is possible that ExpressJet is usualy on-time, except for a handful of flights that are massively delayed due to a rare weather-related event.

ave_arr_delay_byCarrier <- summarize(flights_groupedby_carrier, ave_arr_delay = mean(arr_delay, na.rm=TRUE) )

ave_arr_delay_byCarrier <- merge(ave_arr_delay_byCarrier, airlines, by="carrier")

arrange(ave_arr_delay_byCarrier, desc(ave_arr_delay) )

flights %>%   # don't worry about the code in this cell for now!  We will learn about them shortly.
  filter(!is.na(arr_delay)) %>% 
  filter(arr_delay > 0) %>%
  group_by(carrier) %>% 
  summarize(total_arr_delay = sum(arr_delay), count = n()) %>% 
  mutate(ave_arr_delay = total_arr_delay/count) %>%
  merge(airlines, by="carrier") %>%
  arrange(desc(ave_arr_delay))

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

carrier	total_arrdelay	name
EV	807324	ExpressJet Airlines Inc.
B6	511194	JetBlue Airways
MQ	269767	Envoy Air
UA	205589	United Air Lines Inc.
9E	127624	Endeavor Air Inc.
WN	116214	Southwest Airlines Co.
DL	78366	Delta Air Lines Inc.
FL	63868	AirTran Airways Corporation
US	42232	US Airways Inc.
F9	14928	Frontier Airlines Inc.
AA	11638	American Airlines Inc.
VX	9027	Virgin America
YV	8463	Mesa Airlines Inc.
OO	346	SkyWest Airlines Inc.
HA	-2365	Hawaiian Airlines Inc.
AS	-7041	Alaska Airlines Inc.

carrier	ave_arr_delay	name
F9	21.9207048	Frontier Airlines Inc.
FL	20.1159055	AirTran Airways Corporation
EV	15.7964311	ExpressJet Airlines Inc.
YV	15.5569853	Mesa Airlines Inc.
OO	11.9310345	SkyWest Airlines Inc.
MQ	10.7747334	Envoy Air
WN	9.6491199	Southwest Airlines Co.
B6	9.4579733	JetBlue Airways
9E	7.3796692	Endeavor Air Inc.
UA	3.5580111	United Air Lines Inc.
US	2.1295951	US Airways Inc.
VX	1.7644644	Virgin America
DL	1.6443409	Delta Air Lines Inc.
AA	0.3642909	American Airlines Inc.
HA	-6.9152047	Hawaiian Airlines Inc.
AS	-9.9308886	Alaska Airlines Inc.

carrier	total_arr_delay	count	ave_arr_delay	name
OO	606	10	60.60000	SkyWest Airlines Inc.
YV	13179	258	51.08140	Mesa Airlines Inc.
9E	327023	6637	49.27271	Endeavor Air Inc.
EV	1181808	24484	48.26858	ExpressJet Airlines Inc.
F9	18651	392	47.57908	Frontier Airlines Inc.
VX	76557	1746	43.84708	Virgin America
FL	77874	1895	41.09446	AirTran Airways Corporation
WN	216125	5304	40.74755	Southwest Airlines Co.
B6	944574	23609	40.00906	JetBlue Airways
AA	409671	10706	38.26555	American Airlines Inc.
MQ	442604	11693	37.85205	Envoy Air
DL	619485	16413	37.74356	Delta Air Lines Inc.
UA	814458	22222	36.65098	United Air Lines Inc.
HA	3398	97	35.03093	Hawaiian Airlines Inc.
AS	6495	189	34.36508	Alaska Airlines Inc.
US	213206	7349	29.01157	US Airways Inc.

weight_lb	height_in	majors	haveTakenTextAndIdeas	height_cm
150	72	Music	TRUE	182.88
180	65	Psychology	TRUE	165.10
110	59	Linguistics	FALSE	149.86