heightList <- c(72, 65, 59)
weightList <- c(150, 180, 110)
studentdata <- as.data.frame(cbind(weightList, heightList) )
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata
names(studentdata)[1] <- 'weight_lb'
names(studentdata)[2] <- 'height_in'
Here, weight is in pounds and height is in inches. Suppose that you would like to create a new column where the height data is in cm.
First, let's consider how we would do that if we only have one number to work with.
height_Alex_in <- 72
Fact One inch is equal to 2.54 cm.
So, if Alex's height is 72 inches, then Alex's height in cm is $72 \times 2.54$.
height_Alex_cm <- height_Alex_in * 2.54
print(height_Alex_in)
print(height_Alex_cm)
The list height
contains the heights of the three students in our dataset. Let's create a second list, called height_cm
, which will contain the height of the same three students in cm. We can do the same arithmetic as above, and the arithmetic operation will be applied to each number on that list.
print(heightList)
heightList_cm <- heightList * 2.54
print(heightList_cm)
We can then add a new column called height_cm
in the studentdata dataset that contains these values:
studentdata$height_cm <- heightList_cm
studentdata
Note: we could have done the above steps in one line. For example, let's do something similar with the weight column. Suppose that instead of working with lbs, we prefer to work with kgs.
Since 1 pound is equal to 0.453592 kg, we multiply each number in the weight
column by 0.453592, then store the result in a new column called weight_kg
.
studentdata$weight_kg <- studentdata$weight_lb * 0.453592
studentdata
Suppose, for example, that our three students are about to enter an elevator. Inside the elevator, a sign states that the total weight of the passangers of the elevator car cannot exceed 425 pounds. Is there a way to quickly add the numbers in the weight
column?
Recall that the function sum()
takes the sum of the numbers that are entered within its parentheses. For example:
sum(4, 100, 34.7, -100)
We can also use sum()
to take the sum of all numbers in a list or in a column of a data frame.
sum(studentdata$weight_lb)
newlist <- 1:100
sum(newlist)
sum(-100:100)
Of course, we can only take the sum of numeric data. If we try to take the sum of a column that stores text data, we will get an error message.
sum(studentdata$majors)
Another example: Perhaps we are interested in finding the average of the three students' height. Recall that to compute an average, we simply add all numbers and divide by the number of people. So, to compute the average height of the three students (in inches), we can first use sum()
, then divide the result by 3:
sum(studentdata$height_in)/3
Exercise In the cell below, find the average weight of the students, in kilograms.
A New Function While we can take averages quite easily by first taking the sum and then dividing by the number of individuals, R has kindly provided a function that allows us to compute averages a little more easily. This is the mean()
function ("the mean" is synonymous to "the average"). We will get the same result using the two methods.
mean(studentdata$height_in)
Exercise In the cell below, find the average weight of the students, in kilograms, using the mean()
function
Our studentdata
dataset is very small; therefore, finding the largest and the smallest height, for example, is very easy to do.
However, if we are working with a dataset that has hundreds of thousdands of rows (for example, the nycflights13
dataset), it is useful to have a function that will do this quickly for us.
New Functions
max()
: to find the largest number in a list or in a columnmin()
: to find the smallest number in a list or in a columnmax(studentdata$height_in)
min(studentdata$height_in)
What happens if we apply max()
or min()
to a list containing text data? Let's try:
max(studentdata$majors)
min(studentdata$majors)
Any guess what happenned here?
studentdata
Suppose that we have a fourth student, David, whose data we would like to add as a fourth row in our data frame.
studentdata <- rbind(studentdata, data.frame(weight_lb = 200, height_in = NA, majors = "Political Science", haveTakenTextAndIdeas = FALSE, height_cm = NA, weight_kg = 200 * 0.453592))
studentdata
class(studentdata$height_in)
mean(studentdata$height_in)
Note that the mean of heights is now NA
because David's height data is missing. You will see that real-world data are often incomplete, and that it is not uncommon for datasets to have many NA
s.
So, what do we do if we are still interested in finding the average of the three remaining students' heights? We can still use the mean()
function, telling R to ignore any NA
s:
mean(studentdata$height_in, na.rm=TRUE) #na.rm=TRUE means: "remove any NA's before computing the mean of the height_in column"
We can also add na.rm=TRUE
in other functions where we want to compute a summary of numerical data:
sum(studentdata$height_in, na.rm=TRUE) / 3
dplyr
package¶As we saw in our last lecture, sometimes we need to modify data frames before we are able to glean certain information from it. Today, we will learn about various R functions that will help us in modifying a given data frame. This process is often called "Data Wrangling" and is an important skill to have if we want to be proficient and comfortable with data analysis.
R is an amazing tool, because a lot of people contributes to the development of R. By downloading "packages" (these are codes for various functions, written by others), we can make a difficult process easier.
The main R packages that we will work with in our course are dplyr
and ggplot2
. The package tidyverse
contains both dplyr
and ggplot2
; so we begin by installing and loading tidyverse
:
install.packages('tidyverse')
library('tidyverse')
Let's revisit the nycflights13
dataset.
library('nycflights13')
head(flights)
A few questions that we brainstormed last class are:
(and many more questions!)
First, we need to think, what do we mean by "worst", exactly. Are we asking for simply the number of flights that had any amount of delays, or are we asking for the total amount of delays accumulated by each airlines (or maybe something else)?
Furthermore, we notice that there is a dep_delay
column and an arr_delay
column. Which type of delay are we interested in?
When we are answering these types of questions, it is usually not clear how we should quantify "worst" (or "best"), and we can argue about this for a long time. So, what we would do is simply pick one and go for it, for now.
For each airline, let's add up the number of minutes that the flights are delayed. Then, we will rank the airlines based on the sum of their number of minutes of delays. Let's also suppose that we care only about the arrival delay (that is, if a flight was delayed in its departure but still arrived as scheduled, this did not really upset the passangers.)
To do do this, we need to group the flights by the carrier, and then we will sum up their arr_delay
column.
We do this in two steps, using two new functions:
group_by()
: this function lets us group the rows of a data frame according to the value of one of the columns.summarize()
: this function lets us compute a summary number for each of the group.flights_groupedby_carrier <- group_by(flights, carrier)
arrdelay_bycarrier <- summarize(flights_groupedby_carrier, total_arrdelay = sum(arr_delay, na.rm=TRUE))
arrdelay_bycarrier <- merge(arrdelay_bycarrier, airlines, by="carrier")
arrange(arrdelay_bycarrier, desc(total_arrdelay))
Conclusion of the above exploratory analysis: Based on the sum of the arrival delays (in minutes), Alaska Airlines had the best record in 2013, while ExpressJet Airlines Inc. had the worst record.
While this is revealing, what else should we think about before giving the award of "The most delayed airline in 2013" to ExpressJet Airlines Inc.?
NA
s) that we discarded?ave_arr_delay_byCarrier <- summarize(flights_groupedby_carrier, ave_arr_delay = mean(arr_delay, na.rm=TRUE) )
ave_arr_delay_byCarrier <- merge(ave_arr_delay_byCarrier, airlines, by="carrier")
arrange(ave_arr_delay_byCarrier, desc(ave_arr_delay) )
flights %>% # don't worry about the code in this cell for now! We will learn about them shortly.
filter(!is.na(arr_delay)) %>%
filter(arr_delay > 0) %>%
group_by(carrier) %>%
summarize(total_arr_delay = sum(arr_delay), count = n()) %>%
mutate(ave_arr_delay = total_arr_delay/count) %>%
merge(airlines, by="carrier") %>%
arrange(desc(ave_arr_delay))