heightList <- c(72, 65, 59) weightList <- c(150, 180, 110) studentdata <- as.data.frame(cbind(weightList, heightList) ) studentdata$majors <- c('Music', 'Psychology', 'Linguistics') studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE) studentdata
names(studentdata) <- 'weight_lb'
names(studentdata) <- 'height_in'
Here, weight is in pounds and height is in inches. Suppose that you would like to create a new column where the height data is in cm.
First, let's consider how we would do that if we only have one number to work with.
height_Alex_in <- 72
Fact One inch is equal to 2.54 cm.
So, if Alex's height is 72 inches, then Alex's height in cm is .
height_Alex_cm <- height_Alex_in * 2.54 print(height_Alex_in) print(height_Alex_cm)
 72  182.88
height contains the heights of the three students in our dataset. Let's create a second list, called
height_cm, which will contain the height of the same three students in cm. We can do the same arithmetic as above, and the arithmetic operation will be applied to each number on that list.
 72 65 59
heightList_cm <- heightList * 2.54 print(heightList_cm)
 182.88 165.10 149.86
We can then add a new column called
height_cm in the studentdata dataset that contains these values:
studentdata$height_cm <- heightList_cm studentdata
Note: we could have done the above steps in one line. For example, let's do something similar with the weight column. Suppose that instead of working with lbs, we prefer to work with kgs.
Since 1 pound is equal to 0.453592 kg, we multiply each number in the
weight column by 0.453592, then store the result in a new column called
studentdata$weight_kg <- studentdata$weight_lb * 0.453592
Suppose, for example, that our three students are about to enter an elevator. Inside the elevator, a sign states that the total weight of the passangers of the elevator car cannot exceed 425 pounds. Is there a way to quickly add the numbers in the
Recall that the function
sum() takes the sum of the numbers that are entered within its parentheses. For example:
sum(4, 100, 34.7, -100)
We can also use
sum() to take the sum of all numbers in a list or in a column of a data frame.
newlist <- 1:100 sum(newlist)
Of course, we can only take the sum of numeric data. If we try to take the sum of a column that stores text data, we will get an error message.
Error in sum(studentdata$majors): invalid 'type' (character) of argument Traceback:
Another example: Perhaps we are interested in finding the average of the three students' height. Recall that to compute an average, we simply add all numbers and divide by the number of people. So, to compute the average height of the three students (in inches), we can first use
sum(), then divide the result by 3:
Exercise In the cell below, find the average weight of the students, in kilograms.
A New Function While we can take averages quite easily by first taking the sum and then dividing by the number of individuals, R has kindly provided a function that allows us to compute averages a little more easily. This is the
mean() function ("the mean" is synonymous to "the average"). We will get the same result using the two methods.
Exercise In the cell below, find the average weight of the students, in kilograms, using the
studentdata dataset is very small; therefore, finding the largest and the smallest height, for example, is very easy to do.
However, if we are working with a dataset that has hundreds of thousdands of rows (for example, the
nycflights13 dataset), it is useful to have a function that will do this quickly for us.
max(): to find the largest number in a list or in a column
min(): to find the smallest number in a list or in a column
What happens if we apply
min() to a list containing text data? Let's try:
Any guess what happenned here?
Suppose that we have a fourth student, David, whose data we would like to add as a fourth row in our data frame.
studentdata <- rbind(studentdata, data.frame(weight_lb = 200, height_in = NA, majors = "Political Science", haveTakenTextAndIdeas = FALSE, height_cm = NA, weight_kg = 200 * 0.453592))
Note that the mean of heights is now
NA because David's height data is missing. You will see that real-world data are often incomplete, and that it is not uncommon for datasets to have many
So, what do we do if we are still interested in finding the average of the three remaining students' heights? We can still use the
mean() function, telling R to ignore any
mean(studentdata$height_in, na.rm=TRUE) #na.rm=TRUE means: "remove any NA's before computing the mean of the height_in column"
We can also add
na.rm=TRUE in other functions where we want to compute a summary of numerical data:
sum(studentdata$height_in, na.rm=TRUE) / 3
As we saw in our last lecture, sometimes we need to modify data frames before we are able to glean certain information from it. Today, we will learn about various R functions that will help us in modifying a given data frame. This process is often called "Data Wrangling" and is an important skill to have if we want to be proficient and comfortable with data analysis.
R is an amazing tool, because a lot of people contributes to the development of R. By downloading "packages" (these are codes for various functions, written by others), we can make a difficult process easier.
The main R packages that we will work with in our course are
ggplot2. The package
tidyverse contains both
ggplot2; so we begin by installing and loading
Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.4’ (as ‘lib’ is unspecified) also installing the dependencies ‘rematch’, ‘cellranger’, ‘clipr’, ‘cli’, ‘httr’, ‘lubridate’, ‘purrr’, ‘readxl’, ‘reprex’, ‘rlang’, ‘rstudioapi’, ‘tibble’, ‘tidyr’
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ✔ tibble 1.3.4 ✔ dplyr 0.7.4 ✔ tidyr 0.7.2 ✔ stringr 1.2.0 ✔ readr 1.1.1 ✔ forcats 0.2.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
Let's revisit the
A few questions that we brainstormed last class are:
(and many more questions!)
First, we need to think, what do we mean by "worst", exactly. Are we asking for simply the number of flights that had any amount of delays, or are we asking for the total amount of delays accumulated by each airlines (or maybe something else)?
Furthermore, we notice that there is a
dep_delay column and an
arr_delay column. Which type of delay are we interested in?
When we are answering these types of questions, it is usually not clear how we should quantify "worst" (or "best"), and we can argue about this for a long time. So, what we would do is simply pick one and go for it, for now.
For each airline, let's add up the number of minutes that the flights are delayed. Then, we will rank the airlines based on the sum of their number of minutes of delays. Let's also suppose that we care only about the arrival delay (that is, if a flight was delayed in its departure but still arrived as scheduled, this did not really upset the passangers.)
To do do this, we need to group the flights by the carrier, and then we will sum up their
We do this in two steps, using two new functions:
group_by(): this function lets us group the rows of a data frame according to the value of one of the columns.
summarize(): this function lets us compute a summary number for each of the group.
flights_groupedby_carrier <- group_by(flights, carrier)
arrdelay_bycarrier <- summarize(flights_groupedby_carrier, total_arrdelay = sum(arr_delay, na.rm=TRUE))
arrdelay_bycarrier <- merge(arrdelay_bycarrier, airlines, by="carrier")
|EV||807324||ExpressJet Airlines Inc.|
|UA||205589||United Air Lines Inc.|
|9E||127624||Endeavor Air Inc.|
|WN||116214||Southwest Airlines Co.|
|DL||78366||Delta Air Lines Inc.|
|FL||63868||AirTran Airways Corporation|
|US||42232||US Airways Inc.|
|F9||14928||Frontier Airlines Inc.|
|AA||11638||American Airlines Inc.|
|YV||8463||Mesa Airlines Inc.|
|OO||346||SkyWest Airlines Inc.|
|HA||-2365||Hawaiian Airlines Inc.|
|AS||-7041||Alaska Airlines Inc.|
Conclusion of the above exploratory analysis: Based on the sum of the arrival delays (in minutes), Alaska Airlines had the best record in 2013, while ExpressJet Airlines Inc. had the worst record.
While this is revealing, what else should we think about before giving the award of "The most delayed airline in 2013" to ExpressJet Airlines Inc.?
NAs) that we discarded?
ave_arr_delay_byCarrier <- summarize(flights_groupedby_carrier, ave_arr_delay = mean(arr_delay, na.rm=TRUE) )
ave_arr_delay_byCarrier <- merge(ave_arr_delay_byCarrier, airlines, by="carrier")
arrange(ave_arr_delay_byCarrier, desc(ave_arr_delay) )
|F9||21.9207048||Frontier Airlines Inc.|
|FL||20.1159055||AirTran Airways Corporation|
|EV||15.7964311||ExpressJet Airlines Inc.|
|YV||15.5569853||Mesa Airlines Inc.|
|OO||11.9310345||SkyWest Airlines Inc.|
|WN||9.6491199||Southwest Airlines Co.|
|9E||7.3796692||Endeavor Air Inc.|
|UA||3.5580111||United Air Lines Inc.|
|US||2.1295951||US Airways Inc.|
|DL||1.6443409||Delta Air Lines Inc.|
|AA||0.3642909||American Airlines Inc.|
|HA||-6.9152047||Hawaiian Airlines Inc.|
|AS||-9.9308886||Alaska Airlines Inc.|
flights %>% # don't worry about the code in this cell for now! We will learn about them shortly. filter(!is.na(arr_delay)) %>% filter(arr_delay > 0) %>% group_by(carrier) %>% summarize(total_arr_delay = sum(arr_delay), count = n()) %>% mutate(ave_arr_delay = total_arr_delay/count) %>% merge(airlines, by="carrier") %>% arrange(desc(ave_arr_delay))
|OO||606||10||60.60000||SkyWest Airlines Inc.|
|YV||13179||258||51.08140||Mesa Airlines Inc.|
|9E||327023||6637||49.27271||Endeavor Air Inc.|
|EV||1181808||24484||48.26858||ExpressJet Airlines Inc.|
|F9||18651||392||47.57908||Frontier Airlines Inc.|
|FL||77874||1895||41.09446||AirTran Airways Corporation|
|WN||216125||5304||40.74755||Southwest Airlines Co.|
|AA||409671||10706||38.26555||American Airlines Inc.|
|DL||619485||16413||37.74356||Delta Air Lines Inc.|
|UA||814458||22222||36.65098||United Air Lines Inc.|
|HA||3398||97||35.03093||Hawaiian Airlines Inc.|
|AS||6495||189||34.36508||Alaska Airlines Inc.|
|US||213206||7349||29.01157||US Airways Inc.|