Today:
dplyr
functionsgroup_by()
and summarize()
arrange()
: to sort by a particular columnfilter()
: to remove rowsdplyr
functions, such as mutate()
, etc. will be introduced later)dplyr
functionslibrary('tidyverse')
dplyr
functions¶group_by()
and summarize()
¶Recall that the function group_by()
groups the rows based on the values in one of the columns. However, after you use group_by()
on its own, you don't really notice any difference in the resulting data set. Thus, the function group_by()
is useful when it's paired with another function, such as the summarize()
function.
To see more concretely what group_by()
and summarize()
do, consider the following example.
# do not modify this cell!
# we are creating an example dataset called studentdata2
height_in <- c(72, 65, 59, 63, 75, 60, 66, 70, 61)
weight_lb <- c(150, 180, 100, 135, 190, 120, 110, 170, 140 )
studentdata2 <- as.data.frame(cbind(height_in, weight_lb))
studentdata2$major <- c('Music', 'Psychology', 'Linguistics', 'Music', 'Music', 'Linguistics', 'Psychology', 'Linguistics', 'Music')
studentdata2$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)
studentdata2
Suppose that we are interested in (1) finding the average height of the nine students in this dataset and (2) finding out the average height of students in each major.
(Maybe you had a conjecture that music majors seem to be taller in general compared to Linguistics majors.)
mean(studentdata2$height_in)
Taking the average of all nine students' heights is straightforward.
Since this is a small data set, we can take the average of the Music majors', the Psychology majors, and the Linguistics majors' heights manually.
Method 1: Manually
ave_height_musicmajors <- mean(c(72, 63, 75, 61))
ave_height_psychologymajors <- mean(c(65, 66))
ave_height_linguisticsmajors <- mean(c(59, 60, 70))
print(ave_height_musicmajors)
print(ave_height_psychologymajors)
print(ave_height_linguisticsmajors)
Easy enough! However, if this dataset contains thousands of rows and hundreds of majors (for example, the list of all undergraduate students at NYU), we do not want to use Method 1; it will take a long time and there is a high risk that mistakes might be made. The second method below does the same thing, but aided by R functions.
Method 2: Using group_by()
and summarize()
studentdata2_byMajor <- group_by(studentdata2, major)
summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
summary_df
Okay, let's interpret what happened in the above cell.
In line 1, we created a new data frame consisting of the same data as studentdata2
, but grouped by the students' majors. We did this using the command:
studentdata2_byMajor <- group_by(studentdata2, major)
Note that group_by()
takes two inputs: (1) the name of the data frame and (2) the name of the column that we group the rows by. The grouped result is stored in a new data frame called studentdata2_byMajor
.
In line 2, we used summarize()
to compute the average height of each of the students in each group using the command:
summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
Let's parse what's going on here.
summarize()
takes two inputs: (1) the name of the (grouped) data frame and (2) a function that computes one summary value (in our case, the mean()
function "summarizes" the height of the students in each group). ave_height = ...
, we are giving a name to the new summary column; this is a name that describes what the summary value is.(Note: maybe change one of these questions to be a clicker question)
Use group_by()
and summarize
to find
arrange()
¶The function arrange()
is quite straightforward. This is used to arrange a data frame based on the values in one (or more) column(s), either in increasing or decreasing order.
For example, suppose that we want to display the studentdata2
data frame so that the students are listed by their height, in increasing order.
arrange(studentdata2, height_in)
We can also display them in decreasing order by height, by surrounding the column name with the function desc()
(which stands for descending):
arrange(studentdata2, desc(height_in))
Suppose that we want to sort the rows by major AND by height (that is, sort by major first, and then within each major, sort by height, in increasing order):
arrange(studentdata2, major, height_in)
filter()
¶The function filter()
removes rows from the dataset.
For example, suppose that we want to focus on students who have taken "Texts and Ideas" only. Then, we want to only keep the rows for which the haveTakenTextAndIdeas
column have value TRUE
.
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE)
studentdata2_TextAndIdeas
We can combine filter()
with group_by()
and summarize()
.
For example, suppose that we want to find the average height of the students who have taken "Texts and Ideas", grouped by major:
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
Let's recall what we did to compute the average height of students who have taken "Texts and Ideas", grouped by major:
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
Note that in the three steps above, we created three new data frames: (1) studentdata2_TextandIdeas
, (2) studentdata2_TnI_byMajor
, and (3) height_summary
.
The data frame that we really care about is height_summary
; the first two data frames are created because they are needed as inputs to the group_by()
and summarize()
functions in the second and third steps.
Creating intermediate data frames that we are probably not going to use again is a waste of space (especially if our data sets are large). In fact, we can avoid creating these intermediate data frames by "nesting" the functions together, as follows:
height_summary_v2 <- summarize(group_by(filter(studentdata2, haveTakenTextAndIdeas == TRUE), major), ave_height = mean(height_in))
height_summary_v2
Although we managed to avoid creating intermediate data frames, the resulting code is a bit hard to read and to interpret.
Our solution is a method called "piping". Let's first see what this looks like:
height_summary_v3 <- studentdata2 %>%
filter( haveTakenTextAndIdeas == TRUE ) %>%
group_by( major ) %>%
summarize( ave_height = mean(height_in) )
height_summary_v3
All right, let's parse what happened.
The lines
studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE)
is equivalent to
filter(studentdata2, haveTakenTextAndIdeas == TRUE)
That is, the symbol %>%
takes the data frame to its left and "pipes" this data frame into the function to its right, as the first input.
Next, let's also look at the third line:
studentdata2 %>%
filter( haveTakenTextAndIdeas == TRUE ) %>%
group_by( major )
Here, we use piping twice. The data frame that is the output of
studentdata2 %>%
filter( haveTakenTextAndIdeas == TRUE)
is then used as the input of the group_by()
function.