Introduction to Data Analysis

Lecture 4: More Data Manipulation with dplyr

Today:

  1. More practice with dplyr functions
    • group_by() and summarize()
    • arrange(): to sort by a particular column
    • filter(): to remove rows
    • (Other dplyr functions, such as mutate(), etc. will be introduced later)
  2. Using "piping": chaining together several dplyr functions
In [1]:
library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Part 1: More practice with dplyr functions

1.1. group_by() and summarize()

Recall that the function group_by() groups the rows based on the values in one of the columns. However, after you use group_by() on its own, you don't really notice any difference in the resulting data set. Thus, the function group_by() is useful when it's paired with another function, such as the summarize() function.

To see more concretely what group_by() and summarize() do, consider the following example.

In [2]:
# do not modify this cell!
# we are creating an example dataset called studentdata2
height_in <- c(72, 65, 59, 63, 75, 60, 66, 70, 61)
weight_lb <- c(150, 180, 100, 135, 190, 120, 110, 170, 140 )
studentdata2 <- as.data.frame(cbind(height_in, weight_lb))
studentdata2$major <- c('Music', 'Psychology', 'Linguistics', 'Music', 'Music', 'Linguistics', 'Psychology', 'Linguistics', 'Music')
studentdata2$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)
studentdata2
Out[2]:
height_inweight_lbmajorhaveTakenTextAndIdeas
72 150 Music TRUE
65 180 Psychology TRUE
59 100 LinguisticsFALSE
63 135 Music FALSE
75 190 Music FALSE
60 120 Linguistics TRUE
66 110 Psychology TRUE
70 170 LinguisticsFALSE
61 140 Music TRUE

Suppose that we are interested in (1) finding the average height of the nine students in this dataset and (2) finding out the average height of students in each major.

(Maybe you had a conjecture that music majors seem to be taller in general compared to Linguistics majors.)

In [3]:
mean(studentdata2$height_in)
Out[3]:
65.6666666666667

Taking the average of all nine students' heights is straightforward.

Since this is a small data set, we can take the average of the Music majors', the Psychology majors, and the Linguistics majors' heights manually.

Method 1: Manually

In [4]:
ave_height_musicmajors <- mean(c(72, 63, 75, 61))
ave_height_psychologymajors <- mean(c(65, 66))
ave_height_linguisticsmajors <- mean(c(59, 60, 70))
print(ave_height_musicmajors)
print(ave_height_psychologymajors)
print(ave_height_linguisticsmajors)
[1] 67.75
[1] 65.5
[1] 63

Easy enough! However, if this dataset contains thousands of rows and hundreds of majors (for example, the list of all undergraduate students at NYU), we do not want to use Method 1; it will take a long time and there is a high risk that mistakes might be made. The second method below does the same thing, but aided by R functions.

Method 2: Using group_by() and summarize()

In [5]:
studentdata2_byMajor <- group_by(studentdata2, major)
summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
summary_df
Out[5]:
majorave_height
Linguistics63.00
Music 67.75
Psychology 65.50

Okay, let's interpret what happened in the above cell.

  1. In line 1, we created a new data frame consisting of the same data as studentdata2, but grouped by the students' majors. We did this using the command:

     studentdata2_byMajor <- group_by(studentdata2, major)

    Note that group_by() takes two inputs: (1) the name of the data frame and (2) the name of the column that we group the rows by. The grouped result is stored in a new data frame called studentdata2_byMajor.

  2. In line 2, we used summarize() to compute the average height of each of the students in each group using the command:

     summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))

    Let's parse what's going on here.

    • The function summarize() takes two inputs: (1) the name of the (grouped) data frame and (2) a function that computes one summary value (in our case, the mean() function "summarizes" the height of the students in each group).
    • Note that when we say ave_height = ..., we are giving a name to the new summary column; this is a name that describes what the summary value is.

Exercises

(Note: maybe change one of these questions to be a clicker question)

Use group_by() and summarize to find

  1. average weight, by major
  2. average height, by whether students have taken "Text and Ideas" or not
  3. the number of students in each major
  4. the sum of height of students in each major

1.2. arrange()

The function arrange() is quite straightforward. This is used to arrange a data frame based on the values in one (or more) column(s), either in increasing or decreasing order.

For example, suppose that we want to display the studentdata2 data frame so that the students are listed by their height, in increasing order.

In [6]:
arrange(studentdata2, height_in)
Out[6]:
height_inweight_lbmajorhaveTakenTextAndIdeas
59 100 LinguisticsFALSE
60 120 Linguistics TRUE
61 140 Music TRUE
63 135 Music FALSE
65 180 Psychology TRUE
66 110 Psychology TRUE
70 170 LinguisticsFALSE
72 150 Music TRUE
75 190 Music FALSE

We can also display them in decreasing order by height, by surrounding the column name with the function desc() (which stands for descending):

In [7]:
arrange(studentdata2, desc(height_in))
Out[7]:
height_inweight_lbmajorhaveTakenTextAndIdeas
75 190 Music FALSE
72 150 Music TRUE
70 170 LinguisticsFALSE
66 110 Psychology TRUE
65 180 Psychology TRUE
63 135 Music FALSE
61 140 Music TRUE
60 120 Linguistics TRUE
59 100 LinguisticsFALSE

Suppose that we want to sort the rows by major AND by height (that is, sort by major first, and then within each major, sort by height, in increasing order):

In [8]:
arrange(studentdata2, major, height_in)
Out[8]:
height_inweight_lbmajorhaveTakenTextAndIdeas
59 100 LinguisticsFALSE
60 120 Linguistics TRUE
70 170 LinguisticsFALSE
61 140 Music TRUE
63 135 Music FALSE
72 150 Music TRUE
75 190 Music FALSE
65 180 Psychology TRUE
66 110 Psychology TRUE

1.3. filter()

The function filter() removes rows from the dataset.

For example, suppose that we want to focus on students who have taken "Texts and Ideas" only. Then, we want to only keep the rows for which the haveTakenTextAndIdeas column have value TRUE.

In [9]:
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE)
studentdata2_TextAndIdeas
Out[9]:
height_inweight_lbmajorhaveTakenTextAndIdeas
72 150 Music TRUE
65 180 Psychology TRUE
60 120 LinguisticsTRUE
66 110 Psychology TRUE
61 140 Music TRUE

We can combine filter() with group_by() and summarize().

For example, suppose that we want to find the average height of the students who have taken "Texts and Ideas", grouped by major:

In [10]:
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
Out[10]:
majorave_height
Linguistics60.0
Music 66.5
Psychology 65.5

Exercises

  1. adf
  2. adf

Part 2: "Piping"

Let's recall what we did to compute the average height of students who have taken "Texts and Ideas", grouped by major:

In [11]:
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
Out[11]:
majorave_height
Linguistics60.0
Music 66.5
Psychology 65.5

Note that in the three steps above, we created three new data frames: (1) studentdata2_TextandIdeas, (2) studentdata2_TnI_byMajor, and (3) height_summary.

The data frame that we really care about is height_summary; the first two data frames are created because they are needed as inputs to the group_by() and summarize() functions in the second and third steps.

Creating intermediate data frames that we are probably not going to use again is a waste of space (especially if our data sets are large). In fact, we can avoid creating these intermediate data frames by "nesting" the functions together, as follows:

In [12]:
height_summary_v2 <- summarize(group_by(filter(studentdata2, haveTakenTextAndIdeas == TRUE), major), ave_height = mean(height_in))
height_summary_v2
Out[12]:
majorave_height
Linguistics60.0
Music 66.5
Psychology 65.5

Although we managed to avoid creating intermediate data frames, the resulting code is a bit hard to read and to interpret.

Our solution is a method called "piping". Let's first see what this looks like:

In [13]:
height_summary_v3 <- studentdata2 %>% 
    filter( haveTakenTextAndIdeas == TRUE ) %>%
    group_by( major ) %>%
    summarize( ave_height = mean(height_in) )

height_summary_v3
Out[13]:
majorave_height
Linguistics60.0
Music 66.5
Psychology 65.5

All right, let's parse what happened.

  1. The lines

     studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE)

    is equivalent to

     filter(studentdata2, haveTakenTextAndIdeas == TRUE)

    That is, the symbol %>% takes the data frame to its left and "pipes" this data frame into the function to its right, as the first input.

  2. Next, let's also look at the third line:

     studentdata2 %>% 
         filter( haveTakenTextAndIdeas == TRUE ) %>%
         group_by( major )

    Here, we use piping twice. The data frame that is the output of

     studentdata2 %>% 
         filter( haveTakenTextAndIdeas == TRUE)

    is then used as the input of the group_by() function.

Exercises

  1. (to be added)
  2. (to be added)