Introduction to Data Analysis¶

Lecture 4: More Data Manipulation with dplyr¶

Today:

More practice with dplyr functions
- group_by() and summarize()
- arrange(): to sort by a particular column
- filter(): to remove rows
- (Other dplyr functions, such as mutate(), etc. will be introduced later)
Using "piping": chaining together several dplyr functions

library('tidyverse')

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Part 1: More practice with `dplyr` functions¶

1.1. `group_by()` and `summarize()`¶

Recall that the function group_by() groups the rows based on the values in one of the columns. However, after you use group_by() on its own, you don't really notice any difference in the resulting data set. Thus, the function group_by() is useful when it's paired with another function, such as the summarize() function.

To see more concretely what group_by() and summarize() do, consider the following example.

# do not modify this cell!
# we are creating an example dataset called studentdata2
height_in <- c(72, 65, 59, 63, 75, 60, 66, 70, 61)
weight_lb <- c(150, 180, 100, 135, 190, 120, 110, 170, 140 )
studentdata2 <- as.data.frame(cbind(height_in, weight_lb))
studentdata2$major <- c('Music', 'Psychology', 'Linguistics', 'Music', 'Music', 'Linguistics', 'Psychology', 'Linguistics', 'Music')
studentdata2$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)
studentdata2

Suppose that we are interested in (1) finding the average height of the nine students in this dataset and (2) finding out the average height of students in each major.

(Maybe you had a conjecture that music majors seem to be taller in general compared to Linguistics majors.)

mean(studentdata2$height_in)

Taking the average of all nine students' heights is straightforward.

Since this is a small data set, we can take the average of the Music majors', the Psychology majors, and the Linguistics majors' heights manually.

Method 1: Manually

ave_height_musicmajors <- mean(c(72, 63, 75, 61))
ave_height_psychologymajors <- mean(c(65, 66))
ave_height_linguisticsmajors <- mean(c(59, 60, 70))
print(ave_height_musicmajors)
print(ave_height_psychologymajors)
print(ave_height_linguisticsmajors)

[1] 67.75
[1] 65.5
[1] 63

Easy enough! However, if this dataset contains thousands of rows and hundreds of majors (for example, the list of all undergraduate students at NYU), we do not want to use Method 1; it will take a long time and there is a high risk that mistakes might be made. The second method below does the same thing, but aided by R functions.

Method 2: Using group_by() and summarize()

studentdata2_byMajor <- group_by(studentdata2, major)
summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
summary_df

Okay, let's interpret what happened in the above cell.

In line 1, we created a new data frame consisting of the same data as studentdata2, but grouped by the students' majors. We did this using the command:
```
 studentdata2_byMajor <- group_by(studentdata2, major)
```
Note that group_by() takes two inputs: (1) the name of the data frame and (2) the name of the column that we group the rows by. The grouped result is stored in a new data frame called studentdata2_byMajor.
In line 2, we used summarize() to compute the average height of each of the students in each group using the command:
```
 summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
```
Let's parse what's going on here.
- The function summarize() takes two inputs: (1) the name of the (grouped) data frame and (2) a function that computes one summary value (in our case, the mean() function "summarizes" the height of the students in each group).
- Note that when we say ave_height = ..., we are giving a name to the new summary column; this is a name that describes what the summary value is.

Exercises¶

(Note: maybe change one of these questions to be a clicker question)

Use group_by() and summarize to find

average weight, by major
average height, by whether students have taken "Text and Ideas" or not
the number of students in each major
the sum of height of students in each major

1.2. `arrange()`¶

The function arrange() is quite straightforward. This is used to arrange a data frame based on the values in one (or more) column(s), either in increasing or decreasing order.

For example, suppose that we want to display the studentdata2 data frame so that the students are listed by their height, in increasing order.

arrange(studentdata2, height_in)

We can also display them in decreasing order by height, by surrounding the column name with the function desc() (which stands for descending):

arrange(studentdata2, desc(height_in))

Suppose that we want to sort the rows by major AND by height (that is, sort by major first, and then within each major, sort by height, in increasing order):

arrange(studentdata2, major, height_in)

1.3. `filter()`¶

The function filter() removes rows from the dataset.

For example, suppose that we want to focus on students who have taken "Texts and Ideas" only. Then, we want to only keep the rows for which the haveTakenTextAndIdeas column have value TRUE.

studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE)
studentdata2_TextAndIdeas

We can combine filter() with group_by() and summarize().

For example, suppose that we want to find the average height of the students who have taken "Texts and Ideas", grouped by major:

studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary

Exercises¶

adf
adf

Part 2: "Piping"¶

Let's recall what we did to compute the average height of students who have taken "Texts and Ideas", grouped by major:

studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary

Note that in the three steps above, we created three new data frames: (1) studentdata2_TextandIdeas, (2) studentdata2_TnI_byMajor, and (3) height_summary.

The data frame that we really care about is height_summary; the first two data frames are created because they are needed as inputs to the group_by() and summarize() functions in the second and third steps.

Creating intermediate data frames that we are probably not going to use again is a waste of space (especially if our data sets are large). In fact, we can avoid creating these intermediate data frames by "nesting" the functions together, as follows:

height_summary_v2 <- summarize(group_by(filter(studentdata2, haveTakenTextAndIdeas == TRUE), major), ave_height = mean(height_in))
height_summary_v2

Although we managed to avoid creating intermediate data frames, the resulting code is a bit hard to read and to interpret.

Our solution is a method called "piping". Let's first see what this looks like:

height_summary_v3 <- studentdata2 %>% 
    filter( haveTakenTextAndIdeas == TRUE ) %>%
    group_by( major ) %>%
    summarize( ave_height = mean(height_in) )

height_summary_v3

All right, let's parse what happened.

The lines
```
 studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE)
```
is equivalent to
```
 filter(studentdata2, haveTakenTextAndIdeas == TRUE)
```
That is, the symbol %>% takes the data frame to its left and "pipes" this data frame into the function to its right, as the first input.
Next, let's also look at the third line:
```
 studentdata2 %>% 
     filter( haveTakenTextAndIdeas == TRUE ) %>%
     group_by( major )
```
Here, we use piping twice. The data frame that is the output of
```
 studentdata2 %>% 
     filter( haveTakenTextAndIdeas == TRUE)
```
is then used as the input of the group_by() function.

Exercises¶

(to be added)
(to be added)

height_in	weight_lb	major	haveTakenTextAndIdeas
72	150	Music	TRUE
65	180	Psychology	TRUE
59	100	Linguistics	FALSE
63	135	Music	FALSE
75	190	Music	FALSE
60	120	Linguistics	TRUE
66	110	Psychology	TRUE
70	170	Linguistics	FALSE
61	140	Music	TRUE

Introduction to Data Analysis¶

Lecture 4: More Data Manipulation with dplyr¶

Part 1: More practice with dplyr functions¶

1.1. group_by() and summarize()¶

Exercises¶

1.2. arrange()¶

1.3. filter()¶

Exercises¶

Part 2: "Piping"¶

Exercises¶

Part 1: More practice with `dplyr` functions¶

1.1. `group_by()` and `summarize()`¶

1.2. `arrange()`¶

1.3. `filter()`¶