Today:

- More practice with
`dplyr`

functions`group_by()`

and`summarize()`

`arrange()`

: to sort by a particular column`filter()`

: to remove rows- (Other
`dplyr`

functions, such as`mutate()`

, etc. will be introduced later)

- Using "piping": chaining together several
`dplyr`

functions

In [1]:

```
library('tidyverse')
```

`dplyr`

functions¶`group_by()`

and `summarize()`

¶Recall that the function `group_by()`

groups the rows based on the values in one of the columns. However, after you use `group_by()`

on its own, you don't really notice any difference in the resulting data set. Thus, the function `group_by()`

is useful when it's paired with another function, such as the `summarize()`

function.

To see more concretely what `group_by()`

and `summarize()`

do, consider the following example.

In [2]:

```
# do not modify this cell!
# we are creating an example dataset called studentdata2
height_in <- c(72, 65, 59, 63, 75, 60, 66, 70, 61)
weight_lb <- c(150, 180, 100, 135, 190, 120, 110, 170, 140 )
studentdata2 <- as.data.frame(cbind(height_in, weight_lb))
studentdata2$major <- c('Music', 'Psychology', 'Linguistics', 'Music', 'Music', 'Linguistics', 'Psychology', 'Linguistics', 'Music')
studentdata2$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)
studentdata2
```

Out[2]:

Suppose that we are interested in (1) finding the average height of the nine students in this dataset and (2) finding out the average height of students in each major.

(Maybe you had a conjecture that music majors seem to be taller in general compared to Linguistics majors.)

In [3]:

```
mean(studentdata2$height_in)
```

Out[3]:

Taking the average of all nine students' heights is straightforward.

Since this is a small data set, we can take the average of the Music majors', the Psychology majors, and the Linguistics majors' heights manually.

**Method 1: Manually**

In [4]:

```
ave_height_musicmajors <- mean(c(72, 63, 75, 61))
ave_height_psychologymajors <- mean(c(65, 66))
ave_height_linguisticsmajors <- mean(c(59, 60, 70))
print(ave_height_musicmajors)
print(ave_height_psychologymajors)
print(ave_height_linguisticsmajors)
```

Easy enough! However, if this dataset contains thousands of rows and hundreds of majors (for example, the list of all undergraduate students at NYU), we do not want to use Method 1; it will take a long time and there is a high risk that mistakes might be made. The second method below does the same thing, but aided by R functions.

**Method 2: Using group_by() and summarize()**

In [5]:

```
studentdata2_byMajor <- group_by(studentdata2, major)
summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))
summary_df
```

Out[5]:

Okay, let's interpret what happened in the above cell.

In line 1, we created a new data frame consisting of the same data as

`studentdata2`

, but grouped by the students' majors. We did this using the command:`studentdata2_byMajor <- group_by(studentdata2, major)`

Note that

`group_by()`

takes two inputs: (1) the name of the data frame and (2) the name of the column that we group the rows by. The grouped result is stored in a new data frame called`studentdata2_byMajor`

.In line 2, we used

`summarize()`

to compute the average height of each of the students in each group using the command:`summary_df <- summarize(studentdata2_byMajor, ave_height = mean(height_in))`

Let's parse what's going on here.

- The function
`summarize()`

takes two inputs: (1) the name of the (grouped) data frame and (2) a function that computes one summary value (in our case, the`mean()`

function "summarizes" the height of the students in each group). - Note that when we say
`ave_height = ...`

, we are giving a name to the new summary column; this is a name that describes what the summary value is.

- The function

(Note: maybe change one of these questions to be a clicker question)

Use `group_by()`

and `summarize`

to find

- average weight, by major
- average height, by whether students have taken "Text and Ideas" or not
- the number of students in each major
- the sum of height of students in each major

`arrange()`

¶The function `arrange()`

is quite straightforward. This is used to arrange a data frame based on the values in one (or more) column(s), either in increasing or decreasing order.

For example, suppose that we want to display the `studentdata2`

data frame so that the students are listed by their height, in increasing order.

In [6]:

```
arrange(studentdata2, height_in)
```

Out[6]:

We can also display them in decreasing order by height, by surrounding the column name with the function `desc()`

(which stands for descending):

In [7]:

```
arrange(studentdata2, desc(height_in))
```

Out[7]:

Suppose that we want to sort the rows by major AND by height (that is, sort by major first, and then within each major, sort by height, in increasing order):

In [8]:

```
arrange(studentdata2, major, height_in)
```

Out[8]:

`filter()`

¶The function `filter()`

removes rows from the dataset.

For example, suppose that we want to focus on students who have taken "Texts and Ideas" only. Then, we want to only keep the rows for which the `haveTakenTextAndIdeas`

column have value `TRUE`

.

In [9]:

```
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE)
studentdata2_TextAndIdeas
```

Out[9]:

We can combine `filter()`

with `group_by()`

and `summarize()`

.

For example, suppose that we want to find the average height of the students who have taken "Texts and Ideas", grouped by major:

In [10]:

```
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
```

Out[10]:

- adf
- adf

Let's recall what we did to compute the average height of students who have taken "Texts and Ideas", grouped by major:

In [11]:

```
studentdata2_TextAndIdeas <- filter(studentdata2, haveTakenTextAndIdeas == TRUE) # first, only keep the students who have taken "Texts and Ideas"
studentdata2_TnI_byMajor <- group_by(studentdata2_TextAndIdeas, major) #next, group them by major
height_summary <- summarize(studentdata2_TnI_byMajor, ave_height = mean(height_in)) # finally, find the average height of the students in each group
height_summary
```

Out[11]:

Note that in the three steps above, we created three new data frames: (1) `studentdata2_TextandIdeas`

, (2) `studentdata2_TnI_byMajor`

, and (3) `height_summary`

.

The data frame that we really care about is `height_summary`

; the first two data frames are created because they are needed as inputs to the `group_by()`

and `summarize()`

functions in the second and third steps.

Creating intermediate data frames that we are probably not going to use again is a waste of space (especially if our data sets are large). In fact, we can avoid creating these intermediate data frames by "nesting" the functions together, as follows:

In [12]:

```
height_summary_v2 <- summarize(group_by(filter(studentdata2, haveTakenTextAndIdeas == TRUE), major), ave_height = mean(height_in))
height_summary_v2
```

Out[12]:

Although we managed to avoid creating intermediate data frames, the resulting code is a bit hard to read and to interpret.

Our solution is a method called "piping". Let's first see what this looks like:

In [13]:

```
height_summary_v3 <- studentdata2 %>%
filter( haveTakenTextAndIdeas == TRUE ) %>%
group_by( major ) %>%
summarize( ave_height = mean(height_in) )
height_summary_v3
```

Out[13]:

All right, let's parse what happened.

The lines

`studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE)`

is equivalent to

`filter(studentdata2, haveTakenTextAndIdeas == TRUE)`

That is, the symbol

`%>%`

takes the data frame to its left and "pipes" this data frame into the function to its right, as the first input.Next, let's also look at the third line:

`studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE ) %>% group_by( major )`

Here, we use piping twice. The data frame that is the output of

`studentdata2 %>% filter( haveTakenTextAndIdeas == TRUE)`

is then used as the input of the

`group_by()`

function.

- (to be added)
- (to be added)