CoCalc Shared FilesClasses / Lecture03-WorkingWithDataFrames_BuiltInDatasets.htmlOpen in CoCalc with one click!

In [1]:

```
heightList <- c(72, 65, 59)
weightList <- c(150, 180, 110)
studentdata <- as.data.frame(cbind(weightList, heightList) )
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata
```

Out[1]:

In [2]:

```
names(studentdata)[1] <- 'weight_lb'
```

In [3]:

```
names(studentdata)[2] <- 'height_in'
```

Here, weight is in pounds and height is in inches. Suppose that you would like to create a new column where the height data is in cm.

First, let's consider how we would do that if we only have one number to work with.

In [4]:

```
height_Alex_in <- 72
```

** Fact ** One inch is equal to 2.54 cm.

So, if Alex's height is 72 inches, then Alex's height in cm is $72 \times 2.54$.

In [5]:

```
height_Alex_cm <- height_Alex_in * 2.54
print(height_Alex_in)
print(height_Alex_cm)
```

The list `height`

contains the heights of the three students in our dataset. Let's create a second list, called `height_cm`

, which will contain the height of the same three students in cm. We can do the same arithmetic as above, and the arithmetic operation will be applied to each number on that list.

In [6]:

```
print(heightList)
```

In [7]:

```
heightList_cm <- heightList * 2.54
print(heightList_cm)
```

We can then add a new column called `height_cm`

in the studentdata dataset that contains these values:

In [8]:

```
studentdata$height_cm <- heightList_cm
studentdata
```

Out[8]:

Note: we could have done the above steps in one line. For example, let's do something similar with the weight column. Suppose that instead of working with lbs, we prefer to work with kgs.

Since 1 pound is equal to 0.453592 kg, we multiply each number in the `weight`

column by 0.453592, then store the result in a new column called `weight_kg`

.

In [9]:

```
studentdata$weight_kg <- studentdata$weight_lb * 0.453592
```

In [10]:

```
studentdata
```

Out[10]:

Suppose, for example, that our three students are about to enter an elevator. Inside the elevator, a sign states that the total weight of the passangers of the elevator car cannot exceed 425 pounds. Is there a way to quickly add the numbers in the `weight`

column?

Recall that the function `sum()`

takes the sum of the numbers that are entered within its parentheses. For example:

In [11]:

```
sum(4, 100, 34.7, -100)
```

Out[11]:

We can also use `sum()`

to take the sum of all numbers in a list or in a column of a data frame.

In [12]:

```
sum(studentdata$weight_lb)
```

Out[12]:

In [13]:

```
newlist <- 1:100
sum(newlist)
```

Out[13]:

In [14]:

```
sum(-100:100)
```

Out[14]:

Of course, we can only take the sum of numeric data. If we try to take the sum of a column that stores text data, we will get an error message.

In [16]:

```
sum(studentdata$majors)
```

Another example: Perhaps we are interested in finding the average of the three students' height. Recall that to compute an average, we simply add all numbers and divide by the number of people. So, to compute the average height of the three students (in inches), we can first use `sum()`

, then divide the result by 3:

In [17]:

```
sum(studentdata$height_in)/3
```

Out[17]:

** Exercise ** In the cell below, find the average weight of the students, in kilograms.

** A New Function ** While we can take averages quite easily by first taking the sum and then dividing by the number of individuals, R has kindly provided a function that allows us to compute averages a little more easily. This is the `mean()`

function ("the mean" is synonymous to "the average"). We will get the same result using the two methods.

In [18]:

```
mean(studentdata$height_in)
```

Out[18]:

** Exercise ** In the cell below, find the average weight of the students, in kilograms, using the `mean()`

function

Our `studentdata`

dataset is very small; therefore, finding the largest and the smallest height, for example, is very easy to do.

However, if we are working with a dataset that has hundreds of thousdands of rows (for example, the `nycflights13`

dataset), it is useful to have a function that will do this quickly for us.

** New Functions **

`max()`

: to find the largest number in a list or in a column`min()`

: to find the smallest number in a list or in a column

In [19]:

```
max(studentdata$height_in)
```

Out[19]:

In [20]:

```
min(studentdata$height_in)
```

Out[20]:

What happens if we apply `max()`

or `min()`

to a list containing text data? Let's try:

In [21]:

```
max(studentdata$majors)
```

Out[21]:

In [22]:

```
min(studentdata$majors)
```

Out[22]:

Any guess what happenned here?

In [23]:

```
studentdata
```

Out[23]:

Suppose that we have a fourth student, David, whose data we would like to add as a fourth row in our data frame.

In [24]:

```
studentdata <- rbind(studentdata, data.frame(weight_lb = 200, height_in = NA, majors = "Political Science", haveTakenTextAndIdeas = FALSE, height_cm = NA, weight_kg = 200 * 0.453592))
```

In [25]:

```
studentdata
```

Out[25]:

In [26]:

```
class(studentdata$height_in)
```

Out[26]:

In [27]:

```
mean(studentdata$height_in)
```

Out[27]:

Note that the mean of heights is now `NA`

because David's height data is missing. You will see that real-world data are often incomplete, and that it is not uncommon for datasets to have many `NA`

s.

So, what do we do if we are still interested in finding the average of the three remaining students' heights? We can still use the `mean()`

function, telling R to ignore any `NA`

s:

In [28]:

```
mean(studentdata$height_in, na.rm=TRUE) #na.rm=TRUE means: "remove any NA's before computing the mean of the height_in column"
```

Out[28]:

We can also add `na.rm=TRUE`

in other functions where we want to compute a summary of numerical data:

In [29]:

```
sum(studentdata$height_in, na.rm=TRUE) / 3
```

Out[29]:

`dplyr`

package¶As we saw in our last lecture, sometimes we need to modify data frames before we are able to glean certain information from it. Today, we will learn about various R functions that will help us in modifying a given data frame. This process is often called "Data Wrangling" and is an important skill to have if we want to be proficient and comfortable with data analysis.

R is an amazing tool, because a lot of people contributes to the development of R. By downloading "packages" (these are codes for various functions, written by others), we can make a difficult process easier.

The main R packages that we will work with in our course are `dplyr`

and `ggplot2`

. The package `tidyverse`

contains both `dplyr`

and `ggplot2`

; so we begin by installing and loading `tidyverse`

:

In [30]:

```
install.packages('tidyverse')
```

In [31]:

```
library('tidyverse')
```

Let's revisit the `nycflights13`

dataset.

In [32]:

```
library('nycflights13')
```

In [33]:

```
head(flights)
```

Out[33]:

A few questions that we brainstormed last class are:

- Which airline is the "worst" in terms of the amount of delays?
- How many flights departed from each of the three airports?
- Which NYC airport is the worst in terms of the amount of delays?
- How many flights departed in the morning? afternoon? evening?

(and many more questions!)

First, we need to think, what do we mean by "worst", exactly. Are we asking for simply the number of flights that had any amount of delays, or are we asking for the total amount of delays accumulated by each airlines (or maybe something else)?

Furthermore, we notice that there is a `dep_delay`

column and an `arr_delay`

column. Which type of delay are we interested in?

When we are answering these types of questions, it is usually not clear how we should quantify "worst" (or "best"), and we can argue about this for a long time. So, what we would do is simply pick one and go for it, for now.

For each airline, let's add up the number of minutes that the flights are delayed. Then, we will rank the airlines based on the sum of their number of minutes of delays. Let's also suppose that we care only about the arrival delay (that is, if a flight was delayed in its departure but still arrived as scheduled, this did not really upset the passangers.)

To do do this, we need to **group** the flights by the carrier, and then we will sum up their `arr_delay`

column.

We do this in two steps, using two new functions:

`group_by()`

: this function lets us group the rows of a data frame according to the value of one of the columns.`summarize()`

: this function lets us compute a summary number for each of the group.

In [34]:

```
flights_groupedby_carrier <- group_by(flights, carrier)
```

In [35]:

```
arrdelay_bycarrier <- summarize(flights_groupedby_carrier, total_arrdelay = sum(arr_delay, na.rm=TRUE))
```

In [36]:

```
arrdelay_bycarrier <- merge(arrdelay_bycarrier, airlines, by="carrier")
```

In [37]:

```
arrange(arrdelay_bycarrier, desc(total_arrdelay))
```

Out[37]:

Conclusion of the above exploratory analysis: Based on the sum of the arrival delays (in minutes), Alaska Airlines had the best record in 2013, while ExpressJet Airlines Inc. had the worst record.

While this is revealing, what else should we think about before giving the award of "The most delayed airline in 2013" to ExpressJet Airlines Inc.?

- What about all those missing data (
`NA`

s) that we discarded? - It is possible that perhaps the sum of arrival delays is not the right quantity to use to rank the airlines. For example, airlines that are more majors will have a lot more flights than smaller airlines, and therefore might accumulate more delay minutes even if each delay is only, let's say, one minute long.
- It seems like if an airline has a lot of flights with small delays, they are not worse than an airline with less volume that is consistently very delayed.
- How about consistency of delays? It is possible that ExpressJet is usualy on-time, except for a handful of flights that are massively delayed due to a rare weather-related event.

In [38]:

```
ave_arr_delay_byCarrier <- summarize(flights_groupedby_carrier, ave_arr_delay = mean(arr_delay, na.rm=TRUE) )
```

In [39]:

```
ave_arr_delay_byCarrier <- merge(ave_arr_delay_byCarrier, airlines, by="carrier")
```

In [40]:

```
arrange(ave_arr_delay_byCarrier, desc(ave_arr_delay) )
```

Out[40]:

In [41]:

```
flights %>% # don't worry about the code in this cell for now! We will learn about them shortly.
filter(!is.na(arr_delay)) %>%
filter(arr_delay > 0) %>%
group_by(carrier) %>%
summarize(total_arr_delay = sum(arr_delay), count = n()) %>%
mutate(ave_arr_delay = total_arr_delay/count) %>%
merge(airlines, by="carrier") %>%
arrange(desc(ave_arr_delay))
```

Out[41]: