12 + 3
12 * 3
12 * 3 - 20 / 4
(12 * 3 - 20) / 4
3**2
Sometimes, we would like to give names to describe the quantities that we are working with
price_of_pie <- 20
number_of_people <- 4
cost_per_person <- price_of_pie / number_of_people
To display the content/ the value stored in each name, we simply type the names:
cost_per_person
That is, names are "labels" or "placeholders" or "storage units". We could store not just numbers, but also text. Make sure to surround text to be stored by a single quotation mark:
student1 <- 'Alex Smith'
student2 <- 'Bob Singh'
student3 <- 'Chen Zhang'
student1
R allows us to do a lot of things using "functions". We can think of functions in R as "verbs" which we can use to tell R to do a particular task. Just as some verbs in English must be followed by a noun ("transitive verbs") and some don't, some functions in R must take a particular object or input (often called an "argument").
Let's start with simple function: the print()
function. It's use is to print the content of a name. For example:
print(cost_per_person)
print(price_of_pie)
print(number_of_people)
Contrast the output above with the output of the cell below, where print()
was not used:
cost_per_person
price_of_pie
number_of_people
Notice that the "noun"/object that the function is acting upon is placed inside the pair of parenthes that come directly after the function name (without space between the function and the open parenthesis.)
New Functions. Here are a couple other R functions that helps us does arithmetic:
sqrt()
: takes the square root of a numberabs()
: takes the absolute value of a numberAs your R code becomes more and more involved, it is important to make sure that you and others understand what exactly the code does. To do this, we want to add additional explanation (in english) that we want R to ignore computationally. This additional explanation can be added as "comments" in R. For example:
price_of_pie <- 20
number_of_people <- 4
# To compute cost per person, divide the price of pie by the number of people:
cost_per_person <- price_of_pie / number_of_people
In the above cell, any text to the right of the #
sign is ignored by R. Any text that is preceded by #
is a comment.
height_Alex <- 72
height_Bob <- 65
height_Chen <- 59
A New Function. We use the function c()
to concatenate (i.e., to chain together) several different values into one object. See the example below, where we store the heights of the three students into one list, which we name height
:
height <- c(height_Alex, height_Bob, height_Chen)
print(height)
Exercise Create a list of all integers from 1 to 10 and name this list integers10
. Then, print the contents of this list.
integers10 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print(integers10)
A New R Command. Here is a second way to create a list containing consecutive integers: firstInteger:lastInteger
.
For example, instead of using c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
to create a list of all integers from 1 to 10, we could have created the same list using the following command: 1:10
, which is more concise. Try it below and name this list integers10_version2
:
integers10_version2 <- ...
This is particularly useful if you want to create a very long list. For example, if we want to create a list of all integers from -100 to 100:
my_list <- -100:100
print(my_list)
my_list2 <- c(my_list, height)
print(my_list2)
We can also store text data in a list.
student_names <- c(student1, student2, student3)
print(student_names)
Data Frames are basically tables, or spreadsheets, of data. Each column of a data frame corresponds to a "variable"; Each row of a data frame corresponds to one observation/one individual.
weight_Alex <- 150
weight_Bob <- 180
weight_Chen <- 110
weight <- c(weight_Alex, weight_Bob, weight_Chen)
print(weight)
studentdata <- as.data.frame(cbind(weight, height) )
print(studentdata)
Note that the names of the two lists (weight
and height
) are now the names of the two columns in the data frame.
names(studentdata) # the function names() displays the names of the columns of a data frame
row.names(studentdata) # the function row.names() displays the names of the rows of a data frame
row.names(studentdata) <- student_names
studentdata
Each column of a data frame is simply a list! Given a data frame, to obtain a list containing just one of its columns is easy. We do this using the $
symbol followed by the name of the column.
studentdata$height
Last lecture, we had an example of a student data set that contains weight, height, major, and whether students have taken "Text and Ideas". Let's add the majors and "have taken text and ideas" columns into this data frame.
To create a new column, simply type the data frame name, followed by the $
symbol and the new column name; then, store the values of the new column there.
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
studentdata
studentdata$haveTakenTextAndIdeas <- c('Yes', 'Yes', 'No')
studentdata
Note that the first two columns of the studentdata
data frame contains numerical data whereas the last two columns are text data.
We will talk about different data types in more detail in a bit. However, this is a good chance to introduce a new function:
A New Function The class()
function tells us the type of data that a particular name represents.
For example, using class()
, we will find that
studentdata
is a data framestudentdata$weight
is a list containing numbers, so this is a numerical datastudentdata$majors
is a list containing text. In R, text data is called "character" (because text consists of characters)class(studentdata)
class(studentdata$weight)
class(studentdata$majors)
R comes with some datasets that are ready for us to explore. One such built-in datasets is the women
dataset.
women
head(women, 5)
dim(women)
row.names(women)
We saw that we can put together lists of the same length into a dataframe. We can also (1) extract each column of a dataframe to get a list, (2) extract just one entry in the data frame to get a number
print(women$height)
women$height
women$height[3]
New Functions Here is a summary of new functions that are useful for examining and working with data frames:
as.data.frame(cbind())
: to "bind" lists together to form the columns of a new data framenames()
: to find out the names of the columns of a data frame. It returns a list containing the names of the columnsrow.names()
: to find out the names of the rows of a data framedim()
: to find the number of rows and columns of a data frame (that is, to find the "dimension" of the data frame)head()
: to display the first few rows of a data frame. It takes two arguments: the name of the data frame and the number of rows to be displayedhead(women, 5)
class(women)
class(women$height)
This is data that are just texts. For example, suppose that in the studentdata
data set above, the students' majors are text data:
studentdata
class(studentdata$majors)
Some data are "categorical". For example, in the 'studentdata' dataset above, majors
contains text data. However, we could think of it as a category as well: each student fall into one of a number of possible categories. Sometimes, it is a good idea to tell R explicitly that a given set of text data actually represents categories instead of simply a string of alphabets.
A New Function We can tell R explictly that a column's text data is actually categorical using factor()
, as follows:
factor(studentdata$majors)
class(factor(studentdata$majors))
Note that while studentdata$majors
is text data, factor(studentdata$majors)
treats the different texts/words as categories.
Suppose that it is useful to think of majors as categories as opposed to simply a string of alphabets. We can replace studentdata$majors
with factor(studentdata$majors)
:
# We replace the text data stored in the `major` column with
studentdata$majors <- factor(studentdata$majors)
class(studentdata$majors)
Another example: In the chickwts
dataset below, we record the weight as well as the type of feed given to each chicken. The weight column contains numbers but the feed column contains the type (i.e., the category) of feed. In this particular dataset, one category of feed is horsebean
head(chickwts, 5)
We might wonder, how many different categories of feeds are there in this data set? That is, can we quickly find out what are the other possible types of feed given to the chickens in this data set?
We could do this using the function levels(dataframe$columnname)
, as follows:
levels(chickwts$feed)
As you can see above, there are six categories of feed.
While it might not be so obvious why we care about the distinction between text data vs. categorical data, keep in mind that this distinction is important. It will make more sense why as we work with more and more examples and datasets.
Logical data are data whose values are either TRUE or FALSE.
For example, in our studentdata
data set, the column on whether each student has taken "Text and Ideas" contain a True/False information ("yes" or "no").
studentdata$haveTakenTextAndIdeas
class(studentdata$haveTakenTextAndIdeas)
Currently, the yes and no's are treated as plain text data. In order to tell R to treat them as logical data, let's replace each 'yes' with TRUE
and each 'no' with FALSE
:
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata
class(studentdata$haveTakenTextAndIdeas)
(Again, it might not be so obvious why it is useful or important to replace the "yes" and "no"s with TRUE
and FALSE
values, the distinction between text and logical data is important. By storing these as logical data, we can do more than if they are simply text data.)
Now that we have been more acquainted with how R works and how various types of data can be stored in R, let's look at an example of a real (and large) data set.
nycflights13
package and dataset¶This dataset contains data of ALL flights that departed from one of the three NYC-area airports (JFK, LaGuardia, and Newark) in the year 2013. We will do a bit of exploration of this dataset using tools that we learn today.
We first need to install the package and load it so that R can access it and work with it.
install.packages('nycflights13')
library('nycflights13')
The package nycflights13
contains a data frame called flights
flights
Hmm, this data frame looks huge. Let's see how many rows and columns it has using dim()
.
dim(flights)
It looks that among the nineteen columns, some contain numerical data and some text data. Let's check the data types of the various columns using class()
class(flights$arr_time)
tail(flights, 5)
airlines
df <- merge(flights, airlines, by="carrier")
head(df)
names(df)[20] <- 'airline'
head(df)
df[order(df$arr_delay), ]