Open in CoCalc with one click! Lecture02-IntroductionToR_TypesOfData

Introduction to Data Analysis with Computing

Lecture 2: Introduction to R; Types of Data

Today's Goals and Topics:

  1. Basic computing and arithmetic in R
  2. Lists and Data Frames
  3. Types of Data

Part 1: Basic computing in R

1.1. Arithmetic in R

We can do basic computations in R, such as adding, subtracting, multiplying, and dividing numbers

In [1]:
12 + 3
Out[1]:
15
In [2]:
12 * 3
Out[2]:
36
In [3]:
12 * 3 - 20 / 4
Out[3]:
31

Order of Operations

In [4]:
(12 * 3 - 20) / 4
Out[4]:
4
In [5]:
3**2
Out[5]:
9

1.2. Names

Sometimes, we would like to give names to describe the quantities that we are working with

In [6]:
price_of_pie <- 20
number_of_people <- 4
cost_per_person <- price_of_pie / number_of_people

To display the content/ the value stored in each name, we simply type the names:

In [7]:
cost_per_person
Out[7]:
5

That is, names are "labels" or "placeholders" or "storage units". We could store not just numbers, but also text. Make sure to surround text to be stored by a single quotation mark:

In [8]:
student1 <- 'Alex Smith'
student2 <- 'Bob Singh'
student3 <- 'Chen Zhang'
In [9]:
student1
Out[9]:
'Alex Smith'

1.3. Functions

R allows us to do a lot of things using "functions". We can think of functions in R as "verbs" which we can use to tell R to do a particular task. Just as some verbs in English must be followed by a noun ("transitive verbs") and some don't, some functions in R must take a particular object or input (often called an "argument").

Let's start with simple function: the print() function. It's use is to print the content of a name. For example:

In [10]:
print(cost_per_person)
print(price_of_pie)
print(number_of_people)
[1] 5
[1] 20
[1] 4

Contrast the output above with the output of the cell below, where print() was not used:

In [11]:
cost_per_person
price_of_pie
number_of_people
Out[11]:
5
Out[11]:
20
Out[11]:
4

Notice that the "noun"/object that the function is acting upon is placed inside the pair of parenthes that come directly after the function name (without space between the function and the open parenthesis.)

New Functions. Here are a couple other R functions that helps us does arithmetic:

  • sqrt(): takes the square root of a number
  • abs(): takes the absolute value of a number

1.4. Comments

As your R code becomes more and more involved, it is important to make sure that you and others understand what exactly the code does. To do this, we want to add additional explanation (in english) that we want R to ignore computationally. This additional explanation can be added as "comments" in R. For example:

In [12]:
price_of_pie <- 20
number_of_people <- 4
# To compute cost per person, divide the price of pie by the number of people:
cost_per_person <- price_of_pie / number_of_people

In the above cell, any text to the right of the # sign is ignored by R. Any text that is preceded by # is a comment.

Part 2: Grouping values together: Lists and Data Frames

2.1. Lists of Values

Sometimes, we need to work not just with one number but a collection of numbers; in R, these collections of numbers are called lists.

In [13]:
height_Alex <- 72
height_Bob <- 65
height_Chen <- 59

A New Function. We use the function c() to concatenate (i.e., to chain together) several different values into one object. See the example below, where we store the heights of the three students into one list, which we name height:

In [14]:
height <- c(height_Alex, height_Bob, height_Chen)
print(height)
[1] 72 65 59

Exercise Create a list of all integers from 1 to 10 and name this list integers10. Then, print the contents of this list.

In [15]:
integers10 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print(integers10)
 [1]  1  2  3  4  5  6  7  8  9 10

A New R Command. Here is a second way to create a list containing consecutive integers: firstInteger:lastInteger.

For example, instead of using c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) to create a list of all integers from 1 to 10, we could have created the same list using the following command: 1:10, which is more concise. Try it below and name this list integers10_version2:

In [16]:
integers10_version2 <- ...
Error in eval(expr, envir, enclos): '...' used in an incorrect context
Traceback:

This is particularly useful if you want to create a very long list. For example, if we want to create a list of all integers from -100 to 100:

In [17]:
my_list <- -100:100
print(my_list)
  [1] -100  -99  -98  -97  -96  -95  -94  -93  -92  -91  -90  -89  -88  -87  -86
 [16]  -85  -84  -83  -82  -81  -80  -79  -78  -77  -76  -75  -74  -73  -72  -71
 [31]  -70  -69  -68  -67  -66  -65  -64  -63  -62  -61  -60  -59  -58  -57  -56
 [46]  -55  -54  -53  -52  -51  -50  -49  -48  -47  -46  -45  -44  -43  -42  -41
 [61]  -40  -39  -38  -37  -36  -35  -34  -33  -32  -31  -30  -29  -28  -27  -26
 [76]  -25  -24  -23  -22  -21  -20  -19  -18  -17  -16  -15  -14  -13  -12  -11
 [91]  -10   -9   -8   -7   -6   -5   -4   -3   -2   -1    0    1    2    3    4
[106]    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19
[121]   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34
[136]   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49
[151]   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64
[166]   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
[181]   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94
[196]   95   96   97   98   99  100
In [18]:
my_list2 <- c(my_list, height)
print(my_list2)
  [1] -100  -99  -98  -97  -96  -95  -94  -93  -92  -91  -90  -89  -88  -87  -86
 [16]  -85  -84  -83  -82  -81  -80  -79  -78  -77  -76  -75  -74  -73  -72  -71
 [31]  -70  -69  -68  -67  -66  -65  -64  -63  -62  -61  -60  -59  -58  -57  -56
 [46]  -55  -54  -53  -52  -51  -50  -49  -48  -47  -46  -45  -44  -43  -42  -41
 [61]  -40  -39  -38  -37  -36  -35  -34  -33  -32  -31  -30  -29  -28  -27  -26
 [76]  -25  -24  -23  -22  -21  -20  -19  -18  -17  -16  -15  -14  -13  -12  -11
 [91]  -10   -9   -8   -7   -6   -5   -4   -3   -2   -1    0    1    2    3    4
[106]    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19
[121]   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34
[136]   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49
[151]   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64
[166]   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
[181]   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94
[196]   95   96   97   98   99  100   72   65   59

We can also store text data in a list.

In [19]:
student_names <- c(student1, student2, student3)
print(student_names)
[1] "Alex Smith" "Bob Singh"  "Chen Zhang"

2.2. Data Frames

Data Frames are basically tables, or spreadsheets, of data. Each column of a data frame corresponds to a "variable"; Each row of a data frame corresponds to one observation/one individual.

In [20]:
weight_Alex <- 150
weight_Bob <- 180
weight_Chen <- 110
weight <- c(weight_Alex, weight_Bob, weight_Chen)
print(weight)
[1] 150 180 110
In [21]:
studentdata <- as.data.frame(cbind(weight, height) )
print(studentdata)
  weight height
1    150     72
2    180     65
3    110     59

Note that the names of the two lists (weight and height) are now the names of the two columns in the data frame.

In [22]:
names(studentdata) # the function names() displays the names of the columns of a data frame
Out[22]:
  1. 'weight'
  2. 'height'
In [23]:
row.names(studentdata) # the function row.names() displays the names of the rows of a data frame
Out[23]:
  1. '1'
  2. '2'
  3. '3'
In [24]:
row.names(studentdata) <- student_names
studentdata
Out[24]:
weightheight
Alex Smith15072
Bob Singh18065
Chen Zhang11059

Accessing a column of a data frame

Each column of a data frame is simply a list! Given a data frame, to obtain a list containing just one of its columns is easy. We do this using the $ symbol followed by the name of the column.

In [25]:
studentdata$height
Out[25]:
  1. 72
  2. 65
  3. 59

2.2.1. Adding New Columns to a data frame

Last lecture, we had an example of a student data set that contains weight, height, major, and whether students have taken "Text and Ideas". Let's add the majors and "have taken text and ideas" columns into this data frame.

To create a new column, simply type the data frame name, followed by the $ symbol and the new column name; then, store the values of the new column there.

In [26]:
studentdata$majors <- c('Music', 'Psychology', 'Linguistics')
In [27]:
studentdata
Out[27]:
weightheightmajors
Alex Smith150 72 Music
Bob Singh180 65 Psychology
Chen Zhang110 59 Linguistics
In [28]:
studentdata$haveTakenTextAndIdeas <- c('Yes', 'Yes', 'No')
studentdata
Out[28]:
weightheightmajorshaveTakenTextAndIdeas
Alex Smith150 72 Music Yes
Bob Singh180 65 Psychology Yes
Chen Zhang110 59 LinguisticsNo

Note that the first two columns of the studentdata data frame contains numerical data whereas the last two columns are text data.

We will talk about different data types in more detail in a bit. However, this is a good chance to introduce a new function:

A New Function The class() function tells us the type of data that a particular name represents.

For example, using class(), we will find that

  • studentdata is a data frame
  • studentdata$weight is a list containing numbers, so this is a numerical data
  • studentdata$majors is a list containing text. In R, text data is called "character" (because text consists of characters)
In [29]:
class(studentdata)
Out[29]:
'data.frame'
In [30]:
class(studentdata$weight)
Out[30]:
'numeric'
In [31]:
class(studentdata$majors)
Out[31]:
'character'

2.2.2. Built-In Datasets

R comes with some datasets that are ready for us to explore. One such built-in datasets is the women dataset.

In [32]:
women
Out[32]:
heightweight
58 115
59 117
60 120
61 123
62 126
63 129
64 132
65 135
66 139
67 142
68 146
69 150
70 154
71 159
72 164
In [33]:
head(women, 5)
Out[33]:
heightweight
58 115
59 117
60 120
61 123
62 126
In [34]:
dim(women)
row.names(women)
Out[34]:
  1. 15
  2. 2
Out[34]:
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '6'
  7. '7'
  8. '8'
  9. '9'
  10. '10'
  11. '11'
  12. '12'
  13. '13'
  14. '14'
  15. '15'

We saw that we can put together lists of the same length into a dataframe. We can also (1) extract each column of a dataframe to get a list, (2) extract just one entry in the data frame to get a number

In [35]:
print(women$height)
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
In [36]:
women$height
women$height[3]
Out[36]:
  1. 58
  2. 59
  3. 60
  4. 61
  5. 62
  6. 63
  7. 64
  8. 65
  9. 66
  10. 67
  11. 68
  12. 69
  13. 70
  14. 71
  15. 72
Out[36]:
60

New Functions Here is a summary of new functions that are useful for examining and working with data frames:

  • as.data.frame(cbind()): to "bind" lists together to form the columns of a new data frame
  • names(): to find out the names of the columns of a data frame. It returns a list containing the names of the columns
  • row.names(): to find out the names of the rows of a data frame
  • dim(): to find the number of rows and columns of a data frame (that is, to find the "dimension" of the data frame)
  • head(): to display the first few rows of a data frame. It takes two arguments: the name of the data frame and the number of rows to be displayed

Part 3: Types of Data

3.1. Numerical Data

We saw an example of numerical data in the women dataset, which contains the weight and height of 15 women in the US. Weight and height are both numbers.

Some numerical data are integers, some are decimals or fractions.

In [37]:
head(women, 5)
Out[37]:
heightweight
58 115
59 117
60 120
61 123
62 126
In [38]:
class(women)
Out[38]:
'data.frame'
In [39]:
class(women$height)
Out[39]:
'numeric'

3.2. Character/Text Data

This is data that are just texts. For example, suppose that in the studentdata data set above, the students' majors are text data:

In [40]:
studentdata
class(studentdata$majors)
Out[40]:
weightheightmajorshaveTakenTextAndIdeas
Alex Smith150 72 Music Yes
Bob Singh180 65 Psychology Yes
Chen Zhang110 59 LinguisticsNo
Out[40]:
'character'

3.3. Categorical Data

Some data are "categorical". For example, in the 'studentdata' dataset above, majors contains text data. However, we could think of it as a category as well: each student fall into one of a number of possible categories. Sometimes, it is a good idea to tell R explicitly that a given set of text data actually represents categories instead of simply a string of alphabets.

A New Function We can tell R explictly that a column's text data is actually categorical using factor(), as follows:

In [41]:
factor(studentdata$majors)
class(factor(studentdata$majors))
Out[41]:
  1. Music
  2. Psychology
  3. Linguistics
Out[41]:
'factor'

Note that while studentdata$majors is text data, factor(studentdata$majors) treats the different texts/words as categories.

Suppose that it is useful to think of majors as categories as opposed to simply a string of alphabets. We can replace studentdata$majors with factor(studentdata$majors):

In [42]:
# We replace the text data stored in the `major` column with 
studentdata$majors <- factor(studentdata$majors)
class(studentdata$majors)
Out[42]:
'factor'

Another example: In the chickwts dataset below, we record the weight as well as the type of feed given to each chicken. The weight column contains numbers but the feed column contains the type (i.e., the category) of feed. In this particular dataset, one category of feed is horsebean

In [43]:
head(chickwts, 5)
Out[43]:
weightfeed
179 horsebean
160 horsebean
136 horsebean
227 horsebean
217 horsebean

We might wonder, how many different categories of feeds are there in this data set? That is, can we quickly find out what are the other possible types of feed given to the chickens in this data set?

We could do this using the function levels(dataframe$columnname), as follows:

In [44]:
levels(chickwts$feed)
Out[44]:
  1. 'casein'
  2. 'horsebean'
  3. 'linseed'
  4. 'meatmeal'
  5. 'soybean'
  6. 'sunflower'

As you can see above, there are six categories of feed.

While it might not be so obvious why we care about the distinction between text data vs. categorical data, keep in mind that this distinction is important. It will make more sense why as we work with more and more examples and datasets.

3.4. Logical Data

Logical data are data whose values are either TRUE or FALSE.

For example, in our studentdata data set, the column on whether each student has taken "Text and Ideas" contain a True/False information ("yes" or "no").

In [45]:
studentdata$haveTakenTextAndIdeas
class(studentdata$haveTakenTextAndIdeas)
Out[45]:
  1. 'Yes'
  2. 'Yes'
  3. 'No'
Out[45]:
'character'

Currently, the yes and no's are treated as plain text data. In order to tell R to treat them as logical data, let's replace each 'yes' with TRUE and each 'no' with FALSE:

In [46]:
studentdata$haveTakenTextAndIdeas <- c(TRUE, TRUE, FALSE)
studentdata
Out[46]:
weightheightmajorshaveTakenTextAndIdeas
Alex Smith150 72 Music TRUE
Bob Singh180 65 Psychology TRUE
Chen Zhang110 59 LinguisticsFALSE
In [47]:
class(studentdata$haveTakenTextAndIdeas)
Out[47]:
'logical'

(Again, it might not be so obvious why it is useful or important to replace the "yes" and "no"s with TRUE and FALSE values, the distinction between text and logical data is important. By storing these as logical data, we can do more than if they are simply text data.)

Examples

Now that we have been more acquainted with how R works and how various types of data can be stored in R, let's look at an example of a real (and large) data set.

The nycflights13 package and dataset

This dataset contains data of ALL flights that departed from one of the three NYC-area airports (JFK, LaGuardia, and Newark) in the year 2013. We will do a bit of exploration of this dataset using tools that we learn today.

We first need to install the package and load it so that R can access it and work with it.

In [48]:
install.packages('nycflights13')
Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
In [49]:
library('nycflights13')

The package nycflights13 contains a data frame called flights

In [50]:
flights
Out[50]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
2013 1 1 558 600 -2 849 851 -2 B6 49 N793JB JFK PBI 149 1028 6 0 2013-01-01 06:00:00
2013 1 1 558 600 -2 853 856 -3 B6 71 N657JB JFK TPA 158 1005 6 0 2013-01-01 06:00:00
2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK LAX 345 2475 6 0 2013-01-01 06:00:00
2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR SFO 361 2565 6 0 2013-01-01 06:00:00
2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA DFW 257 1389 6 0 2013-01-01 06:00:00
2013 1 1 559 559 0 702 706 -4 B6 1806 N708JB JFK BOS 44 187 5 59 2013-01-01 05:00:00
2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR LAS 337 2227 6 0 2013-01-01 06:00:00
2013 1 1 600 600 0 851 858 -7 B6 371 N595JB LGA FLL 152 1076 6 0 2013-01-01 06:00:00
2013 1 1 600 600 0 837 825 12 MQ 4650 N542MQ LGA ATL 134 762 6 0 2013-01-01 06:00:00
2013 1 1 601 600 1 844 850 -6 B6 343 N644JB EWR PBI 147 1023 6 0 2013-01-01 06:00:00
2013 1 1 602 610 -8 812 820 -8 DL 1919 N971DL LGA MSP 170 1020 6 10 2013-01-01 06:00:00
2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6 5 2013-01-01 06:00:00
2013 1 1 606 610 -4 858 910 -12 AA 1895 N633AA EWR MIA 152 1085 6 10 2013-01-01 06:00:00
2013 1 1 606 610 -4 837 845 -8 DL 1743 N3739P JFK ATL 128 760 6 10 2013-01-01 06:00:00
2013 1 1 607 607 0 858 915 -17 UA 1077 N53442 EWR MIA 157 1085 6 7 2013-01-01 06:00:00
2013 1 1 608 600 8 807 735 32 MQ 3768 N9EAMQ EWR ORD 139 719 6 0 2013-01-01 06:00:00
2013 1 1 611 600 11 945 931 14 UA 303 N532UA JFK SFO 366 2586 6 0 2013-01-01 06:00:00
2013 1 1 613 610 3 925 921 4 B6 135 N635JB JFK RSW 175 1074 6 10 2013-01-01 06:00:00
2013 1 1 615 615 0 1039 1100 -21 B6 709 N794JB JFK SJU 182 1598 6 15 2013-01-01 06:00:00
2013 1 1 615 615 0 833 842 -9 DL 575 N326NB EWR ATL 120 746 6 15 2013-01-01 06:00:00
2013 9 30 2123 2125 -2 2223 2247 -24 EV 5489 N712EV LGA CHO 45 305 21 25 2013-09-30 21:00:00
2013 9 30 2127 2129 -2 2314 2323 -9 EV 3833 N16546 EWR CLT 72 529 21 29 2013-09-30 21:00:00
2013 9 30 2128 2130 -2 2328 2359 -31 B6 97 N807JB JFK DEN 213 1626 21 30 2013-09-30 21:00:00
2013 9 30 2129 2059 30 2230 2232 -2 EV 5048 N751EV LGA RIC 45 292 20 59 2013-09-30 20:00:00
2013 9 30 2131 2140 -9 2225 2255 -30 MQ 3621 N807MQ JFK DCA 36 213 21 40 2013-09-30 21:00:00
2013 9 30 2140 2140 0 10 40 -30 AA 185 N335AA JFK LAX 298 2475 21 40 2013-09-30 21:00:00
2013 9 30 2142 2129 13 2250 2239 11 EV 4509 N12957 EWR PWM 47 284 21 29 2013-09-30 21:00:00
2013 9 30 2145 2145 0 115 140 -25 B6 1103 N633JB JFK SJU 192 1598 21 45 2013-09-30 21:00:00
2013 9 30 2147 2137 10 30 27 3 B6 1371 N627JB LGA FLL 139 1076 21 37 2013-09-30 21:00:00
2013 9 30 2149 2156 -7 2245 2308 -23 UA 523 N813UA EWR BOS 37 200 21 56 2013-09-30 21:00:00
2013 9 30 2150 2159 -9 2250 2306 -16 EV 3842 N10575 EWR MHT 39 209 21 59 2013-09-30 21:00:00
2013 9 30 2159 1845 194 2344 2030 194 9E 3320 N906XJ JFK BUF 50 301 18 45 2013-09-30 18:00:00
2013 9 30 2203 2205 -2 2339 2331 8 EV 5311 N722EV LGA BGR 61 378 22 5 2013-09-30 22:00:00
2013 9 30 2207 2140 27 2257 2250 7 MQ 3660 N532MQ LGA BNA 97 764 21 40 2013-09-30 21:00:00
2013 9 30 2211 2059 72 2339 2242 57 EV 4672 N12145 EWR STL 120 872 20 59 2013-09-30 20:00:00
2013 9 30 2231 2245 -14 2335 2356 -21 B6 108 N193JB JFK PWM 48 273 22 45 2013-09-30 22:00:00
2013 9 30 2233 2113 80 112 30 42 UA 471 N578UA EWR SFO 318 2565 21 13 2013-09-30 21:00:00
2013 9 30 2235 2001 154 59 2249 130 B6 1083 N804JB JFK MCO 123 944 20 1 2013-09-30 20:00:00
2013 9 30 2237 2245 -8 2345 2353 -8 B6 234 N318JB JFK BTV 43 266 22 45 2013-09-30 22:00:00
2013 9 30 2240 2245 -5 2334 2351 -17 B6 1816 N354JB JFK SYR 41 209 22 45 2013-09-30 22:00:00
2013 9 30 2240 2250 -10 2347 7 -20 B6 2002 N281JB JFK BUF 52 301 22 50 2013-09-30 22:00:00
2013 9 30 2241 2246 -5 2345 1 -16 B6 486 N346JB JFK ROC 47 264 22 46 2013-09-30 22:00:00
2013 9 30 2307 2255 12 2359 2358 1 B6 718 N565JB JFK BOS 33 187 22 55 2013-09-30 22:00:00
2013 9 30 2349 2359 -10 325 350 -25 B6 745 N516JB JFK PSE 196 1617 23 59 2013-09-30 23:00:00
2013 9 30 NA 1842 NA NA 2019 NA EV 5274 N740EV LGA BNA NA 764 18 42 2013-09-30 18:00:00
2013 9 30 NA 1455 NA NA 1634 NA 9E 3393 NA JFK DCA NA 213 14 55 2013-09-30 14:00:00
2013 9 30 NA 2200 NA NA 2312 NA 9E 3525 NA LGA SYR NA 198 22 0 2013-09-30 22:00:00
2013 9 30 NA 1210 NA NA 1330 NA MQ 3461 N535MQ LGA BNA NA 764 12 10 2013-09-30 12:00:00
2013 9 30 NA 1159 NA NA 1344 NA MQ 3572 N511MQ LGA CLE NA 419 11 59 2013-09-30 11:00:00
2013 9 30 NA 840 NA NA 1020 NA MQ 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00

Hmm, this data frame looks huge. Let's see how many rows and columns it has using dim().

In [51]:
dim(flights)
Out[51]:
  1. 336776
  2. 19

It looks that among the nineteen columns, some contain numerical data and some text data. Let's check the data types of the various columns using class()

In [52]:
class(flights$arr_time)
Out[52]:
'integer'
In [53]:
tail(flights, 5)
Out[53]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 9 30 NA 1455 NA NA 1634 NA 9E 3393 NA JFK DCA NA 213 14 55 2013-09-30 14:00:00
2013 9 30 NA 2200 NA NA 2312 NA 9E 3525 NA LGA SYR NA 198 22 0 2013-09-30 22:00:00
2013 9 30 NA 1210 NA NA 1330 NA MQ 3461 N535MQ LGA BNA NA 764 12 10 2013-09-30 12:00:00
2013 9 30 NA 1159 NA NA 1344 NA MQ 3572 N511MQ LGA CLE NA 419 11 59 2013-09-30 11:00:00
2013 9 30 NA 840 NA NA 1020 NA MQ 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00
In [54]:
airlines
Out[54]:
carriername
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.
EV ExpressJet Airlines Inc.
F9 Frontier Airlines Inc.
FL AirTran Airways Corporation
HA Hawaiian Airlines Inc.
MQ Envoy Air
OO SkyWest Airlines Inc.
UA United Air Lines Inc.
US US Airways Inc.
VX Virgin America
WN Southwest Airlines Co.
YV Mesa Airlines Inc.
In [55]:
df <- merge(flights, airlines, by="carrier")
In [56]:
head(df)
Out[56]:
carrieryearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delayflighttailnumorigindestair_timedistancehourminutetime_hourname
9E 2013 2 5 827 830 -3 1032 1023 9 4220 N8698A JFK RDU 78 427 8 30 2013-02-05 08:00:00Endeavor Air Inc.
9E 2013 8 23 1901 1905 -4 2051 2103 -12 3360 N926XJ JFK PIT 61 340 19 5 2013-08-23 19:00:00Endeavor Air Inc.
9E 2013 6 2 805 810 -5 949 1027 -38 3538 N925XJ JFK MSP 145 1029 8 10 2013-06-02 08:00:00Endeavor Air Inc.
9E 2013 10 26 2139 1935 124 2358 2145 133 3470 N928XJ JFK CVG 102 589 19 35 2013-10-26 19:00:00Endeavor Air Inc.
9E 2013 7 7 NA 2030 NA NA 2156 NA 4218 NA JFK PHL NA 94 20 30 2013-07-07 20:00:00Endeavor Air Inc.
9E 2013 2 18 1459 1505 -6 1621 1637 -16 3393 N910XJ JFK DCA 46 213 15 5 2013-02-18 15:00:00Endeavor Air Inc.
In [57]:
names(df)[20] <- 'airline'
In [58]:
head(df)
Out[58]:
carrieryearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delayflighttailnumorigindestair_timedistancehourminutetime_hourairline
9E 2013 2 5 827 830 -3 1032 1023 9 4220 N8698A JFK RDU 78 427 8 30 2013-02-05 08:00:00Endeavor Air Inc.
9E 2013 8 23 1901 1905 -4 2051 2103 -12 3360 N926XJ JFK PIT 61 340 19 5 2013-08-23 19:00:00Endeavor Air Inc.
9E 2013 6 2 805 810 -5 949 1027 -38 3538 N925XJ JFK MSP 145 1029 8 10 2013-06-02 08:00:00Endeavor Air Inc.
9E 2013 10 26 2139 1935 124 2358 2145 133 3470 N928XJ JFK CVG 102 589 19 35 2013-10-26 19:00:00Endeavor Air Inc.
9E 2013 7 7 NA 2030 NA NA 2156 NA 4218 NA JFK PHL NA 94 20 30 2013-07-07 20:00:00Endeavor Air Inc.
9E 2013 2 18 1459 1505 -6 1621 1637 -16 3393 N910XJ JFK DCA 46 213 15 5 2013-02-18 15:00:00Endeavor Air Inc.
In [59]:
df[order(df$arr_delay), ]
Out[59]:
carrieryearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delayflighttailnumorigindestair_timedistancehourminutetime_hourairline
322787VX 2013 5 7 1715 1729 -14 1944 2110 -86 193 N843VA EWR SFO 315 2565 17 29 2013-05-07 17:00:00 Virgin America
320759VX 2013 5 20 719 735 -16 951 1110 -79 11 N840VA JFK SFO 316 2586 7 35 2013-05-20 07:00:00 Virgin America
40461AA 2013 5 6 1826 1830 -4 2045 2200 -75 269 N3KCAA JFK SEA 289 2422 18 30 2013-05-06 18:00:00 American Airlines Inc.
257628UA 2013 5 2 1947 1949 -2 2209 2324 -75 612 N851UA EWR LAX 300 2454 19 49 2013-05-02 19:00:00 United Air Lines Inc.
51525AS 2013 5 4 1816 1820 -4 2017 2131 -74 7 N551AS EWR SEA 281 2402 18 20 2013-05-04 18:00:00 Alaska Airlines Inc.
287137UA 2013 5 2 1926 1929 -3 2157 2310 -73 1628 N24212 EWR SFO 314 2565 19 29 2013-05-02 19:00:00 United Air Lines Inc.
63392B6 2013 5 13 657 700 -3 908 1019 -71 671 N805JB JFK LAX 290 2475 7 0 2013-05-13 07:00:00 JetBlue Airways
132997DL 2013 5 6 1753 1755 -2 2004 2115 -71 1394 N3760C JFK PDX 283 2454 17 55 2013-05-06 17:00:00 Delta Air Lines Inc.
291801UA 2013 5 7 2054 2055 -1 2317 28 -71 622 N806UA EWR SFO 309 2565 20 55 2013-05-07 20:00:00 United Air Lines Inc.
73875B6 2013 5 13 1801 1805 -4 2018 2128 -70 217 N663JB JFK LGB 295 2465 18 5 2013-05-13 18:00:00 JetBlue Airways
213026HA 2013 2 11 857 900 -3 1430 1540 -70 51 N389HA JFK HNL 601 4983 9 0 2013-02-11 09:00:00 Hawaiian Airlines Inc.
265352UA 2013 2 26 1721 1725 -4 1936 2046 -70 385 N855UA EWR PDX 294 2434 17 25 2013-02-26 17:00:00 United Air Lines Inc.
283072UA 2013 2 28 702 705 -3 924 1034 -70 963 N831UA EWR SNA 306 2434 7 5 2013-02-28 07:00:00 United Air Lines Inc.
284804UA 2013 2 26 1335 1335 0 1819 1929 -70 15 N76065 EWR HNL 566 4963 13 35 2013-02-26 13:00:00 United Air Lines Inc.
286098UA 2013 5 13 1624 1629 -5 1831 1941 -70 789 N855UA EWR LAX 290 2454 16 29 2013-05-13 16:00:00 United Air Lines Inc.
313668US 2013 5 3 616 630 -14 803 913 -70 195 N507AY JFK PHX 266 2153 6 30 2013-05-03 06:00:00 US Airways Inc.
322941VX 2013 1 4 1026 1030 -4 1305 1415 -70 23 N855VA JFK SFO 324 2586 10 30 2013-01-04 10:00:00 Virgin America
43629AA 2013 2 26 1827 1830 -3 2056 2205 -69 269 N3EAAA JFK SEA 308 2422 18 30 2013-02-26 18:00:00 American Airlines Inc.
46050AA 2013 5 13 855 900 -5 1116 1225 -69 1 N328AA JFK LAX 299 2475 9 0 2013-05-13 09:00:00 American Airlines Inc.
108686DL 2013 2 27 1858 1900 -2 2152 2301 -69 1967 N704X JFK SFO 329 2586 19 0 2013-02-27 19:00:00 Delta Air Lines Inc.
116263DL 2013 2 28 1855 1900 -5 2152 2301 -69 1967 N705TW JFK SFO 331 2586 19 0 2013-02-28 19:00:00 Delta Air Lines Inc.
285618UA 2013 5 4 1914 1915 -1 2107 2216 -69 1557 N36447 EWR LAS 276 2227 19 15 2013-05-04 19:00:00 United Air Lines Inc.
322427VX 2013 2 26 1022 1030 -8 1306 1415 -69 23 N846VA JFK SFO 327 2586 10 30 2013-02-26 10:00:00 Virgin America
323553VX 2013 5 12 721 730 -9 956 1105 -69 183 N852VA EWR SFO 318 2565 7 30 2013-05-12 07:00:00 Virgin America
27819E 2013 5 6 1846 1859 -13 2026 2134 -68 3403 N922XJ JFK MCI 138 1113 18 59 2013-05-06 18:00:00 Endeavor Air Inc.
115139E 2013 8 20 1555 1559 -4 1720 1828 -68 3540 N905XJ JFK MSP 133 1029 15 59 2013-08-20 15:00:00 Endeavor Air Inc.
47876AA 2013 9 7 1550 1600 -10 1757 1905 -68 1156 N3EHAA LGA DFW 171 1389 16 0 2013-09-07 16:00:00 American Airlines Inc.
111907DL 2013 4 30 1440 1445 -5 1711 1819 -68 963 N713TW JFK LAX 308 2475 14 45 2013-04-30 14:00:00 Delta Air Lines Inc.
139164DL 2013 3 1 2014 2020 -6 2220 2328 -68 1729 N694DL JFK LAS 283 2248 20 20 2013-03-01 20:00:00 Delta Air Lines Inc.
140618DL 2013 2 26 1918 1925 -7 2155 2303 -68 6 N3768 JFK SLC 246 1990 19 25 2013-02-26 19:00:00 Delta Air Lines Inc.
336511YV 2013 5 23 NA 1735 NA NA 1937 NA 2751 N912FJ LGA CLT NA 544 17 35 2013-05-23 17:00:00Mesa Airlines Inc.
336514YV 2013 6 25 NA 1735 NA NA 1937 NA 2751 N935LR LGA CLT NA 544 17 35 2013-06-25 17:00:00Mesa Airlines Inc.
336523YV 2013 12 17 NA 1637 NA NA 1800 NA 3771 N503MJ LGA IAD NA 229 16 37 2013-12-17 16:00:00Mesa Airlines Inc.
336530YV 2013 2 8 NA 1435 NA NA 1559 NA 3750 N516LR LGA IAD NA 229 14 35 2013-02-08 14:00:00Mesa Airlines Inc.
336531YV 2013 2 8 NA 1602 NA NA 1722 NA 3771 N519LR LGA IAD NA 229 16 2 2013-02-08 16:00:00Mesa Airlines Inc.
336536YV 2013 6 28 NA 1735 NA NA 1937 NA 2751 N924FJ LGA CLT NA 544 17 35 2013-06-28 17:00:00Mesa Airlines Inc.
336537YV 2013 6 28 NA 1617 NA NA 1744 NA 3771 N509MJ LGA IAD NA 229 16 17 2013-06-28 16:00:00Mesa Airlines Inc.
336550YV 2013 12 10 NA 1637 NA NA 1800 NA 3771 N514MJ LGA IAD NA 229 16 37 2013-12-10 16:00:00Mesa Airlines Inc.
336575YV 2013 12 9 1749 1637 72 NA 1800 NA 3771 N510MJ LGA IAD NA 229 16 37 2013-12-09 16:00:00Mesa Airlines Inc.
336582YV 2013 10 21 NA 1735 NA NA 1946 NA 2751 N918FJ LGA CLT NA 544 17 35 2013-10-21 17:00:00Mesa Airlines Inc.
336597YV 2013 6 24 NA 1735 NA NA 1937 NA 2751 N935LR LGA CLT NA 544 17 35 2013-06-24 17:00:00Mesa Airlines Inc.
336618YV 2013 12 5 NA 1150 NA NA 1406 NA 2885 N942LR LGA CLT NA 544 11 50 2013-12-05 11:00:00Mesa Airlines Inc.
336621YV 2013 12 5 NA 1637 NA NA 1800 NA 3771 N519LR LGA IAD NA 229 16 37 2013-12-05 16:00:00Mesa Airlines Inc.
336623YV 2013 8 1 NA 1735 NA NA 1937 NA 2751 N920FJ LGA CLT NA 544 17 35 2013-08-01 17:00:00Mesa Airlines Inc.
336624YV 2013 8 1 NA 1605 NA NA 1732 NA 3771 N507MJ LGA IAD NA 229 16 5 2013-08-01 16:00:00Mesa Airlines Inc.
336626YV 2013 7 22 NA 1735 NA NA 1937 NA 2751 N909FJ LGA CLT NA 544 17 35 2013-07-22 17:00:00Mesa Airlines Inc.
336627YV 2013 7 22 NA 1605 NA NA 1732 NA 3771 N508MJ LGA IAD NA 229 16 5 2013-07-22 16:00:00Mesa Airlines Inc.
336639YV 2013 1 30 NA 1602 NA NA 1722 NA 3771 N503MJ LGA IAD NA 229 16 2 2013-01-30 16:00:00Mesa Airlines Inc.
336644YV 2013 12 8 NA 1637 NA NA 1800 NA 3771 N508MJ LGA IAD NA 229 16 37 2013-12-08 16:00:00Mesa Airlines Inc.
336663YV 2013 7 23 NA 1136 NA NA 1338 NA 2651 N916FJ LGA CLT NA 544 11 36 2013-07-23 11:00:00Mesa Airlines Inc.
336664YV 2013 7 23 NA 1605 NA NA 1732 NA 3771 N513MJ LGA IAD NA 229 16 5 2013-07-23 16:00:00Mesa Airlines Inc.
336672YV 2013 12 10 NA 1150 NA NA 1406 NA 2885 N930LR LGA CLT NA 544 11 50 2013-12-10 11:00:00Mesa Airlines Inc.
336680YV 2013 4 19 NA 1603 NA NA 1730 NA 3790 N519LR LGA IAD NA 229 16 3 2013-04-19 16:00:00Mesa Airlines Inc.
336708YV 2013 7 1 NA 1735 NA NA 1937 NA 2751 N922FJ LGA CLT NA 544 17 35 2013-07-01 17:00:00Mesa Airlines Inc.
336742YV 2013 1 11 NA 1435 NA NA 1559 NA 3750 N518LR LGA IAD NA 229 14 35 2013-01-11 14:00:00Mesa Airlines Inc.
336748YV 2013 10 7 NA 1735 NA NA 1946 NA 2751 N926LR LGA CLT NA 544 17 35 2013-10-07 17:00:00Mesa Airlines Inc.
336749YV 2013 10 7 NA 1629 NA NA 1750 NA 3771 N510MJ LGA IAD NA 229 16 29 2013-10-07 16:00:00Mesa Airlines Inc.
336758YV 2013 1 13 NA 1605 NA NA 1729 NA 3771 N502MJ LGA IAD NA 229 16 5 2013-01-13 16:00:00Mesa Airlines Inc.
336765YV 2013 6 13 NA 1617 NA NA 1744 NA 3771 N509MJ LGA IAD NA 229 16 17 2013-06-13 16:00:00Mesa Airlines Inc.
336775YV 2013 8 8 NA 1605 NA NA 1732 NA 3771 N503MJ LGA IAD NA 229 16 5 2013-08-08 16:00:00Mesa Airlines Inc.
In [0]: