Hands-on introduction to R
R is a powerful, comprehensive, open-source software framework for doing statistics. It is possible to download and install the software on computers, or to use it through a website-interface without downloading anything. We will use R through a website-interface in the form of Sage worksheets. In fact, what you are reading here is a Sage worsheet that will guide you through the first steps of getting familiar with R. Let us begin by learning how to input (small) datasets into R.How to input simple datasets by hand
Prelude: To use R through Sage worksheets (and Sage cells), the first line in each new cell must always be "%r" (without the quotes). Another alternative is to choose "R" from the modes menu at the top of the worksheet. R allows setting up your data through keyboard input, or by reading the data through an input file. It is extremely useful to know how to do keyboard input for simple and small datasets. Example: Find the mean, standard deviation and 5-number summary for the set of values: 1, 2, 3, 4, 8 The commands below show how to do this.Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.
3.6
2.70185121722126
7.3
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 2.0 3.0 3.6 4.0 8.0
Now, let's define a 2nd variable that is categorical.
For example, suppose it contains the 5 values:
Yes, No, Yes, Yes, Maybe
acat
Maybe No Yes
1 1 3
Exercise 1:
Create an R variable for each of the following datasets
?var
?sd
- blue, pink, blue, green, green, blue, pink, blue
?var
?sd
How to input a spreadsheet of data
A spreadsheet or table of raw data can be created manually via keyboard input, or by reading in data files written in various standard formats. The structure used in R to represent such tables is called a "dataframe." Consider, for example, the following datasetAge | Sex | Class year | SAT score | Financial aid? |
18 | F | 1 | 1014 | N |
20 | F | 3 | 1222 | Y |
17 | M | 1 | 1141 | Y |
17 | F | 1 | 1082 | N |
19 | M | 2 | 1261 | Y |
18 | F | 2 | 1288 | N |
20 | F | 1 | 1002 | N |
21 | M | 3 | 1078 | N |
age | sex | year | sat_score | f_aid |
---|---|---|---|---|
18 | f | 1 | 1014 | n |
20 | f | 3 | 1222 | y |
17 | m | 1 | 1141 | y |
17 | f | 1 | 1082 | n |
19 | m | 2 | 1261 | y |
18 | f | 2 | 1288 | n |
20 | f | 1 | 1002 | n |
21 | m | 3 | 1078 | n |
Once a dataframe is created, it is easy to make various
displays, and to compute summary statistics for variables
in the dataframe. The following examples show how to do
this for variables in the dataframe created above.
n y
5 3
sat_score
Min. :1002
1st Qu.:1062
Median :1112
Mean :1136
3rd Qu.:1232
Max. :1288
Exercise 2.1:
The following table contains data on the employment status
of a sample of college students
Create a dataframe via keyboard input to represent these data.
Print your dataframe and verify that it is correct.
Age | Major | Employment | Work hours |
---|---|---|---|
19 | Business | Part time | 35 |
19 | English | Part time | 30 |
34 | Business | Unemployed | 0 |
20 | Psychology | Part time | 19 |
20 | Psychology | Part time | 32 |
21 | History | Unemployed | 0 |
21 | Business | Part time | 20 |
21 | History | Part time | 15 |
23 | Psychology | Full time | 36 |
41 | Business | Full time | 50 |
30 | Physics | Unemployed | 0 |
Exercise 2.2:
For each variable in the dataset above, make a display (or two!)
and compute summary statistics.
Use the builtin help feature to discover at least a couple of
new ways to customize your displays and/or computations.
For example, try to figure out how to make a 2-way table
for your two categorical variables.