CoCalc -- random_sampling_in

Random numbers and random sampling

This worksheet is an interactive, guided module for learning the basics of generating a variety of different types of random numbers, and how to do random sampling. It assumes you know how to read datafiles and produce a dataframe containing your data.

Let us begin by learning how to generate random numbers. Two of the most useful types of random numbers are those that are: (1) uniformly distributed in some specified interval; and (2) normally distributed, with specified mean and SD.

Example: Shown below are several variants of commands to generate both types of random numbers. The basic command for uniformly distributed numbers is runif(). Likewise, the command for normally distributed numbers is rnorm().

Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.

%r
n = 3
# To generate n uniform random numbers between min and max:
runif (n, min=0, max=5)

# Note that the same command can be run by the following (less clear) short-hand:
runif (3, 0, 5)

# If the min/max are also left out, it defaults to the range of 0 to 1:
runif (3)

2.26215060451068
0.216705296188593
0.640162804629654

2.86498950328678
2.8263787983451
0.851491509238258

0.871844316134229
0.36544156447053
0.701308553805575

%r
n = 3
# To generate n normally distributed random numbers, use "rnorm":
rnorm (n, mean=12, sd=3.5)

# Again, can run the equivalent (less clear) short-hand:
rnorm (n, 12, 3.5)

# If mean/sd are left out, defaults to standard normal dist. with mean=0, sd=1:
rnorm (n)

11.0674621077608
13.007689296916
12.9938812848226

11.9787983835207
14.1956856764494
16.7709320458723

0.300196825364775
-0.655895508767473
-0.399298374172015

%r
# To verify whether it is actually doing what we think, let's 
# plot a histogram of 500 normally distributed random numbers:
x = rnorm (500, 12, 3.5)
hist(x, xlab="random numbers", main="Check whether normal")

How about discrete random numbers

Notice that all the above examples produce numbers that are continuously distributed across their range. Such numbers usually contain decimals, and rarely turn out to be nice, round numbers.

What if we wanted random integers, say, 12 of them, lying between

-6

and 23? The command for doing that is sample, as shown in the following examples

%r
# One way to  randomly pick integer numbers from a specified 
# range is using the "sample" command.
#
# Example: Pick 12 numbers at random that lie between -6 and 23.
sample ( -6:23,  12, replace=TRUE)

19
-3
2
1
20
-5
10
21
18
14
1
7

%r
# Note that "replace=TRUE" allows picking the same number 
# more than once.  If you want all the numbers to 
# be different, just leave out that option, like this
sample ( -6:23,  12)

13
20
-4
16
0
-3
14
17
19
18
10
7

The sample command can also be used to randomly pick from categorical variables. Some examples follow.

%r
# Toss a coin 3 times and record the sequence of outcomes
sample( c("H", "T"), 3, replace=TRUE )

# Here is one way to do 10 trials of tossing a coin 3 times.
# There is, likely, a better way to do this. But this is what 
# I know at the moment!
for ( i in 1:10 ){
#    print( sample( c("H", "T"), 3, replace=TRUE ) )
    show( sample( c("H", "T"), 3, replace=TRUE ) )
}

[1] "T" "H" "T"
[1] "H" "H" "H"
[1] "H" "T" "T"
[1] "H" "H" "H"
[1] "T" "T" "T"
[1] "T" "H" "H"
[1] "T" "H" "T"
[1] "T" "H" "H"
[1] "T" "H" "H"
[1] "T" "H" "T"

How to pick random samples from datafiles

Another very important use of the sample command is to pick random samples from data sets. R provides relatively straightforward ways to pick simple random samples and stratified random samples from dataframes.

Example: The file "test_csv_file.csv" contains data on the employment status, work hours, age, etc., of a group of university students. The following examples show how to get an SRS and stratified random sample from this dataset.

%r
# Read data file and create a dataframe called "empdata"
empdata = read.csv(file="./test_csv_file.csv", header=TRUE, sep=",")

# If we want to see the original data, uncomment next line
# empdata

# Pick a simple random sample of size 7 from this data
library(dplyr)
empdata %>%
    sample_n(7)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

	Response_id	Age	Gender	Employment.Status	Work.Hours
5	164457	34	Female	Unemployed	0.0
16	165440	21	Female	Part Time	15.0
15	165417	21	Male	Part Time	20.0
27	166391	18	Female	Part Time	25.0
28	166397	33	Female	Full Time	37.5
23	166105	41	Male	Full Time	50.0
22	165932	21	Female	Unemployed	0.0

%r
# Now let's try a stratified random sample based on employment status.
# Let us pick 3 people from each stratum
empdata %>%
    group_by(Employment.Status) %>%
    sample_n(3)

Response_id	Age	Gender	Employment.Status	Work.Hours
165793	56	Male	Full Time	40
164573	23	Female	Full Time	36
166415	37	Male	Full Time	40
164417	21	Male	Part Time	10
166389	38	Female	Part Time	20
165417	21	Male	Part Time	20
165345	21	Female	Unemployed	0
165638	32	Female	Unemployed	0
165056	30	Female	Unemployed	0

Exercise 1:

Generate 9 uniformly distributed random numbers in the range [2, 5].
Show that the "runif" function does, in fact, produce a uniform distribution of random numbers by plotting a histogram of 500 numbers in the range [2, 5].
Generate 50 normally distributed random numbers with mean=4.56 and SD=2. Plot a histogram showing your results.
Toss a fair coin 50 times (using R) and count the number of heads.

Exercise 2:

The file "grades.csv" contains data on midterm scores and class year for a group of college students. Read the file and create a dataframe.

Pick a simple random sample of size 12 from this dataset.
Next, pick a stratified random sample containing 3 students from each class year.

For each type of sample, be sure to display the full data in the file for each of the selected individuals.