CoCalc Public Filesrandom_sampling_in_r.sagewsOpen with one click!
Author: A Pardhanani
Views : 85
Compute Environment: Ubuntu 18.04 (Deprecated)

Random numbers and random sampling

This worksheet is an interactive, guided module for learning the basics of generating a variety of different types of random numbers, and how to do random sampling. It assumes you know how to read datafiles and produce a dataframe containing your data.

Let us begin by learning how to generate random numbers. Two of the most useful types of random numbers are those that are: (1) uniformly distributed in some specified interval; and (2) normally distributed, with specified mean and SD.

Example: Shown below are several variants of commands to generate both types of random numbers. The basic command for uniformly distributed numbers is runif(). Likewise, the command for normally distributed numbers is rnorm().

Note that all the information following any "#" sign is just to explain what is going on. R ignores anything that follows a "#" sign.
%r n = 3 # To generate n uniform random numbers between min and max: runif (n, min=0, max=5) # Note that the same command can be run by the following (less clear) short-hand: runif (3, 0, 5) # If the min/max are also left out, it defaults to the range of 0 to 1: runif (3)
  1. 2.26215060451068
  2. 0.216705296188593
  3. 0.640162804629654
  1. 2.86498950328678
  2. 2.8263787983451
  3. 0.851491509238258
  1. 0.871844316134229
  2. 0.36544156447053
  3. 0.701308553805575
%r n = 3 # To generate n normally distributed random numbers, use "rnorm": rnorm (n, mean=12, sd=3.5) # Again, can run the equivalent (less clear) short-hand: rnorm (n, 12, 3.5) # If mean/sd are left out, defaults to standard normal dist. with mean=0, sd=1: rnorm (n)
  1. 11.0674621077608
  2. 13.007689296916
  3. 12.9938812848226
  1. 11.9787983835207
  2. 14.1956856764494
  3. 16.7709320458723
  1. 0.300196825364775
  2. -0.655895508767473
  3. -0.399298374172015
%r # To verify whether it is actually doing what we think, let's # plot a histogram of 500 normally distributed random numbers: x = rnorm (500, 12, 3.5) hist(x, xlab="random numbers", main="Check whether normal")

How about discrete random numbers

Notice that all the above examples produce numbers that are continuously distributed across their range. Such numbers usually contain decimals, and rarely turn out to be nice, round numbers.

What if we wanted random integers, say, 12 of them, lying between 6-6 and 23? The command for doing that is sample, as shown in the following examples
%r # One way to randomly pick integer numbers from a specified # range is using the "sample" command. # # Example: Pick 12 numbers at random that lie between -6 and 23. sample ( -6:23, 12, replace=TRUE)
  1. 19
  2. -3
  3. 2
  4. 1
  5. 20
  6. -5
  7. 10
  8. 21
  9. 18
  10. 14
  11. 1
  12. 7
%r # Note that "replace=TRUE" allows picking the same number # more than once. If you want all the numbers to # be different, just leave out that option, like this sample ( -6:23, 12)
  1. 13
  2. 20
  3. -4
  4. 16
  5. 0
  6. -3
  7. 14
  8. 17
  9. 19
  10. 18
  11. 10
  12. 7
The sample command can also be used to randomly pick from categorical variables. Some examples follow.
%r # Toss a coin 3 times and record the sequence of outcomes sample( c("H", "T"), 3, replace=TRUE ) # Here is one way to do 10 trials of tossing a coin 3 times. # There is, likely, a better way to do this. But this is what # I know at the moment! for ( i in 1:10 ){ # print( sample( c("H", "T"), 3, replace=TRUE ) ) show( sample( c("H", "T"), 3, replace=TRUE ) ) }
  1. 'H'
  2. 'H'
  3. 'T'
[1] "T" "H" "T" [1] "H" "H" "H" [1] "H" "T" "T" [1] "H" "H" "H" [1] "T" "T" "T" [1] "T" "H" "H" [1] "T" "H" "T" [1] "T" "H" "H" [1] "T" "H" "H" [1] "T" "H" "T"

How to pick random samples from datafiles

Another very important use of the sample command is to pick random samples from data sets. R provides relatively straightforward ways to pick simple random samples and stratified random samples from dataframes.

Example: The file "test_csv_file.csv" contains data on the employment status, work hours, age, etc., of a group of university students. The following examples show how to get an SRS and stratified random sample from this dataset.

%r # Read data file and create a dataframe called "empdata" empdata = read.csv(file="./test_csv_file.csv", header=TRUE, sep=",") # If we want to see the original data, uncomment next line # empdata # Pick a simple random sample of size 7 from this data library(dplyr) empdata %>% sample_n(7)
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
Response_idAgeGenderEmployment.StatusWork.Hours
5164457 34 Female Unemployed 0.0
16165440 21 Female Part Time 15.0
15165417 21 Male Part Time 20.0
27166391 18 Female Part Time 25.0
28166397 33 Female Full Time 37.5
23166105 41 Male Full Time 50.0
22165932 21 Female Unemployed 0.0
%r # Now let's try a stratified random sample based on employment status. # Let us pick 3 people from each stratum empdata %>% group_by(Employment.Status) %>% sample_n(3)
Response_idAgeGenderEmployment.StatusWork.Hours
165793 56 Male Full Time 40
164573 23 Female Full Time 36
166415 37 Male Full Time 40
164417 21 Male Part Time 10
166389 38 Female Part Time 20
165417 21 Male Part Time 20
165345 21 Female Unemployed 0
165638 32 Female Unemployed 0
165056 30 Female Unemployed 0
Exercise 1:
  1. Generate 9 uniformly distributed random numbers in the range [2, 5].
  2. Show that the "runif" function does, in fact, produce a uniform distribution of random numbers by plotting a histogram of 500 numbers in the range [2, 5].
  3. Generate 50 normally distributed random numbers with mean=4.56 and SD=2. Plot a histogram showing your results.
  4. Toss a fair coin 50 times (using R) and count the number of heads.

Exercise 2:

The file "grades.csv" contains data on midterm scores and class year for a group of college students. Read the file and create a dataframe.
  1. Pick a simple random sample of size 12 from this dataset.
  2. Next, pick a stratified random sample containing 3 students from each class year.

For each type of sample, be sure to display the full data in the file for each of the selected individuals.