Kernel: R (R-Project)

Lecture 17


  1. Review of hypotesis test

  2. Application: A/B Testing

    • Example

  3. Causality

1. Review of hypotesis test

A possible rule for rejecting the null hypothesis:

  • establish cutoff for p-value

  • for example, a 5% cutoff: if the observed p-value is 5% or less, then reject the null hypothesis. Otherwise, do not reject it

2. A/B Testing: Comparing Two Samples

  • compare values of sampled individuals in group a with values of sampled individuals in group b

  • example: random sample of visiotrs to etsy. comparing A) click rate using design A vs B) click rate using design B

Example: smoking behaviors of mothers and its influence on babies weights

  • comparing A) birth weights of babies of mothers who smoked during pregnancy vs. B) birth weights of babies of mothers who didn't smoke. question: could the difference be due to chance alone?


  • Null: In the population, the distributions of the birth weights of babies in two groups are the same

  • Alternate: babies of the mothers who smoked weighed less than the babies of the non-smokers

  • To test this we have to compute a test statistic (one number) between group A and group B. the test statistic is group b - group a

    • the statistic for the null hypothesis would be 0


  • If the null is true, all rearrangements of the birth weights among the two groups are equally likely.

  • Plan:

    • shuffle birth weights

    • assign some to "group a" and the rest to "group b," maintaining sample sizes

    • find the difference b/t the averages of two shuffled groups -repeat

library('dplyr') library('ggplot2')
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
babyweight <- read.csv("babyweight.csv")
# simulate num_simulations <- 1000 # set up data frame with 1000 rows, each row being an observation. one column would be the test statistic. test statistic = mean weight of group b - mean weight of group a. two other columns would be average weight group A and average weight of group B. simulated_data <- data.frame(ave_weight_A = double(num_simulations), ave_weight_B = double(num_simulations), statistic = double(num_simulations) ) count <- 1 while( count <= num_simulations ) { shuffled_babies <- sample( babyweight$Wgt, 32, replace = FALSE ) group_A <- shuffled_babies[1:16] group_B <- shuffled_babies[17:32] #find mean of weight in each group, place in correct data frame, and then find the difference simulated_data$ave_weight_A[count] <- mean(group_A) simulated_data$ave_weight_B[count] <- mean(group_B) simulated_data$statistic[count] <- simulated_data$ave_weight_B[count] - simulated_data$ave_weight_A[count] count <- count + 1 }
ggplot(simulated_data, aes( x = statistic)) + geom_histogram( bins = 10 )
Image in a Jupyter notebook
# find percentile of observed stat: sum( simulated_data$statistic <= observed_diff ) / 1000 # area to the left is 76.6th percentile
# p-value 1-sum(simulated_data$statistic <= observed_diff) / 1000 # area to the right is 23.4th percentile