{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# Lecture 17\n",
"\n",
"Today:\n",
"1. Review of hypotesis test\n",
"2. Application: A/B Testing\n",
" + Example\n",
"3. Causality"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 1. Review of hypotesis test\n",
"\n",
"A possible rule for rejecting the null hypothesis:\n",
"\n",
"- establish cutoff for p-value\n",
"\n",
"- for example, a 5% cutoff: if the observed p-value is 5% or less, then reject the null hypothesis. Otherwise, do not reject it"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 2. A/B Testing: Comparing Two Samples\n",
"\n",
"- compare values of sampled individuals in group a with values of sampled individuals in group b\n",
"- example: random sample of visiotrs to etsy. comparing A) click rate using design A vs B) click rate using design B"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Example: smoking behaviors of mothers and its influence on babies weights\n",
"\n",
"- comparing A) birth weights of babies of mothers who smoked during pregnancy vs. B) birth weights of babies of mothers who didn't smoke. question: could the difference be due to chance alone?\n",
"\n",
"HYPOTHESES\n",
"- Null: In the population, the distributions of the birth weights of babies in two groups are the same\n",
"- Alternate: babies of the mothers who smoked weighed less than the babies of the non-smokers\n",
"- To test this we have to compute a test statistic (one number) between group A and group B. the test statistic is group b - group a\n",
" - the statistic for the null hypothesis would be 0\n",
"\n",
"SIMULATION\n",
"- If the null is true, all rearrangements of the birth weights among the two groups are equally likely.\n",
"- Plan:\n",
" - shuffle birth weights\n",
" - assign some to \"group a\" and the rest to \"group b,\" maintaining sample sizes\n",
" - find the difference b/t the averages of two shuffled groups\n",
" -repeat"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"Attaching package: ‘dplyr’\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following objects are masked from ‘package:stats’:\n",
"\n",
" filter, lag\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following objects are masked from ‘package:base’:\n",
"\n",
" intersect, setdiff, setequal, union\n",
"\n"
]
}
],
"source": [
"library('dplyr')\n",
"library('ggplot2')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"babyweight <- read.csv(\"babyweight.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n"
]
},
"execution_count": 16,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"# Selecting elements from a list\n",
"\n",
"shuffled_babies <- sample( babyweight$Wgt, 32, replace = FALSE )\n",
"shuffled_babies\n",
"shuffled_babies[c(1, 3)] #selecting first and third baby\n",
"shuffled_babies[1:16] #first sixteen numbers in the list; telling R what index to select\n",
"shuffled_babies[17:32] #last sixteen numbers in the list"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"# simulate\n",
"\n",
"num_simulations <- 1000\n",
"\n",
"# set up data frame with 1000 rows, each row being an observation. one column would be the test statistic. test statistic = mean weight of group b - mean weight of group a. two other columns would be average weight group A and average weight of group B.\n",
"simulated_data <- data.frame(ave_weight_A = double(num_simulations), \n",
" ave_weight_B = double(num_simulations),\n",
" statistic = double(num_simulations) )\n",
"\n",
"\n",
"count <- 1\n",
"while( count <= num_simulations ) {\n",
"\n",
" shuffled_babies <- sample( babyweight$Wgt, 32, replace = FALSE )\n",
" group_A <- shuffled_babies[1:16]\n",
" group_B <- shuffled_babies[17:32]\n",
" \n",
" #find mean of weight in each group, place in correct data frame, and then find the difference\n",
" simulated_data$ave_weight_A[count] <- mean(group_A)\n",
" simulated_data$ave_weight_B[count] <- mean(group_B)\n",
" simulated_data$statistic[count] <- simulated_data$ave_weight_B[count] - simulated_data$ave_weight_A[count]\n",
"\n",
"\n",
" count <- count + 1\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"