This is an individual assignment. You are free to use your notes, homeworks, books, online materials, etc. You may not discuss the questions with anyone else. The midterm is due on canvas Friday at midnight before Saturday. Submit the html file this Rmd produces like the homework. You are free to schedule one 15 minute zoom session with me this upcoming week to ask a question or two. I reserve the right to not answer the question.
I have loaded the libraries and data for you. Don't touch the code block below. You will have to create your own code block(s) to answer the questions if needed.
Read the questions carefully and slowly. Take your time. Write your answers in this document where indicated.
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.width=6, fig.align="center") # No warnings library(dplyr) # For pipe and other data commands library(janitor) # For tabyl library(ggplot2) # For plotting using ggplot() function library(knitr) # For making tablues using kabble() load("~/Data/output/ACS_clean.RData") ls()
Here is a histogram for the variable
HINCP split by the variable
new_FS which indicates whether a family was on food stamps or not.
mydata_clean %>% distinct(SERIALNO, new_FS, HINCP) %>% ggplot(aes(x = HINCP)) + geom_histogram(binwidth = 25000, color = "white" ) + facet_wrap( ~ new_FS) + scale_x_continuous(labels = scales::comma) + xlim(0, 1000000)
Taking the graph into account, write your answers below.
HINCP? Explain how you know.
new_FS? Explain how you know.
Here is a summary table for the variable
HINCP split by
mydata_clean %>% distinct(SERIALNO, new_FS, HINCP) %>% group_by(new_FS) %>% summarize(n = n(), min = min(HINCP, na.rm=TRUE), median = median(HINCP, na.rm=TRUE), mean = mean(HINCP, na.rm=TRUE), max = max(HINCP, na.rm=TRUE)) %>% kable()
Taking the table and histogram into account, answer the questions.
HINCPfor people on food stamps?
HINCPfor people on food stamps means.
How are the distributions of
HINCP for those on food stamps and not on food stamps different and similar?
What do you find surprising for either distribution?
mydata_clean %>% distinct(SERIALNO, new_FS, HINCP) %>% ggplot(aes(x = new_FS, y = HINCP)) + geom_boxplot(outlier.shape = NA) + coord_flip() + ylim(0,400000)
HINCPis the same or different for people who are on or not on food stamps? Cite as much evidence as possible from the graphs and tables above.
Here is a table for the variable
JWTR_new which indicates how someone got to work.
mydata_clean %>% distinct(SERIALNO, JWTR_new) %>% tabyl(JWTR_new) %>% adorn_pct_formatting(digits=0) %>% kable()
JWTR_new? Explain how you know.
Bus or trolley busmean?
Read the document
Cell-Phone-Student.md in the
Affective-Domain\Cell-Phones directory. Answer the questions below.
What is "sampling bias"? Explain using proper terminology and craft your example to explain how it can effect the outcome of a statistical study. This should be several paragraphs long.
Sampling bias means that the samples of a random variable that are collected to determine its distribution are selected incorrectly and do not represent the distribution truthfully. What makes a sample biased is when they are SELECTIVELY chosen instead of RANDOMLY. In an unbiased sample, how samples are determined should only result by chance.
For example: If we wanted to predict an outcome of the election, we will poll 1000 voters, asking them who they want to vote for. To get an accurate representation, we need to include everybody's views as a whole; including elderly people, young voters, middle-aged voters, ethnic minorities, rich people, etc. True representation means that nobody is left out from the sampling.
The most effective method to avoid sampling bias is through a random sample. This provides the same odds for every member of the population to be chosen as a participant in the conducted study.