Project: Brandon Friedel - Courses/Emmanuel-Garcia/MATH-159_Fall-2020/MATH-159-11-081548_Fall-2020

Path: Homework / hw03 / hw03_one-mean.rmd

Views: ²⁴²⁵
Image: ubuntu2004

---

title: HW03
subtitle: The mean of one numerical variable
author: MATH--159
date: Fall 2020
output:
  html_document:
    highlight: tango
    theme: spacelab
    toc: yes

---

knitr::opts_chunk$set(echo = T,
                      warning=FALSE,
                      message=FALSE,
                      fig.height=4,
                      fig.align='center') #Chunk default settings

library(dplyr)
library(ggplot2)
library(janitor)
library(knitr)

load("~/Data/output/ACS_clean.RData")

set.seed(12102020)

mydata_clean %>%
  filter(AGEP >= 18, AGEP <= 25, NP == 1, WKHP <= 40, RAC1P_cat == "White alone") %>%
  sample_n(50) ->
  young_people

mydata_clean %>%
  filter(AGEP >= 40, AGEP <= 50, NP == 1, WKHP <= 40, RAC1P_cat == "White alone") %>%
  sample_n(50) ->
  old_people

Assignment Overview

This purpose of this assignment is to bring together what we've learned about normal and sample distributions. Together they create a theory of inference for one mean. The idea is to take the data from our sample and use it to infer the true value or mean of the population. For instance, we might have weight data on thousands of people. Using this weight data we will construct a guess (confidence interval) for the true mean of the population (the actual population average).

Instructions

Completely describe 2 continuous numerical variables using
- A table of summary statistics,
- An appropriate plot with titles and axes labels,
- A short paragraph description in full complete English sentences.
Manually calculate a confidence interval for the population mean.
Turn in the assignment on Canvas. Don't worry about the due date. I'll allow late submissions.
- See me in office hours or class if you're having trouble.

Guidance for numerical variables

What is the trend in the data? What exactly does the graph show? (Use the graph title to help you answer this question)
Describe the shape:
- Symmetry/Skewness - Is it symmetric, skewed right, or skewed left?
- Modality - Is it uniform, unimodal, or bimodal?
Describe the spread:
- Variability - What is the approximate range of the data (x-axis)?
- Does the variable have a lot of variability in the data ?
Describe the center: What is the mean/median/midpoint of the data? (Pick one or two).
Describe the outliers (note: there may not be any for every graph):
- Are there any outliers for the variable?
- If yes, are these true outliers or false (due to data management or input error) outliers?
Reread your explanation for context grammar, spelling and common sense.

Assignment

There are two dataframes that you will use for this assignment:

young_people which is a random sample of 50 people who are: a. 18 to 25 and live in the Bay Area, b. live alone, c. work 40 hours or less per week, d. are white.
old_people which is a random sample of 50 people who are: a. 40 to 50 and live in the Bay Area, b. live alone , c. work 40 hours or less per week, d. are white.

Question 1

What is the mean wage (WAGP variable) for young_people and standard deviation? (3 points)

young_people %>%
select(WAGP) %>%
summarize(average_wage=mean(WAGP, na.rm=TRUE),standard_dev=sd(WAGP, na.rm=TRUE))

What is the mean wage (WAGP variable) for old_people and standard deviation? (3 points)

old_people %>%
select(WAGP) %>%
summarize(average_wage=mean(WAGP, na.rm=TRUE),standard_dev=sd(WAGP, na.rm=TRUE))

Describe the distribution of the variable WAGP for young_people. (4 points)

young_people %>%
ggplot(aes(WAGP)) +
geom_histogram(binwidth= 50000, boundary=0,closed=c('left'),colors='white')

the graph is unimodal because there is one peak

Describe the distribution of the variable WAGP for old_people. (4 points)

old_people %>%
ggplot(aes(WAGP)) +
geom_histogram(binwidth= 50000, boundary=0,closed=c('left'),colors='white')

the graph is unimodal because there is one peak

Question 2

Given the sample of 50 young_people what is your estimate for the mean wage for the population of young_people if you wish to be 95% confident in your estimate? (3 points)

mean<- 12689
sd<- 24895/sqrt(50)
mean + 2*sd

mean -2*sd

Given the sample of 50 old_people what is your estimate for the mean wage for the population of old_people if you wish to be 95% confident in your estimate? (3 points)

mean <- 108956
sd<- 92571/sqrt(50)

mean + 2*sd

mean - 2*sd

Question 3

Given the two estimates you calculated in Question 2 do you think it is possible that the two groups actually have the same mean wage if you wish to be 95% confident? (3 points)

in order to be 95% confident you would have to have to same mean so no

Question 4

If you had to guess with 95% confidence the wage of a person from the population of young_people, what would you guess be? Explain how you reached these numbers. (3 points)

my prediction would be between 0 and 49799 because that is within 2 sdandard deviations using the Imperical rule