Questions

Question 1

Here is a histogram for the variable HINCP split by the variable new_FS which indicates whether a family was on food stamps or not.

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  ggplot(aes(x = HINCP)) +
    geom_histogram(binwidth = 25000,
                   color = "white"
                  ) +
    facet_wrap( ~ new_FS) + 
    scale_x_continuous(labels = scales::comma) +
    xlim(0, 1000000)

Taking the graph into account, write your answers below.

What sort of variable is HINCP? Explain how you know.

HINCP is a categorical variable, because it is showing the income of each household within certain ranges. Categorical variables are also always found on the x-axis.

What sort of variables is new_FS? Explain how you know.

new_FS is a numerical variable, because it is showing how many households use food stamps or do not use food stamps. Numerical variables are also always found on the y-axis.

How would you describe the shape of the histogram for people who are on food stamps?

The shape of the histogram is skewed right for people who are on food stamps.

Here is a summary table for the variable HINCP split by new_FS.

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  group_by(new_FS) %>%
  summarize(n = n(),
            min = min(HINCP, na.rm=TRUE),
            median = median(HINCP, na.rm=TRUE),
            mean = mean(HINCP, na.rm=TRUE),
            max = max(HINCP, na.rm=TRUE)) %>%
  kable()

new_FS	n	min	median	mean	max
Food stamps	641	-4600	38400	64901.88	958700
No food stamps	13471	-4800	124000	164488.83	2580000

Taking the table and histogram into account, answer the questions.

What is the appropriate measure of spread for distribution of HINCP for people on food stamps?

The appropriate measure of spread of distribution of HINCP for people on food stamps is Median and IQR.

Please explain in context what the measure of center of HINCP for people on food stamps means.

When using the median to find the measure of center, it means putting in sequence the number of people on food stamps from lowest to highest and seeing from the middle amount as being the median of the distribution.

How are the distributions of HINCP for those on food stamps and not on food stamps different and similar?

The distributions of HINCP for those on food stamps and those not on food stamps are similar being as though they are both right skewed.

What do you find surprising for either distribution?

I find over 1,000 households having a low income and not using food stamps very surprising compared to the chart of the people who are on food stamps. I find it surprising that there is a lot more data of those not on food stamps being households that have lower incomes.

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  ggplot(aes(x = new_FS, y = HINCP)) +
    geom_boxplot(outlier.shape = NA) +
    coord_flip() +
    ylim(0,400000)

Do you think the distribution of HINCP is the same or different for people who are on or not on food stamps? Cite as much evidence as possible from the graphs and tables above.

I think the distribution of HINCP is about the same in terms of being right skewed and needing to use the measure of center with Median.

Question 2

Here is a table for the variable JWTR_new which indicates how someone got to work.

mydata_clean %>%
  distinct(SERIALNO, JWTR_new) %>%
  tabyl(JWTR_new) %>%
  adorn_pct_formatting(digits=0) %>%
  kable()

JWTR_new	n	percent	valid_percent
Car, truck, or van	8524	38%	71%
Bus or trolley bus	558	3%	5%
Streetcar	24	0%	0%
Subway	649	3%	5%
Railroad	224	1%	2%
Ferryboat	38	0%	0%
Taxicab	60	0%	0%
Motorcycle	60	0%	0%
Bicycle	271	1%	2%
Walked	475	2%	4%
Worked at home	987	4%	8%
Other method	149	1%	1%
NA	10122	46%	-

What sort of variable is JWTR_new? Explain how you know.

JWTR_new is a categorical variable because it is showing which vehicles people use to get to work.

What does the 46% of row NA mean?

The 46% stands for the percent of people whom the transportation method is inapplicable to them.

What is the difference between the percent and valid_percent columns?

The valid_percent column is counting the methods in which people do use transportation to get to work; excluding the percentage of people in which the question was inapplicable to them.

What does the 5% of row Bus or trolley bus mean?

The 5% of row Bus or trolley bus is showing the valid percent of people excluding the NA row in which they do not apply to the question.

Question 3

Read the document Cell-Phone-Student.md in the Affective-Domain\Cell-Phones directory. Answer the questions below.

Do you think your cell phone habits are hindering your attention and focus? Explain. This should be a paragraph or two long.

I do think my cell phone habits are hindering my attention and focus. For me, I feel like my cell phone and I are connected by a string and I can never go without it. I use my phone for just about everything, because with our technological advances improving day by day, my phone makes everything much easier. I can search up any question that comes to mind, I can contact my friends via text, call, Facetime, email, etc., I can play games when I’m bored, and I can check my social media.

However, though my cell phone is very useful and can kill my boredom, it can be very distracting when I’m trying to focus. I can be doing my homework and a notification of a text message will immediately draw my attention away and I will not only reply to the text, but also check my email, check my social media, and so forth. For me, I check my phone so often in the day, I don’t need a notification to take my attention away from anything. I think our cell phones just provide us with so much that we will easily drop whatever we are doing to go on it.

Do you intend to make any changes to how you use your cellphone? If so, what changes? This should be a paragraph or two long.

To be honest, I would want to make changes to my phone usage, but realistically, I don’t think I would be able to. I work late at night and for me, it is really hard for me to fall asleep, so being on my phone for who knows how long, helps my eyes get tired. Though it is an acquired bad habit, I don’t think I’d be able to break that routine for myself.

However, I had recently made the slightest change in my extreme phone usage. I used to check my social media every second of the day, but recently, as in for a few weeks, I won’t open my social media until the end of the day when I’m at home from school and work. I think this slight change has made some positive progress to my previous excessive usage.

I would say, if I, or someone were to try to change their habits, I would advise myself/them to put the phone in a different room when you’re with your family/friends at home, so that you’re complete undivided attention is on them. I would also say to put your phone on silent instead of on sound, so it lessens the times you get interrupted while doing something to check your notifications. I would also say to possibly delete your social media apps. I had seen other people “take a break from social media” by deleting the apps for about a month or so, and I think can really help someone see how unimportant social media is and lessen some of their screen time.

Question 4

What is “sampling bias”? Explain using proper terminology and craft your example to explain how it can effect the outcome of a statistical study. This should be several paragraphs long.

“Sampling bias” is a bias in which a sample is collected in such a way that some members of the sample population have a lower sampling probability than others.

Questions

Question 1

Here is a histogram for the variable HINCP split by the variable new_FS which indicates whether a family was on food stamps or not.

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  ggplot(aes(x = HINCP)) +
    geom_histogram(binwidth = 25000,
                   color = "white"
                  ) +
    facet_wrap( ~ new_FS) + 
    scale_x_continuous(labels = scales::comma) +
    xlim(0, 1000000)

Taking the graph into account, write your answers below.

What sort of variable is HINCP? Explain how you know.

HINCP is a categorical variable, because it is showing the income of each household within certain ranges. Categorical variables are also always found on the x-axis.

What sort of variables is new_FS? Explain how you know.

new_FS is a numerical variable, because it is showing how many households use food stamps or do not use food stamps. Numerical variables are also always found on the y-axis.

How would you describe the shape of the histogram for people who are on food stamps?

The shape of the histogram is skewed right for people who are on food stamps.

Here is a summary table for the variable HINCP split by new_FS.

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  group_by(new_FS) %>%
  summarize(n = n(),
            min = min(HINCP, na.rm=TRUE),
            median = median(HINCP, na.rm=TRUE),
            mean = mean(HINCP, na.rm=TRUE),
            max = max(HINCP, na.rm=TRUE)) %>%
  kable()

new_FS	n	min	median	mean	max
Food stamps	641	-4600	38400	64901.88	958700
No food stamps	13471	-4800	124000	164488.83	2580000

Taking the table and histogram into account, answer the questions.

What is the appropriate measure of spread for distribution of HINCP for people on food stamps?

The appropriate measure of spread of distribution of HINCP for people on food stamps is Median and IQR.

Please explain in context what the measure of center of HINCP for people on food stamps means.

How are the distributions of HINCP for those on food stamps and not on food stamps different and similar?

The distributions of HINCP for those on food stamps and those not on food stamps are similar being as though they are both right skewed.

What do you find surprising for either distribution?

mydata_clean %>%
  distinct(SERIALNO, new_FS, HINCP) %>%
  ggplot(aes(x = new_FS, y = HINCP)) +
    geom_boxplot(outlier.shape = NA) +
    coord_flip() +
    ylim(0,400000)

Do you think the distribution of HINCP is the same or different for people who are on or not on food stamps? Cite as much evidence as possible from the graphs and tables above.

I think the distribution of HINCP is about the same in terms of being right skewed and needing to use the measure of center with Median.

Question 2

Here is a table for the variable JWTR_new which indicates how someone got to work.

mydata_clean %>%
  distinct(SERIALNO, JWTR_new) %>%
  tabyl(JWTR_new) %>%
  adorn_pct_formatting(digits=0) %>%
  kable()

JWTR_new	n	percent	valid_percent
Car, truck, or van	8524	38%	71%
Bus or trolley bus	558	3%	5%
Streetcar	24	0%	0%
Subway	649	3%	5%
Railroad	224	1%	2%
Ferryboat	38	0%	0%
Taxicab	60	0%	0%
Motorcycle	60	0%	0%
Bicycle	271	1%	2%
Walked	475	2%	4%
Worked at home	987	4%	8%
Other method	149	1%	1%
NA	10122	46%	-

What sort of variable is JWTR_new? Explain how you know.

JWTR_new is a categorical variable because it is showing which vehicles people use to get to work.

What does the 46% of row NA mean?

The 46% stands for the percent of people whom the transportation method is inapplicable to them.

What is the difference between the percent and valid_percent columns?

The valid_percent column is counting the methods in which people do use transportation to get to work; excluding the percentage of people in which the question was inapplicable to them.

What does the 5% of row Bus or trolley bus mean?

The 5% of row Bus or trolley bus is showing the valid percent of people excluding the NA row in which they do not apply to the question.

Question 3

Read the document Cell-Phone-Student.md in the Affective-Domain\Cell-Phones directory. Answer the questions below.

Do you think your cell phone habits are hindering your attention and focus? Explain. This should be a paragraph or two long.

Do you intend to make any changes to how you use your cellphone? If so, what changes? This should be a paragraph or two long.

Question 4

What is “sampling bias”? Explain using proper terminology and craft your example to explain how it can effect the outcome of a statistical study. This should be several paragraphs long.

“Sampling bias” is a bias in which a sample is collected in such a way that some members of the sample population have a lower sampling probability than others.

Midterm 1

MATH-159 Spring 2020

Kaylanna Wong

3/15-3/19

Instructions and Setup

Questions

Question 1

Question 2

Question 3

Question 4

Midterm 1

MATH-159 Spring 2020

Kaylanna Wong

3/15-3/19

Instructions and Setup

Questions

Question 1

Question 2

Question 3

Question 4