︠2501cbec-0463-4c4c-8ee0-c81484b6b62esi︠ %hide %html

Confidence intervals and hypothesis tests

R provides an extensive range of tools for doing inferential statistics. The most basic of these include confidence intervals and hypothesis tests for one sample proportion.

Before getting into details of these methods, it is useful to learn how to work with normal distributions in R. The following examples show how to find normal probabilities (i.e., area under the normal curve) for any specified range of z-values.

Example 1: Find the following probabilities in the standard normal distribution:

    (a) \(P\;(z < -1.6789)\).

    (b) $P\;(z > -1.6789)$.

    (c) \( P\;(|z| < 0.78) \).

    (d) \( P\;(-0.98 < z < 1.279) \)

We will use a function named pnorm. For more information, type ?pnorm in a %r cell to see the full set of options it offers. ︡5124eeb5-d239-4362-b0e8-6cf7f0ded6db︡{"hide":"input"}︡{"html":"

Confidence intervals and hypothesis tests

\n\nR provides an extensive range of tools for doing \ninferential statistics. The most basic of these include \nconfidence intervals and hypothesis tests for one sample \nproportion.\n

\n\nBefore getting into details of these methods, it is useful \nto learn how to work with normal distributions in R. \nThe following examples show how to find normal probabilities\n(i.e., area under the normal curve) for any specified \nrange of z-values.\n

\n\n\nExample 1: Find the following probabilities in the \n standard normal distribution:

\n    (a) \\(P\\;(z < -1.6789)\\).

\n    (b) $P\\;(z > -1.6789)$.

\n    (c) \\( P\\;(|z| < 0.78) \\).

\n    (d) \\( P\\;(-0.98 < z < 1.279) \\)

\n
\n

\n\nWe will use a function named pnorm. \nFor more information, type ?pnorm \nin a %r cell to see the full set of options it offers. "}︡{"done":true}︡ ︠56f0867b-1624-4396-bbac-202a436ed672︠ %r # Example 1: (a) z < -1.6789 pnorm(-1.6789, lower.tail=TRUE) # Example 1: (b) z > -1.6789 pnorm(-1.6789, lower.tail=FALSE) # Example 1: (c) |z| < 0.78 pnorm(0.78) - pnorm(-0.78) # NOTE that lower.tail=TRUE is the default. So we can # leave it out for brevity. # Example 1: (d) -0.98 < z < 1.279 pnorm(1.279) - pnorm(-0.98) ︡58b04050-75c0-4caa-8e50-5148f45a4250︡{"html":"0.0465857672770096"}︡{"html":"0.95341423272299"}︡{"html":"0.564609124828534"}︡{"html":"0.736008412726515"}︡{"done":true}︡ ︠c47d9b13-53aa-492b-862b-98375637efc0si︠ %hide %html Next, let us look at how to do reverse lookups.

Example 2: Find the z-score corresponding to the following areas in the standard normal distribution:

    (a) Area in the upper tail is 0.0845 .

    (b) Area in lower tail is 0.404.

    (c) Want z corresponding to the central 48% area.

We will use a function named qnorm, which returns the z-value based on accumulating areas from the left-end (i.e., in the lower tail). Thus, for each situation we must figure out the right input to give qnorm so that it gives us what we want. ︡e657b546-cd6b-4b30-9902-edeaabed91b7︡{"hide":"input"}︡{"html":"Next, let us look at how to do reverse lookups.\n

\n\n\nExample 2: Find the z-score corresponding to \nthe following areas in the \nstandard normal distribution:

\n    (a) Area in the upper tail is 0.0845 .

\n    (b) Area in lower tail is 0.404.

\n    (c) Want z corresponding to the central 48% area.

\n
\n\n

\nWe will use a function named qnorm, \nwhich returns the z-value based on accumulating areas from \nthe left-end (i.e., in the lower tail). Thus, for each \nsituation we must figure out the right input to give \nqnorm so that it \ngives us what we want."}︡{"done":true}︡ ︠19eb765e-ce73-4592-9536-1b9e05e4e768s︠ %r # Example 2: (a) upper tail contains 0.0845 of the area qnorm(1-0.0845) # since it is in the upper tail, must do 1-0.0845 # Example 2: (b) lower tail contains 0.404 of the area qnorm(0.404) # Example 2: (c) Want z corresponding to the central 48% area qnorm(0.48 + (1-0.48)/2) # central 48% = (48 + 52/2)% area to the left ︡94e3ca93-e6f9-479c-ad37-9cdea0b82ab5︡{"html":"1.37542410526545"}︡{"html":"-0.243006967409982"}︡{"html":"0.643345405392917"}︡{"done":true}︡ ︠ddf04f26-0b73-4486-a7bc-5a20fc7b6663si︠ %hide %html
Confidence intervals for one sample proportion
Suppose we have data on a categorical variable that we can treat as having only two sides (e.g., yes or no; success or failure). Let n=sample size, and x=number of successes in the sample. Then the function named prop.test can be used to compute a confidence interval for the proportion of successes in the sample. The simplest usage of the function has the form:   prop.test(x, n, conf.level=k)
where $k$ is the level of confidence we want.

Example 3: Pew Research polled a random sample of 900 U.S. teens about their Internet use. Sixty percent of those teens admitted they had misrepresented their age online to access websites and online services. Compute and interpret a 90% confidence interval for the proprtion of U.S. teens who have misrepresented their age online. ︡527f1247-a578-43e0-a1fb-aefa100f9c71︡{"hide":"input"}︡{"html":"
Confidence intervals for one sample proportion
\n\nSuppose we have data on a categorical variable that we can \ntreat as having only two sides (e.g., yes or no; success or \nfailure). Let n=sample size, and x=number of \nsuccesses in the sample. Then the function named \nprop.test can be \nused to compute a confidence interval for the proportion \nof successes in the sample. The simplest usage of the function \nhas the form:   \nprop.test(x, n, conf.level=k)
\nwhere $k$ is the level of confidence we want.\n

\n \n\nExample 3: Pew Research polled a random sample of 900 \nU.S. teens about their Internet use. Sixty percent of \nthose teens admitted they had misrepresented their age \nonline to access websites and online services. Compute \nand interpret a 90% confidence interval for the proprtion of U.S. teens \nwho have misrepresented their age online.\n"}︡{"done":true}︡ ︠3f6c14c9-37dc-473d-86db-956c8df7dd89s︠ %r # Example 3: Here we have n=900, x=0.6*900 # The shortest (but less clear) way to do this is: prop.test(0.6*900, 900, conf.level=0.9) ︡c6b3d870-ed07-4959-8462-3b26e3ca3517︡{"stdout":"\n\t1-sample proportions test with continuity correction\n\ndata: 0.6 * 900 out of 900, null probability 0.5\nX-squared = 35.601, df = 1, p-value = 2.421e-09\nalternative hypothesis: true p is not equal to 0.5\n90 percent confidence interval:\n 0.5723185 0.6270697\nsample estimates:\n p \n0.6 \n"}︡{"done":true}︡ ︠7c6a31a2-3410-4177-ab1b-11685bd4f43cs︠ %r # For more clarity, use the following form: n = 900 # sample size x = 0.6*n # number of successes myout = prop.test(x, n, conf.level=0.9) cat("The confidence interval = [", myout$conf.int, "]") # print left/right ends of CI #diff(myout$conf.int) # uncomment this to get width of CI ︡929538f9-541f-4e86-bef2-d2a15b0ce227︡{"stdout":"The confidence interval = [ 0.5723185 0.6270697 ]"}︡{"done":true}︡ ︠d403cf3a-05f2-4cdb-b40f-7c9fcf9b39bdsi︠ %hide %html To see a more general description of prop.test use R's builtin help utility by typing

?prop.test

and "run" it. ︡db3d9f29-2d15-42ca-84e7-5c482d09788a︡{"hide":"input"}︡{"html":"\nTo see a more general description of prop.test \nuse R's builtin help utility by typing

\n?prop.test

\nand \"run\" it.\n\n"}︡{"done":true}︡ ︠890ef581-ccd1-4022-8699-7d3c558b3710si︠ %hide %html
Hypothesis tests with one sample proportion
The same prop.test function used for confidence intervals can also be used for hypothesis testing, as shown in the following example.

Example 4: In Example 3 above we saw that Pew Research found 60% of a random sample of 900 teens admitted they had misrepresented their age online to access websites and online services. Extending that scenario, suppose we want to test the hypothesis that more than 55% of all teens misrepresent their age online. Carry out the test and find the P-value. ︡485e1db5-2cb9-4350-b45b-bb8dc7468ab3︡{"hide":"input"}︡{"html":"
Hypothesis tests with one sample proportion
\n\nThe same prop.test \nfunction used for confidence intervals can also be used for \nhypothesis testing, as shown in the following example.\n\n

\n\nExample 4: In Example 3\nabove we saw that Pew Research found 60% of a random \nsample of 900 teens admitted they had misrepresented \ntheir age online to access websites and online services. \nExtending that scenario, suppose we want to test the \nhypothesis that more than 55% of all teens misrepresent \ntheir age online. Carry out the test and find the P-value.\n"}︡{"done":true}︡ ︠f8464944-14f9-45cd-b887-335ce9efd1b1s︠ %r n = 900 # sample size x = 0.6*n # number of successes p0 = 0.55 # null hypothesis value of the proportion prop.test(x, n, p0, alternative="greater") # # OR, we can do it this way # myout = prop.test(x, n, p0, alternative="greater") cat("The P-value=", myout$p.value) ︡3c1059a5-ac78-4edd-9f5c-0f5da0194c83︡{"stdout":"\n\t1-sample proportions test with continuity correction\n\ndata: x out of n, null probability p0\nX-squared = 8.89, df = 1, p-value = 0.001434\nalternative hypothesis: true p is greater than 0.55\n95 percent confidence interval:\n 0.5723185 1.0000000\nsample estimates:\n p \n0.6 \n"}︡{"stdout":"The P-value= 0.001433675"}︡{"done":true}︡ ︠c83cb4b0-fcd7-42b5-b862-aab425f3bdb2si︠ %hide %html
Student t distribution lookups
It is fairly straightforward to "lookup" values of a student t distribution with any specified degrees of freedom.

Example 5: Compute each of the following values for the indicated t distribution:

    (a) \(P\;(t < 1.967)\) with 5 df.

    (b) \(P\;(t > 1.967)\) with 5 df.

    (c) The $t$-value where 97.5% area is to the left, with 5 df.

        (This is the same as the $t_5^*$ value for a 95% confidence interval.)

    (d) The $t$-value where 97.5% area is to the left, with 23 df.
︡6016ef42-30ee-4bbf-894e-5f8e7b4a5207︡{"hide":"input"}︡{"html":"
Student t distribution lookups
\n\nIt is fairly straightforward to \"lookup\" values of a \nstudent t distribution with any specified degrees of \nfreedom.\n\n

\n\nExample 5: Compute each of the following values \nfor the indicated t distribution:

\n    (a) \\(P\\;(t < 1.967)\\) with 5 df.

\n    (b) \\(P\\;(t > 1.967)\\) with 5 df.

\n    (c) The $t$-value where 97.5% area is to the left, with 5 df.

\n        (This is the same as the $t_5^*$ \nvalue for a 95% confidence interval.)

\n    (d) The $t$-value where 97.5% area is to the left, with 23 df.\n
"}︡{"done":true}︡ ︠b26260c3-9e31-4f45-894f-5492566b865bs︠ %r pt(1.967, df=5) # Find area under t-curve with 5 degrees of freedom for t < 1.967 pt(1.967, df=5, lower.tail=FALSE) # Find area under same t-curve for t > 1.967 qt(0.975, df=5) # Inverse lookup: find t-value at 97.5 percentile point with 5 df qt(0.975, df=23) # Inverse lookup: find t-value at 97.5 percentile point with 23 df ︡d359eb9d-124e-4c77-a56c-2751731b1025︡{"html":"0.946834355069976"}︡{"html":"0.0531656449300238"}︡{"html":"2.57058183563631"}︡{"html":"2.06865761041905"}︡{"done":true}︡ ︠76735784-b33e-4001-8f61-5594db1b53b3si︠ %hide %html
Inferences with one sample mean
It is easiest to do confidence intervals and hypothesis tests with sample mean values if you first create a dataframe or variable containing your raw data. Once you have this, the function named t.test can be used for computing confidence intervals and hypothesis tests.

Example 6a: In one of our previous labs we used the file named "winter.csv", which contains the following 4 variables: (1) name of a U.S. city; (2) the mean January temperatures in that city (in degrees F); (3) the latitude of the city; and (4) the January temperature in degrees c.

Suppose we treat those data as a random sample of cities drawn from a population that consists of all cities in the U.S. Compute a 90% confidence interval for the true mean latitude of U.S. cities.
︡ab07c636-4b53-46ac-9d01-95e23bdfac2a︡{"hide":"input"}︡{"html":"
Inferences with one sample mean
\n\nIt is easiest to do confidence intervals and hypothesis \ntests with sample mean values if you first create a dataframe \nor variable containing your raw data. Once you have this, \nthe function named \nt.test can be \nused for computing confidence intervals and hypothesis tests. \n

\n \n\nExample 6a: In one of our previous labs we used the file \nnamed \"winter.csv\", which contains the following 4 variables: \n(1) name of a U.S. city; (2) the mean January temperatures in \nthat city (in degrees F); (3) the latitude of the city; and \n(4) the January temperature in degrees c.

\n\nSuppose we treat those data as a random sample of cities \ndrawn from a population that consists of all cities in the \nU.S. Compute a 90% confidence interval for the true mean \nlatitude of U.S. cities.\n
"}︡{"done":true}︡ ︠50542104-c8b3-4f0c-a04a-bd16cf587a21si︠ %r # Read the csv file winterdat = read.csv(file="./winter.csv", header=TRUE, sep=",") # You can see names of the variables in that file # by uncommenting the next line #winterdat # Next, define the variable you want to do inference with x = winterdat$Latitude t.test(x, conf.level=0.9) # Compute a 90% CI using "x" as input data # If you want to see a cleaner output, use: myout = t.test(x, conf.level=0.9) cat("The confidence interval = [", myout$conf.int, "]") # print left/right ends of CI ︡cb36660a-ec6f-48ea-8e3f-da40de17e1d5︡{"stdout":"\n\tOne Sample t-test\n\ndata: x\nt = 74.436, df = 58, p-value < 2.2e-16\nalternative hypothesis: true mean is not equal to 0\n90 percent confidence interval:\n 38.35037 40.11235\nsample estimates:\nmean of x \n 39.23136 \n"}︡{"stdout":"The confidence interval = [ 38.35037 40.11235 ]"}︡{"done":true}︡ ︠8a707f0b-ca8a-4f3a-8637-883865439cd6si︠ %hide %html Example 6b: Continuing with data from the previous example, let us test the hypothesis that the true mean latitude of U.S. cities is no more than 38 deg. Find the P-value and draw an inference. ︡423d6624-a035-4194-8d86-d322f4b43a58︡{"hide":"input"}︡{"html":" \n\nExample 6b: Continuing with data from the previous \nexample, let us test the hypothesis that the true mean latitude \nof U.S. cities is no more than 38 deg. Find the P-value \nand draw an inference.\n"}︡{"done":true}︡ ︠563d6177-96fd-4bf7-8ee5-db6340a6f1d0s︠ %r x = winterdat$Latitude myout = t.test(x, mu=38, alternative="greater") cat("The P-value=", myout$p.value) ︡7a2ef5a2-756a-4087-afff-3e1360650d35︡{"stdout":"The P-value= 0.01147574"}︡{"done":true}︡ ︠ba76e952-b072-4049-a00f-9157b5780989si︠ %hide %html Example 6b continued: The P-value is about 1.1%. If we assume a significance level of 5% (which for a 1-tailed hypothesis test is consistent with a 90% confidence level), we would reject the null hypothesis and conclude that the true mean latitude of U.S. cities is less than 38 degrees. ︡64db967d-7bb1-4ddb-b77c-2dbf3914bf87︡{"hide":"input"}︡{"html":" \n\nExample 6b continued: The P-value is about 1.1%. \nIf we assume a significance level of 5% (which for a 1-tailed \nhypothesis test is consistent with a 90% confidence level), we would \nreject the null hypothesis and conclude that the true mean \nlatitude of U.S. cities is less than 38 degrees.\n"}︡{"done":true}︡ ︠80af29be-7678-44b7-b962-6920816c3980si︠ %hide %html To see a more general description of t.test use R's builtin help utility by typing

?t.test

and "run" it. ︡ba8e8469-529e-4070-9deb-f18d64018646︡{"hide":"input"}︡{"html":"\nTo see a more general description of t.test \nuse R's builtin help utility by typing

\n?t.test

\nand \"run\" it.\n"}︡{"done":true}︡