set.seed(9)
Population <- rexp(1000, 0.02)12 Inference II
A confidence interval (CI) provides a range of plausible values for a population parameter, along with a confidence level indicating the probability that the interval contains the true parameter.
For example, a 95% CI means that if we repeated the sampling process 100 times, about 95 of the intervals would contain the true population parameter.
- Confidence level (e.g., 95%) = 1 - significance level \((\alpha = 0.05)\).
- The remaining \((\alpha)\) represents the risk of error (Type I error in hypothesis testing context).
CIs account for sampling variability and are wider for lower confidence levels or smaller samples.
12.1 Constructing Confidence Intervals for Means
For a population mean, assuming normality (via CLT), the CI is:
\[\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\] or, if \((\sigma)\) is unknown (common), use the t-distribution:
\[\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\] - \((\bar{x})\): Sample mean - \((z_{\alpha/2})\) or \((t_{\alpha/2})\): Critical value from standard normal or t-distribution (e.g., 1.96 for 95% CI with large n) - \((s)\): Sample standard deviation - \((n)\): Sample size
The term after \((\pm)\) is the margin of error.
12.1.1 Example 1: Life Expectancy
Suppose the average life expectancy sample mean is \(\bar{x} = 78.1\) years, with population standard deviation \(\sigma = 4.5\), and \(n = 50\). For a 90% CI \(z_{0.05} \approx 1.645\):
\[\frac{\sigma}{\sqrt{n}} = \frac{4.5}{\sqrt{50}} \approx 0.637\]
\[78.1 \pm 1.645 \times 0.637 \approx 78.1 \pm 1.05\]
Lower limit (LL): 77.05
Upper limit (UL): 79.15
We are 90% confident that the true population mean life expectancy is between 77.05 and 79.15 years.
12.2 Constructing Confidence Intervals for Proportions
For a population proportion \((p)\), the CI is:
\[\bar{p} \pm z_{\alpha/2} \cdot \sqrt{\frac{\bar{p}(1-\bar{p})}{n}}\] - \(\bar{p}\): Sample proportion (successes / n) - Assumes normality via CLT for proportions.
12.2.1 Example 2: Success Rate
A random sample of 100 yields 40 successes, so \(\bar{p} = 0.4\). For a 90% CI \(z_{0.05} \approx 1.645\):
\[SE = \sqrt{\frac{0.4 \times 0.6}{100}} = \sqrt{0.0024} \approx 0.049\]
\[0.4 \pm 1.645 \times 0.049 \approx 0.4 \pm 0.081\]
LL: 0.319
UL: 0.481
We are 90% confident that the true population proportion is between 31.9% and 48.1%.
12.3 Summary
Statistical inference bridges the gap between samples and populations through sampling distributions and the CLT. Confidence intervals offer a practical way to express uncertainty in estimates, essential for business decisions like market analysis or quality control. Remember, higher confidence levels widen intervals, and larger samples narrow them, improving precision.
12.4 Useful R Functions
The qnorm() and qt() functions calculate quartiles for the normal and \(t\) distributions, respectively.
The if() function creates a conditional statement in R.
12.5 Exercises
The following exercises will help you test your knowledge on Statistical Inference. In particular, the exercises work on:
- Simulating confidence intervals.
- Estimating confidence intervals in R.
- Estimating confidence intervals for proportions.
Try not to peek at the answers until you have formulated your own answer and double-checked your work for any mistakes.
Exercise 1
In this exercise you will be simulating confidence intervals.
Set the seed to \(9\). Create a random sample of 1000 data points and store it in an object called Population. Use the exponential distribution with rate of \(0.02\) to generate the data. Calculate the mean and standard deviation of Population and call them PopMean and PopSD respectively. What are the mean and standard deviation of Population?
Answer
The mean of Population is 48.61. The standard deviation is 47.94.
Start by generating values from the exponential distribution. You can use the
rexp()function in R to do this. Setting the seed to 9 yields:The population mean is:
(PopMean <- mean(Population))[1] 48.61053The standard deviation is:
(PopSD <- sd(Population))[1] 47.94411Create a for loop (with 10,000 iterations) that takes a sample of 50 points from Population, calculates the mean, and then stores the result in a vector called SampleMeans. What is the mean of the SampleMeans?
Answer
The mean is very close to the population mean 48.83. The standard deviation is 6.83.
In R you can use a for loop to create the vector of sample means.
nrep <- 10000 SampleMeans <- c() for (i in 1:nrep){ x <- sample(Population, 50, replace = T) SampleMeans <- c(SampleMeans, mean(x)) }The mean of SampleMeans is:
(xbar <- mean(SampleMeans))[1] 48.7005The standard deviation is:
(Standard <- sd(SampleMeans))[1] 6.827595Create a \(90\)% confidence interval using the first data point in the SampleMeans vector. Does the confidence interval include PopMean?
Answer
The confidence interval is [47.71, 70.17]. Since the population mean is equal to 48.61, the confidence interval does include the population mean.
Let’s construct the upper and lower limits of the interval in R.
(ll <- SampleMeans[1] + qnorm(0.05) * Standard)[1] 47.71385(ul <- SampleMeans[1] - qnorm(0.05) * Standard)[1] 70.17464Now take the minimum of the SampleMeans vector. Create a new \(90\)% confidence interval. Does the interval include PopMean? Out of the \(10,000\) intervals that you could construct with the vector SampleMeans, how many would you expect to include PopMean?
Answer
The confidence interval is [14.86, 37.32]. This interval does not include the population mean of 48.61. Out of the 10,000 confidence intervals, one would expect about 9,000 to include the population mean.
Let’s find the confidence interval limits using R.
(Minll <- min(SampleMeans) + qnorm(0.05) * Standard)[1] 14.85631(Minul <- min(SampleMeans) - qnorm(0.05) * Standard)[1] 37.31709We can confirm in R that about 9,000 of the intervals include PopMean. Once more, let’s use a for loop to construct confidence intervals for each element in SampleMeans and check whether the PopMean is included. The count variable keeps track of how many intervals include the population mean.
count = 0 for (i in SampleMeans){ (ll <- i + qnorm(0.05) * Standard) (ul <- i - qnorm(0.05) * Standard) if (PopMean <= ul & PopMean >= ll){ count = count + 1 } } count[1] 8978
Exercise 2
A random sample of \(24\) observations is used to estimate the population mean. The sample mean is \(104.6\) and the standard deviation is \(28.8\). The population is normally distributed. Construct a \(90\)% and \(95\)% confidence interval for the population mean. How does the confidence level affect the size of the interval?
Answer
The 90% confidence interval is [94.52, 114.67] and the 95% confidence interval is [92.68, 116.76]. The larger the confidence level, the larger the interval.
Let’s construct the intervals using R. Since the population standard deviation is unknown we will use the t-distribution. The interval is constructed as ({x} t_{/2} ).
(ul90 <- 104.6 - qt(0.05, 23) * 28.8 / sqrt(24))[1] 114.6755(ll90 <- 104.6 + qt(0.05, 23) * 28.8 / sqrt(24))[1] 94.52453For the 95% confidence interval we adjust the significance level accordingly.
(ul95 <- 104.6 - qt(0.025, 23) * 28.8 / sqrt(24))[1] 116.7612(ll95 <- 104.6 + qt(0.025, 23) * 28.8 / sqrt(24))[1] 92.43883A random sample from a normally distributed population yields a mean of \(48.68\) and a standard deviation of \(33.64\). Compute a \(95\)% confidence interval assuming a) that the sample size is \(16\) and b) the sample size is \(25\). What happens to the confidence interval as the sample size increases?
Answer
The confidence interval for a sample size of 16 is [30.75, 66.61]. The confidence interval when the sample size is 25 is [34.79, 62.57]. As the sample size gets larger, the confidence interval gets narrower and more precise.
Let’s use R again to calculate the confidence interval. For a sample size of 16 the interval is:
(ul16 <- 48.68 - qt(0.025, 15) * 33.64 / sqrt(16))[1] 66.60549(ll16 <- 48.68 + qt(0.025, 15) * 33.64 / sqrt(16))[1] 30.75451Increasing the sample size to 25 yields:
(ul25 <- 48.68 - qt(0.025, 24) * 33.64 / sqrt(25))[1] 62.56591(ll25 <- 48.68 + qt(0.025, 24) * 33.64 / sqrt(25))[1] 34.79409
Exercise 3
You will need the sleep data set for this problem. The data is built into R, and displays the effect of two sleep inducing drugs on students. Calculate a \(95\)% confidence interval for group 1 and for group 2. Which drug would you expect to be more effective at increasing sleeping times?
Answer
The 95% confidence interval for group 1 is [-0.53, 2.03]. Let’s first calculate the standard error for group 1.
(se1 <- sd(sleep$extra[sleep$group == 1]) / sqrt(length(sleep$extra[sleep$group == 1])))[1] 0.5657345
We can now use the standard error to estimate the lower and upper limits of the confidence interval.
(ll1 <- mean(sleep$extra[sleep$group == 1]) + qt(0.025, 9) * se1)[1] -0.5297804
(ul1 <- mean(sleep$extra[sleep$group == 1]) - qt(0.025, 9) * se1)[1] 2.02978
The 95% confidence interval for group 2 is [0.90, 3.76].Let’s repeat the procedure for group 2. Start by finding the standard error.
(se2 <- sd(sleep$extra[sleep$group == 2]) / sqrt(length(sleep$extra[sleep$group == 2])))[1] 0.6331666
Using the standard error we can complete the confidence interval.
(ll2 <- mean(sleep$extra[sleep$group == 2]) + qt(0.025, 9) * se2)[1] 0.8976775
(ul2 <- mean(sleep$extra[sleep$group == 2]) - qt(0.025, 9) * se2)[1] 3.762322
Drug 2 is more effective. Drug 2 does not include zero in the interval, and the interval is to the right of zero. It is unlikely that drug 2 has no effect on students’ sleeping time. Additionally, Drug 2’s mean increase in sleeping hours is 2.33 vs. 0.75 for drug 1.
Exercise 4
A random sample of \(100\) observations results in \(40\) successes. Construct a \(90\)% and \(95\)% confidence interval for the population proportion. Can we conclude at either confidence level that the population proportion differs from \(0.5\)?
Answer
The 90% and 95% confidence intervals are [0.319, 0.481], and [0.304, 0.496] respectively. Since they do not include 0.5, we can conclude that the population proportion is significantly different from 0.5.
We can create an object that stores the sample proportion and sample in R:
(p <- 0.4)[1] 0.4(n <- 100)[1] 100The 90% confidence interval is given by:
(Ex1ll90 <- p + qnorm(0.05) * sqrt(p * (1 - p) / 100))[1] 0.319419(Ex1ul90 <- p - qnorm(0.05) * sqrt(p * (1 - p) / 100))[1] 0.480581The 95% confidence interval is:
(Ex1ll95 <- p + qnorm(0.025) * sqrt(p * (1 - p) / 100))[1] 0.3039818(Ex1ul95 <- p - qnorm(0.025) * sqrt(p * (1 - p) / 100))[1] 0.4960182You will need the HairEyeColor data set for this problem. The data is built into R, and displays the distribution of hair and eye color for \(592\) statistics students. Construct a \(95\) confidence interval for the proportion of Hazel eye color students.
Answer
The 95% confidence interval is [0.128, 0.186].
The data can easily be viewed by calling
HairEyeColorin R.HairEyeColor, , Sex = Male Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female Eye Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8Note that there are three dimensions to this table (Hair, Eye, Sex). We can calculate the proportion of Hazel eye colored students with the following command that makes use of indexing:
(p <- (sum(HairEyeColor[, 3, 1]) + sum(HairEyeColor[, 3, 2])) / sum(HairEyeColor))[1] 0.1570946Now we can use this proportion to construct the intervals. Recall that for proportions the interval is calculated by ({p} z_{/2} ). The 95% confidence interval is given by:
(Ex2ll95 <- p + qnorm(0.025) * sqrt(p * (1 - p) / 592))[1] 0.1277818(Ex2ul95 <- p - qnorm(0.025) * sqrt(p * (1 - p) / 592))[1] 0.1864074