<- 0.15
rxy <- 25
n <- (rxy * sqrt(n - 2)) / (sqrt(1 - rxy^2))) (tstat
[1] 0.7276069
In this chapter, you will begin mastering the concepts of correlation significance, difference of means tests, and regression inference, empowering you with the tools to uncover meaningful relationships and differences in data while assessing their statistical significance. These skills will enable you to make informed business decisions, such as evaluating the strength of market trends, comparing performance across groups, or determining the impact of variables in predictive models, ensuring your conclusions are robust and actionable.
To determine the statistical significance of the correlation coefficient we test:
\(H_o: \rho \geq 0\); \(H_a: \rho <0\) left tail
\(H_o: \rho \leq 0\); \(H_a: \rho >0\) right tail
\(H_o: \rho = 0\); \(H_a: \rho \neq 0\) two tails
The test statistic for the correlation is given by \(t_{df}= \frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}\), where \(df=n-2\) and \(r_{xy}\) is the sample correlation coefficient.
Run the cor.test()
function to perform the test on two vectors. Here is a list of arguments to use:
alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.
Tests for inference about the difference of two population means.
The test for unpaired mean differences (not equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {\frac {s_1^2}{n_1} \frac{s_2^2}{n_2}}}\).
The test for unpaired mean difference (equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {s_p^2 (\frac {1}{n_1} + \frac {1}{n_2})}}\).
The test for paired mean difference is given by \(t_{df}= \frac {\bar d- d_o}{\frac {s}{\sqrt{n}}}\).
Run these test in R by using the t.test()
function. Here is a list of arguments to use:
paired: use True for paired, False for independent. The default is False.
var.equal: use True for equal variances, False for unequal. The default is False.
mu: a value that indicate the hypothesized value of the mean or mean difference.
alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.
When running regression a couple of test can be performed on the coefficients to determine significance:
The first test competing hypothesis are \(H_o: \beta_j = 0\); \(H_a: \beta_j \ne 0\). The test statistic for the intercept (slope) coefficient is given by \(t_{df}= \frac {b_j}{se(b_j)}\).
The second test competing hypothesis are \(H_o: \beta_1=\beta_2=...\beta_k=0\); \(H_a:\) at least one \(\beta_i \neq 0\). The joint test of significance is given by \(F_{df_1,df_2} = \frac {SSR/k}{SSE/(n-k-1)} = \frac {MSR}{MSE}\). The Anova table below shows more detail on this test.
Anova | df | SS | MS | F | Significance |
---|---|---|---|---|---|
Regression | \(k\) | \(SSR\) | \(MSR=\frac{SSR}{k}\) | \(F_{df_1,df_2} = \frac {MSR}{MSE}\) | \(P(F) \geq \frac{MSR}{MSE}\) |
Residual | \(n-k-1\) | \(SSE\) | \(MSE=\frac {SSE}{n-k-1}\) | ||
Total | \(n-1\) | \(SST\) |
To conduct these tests, save the lm()
model into an object. The summary()
function can then be used to retrieve the results of the tests on the model’s parameters. Use the anova()
function to obtain the Anova table.
The following exercises will help you test your knowledge on Regression and Inference. In particular, the exercises work on:
Try not to peek at the answers until you have formulated your own answer and double-checked your work for any mistakes.
Consider the following competing hypothesis: \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\). A sample of \(25\) observations reveals that the correlation coefficient between two variables is \(0.15\). At a \(5\)% confidence level, can we reject the null hypothesis?
Answer
At the 5% significance level, we cannot reject the null since the p-value is 0.47 > 0.05.
Recall that the t-stat is calculated by (). We can use R as a calculator to calculate this value:
<- 0.15
rxy <- 25
n <- (rxy * sqrt(n - 2)) / (sqrt(1 - rxy^2))) (tstat
[1] 0.7276069
Now, we can estimate the p-value using the pt()
function:
2 * pt(tstat, n - 2, lower.tail = FALSE)
[1] 0.4741966
Install the ISLR2
package in R. Use the Hitters data set to look at the relationship between Hits and Salary. Specifically, calculate the correlation coefficient and test the competing hypothesis \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\) at the \(1\)% significance level.
Answer
The estimated correlation of 0.44 and the t-value is 7.89. Since the p-value is approximately 0, we reject the null hypothesis (H_{o}: ).
Once the ISLR2
package is downloaded, it can be loaded to R using the library()
function. The cor.test()
function conducts the appropriate test of significance.
library(ISLR2)
cor.test(Hitters$Salary, Hitters$Hits, conf.level = 0.95)
Pearson's product-moment correlation
data: Hitters$Salary and Hitters$Hits
t = 7.8863, df = 261, p-value = 8.531e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3355210 0.5314332
sample estimates:
cor
0.4386747
Install the ISLR2
package in R. Use the Hitters data set to investigate if the average hits were significantly different between the two divisions (American and National). Use the NewLeague and Hits variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?
Answer
There is no reason to believe that the population variances are different. Players are recruited from what seems to be a common pool. At a 5% significance level, the difference of the two means is not significantly different from zero. We can’t reject the null hypothesis.
We will use the t.test()
function in R to test the hypothesis. We note that the test is not paired, two-sided, and assumes equal variances in the population.
t.test(Hitters$Hits[Hitters$NewLeague == "A"],
$Hits[Hitters$NewLeague == "N"], paired = FALSE,
Hittersalternative = "two.sided", mu = 0, var.equal = TRUE,
conf.level = 0.95)
Two Sample t-test
data: Hitters$Hits[Hitters$NewLeague == "A"] and Hitters$Hits[Hitters$NewLeague == "N"]
t = 1.0862, df = 320, p-value = 0.2782
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.581286 15.875028
sample estimates:
mean of x mean of y
103.58523 97.93836
Use the ISLR2
package for this question. Particularly, use the BrainCancer data set to test whether males have a higher average survival time than women. Use the sex and time variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?
Answer
There might be reason to believe that the population variances are different. Women and men are known to have medical differences. At a 5% significance level, the average survival time of men seems not to be larger than that of women. We can’t reject the null hypothesis (H_{o}: {x}{1} - {x}{2} ).
Once more use the t.test()
function in R to test the hypothesis. Note that the test is not paired, right-tailed, and assumes different variances in the population.
t.test(BrainCancer$time[BrainCancer$sex == "Male"],
$time[BrainCancer$sex == "Female"], paired = FALSE,
BrainCanceralternative = "greater", mu = 0, var.equal = FALSE,
conf.level = 0.95)
Welch Two Sample t-test
data: BrainCancer$time[BrainCancer$sex == "Male"] and BrainCancer$time[BrainCancer$sex == "Female"]
t = -0.30524, df = 84.867, p-value = 0.6195
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-8.504999 Inf
sample estimates:
mean of x mean of y
26.78302 28.10200
Use the sleep data set included in R. At the \(1\)% significance level, is there an effect of the drug on the \(10\) patients? Assume that the group variable denotes before (\(1\)) the drug is administered and after (\(2\)) the drug is administered.
Answer
The drug seems to have an effect as we can reject the null hypothesis (H_{o}: {d} = 0). The difference of means seems to be statistically different from zero.
Use the t.test()
function once more in R. Make sure to note that the test is paired and two-tailed.
t.test(sleep$extra[sleep$group == 1],
$extra[sleep$group == 2], paired = TRUE,
sleepalternative = "two.sided", mu = 0, conf.level = 0.99)
Paired t-test
data: sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
-2.8440519 -0.3159481
sample estimates:
mean difference
-1.58
Install the ISLR2
package in R. Use the Hitters data set to investigate the effect of HmRun, RBI, and Years on a player’s Salary. Which variables are statistically different from zero? Are the variables jointly significant? Does the \(R^2\) suggest a good fit of the data to the model?
Answer
Both RBI and Years are statistically significant, and the salary of a player increases as they gain more experience and have more RBIs. Home runs do not seem to have an impact on the salary of a player according to the data. The F-statistic reveals that the coefficients are jointly significant since the p-value is approximately zero. Both the Multiple and Adjusted (R^2) suggest that the model only accounts for 32% of the variation in Salary. We might have to include more variables in our model to better explain the salary of a player.
We can run a linear regression in R by using the lm()
function. We’ll use the summary()
function to get more details on the model’s performance.
<- lm(Salary ~ HmRun + RBI + Years, data = Hitters)
fit summary(fit)
Call:
lm(formula = Salary ~ HmRun + RBI + Years, data = Hitters)
Residuals:
Min 1Q Median 3Q Max
-752.31 -197.27 -66.80 97.73 2151.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -90.086 61.142 -1.473 0.142
HmRun -7.346 4.972 -1.478 0.141
RBI 9.156 1.685 5.432 1.28e-07 ***
Years 32.818 4.838 6.783 7.97e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 372.2 on 259 degrees of freedom
(59 observations deleted due to missingness)
Multiple R-squared: 0.3269, Adjusted R-squared: 0.3191
F-statistic: 41.93 on 3 and 259 DF, p-value: < 2.2e-16
José Altuve had \(28\) home runs, \(57\) RBI’s, and has been in the league for \(12\) years. What is the model’s predicted salary for him? What is the \(95\)% prediction interval? Note: The model predicts his salary if he played in \(1987\).
Answer
The predicted salary is 619.93, and the 95% prediction interval is [-129.89, 1369.7].
<- data.frame(HmRun = 28, RBI = 57, Years = 12)
new predict(fit, newdata = new, level = 0.95, interval = "prediction")
fit lwr upr
1 619.9268 -129.8905 1369.744