14 Regression and Inference

In this chapter, you will begin mastering the concepts of correlation significance, difference of means tests, and regression inference, empowering you with the tools to uncover meaningful relationships and differences in data while assessing their statistical significance. These skills will enable you to make informed business decisions, such as evaluating the strength of market trends, comparing performance across groups, or determining the impact of variables in predictive models, ensuring your conclusions are robust and actionable.

14.1 Correlation Significance

To determine the statistical significance of the correlation coefficient we test:

$H_o: \rho \geq 0$; $H_a: \rho <0$ left tail
$H_o: \rho \leq 0$; $H_a: \rho >0$ right tail
$H_o: \rho = 0$; $H_a: \rho \neq 0$ two tails

The test statistic for the correlation is given by $t_{df}= \frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}$, where $df=n-2$ and $r_{xy}$ is the sample correlation coefficient.

Run the cor.test() function to perform the test on two vectors. Here is a list of arguments to use:

alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.

14.2 Difference of Means Tests

Tests for inference about the difference of two population means.

The test for unpaired mean differences (not equal variances) is given by $t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {\frac {s_1^2}{n_1} \frac{s_2^2}{n_2}}}$.
The test for unpaired mean difference (equal variances) is given by $t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {s_p^2 (\frac {1}{n_1} + \frac {1}{n_2})}}$.
The test for paired mean difference is given by $t_{df}= \frac {\bar d- d_o}{\frac {s}{\sqrt{n}}}$.

Run these test in R by using the t.test() function. Here is a list of arguments to use:

paired: use True for paired, False for independent. The default is False.
var.equal: use True for equal variances, False for unequal. The default is False.
mu: a value that indicate the hypothesized value of the mean or mean difference.
alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.

14.3 Regression Inference

When running regression a couple of test can be performed on the coefficients to determine significance:

The first test competing hypothesis are $H_o: \beta_j = 0$; $H_a: \beta_j \ne 0$. The test statistic for the intercept (slope) coefficient is given by $t_{df}= \frac {b_j}{se(b_j)}$.
The second test competing hypothesis are $H_o: \beta_1=\beta_2=...\beta_k=0$; $H_a:$ at least one $\beta_i \neq 0$. The joint test of significance is given by $F_{df_1,df_2} = \frac {SSR/k}{SSE/(n-k-1)} = \frac {MSR}{MSE}$. The Anova table below shows more detail on this test.

Anova	df	SS	MS	F	Significance
Regression	$k$	$SSR$	$MSR=\frac{SSR}{k}$	$F_{df_1,df_2} = \frac {MSR}{MSE}$	$P(F) \geq \frac{MSR}{MSE}$
Residual	$n-k-1$	$SSE$	$MSE=\frac {SSE}{n-k-1}$
Total	$n-1$	$SST$

To conduct these tests, save the lm() model into an object. The summary() function can then be used to retrieve the results of the tests on the model’s parameters. Use the anova() function to obtain the Anova table.

14.4 Exercises

The following exercises will help you test your knowledge on Regression and Inference. In particular, the exercises work on:

Determining the significance of correlations.
Conduct paired and unpaired test of means and proportions.
Determining the significance of the slope and intercept estimates both individually and jointly.
Developing prediction intervals.

Try not to peek at the answers until you have formulated your own answer and double-checked your work for any mistakes.

Exercise 1

Consider the following competing hypothesis: $H_{o}: \rho=0$, $H_{a}: \rho \neq 0$. A sample of $25$ observations reveals that the correlation coefficient between two variables is $0.15$. At a $5$% confidence level, can we reject the null hypothesis?
Answer

At the 5% significance level, we cannot reject the null since the p-value is 0.47 > 0.05.

Recall that the t-stat is calculated by (). We can use R as a calculator to calculate this value:
```
rxy <- 0.15
n <- 25
(tstat <- (rxy * sqrt(n - 2)) / (sqrt(1 - rxy^2)))
```
```
[1] 0.7276069
```
Now, we can estimate the p-value using the pt() function:
```
2 * pt(tstat, n - 2, lower.tail = FALSE)
```
```
[1] 0.4741966
```
Install the ISLR2 package in R. Use the Hitters data set to look at the relationship between Hits and Salary. Specifically, calculate the correlation coefficient and test the competing hypothesis $H_{o}: \rho=0$, $H_{a}: \rho \neq 0$ at the $1$% significance level.
Answer

The estimated correlation of 0.44 and the t-value is 7.89. Since the p-value is approximately 0, we reject the null hypothesis (H_{o}: ).

Once the ISLR2 package is downloaded, it can be loaded to R using the library() function. The cor.test() function conducts the appropriate test of significance.
```
library(ISLR2)
cor.test(Hitters$Salary, Hitters$Hits, conf.level = 0.95)
```
```
    Pearson's product-moment correlation

data:  Hitters$Salary and Hitters$Hits
t = 7.8863, df = 261, p-value = 8.531e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3355210 0.5314332
sample estimates:
      cor 
0.4386747 
```

Exercise 2

Install the ISLR2 package in R. Use the Hitters data set to investigate if the average hits were significantly different between the two divisions (American and National). Use the NewLeague and Hits variables to test the hypothesis at the $5$% significance level. Is there reason to believe that the population variances are different?
Answer

There is no reason to believe that the population variances are different. Players are recruited from what seems to be a common pool. At a 5% significance level, the difference of the two means is not significantly different from zero. We can’t reject the null hypothesis.

We will use the t.test() function in R to test the hypothesis. We note that the test is not paired, two-sided, and assumes equal variances in the population.
```
t.test(Hitters$Hits[Hitters$NewLeague == "A"],
       Hitters$Hits[Hitters$NewLeague == "N"], paired = FALSE, 
       alternative = "two.sided", mu = 0, var.equal = TRUE,
       conf.level = 0.95)
```
```
    Two Sample t-test

data:  Hitters$Hits[Hitters$NewLeague == "A"] and Hitters$Hits[Hitters$NewLeague == "N"]
t = 1.0862, df = 320, p-value = 0.2782
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.581286 15.875028
sample estimates:
mean of x mean of y 
103.58523  97.93836 
```
Use the ISLR2 package for this question. Particularly, use the BrainCancer data set to test whether males have a higher average survival time than women. Use the sex and time variables to test the hypothesis at the $5$% significance level. Is there reason to believe that the population variances are different?
Answer

There might be reason to believe that the population variances are different. Women and men are known to have medical differences. At a 5% significance level, the average survival time of men seems not to be larger than that of women. We can’t reject the null hypothesis (H_{o}: {x}{1} - {x}{2} ).

Once more use the t.test() function in R to test the hypothesis. Note that the test is not paired, right-tailed, and assumes different variances in the population.
```
t.test(BrainCancer$time[BrainCancer$sex == "Male"],
       BrainCancer$time[BrainCancer$sex == "Female"], paired = FALSE, 
       alternative = "greater", mu = 0, var.equal = FALSE,
       conf.level = 0.95)
```
```
    Welch Two Sample t-test

data:  BrainCancer$time[BrainCancer$sex == "Male"] and BrainCancer$time[BrainCancer$sex == "Female"]
t = -0.30524, df = 84.867, p-value = 0.6195
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -8.504999       Inf
sample estimates:
mean of x mean of y 
 26.78302  28.10200 
```

Exercise 3

Use the sleep data set included in R. At the $1$% significance level, is there an effect of the drug on the $10$ patients? Assume that the group variable denotes before ($1$) the drug is administered and after ($2$) the drug is administered.
Answer

The drug seems to have an effect as we can reject the null hypothesis (H_{o}: {d} = 0). The difference of means seems to be statistically different from zero.

Use the t.test() function once more in R. Make sure to note that the test is paired and two-tailed.
```
t.test(sleep$extra[sleep$group == 1],
       sleep$extra[sleep$group == 2], paired = TRUE,
       alternative = "two.sided", mu = 0, conf.level = 0.99)
```
```
    Paired t-test

data:  sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
 -2.8440519 -0.3159481
sample estimates:
mean difference 
          -1.58 
```

Exercise 4

Install the ISLR2 package in R. Use the Hitters data set to investigate the effect of HmRun, RBI, and Years on a player’s Salary. Which variables are statistically different from zero? Are the variables jointly significant? Does the $R^2$ suggest a good fit of the data to the model?
Answer

Both RBI and Years are statistically significant, and the salary of a player increases as they gain more experience and have more RBIs. Home runs do not seem to have an impact on the salary of a player according to the data. The F-statistic reveals that the coefficients are jointly significant since the p-value is approximately zero. Both the Multiple and Adjusted (R^2) suggest that the model only accounts for 32% of the variation in Salary. We might have to include more variables in our model to better explain the salary of a player.

We can run a linear regression in R by using the lm() function. We’ll use the summary() function to get more details on the model’s performance.
```
fit <- lm(Salary ~ HmRun + RBI + Years, data = Hitters)
summary(fit)
```
```
Call:
lm(formula = Salary ~ HmRun + RBI + Years, data = Hitters)

Residuals:
    Min      1Q  Median      3Q     Max 
-752.31 -197.27  -66.80   97.73 2151.78 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -90.086     61.142  -1.473    0.142    
HmRun         -7.346      4.972  -1.478    0.141    
RBI            9.156      1.685   5.432 1.28e-07 ***
Years         32.818      4.838   6.783 7.97e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 372.2 on 259 degrees of freedom
  (59 observations deleted due to missingness)
Multiple R-squared:  0.3269,    Adjusted R-squared:  0.3191 
F-statistic: 41.93 on 3 and 259 DF,  p-value: < 2.2e-16
```
José Altuve had $28$ home runs, $57$ RBI’s, and has been in the league for $12$ years. What is the model’s predicted salary for him? What is the $95$% prediction interval? Note: The model predicts his salary if he played in $1987$.
Answer

The predicted salary is 619.93, and the 95% prediction interval is [-129.89, 1369.7].
```
new <- data.frame(HmRun = 28, RBI = 57, Years = 12)
predict(fit, newdata = new, level = 0.95, interval = "prediction")
```
```
       fit       lwr      upr
1 619.9268 -129.8905 1369.744
```

Anova	df	SS	MS	F	Significance
Regression	\(k\)	\(SSR\)	\(MSR=\frac{SSR}{k}\)	\(F_{df_1,df_2} = \frac {MSR}{MSE}\)	\(P(F) \geq \frac{MSR}{MSE}\)
Residual	\(n-k-1\)	\(SSE\)	\(MSE=\frac {SSE}{n-k-1}\)
Total	\(n-1\)	\(SST\)