14  Regression and Inference

14.1 Concepts

Correlation Significance

To determine the statistical significance of the correlation coefficient we test:

  • \(H_o: \rho \geq 0\); \(H_a: \rho <0\) left tail

  • \(H_o: \rho \leq 0\); \(H_a: \rho >0\) right tail

  • \(H_o: \rho = 0\); \(H_a: \rho \neq 0\) two tails

The test statistic for the correlation is given by \(t_{df}= \frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}\), where \(df=n-2\) and \(r_{xy}\) is the sample correlation coefficient.

Run the cor.test() function to perform the test on two vectors. Here is a list of arguments to use:

  • alternative: is a choice between “two.sided”, “less” and “greater”.

  • conf.level: sets the confidence level. Enter as a decimal and not percentage.

Difference of Means Tests

Tests for inference about the difference of two population means.

  • The test for unpaired mean differences (not equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {\frac {s_1^2}{n_1} \frac{s_2^2}{n_2}}}\).

  • The test for unpaired mean difference (equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {s_p^2 (\frac {1}{n_1} + \frac {1}{n_2})}}\).

  • The test for paired mean difference is given by \(t_{df}= \frac {\bar d- d_o}{\frac {s}{\sqrt{n}}}\).

Run these test in R by using the t.test() function. Here is a list of arguments to use:

  • paired: use True for paired, False for independent. The default is False.

  • var.equal: use True for equal variances, False for unequal. The default is False.

  • mu: a value that indicate the hypothesized value of the mean or mean difference.

  • alternative: is a choice between “two.sided”, “less” and “greater”.

  • conf.level: sets the confidence level. Enter as a decimal and not percentage.

Regression Inference

When running regression a couple of test can be performed on the coefficients to determine significance:

  • The first test competing hypothesis are \(H_o: \beta_j = 0\); \(H_a: \beta_j \ne 0\). The test statistic for the intercept (slope) coefficient is given by \(t_{df}= \frac {b_j}{se(b_j)}\).

  • The second test competing hypothesis are \(H_o: \beta_1=\beta_2=...\beta_k=0\); \(H_a:\) at least one \(\beta_i \neq 0\). The joint test of significance is given by \(F_{df_1,df_2} = \frac {SSR/k}{SSE/(n-k-1)} = \frac {MSR}{MSE}\). The Anova table below shows more detail on this test.

Anova df SS MS F Significance
Regression \(k\) \(SSR\) \(MSR=\frac{SSR}{k}\) \(F_{df_1,df_2} = \frac {MSR}{MSE}\) \(P(F) \geq \frac{MSR}{MSE}\)
Residual \(n-k-1\) \(SSE\) \(MSE=\frac {SSE}{n-k-1}\)
Total \(n-1\) \(SST\)

To conduct these tests, save the lm() model into an object. The summary() function can then be used to retrieve the results of the tests on the model’s parameters. Use the anova() function to obtain the Anova table.

14.2 Exercises

The following exercises will help you test your knowledge on Regression and Inference. In particular, the exercises work on:

  • Determining the significance of correlations.

  • Conduct paired and unpaired test of means and proportions.

  • Determining the significance of the slope and intercept estimates both individually and jointly.

  • Developing prediction intervals.

Answers are provided below. Try not to peak until you have a formulated your own answer and double checked your work for any mistakes.

Exercise 1

  1. Consider the following competing hypothesis: \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\). A sample of \(25\) observations reveals that the correlation coefficient between two variables is \(0.15\). At a \(5\)% confidence level, can we reject the null hypothesis?

  2. Install the ISLR2 package in R. Use the Hitters data set to look at the relationship between Hits and Salary. Specifically, calculate the correlation coefficient and test the competing hypothesis \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\) at the \(1\)% significance level.

Exercise 2

  1. Install the ISLR2 package in R. Use the Hitters data set to investigate if the average hits were significantly different between the two divisions (American and National). Use the NewLeague and Hits variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?

  2. Use the ISLR2 package for this question. Particularly, use the BrainCancer data set to test whether males have a higher average survival time than women. Use the sex and time variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?

Exercise 3

  1. Use the sleep data set included in R. At the \(1\)% significance level, is there an effect of the drug on the \(10\) patients? Assume that the group variable denotes before (\(1\)) the drug is administered and after (\(2\)) the drug is administered.

Exercise 4

  1. Install the ISLR2 package in R. Use the Hitters data set to investigate the effect of HmRun,RBI, and Years on a players Salary. Which variables are statistically different from zero? Are the variables jointly significant? Does the \(R^2\) suggest a good fit of the data to the model?

  2. José Altuve had \(28\) home runs, \(57\) RBI’s, and has been in the league for \(12\) years. What is the model’s predicted salary for him? What is the \(95\)% prediction interval? Note: The model predicts his salary if he played in \(1987\).

14.3 Answers

Exercise 1

  1. At the \(5\)% significance level, we can not reject the null since the p-value is \(0.47>0.05\).

Recall that the t-stat is calculated by \(\frac {r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}\). We can use R as a calculator to calculate this value:

rxy<-0.15
n<-25
(tstat<-(rxy*sqrt(n-2))/(sqrt(1-rxy^2)))
[1] 0.7276069

Now, we can estimate the \(p\)-value using the pt() function:

2*pt(tstat,n-2,lower.tail = F)
[1] 0.4741966
  1. The estimated correlation of \(0.44\) and the t-value is \(7.89\). Since the \(p\)-value is approximately \(0\) we reject the null hypothesis \(H_{o}: \rho=0\).

Once the ISLR2 package is downloaded, it can be loaded to R using the library() function. The cor.test() function conducts the appropriate test of significance.

library(ISLR2)
cor.test(Hitters$Salary,Hitters$Hits, conf.level = 0.95)

    Pearson's product-moment correlation

data:  Hitters$Salary and Hitters$Hits
t = 7.8863, df = 261, p-value = 8.531e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3355210 0.5314332
sample estimates:
      cor 
0.4386747 

Exercise 2

  1. There is no reason to believe that the population variances are different. Players are recruited from what seems to be a common pool. At a \(5\)% significance level, the difference of the two means is not significantly different from zero. We can’t reject the null hypothesis.

We will use the t.test() function in R to test the hypothesis. We note that the test is not paired, two sided and of equal variances in the population.

t.test(Hitters$Hits[Hitters$NewLeague=="A"],
       Hitters$Hits[Hitters$NewLeague=="N"],paired = F, 
       alternative = "two.sided",mu = 0,var.equal = T,
       conf.level = 0.95 )

    Two Sample t-test

data:  Hitters$Hits[Hitters$NewLeague == "A"] and Hitters$Hits[Hitters$NewLeague == "N"]
t = 1.0862, df = 320, p-value = 0.2782
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.581286 15.875028
sample estimates:
mean of x mean of y 
103.58523  97.93836 
  1. There might be reason to believe that the population variances are different. Women and men are known to have medical differences. At a \(5\)% significance level, the average survival time of men seems not to be larger than that of women. We can’t reject the null hypothesis \(H_{o}:\bar {x_{1}}-\bar {x_{2}} \leq 0\).

Once more use the t.test() function in R to test the hypothesis. Note that the test is not paired, right-tailed and of different variances in the population.

t.test(BrainCancer$time[BrainCancer$sex=="Male"],
       BrainCancer$time[BrainCancer$sex=="Female"],paired = F, 
       alternative = "greater",mu = 0, var.equal = F,
       conf.level = 0.95 )

    Welch Two Sample t-test

data:  BrainCancer$time[BrainCancer$sex == "Male"] and BrainCancer$time[BrainCancer$sex == "Female"]
t = -0.30524, df = 84.867, p-value = 0.6195
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -8.504999       Inf
sample estimates:
mean of x mean of y 
 26.78302  28.10200 

Exercise 3

  1. There drug seems to have an effect as we can reject the null hypothesis \(H_{o}:\bar {d} = 0\). The difference of means seems to be statistically different from zero.

Use the t.test() function once more in R. Make sure to note that the test is paired, and two-tailed.

t.test(sleep$extra[sleep$group==1],
       sleep$extra[sleep$group==2], paired=T,
       alternative = "two.sided", mu=0, conf.level = 0.99)

    Paired t-test

data:  sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
 -2.8440519 -0.3159481
sample estimates:
mean difference 
          -1.58 

Exercise 4

  1. Both RBI and Years are statistically significant and the salary of a player increases as they gain more experience and have more RBI’s. Home runs do not seem to have an impact on the salary of a player according to the data. The F-Statistics reveals that the coefficients are jointly significant since the p-value is approximately zero. Both the Multiple and Adjusted \(R^2\) suggest that the model only accounts for \(32\)% of the variation in Salary. We might have to include more variable in our model to better explain the salary of a player.

We can run a linear regression in R by using the lm() function. We’ll use the summary() function to get more details on the model’s performance.

fit<-lm(Salary~HmRun+RBI+Years,data=Hitters)
summary(fit)

Call:
lm(formula = Salary ~ HmRun + RBI + Years, data = Hitters)

Residuals:
    Min      1Q  Median      3Q     Max 
-752.31 -197.27  -66.80   97.73 2151.78 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -90.086     61.142  -1.473    0.142    
HmRun         -7.346      4.972  -1.478    0.141    
RBI            9.156      1.685   5.432 1.28e-07 ***
Years         32.818      4.838   6.783 7.97e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 372.2 on 259 degrees of freedom
  (59 observations deleted due to missingness)
Multiple R-squared:  0.3269,    Adjusted R-squared:  0.3191 
F-statistic: 41.93 on 3 and 259 DF,  p-value: < 2.2e-16
  1. The predicted salary is \(619.93\) and the \(95\)% prediction interval is [\(-129.89\),\(1369.7\)].
new<-data.frame(HmRun=28,RBI=57,Years=12)
predict(fit,newdata=new,level=0.95,interval="prediction")
       fit       lwr      upr
1 619.9268 -129.8905 1369.744