<-0.15
rxy<-25
n<-(rxy*sqrt(n-2))/(sqrt(1-rxy^2))) (tstat
[1] 0.7276069
To determine the statistical significance of the correlation coefficient we test:
\(H_o: \rho \geq 0\); \(H_a: \rho <0\) left tail
\(H_o: \rho \leq 0\); \(H_a: \rho >0\) right tail
\(H_o: \rho = 0\); \(H_a: \rho \neq 0\) two tails
The test statistic for the correlation is given by \(t_{df}= \frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}\), where \(df=n-2\) and \(r_{xy}\) is the sample correlation coefficient.
Run the cor.test()
function to perform the test on two vectors. Here is a list of arguments to use:
alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.
Tests for inference about the difference of two population means.
The test for unpaired mean differences (not equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {\frac {s_1^2}{n_1} \frac{s_2^2}{n_2}}}\).
The test for unpaired mean difference (equal variances) is given by \(t_{df}= \frac {(\bar x_1 - \bar x_2)- \bar d_o}{\sqrt {s_p^2 (\frac {1}{n_1} + \frac {1}{n_2})}}\).
The test for paired mean difference is given by \(t_{df}= \frac {\bar d- d_o}{\frac {s}{\sqrt{n}}}\).
Run these test in R by using the t.test()
function. Here is a list of arguments to use:
paired: use True for paired, False for independent. The default is False.
var.equal: use True for equal variances, False for unequal. The default is False.
mu: a value that indicate the hypothesized value of the mean or mean difference.
alternative: is a choice between “two.sided”, “less” and “greater”.
conf.level: sets the confidence level. Enter as a decimal and not percentage.
When running regression a couple of test can be performed on the coefficients to determine significance:
The first test competing hypothesis are \(H_o: \beta_j = 0\); \(H_a: \beta_j \ne 0\). The test statistic for the intercept (slope) coefficient is given by \(t_{df}= \frac {b_j}{se(b_j)}\).
The second test competing hypothesis are \(H_o: \beta_1=\beta_2=...\beta_k=0\); \(H_a:\) at least one \(\beta_i \neq 0\). The joint test of significance is given by \(F_{df_1,df_2} = \frac {SSR/k}{SSE/(n-k-1)} = \frac {MSR}{MSE}\). The Anova table below shows more detail on this test.
Anova | df | SS | MS | F | Significance |
---|---|---|---|---|---|
Regression | \(k\) | \(SSR\) | \(MSR=\frac{SSR}{k}\) | \(F_{df_1,df_2} = \frac {MSR}{MSE}\) | \(P(F) \geq \frac{MSR}{MSE}\) |
Residual | \(n-k-1\) | \(SSE\) | \(MSE=\frac {SSE}{n-k-1}\) | ||
Total | \(n-1\) | \(SST\) |
To conduct these tests, save the lm()
model into an object. The summary()
function can then be used to retrieve the results of the tests on the model’s parameters. Use the anova()
function to obtain the Anova table.
The following exercises will help you test your knowledge on Regression and Inference. In particular, the exercises work on:
Determining the significance of correlations.
Conduct paired and unpaired test of means and proportions.
Determining the significance of the slope and intercept estimates both individually and jointly.
Developing prediction intervals.
Answers are provided below. Try not to peak until you have a formulated your own answer and double checked your work for any mistakes.
Consider the following competing hypothesis: \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\). A sample of \(25\) observations reveals that the correlation coefficient between two variables is \(0.15\). At a \(5\)% confidence level, can we reject the null hypothesis?
Install the ISLR2
package in R. Use the Hitters data set to look at the relationship between Hits and Salary. Specifically, calculate the correlation coefficient and test the competing hypothesis \(H_{o}: \rho=0\), \(H_{a}: \rho \neq 0\) at the \(1\)% significance level.
Install the ISLR2
package in R. Use the Hitters data set to investigate if the average hits were significantly different between the two divisions (American and National). Use the NewLeague and Hits variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?
Use the ISLR2
package for this question. Particularly, use the BrainCancer data set to test whether males have a higher average survival time than women. Use the sex and time variables to test the hypothesis at the \(5\)% significance level. Is there reason to believe that the population variances are different?
Install the ISLR2
package in R. Use the Hitters data set to investigate the effect of HmRun,RBI, and Years on a players Salary. Which variables are statistically different from zero? Are the variables jointly significant? Does the \(R^2\) suggest a good fit of the data to the model?
José Altuve had \(28\) home runs, \(57\) RBI’s, and has been in the league for \(12\) years. What is the model’s predicted salary for him? What is the \(95\)% prediction interval? Note: The model predicts his salary if he played in \(1987\).
Recall that the t-stat is calculated by \(\frac {r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^2}}\). We can use R as a calculator to calculate this value:
<-0.15
rxy<-25
n<-(rxy*sqrt(n-2))/(sqrt(1-rxy^2))) (tstat
[1] 0.7276069
Now, we can estimate the \(p\)-value using the pt()
function:
2*pt(tstat,n-2,lower.tail = F)
[1] 0.4741966
Once the ISLR2
package is downloaded, it can be loaded to R using the library()
function. The cor.test()
function conducts the appropriate test of significance.
library(ISLR2)
cor.test(Hitters$Salary,Hitters$Hits, conf.level = 0.95)
Pearson's product-moment correlation
data: Hitters$Salary and Hitters$Hits
t = 7.8863, df = 261, p-value = 8.531e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3355210 0.5314332
sample estimates:
cor
0.4386747
We will use the t.test()
function in R to test the hypothesis. We note that the test is not paired, two sided and of equal variances in the population.
t.test(Hitters$Hits[Hitters$NewLeague=="A"],
$Hits[Hitters$NewLeague=="N"],paired = F,
Hittersalternative = "two.sided",mu = 0,var.equal = T,
conf.level = 0.95 )
Two Sample t-test
data: Hitters$Hits[Hitters$NewLeague == "A"] and Hitters$Hits[Hitters$NewLeague == "N"]
t = 1.0862, df = 320, p-value = 0.2782
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.581286 15.875028
sample estimates:
mean of x mean of y
103.58523 97.93836
Once more use the t.test()
function in R to test the hypothesis. Note that the test is not paired, right-tailed and of different variances in the population.
t.test(BrainCancer$time[BrainCancer$sex=="Male"],
$time[BrainCancer$sex=="Female"],paired = F,
BrainCanceralternative = "greater",mu = 0, var.equal = F,
conf.level = 0.95 )
Welch Two Sample t-test
data: BrainCancer$time[BrainCancer$sex == "Male"] and BrainCancer$time[BrainCancer$sex == "Female"]
t = -0.30524, df = 84.867, p-value = 0.6195
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-8.504999 Inf
sample estimates:
mean of x mean of y
26.78302 28.10200
Use the t.test()
function once more in R. Make sure to note that the test is paired, and two-tailed.
t.test(sleep$extra[sleep$group==1],
$extra[sleep$group==2], paired=T,
sleepalternative = "two.sided", mu=0, conf.level = 0.99)
Paired t-test
data: sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
-2.8440519 -0.3159481
sample estimates:
mean difference
-1.58
We can run a linear regression in R by using the lm()
function. We’ll use the summary()
function to get more details on the model’s performance.
<-lm(Salary~HmRun+RBI+Years,data=Hitters)
fitsummary(fit)
Call:
lm(formula = Salary ~ HmRun + RBI + Years, data = Hitters)
Residuals:
Min 1Q Median 3Q Max
-752.31 -197.27 -66.80 97.73 2151.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -90.086 61.142 -1.473 0.142
HmRun -7.346 4.972 -1.478 0.141
RBI 9.156 1.685 5.432 1.28e-07 ***
Years 32.818 4.838 6.783 7.97e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 372.2 on 259 degrees of freedom
(59 observations deleted due to missingness)
Multiple R-squared: 0.3269, Adjusted R-squared: 0.3191
F-statistic: 41.93 on 3 and 259 DF, p-value: < 2.2e-16
<-data.frame(HmRun=28,RBI=57,Years=12)
newpredict(fit,newdata=new,level=0.95,interval="prediction")
fit lwr upr
1 619.9268 -129.8905 1369.744