6 Regression I

Measures of association are essential tools in business statistics for analyzing relationships between variables. They help determine whether changes in one variable are linked to changes in another, providing valuable insights into patterns and dependencies. Understanding these relationships is critical for making informed decisions. By quantifying the strength and direction of associations, businesses can better interpret data, optimize strategies, and drive meaningful outcomes. Below we study important measures of association.

6.1 The Covariance

The covariance is a measure that determines the direction of the relationship between two variables. It is calculated by:

THE COVARIANCE \[s_{xy}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1}\] where \(x_i\) and \(y_i\) are observations, \(\bar{x}\) and \(\bar{y}\) are the sample means, and \(n\) is the sample size.

\(s_{xy} > 0\)	\(s_{xy} = 0\)	\(s_{xy} < 0\)
Direct relationship	No relationship	Inverse relationship

Example: Let’s consider the following data that captures the price of stocks (SPY) and bonds (BND):

Period	SPY	BND
1	87	316
2	86	380
3	84	416
4	85	430
5	83	440

The idea behind calculating a covariance is to determine whether there is a relationship between the two variables. We start by finding the deviations from the mean for both SPY and BND:

Period	SPY	BND	\(SPY - \bar{SPY}\)	\(BND - \bar{BND}\)	Product
1	87	316	\(2\)	\(-80\)	\(-160\)
2	86	380	\(1\)	\(-16\)	\(-16\)
3	84	416	\(-1\)	\(20\)	\(-20\)
4	85	430	\(0\)	\(34\)	\(0\)
5	83	440	\(-2\)	\(44\)	\(-88\)

The covariance is the average of the products column using \(n-1\):

\[s_{xy} = \frac{-160 + (-16) + (-20) + 0 + (-88)}{5-1} = \frac{-284}{4} = -71\]

Since \(s_{xy} < 0\) there is an inverse relationship between SPY and BND. Intuitively, the covariance checks whether on average the products of the deviations from the mean is positive or negative. Whenever \(x\) lies below its mean and \(y\) lies above its mean, the product is negative — and vice versa — reflecting the inverse relationship between the two variables.

6.2 The Correlation

The correlation measures the strength of the linear relationship between two variables. It is calculated by:

THE CORRELATION COEFFICIENT \[r = \frac{s_{xy}}{s_x \cdot s_y}\] where \(s_{xy}\) is the covariance, \(s_x\) is the standard deviation of \(x\), and \(s_y\) is the standard deviation of \(y\).

The correlation coefficient lies between \([-1, 1]\):

\(r = -1\)	\(-1 < r < 0\)	\(r = 0\)	\(0 < r < 1\)	\(r = 1\)
Perfect inverse	Inverse	No relationship	Direct	Perfect direct

The strength of the relationship is determined by the absolute value of \(r\). Values close to \(1\) indicate a strong relationship, values close to \(0\) indicate a weak relationship.

Example: Consider the SPY and BND data. The standard deviation of SPY is \(s_{SPY} = 1.58\) and that of BND is \(s_{BND} = 50.37\). Substituting into the correlation formula:

\[r = \frac{-71}{1.58 \times 50.37} = -0.89\]

This indicates that the relationship between SPY and BND is inverse and strong.

6.3 The Coefficient of Determination (\(R^2\))

The coefficient of determination or \(R^2\) measures the percent of variation in \(y\) explained by variations in \(x\). It is calculated by:

THE COEFFICIENT OF DETERMINATION \[R^2 = r^2\] where \(r\) is the correlation coefficient. \(R^2\) ranges from \(0\) (no relationship) to \(1\) (perfect relationship).

Example: Consider once more the SPY and BND example. The correlation coefficient is \(r = -0.89\). Hence:

\[R^2 = (-0.89)^2 = 0.79\]

This indicates that about \(79\%\) of the variation in BND can be explained by changes in SPY. The two variables are closely related.

6.4 Measures of Association in Excel

Excel provides straightforward functions to calculate all three measures of association. Let’s use the SPY and BND data from the example above.

Download Data

Covariance: To calculate the sample covariance use =COVARIANCE.S(). Assuming SPY is in column A and BND is in column B:

=COVARIANCE.S(A2:A6, B2:B6)

Correlation: To calculate the correlation coefficient use =CORREL():

=CORREL(A2:A6, B2:B6)

Coefficient of Determination: There is no direct Excel function for \(R^2\) between two variables. Simply square the correlation coefficient:

=CORREL(A2:A6, B2:B6)^2

The results for the SPY and BND data are:

Covariance	Correlation	R2
−71.0000	−0.8915	0.7949

Scatter Plot: To create a scatter plot in Excel, select both columns and go to Insert → Charts → Scatter. This will display the relationship between SPY and BND visually. To add a trend line, right-click any point in the scatter plot and select Add Trendline → Linear.

6.5 Excel Function Summary

Below is a list of the Excel functions used in this section:

=COVARIANCE.S(array1, array2) calculates the sample covariance between two variables. Use =COVARIANCE.P() for the population covariance.
=CORREL(array1, array2) calculates the correlation coefficient between two variables. The result is always between \(-1\) and \(1\).
=CORREL(array1, array2)^2 calculates the coefficient of determination \(R^2\) by squaring the correlation coefficient. Excel does not have a standalone \(R^2\) function for two variables.

6.6 Exercises

The following exercises will help you understand statistical measures that establish the relationship between two variables. In particular, the exercises work on:

Calculating covariance and correlation.
Creating scatter diagrams in Excel.
Calculating the coefficient of determination.

Answers are provided below. Try not to peek until you have formulated your own answer and double checked your work for any mistakes.

Exercise 1

For the following exercises, make your calculations by hand and verify results using Excel functions when possible.

Consider the data below. Calculate the covariance and correlation coefficient by finding deviations from the mean. Verify your result using Excel. Is there a direct or inverse relationship between the two variables? How strong is the relationship?

x	20	21	15	18	25
y	17	19	12	13	22

Answer

The covariance is \(14.9\) and the correlation is \(0.96\). The results indicate that there is a strong direct relationship between the two variables.

Start by finding the mean of each variable: \(\bar{x} = 19.8\) and \(\bar{y} = 16.6\). Then build the deviations table:

\(x_i\)	\(y_i\)	\(x_i-\bar{x}\)	\(y_i-\bar{y}\)	Product
20	17	\(0.2\)	\(0.4\)	\(0.08\)
21	19	\(1.2\)	\(2.4\)	\(2.88\)
15	12	\(-4.8\)	\(-4.6\)	\(22.08\)
18	13	\(-1.8\)	\(-3.6\)	\(6.48\)
25	22	\(5.2\)	\(5.4\)	\(28.08\)

\[s_{xy} = \frac{0.08+2.88+22.08+6.48+28.08}{4} = \frac{59.6}{4} = 14.9\]

\[r = \frac{14.9}{sd(x) \times sd(y)} = \frac{14.9}{3.49 \times 4.34} = 0.96\]

To verify in Excel, enter x in column A and y in column B and use:

=COVARIANCE.S(A2:A6, B2:B6) → \(14.9\)
=CORREL(A2:A6, B2:B6) → \(0.96\)

Consider the data below. Calculate the covariance and correlation coefficient by finding deviations from the mean. Verify your result using Excel. Is there a direct or inverse relationship between the two variables? How strong is the relationship?

w	19	16	14	11	18
z	17	20	20	16	18

Answer

The covariance is \(0.85\) and the correlation is \(0.148\). The results indicate that there is a very weak direct relationship between the two variables. They might be unrelated.

To verify in Excel, enter w in column A and z in column B and use:

=COVARIANCE.S(A2:A6, B2:B6) → \(0.85\)
=CORREL(A2:A6, B2:B6) → \(0.148\)

Exercise 3

You will need the College data set to answer this question. You can download it here:

Download Data

Create a scatter diagram between GRAD_DEBT_MDN (Median Debt) on the y-axis and MD_EARN_WNE_P10 (Median Earnings) on the x-axis. Which of the following best describes the relationship?

Direct relationship — higher earnings are associated with higher debt
Inverse relationship — higher earnings are associated with lower debt
No relationship between the two variables
U-shaped relationship

Calculate the correlation coefficient and coefficient of determination. Which of the following correctly describes the results?

\(r = 0.46\), \(R^2 = 0.46\) — moderate direct relationship
\(r = 0.46\), \(R^2 = 0.21\) — moderate direct relationship, debt explains 21% of variation in earnings
\(r = -0.46\), \(R^2 = 0.21\) — moderate inverse relationship
\(r = 0.90\), \(R^2 = 0.81\) — strong direct relationship

Answer

The correlation coefficient is \(r = 0.46\), indicating a moderate direct relationship between earnings and debt. The coefficient of determination \(R^2 = 0.21\) indicates that only \(21\%\) of the variation in earnings can be explained by debt. Other factors such as field of study, institution type, and location likely explain the remaining variation.

In Excel, assuming GRAD_DEBT_MDN is in column A and MD_EARN_WNE_P10 is in column B:

=CORREL(A2:A1000, B2:B1000) → \(0.46\)
=CORREL(A2:A1000, B2:B1000)^2 → \(0.21\)

Note: use =CORREL() rather than =COVARIANCE.S() when there are missing values, as CORREL handles them more gracefully in Excel.

6.1 The Covariance

6.2 The Correlation

6.3 The Coefficient of Determination (\(R^2\))

6.4 Measures of Association in Excel

6.5 Excel Function Summary

6.6 Exercises

Exercise 1

Exercise 2

Exercise 3