Understanding where the “center” of a dataset lies is a fundamental step when interpreting and analyzing data. Measures of central location provide different ways to identify the “typical” value in a dataset, each with its unique strengths and limitations.
3.1 The Mean
The mean is the average value for a numerical variable. It is a widely understood and straightforward measure to calculate. It incorporates all data points, providing a comprehensive representation of the data set. However, its reliance on every value also makes it sensitive to outliers or skewed distributions, which may cause it to not accurately reflect the true center of the data.
The sample statistic is estimated by:
\[
\bar{x}=\frac{\sum x_{i}}{n}
\]
where \(x_i\) is observation \(i\), and \(n\) is the number of observations in the sample. The formula instructs you to add up all of the values in a variable and then divide by the sample size. If you wish to estimate the population parameter you can use:
\[
\mu=\frac{\sum x_{i}}{N}
\]
where \(N\) is the population size.
Ex 1: Consider the following sample of numbers\(x=\{1,4,2,1\}\). The mean would be equal to\(\bar{x}=\frac{1+4+2+1}{4}\)or\(\bar{x}=2\).
Ex 2: Consider the following sample of numbers\(x=\{1,4,2,1,100\}\). The mean is \(\bar{x}=21.6\). Although most of the numbers are in the range \((1,4)\), the \(100\) biases the mean to \(21.6\).
3.2 The Median
The median is the value in the middle when data is organized in ascending order. When \(n\) is even, the median is the average between the two middle values. The median is resistant to outliers, making it an ideal measure for skewed data or data sets with irregular distributions. It is particularly useful for ordinal data, where precise ranking is important. However, the median does not utilize all data points, which can lead to less precise comparisons compared to other measures like the mean.
Ex 1: Consider the following numbers\(x=\{1,4,2,1\}\). We first sort the data to obtain \(x_{sorted}=\{1,1,2,4\}\). Since the number of observations is even (\(n=4\)), the median is the average of the two middle numbers \((1,2)\). \(median_x=\frac{1+2}{2}\) or \(1.5\).
Ex 2: Consider the following numbers\(x=\{1,4,2,1,100\}\). We once again sort the number obtaining \(x_{sorted}=\{1,1,2,4,100\}\). Since \(n=5\) in this case we can identify the median as the third value in \(x_{sorted}\). \(median_x=2\). Note that the inclusion of \(100\) in the data did not change much the measure of central location.
3.3 The Mode
The mode is the value with highest frequency from a set of observations. This measure is particularly useful for categorical data as it helps determine popularity of values. It can be applied to both numerical and non-numerical data sets. The mode has its limitations; it may not exist in cases where all values occur with equal frequency, and there may be multiple modes, which can complicate interpretation. Additionally, since the mode focuses only on the most frequent value, it does not account for other data points, limiting its overall utility as a comprehensive measure.
Ex 1: Consider the following numbers\(x=\{1,4,2,1\}\). Since \(1\) is repeated twice and all other numbers just repeated once, \(x_{mode}=1\).
Ex 2: Consider the following numbers\(x=\{1,4,2,1,4\}\). Now \(4\) is also repeated twice. The variable has two modes \(x_{mode}=\{1,2\}\). \(x\) is said to be bimodal.
3.4 The Weighted Mean
The weighted mean is useful in scenarios where some data points are more significant than others, such as in financial portfolios, grade point averages, or survey results, as it accounts for variability in importance across observations. This measure requires additional information in the form of weights (\(w_i\)), which may not always be available or accurate. It is calculated the sum product of values (\(x_i\)) and weights (\(w_i\)) and then dividing by the sum of weights. Mathematically, the weighted average is:
Ex: Consider three different stocks\(S=\{T, C, X\}\) with stock returns of \(R=\{2,4,10\}\). Each stock has a weight in the portfolio of \(W=\{0.3,0.2,0.5\}\). The average return of the portfolio is \(\bar{x}_{weighted}=\frac{0.6+0.8+5}{1}\) or \(\bar{x}_{weighted}=6.4\).
3.5 The Geometric Mean
The geometric mean is a multiplicative average that is less sensitive to outliers relative to the arithmetic mean. It is useful when averaging growth rates or rates of return. It is calculated by:
where \(\sqrt[n]{}\) is the \(n_{th}\) root, and \(r_i\) are the returns or growth rates. When working with growth rates or rates of return, you add 1 to each rate because these metrics represent changes relative to a base value.
When dealing with proportions, these already reflect a standalone quantity, not a relative change. As a consequence, there is no need to add 1. In this case the geometric mean simplifies to:
Ex: Consider the variable\(x=\{0.2,0.3,0.1,0.1\}\). The geometric mean is equal to \(\bar{x}_g=\sqrt[4]{(1.2 \times 1.3 \times 1.1 \times 1.1)}-1\) or \(\bar{x_g}=0.17\) if x represents growth rates. When x represents ratios from a whole, the geometric mean is \(\bar{x}_g=\sqrt[4]{0.2 \times 0.3 \times 0.1 \times 0.1)}\) or \(0.156\).
3.6 Measures of Central Location in R
Base R has a collection of functions that calculate measures of central location. Let’s consider the following data on approval ratings:
Since the approval rating is a percentage of the total people polled there is no need to add one to these numbers.
The summary() calculates a collection of summary statistics for a vector or data frame. Below we apply it to the entire data set:
summary(poll)
date people approval
Length:4 Min. : 30.0 Min. :0.2500
Class :character 1st Qu.: 45.0 1st Qu.:0.2500
Mode :character Median : 75.0 Median :0.4750
Mean :107.5 Mean :0.5125
3rd Qu.:137.5 3rd Qu.:0.7375
Max. :250.0 Max. :0.8500
Below a list of the functions used:
The mean() function calculates the average for a vector of numbers.
The median() function calculates the median for a vector of numbers.
The table() function generates the frequency distribution so that we can identify the mode or modes.
The weighted.mean() function calculates the weighted mean for a vector of numbers, and corresponding vector of weights.
The lenght() function calculates the length of a vector.
The summary() function provides measures of location for a vector of numbers.
3.7 Exercises
The following exercises will help you practice the measures of central location. In particular, the exercises work on:
Calculating the mean, median, and the mode.
Calculating the weighted average.
Applying the geometric mean for growth rates and returns.
Answers are provided below. Try not to peak until you have a formulated your own answer and double checked your work for any mistakes.
Exercise 1
For the following exercises, make your calculations by hand and verify results using R functions when possible.
Use the following observations to calculate the mean, the median, and the mode.
8
10
9
12
12
Answer
To find the mean we will use the following formula\(( \frac{1}{n} \sum_{i=i}^{n} x_{i})\). The summation of the values is \(51\) and the number of observations is \(5\). The mean is \(51/5=10.2\).
The median is found by locating the middle value when data is sorted in ascending order. The median in this example is\(10\).
The mode is the value with the highest frequency. In this example the mode is\(12\) since it is repeated twice and all other numbers appear only once.
The mean can be easily verified in R by using the mean() function:
mean(c(8,10,9,12,12))
[1] 10.2
Similarly, the median is easily verified by using the median() function:
median(c(8,10,9,12,12))
[1] 10
We can use the table() function to calculate frequencies and easily identify the mode.
table(c(8,10,9,12,12))
8 9 10 12
1 1 1 2
Use following observations to calculate the mean, the median, and the mode.
-4
0
-6
1
-3
-4
Answer
The mean is\(-2.67\), the median is \(-3.5\), the mode is \(-4\).
The mean is verified in R:
mean(c(-4,0,-6,1,-3,-4))
[1] -2.666667
The median in R:
median(c(-4,0,-6,1,-3,-4))
[1] -3.5
Finally, the mode in R:
table(c(-4,0,-6,1,-3,-4))
-6 -4 -3 0 1
1 2 1 1 1
Use the following observations, calculate the mean, the median, and the mode.
20
15
25
20
10
15
25
20
15
Answer
The mean is\(18.33\), the median is \(20\), the data is bimodal with both \(15\) and \(20\) being modes.
The mean is verified in R:
mean(c(20,15,25,20,10,15,25,20,15))
[1] 18.33333
The median in R:
median(c(20,15,25,20,10,15,25,20,15))
[1] 20
The frequency distribution identifies the modes:
table(c(20,15,25,20,10,15,25,20,15))
10 15 20 25
1 3 3 2
Exercise 2
Download the ISLR2 package. You will need the OJ data set to answer this question.
Find the mean price for Country Hill (PriceCH) and Minute Maid (PriceMM).
Answer
The mean price for Country Hill is\(1.87\). The mean price for Minute Maid is \(2.09\).
The means can be easily found with the mean() function:
library(ISLR2)OJ=OJmean(OJ$PriceCH)
[1] 1.867421
mean(OJ$PriceMM)
[1] 2.085411
Find the mean price of Country Hill (PriceCH) at each store (StoreID). Which store provides the better price?
Answer
The mean price at store 1 for Country Hill is\(1.80\). The juice is cheaper at store 1.
The means for each store can be found by using group_by() and summarise().The mean price at each store is:
Over the past year an investor bought TSLA. She made these purchases on three occasions at the prices shown in the table below. Calculate the average price per share.
Date
Price Per Share
Number of Shares
February
250.34
80
April
234.59
120
Aug
270.45
50
Answer
The average price of sale is found by using the weighted average formula.\(\frac{\sum w_{i}x_{i}}{\sum w_{i}}\) The weights (\(w_{i}\)) are given by the number of shares bought and the values (\(x_{i}\)) are the prices. The weighted average is \(246.802\).
In R you can create two vectors. One holds the share price and the other one the number of shares bought.
What would have been the average price per share if the investor would have bought equal amounts of shares each month?
Answer
The average if equal shares were bought would be\(251.7933\).
In R you can use the mean() function on the PricePerShare vector.
(Average<-mean(PricePerShare))
[1] 251.7933
Exercise 4
Consider the following observations for the consumer price index (CPI). Calculate the inflation rate (Growth Rate of the CPI) for each period.
1.0
1.3
1.6
1.8
2.1
Answer
The inflation rate is the percentage change in the CPI. The inflation rate for each period is shown in the table below:
30%
23.08%
12.5%
16.67%
In R create an object to store the values of the CPI:
CPI<-c(1,1.3,1.6,1.8,2.1)
Next use the diff() function to find the difference between the end value and start value. Divide the result by a vector of starting value and multiply times 100.
(Inflation<-100*diff(CPI)/CPI[1:4])
[1] 30.00000 23.07692 12.50000 16.66667
What is the average growth rate for the inflation rate?
Answer
The average growth rate is \(31.61%\)
We can use the geometric mean formula with compounding. In R:
(Growth<-prod(1.3,1.2308,1.125,1.1667)^(1/4)-1)
[1] 0.2038175
Suppose that you want to invest $1000 dollars in a stock that is predicted to yield the following returns in the next four years. Calculate both the arithmetic mean and the geometric mean. Use the geometric mean to estimate how much money you would have by the end of year 4.
Year
Annual Return
1
17.3
2
19.6
3
6.8
4
8.2
Answer
At the end of 4 years it is predicted that you would have\(1621.17\) dollars. Each year you would have gained \(12.84\)% on average.