4  Time Series Tools

In this module, we will be learning the tools that will allow us to analyze Time Series data. Before any forecasts are estimated, we should carefully inspect the data, address any mistakes that might be present, and take note of any general patterns. Following this procedure will ensure that the data is ready for analysis.

We will first learn how to transform and manipulate our data using the dplyr package and the handy “piping” operator. We will also familiarize ourselves with the lubridate package, which helps us parse dates. Next, we will use the tsibble data structure to organize time series data. Finally, we plot the times series to identify trends, seasonality, or cyclical behavior using the ggplot2, and fpp3 packages. This last package includes the fable package used to model time series data.

4.1 Chilango’s Kitchen Wants You to Forecast Avocado Prices

Chilango’s Kitchen specializes in tacos and burritos made to order in front of the customer. Guacamole is the perfect pairing to their delicious food and one of the restaurant’s best sellers. Their guac uses just six ingredients: avocados, lime juice, cilantro, red onion, jalapeño, and kosher salt. Because of its popularity, each restaurant goes through approximately five cases of avocados daily, amounting to more than \(44,000\) pounds annually. Chilango’s Kitchen wants you to provide insights on the price of avocados in California.

4.2 The Avocado Data Set

The Hass Avocado Board provides the data set and contains weekly retail scan data of U.S. retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. The data reflects an expanded, multi-outlet reporting aggregating the following channels: grocery, mass, club, drug, dollar, and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. Other avocados (e.g. greenskins) are not included in this data. You can find the data here.

Note: In general the data is recorded weekly. However, there is an entry on 01/01/2018, that is right after 12/31/2017. This is a single observation that is not weekly. There are also missing dates from 12/02/2018-12/31/2018.

4.3 Loading tidyverse and Inspecting the Data.

tidyverse is a collection of packages in R that allow us to manipulate, explore and visualize data. There are a couple of packages within tidyverse (dplyr and tidyr) that we will be using to transform our data and get it ready for analysis. dplyr will allow us to do most of our data manipulation: creating new variables, renaming variables, filtering values, sorting, grouping, and summarizing, among others. tidyr will allow us to pivot data sets, unite or separate columns, and deal with missing values. Although it is always possible to complete these tasks using base R, tidyverse allows us to efficiently perform these operations using data manipulation verbs that are very intuitive. Below we load the library.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As you can see, several packages were attached (loaded) when we write library(tidyverse). As mentioned, both tidyr and dplyr are part of this overall package. Now that the package is loaded we can import our data by using the read_csv() function from the readr package.

avocado<-read_csv("https://jagelves.github.io/Data/avocado2020.csv")
Rows: 33045 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): date, type, geography
dbl (10): average_price, total_volume, 4046, 4225, 4770, total_bags, small_b...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The code above imports the data as a tibble (a data structure similar to a data frame) and saves it in an object named avocado. The output informs us that three variables are classified as a character, while the rest are double. You can preview the data with either the spec() or glimpse() commands.

spec(avocado)
cols(
  date = col_character(),
  average_price = col_double(),
  total_volume = col_double(),
  `4046` = col_double(),
  `4225` = col_double(),
  `4770` = col_double(),
  total_bags = col_double(),
  small_bags = col_double(),
  large_bags = col_double(),
  xlarge_bags = col_double(),
  type = col_character(),
  year = col_double(),
  geography = col_character()
)

You will notice that the date variable is of type character. We can use the lubridate package to coerce this variable to a date. Specifically, since the date variable is formatted as month/day/year we will use the mdy() function. You can learn more about this package in Wickham (2017) Chapter 18.

library(lubridate)
avocado$date<-mdy(avocado$date)

4.4 Piping and dplyr

dplyr is commonly used with “piping”. Generally speaking, “piping” allows us to chain functions. “Piping” (%>% or |>) passes the object on the left of the pipe as the first argument in the function to the right of the pipe. We can illustrate this using the select() and arrange() functions.

avocado %>% select(c(average_price,geography)) %>%
  arrange(desc(average_price)) %>% head(5)
# A tibble: 5 × 2
  average_price geography           
          <dbl> <chr>               
1          3.25 San Francisco       
2          3.17 Tampa               
3          3.12 San Francisco       
4          3.05 Miami/Ft. Lauderdale
5          3.04 Raleigh/Greensboro  

There is a lot to explain in this line of code. Let’s start with the functions used. The select() and arrange() functions are part of the dplyr package. As the name indicates, the select() function selects variables from a tibble or data frame. The arrange() function sorts the data in ascending order, and the desc() function is nested inside to sort in descending order.

Let’s focus on the entire code by reading it from left to right. avocado is the tibble that contains all of the data. Since it is to the left of the pipe (%>%), it passes as the first argument of the select() function. That is why you don’t see avocado as the first argument listed in the select() function. The new data frame (i.e., the one with only the geography and the average price) then passes as the first argument of the arrange() function that follows the second pipe. The data frame is sorted in descending order so that the highest average avocado price is displayed first. Finally, the head() function is used to retrieve the top five entries.

As noted in Section 4.2, there is an additional date in the data set that is between weeks (2018-01-01). We can remove this observation by using the filter() function.

avocado %>% filter(date!=ymd("2018-01-01")) -> avocado

You should notice that whereas the select() function chooses particular variables (i.e., columns), the filter() function chooses rows of the tibble that meet the conditions listed. Note also that the filtered data set is assigned (->) to avocado overwriting the older object.

The examples above highlight the use of dplyr functions to transform your data. There are plenty of other functions, but learning these are outside the scope of this book. To find out more, I recommend reading Wickham (2017) Chapter 4. Let’s use the filter() and select() functions once more to retrieve California’s average price of organic avocados for 2015-2018.

avocado %>% 
  filter(geography=="California", type=="organic", year<=2018) %>% 
  select(date, average_price) -> cali

When exploring time series data, it is important to standardize the time interval between observations. The code below calculates the number of days between observations.

cali %>% pull(date) %>% diff()
Time differences in days
  [1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 [38] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[112] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[149] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[186] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

The pull() function vectorizes the date variable, making it possible to apply the diff() function. The result confirms that all observations are seven days apart and that no missing observations or duplicates exist.

4.5 Visualizing The Data

To visualize the data, we will be using ggplot2. One of the main functions in ggplot2 is the aes() function. This function sets the plotting canvas and determines the mapping of variables. The geom_line() function specifies the type of plot and is added to the canvas with the plus sign +. In time series, we will use the line plot regularly. Labels for the graph are easily set with the labs() function and there are plenty of themes available to customize your visualization. Below, the theme_classic() is displayed. To learn more about the ggplot package, you can refer to Wickham (2017) Chapter 2. Below is the code to create a line plot of California’s average avocado price.

ggplot(data=cali) + 
  geom_line(mapping=aes(x=date,y=average_price),
            color="black") +
  theme_classic() + 
  labs(x="",
       y="Average Price", 
       title="Organic Avocado Price in California",
       subtitle="2015-2018",
       caption="https://hassavocadoboard.com")  

The average price of avocados in California increased during the period considered. It reached a maximum of about \(2.60\) in \(2017\) and a minimum of \(1.10\) in \(2015\). There is also a seasonal pattern with low prices at the beginning of the year and peaks mid-year. In upcoming chapters, we will extrapolate these patterns and use them to forecast time series.

4.6 tsibbles

When analyzing time series, time plays a central role. Consequently, we will use a data structure to handle a time series called a tsibble (time series tibble). tsibbles are defined by a time index (i.e., the date) that has a standard interval (i.e., days, weeks, months, and years) and keys (i.e., dimensions that do not vary with time). In the avocado data set, we are mainly interested in the average price of the avocados. You will note that the prices are classified by location (geography) and type (organic and conventional). You can learn more about tsibbles here.

tsibbles, as well a variety of packages that help us analyze time series, are part of the fpp3 package. Below we load the package, and coerce our avocado tibble to a tsibble.

library(fpp3)
avocado %>%
  as_tsibble(key=c(type, geography),
           index=date, regular=T) %>%
  filter_index("2015-01-04"~"2018-12-02")-> avocadots

In the code above, the as_tsibble() function was called with the parameter regular set at true, indicating that the date has no gaps and occurs every week (the greatest common divisor of the index column). Additionally, the type of avocado and the geography do not vary with time and are classified as a key. The function filter_index() is called as it allows us to determine the window for analysis. As noted in Section 4.2, there are some missing dates in December 2018. We limit the analysis for now to 2015-2018.

Recall, that we are interested in the average price of avocados for California. We can specify the tsibble for analysis, by using dplyr.

avocadots %>% filter(geography=="California", type=="organic") %>%
  select(date,average_price) -> calits

4.7 Time Series Decomposition

As highlighted in Section 4.5, the average price of avocados in California has a trend and a seasonal pattern. There are methods available to tease these components and make them more apparent. STL (Season Trend decomposition using LOESS) decomposes the series into a trend, seasonality, and an error (unexplained) component. It is easy to run this method in R using the fable package.

In practice, the decomposition is constructed in several steps. First, a moving average of the series is calculated to track the trend. The trend is then subtracted from the series to obtain a de-trended series. The seasonal component is calculated by averaging the values based on the window provided (52 weeks or yearly) for the de-trended series. The error is the remaining series fluctuation that is not explained by the trend or the seasonal component (\(Series-Trend-Seasonal=Error\)). In the end, each component can be graphed and displayed, as illustrated below.

calits %>%
  model(STL(average_price~trend(window=199)+
              season(window=51), robust=TRUE)) %>%
  components() %>% autoplot()+ theme_classic()

fable allows us to construct this model easily by using the model() function. This function will allow us to estimate a variety of time series models, so we will be using it regularly. The particular model we are running is STL, hence the use of the STL() function within the model() function. We define the model by specifying the dependent variable (i.e., average_price) followed by a tilde (~) and the components. As you can see, the window argument in the trend() function is set to a relatively large value and the season() function specifies \(51\) weeks to capture the yearly seasonality. By doing this the moving average reflects the general direction of the series and avoids the small fluctuations of the data. The robust parameter is set to true to make the fitting process less sensitive to outliers or influential data points. You can learn more about this function in Chapter 3 of Hyndman (2021).

As shown above, the trend is increasing, and the seasonal component confirms low levels at the beginning of the year and high levels in the summer. These are two general patterns that determine the price of avocados in California and provide valuable insight to share with Chilango’s Kitchen.

4.8 Readings

Readings get you familiar with the techniques used in time series analysis. Hyndman (2021) is the main reading here. It provides a good introduction to forecasting, plotting time series and decomposition. The book makes use of fable to conduct STL decomposition and tsibbles to capture the data. Most of time series analysis is made easier in R by using tidyverse. Wickham (2017) provides an excellent introduction to some of the packages included in tidyverse such as dplyr, ggplot, and lubridate.

4.9 Lessons Learned in This Chapter

In this module you have been introduced to data wrangling, plotting, tsibbles and time series decomposition. You have learned how to:

  • Manipulate dates with lubridate.

  • Select and filter variables using dplyr.

  • Plot time series using ggplot.

  • Apply tsibbles in time series analysis.

  • Decompose a series using the model() and STL() functions in fable.