Despite its almost ubiquitous use in the business industry and social sciences, time series analysis and by extension time series forecasting is one of the least understood machine learning methods new data scientists and machine learning engineers are undertaking.
The purpose of this blog is to provide an overview of this lesser-known but incredibly important machine learning technique.
What is time series data?
To answer this question, let’s take a step back to discuss the types of data that we use for typical regression and classification tasks. When we make a prediction about a new observation, that model is built from hundreds or thousands of previous observations that are either all captured at a single point in time, or from data points in which time does not matter. This is known as cross-sectional data.
Time series data is different because it is recorded at regular time intervals, and the order of these data points is important. Therefore, any predictive model based on time series data will have time as an independent variable. The output of a model would be the predicted value or classification at a specific time.
What is a time series problem?
Here are a few examples of how different industries use time series forecasting:
- Energy – Prices; demand; production schedules
- Retail – Sales; consumer demand for certain products
- State government – Sales tax receipts
- Transportation – Demand for future travel
- Finance – Stocks; market potential
Time series analysis vs time series forecasting
This blog is focused on time series forecasting, but let’s clear up some possible confusion about the term time series analysis. While time series forecasting is a form of predictive modeling, time series analysis is a form of descriptive modeling. This means that someone conducting time series analysis is looking at a dataset to identify trends and seasonal patterns and associate them to external circumstances. Many social scientists and policy makers use this form of descriptive modeling to develop programs and recommendations.
The goal of time series forecasting however, is to predict a future value or classification at a particular point in time.
The four components of a time series
The first step in analyzing a time series in order to develop a predictive model is to identify and understand the underlying pattern of the data over time. These underlying patterns are usually classified as the following four components:
- Trend – The long-term gradual change in the series. This is the simplest trend pattern, as it demonstrates long-term growth or decline.
- Seasonality – Predictable, short-term patterns that occur within a single unit of time and repeat indefinitely.
- Cyclical component – Long-term swings in the data that may take years or decades to play out. These swings do not happen in a predictable manner and are often the result of external economic conditions.
- Noise (error) – Random variation due to uncontrolled circumstances.
What are time series forecasting methods?
The following are machine learning forecasting methods to use with time series data. When deciding on a method to use, keep the following in mind:
- Underlying assumptions about the data (ie. does the error follow a normal distribution?)
- The external factors that may influence the trend
- If the problem you’re trying to solve has a simple or complicated possible solutions
Simple time series forecasting methods
It’s possible that the most accurate machine learning time series forecasting model is the simplest. In the same way that data scientists often begin their modeling of cross-sectional data with simple linear regression, there are time series equivalents. Here are a few examples:
- Naive forecast – In a naive forecast the predicted value is simply the value of the most recent observation. This very basic method is often used as a benchmark to evaluate the performance of more sophisticated forecasts. In other words, if your complicated model is less accurate than the naive forecast, then you are likely doing something wrong.
- Average – In the average method, all forecasts are equal to the mean of all of the historical data.
- Seasonal naive method – This is similar to the naive forecast except that the predicted value is the last observed value from the same season of the time period. For example, on a monthly scale using this method, a November forecast would be equal to the last observed value in November.
Regression-based time series forecasting
You can develop linear, polynomial, and exponential regression time series forecasting models by creating a time index variable starting with the first observation (t=1) to the most recent (t=n). The result is a model of trend, but not seasonality. This is a useful method if your underlying assumption is that this trend is appropriate and relevant for your chosen time period.
If your model does need to take seasonality into account, that can also be done with linear regression. This is done by creating a categorical variable that indicates seasons.
Unlike regression models that are based on assumptions about trend or noise structure, time series smoothing methods are designed to adapt to changes in the data over time. Smoothing reduces noise by taking averages of observations over multiple periods. We discuss the two most common smoothing methods, moving average and exponential smoothing below:
Moving average – The moving average method generates a series of averages by taking the mean of values in the time series within designated periods. It is presumed that observations that close in time are probably also similar in value, so taking an average eliminates the noise. Moving averages are usually taken of the most recent data points.
Exponential smoothing – Exponential smoothing takes a weighted average over all past values, giving more weight to the most recent observations. The purpose is to acknowledge older information, while prioritizing the most recent data.
Special considerations for time series data
As you may know, developing a model requires dividing data into training and validation sets. With cross-sectional data, you would randomly divide the data into these groups. However, you can’t randomly divide data that has a sequential time element. For time series modeling, earlier data is used as the training set, while newer data is used as the validation set.
In addition, when forecasting new values the model itself is based on the entire data set, not just the training partition. This is because the data most relevant to a forecast are the observations that happened most recently.
Developing your own time series model
The process of developing your own time series machine learning model, is similar to how you would develop a model using cross-sectional data. We’ve detailed this process here, but in short, here are the steps:
- Find a problem to solve or a question to answer
- Gather the data necessary to be able to answer the question
- Perform exploratory analysis
- Choose and test out models
- Evaluate the accuracy of your chosen model.
If you’re unsure where to start, check out this introductory post as well as the time series datasets found on Kaggle and the UCI Machine Learning Repository. Both have dozens of datasets specific to this machine learning technique.