# An introduction to time series forecasting

Despite its almost ubiquitous use in the business industry and social sciences, time series analysis and by extension time series forecasting is one of the least understood machine learning methods new data scientists and machine learning engineers are undertaking.

The purpose of this blog is to provide an overview of this lesser-known but incredibly important machine learning technique.

## What is time series data?

To answer this question, let’s take a step back to discuss the types of data that we use for typical regression and classification tasks. When we make a prediction about a new observation, that model is built from hundreds or thousands of previous observations that are either all captured at a single point in time, or from data points in which time does not matter. This is known as cross-sectional data.

Time series data is different because it is recorded at regular time intervals, and the order of these data points is important. Therefore, any predictive model based on time series data will have time as an independent variable. The output of a model would be the predicted value or classification at a specific time.

## What is a time series problem?

Here are a few examples of how different industries use time series forecasting:

• Energy – Prices; demand; production schedules
• Retail – Sales; consumer demand for certain products
• State government – Sales tax receipts
• Transportation – Demand for future travel
• Finance – Stocks; market potential

## Time series analysis vs time series forecasting

This blog is focused on time series forecasting, but let’s clear up some possible confusion about  the term time series analysis. While time series forecasting is a form of predictive modeling, time series analysis is a form of descriptive modeling. This means that someone conducting time series analysis is looking at a dataset to identify trends and seasonal patterns and associate them to external circumstances. Many social scientists and policy makers use this form of descriptive modeling to develop programs and recommendations.

The goal of time series forecasting however, is to predict a future value or classification at a particular point in time.

## The four components of a time series

The first step in analyzing a time series in order to develop a predictive model is to identify and understand the underlying pattern of the data over time. These underlying patterns are usually classified as the following four components:

• Trend – The long-term gradual change in the series. This is the simplest trend pattern, as it demonstrates long-term growth or decline.
• Seasonality – Predictable, short-term patterns that occur within a single unit of time and repeat indefinitely.
• Cyclical component – Long-term swings in the data that may take years or decades to play out. These swings do not happen in a predictable manner and are often the result of external economic conditions.
• Noise (error) – Random variation due to uncontrolled circumstances.

## What are time series forecasting methods?

The following are machine learning forecasting methods to use with time series data. When deciding on a method to use, keep the following in mind:

• Underlying assumptions about the data (ie. does the error follow a normal distribution?)
• The external factors that may influence the trend
• If the problem you’re trying to solve has a simple or complicated possible solutions

### Simple time series forecasting methods

It’s possible that the most accurate machine learning time series forecasting model is the simplest. In the same way that data scientists often begin their modeling of cross-sectional data with simple linear regression, there are time series equivalents. Here are a few examples:

• Naive forecast – In a naive forecast the predicted value is simply the value of the most recent observation. This very basic method is often used as a benchmark to evaluate the performance of more sophisticated forecasts. In other words, if your complicated model is less accurate than the naive forecast, then you are likely doing something wrong.
• Average –  In the average method, all forecasts are equal to the mean of all of the historical data.
• Seasonal naive method – This is similar to the naive forecast except that the predicted value is the last observed value from the same season of the time period. For example, on a monthly scale using this method, a November forecast would be equal to the last observed value in November.

### Regression-based time series forecasting

You can develop linear, polynomial, and exponential regression time series forecasting models by creating a time index variable starting with the first observation (t=1) to the most recent (t=n). The result is a model of trend, but not seasonality. This is a useful method if your underlying assumption is that this trend is appropriate and relevant for your chosen time period.

If your model does need to take seasonality into account, that can also be done with linear regression. This is done by creating a categorical variable that indicates seasons.

### Smoothing methods

Unlike regression models that are based on assumptions about trend or noise structure, time series smoothing methods are designed to adapt to changes in the data over time. Smoothing reduces noise by taking averages of observations over multiple periods. We discuss the two most common smoothing methods, moving average and exponential smoothing below:

Moving average – The moving average method generates a series of averages by taking the mean of values in the time series within designated periods. It is presumed that observations that close in time are probably also similar in value, so taking an average eliminates the noise. Moving averages are usually taken of the most recent data points.

Exponential smoothing – Exponential smoothing takes a weighted average over all past values, giving more weight to the most recent observations. The purpose is to acknowledge older information, while prioritizing the most recent data.

## Special considerations for time series data

As you may know, developing a model requires dividing data into training and validation sets. With cross-sectional data, you would randomly divide the data into these groups. However, you can’t randomly divide data that has a sequential time element. For time series modeling, earlier data is used as the training set, while newer data is used as the validation set.

In addition, when forecasting new values the model itself is based on the entire data set, not just the training partition. This is because the data most relevant to a forecast are the observations that happened most recently.

## Developing your own time series model

The process of developing your own time series machine learning model, is similar to how you would develop a model using cross-sectional data. We’ve detailed this process here, but in short, here are the steps:

1. Find a problem to solve or a question to answer
2. Gather the data necessary to be able to answer the question
3. Perform exploratory analysis
4. Choose and test out models
5. Evaluate the accuracy of your chosen model.

If you’re unsure where to start, check out this introductory post as well as the time series datasets found on Kaggle and the UCI Machine Learning Repository. Both have dozens of datasets specific to this machine learning technique.

# Introducing GitHub Source Code Management for Algorithmia

As a data scientist or machine learning engineer, your specialty is building out robust machine learning models. Your purpose is to make a positive impact on your business by offering data insights, reducing costs, increasing revenue, or even generating customer delight.

As your collection of models gets larger, it quickly becomes difficult to manage your code and collaborate with other members of your team unless you are implementing best practices such as version control and using a code repository.

## Model management through centralized repositories

A centralized repository increases the visibility of your models so there is less duplication of work and also provides other teams the opportunity to use those models to solve business problems quickly by not reinventing the wheel.

Algorithmia already provides a centralized repository for your algorithms in production that are backed by Git and served via our REST API. Our platform offers flexibility in where you store your source code and how you interact with your algorithm in your development process.

You can easily host your source code on Algorithmia, either on the Public version or your private Algorithmia Enterprise instance and you can utilize our Web IDE or you can take advantage of our CLI tool and API’s to stay in your local environment of choice

Algorithmia is happy to announce that we have expanded our source code management offerings, adding to the benefits of having a centralized repository for increased model management. And there is more to come.

## Github-hosted source code for model management

When multiple users contribute to an algorithm’s development, there can be many points of friction. Conflicts can arise in the code base such as an inability to track what’s changed and who made that change. To organize the development process, enterprises need a centralized source code repository and a set of controls over what code gets implemented and how to track changes.

When you need to collaborate with other team members on the same algorithm, taking advantage of model management by tracking who has contributed to your code base, who has updated the code and when, and other important auditing features and utilize GitHub features like GitHub Actions, you now can!

By connecting your Algorithmia and GitHub accounts, you can store your source code on GitHub and deploy it to an algorithm in production on Algorithmia. This way, multiple users can easily contribute to the same algorithm, collaborate on a centralized code base, and ensure code quality with best practices like code reviews through pull requests and issues tracking

You can also take advantage of GitHub’s governance applications around dependency management to ensure that your models aren’t at risk of utilizing package versions with deprecated features. These governance features enhance Algorithmia’s current model management workflow for reproducibility of machine learning models in production.

### Getting started

This guide will show you how easy it is to start using GitHub with Algorithmia.

First, click on “Create a New Algorithm” in the Algorithmia Web IDE. If you’ve never created an algorithm before, learn how in the Developer Center

You’ll see a form (pictured below) to fill out, and if you have created algorithms before on Algorithmia, you’ll notice that there are now two options for Repository Host where you can choose your source code. You’ll want to choose the GitHub option and then authorize your account:

Once your accounts are linked, Algorithmia will be able to create repositories linked to that account. Note that you can create an algorithm under any organization you belong to in GitHub, or under the GitHub user account you connected to:

After going through the configuration steps to create your algorithm, you’ll get a notification telling you that you’ve successfully created the repository and it will show you your new algorithm in the Web IDE:

Be aware that when you click on “Source Code” for your new algorithm, you will be redirected to the GitHub repository for that algorithm so you can work in the environment you are most familiar with.

Now you can set up GitHub Actions, Azure Pipelines or numerous other integrations by using Algorithmia linked with GitHub repositories.

For a full step-by-step tutorial on how to get started with hosting your source code on GitHub, check out our guide for Source Code Management with GitHub.

## Source code management and what’s next

With GitHub integrations or hosting your source code on Algorithmia, you can easily take advantage of ML model management best practices.

We are constantly working on more integrations, including other version control systems and continuous integration pipelines that will enable our users to manage their codebases and deployments seamlessly with Algorithmia.

Stay tuned for these and other new features that enhance your organization’s ability to connect, deploy, scale, and manage your machine learning pipelines.

# Time series data analysis advances DevOps

Time series data, the key data points that have an associated timestamp allowing indexing in time order, are in most cases INSERT-intensive, requiring specialized time series databases as opposed to traditional relational practice as seen in SQL.

Prior to advancements in machine learning, much of the time series data analysis completed by DevOps engineers was limited to simple averages of key metrics with associated timestamps. By setting thresholds on those metrics in conjunction with timestamps, simple alert systems were born. Now, DevOps engineers are using time series data in ways that benefit from enhancements in the field of artificial intelligence.

## DevOps strives for 100% uptime using historical time series data

While standard alerts are useful for determining if a service or system is close to failure, DevOps now has the ability to see valuable trends in time series data. Rather than being reactionary, engineers are adding methods to their tool belts to prevent system outages and prepare for events based on historical data.

This proactive approach is one of the key tenets of today’s ML DevOps methodology. Rather than focusing on thresholds, DevOps can utilize anomalies in time series data found by the introduction of machine learning models.

### Time series in action

Let’s look at an example of some time series data that DevOps engineers are already familiar with.

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326

Above, we see an HTTP log entry with a number of data points. Information like IP address, user information, request result, and above all, timestamp data are all collected.

DevOps can use this information to identify application failures or broken links to various assets. Engineers can act on these insights and resolve issues they encounter.

### DevOps analytics and time series data

Alternatively, they are identifying trends in this time series data that allow for proactive capacity planning. If the data shows an increase in the number of requests during a holiday or other event, DevOps engineers can use scaling techniques on the production resources to ensure a good user experience. By using this type of data for capacity planning, outages due to lack of resources are minimized; which consequently, saves time and possibly lost revenue during a preventable outage.

## Artificial intelligence identifies trends in economic time series data

Just as DevOps engineers are able to take advantage of advancements in machine learning, economics is also benefiting from the new technology. A great example of this is to see how time series data identifies trends in the stock market. AI is creating new ways to conduct risk analysis so that investors have a clearer picture of historical trends for individual companies as well as the market as a whole.

Time series data can also provide more in-depth cost-benefit analysis, including forecasting based on data used during the training of various ML models. Ultimately, this gives insight into additional scenarios with feedback to support it when presenting to stakeholders. This type of data was rarely available prior to the introduction of artificial intelligence, making it quite valuable.

What makes today’s big data challenges more complicated, however, is the need for data scientists to have access to large datasets alongside the models they use. When working with data that involves transactions, it is critical to have an appropriate layer of security as well.

Algorithmia allows a team’s DevOps engineers to implement a solution based on proven DevOps processes. At the same time, Algorithmia allows data scientists to branch out and innovate with their own machine learning models, or those already deployed in the Algorithmia platform.

### Time series data benefits from specialized database formats

Due to the nature of time series datasets, the database chosen must be scalable and highly available. Typical databases do not provide the throughput or storage needs for the large amounts of data surrounding ecommerce.

Specialized database formats are available that take advantage of advancements in software engineering, making them perfect for the types of intense analysis needed to make sense of large amounts of data. By hosting time series data in appropriate formats, data scientists and DevOps engineers also benefit from a usability standpoint.

Many functions for data retention, aggregation based on time elements, and common query tasks are built in, thus, eliminating the need for additional DevOps processes around maintenance. The result of having the right data in the right place is an increase in efficiencies across the board.

## Algorithmia provides a full solution for time series data analysis

Algorithmia recognizes the need for storing specialized datasets. Additionally, time series data is often stored with major cloud providers as denoted by business needs. You can seamlessly connect many major cloud-platform storage accounts for use in ML models, all while providing a single point of integration that handles all aspects of security and scalability.

Algorithmia’s Public instance Includes ML models that fully utilize the way today’s time series data is stored. Using these models in combination with yours and those of the Algorithmia community, fosters innovation needed to advance AI for today’s big data needs. Data scientists can focus on their jobs and DevOps can ensure capacity and uptime remains at appropriate service levels.

# The best AI programming languages to use

Computer coding must be involved to implement any type of AI system, and there is a variety of programming languages that lend themselves to specific AI or machine learning tasks. Let’s look at which programming languages will be the most beneficial for your specific use cases.

We have composed a simple list showing which five programming languages are best to learn if you want to be successful in the artificial intelligence industry. Each has its own particular strengths and weaknesses for a given project, so consider your end goals before selecting a language.

These programming languages include:

• Python
• R
• Java
• Scala
• Rust

## Python

Python is by far the most popular programming language used in artificial intelligence today because it has easy to learn syntaxes, massive libraries and frameworks, dynamic applicability to a plethora of AI algorithms, and is relatively simple to write.

Python supports multiple orientation styles; including functional, object-oriented, and procedural. In addition, its massive community helps to keep this language at the forefront of the computer science industry.

The disadvantages of Python include its lack of speed compared to some of the other languages, its less than optimal mobile coding capabilities, and the difficulty it has with memory-intensive tasks.

## R

R is another machine learning programming language, that is relatively easy to understand. The most common uses of R are for data analysis, big data modeling, and data visualization. R’s abundance of package sets and variety of materials make it easy to work with on data-centric tasks.

The disadvantages of R includes its excess use of memory, lack of basic security (unable to embed into web applications), and the fact that it is rooted in an older programming language, S.

## Java

Java is object-oriented and includes strengths such as working well with search algorithms—a simplified framework that supports large-scale projects efficiently—and its ease of debugging code. In addition, it is supported by a well-established community and has a myriad of open-source libraries.

The disadvantages of Java include its lack of performance speed compared to other languages and the inefficient use of memory that comes with running on top of the Java Virtual Machine. These two shortcomings generally result in a third: the increased cost of hardware.

## Scala

Scala is a highly scalable programming language that can handle large amounts of big data. Being multi-paradigm, Scala supports both object-oriented and functional styles of programming. Due to its concise code, Scala can be more readable and easier to write than other languages, similar to Java. Its speed and efficiency are what makes this language stand out for machine learning and AI models, with relatively error-free coding that is easy to debug when necessary.

The disadvantages of Scala include side effects that come with fulfilling both object-oriented and functional styles. Since this language is a combination of both programming styles, it can make understanding type-information more difficult. In addition, the option to switch back to an object-oriented style can be seen as a downside, as you won’t be forced to think functionally while you code.

## Rust

Rust is a systems-level programming language. It was created with the intention of writing “safe” code, meaning that objects are managed in the program itself. This relieves the programmer of doing pointer arithmetic or having to independently manage memory. The inability to use excess memory often results in cleaner code, potentially making it easier to program.

The disadvantages of Rust include a slower compiler than other languages, no garbage collection, and codes that cannot be developed at the same rate as other programming languages, such as Python.

## With Algorithmia, you can use multiple languages within one AI software

Algorithmia provides a machine learning architecture that invites programmers to pipeline models together, even if they’re written in different languages. This removes any need to translate algorithms into a certain language to be compatible with the rest of the algorithms in a monolithic architecture.

You can also reuse pieces of the software by calling them into the application whenever they’re needed, without copying and pasting them. Algorithmia helps organizations create better software, faster in this way.

Watch our video demo to learn how Algorithmia can help your organization increase efficiency in the last-mile machine learning process: deploying, serving, and managing models.

Six open-source machine learning tools you should know

Explanation of roles: machine learning engineers vs data scientists

# Linear regression for machine learning

Linear regression in machine learning is a supervised learning technique that comes from classical statistics. However, with the rapid rise of machine learning and deep learning, its use has surged as well, because neural networks with linear (multilayer perceptron) layers perform regression.

This regression is typically linear, but when the use of non-linear activation functions are incorporated into these networks, then they become capable of performing non-linear regression.

Nonlinear regression models the relationship between input and output using some form of non-linear function, for example a polynomial or an exponential. Non-linear regressors can be used to model common relationships in science and economics, like as examples, the exponential decay of a radioactive molecule or the trend in stock market performance in accordance with the overall global economy.

## How does linear regression work?

Stepping back from the neural network view, we can specify linear regression models as a simple mathematical relationship. Succinctly put, linear regression models a linear dependency between an input and an output variable. Depending upon what context you’re working in, these inputs and outputs are referred to by different terms.

Most commonly, we have a training dataset with $k$ examples, each having $n$ input components, $x_1, \ldots, x_n$, called the regressors, covariates, or exogenous variables. The output vector $\mathbf{y}$ is called the response variable, output variable, or the dependent variable. In multivariate linear regression, there can be multiple such output variables. The parameters of the model, $w_0, w_1, \ldots, w_n$, are called the regression coefficients, or in the deep learning context, the weights. The model has the form

$$\mathbf{y} = w_0 + w_1x_1 + \cdots + w_nx_n$$

for a single training example $\mathbf{x} = [x_1, \ldots, x_n]$. We can also make this notation compact by compressing the training data into a matrix $X \in \mathbb{R}^{k \times n+1}$, ${\displaystyle X={\begin{pmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{pmatrix}}={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}},}$

and the weights into a vector, $\mathbf{w} = [w_0, w_1, \ldots, w_n]^{\top}$. The weights form the core of the model. They encode the linear relationship between input and output, placing more emphasis on data features, which are important and down-weighting those that are not. Note that we add a “hidden component” to each of the rows of $X$ that has value 1. This allows us to compute a dot product with $\mathbf{w}$, which has a bias term, $w_0$. The bias term allows the model to shift the linear hyperplane it computes off of the origin, permitting it to model relationships in data that are not zero-centered. The simplified model can then be expressed as

$$y = X\mathbf{w}$$

This is the basic model that underlies most implementations of linear regression; however, there are many variations that can exist on top of this basic structure, each conferring its own drawbacks and benefits. For example, there’s a version of linear regression called Bayesian linear regression, which introduces a Bayesian perspective by placing prior distributions on the weights of the model. This makes it easier to reason about what the model’s doing and subsequently makes its results more interpretable.

## Training a linear regression model

So how do we train a linear regression model? Well, the process is similar to what’s used with most machine learning models. We have a training set $$\mathcal{D} = \{(x^{(1)}, y^{(1)}),\ldots, (x^{(n)}, y^{(n)})\}$$ and our task is to model this relationship as closely as possible without affecting the model’s ability to predict on new examples. To that end, we define a loss, or objective, function $J_{\mathbf{w}}(\hat{y}, y)$ which takes in the true output $y$ and the predicted output $\hat{y}$ and measures “how well” the model is doing at predicting $y$ given $\mathbf{x}$. We use the subscript $\mathbf{w}$ to indicate that the output of $J$ is dependent on and parameterized by the model’s weights, $\mathbf{w}$, via the prediction $\mathbf{y}$, even though those weight values don’t explicitly show up in the function’s calculation. For linear regression, we typically use the mean-squared error (MSE) loss function. It is defined as

$$J_\mathbf{w}(\hat{y}, y) = \frac{1}{2} \sum_{i=1}^n (\hat{y}^{(i)} – y^{(i)})^2$$

We can then optimize this loss function using one of a variety of techniques. We could use something like gradient descent, the de facto standard for training neural networks, but this is actually not necessary for linear regression. This is because we can actually solve the optimization problem directly in order to find the optimum value for the weights, $\mathbf{w}^*$.

Since we want to optimize this for $\mathbf{w}$, we take the gradient with respect to $\mathbf{w}$, set the result to 0, then solve for $\mathbf{w}^*$, the optimal setting of $\mathbf{w}$. We have

\begin{align*} \nabla_{\mathbf{w}} J_\mathbf{w}(\hat{y}, y) &= \nabla_{\mathbf{w}} (y – X\mathbf{w})^\top(y – X\mathbf{w}) \\ &= \nabla_\mathbf{w} \left(y^\top y – y^\top X \mathbf{w} – \mathbf{w}^\top X^\top y + \mathbf{w}^\top X^\top X \mathbf{w}\right) \\ &= -2 y^\top X + 2 \mathbf{w}^\top X^\top X \\ \end{align*}

Now we set the gradient equal to 0 and solve for $\mathbf{w}$

\begin{align*} 0 &= -2 y^\top X + 2 \mathbf{w}^\top X^\top X \\ y^\top X &= \mathbf{w}^\top X^\top X \\ \mathbf{w}^* &= (X^\top X)^{-1}y^\top X \end{align*}

This is the optimal setting of $\mathbf{w}$ that will give the model with the best results. As you can see, it’s computed solely using products of $X$ and $y$. However, it requires a matrix inversion of $X^\top X$ which can be computationally difficult when $X$ is very large or poorly conditioned. In these cases, you could use an inexact optimization method like gradient descent or techniques designed to approximate the matrix inverse without actually computing it.

## Regularization

Probably the most commonly used variants of linear regression are those models which involve added regularization. Regularization refers to the process of penalizing model weights which are large in absolute value. Usually this is done by computing some norm of the weights as a penalty term added onto the cost function. The purpose of regularization is usually to mitigate overfitting, the tendency of a model to too closely replicate the underlying relationship in its training data, which prevents it from generalizing well to unseen examples. There are two basic types of regularization for linear regression models: L1 and L2.

$$\|\mathbf{w}\|_1 = |w_0 + w_1 + \cdots + w_n|$$

Regression models which employ L1 regularization are said to perform lasso regression.

In contrast, L2 regularization adds the L2 norm of the weight vector $\mathbf{w}$ as a penalty term to the objective function. The L2 norm is defined as

$$\|\mathbf{w}\|_2 = w_0^2 + w_1^2 + \cdots + w_n^2$$

Regression models regularized using L2 regression are said to perform ridge regression.

So how do these regularization penalties qualitatively affect the model’s results (outputs)? Well, it turns out that L2 regularization produces weight coefficients which are small but diffuse. That’s to say, it tends to produce models where each of the coefficients $w_0, \ldots, w_n$ are relatively small and relatively similar in magnitude.

In contrast, L1 regularization tends to be more specific about the way in which it penalizes coefficients. Certain of these coefficients tend to be penalized heavily and driven towards values of 0, whereas some remain relatively unchanged. The weights that L1 regularization produces are often said to be sparse.

In that vein, some also contend that L1 regularization actually performs a sort of soft feature selection, i.e. the selection of features (components in the data) which are the most important for producing the desired result. By driving certain weights to 0, the model is indicating that these variables are actually not particularly helpful or explanatory in its action.

## Uses of linear regression

Linear regression can be used just about anywhere a suspected linear relationship in data exists. For businesses, this could come in the form of sales data. For example, a business might introduce a new product into the market but be unsure of the price point at which to sell it.

By testing customer response in the form of gross sales at a few selected price points, a business can extrapolate the relationship between price and sales using linear regression in order to determine the optimal point at which to sell their product.

Similarly, linear regression can be employed at many stages in a product’s sourcing and production pipeline. A farmer, for example, might want to model how changes in certain environmental conditions such as rainfall and humidity affect overall crop yield. This can help her determine an optimized system for growing and rotating crops in order to maximize profits.

Ultimately, linear regression is an invaluable tool for modeling simple relationships in data. While it’s not as fancy or as complex as more modern machine learning methods, it’s often the right tool for many real-world datasets in which a straightforward relationship exists. Not to mention, the ease of setting up regression models and the quickness with which they can be trained make them the tool of choice for businesses that want to prototype quickly and efficiently.