# Linear regression for machine learning

Linear regression in machine learning is a supervised learning technique that comes from classical statistics. However, with the rapid rise of machine learning and deep learning, its use has surged as well, because neural networks with linear (multilayer perceptron) layers perform regression.

This regression is typically linear, but when the use of non-linear activation functions are incorporated into these networks, then they become capable of performing non-linear regression.

Nonlinear regression models the relationship between input and output using some form of non-linear function, for example a polynomial or an exponential. Non-linear regressors can be used to model common relationships in science and economics, like as examples, the exponential decay of a radioactive molecule or the trend in stock market performance in accordance with the overall global economy.

## How does linear regression work?

Stepping back from the neural network view, we can specify linear regression models as a simple mathematical relationship. Succinctly put, linear regression models a linear dependency between an input and an output variable. Depending upon what context you’re working in, these inputs and outputs are referred to by different terms.

Most commonly, we have a training dataset with $k$ examples, each having $n$ input components, $x_1, \ldots, x_n$, called the *regressors*, *covariates*, or *exogenous variables*. The output vector $\mathbf{y}$ is called the *response variable*, *output variable*, or the *dependent variable*. In multivariate linear regression, there can be multiple such output variables. The parameters of the model, $w_0, w_1, \ldots, w_n$, are called the *regression coefficients*, or in the deep learning context, the *weights*. The model has the form

$$\mathbf{y} = w_0 + w_1x_1 + \cdots + w_nx_n$$

for a single training example $\mathbf{x} = [x_1, \ldots, x_n]$. We can also make this notation compact by compressing the training data into a matrix $X \in \mathbb{R}^{k \times n+1}$, $

{\displaystyle X={\begin{pmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{pmatrix}}={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}},}$

and the weights into a vector, $\mathbf{w} = [w_0, w_1, \ldots, w_n]^{\top}$. The weights form the core of the model. They encode the linear relationship between input and output, placing more emphasis on data features, which are important and down-weighting those that are not. Note that we add a “hidden component” to each of the rows of $X$ that has value 1. This allows us to compute a dot product with $\mathbf{w}$, which has a bias term, $w_0$. The bias term allows the model to shift the linear hyperplane it computes off of the origin, permitting it to model relationships in data that are not zero-centered. The simplified model can then be expressed as

$$y = X\mathbf{w}$$

This is the basic model that underlies most implementations of linear regression; however, there are many variations that can exist on top of this basic structure, each conferring its own drawbacks and benefits. For example, there’s a version of linear regression called Bayesian linear regression, which introduces a Bayesian perspective by placing prior distributions on the weights of the model. This makes it easier to reason about what the model’s doing and subsequently makes its results more interpretable.

## Training a linear regression model

So how do we train a linear regression model? Well, the process is similar to what’s used with most machine learning models. We have a training set $$\mathcal{D} = \{(x^{(1)}, y^{(1)}),\ldots, (x^{(n)}, y^{(n)})\}$$ and our task is to model this relationship as closely as possible without affecting the model’s ability to predict on new examples. To that end, we define a loss, or objective, function $J_{\mathbf{w}}(\hat{y}, y)$ which takes in the true output $y$ and the predicted output $\hat{y}$ and measures “how well” the model is doing at predicting $y$ given $\mathbf{x}$. We use the subscript $\mathbf{w}$ to indicate that the output of $J$ is dependent on and parameterized by the model’s weights, $\mathbf{w}$, via the prediction $\mathbf{y}$, even though those weight values don’t explicitly show up in the function’s calculation. For linear regression, we typically use the mean-squared error (MSE) loss function. It is defined as

$$J_\mathbf{w}(\hat{y}, y) = \frac{1}{2} \sum_{i=1}^n (\hat{y}^{(i)} – y^{(i)})^2$$

We can then optimize this loss function using one of a variety of techniques. We could use something like *gradient descent*, the de facto standard for training neural networks, but this is actually not necessary for linear regression. This is because we can actually solve the optimization problem directly in order to find the optimum value for the weights, $\mathbf{w}^*$.

Since we want to optimize this for $\mathbf{w}$, we take the gradient with respect to $\mathbf{w}$, set the result to 0, then solve for $\mathbf{w}^*$, the optimal setting of $\mathbf{w}$. We have

$$\begin{align*}

\nabla_{\mathbf{w}} J_\mathbf{w}(\hat{y}, y) &= \nabla_{\mathbf{w}} (y – X\mathbf{w})^\top(y – X\mathbf{w}) \\

&= \nabla_\mathbf{w} \left(y^\top y – y^\top X \mathbf{w} – \mathbf{w}^\top X^\top y + \mathbf{w}^\top X^\top X \mathbf{w}\right) \\

&= -2 y^\top X + 2 \mathbf{w}^\top X^\top X \\

\end{align*}$$

Now we set the gradient equal to 0 and solve for $\mathbf{w}$

$$\begin{align*}

0 &= -2 y^\top X + 2 \mathbf{w}^\top X^\top X \\

y^\top X &= \mathbf{w}^\top X^\top X \\

\mathbf{w}^* &= (X^\top X)^{-1}y^\top X

\end{align*}$$

This is the optimal setting of $\mathbf{w}$ that will give the model with the best results. As you can see, it’s computed solely using products of $X$ and $y$. However, it requires a matrix inversion of $X^\top X$ which can be computationally difficult when $X$ is very large or poorly conditioned. In these cases, you could use an inexact optimization method like gradient descent or techniques designed to approximate the matrix inverse without actually computing it.

## Regularization

Probably the most commonly used variants of linear regression are those models which involve added *regularization*. Regularization refers to the process of penalizing model weights which are large in absolute value. Usually this is done by computing some norm of the weights as a penalty term added onto the cost function. The purpose of regularization is usually to mitigate *overfitting*, the tendency of a model to too closely replicate the underlying relationship in its training data, which prevents it from generalizing well to unseen examples. There are two basic types of regularization for linear regression models: L1 and L2.

$$\|\mathbf{w}\|_1 = |w_0 + w_1 + \cdots + w_n|$$

Regression models which employ L1 regularization are said to perform *lasso regression*.

In contrast, L2 regularization adds the L2 norm of the weight vector $\mathbf{w}$ as a penalty term to the objective function. The L2 norm is defined as

$$\|\mathbf{w}\|_2 = w_0^2 + w_1^2 + \cdots + w_n^2$$

Regression models regularized using L2 regression are said to perform *ridge regression*.

So how do these regularization penalties qualitatively affect the model’s results (outputs)? Well, it turns out that L2 regularization produces weight coefficients which are small but diffuse. That’s to say, it tends to produce models where each of the coefficients $w_0, \ldots, w_n$ are relatively small and relatively similar in magnitude.

In contrast, L1 regularization tends to be more specific about the way in which it penalizes coefficients. Certain of these coefficients tend to be penalized heavily and driven towards values of 0, whereas some remain relatively unchanged. The weights that L1 regularization produces are often said to be sparse.

In that vein, some also contend that L1 regularization actually performs a sort of soft *feature selection*, i.e. the selection of features (components in the data) which are the most important for producing the desired result. By driving certain weights to 0, the model is indicating that these variables are actually not particularly helpful or explanatory in its action.

## Uses of linear regression

Linear regression can be used just about anywhere a suspected linear relationship in data exists. For businesses, this could come in the form of sales data. For example, a business might introduce a new product into the market but be unsure of the price point at which to sell it.

By testing customer response in the form of gross sales at a few selected price points, a business can extrapolate the relationship between price and sales using linear regression in order to determine the optimal point at which to sell their product.

Similarly, linear regression can be employed at many stages in a product’s sourcing and production pipeline. A farmer, for example, might want to model how changes in certain environmental conditions such as rainfall and humidity affect overall crop yield. This can help her determine an optimized system for growing and rotating crops in order to maximize profits.

Ultimately, linear regression is an invaluable tool for modeling simple relationships in data. While it’s not as fancy or as complex as more modern machine learning methods, it’s often the right tool for many real-world datasets in which a straightforward relationship exists. Not to mention, the ease of setting up regression models and the quickness with which they can be trained make them the tool of choice for businesses that want to prototype quickly and efficiently.