You’ve probably heard it said in machine learning that when it comes to getting great results, the data is even more important than the model you use. This is true, and it’s not just the native data that’s so important but also how we choose to transform it. This is where feature selection comes in. Feature selection in machine learning refers to the process of choosing the most relevant features in our data to give to our model. By limiting the number of features we use (rather than just feeding the model the unmodified data), we can often speed up training and improve accuracy, or both.

So how does feature selection work? There are a variety of methods for accomplishing the task, ranging from the simple to the absurdly complex, and some feature selection algorithms probably qualify as machine learning models in their own right! We’ll run through a few of the most prominent methods here.

The categories of feature selection methods

Feature selection methods are often divided up into three different categories as follows:

  1. Wrapper methods
  2. Filter methods
  3. Embedded methods

Wrapper methods

Wrapper methods are probably the best approach to feature selection (in terms of accuracy), but they also require the most computational resources. With wrapper methods, we train a model multiple times using different feature sets and then compare the resulting models via their cross validation accuracy. Because we evaluate the models using the same metric we’re interested in optimizing (the validation accuracy), wrapper methods tend to give the best results.

However, every time we add or eliminate a feature or set of features from the data we are required to retrain the model and evaluate on the validation set. This can quickly become very time-intensive and costly.

Recursive feature elimination

There are a few heuristics we can use to cut down on the evaluation time while still achieving pretty good results. One such method is called recursive feature elimination. With RFE, we first train a model with the entire, unmodified data set and then evaluate its performance. From there, we assign an importance score or weighting to each of the features (predictors) in the model. This can be something as simple as the magnitude of the weights assigned to that feature, or it could be some sort of calculation on the feature values themselves.

From there, we eliminate the feature with the lowest importance from the dataset and retrain the model. So long as the cross-validation accuracy continues to improve, we continue to iterate this process. Because the algorithm eliminates the least valuable feature at each step in the process, it can be described as a greedy algorithm.

Forward feature selection

The converse of RFE goes by a variety of names, one of which is forward feature selection. In this algorithm, we first evaluate the performance of the model with respect to each of the features in the dataset. This can be done, as before, by training the algorithm on the unmodified data and applying an importance score to each feature. From there, one can start with only the most important feature and train the model against that, evaluating the accuracy. Additional features are added to the data based on the initial importance ranking until the cross validation accuracy begins to decrease.

Filter methods

With filter methods the features are evaluated against a proxy rather than cross validation accuracy. This is not as accurate because the proxy can at best only be correlated with the cross validation accuracy. However, such proxies are often much quicker to evaluate and don’t require retraining the entire model at each step in the iteration. Therefore, they can often work quite well.

Mutual information

One proxy that can be used is the mutual information between features and the model outputs. Mutual information in this case provides a measure of how much useful information is conveyed to the model by including a given feature. Therefore, including features with a high mutual information can be helpful for boosting model accuracy.

Formally, the mutual information $I(X; Y)$ between two random variables $X$ and $Y$ is defined to be

$$\begin{align*} I(X; Y) &= H(Y) – H(Y|X) \\ &= \mathbb{E}_X[D_{KL}(p(Y|X), P(Y))] \\ &= \int \int p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\ dx\ dy \end{align*}$$

The above formula gives a few different ways of thinking about mutual information. First, it can be thought of as the decrease in entropy $H(Y) – H(Y|X)$ one gets from first observing $Y$ and then observing $Y$ given $X$. Since entropy can be thought of as a measure of surprise, what this is really calculating is the decrease in surprise you get from observing $Y$ if you already know $X$. This is a way of conveying how much information $X$ provides in that context.

The next line shows that the MI can also be thought of as the expected KL divergence between the two probability distributions governing $Y|X$ and $Y$.

Pointwise mutual information

Closely related to the MI is something called the pointwise mutual information. The PMI is basically the same thing as the MI except it applies to individual events/variables rather than probability distributions over those events. The PMI is frequently used in NLP for detecting word collocations (words that frequently occur together), because if two words often arise in the neighborhood of each other then they probably have a high PMI. The PMI is defined analytically as

$$\text{PMI}(x, y) = \log \frac{p(x|y)}{p(x)} = \log \frac{p(y|x)}{p(y)}$$

Pearson correlation

The Pearson correlation is a measure of how much two random variables $X$ and $Y$ are linearly correlated. It is given as a number between -1 and 1. We can use the Pearson correlation between input features and model outputs as a way to select features. This works particularly well for linear regression problems. The Pearson coefficient is calculated as

$$\rho(X, Y) = \frac{\mathbb{E}[(X – \mu_X)(Y – \mu_Y)]}{\sigma_X \sigma_Y}$$


Finally, we can use the chi-squared test ($\chi^2$) to assess which features are most informative to the model. To do so, we calculate the $\chi^2$ value between each feature and target. Those with the highest scores are the best features. The formula is defined as

$$\chi^2 = \sum_{i=1}^k \frac{(x_i – y_i)^2}{y_i}$$

Embedded methods

The term embedded method is just a catch-all for feature selection in machine learning methods that don’t fall into the previous two buckets. $L1$ regularization is the prime example of an embedded method. It acts in a manner akin to feature selection because the L1 norm penalty encourages sparsity on the model’s weights. Because most of the weight’s coefficients are driven towards zero, they are effectively “selected out” by the regularizer.

Applications of feature selection

So why should you use feature selection in machine learning models? Well, without feature selection, your models stand less of a chance of being interpretable. For example, if you were going to create a model to predict a diabetic person’s blood sugar level at different times during the day, you might start by collecting all the possible physiological data you could on that person, such as blood pressure, food eaten, liquids consumed, sleep duration, etc.

Some of these variables will be highly impactful to the model’s output, yet others will turn out to be just noise. If you want to provide that person with actionable advice as to how to keep their insulin levels stable throughout the day, you’ll want to eliminate all the unnecessary features and focus only on the ones which are critical to the result. That’s where feature selection comes in.

Similarly, the same logic can be applied to more commercial business problems. For example, you might want to create models to optimize a supply chain network or improve product delivery times or distribute server load across a network. In any instance like this where business stakeholders are involved and concrete actions must take place, it’s vital to use feature selection to ensure that your models are providing the most actionable information possible.