Learn why model validation is important and how to approach it.

Model validation

In machine learning, model validation is the process of verifying that models are providing satisfactory outcomes to their input data, in line with both qualitative and quantitative objectives. While partially consisting of a set of tried-and-true processes, model validation is a heterogeneous process that cannot easily be pinned down or characterized in general and applied to all models, creating opportunities for creativity and ingenuity.

Model validation is a component of machine learning governance, the overall process for how an organization controls access, implements policy, and tracks activity for models.

Why is model validation important?

Model validation and verification ensures the effectiveness and accuracy of a trained model in preparation for use. Without model validation, the model may perform poorly and the training time will be irrecoverable. A model that is not properly validated will not be robust enough to adapt to new stress scenarios or may be too overfitted to receive and properly use new inputs. Different than model monitoring, model validation will take place before the model is put into place with the full dataset. Monitoring will regularly occur alongside a running model.

How to validate a model

There are two straightforward ways to statistically validate a model: one can evaluate the model on the data the model was trained on, or one can evaluate it on an external test set. The first method introduces the problem of overfitting: one can fit any dataset arbitrarily well at the cost of creating a model brittle to extra data. If the model is perfected to one dataset, the model may not be able to use and identify correct outputs with new data and thus will not validate. One could take the case of trying to fit a curve through (x,y) pairs when given 100 of them as a training set.

A high-degree polynomial could fit the data in the training set exactly while being very brittle to data outside the training set. It is common practice instead to validate models using a test set. When originally given a data set, one can construct the test set by randomly extracting 10 to 20 percent of the data. In the case of the 100 (x,y) pairs, one could discover that the high-degree polynomial was overfitting easily by separating out a test set and evaluating the model against it. One might then choose to use a simpler model, such as a linear regression, which has a higher chance of passing model validation.

Many different statistical evaluation metrics can be used for model validation in general, including mean average error, mean squared error, and the ROC curve.

Model validation pitfalls

It is a mistake to believe that model validation is a purely quantitative or statistical process. For instance, a key part of model validation is ensuring that you have picked the right high-level statistical model. One could consider the example of training a system to predict the price of an item given an image of it. One could obtain reasonably good results by simply applying a logistic regression to the set of images. But this would ignore much better results that could potentially be obtained by applying a multiple layer convolutional neural network to the images.

It is thus important to perform thorough research of the machine learning literature as a part of model validation. The results of endless hours of work on a model that is a poor or mediocre choice for a given dataset can be surpassed by a simple glance at the right areas of the arXiv. On the other hand, a model that is not exactly the right choice for a given data set, but still close to the optimum, can still be considered to pass model validation.

It is generally mistaken to take the perspective prevalent in Kaggle competitions that the goal is to squeeze every last drop of performance out of your model. Redoing a machine learning problem with a different model carries the problems of being expensive, time-consuming, and error-prone. It is often true that there is either one model that is “right” for the dataset, as is the case with large image datasets and neural networks. Or that there is no one “right” model, and several will be close to the optimum, as is the case in most non-image-based Kaggle competitions.

What is data validation?

Data validation is another key component of model validation. Data values can be corrupted or contain errors in ways that impair the results of model training. The integrity of data values can be verified by manually delving into sections of the data, programmatically searching through it, or by creating graphs. The integrity can also be checked qualitatively by ensuring that the data was drawn from a reliable, trustworthy, well-maintained and up-to-date source.

Data for the training and test sets should be drawn from the same probability distribution or as close as possible to achieve adequate results. In addition, there is a risk that models may be vulnerable to errors on specific input data values because they are poorly represented in the training set. If it happens to be the case that such a class of errors is possible, it is important to verify that the training set adequately covers all data inputs on which the model will need to be evaluated, or you may lose model validation. There are many methods of guaranteeing that the training set is adequate, including manual searching through the data or creating visual plots of it.

It is in general critical to have made correct assumptions about the similarities between the training set and the data the model will ultimately be evaluated on, which is again a qualitative process. 

Performance and model validation

Setting the standards for how performant a model must be is a further component of model evaluation and machine learning validation. Models can never be completely accurate, and tradeoffs between the risk of errors and the training time and size of the training set must be made. Ultimately the decision must be made qualitatively, perhaps via the process of evaluating multiple different models on the dataset to decide on standards. Sometimes no model is sufficiently performant and the project must be scrapped altogether. One could consider the case of self-driving cars, a notoriously difficult problem whose solution in practice has yet to be found.

Self-driving cars

Given that self-driving cars have already been found to work to a limited extent in experimental settings, correctly setting the standard between success and failure is the key in deciding whether self-driving cars will be the iconic technological breakthrough of the 21st century or a massive source of malinvestment. Self-driving cars further provide an example of a case in which the dividing line between success and failure for a model cannot be reduced to a single number. Accidents per mile, fatalities per mile, errors per mile, the number and type of traffic violations per mile and number of human interventions per mile can all be relevant metrics in the model validation, and sorting through them to decide between success and failure can be a rather complex process, with both qualitative and quantitative components. 

K-fold cross validation

There are many methods of comparing models, which is an important part of model validation. K-fold cross validation is one such method. In k-fold cross validation, the training set is randomly split into k groups. Then, iterated k times, one holds out each unique group, trains the model on the remaining data, and evaluates the model on that group. Performance is averaged over each k iteration of the cross validation.

One can use k-fold cross validation to compare and select models prior to evaluation on the full test set. The Akaike information criterion is another method of comparing models. The AIC estimates on a relative basis the amount of information lost by a given model. The AIC is given $AIC = 2k – 2ln(\hat{L})$, where $\hat{L}$ is the maximum value of the likelihood function for the model. Because logarithms are monotonically increasing, and the $ln(\hat{L})$ term is subtracted off, the preferred models are the ones with the minimum AIC value. AIC may be useful, for example, in model validation when you want to compare the performance of a less performant (but faster to train) model to that of a more performant (but slow to train) model. If the less performant model is sufficiently close in AIC to the more performant model, it may pass model validation.

No one-size-fits-all model validation

Ultimately model validation in machine learning can be a highly custom and diverse process depending on the model and dataset at play. There is no universal system, process, or model validation technique that is optimal for validating any given model on any given dataset. For an example of the degree to which model validation methods can be diverse, one could consider the case of a multiple layer neural network image classification system. Recent research has shown that an adversarial neural network could be trained to produce data inputs for which the image classifier would yield large statistical errors.

Neural networks are considered intrinsically vulnerable to such adversarial attacks, so if either the errors produced or the breadth of data on which the errors can be produced are considered to be too large, one could switch to a model that does not incorporate neural networks to pass model validation. This example has the practical application of a self-driving system, in which a neural network may decide, based on image inputs, the speed and direction of driving. It would be quite important for such a system to be resilient to adversarial attacks, as criminals could impose images on or near the roadway that tend to cause the self-driving car to make bad decisions. In general, training adversarial models against your model and examining the results is a promising method of model validation.

More from the AI/ML governance blog series