A photo of a red fox

Natural language processing has been one of the most poignant and visible uses of machine learning capabilities in recent years. From the basics of recurrent neural network architectures that were able to detect the first named entity pairings, to now where transformers are able to look at an entire paragraph or book simultaneously using parallel processing on GPUs, we’ve clearly seen some tremendous improvements. 

However nothing has been quite as dramatic for the field as a new architecture, Bidirectional Encoding Representations from Transformers or BERT.

In this post, we’ll walk through what BERT is and provide an easy way to use it on Algorithmia.

How recurrent neural networks work

Before we talk about BERT itself, let’s start with one of the cornerstone building blocks of many machine learning architectures—Recurrent Neural Networks (RNNs).

RNN model depictionSource: https://medium.com/towards-artificial-intelligence/whirlwind-tour-of-rnns-a11effb7808f

Many datasets containing signals we’d like to extract are sequential, meaning that for each element x(t) in an input sequence (in the graphic above), y(t) depends not only on x(t), but x(t-1), and x(t-n). 

A great example of this is language—imagine this sentence: The quick brown fox jumps over the lazy ___. You may have an idea what that last word is supposed to be. This is due to how language is constructed—each word adds context and morphs the meaning of the sentence. To consider any of the  words individually, without context, would make it difficult to predict that the last word was dog

Using context for more accurate predictions 

Recurrent neural networks are a unique architecture that allow for the state of previous operations to be preserved inside the model. This means that I could design an RNN model that, given each word individually (a, quick, brown, fox, …), I could train the architecture to successfully predict dog and many other things, which is a simplistic description of how RNNs work. Let’s take a look at what some downsides to recurrent architectures are. 

Challenges in recurrent architectures

Vanishing gradient problem

One drawback is called the Vanishing Gradient Problem, which stems from how information is stored in RNNs. As mentioned, information from x(t-n) is stored in the network to help predict y(t), however when n gets to be a very large number, that information eventually starts to leak out.

There have been improvements to reduce this impact, such as Long-Short Term Memory layers (LSTMs) or Gradient Recurrent Units (GRUs), however this problem continues to persist in very, very long-range information sharing.

Information processing

The second problem stems from how information is processed. As mentioned, information from x(t-n) is used to help predict y(t). This means we need to calculate the value of y(t-n) before we can even start work on y(t), which can make parallelizing the training/inference processes quite difficult if not impossible for many tasks. 

This isn’t always a problem, however, especially for some smaller architectures, but if you intend to use scaled deep learning models, you will very quickly run into a brick wall in how fast you can train the model. 

This is one of the reasons why researchers have historically preferred to focus on other ML projects like image processing, as the power of deep learning was unable to provide any value to many RNN models.

Transfer learning

The third problem is a difficulty with transfer learning. The concept of transfer learning is the process of taking an ML model pre-trained on some generic dataset and re-training it on a specialized object dataset for the specific project or your problem. 

This kind of process is very common in the image processing world but has proven to be quite challenging for even relatively standard sequential tasks, such as Natural Language Processing. This is because any model you are planning to use for transfer learning must have been trained with the same type of objective as the one you plan on tackling. 

Transfer learning requires a shared set of necessary transformations between model objectives, which is where we see benefits in training time  and model / accuracy.

In the field of image classification,  we’re almost always looking for objects in an image, generally a natural photograph (like family vacation pictures from the bBahamas, etc). However if you attempted to reuse a general classification model to classify artifacts in x-ray stereographs, your model will really struggle to provide any value.

This kind of scenario has plagued NLP algorithms since it’s inception, as many NLP tasks are disparate and have objectives (such as Named Entity Recognition, or tText pPrediction) that are very difficult to leverage transfer learning for from one task to another.

This is where BERT comes in, and why it’s so special. BERT uses a multi-headed objective system that takes the most common NLP objectives and trains a model that’s capable of being successful in all of them. We’ll look at BERT models more in-depth below.

Other types of RNNs 

Attention networks

A new architecture was created by Google researchers a couple of years ago that approaches sequential problems in a different way.

A depiction of a recurrent neural network with an attention layer

With attention networks, we’re processing every variable in our sequence (x(0) all the way to x(t)) at once, rather than one at a time. 

We’re able to do this because the attention layer is able to view all the data at once using its limited number of weights to focus on only the parts of the input that matter for the next prediction. This means we’re able to parallelize training our model and also take advantage of GPUs.

Transformer networks

As a progression on attention networks, transformers have multiple “sets” of weights per attention layer that are able to focus on different parts of an attention vector. These are called transformer heads. 

Other than that, the big difference between attention and transformer networks is the concept of stacking attention/linear layers on top of each other (while taking some concepts from residual network architectures) in a similar way to convolutional neural networks. This creates the paradigm of deep learning, which allows us to avoid the vanishing gradient problem by ensuring that information from previous layers always bubbles up to the last layer of the network. 

These networks have become state of the art for natural language processing, considered jointly with the fact that they can be trained effectively using GPUs and TPUs, which allows researchers to make them even deeper.

A depiction of a transformer network model

Bidirectional Encoding Representations from Transformers (BERT)

Attention architectures allow us to solve two of the biggest problems of working with RNNs and be able to train much faster due to the parallelization attention models provide. With the introduction of transformers, using residual connections and multiple transformer heads, we can avoid the vanishing gradient problem, allowing us to construct deeper models and take advantage of the deep learning paradigm. 

But we’re still missing something; we haven’t addressed a third problem—NLP models are terrible for transfer learning.

This is where BERT comes in. It’s trained on two different objectives to normalize the parameters to be more general-purpose. Like many NLP architectures, a model is first trained to predict missing words and then to encode them into an internal representation using the “bag of words” metric. 

Unlike with typical training systems however, BERT is provided with not just one representation of a block of text, but two—one right-left, the other left-right. Hence it’s a bidirectional encoder. 

This phrase “embedding encoder” is also much deeper and contains significantly more parameters than earlier encoding systems such as word2vec or GLoVe

Besides that, the word “encoding” is not independent of the context, which allows BERT to have a very deep and rich understanding of the vocabulary used in the training corpus.

Diagrams of BERT models in semi- and supervised learning environments

Once a word encoder internal model is trained, a classifier is stacked on top of the model, which can be trained for a variety of tasks. In the pre-trained examples, a simple Spam/Not Spam binary classifier is constructed, but obviously this could be used for other systems as well, such as Named Entity Recognition of sentiment analysis, to name a few.

BERT and Algorithmia

A big benefit of BERT is that it generates very rich encodings of word representations that can be used for tasks involving large documents with many sentences. This is helpful because one model can be used to construct many downstream applications of varying complexity, such as document classification or semi-supervised document topic clustering.

Algorithmia has deployed two examples of BERT models on Algorithmia, one in TensorFlow, and the other on PyTorch. As you can see, the source code is also available using the new Github for Algorithmia integration, which allows you to more easily use the code you’d like.

Both of these models are able to provide rich representations of a sentence, and can be used as a first stage for many NLP downstream tasks that are specialized for your business case.

James Sutton