The field of natural language processing (NLP) is concerned with the creation of machine learning methods for understanding written and verbal data. And as in any subfield of machine learning, it’s necessary to devise a technique for creating numerical representations of that data so it can be acted on by mathematical and statistical models.
A word embedding or word vector is one such type of representation. The term “embedding” here refers to the placing of the word as a vector into some sort of vector space. That way, the words can be manipulated mathematically as any vector would while retaining the contextual and semantic information inherent in the text from which they were drawn.
The state of the art for these methods has advanced rapidly over the past five years, and techniques vary based on their complexity, the time needed to train, the semantic level at which they consider information (character, word, phrase, sentence, paragraph, document), and the specific algorithm they use for computing the embedding from surrounding context.
Ins and outs of word embeddings
When considering the architectures of embedding models, it’s helpful to think about how difficult the task at hand actually is. When training these models, the user seeks to condense the thousands of uses of a different word into a single n-dimensional vector (where n is usually in the range 100-300). Thus, the model needs to make incredibly intelligent use of reams of data to produce a representation that’s meaningful.
The complexity of the task is compounded by the many nuances of human language. Many words, such as house or bark have several different meanings, and even different parts of speech, depending on how they’re used. One downside of the embedding approach is that these multiple meanings are collapsed into a single vector that captures none of them perfectly.
However, some models attempt to deal with this complexity by creating different vectors for different senses of a word. In this case, the challenge usually comes in algorithmically identifying which sense is being used at any given point in the training set.
Heuristic approaches to tackling this can work, but one often ends up with a chicken-or-the-egg problem in which a sense-based embedding would be quite helpful for labeling training data, but in order to get such a representation one needs accurately labeled training data. Given all this, the fact that very accurate embedding models exist is a testament to the ingenuity of engineers and machine learning scientists.
One of the most popular algorithms in the word embedding space has been Word2Vec. It was the first widely disseminated word embedding method and was developed by Tomas Mikolov, a researcher at Google. It proposes two different architectures: the Skip-gram model and the Continuous-Bag-Of-Words model (CBOW).
Both learn semantic embeddings of words by assessing the context (ie. the surrounding words) in which those words are placed. In both methods, the user sets a window size that defines how many surrounding context words to consider in the modeling. In general, more context means better predictions but also places more computational stress on the model.
The two models differ in how they handle context. In the Skip-gram method, the model takes the target word for the embedding being learned and seeks to predict the surrounding context words from it. In the CBOW model, the model takes the context surrounding the target word and seeks to predict the target word from it.
In practice, the Skip-gram model is often favored because it can be trained accurately with smaller datasets and more accurately represent uncommon/rare words. The main advantage of the CBOW model is that it is less expensive to train, which can be an important determinant depending on your particular set of constraints.
In either model, the prediction is computed using a neural network type architecture with input, hidden, and output layers. In both cases, the model’s value derives not from its predictions but from its weights. Remember, the models are constructed to predict context and target words, but not to produce the embeddings. Instead, the embeddings are extracted as the rows of the model’s weight matrices.
The embeddings that are produced are unique in the way that they capture semantic information. In fact, one can perform mathematical operations, such as addition, in the vector space that are meaningful in the semantic space. As an example, one could take the vectors for rooster, man, and woman and perform the operation rooster – man + woman. This would likely yield a vector representation lying remarkably close to the vector for hen.
The GloVe algorithm (Global Vectors for Word Representation) produces word embeddings that are very similar to those produced by Word2Vec but does so in a somewhat dissimilar manner. While GloVe still considers contextual information, it does away with full windows and instead considers context on a 1-1 basis. That is, GloVe creates a word-word co-occurrence matrix containing the probability P(a | b) of seeing the word a in a k-word slice around word b.
According to the GloVe paper, the ultimate objective is to obtain a word representation such that for any two vectors, the log probability of their co-occurrence is equal to their dot product. The authors spent a good deal of time testing their embeddings on word analogy tests (ie. those which ask Cat is to lion as fish is to ___). They demonstrated impressive results on this task, thus establishing that semantic knowledge is captured by their model.
One newer player in the word embedding game is a model called FastText, developed by researchers at Facebook. The model is very similar to the Skip-gram version of Word2Vec. However, the authors use some clever tricks that improve performance.
First, they model words as a bag of character n-grams. Thus a word with 4 letters is modeled as a 4-character n-gram. The key here is that not only do they model the entire n-character sequence of the word, but they also model the subwords which create the word. So, for a word such as pinch they could create vectors for subwords such as p, pi, pin, pinc, and ch. They can then represent entire words as the sum of the subvectors. This provides a number of benefits.
First, a model trained in this way obtains a better morphological understanding of the language on which it’s being trained. For example, it can learn to more accurately model suffixes such as -ing, -ed, -ied, etc. This can allow it to create better representations based on word tense. Secondly, this form of modeling allows the algorithm to handle out of vocabulary words.
Prior models such as Word2Vec and GloVe were unable to produce embeddings for words which were not in the training corpus. However, FastText mitigates this problem. Because it is trained to model character n-grams, it can produce representations for rare words that were not seen during training. Finally, FastText was implemented in heavily optimized C++ code, meaning it trains blazingly fast and can be used on devices such as phones with fewer computational resources.
Word embedding use cases
One of the primary uses for word embeddings is for determining similarity, either in meaning or in usage. This is usually done by computing a metric such as the cosine similarity (ie. a normalized dot product), between two vectors. If the embedding model was trained properly, then similar words such as storm and gale will show a high cosine similarity, as measured on a scale from 0 to 1.
Challenges for word embedding models
However, care needs to be taken here. Embeddings trained using the contextual similarity model popularized by Word2Vec can often exhibit peculiarities that are surprising upon first glance, although which make complete sense upon further investigation into the method by which the model was trained.
For example, when assessed using a cosine similarity metric, words such as love and hate will likely show a high similarity, despite having semantically opposite meanings. This is because the contexts in which “love” is used are highly correlated with those in which “hate” is used. As such, the model groups them into the same category, despite the opposite meanings.
While there are ways to mitigate this by paying careful attention to the training procedure, it can also be viewed as a useful feature of the model, depending on how one plans to use the outputs.
Nearest neighbors and clustering
Another great use case for word embeddings is in nearest neighbors and clustering methods. In many use cases, one might like to create topic models which cluster words associated with a given topic. In others, one might wish to find synonyms or substitutions for a given word by finding the nearest neighbors measured by cosine similarity. In either instance, the semantic meaning encoded in the word representations allows one to use classic clustering and nearest neighbors methods such as k-means or knn without much modification in order to produce a useful result.
Word embedding in the enterprise
There are many use cases for word embedding models within a business context. For example, a marketing company might have certain copy that they find works well with a given customer demographic, and they may want to expand the way in which they talk about their products.
They could use similarity metrics on the word embeddings associated with their existing copy to generate new and effective ways of talking about their product offering.
As another example, a company might want to assess consumer sentiment around one of their products. They could pull customer reviews and mentions off social media platforms and train a sentiment classifier to predict how customers are talking about their products. By initializing the classifier with trained word embeddings, they will almost certainly improve the model’s resulting downstream accuracy.