Quick, what languages are these two sentences written in:

“Hey bana bir sorununuz olur mu?”

“Halló ég er með vandamál getur þú hjálpað mér?”

Not easy, right?

Figuring out a document’s source language is an essential first step for many cross-language tools and that’s why we’ve implemented a Language Identification algorithm.

Detecting languages falls in the category of natural language processing, which is the field of describing how computers can decipher meaning and value from human languages.

## What Is Language Identification?

Our implementation of Language Identification is a microservice that utilizes the Apache Tika framework and it’s LangIdentifier module. It uses pre-trained language models in an ensemble to determine which language, or languages, an unknown input document is written in.

This is how the Apache Tika Language Identification works:

First, a language text corpus is generated by combining text data from books, papers, reports, etc to construct a large and diverse dataset as diverse as the language.

Tika then uses the tri-gram model structure, which simply trains a neural network to predict the next character given the previous 3 characters in document. After scanning the whole corpus you end up with a trained language specific character prediction model.

Now that you have a bunch of trained language models, here’s how Tika predicts languages:

Tika starts by iteratively scanning over the unknown input document and passes the current tri-gram sample to each language model.

Each language model then tries to predict the next character in sequence. The errors from each model are calculated and averaged for each tri-gram sample, generating language accuracies.

Finally, after the whole document has been scanned, the language accuracies are compared and the results are returned.

## Why You Need Language Identification

Natural language models are generally specific to a particular language. If you aren’t sure what the incoming document’s language is, it’ll be incredibly difficult to provide them with a good experience.

Imagine a technical support chat bot that was able to determine the language a user is speaking in and could reply with documentation in the same language? Or consider a sentiment analysis tool that could automatically detect the language and the sentiment of any human language?

## How Do I Use Language Identification?

Using the Language Identification algorithm is easy. All you need to do is provide text as a string form you want to identify, and it’ll take it from there.

Sample Input

import Algorithmia
input = {
"sentence": "Hi, ben Turkce konusuyorum. Hangi dilde konustugumu anlayabiliyor musun?"
}
client = Algorithmia.client('your api key here')
algo = client.algo('nlp/LanguageIdentification/1.0.0')
print algo.pipe(input)

Sample Output

[{"confidence": "0.9999969", "language": "tr"}]

That’s it! We have 99% confidence that input text is in Turkish (see the language key for the algorithm here). Now you have a tool you can use to determine the language of any kind of test or document with confidence!

If you want to get even more information from text? Take a look at our implementations of Named Entity Recognition, Parsey McParseface and Text Summarization algorithms to extract even more information from your documents.