specrom

specrom / NaturalLanguageDetection / 0.1.2

README.md

Overview

This Algorithm returns ISO 639-1 codes and normalized probability scores (0-1) for 97 pretrained languages.

Its often useful to detect the language of the text before applying further text processing APIs; for example, if you load thousands of tweets for some futher processing such as name entity recognition, than its important to make sure you only select tweets in langauges your model supports (english, spanish etc), and our NaturalLanguageDetection API can help you do that preprocessing quickly.

The complete list of supported langauges of the pretrained model are af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu.

Usage

Input

{ "documents": [
    { "id": string, "text": string },
    ]}
  • text - (required) an arbitrary length text document.
  • Id - (required) a unique number or string

Output

{
  "documents": [
    {
      "Detected_language": [{
        "ISO631-1_language_code": string,
        "normalized_probability": float
      }],
      "id": string
    }
  ]
}
  • ISO631-1_language_code is the two letter ISO631-1 language code
  • the normalized_probability is a float (0-1) showing the confidence of the predicted language.
  • id is number/string supplied at input

Examples

{ "documents": [
    { "id": "1", "text": "This is a document written in English." },
    { "id": "2", "text": "Este es un document escrito en Español." },
    { "id": "3", "text": "这是一个用中文写的文件" }]}

Output

{
  "documents": [
    {
      "Detected_language": [
        {
          "ISO631-1_language_code": "en",
          "normalized_probability": 0.9999999998851724
        }
      ],
      "id": "1"
    },
    {
      "Detected_language": [
        {
          "ISO631-1_language_code": "es",
          "normalized_probability": 0.9999992791970168
        }
      ],
      "id": "2"
    },
    {
      "Detected_language": [
        {
          "ISO631-1_language_code": "zh",
          "normalized_probability": 1
        }
      ],
      "id": "3"
    }
  ]
}