nlp

nlp / LanguageIdentification / 1.0.0

README.md

Introduction

This is a language identifier based on Apache Tika.

The supported languages are:

  1. af Afrikaans
  2. an Aragonese
  3. ar Arabic
  4. ast Asturian
  5. be Belarusian
  6. br Breton
  7. ca Catalan
  8. bg Bulgarian
  9. bn Bengali
  10. cs Czech
  11. cy Welsh
  12. da Danish
  13. de German
  14. el Greek
  15. en English
  16. es Spanish
  17. et Estonian
  18. eu Basque
  19. fa Persian
  20. fi Finnish
  21. fr French
  22. ga Irish
  23. gl Galician
  24. gu Gujarati
  25. he Hebrew
  26. hi Hindi
  27. hr Croatian
  28. ht Haitian
  29. hu Hungarian
  30. id Indonesian
  31. is Icelandic
  32. it Italian
  33. ja Japanese
  34. km Khmer
  35. kn Kannada
  36. ko Korean
  37. lt Lithuanian
  38. lv Latvian
  39. mk Macedonian
  40. ml Malayalam
  41. mr Marathi
  42. ms Malay
  43. mt Maltese
  44. ne Nepali
  45. nl Dutch
  46. no Norwegian
  47. oc Occitan
  48. pa Punjabi
  49. pl Polish
  50. pt Portuguese
  51. ro Romanian
  52. ru Russian
  53. sk Slovak
  54. sl Slovene
  55. so Somali
  56. sq Albanian
  57. sr Serbian
  58. sv Swedish
  59. sw Swahili
  60. ta Tamil
  61. te Telugu
  62. th Thai
  63. tl Tagalog
  64. tr Turkish
  65. uk Ukrainian
  66. ur Urdu
  67. vi Vietnamese
  68. wa Walloon
  69. yi Yiddish
  70. zh-cn Simplified Chinese
  71. zh-tw Traditional Chinese

Input(s):

  • (Required): Sentence. (key="sentence")
  • (Optional): Languages to detect. (Comma separated string) (key="languages")
  • (Optional): Specify if sentence is short or not. (key="shortText")
  • (Optional): Specify if sentence has mixed languages. (key="mixedLanguages")

Output:

  • A list of languages with detected language and confidence interval. (languages with very low confidence values are not shown)

Example(s)

Example 1.

  • Parameter 1: Example sentence.
{
  "sentence": "Hello, I'm speaking English. Can you guess which language I'm talking in?"
}

Output:

[
  {
    "confidence": "0.9999959",
    "language": "en"
    }
]

Example 2.

  • Parameter 1: Example sentence.
  • Parameter 2: A list of languages to detect.
{
  "sentence": "Hi, ben Turkce konusuyorum. Hangi dilde konustugumu anlayabiliyor musun?",
  "languages": "ru,it,fr,tr,en,uk"
}

Output:

[
  {
    "confidence": "0.99999857",
    "language": "tr"
    }
]

Example 3.

  • Parameter 1: Example sentence.
  • Parameter 2: A list of languages to detect.
  • Parameter 3: Letting the algorithm know it's a short text.
  • Parameter 4: Letting the algorithm know it's a mixed language sentence.
{
  "sentence": "Hi, ben Turkce konusuyorum. And I'm speaking in English. Hangi dilde konustugumu anlayabiliyor musun? Can you tell which language I'm speaking in?",
  "languages": "ru,it,fr,tr,en,uk",
  "shortText": "true",
  "mixedLanguages": "true"
}

Output:

[
  {
    "confidence": "0.7142835",
    "language": "en"
  },
  {
    "confidence": "0.28571582",
    "language": "tr"
  }
]