Categorize Documents by Language


The Language Identification microservice from Algorithmia is a straightforward API which accepts a piece of text, and attempts to identify the natural language in which it is written.

This simple Python script will examine all the .txt and .docx files in a directory, identify the language of each file, and move them into subdirectories according to their ISO 639 language code (‘en’, ‘fr’, etc).

For the full blog post related to this recipe, see Build Your Own Language Detection Microservice.

Getting Started

Create a free Algorithmia account, and install the Algorithmia Python client and the python-docx package:

pip install algorithmia
pip install python-docx

Detailed instructions can be found in the blog post.

How To Run the Script

First, edit the script and replace your_api_key with your Algorithmia API Key

Also replace /some/file/path/ with a local directory which you wish to examine.

Use the command line, and navigate to the folder with your Python file and run:


