tesseractocr

tesseractocr / OCR / 0.1.0

README.md


Recognize text in your images with this algorithm. It uses Tesseract, "probably the most accurate open source OCR engine available". For more information on the development of Tesseract, refer to: https://code.google.com/p/tesseract-ocr/

Input: 

Option 1: Binary data only

Option 2: JsonObject with following properties:

  • "src" (Required): A string that holds a link to the image. The link could be a direct link to a file in the Data API, a direct link to a file accessible via internet (http link) or a base64 encoded image.
  • "hocr" (Optional): An object that has as properties the options that Tesseract accepts as parameters. Some of the crucial ones explained below.
  • confidenceCutoff (Optional): An integer value that enables the user to specify the quality of the output depending on the confidence values output by the algorithm. Default is 0.


Output:

The output includes the full result and optionally confidence values by each word recognized. It is a JsonObject with properties "result" and "compound".


Parameters:

Prepare your JsonObject with the "hocr" property to make use of different parameters of Tesseract. Some very useful ones are:

tessedit_create_hocr: This outputs an XML file that includes the locations and confidence values of each word recognized. We parse this file and return the relevant information to you in the output object.

tessedit_char_whitelist: Just put the characters that you specifically do want recognized in your image here in a string.

tessedit_char_blacklist: Just put the characters that you do not want recognized in your image here in a string.

Basic Mode: Pipe in binary data (an image with the text you would like recognized), get back a JsonObject that includes the result text.

Advanced Mode: We support all the parameters that Tesseract supports. For a full list of possible parameters, refer to: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version.

Sample input:



Sample output:

{ "result": " \n \n \n \nAALGORITHMIA \nDIEGO \nOPPENHEIMER \nCEO \ndiego@algorithmia.com \no \n@doppenhe \n206.552.9054 \nQ \ndoppenheimer \n \n \n", 

 "compound": { "": 95, "AALGORITHMIA": 53, "DIEGO": 88, "OPPENHEIMER": 88, "CEO": 88, "diego@algorithmia.com": 83, "o": 77, "@doppenhe": 85, "206.552.9054": 84, "Q": 55, "doppenheimer": 79 }