character_recognition

character_recognition / tesseract / 0.3.0

README.md

This algorithm is able to accurately decypher and extract text from a variety of sources! As per it's namesake it uses an updated version of the tesseract open source OCR tool. We also automatically binarize and preprocess images using the binarization so tesseract has an easier time decyphering images. Not only are we able to extract english text, but tesseract supports over 100 other languages as well! Give it a try and don't forget to leave a comment if you like it or have a suggestion/comment.

note: This algorithm works best when the image only contains text and is on a single line. It's recommended to crop out everything else from the image with an algorithm like text detection first.

Table of Contents

I/O

{  
   "image": String,
   "language": String,
   "mode": Integer
}
  • image - (required) - a hosted image file, may be a web url (http, https) or a data connector uri (data://, s3://, etc).
  • language - (optional) - the expected language of the text you wish to extract, click here for a full list of compatable tesseract languages and their shortcodes. defaults to 'eng'
  • mode - (optional) - the tesseract ocr engine mode (or oem) the modes are: 0 - original tesseract functionality only 1 - Neural nets LSTM only 2 Tesseract + LSTM defaults to 2

Example

Example 1 - Japanese Kanji

Input

{  
   "image":"http://i.imgur.com/uq9edCO.jpg",
   "language":"jpn"
}

Output

{  
   "prediction":"歩 行 者 優 先"
}

Example 2 - English license plate

Input

{  
   "image":"http://i.imgur.com/mdXrRVD.jpg",
   "language":"eng"
}

Output

{  
   "prediction":"ACG8095"
}

FAQ

Question: Is this a universal OCR tool?

No, Tesseract has been designed to be great at document style OCR. It's trained with black & white images containing only text. We're able to get us closer to general purpose by using our binarization algorithm but it's not perfect. If the background is too noisey then tesseract might fail on that kind of output, this is particularily true for natural scene images.

Question: If I have a natural scene image, what can I do?

First try passing it to our NaturalTextNet algorithm as it's designed to extract text from natural scene images. Alternatively you can crop as much of the image as you can that doesn't contain text.

For example if you have this stop sign:

Tesseract will have a hard time separating out the background and the edges around the stop sign.

However if you use the text detection algorithm, you can automatically generate a bounding box around just the text:

and finally now that we have the bounding box, we can crop out just the text for tesseract:

and the final result:

{"prediction": "STOP"}

For a more pre-built solution, take a look at TextExtraction.

Question: I have a modified traineddata file for tesseract and I'd like to use it to improve the service, what can I do?

mention it in the comments and I'll take a look at adding it in the next release :)

Credits

This algorithm was created using tesseract. All images sourced from the wikimedia foundation with the creative commons license.