ocr

ocr / SmartOCR / 0.2.6

README.md

An OCR algorithm that makes use of image cleaning techniques to provide a higher accuracy for OCR. This algorithm uses the main Tesseract algorithm. You can read more about it here: https://algorithmia.com/algorithms/tesseractocr/OCR.

Input

The algorithm accepts either binary data or a direct Data API url to a jpg file. 

Output

A JsonObject with fields:

  • phones: Any string in the raw output that matches a phone number regex
  • emails: Any string in the raw output that matches an email regex
  • raw: Raw OCR output

Features

Perspective Transform:

For the photos that are taken at an angle, this algorithm adjusts the viewing angle so that the writings are perpendicular to the horizontal axis. The letters are not slanted and straight.

Sharpening:

The algorithm applies a deblurring mask to the image so that the letters are sharper.

Image Binarization:

The image is binarized to black and white for higher accuracy by the OCR algorithm. More information can be obtained from: 

Phone and Email Regex Matching:

Uses generic phone and email regexes for easier data extraction.

Sample image:


Transformed:


Sharpened:


Binarized: