Document OCR with OpenCV and Tesseract

I recently revisited a project I had been working on to familiarise myself with OpenCV and tidied it up for release. In essence the program is a command line utility which uses OpenCV to extract text from photographs. It then reorients the text to enable Tesseract to more easily conduct Optical Character Recognition (OCR). It can then save the resulting tidied page image and OCR text.

Tesseract

Tesseract is a well established open source OCR solution. It is used by calling an executable version of tesseract using a Python Tesseract interface library. It appears to work best when presented with a high contrast image of text which does not suffer from angling or key stoning issues. Therefore the key role of OpenCV in the project is to reorient the image to present the text correctly. It also increases the contrast of the image prior to OCR.

OpenCV

OpenCV turns out to be quite a useful way to rapidly build out an image processing pipeline. It lacks the flexibility of full machine learning pipeline in terms of functions that can be applied. On the other hand it has a wide range of well established image processing techniques built in which you can rapidly assembled into practical vision solutions. I found some useful tutorials at https://www.computervision.zone/ which let me get up to speed quickly.

Initial location and rectification of the text in the image
Initial location and rectification of the text in the image

The basic strategy adopted is to find the largest quadrilateral in a grey scale version of the image. Next IĀ  warp this quadrilateral to a rectangle to present the text in the correct orientation. Then I resize the image, blur it to ensure small imperfections in letters and background noise are reduced and then threshold the image into a two tone image. I then pass this two tone image to Tesseract to convert to text.

Resize, blur and erode steps
Resize, blur and erode steps

Argparse and Pytest

Other points of interest you can find in the code are some use of arparse to allow the project to be run from the command line. I have also given also examples of usingĀ  pytest facilities for mocking with @patch and parameterisation with @pytest.mark.parametrize.

The Code

You can find the project at https://github.com/JustinMatters/DocumentReader instructions for use can be found in the read me file. As a first experiment with OpenCV, I am reasonably happy with it, though it can prove to be a little brittle depending on the quality of image provided.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.