Using Python for Text Extraction | Digital Scholarship in Arts (DiSA)

Working with a Python library (a set of prewritten code that makes it easier to do certain tasks) can be an easy and free way to extract text from images and PDFs. Most of these libraries can be setup quickly and run from either the command line or within a Jupityr notebook. Much like their paid alternatives, these Python libraries make use of the latest AI innovations and Learning Models. If your document images are primarily of modern printed or typed texts, then pyTesseract or EasyOCR might be what you want.

pyTesseract

PyTesseract is an open-source Python library used to extract text from image files. Built as a wrapper for Google’s Tesseract-OCR Engine, it relies on traditional image processing techniques and pattern matching for character recognition. It can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. It also has unicode (UTF-8) support and can recognize over 100 languages. PyTesseract performs well on high-resolution, clean images with clear text and simple layouts. Preprocessing (image cleanup, noise reduction, deskewing) is required. Limited support for handwriting transcription.

EasyOCR

EasyOCR is an open-source Python library (like PyTesseract) used to extract text from image files. EasyOCR leverages deep learning models for text detection and recognition. It excels at recognizing text in challenging conditions: noisy images, varying fonts, complex layouts, and distorted text. Its deep-learning approach means less extensive image preprocessing is needed. It can also leverage GPU acceleration for faster processing. EasyOCR supports 80+ languages and popular writing scripts, including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc. Currently weak on handwriting transcription, but this is next in the pipeline.

EasyOCR supports 80+ languages and popular writing scripts, including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc. At the moment, no handwritten text support – but this is in the pipeline next.