What is Text Extraction?
Text extraction is the process of turning images of text into an editable format (usually returned as a .txt file or JSON). Text extraction tools and code libraries rely on Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) approaches to interpret images of writing. OCR excels at reading images of printed text, whereas HTR can interpret handwritten text. The range of text extraction tools spans full-featured platforms like Transkribus to Python libraries like pyTesseract and EasyOCR. Many modern tools are aided by AI techniques like machine learning and natural language processing.
We use text extraction tools to quickly and accurately extract text from one data format that cannot be easily read or analyzed (images) into a new format that can be analyzed through computational and digital methods and tools. While manual transcription by a skilled human reader tends to be the most accurate means of extracting text, sometimes the volume of scanned images requiring transcription would take too many human work hours to complete in a reasonable time.
Automating this process by using an OCR/HRT tool can help researchers access the text data sets they need in a more timely and efficient manner. Although OCR has been around since the 1960s, the tools and approaches have continued to evolve. Many popular text data extraction tools incorporate new technologies like AI and machine learning to aid in increasing the accuracy of transcription and handling particularly complex situations and layouts.
What Can OCR/HTR Do?
OCR and HTR allow us to convert images of printed and handwritten text into text documents that can be more easily read by both humans and machines. Once converted, the text files can then be analyzed and searched for patterns, recurrences, anomalies, and other statistical and textual features. Usually, text extraction with OCR/HTR is the first step in the process of making a historical text more accessible to researchers, so that further analysis and preservation steps can be taken.
Case 1: Printed or typeset documents
This is the easiest case for most contemporary OCR/HTR tools.
For modern and contemporary texts (1800-present), fonts tend to be legible and consistent, letter shapes are conventional, and page layouts are fairly uniform. Earlier (pre-1800) typeset documents often present challenges and introduce errors due to unusual fonts, inconsistent spelling, and unfamiliar/antiquated letter shapes. The presence of other marks (stamps, marginalia, etc) and/or damage (tears, blotches, hole punches, or gaps) can produce errors and mistranscriptions (eg. mistaking a hole punch for an “O” or a “0”). Typical examples include literary texts, court transcripts, legal documents, academic articles, financial documents, correspondence, newspapers, and posters.
Case 2: Manuscripts and handwritten texts
Historical manuscripts, letters, documents, and other texts written by hand can pose many unique challenges to digital transcription. Some texts might feature multiple scribes with different writing styles. Others feature barely legible handwriting, inconsistent or idiosyncratic spelling, intermixed illustrations or marginalia, or the faded presence of previous texts (palimpsests and recycled paper). Handwritten documents can also have irregular lines of text that are skewed, curved, or otherwise distorted. In these cases, working with a tool or library that is capable of HTR is vital.
Case 3: Illustrations or images containing text
Text can also be extracted from scanned artwork, photography, posters, diagrams, charts, tables, or maps. While most OCR/HTR tools can extract any text located within an image as a text blob, you will need a more advanced tool or approach if you wish to also preserve the arrangement/location of individual text elements (typically preserved by identifying different blocks or regions, then recording both the text and the associated location information in a JSON file).
Popular Text Extraction Tools
Depending on your needs, budget, and the type of texts you are working with, there are a wide variety of OCR/HTR tools available, ranging from free open-source Python libraries like pytesseract and pyEasyOCR to paid proprietary solutions like Transkribus or Microsoft Azure’s Document Intelligence. Free options require some level of Python programming, but tutorials and guides are readily available. You can find our Jupityr notebooks for using pyTesseract and EasyOCR on the DiSA GitHub.
Further Reading