Document Intelligence

If you are interested in computer aided transcription to extract text from non-OCR documents to turn it into readable OCR text, there are many options.

DiSA has tried this out with success using Document Intelligence which is a cloud-based Microsoft Azure AI service that enables a user to build intelligent document processing solutions. In the context of Digital Humanities and Digital Scholarship work, Document Intelligence can be used specifically to extract printed and handwritten text from unsearchable PDFs. Researchers often encounter older scanned documents that are not machine-readable or cannot be simply copy-and-pasted from. If you are looking at a large document or a large collection of scanned documents that require transcription, UBC DiSA team offers assistance in helping you set up Document Intelligence so that you can extract the data you need from these files quickly and efficiently.


Setting Up Document Intelligence using Microsoft Azure

To use Document Intelligence, you will need to

  1. Create a Microsoft Azure account
  2. Setup a subscription and choose a level (Free is fine for testing or small amounts of text)
  3. Create a Document Intelligence resource
  4. Note the resource Endpoint and Key

READ: DiSA GitHub – How to Set up Document Intelligence


Using Document Intelligence in a Jupityr or Google Colab Notebook

For smaller batches of files, you can run Document Intelligence through Jupityr or Google Colab.

READ: DiSA GitHub – Document Intelligence in a Web Notebook


Using Document Intelligence Locally

If you wish to run the Document Intelligence transcription on your machine (as opposed to running it from a Jupytr or Google Colab Notebook), you will need to have a recent version of Python installed. We also recommend having Microsoft VS Code or a similar coding environment installed to make editing and running the program easier.

READ: DiSA GitHub – Running Document Intelligence on Your Own Computer


Other Resources