Document Intelligence

Document Intelligence is a cloud-based Azure AI service that enables a user to build intelligent document processing solutions. In the context of Digital Humanities and Digital Scholarship work, Document Intelligence can be used specifically to extract printed and handwritten text from unsearchable PDFs. Researchers often encounter older scanned documents that are not machine-readable or cannot be simply copy-and-pasted from. If you are looking at a large document or a large collection of scanned documents that require transcription, UBC DiSA team offers assistance in helping you set up Document Intelligence so that you can extract the data you need from these files quickly and efficiently.

Setting Up Document Intelligence

To use Document Intelligence, you will need to

  1. Create a Microsoft Azure account
  2. Setup a subscription and choose a level (Free is fine for testing or small amounts of text)
  3. Create a Document Intelligence resource
  4. Note the resource Endpoint and Key

READ: DiSA GitHub – How to Set up Document Intelligence

Using Document Intelligence in a Jupityr or Google Colab Notebook

For smaller batches of files, you can run Document Intelligence through Jupityr or Google Colab.

READ: DiSA GitHub – Document Intelligence in a Web Notebook

Using Document Intelligence Locally

If you wish to run the Document Intelligence transcription on your machine (as opposed to running it from a Jupytr or Google Colab Notebook), you will need to have a recent version of Python installed. We also recommend having Microsoft VS Code or a similar coding environment installed to make editing and running the program easier.

READ: DiSA GitHub – Running Document Intelligence on Your Own Computer

Other Resources