Document Intelligence is a cloud-based Azure AI service that enables a user to build intelligent document processing solutions. In the context of Digital Humanities and Digital Scholarship work, Document Intelligence can be used specifically to extract printed and handwritten text from unsearchable PDFs. Researchers often encounter older scanned documents that are not machine-readable or cannot be simply copy-and-pasted from. If you are looking at a large document or a large collection of scanned documents that require transcription, UBC DiSA team offers assistance in helping you set up Document Intelligence so that you can extract the data you need from these files quickly and efficiently.
Setting Up Document Intelligence
To use Document Intelligence, you will need to
- Create a Microsoft Azure account
- Setup a subscription and choose a level (Free is fine for testing or small amounts of text)
- Create a Document Intelligence resource
- Note the resource Endpoint and Key
READ: DiSA GitHub – How to Set up Document Intelligence
Using Document Intelligence in a Jupityr or Google Colab Notebook
For smaller batches of files, you can run Document Intelligence through Jupityr or Google Colab.
Using Document Intelligence Locally
If you wish to run the Document Intelligence transcription on your machine (as opposed to running it from a Jupytr or Google Colab Notebook), you will need to have a recent version of Python installed. We also recommend having Microsoft VS Code or a similar coding environment installed to make editing and running the program easier.
READ: DiSA GitHub – Running Document Intelligence on Your Own Computer