Berger Court Transcripts Digital Processing

Professor Tina Loo, History

Studying a historical pipeline inquiry case involving First Nations/Indigenous witnesses, Dr. Tina Loo needed to extract and clean text from 459 scanned PDF documents, representing over 200,000 pages of non-machine-readable court transcripts. DiSA developed a custom Python-based workflow using Microsoft Azure Document Intelligence to batch-process the documents, then applied natural language processing and regular expression pattern matching to remove transcription artifacts (line numbers, hole punch marks, hand stamps, and administrative metadata). Specialized data cleaning functions were tailored to the unique challenges of digitizing historical court documents, preserving confidence-level reports and organizing cleaned transcripts separately from source files. With extraction and cleaning complete, the researcher is able to with further analysis in the project’s next phase.