Associate Professor Siobhán McElduff, Culture, Ancient Mediterranean and Near Eastern Studies

Working with historical texts featuring archaic typographic conventions (long s, ligatures, period-specific letterforms) extracted text using Tesseract OCR, but the resulting transcriptions were riddled with systematic errors compounded by poor scan quality, low resolution, misalignment, and clipped pages. McElduff begun addressing errors using Python and regular expressions, when DiSA helped to develop a multi-pass data cleaning pipeline that progressed from correcting common typographic transcription errors to pattern-matching known word forms and finally targeting recurring vocabulary specific to the corpus. Using object-oriented Python, DiSA supported the creation of functions to identify discrete catalog entries, preserve their entry numbers, and deduce corrupted numbers based on sequential logic, which offered McElduff both the refined data cleaning code and comparative transcription results for future decision-making.
DiSA support through in kind RA and GA time support, helped in editing Dr. McElduff’s transcriptions and encoding, including locating missing metadata for the ballads, reorganizing People, Places, and Organizations data, ensuring all encoding tagging is consistent and valid, and also adding new tags, people, and places where relevant.
- Read a feature about Dr. McElduff’s process here.