Aims: To evaluate the effects of Optical Character Recognition (OCR) accuracy on automatic cancer classification of pathology reports.
Methods: 228 scanned images of pathology reports obtained from an Australian state-based Cancer Registry were converted to free-text using a commercial OCR system. A state-of-the-art rule-based cancer classification system, Medical Text Extraction (Medtex)1, was employed to classify the OCR-ed reports. A post-processing pipeline2 was implemented to recover part of the formatting loss during the OCR process. Ground truth judgement for cancer classification was provided by two clinical coders. Classification effectiveness using OCR-ed reports was compared with that obtained on human corrected reports.
Results: Free-text was recognised by the OCR tool with high word accuracy (98.8%); however, some words that may affect the correct classification of reports in terms of histologies (e.g. keratinising) and sites (e.g. anterior, posterior) were misrecognised. Pathology reports were classified into notifiable and not-notifiable with an F-measure of 91.6%, regardless of OCR errors. Lower effectiveness was witnessed when extracting finer-grained synoptic factors, e.g., primary site and histological type. OCR errors were found to lower the performance of the system when compared to those obtained using reports that had OCR errors rectified.
Conclusions: A commercial OCR tool appeared suited to convert image-based pathology reports into electronic free-text with high accuracy. However, words key to the identification of affected body structures and histological types were misrecognised by the OCR tool. Medtex was used to assess the impact of OCR errors when automatically coding pathology reports. Experimental results suggest that OCR errors are negligible for classifying notifiable cancer reports. However, OCR errors do affect the classification of synoptic reporting items, as lower performances were achieved by the automatic system. Future work will consider incorporating OCR error correction within Medtex to improve synoptic reporting.