Tenasol, as a regular part of our operations, uses OCR technology for processing and extracting data from medical records prior to our NLP processes. As we previously discussed, healthcare data sources highly consist of PDFs and images in the payer space, as opposed to purely structured HL7 ADT, HL7 C-CDA, or HL7 FHIR data, due to gaps in interoperability.

This process is particularly difficult as image quality of many records are very low, they can be shuffled, and they can range in length to as long as 60,000 pages.

If you skip the rest of this article, know the following:

OCR systems take in image or PDF files and output data representing what text is where.
Google Document AI is currently the top performing OCR engine for high volume generalized processing if cost is not a factor.
Tesseract OCR is currently the best open source and free OCR engine, but is incapable of reading handwriting and has a high error rate.
“OCR-less” LLM transformer solutions are in early development to directly interpret images without OCR (e.g. Donut) but fall short on performance. For the tech savvy, these use image encoders tied to text encoders.
No system today can perfectly read documents and their structure.
There is a balance between processing speed and accuracy in OCR which must be kept in mind when choosing an OCR solution.

Comparison of OCR engines — *Error of major OCR engines [1]*

OCR has been around for decades believe it or not - and has remained relatively consistent in the approaches used with only small variations in approach to improve performance. Remarkably the largest and earliest utilizer of OCR technology is the US Postal Service in the 1960’s for automated mail sorting, which it still uses today.

How OCR Works

Step 1: Image Preprocessing

In an initial step images are preprocessed. This might mean converting them to a singular format, skewing them properly as shown below, or changing image parameters such as contrast so later steps go more smoothly.

Step 2: Locate Characters

In this step characters are located as to their position - however they are not identified! Knowing where they are is essential for multiple other steps. Deep learning is very useful here, as such techniques as convolutional neural networks (CNN) are able to pick up on objects very efficiently in images.

Step 3: Segmentation

Characters (letters) are grouped in multiple ways. This has value to the OCR processing steps, but also may be useful to a user after OCR is complete. Characters, words (also called tokens), lines, and paragraphs, all carry information in their grouped locations and individual sizes.

Step 4: Recognition

At this point, classification of each letter occurs. There are a few ways to do this. Generally this is performed by first specifying a character set (e.g. ASCII, UTF-8, etc). This is usually not something that can be selected by a user.

Then, for each character, a matching algorithm is performed of the character to all in the character set to seek the closest. This is usually done by one of the following methods:

lower the resolution of the sub-image and then matching it to a library that contains all permutations possible of a same size grid (faster, less accurate).
using machine learning model such as k-nearest neighbors or a deep learning convolutional neural net (slower, more accurate, works for handwriting).

Intelligent systems will sometimes use both approaches for speed and compute optimization.

Step 5: Sequencing

Once all characters are recognized, they are then sequenced. Information about character location and what the characters are can be used to help make this decision. Characters, words (tokens), lines, paragraphs, and pages are sequenced usually using straightforward logic or simplistic machine learning models.

Step 6: Language Detection

One characters have been sequenced, an attempt can be made to determine what language a text is in. This can be specified before running, but modern systems are easily capable of auto-detection using simple machine learning models based on language commonality and character sequences seen, which are dependent on the sequencing step previous.

Step 7: OCR Error Correction

Once language is determined, characters may be changed based on frequency in the given language and proximity to other characters in the word, line, or paragraph using machine learning models. This is an extremely important step that is more common in modern systems. Think for example about a “0” and an “O” - a zero is far more likely to be next to other numbers, and a letter O is far more likely to be next to other characters.

More recently, systems have taken advantage of increasingly more complex Long-Short-Term Memory (LSTM) neural networks to improve the performance of this step.

Step 8: OCR Output Creation

Finally output is created. This is most commonly a JSON format that contains all the prior mentioned information.

Where OCR Fails

Errors do occur. The below comparison shows an early OCR system now free and open-sourced named Tesseract still used by many today for its speed, however it falls short in the accuracy department. The below shows what happens when a poor OCR engine like tesseract is used on noisy or blurred text. More modern systems are capable of evaluating context and better character recognition to reach near perfect accuracy on a task like this.

errors in ocr processing — *LEFT: Ground truth RIGHT: Tesseract with errors highlighted* *[1]. Tenasol does not utilize Tesseract OCR due to it’s error rate.*

But errors may be derived from more than simply poor hand writing or poor quality images. Often lines may drawn through text, or the text may be extracted in a non-sensical sequence. Convolutional evaluation of a page might cause an OCR system to read a block of text as a single paragraph for example, when it is really two paragraphs side by side. Additionally, color, font type, font size, and bold/italic information is often lost by these OCR systems which may lend value in the context of a page of data as a whole.

Sentence Structure

Documents with poor resolution, artifacts, or handwriting often result in the periodic character inaccuracy. Take for example:

Let's eat, Grandma! [ACTUAL]
vs

Let's eat Grandma! [DETECTED]

Notice how in the second sentence the comma is removed, changing the meaning of the sentence. It is not rare for OCR to miss a comma in the case of poor image quality. If using parts-of-speech language processing, this interpreted meaning would differ resulting in inaccuracy.

Tables & Forms

Some cloud services offer premium OCR services for a significantly higher fee that are capable of detecting and partially interpreting tables and forms. Form processing software variants may autonomously detect form fields and their values, or may take a pre-specified template for further accuracy.

One downside of table extraction is that it is up to a user to determine what the headers are, if present, and non-standard tables, or those without lines shown may not be properly parsed.

Major Players in OCR

Google: Google initially utilized Tesseract, however on their cloud platform now offer both a document AI and a vision AI product (standard and advanced).

Microsoft: Nuance, previously a global leader in OCR was purchased by Microsoft and the service was wrapped inside of their cloud software now known as Microsoft Azure, and has incorporated significant upgrades since then.

Amazon: Amazon, like Google, built an in-house OCR solution called Textract that offers the best in value, but lowest performance among paid options.

Tesseract (open source): Built by HP, tesseract at one time was the top OCR engine. It was ultimately acquired by Google and then open sourced for the public. Tesseract is great for low cost margin processing projects or proof of concept scenarios.

How OCR data is used for Medical Record Processing

General NLP Services:

Generalized OCR services for searching and parsing
Automatic page rotation / reorientation
Duplicate page removal
Duplicate document detection
Annotation of medical records
Conversion between medical record formats (e.g CCDA, FHIR, PDF, HL7 ADT, image, RTF)

Extraction NLP Services:

Demographics
Patient information
Patient SDoH
Codes (ICD-10-CM, SNOMED, LOINC, CPT, RxNorm, etc) with linked date of service, practitioner, etc
Labs and Vitals
Practitioner information
Facility information
Disease research

Single-Record NLP AI services:

past / present / future patient event detection / prediction
readability / handwriting detection
HEDIS evidence extraction
Risk evidence extraction
language translation
sentiment analysis
summarization

Multi-record NLP AI Services:

Population health trends and visualizations
fraud, waste, and abuse detection
risk adjustment prioritization of medical records
HEDIS / quality prioritization of medical records
duplicate document detection
duplicate patient detection
adverse event detection / pharmacovigilance

Conclusion

Optical Character Recognition (OCR) technology has evolved significantly but remains imperfect. OCR involves complex steps, including preprocessing, recognition, sequencing, and error correction, leveraging advanced machine learning techniques to enhance performance. Challenges persist in processing low-resolution images, interpreting handwriting, and handling complex formats like tables. Choosing an OCR solution requires balancing speed, accuracy, and cost. If you would like to see how Tenasol can support OCR processing as a part of a larger pipeline for healthcare data source extraction, please reach out to our sales team.

References

[1] Hegghammer, Thomas. "OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment." Journal of Computational Social Science 5.1 (2022): 861-882.

What is the Best Optical Character Recognition (OCR) Engine