The Best Medical Record OCR

Medical Record OCR

Tenasol, as a regular part of our operations, makes use of multiple OCR technology for processing and extracting data from medical records prior to our NLP processes. As we previously discussed, healthcare data sources highly consist of PDFs and images in the payer space, as opposed to purely structured HL7 C-CDA or HL7 FHIR data.

This process is particularly difficult as image quality of many records are very low, they can be shuffled, and they can range in length to as long as 60,000 pages. This blog aims to cover known OCR vendors, relative performances, and limitations of these OCR systems.

If you want to skip this blog you should take away the following:

  • OCR systems take in image or PDF files and output data representing what text is where.

  • Google Document AI is currently the top performing OCR engine for high volume generalized processing if cost is not a factor.

  • Tesseract OCR is currently the best opensource and free OCR engine, but is incapable of reading handwriting and has a high error rate.

  • No system today is able to perfectly read documents and their structure.

  • There is a balance between processing speed and accuracy in OCR which must be kept in mind when choosing an OCR solution.

  • New solutions are on the rise that can directly interpret images without OCR, such as Donut but with often lower performance.

Comparison of OCR engines

Error of major OCR engines [1]

OCR has been around for decades believe it or not - and has remained relatively consistent in the approaches used with only small variations in approach to improve performance. Remarkably the largest and earliest utilizer of OCR technology is the US Postal Service in the 1960’s for automated mail sorting, which it still uses today.

Medical OCR performance options

How OCR Works

Step 1: Image Preprocessing

In an initial step images are preprocessed. This might mean converting them to a singular format, skewing them properly as shown below, or changing image parameters such as contrast so later steps go more smoothly.

Skew Correction in OCR

Step 2: Locate Characters

In this step characters are located as to their position - however they are not identified! Knowing where they are is essential for multiple other steps. Deep learning is very useful here, as such techniques as convolutional neural networks (CNN) are able to pick up on objects very efficiently in images.

Medical OCR letter detection

Step 3: Segmentation

Characters (letters) are grouped in multiple ways. This has value to the OCR processing steps, but also may be useful to a user after OCR is complete. Characters, words (also called tokens), lines, and paragraphs, all carry information in their grouped locations and individual sizes.

Medical OCR sequencing

Step 4: Recognition

At this point, classification of each letter occurs. There are a few ways to do this. Generally this is performed by first specifying a character set (e.g. ASCII, UTF-8, etc). This is usually not something that can be selected by a user.

Then, for each character, a matching algorithm is performed of the character to all in the character set to seek the closest. This is usually done by one of the following methods:

  • lower the resolution of the sub-image and then matching it to a library that contains all permutations possible of a same size grid (faster, less accurate).

  • using machine learning model such as k-nearest neighbors or a deep learning convolutional neural net (slower, more accurate, works for handwriting).

Intelligent systems will sometimes use both approaches for speed and compute optimization.

Medical OCR character detection

Step 5: Sequencing

Once all characters are recognized, they are then sequenced. Information about character location and what the characters are can be used to help make this decision. Characters, words (tokens), lines, paragraphs, and pages are sequenced usually using straightforward logic or simplistic machine learning models.

Medical OCR character sequencing

Step 6: Language Detection

One characters have been sequenced, an attempt can be made to determine what language a text is in. This can be specified before running, but modern systems are easily capable of auto-detection using simple machine learning models based on language commonality and character sequences seen, which are dependent on the sequencing step previous.

Step 7: OCR Error Correction

Once language is determined, characters may be changed based on frequency in the given language and proximity to other characters in the word, line, or paragraph using machine learning models. This is an extremely important step that is more common in modern systems. Think for example about a “0” and an “O” - a zero is far more likely to be next to other numbers, and a letter O is far more likely to be next to other characters.

More recently, systems have taken advantage of increasingly more complex Long-Short-Term Memory (LSTM) neural networks to improve the performance of this step.

Step 8: OCR Output Creation

Finally output is created. This is most commonly a JSON format that contains all the prior mentioned information.

Where OCR Fails

Errors do occur. The below comparison shows an early OCR system now free and open-sourced named tesseract still used by many today for its speed, however it falls short in the accuracy department. The below shows what happens when a poor OCR engine like tesseract is used on noisy or blurred text. More modern systems are capable of evaluating context and better character recognition to reach near perfect accuracy on a task like this.

errors in ocr processing

LEFT: Ground truth RIGHT: Tesseract with errors highlighted [1]. Tenasol does not utilize Tesseract OCR due to it’s error rate.

But errors may be derived from more than simply poor hand writing or poor quality images. Often lines may drawn through text, or the text may be extracted in a non-sensical sequence. Convolutional evaluation of a page might cause an OCR system to read a block of text as a single paragraph for example, when it is really two paragraphs side by side. Additionally, color, font type, font size, and bold/italic information is often lost by these OCR systems which may lend value in the context of a page of data as a whole.

Sentence Structure

Documents with poor resolution, artifacts, or handwriting often result in the periodic character inaccuracy. Take for example:

Let's eat, Grandma! [ACTUAL]
vs

Let's eat Grandma! [DETECTED]

Notice how in the second sentence the comma is removed, changing the meaning of the sentence. It is not rare for OCR to omit or miss a comma in the case of poor image quality. If using parts-of-speech language processing, this interpreted meaning would differ resulting in inaccuracy.

Tables & Forms

Some cloud services offer premium OCR services for a significantly higher fee that are capable of detecting and partially interpreting tables and forms. Form processing software variants may autonomously detect form fields and their values, or may take a pre-specified template for further accuracy.

One downside of table extraction is that it is up to a user to determine what the headers are, if present, and non-standard tables, or those without lines shown may not be properly parsed.

Major Players in OCR

Google: Google initially utilized Tesseract, however on their cloud platform now offer both a document AI and a vision AI product (standard and advanced).

Microsoft: Nuance, previously a global leader in OCR was purchased by Microsoft and the service was wrapped inside of their cloud software now known as Microsoft Azure, and has incorporated significant upgrades since then.

Amazon: Amazon, like Google, built an in-house OCR solution called Textract that offers the best in value, but lowest performance among paid options.

Tesseract (open source): Built by HP, tesseract at one time was the top OCR engine. It was ultimately acquired by Google and then open sourced for the public. Tesseract is great for low cost margin processing projects or proof of concept scenarios.

How OCR data can be used for Medical Record Processing

General NLP Services:

  • Generalized OCR services for searching and parsing

  • Automatic page rotation / reorientation

  • Duplicate page removal

  • Duplicate document detection

  • Annotation of medical records

  • Conversion between medical record formats (e.g CCDA, FHIR, PDF, HL7 ADT, image, RTF)

Extraction NLP Services:

  • Demographics

  • Patient information

  • Patient SDoG

  • Codes (ICD-10-CM, SNOMED, LOINC, CPT, RxNorm, etc) with linked date of service, practitioner, etc

  • Labs and Vitals

  • Practitioner information

  • Facility information

  • Disease research

Single-Record NLP AI services:

  • past / present / future patient event detection / prediction

  • readability / handwriting detection

  • HEDIS evidence extraction

  • Risk evidence extraction

  • language translation

  • sentiment analysis

  • summarization

Multi-record NLP AI Services:

  • Population health trends and visualizations

  • fraud, waste, and abuse detection

  • risk adjustment prioritization of medical records

  • HEDIS / quality prioritization of medical records

  • duplicate document detection

  • duplicate patient detection

  • adverse event detection / pharmacovigilance

Conclusion

Optical Character Recognition (OCR) technology has evolved significantly but remains imperfect. OCR involves complex steps, including preprocessing, recognition, sequencing, and error correction, leveraging advanced machine learning techniques to enhance performance. Challenges persist in processing low-resolution images, interpreting handwriting, and handling complex formats like tables. Choosing an OCR solution requires balancing speed, accuracy, and cost. If you would like to see how Tenasol can support OCR processing as a part of a larger pipeline for healthcare data source extraction, please reach out to our sales team.

References

[1] Hegghammer, Thomas. "OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment." Journal of Computational Social Science 5.1 (2022): 861-882.

Previous
Previous

9 Reasons Parsing PDF Data is Difficult

Next
Next

Summary of Medical Coding Systems