What is the Best Optical Character Recognition (OCR) Engine
Tenasol, as a regular part of our operations, makes use of multiple OCR technologies for processing and extracting data from medical records. As we covered in a previous blog, healthcare data sources highly consist of PDFs and images in the payer space. This blog aims to cover known vendors, relative performances, and limitations of these OCR systems.
If you want to skip this blog you should take away the following:
OCR systems take in image or PDF files and output data representing what text is where.
Google Document AI is currently the top performing OCR engine for high volume generalized processing if cost is not a factor.
Tesseract OCR is currently the best opensource and free OCR engine, but is incapable of reading handwriting and has a high error rate.
No system today is able to perfectly read documents and their structure.
There is a balance between processing speed and accuracy in OCR which must be kept in mind when choosing an OCR solution.
OCR has been around for decades believe it or not - and has remained relatively consistent in the approaches used with only small variations in approach to improve performance. Remarkably the largest and earliest utilizer of OCR technology is the US Postal Service in the 1960’s for automated mail sorting, which it still uses today.
How It Works
Step 1: Image Preprocessing
In an initial step images are preprocessed. This might mean converting them to a singular format, skewing them properly as shown below, or changing image parameters such as contrast so later steps go more smoothly.
Step 2: Locate Characters
In this step characters are located as to their position - however they are not identified! Knowing where they are is essential for multiple other steps. Deep learning is very useful here, as such techniques as convolutional neural networks (CNN) are able to pick up on objects very efficiently in images.
Step 3: Segmentation
Characters (letters) are grouped in multiple ways. This has value to the OCR processing steps, but also may be useful to a user after OCR is complete. Characters, words (also called tokens), lines, and paragraphs, all carry information in their grouped locations and individual sizes.
Step 4: Recognition
At this point, classification of each letter occurs. There are a few ways to do this. Generally this is performed by first specifying a character set (e.g. ASCII, UTF-8, etc). This is usually not something that can be selected by a user.
Then, for each character, a matching algorithm is performed of the character to all in the character set to seek the closest. This is usually done by one of the following methods:
lower the resolution of the sub-image and then matching it to a library that contains all permutations possible of a same size grid (faster, less accurate).
using machine learning model such as k-nearest neighbors or a deep learning convolutional neural net (slower, more accurate, works for handwriting).
Intelligent systems will sometimes use both approaches for speed and compute optimization.
Step 5: Sequencing
Once all characters are recognized, they are then sequenced. Information about character location and what the characters are can be used to help make this decision. Characters, words (tokens), lines, paragraphs, and pages are sequenced usually using straightforward logic or simplistic machine learning models.
Step 6: Language Detection
One characters have been sequenced, an attempt can be made to determine what language a text is in. This can be specified before running, but modern systems are easily capable of auto-detection using simple machine learning models based on language commonality and character sequences seen, which are dependent on the sequencing step previous.
Step 7: Error Correction
Once language is determined, characters may be changed based on frequency in the given language and proximity to other characters in the word, line, or paragraph using machine learning models. This is an extremely important step that is more common in modern systems. Think for example about a “0” and an “O” - a zero is far more likely to be next to other numbers, and a letter O is far more likely to be next to other characters.
More recently, systems have taken advantage of increasingly more complex Long-Short-Term Memory (LSTM) neural networks to improve the performance of this step.
Step 8: Output Creation
Finally output is created. This is most commonly a JSON format that contains all the prior mentioned information.
Where OCR Fails
Sentence Structure
Documents with poor resolution, artifacts, or handwriting often result in the periodic character inaccuracy. Take for example:
Let's eat, Grandma! [ACTUAL]
vs
Let's eat Grandma! [DETECTED]
Notice how in the second sentence the comma is removed, changing the meaning of the sentence. It is not rare for OCR to omit or miss a comma in the case of poor image quality. If using parts-of-speech language processing, this interpreted meaning would differ resulting in inaccuracy.
Tables & Forms
Some cloud services offer premium OCR services for a significantly higher fee that are capable of detecting and partially interpreting tables and forms. Form processing software variants may autonomously detect form fields and their values, or may take a pre-specified template for further accuracy.
One downside of table extraction is that it is up to a user to determine what the headers are, if present, and non-standard tables, or those without lines shown may not be properly parsed.
Major Players
Google: Google initially utilized Tesseract, however on their cloud platform now offer both a document AI and a vision AI product (standard and advanced).
Microsoft: Nuance, previously a global leader in OCR was purchased by Microsoft and the service was wrapped inside of their cloud software now known as Microsoft Azure, and has incorporated significant upgrades since then.
Amazon: Amazon, like Google, built an in-house OCR solution called Textract that offers the best in value, but lowest performance among paid options.
Tesseract (open source): Built by HP, tesseract at one time was the top OCR engine. It was ultimately acquired by Google and then open sourced for the public. Tesseract is great for low cost margin processing projects or proof of concept scenarios.
Conclusion
Optical Character Recognition (OCR) technology has evolved significantly but remains imperfect. OCR involves complex steps, including preprocessing, recognition, sequencing, and error correction, leveraging advanced machine learning techniques to enhance performance. Challenges persist in processing low-resolution images, interpreting handwriting, and handling complex formats like tables. Choosing an OCR solution requires balancing speed, accuracy, and cost. If you would like to see how Tenasol can support OCR processing as a part of a larger pipeline for healthcare data source extraction, please reach out to our sales team.
References
[1] Hegghammer, Thomas. "OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment." Journal of Computational Social Science 5.1 (2022): 861-882.