9 Reasons Parsing PDF Data is Difficult

medical record PDF parsing

One of Tenasol’s primary operations and areas of expertise is extraction of data from a wide variety of healthcare data sources. Medical records make up the bulk of these, and come in many different flavors (HL7v2 ADT, HL7v3 C-CDA, HL7v4 FHIR, Image/PDF/TIFF, and more).

Digital representations are not often easy for human eyes. JSON and XML files often appear as jumbled words with brackets and while automated software can make some decisions on these, they are most often rendered as HTML or as a PDF file for a human to view and make medical decisions on. In previous blogs (ADT, CCDA, FHIR) we cover the digital formats.

The majority of records exchanged outside of the point of care are in unstructured PDF or multipage TIFF format since that is what insurers typically receive when requesting information from hospitals. A minority of patient information is transmitted in digital format to these entities. Because of this, Tenasol’s mission often relies upon these difficult-to-parse formats as few other vendors are able to abstract information to the same precision that Tenasol is capable of.

This blog post is meant to describe some of the drawbacks and hurdles of rendered and paper medical records.

PDF Medical Record

A Deidentified Medical Record PDF

Records contain vitals, diagnosis, history, allergies, and patient-identifying information among other things recorded from a single or multiple visits.

Usually, but not always, they describe a single patient in both the present and past.

Importance of Unstructured Data

The grand majority data in healthcare is unstructured.

Massive amounts of data are contained in a record, regardless of if it is unstructured or not. Codified data, forms, images, free text alone are just one component. This data may all also be timestamped to build a patient timeline of their location and overall health.

Modern machine learning techniques further allows us to forecast patient risks and how imminent they are, as well as even detecting what information may be accidentally undocumented. Going further, the metadata of the file itself offers more information.

Tenasol is capable of extracting this data (such as diagnosis codes, lab codes, drug codes) even when they are not explicitly stated and turning these into structured data for such use cases as generalized data extraction, risk adjustment, HEDIS quality measures, prior authorization, and more.

medical record data extraction

Unstructured Data Hurdles

1) Medical record variation is massive

Almost no assumptions may be made about a medical record. Far too often I have met development teams that have tried to detect sections of medical records to make assumptions about data found - but these are fluid in definition. There are no rules about structure. A medical record may contain no visits and simply describe patient information, or it may describe just one or many encounters with practitioners.

2) There is no assurance that pages are in sequence

Further more, pages may easily be shuffled during handling, faxing, or scanning. This means that information that is split across pages may be unrelated or unreliable.

3) A medical record may be about one or many patients

Depending on how it was rendered and transported, a rendered medical record may contain more than one patient. This may be the case if a hospital is sending medical data to an insurance agency on numerous patients at the same time based on a payer chase. It may be up to the payer to split those back apart for review purposes.

Medical record sizes

4) Page count variation is high

For older patients, records skew higher in length similar to what is shown to the right. These older patients have a median of around 12 pages, a mean of 80, and can range as high as tens of thousands of pages in length.

It is worth noting that there are a large number of machine learning and natural language models that fail short this amount of data size variability.

medical record handwriting

6) Poor legibility may appear

Around 10% of medical records contain at least some handwriting. These are usually patient forms or practitioner notes. Legibility varies widely, but most modern OCR systems can do fairly proficiently in determining what is written. Others - like the writing on the left - will go unrecognized.

page rotation in medical records

7) Pages may be rotated

Approximately 5-10% of medical records contain pages that have been incorrectly rotated due to mishandling. Tenasol offers services for repairing this, but it is not uncommon due to practitioner handling of paper records during transmission.

image data in medical records

8) Images and tables are common

While images cannot be rendered in raw digital formats, they are often present in rendered documents. This is an example deidentified ECG printout.

Tables in particular may be very hard to extract data from given their high variation of formatting.

9) Data Duplication

Electronic: A deceivingly common feature of all types of medical records is duplicated data. On electronic records, this occurs either due to issues with a software interface identifying a unique record or a practitioner re-attaching all past visits to a record that already has them.

Pages: For a paper medical record, this can happen with pages if a practitioner transmits the same patient chart multiple times on accident during the same transmission.

Data: Within the record it can happen if information like vitals are represented repeatedly in multiple sections - for example once in the encounter and once in a vitals summary table.

Duplicate Transmissions: Furthermore, it is possible for points of care to send the exact same medical record multiple times. Tenasol offers deduplication features on a sliding scale - so documents that are not 100% identical due to features such as fax headers are still treated the same. In human review for verification, Tenasol has discovered clients with up to 40% of duplicate medical records in queue for human review, resulting in substantial resource costs that are no longer necessary with this technology.

medical record deduplication

Methods of hashing used by Tenasol for medical record deduplication

Let Us Know How Tenasol Can Help!

Tenasol excels in interoperability and transforming unstructured medical records into usable, structured data through advanced AI and machine learning technologies. By addressing challenges such as variable formatting, poor legibility, page rotation, and data duplication, Tenasol ensures precise extraction and organization of vital patient information. These capabilities empower healthcare providers and insurers to efficiently manage vast amounts of data, reducing resource costs and improving decision-making. By leveraging its expertise in handling diverse formats and developing robust solutions for complex records, Tenasol continues to set a benchmark for innovation in healthcare data management, driving better outcomes for patients and stakeholders alike.

Previous
Previous

Introduction to HEDIS & Quality

Next
Next

The Best Medical Record OCR