Unstructured Health Data Extraction is the use of Healthcare NLP (Natural Language Processing) to extract codified data that may be used for peripheral healthcare processes.

This article explores the importance of healthcare NLP, capabilities it offers for health data extraction, and how Tenasol uses it for its clients..

Why Healthcare Data Extraction?

As shown above in an analysis of electronic medical record (EMR) data performed by Tenasol, roughly 58% (0.4*0.7 + 0.3) of electronic medical record data is completely unstructured. This combined with standalone Image and PDF medical records imply that the world of healthcare data is far from reaching a structured state.

Presently, most systems used for processing these records make use only the remaining 42% (0.6*0.7) that is structured by extracting codified content within the purely structured fields of the XML or JSON structure in file. These codes might be SNOMED, ICD-10-CM, ICD-10-PCS, LOINC, RxNorm, NDC, CPT, or any other of a number of codes that indicate patient conditions, actions, drugs, demographics, results and more.

The unstructured data alternatively comes in the forms of unformatted text fields, images, and now with the advent of HL7 FHIR, literally any filetype. Stated another way, HL7 FHIR means we are entering a world of more unstructured data, not less.

What Can Be Done with Extracted Health Data?

Without exception every type of data can be processed and fed into an artificial intelligence (AI) model. And when constructed correctly, these will yield outputs with predictive power. However it is the use cases with the most predictive power, and those with the most value that will gain the most interest. Here is an itemized list of the ways that NLP systems are either being used, or proposed to be used when using unstructured medical record data (from either a single or multiple records) as an input, the majority of which Tenasol offers as solutions, in four categories:

General NLP Services:

Generalized OCR services for searching and parsing
Automatic page rotation / reorientation
Duplicate page removal
Annotation of medical records
Conversion between medical record formats (e.g CCDA, FHIR, PDF, HL7 ADT, image, RTF)

Extraction NLP Services:

Demographics
Patient information
Patient SDoH
Codes (ICD-10-CM, SNOMED, LOINC, CPT, etc) with linked date of service, practitioner, etc
Labs and Vitals
Practitioner information
Facility information
Disease research

Single-Record NLP AI services:

past / present / future patient event detection / prediction
readability / handwriting detection
HEDIS evidence extraction
Risk evidence extraction
Prior authorization evidence extraction
language translation
medical record deidentification
sentiment analysis
summarization

Multi-record NLP AI Services:

Population health trends and visualizations
fraud, waste, and abuse detection
risk adjustment prioritization of medical records
HEDIS / quality prioritization of medical records
duplicate document detection
duplicate patient detection
adverse event detection / pharmacovigilance

What is Unstructured Health Data Extraction Used for Today?

In short, very little outside of clinical settings. Most data services only process the structured content shown below.

All content used in situations outside of hospitals, if it is not a PDF, is most often rendered as HTML code converted to PDF, and then read by a human reviewer. This transformation process results in information loss (mostly structural information) that is not viewable by the human reviewer.

An inherent downside of this approach, is that large volumes of medical records cannot be easily compared without data-centric NLP approaches. A plan interested in assessing their population in a target zip code may find far easier to have a computer system parse a high volume of records to determine this information that abstract it manually via human review.

What Data Can Tenasol Extract?

Here is a depiction of many of the codified data that Tenasol can extract from an unstructured record:

health data extraction capabilities — *Some* of the elements that Tenasol can parse with health data extraction.

Tenasol is also capable of performing loads of machine learning-backed language detection. Examples of this are our HEDIS, and risk adjustment, and prior authorization solutions. These solutions not only are capable of highlighting text, but also indicating confidence and sorting medical records post-health data extraction.

What Else Can Tenasol Do Health Data Extraction?

Taken a step further, Tenasol systems are able to use this approach to compare populations in zip codes across the country using medical records as the sole input, in a significantly shorter period of time, showing where conditions such as diabetes are more or less prevalent via the use of both structured and unstructured content. A team of people manually reviewing would find this task monumental, and a system only using structured data would find an incomplete picture.

Tenasol makes use of this approach in our GeoHealth® product shown below. This service aggregates claims information from structured and unstructured data in medical records, and uses Geospatial Information Systems (GIS) to form a heatmap of a specified disease or group of diseases. It is also able to animate it over a time period.

mapping unstructured medical record data

Methods of NLP Health Data Extraction

While Generative AI and Large Language Models are big in headlines, they are just one tool in a massive toolbox. Similar to optical character recognition (OCR) technologies, a plurality of approaches exist that operate along a sliding scale of compute required, and speed to operate.

medical nlp performance range — Regular expressions, bag-of-words, word2vec, and transformers (llms) are all methods that can be used to drive information from unstructured healthcare data sources.

It is worth noting that there are strategies that blend between these, but these are overall the most prominent. No one single strategy should be employed for an unstructured data pipeline, but rather multiple to handle specified cases.

For example, extracting all dates from a medical record is a relatively straight forward task, that does not require the high costs or compute drawn by an transformer system. Similarly, while summarization tasks are possible without transformers, they typically offer the cleanest result with the most control of the output relative to other toolsets. Aspects of these differing toolsets will be covered in subsequent blog posts.

Importance of Health Data Extraction Validation

With extraction out of the way, it can be easy to forget that medical records are created by humans and sometimes faulty machines. This implies that data represented in a medical record, especially the unstructured sections, may be erroneous.

Examples of erroneous information include incorrect units on vitals or labs, numbers that are outside the realm of possibility, codes that are non-existent recorded on accident via “fat thumbing” mistakes. For most tasks, Tenasol employs multiple stages of validation to confirm information before it is taken as truth, or flags it as erroneous in nature.

This may be performed with regular expressions, pre-existing and maintained libraries, or coded knowledge on units and ranges built into a screening system. While not common, there are scenarios where these errors may be corrected or reformatted before they reach a final user.

Unstructured Data to FHIR

Tenasol uniquely permits the transformation of unstructured data to FHIR resources, either in USCDI or R4 FHIR representations at present as a service. Because the Tenasol pipeline is capable of transforming unstructured text to structured data, we added a layer on top to take this one step further to FHIR data, as shown below.

PDF to FHIR (or Image to Fhir) — PDF to FHIR implementation by Tenasol

Conclusion

Natural Language Processing (NLP) has emerged as a transformative technology in deciphering medical records, offering unparalleled opportunities to enhance patient care, streamline healthcare operations, and improve diagnoses. Through the analysis of both structured and unstructured data, NLP systems can extract valuable insights, predict patient outcomes, and facilitate population health management. While the evolution of NLP methodologies, from rule-based systems to advanced transformer models, continues to drive innovation in healthcare, it's essential to employ diverse approaches tailored to specific use cases. With robust validation processes in place, NLP holds immense potential to revolutionize healthcare delivery and decision-making processes.

To extract data from unstructured healthcare data sources with Tenasol, reach out to our sales team.

Unstructured Health Data Extraction with Tenasol