Unstructured Health Data Extraction with Tenasol
The role of Healthcare NLP (Natural Language Processing) in health data extraction has been transformative, offering opportunities for enhancing patient care, improving diagnoses, and streamlining healthcare operations.
This article explores the importance of healthcare NLP, capabilities it offers for health data extraction, and how Tenasol uses it for its clients..
Why Healthcare Data Extraction?
As shown above in an analysis of electronic medical record data performed by Tenasol, roughly 58% (0.4*0.7 + 0.3) of electronic medical record data is completely unstructured. This combined with standalone Image and PDF medical records imply that the world of healthcare data is far from reaching a structured state.
Presently, most systems used for processing these records make use only the remaining 42% (0.6*0.7) that is structured by extracting codified content within the purely structured fields of the XML or JSON structure in file. These codes might be SNOMED, ICD-10-CM, ICD-10-PCS, LOINC, RxNorm, NDC, CPT, or any other of a number of codes that indicate patient conditions, actions, drugs, demographics, results and more.
The unstructured data alternatively comes in the forms of unformatted text fields, images, and now with the advent of HL7 FHIR, literally any filetype. Stated another way, HL7 FHIR means we are entering a world of more unstructured data, not less.
What Can Be Done with Extracted Health Data?
Without exception every type of data can be processed and fed into an artificial intelligence (AI) model. And when constructed correctly, these will yield outputs with predictive power. However it is the use cases with the most predictive power, and those with the most value that will gain the most interest. Here is an itemized list of the ways that NLP systems are either being used, or proposed to be used when using unstructured medical record data (from either a single or multiple records) as an input, the majority of which Tenasol offers as solutions, in four categories:
General NLP Services:
Generalized OCR services for searching and parsing
Automatic page rotation / reorientation
Annotation of medical records
Conversion between medical record formats (e.g CCDA, FHIR, PDF, HL7 ADT, image, RTF)
Extraction NLP Services:
Demographics
Patient information
Patient SDoH
Codes (ICD-10-CM, SNOMED, LOINC, CPT, etc) with linked date of service, practitioner, etc
Labs and Vitals
Practitioner information
Facility information
Disease research
Single-Record NLP AI services:
past / present / future patient event detection / prediction
readability / handwriting detection
language translation
sentiment analysis
summarization
Multi-record NLP AI Services:
Population health trends and visualizations
fraud, waste, and abuse detection
HEDIS / quality prioritization of medical records
What is Unstructured Health Data Extraction Used for Today?
In short, very little outside of clinical settings. Most data services only process the structured content shown below.
All content used in situations outside of hospitals, if it is not a PDF, is most often rendered as HTML code converted to PDF, and then read by a human reviewer. This transformation process results in information loss (mostly structural information) that is not viewable by the human reviewer.
An inherent downside of this approach, is that large volumes of medical records cannot be easily compared without data-centric NLP approaches. A plan interested in assessing their population in a target zip code may find far easier to have a computer system parse a high volume of records to determine this information that abstract it manually via human review.
What Data Can Tenasol Extract?
Here is a depiction of many of the codified data that Tenasol can extract from an unstructured record:
Tenasol is also capable of performing loads of machine learning-backed language detection. Examples of this are our HEDIS, and risk adjustment, and prior authorization solutions. These solutions not only are capable of highlighting text, but also indicating confidence and sorting medical records post-health data extraction.
What Else Can Tenasol Do Health Data Extraction?
Taken a step further, Tenasol systems are able to use this approach to compare populations in zip codes across the country using medical records as the sole input, in a significantly shorter period of time, showing where conditions such as diabetes are more or less prevalent via the use of both structured and unstructured content. A team of people manually reviewing would find this task monumental, and a system only using structured data would find an incomplete picture.
Tenasol makes use of this approach in our GeoHealth® product shown below. This service aggregates claims information from structured and unstructured data in medical records, and uses Geospatial Information Systems (GIS) to form a heatmap of a specified disease or group of diseases. It is also able to animate it over a time period.
Methods of NLP Health Data Extraction
While Generative AI and Large Language Models are big in headlines, they are just one tool in a massive toolbox. Similar to optical character recognition (OCR) technologies, a plurality of approaches exist that operate along a sliding scale of compute required, and speed to operate.
It is worth noting that there are strategies that blend between these, but these are overall the most prominent. No one single strategy should be employed for an unstructured data pipeline, but rather multiple to handle specified cases.
For example, extracting all dates from a medical record is a relatively straight forward task, that does not require the high costs or compute drawn by an transformer system. Similarly, while summarization tasks are possible without transformers, they typically offer the cleanest result with the most control of the output relative to other toolsets. Aspects of these differing toolsets will be covered in subsequent blog posts.
Importance of Health Data Extraction Validation
With extraction out of the way, it can be easy to forget that medical records are created by humans and sometimes faulty machines. This implies that data represented in a medical record, especially the unstructured sections, may be erroneous.
Examples of erroneous information include incorrect units on vitals or labs, numbers that are outside the realm of possibility, codes that are non-existent recorded on accident via “fat thumbing” mistakes. For most tasks, Tenasol employs multiple stages of validation to confirm information before it is taken as truth, or flags it as erroneous in nature.
This may be performed with regular expressions, pre-existing and maintained libraries, or coded knowledge on units and ranges built into a screening system. While not common, there are scenarios where these errors may be corrected or reformatted before they reach a final user.
Conclusion
Natural Language Processing (NLP) has emerged as a transformative technology in deciphering medical records, offering unparalleled opportunities to enhance patient care, streamline healthcare operations, and improve diagnoses. Through the analysis of both structured and unstructured data, NLP systems can extract valuable insights, predict patient outcomes, and facilitate population health management. While the evolution of NLP methodologies, from rule-based systems to advanced transformer models, continues to drive innovation in healthcare, it's essential to employ diverse approaches tailored to specific use cases. With robust validation processes in place, NLP holds immense potential to revolutionize healthcare delivery and decision-making processes.
To extract data from unstructured healthcare data sources with Tenasol, reach out to our sales team.