NLP and Medical Records
The role of Natural Language Processing (NLP) in deciphering medical records has been transformative, offering opportunities for enhancing patient care, improving diagnoses, and streamlining healthcare operations. NLP, a field of artificial intelligence (AI), involves the use of algorithms to understand and interpret human language. This capability is particularly valuable in healthcare, where medical records are abundant with complex, unstructured data. This blog article explores the evolution of NLP in the medical field, highlighting significant milestones and discussing its profound impact on healthcare.
Where’s all the data?
As shown above in an analysis of electronic medical record data performed by Tenasol, roughly 58% (0.4*0.7 + 0.3) of electronic medical record data is completely unstructured. This combined with standalone Image and PDF medical records imply that the world of healthcare data is far from reaching a structured state.
Presently, most systems used for processing these records make use only the remaining 42% (0.6*0.7) that is structured by extracting codified content with the XML or JSON structure of the file. These codes might be SNOMED, ICD-10-CM, ICD-10-PCS, LOINC, RxNorm, NDC, CPT, or any other of a number of codes that indicate patient conditions, actions, drugs, demographics, results and more.
The unstructured data alternatively comes primarily in the forms of text and images, but extends to nearly an infinite number of data types now able to be housed within a FHIR R4 data transmission. By and large, the majority of this text.
What can you do with it?
Without exception every type of data can be processed and fed into an artificial intelligence (AI) model. And when constructed correctly, these will yield outputs with predictive power. However it is the use cases with the most predictive power, and those with the most value that will gain the most interest. Here is an itemized list of the ways that NLP systems are either being used, or proposed to be used when using unstructured medical record data (from either a single or multiple records) as an input, the majority of which Tenasol offers as solutions, in four categories:
What is done with unstructured data today?
In short, very little outside of clinical settings. Most data services only process the structured content shown below.
All content used in situations outside of hospitals, if it is not a PDF, is most often rendered as HTML code converted to PDF, and then read by a human reviewer. This transformation process results in information loss (mostly structural information) that is not viewable by the human reviewer.
An inherent downside of this approach, is that large volumes of medical records cannot be easily compared without data-centric NLP approaches. For example, a plan interested in assessing their population in a target zip code may find it more difficult to have 20 human reviewers compare notes than a singular AI system that operates far more quickly and interprets in a uniform fashion rather than in 20 different ways.
Taken a step further, Tenasol systems are able to use this approach to compare populations in zip codes across the country using medical records as the sole input, in a significantly shorter period of time, showing where conditions such as diabetes are more or less prevalent via the use of both structured and unstructured content. A team of people manually reviewing would find this task monumental, and a system only using structured data would find an incomplete picture.
Tenasol makes use of this approach in our GeoHealth® product shown below. This service aggregates claims information from structured and unstructured data in medical records, and uses Geospatial Information Systems (GIS) to form a heatmap of a specified disease or group of diseases. It is also able to animate it over a time period.
Methods of NLP processing of unstructured data
While Generative AI and Large Language Models are big in headlines, they are just one tool in a massive toolbox. Similar to optical character recognition (OCR) technologies, a plurality of approaches exist that operate along a sliding scale of compute required, and speed to operate.
It is worth noting that there are strategies that blend between these, but these are overall the most prominent. No one single strategy should be employed for an unstructured data pipeline, but rather multiple to handle specified cases.
For example, extracting all dates from a medical record is a relatively straight forward task, that does not require the high costs or compute drawn by an transformer system. Similarly, while summarization tasks are possible without transformers, they typically offer the cleanest result with the most control of the output relative to other toolsets. Aspects of these differing toolsets will be covered in subsequent blog posts.
The Importance of Validation
With extraction out of the way, it can be easy to forget that medical records are created by humans and sometimes faulty machines. This implies that data represented in a medical record, especially the unstructured sections, may be erroneous.
Examples of erroneous information include incorrect units on vitals or labs, numbers that are outside the realm of possibility, codes that are non-existent recorded on accident via “fat thumbing” mistakes. For most tasks, Tenasol employs multiple stages of validation to confirm information before it is taken as truth, or flags it as erroneous in nature.
This may be performed with regular expressions, pre-existing and maintained libraries, or coded knowledge on units and ranges built into a screening system. While not common, there are scenarios where these errors may be corrected or reformatted before they reach a final user.
Conclusion
Natural Language Processing (NLP) has emerged as a transformative technology in deciphering medical records, offering unparalleled opportunities to enhance patient care, streamline healthcare operations, and improve diagnoses. Through the analysis of both structured and unstructured data, NLP systems can extract valuable insights, predict patient outcomes, and facilitate population health management. While the evolution of NLP methodologies, from rule-based systems to advanced transformer models, continues to drive innovation in healthcare, it's essential to employ diverse approaches tailored to specific use cases. With robust validation processes in place, NLP holds immense potential to revolutionize healthcare delivery and decision-making processes.