Medical Data Deduplication

The process of deduplication in the management of medical records is a critical, yet often underappreciated, aspect of healthcare data management. This involves identifying and removing duplicate information from a dataset, ensuring that each patient's information is unique and accurately represented.

There are 4 entities to deduplicate:

Pages or page sets within a paper medical record [paper/pdf/image]
Entire paper records [paper/pdf/image]
Structured medical record data [C-CDA (HL7 v3)/FHIR (HL7 v4)]
Patients

This article seeks to describe methods for each.

Scale of the Challenge

The extent of duplication in medical records is not trivial. Data observed by Tenasol clients shows duplication rates ranging from 10% to as high as 50%. Recent research has implied this is as high as 58% [1]. Reasons for the duplicative nature of medical data include, but are not limited to:

Interoperability bottlenecks that convert digital records to image/PDF
Duplicative faxing of a patients medical records
Re-entry of medical data by a practitioner (attaching of previous data to a new visit).
Healthcare IT software

Duplicate information may manifest itself in different forms as well. For example, two identical faxed records may have different headers due to their time of fax or source hospital. A patient may have multiple names associated with themselves. A digital record may have been converted to a PDF record, and then attached to a C-CDA file that also contains the data digitally.

The process of reviewing these records during Medical Record Review is both time-consuming and resource-intensive. Each duplicate record must be compared with its counterparts to verify and consolidate information, a task that requires meticulous attention to detail and a deep understanding of medical terminology and patient history.

In healthcare, the accuracy and completeness of a patient’s medical record are paramount. A single patient’s information might be spread across multiple databases or entered into the same system multiple times due to various factors such as misspellings, name changes, or data entry errors. These duplicates can lead to incomplete or fragmented care, where critical information might be overlooked because it is stored under a separate, duplicate record. Moreover, in the context of MRR, which often supports legal processes, research, and quality assurance activities, the integrity of data is non-negotiable. Duplication not only compromises the quality of data but also significantly increases the workload and complexity of reviews, potentially leading to erroneous conclusions or oversight of crucial details.

Page Deduplication

Page deduplication in medical records can be risky. While two pages may have the same content, groups of duplicated pages are less risky. Consider for example a form that has a page without fields on it that appears multiple times. Removing it may impact the context information of that form if the one page were to be removed in one of the two instances, ignoring the differences elsewhere in the form. However, if we specify a minimum page sequence size for page duplicate detection, this issue is overcome.

In practice, hashes are used for this. Hashes are a sort of “fingerprint” of a piece of data, that can be used to uniquely describe a piece of data. If the same piece of data is fed to a hashing utility, it produces an identical output hash. These hashes are built from either optical character recognition data, or solely the text derived from such data is effective in performing this efficiently at scale. This however does not solve the case of two documents with slight differences, such as those with different fax headers. Loose matching pages is considered risky in nature.

Document Deduplication

The value of identifying similar but not identical documents is significantly greater. Advanced similarity models, such as Siamese neural networks used in facial recognition are applicable, but lack transparency in operation as well as ease of understanding errors to a user.

Hashing methods are similarly viable, transparent in nature, and allow for a sliding scale of similarity when comparing documents. The below chart describes methods Tenasol uses for hashing documents for deduplication.

Structured Digital Record Data

While structured medical records are highly present inside hospital systems, they are startlingly less present outside of them. Interoperability and the fight for EMR dominance has led to issues in transferring structured data between entities.

Structured records still face duplication problems, in the forms of duplicated data within the primary structured locations, as well as often appended unstructured data that comes in the form of binary files, attached PDF files, image files, or rich text files.

If a person is to review this data, it should be deduplicated prior to rendering in a human readable format. Tenasol offers toolsets capable of these operations, which use a combination of hashing approaches and advanced searching methods to remove those duplications in the structured data. Taken a step further, these approaches also permit the deduplication of data if it appears in both structured and unstructured locations.

Patients

This topic is particularly complicated - please refer to the complete blog post we have done on it.

Conclusion

Deduplication in the management of medical records is a complex but essential task that directly impacts the quality of patient care, the efficiency of healthcare operations, and the integrity of medical research and legal processes. The challenge of addressing duplicate records in MRR is significant, with substantial portions of records requiring review and consolidation. Through a combination of technological solutions, improved data management practices, and the adoption of emerging technologies, healthcare organizations can effectively reduce the prevalence of duplicate records, enhancing the reliability and usefulness of medical data. As the healthcare industry continues to evolve, the strategies for managing deduplication will also need to adapt, ensuring that patient records are accurate, complete, and uniquely representative of each individual's health history.

Sources

[1] Steinkamp, Jackson, Jacob J. Kantrowitz, and Subha Airan-Javia. "Prevalence and sources of duplicate information in the electronic medical record." JAMA Network Open 5.9 (2022): e2233348-e2233348.