Medical Record Deidentification NLP
Medical record deidentification NLP is the use of NLP in the of removal of protected health information (PHI) from a data asset.
As a regular part of its business, Tenasol performs data extraction and automated medical record deidentification NLP upon request. Some initial points about deidentification:
There is a narrow definition of what is properly deidentified according to HIPAA.
Structured data can be automatically deidentified with NLP. There are a large number of tools (and rules) on how to do this properly, and it may require an expert. When we say ‘structured‘ we mean that the fields are in a database and have typing controls.
Unstructured data cannot reliably be automatically deidentified perfectly (with one exception). While unstructured data (OCR’d images, unconstrained strings of text, etc.) may be manually deidentified by hand, doing it automatically is a categorical impossibility as any model used for removing PHI will do so with less than a 100% accuracy, guaranteeing the presence of PHI. Though you can make data that is structured from unstructured data, then deidentify that.
Penalties for not properly deidentifying data are very high. Therefore medical record deidentification should be taken extremely seriously. HIPAA violations start at 100$ and range to the tens of millions of dollars with possible jail time depending on scale and intent.
HIPAA Deidentification Laws
HIPAA defined two ways to deidentify a data asset:
Narrow, ‘Safe Harbor’ Approach. This states there are 18 identifiers that must be removed. These include patient location, contact information, web URLs, and just about any number/code tied to the patient. Also,
Zips may be retained as the first 3 digits so long as the 3-digit zip has >20,000 residents. HIPAA itemizes these zip codes for those interested.
Dates may only be retained to the year.
Patients older than 90 must be indicated as “90 or older“, and information identifying them as such must be obfuscated, as it over-specifies a small population.
Wider ‘Expert Determination‘ Approach. In this method a person who is an expert (this is undefined) is tasked with determining the correct approach to make sure the data is removed in a specified way from the data, meeting the principles of the safe harbor rules and documenting their approach.
Note: Both categories also prohibit any data that may be joined with public data to deidentify a patient with anything more than low risk. HIPAA is loose about what amounts of ‘low risk’ of identification are.
Tenasol & Deidentification NLP
Rather than performing automated unstructured deidentification, Tenasol uses our data extraction pipeline to pull information from a record, then deidentify the now-structured data, applying the Safe Harbor principle outlined by HIPAA.
Tenasol collects information on data found and classifies it as structured and unstructured. It also determines metadata about entire medical records as a whole. In the following example, we show how a medical finding has elements related to PHI, as well as ones that are no longer of use (like page and coordinates) removed to make a new deidentified finding.
Deidentification NLP & Hashing
In many instances, such as the ‘filename’ shown in the above example, it is important to retain a consistent data element but to make sure it is modified in all instances. We want filename to be unique to all found data elements, but we do not want it to be able to be known as it may contain PHI in the form of a patients name.
Hashing is an appropriate way to do this, but it is important to make sure that hashing is done properly. For example if we are SHA256 hashing a persons name and it is ‘John Doe‘ we would get a hash of:
ae6e4d1209f17b460503904fad297b31e9cf6362
If someone were to guess John Doe is the patient, they could simply check what the hash is for John Doe, see that they are correct, and the patient is no longer deidentified. To perform a good hash for this scenario:
Use a modern hashing method. SHA1 offers far less encryption than the more common and more secure SHA256.
Cut encryption string down in length. Because a SHA256 hash is so long (64 characters) cutting it down reduces the amount of data that needs to be stored and further prevents decryption.
Add a random value or timestamp when hashing. This prevents the name alone from being hashed to fully encrypt the string.
Taking our example before, the hash would now become a far more secure, more concise string of:
ca4d4009c3b1167367a4ee3586ef7e28
Deidentification NLP Use Cases
Medical record deidentification NLP may be used for a myriad of use cases including:
Clinical Research: Population health research is common with deidentified data sets that describe who is getting what codified treatments, drugs, procedures, or who has what SDoH conditions and what their age and healthcare status is.
Population Health Visualizations: Data that is aggregated and restricted to regions with larger populations may be considered deidentified and publicly shared for use in improving population health decisions and health policy.
Example Data: Example data may be generated for public demonstrations once it has been deidentified using deidentification.
Training of AI models: Models may be trained from this data (with some caveats) to permit deployment on real data that may be used for a variety of use cases that Tenasol also performs.
Other Deidentification NLP Rules
Note that this is just a summary of a few of the approaches that Tenasol takes and is not the complete picture. Please review HIPAA deidentification rules in detail if you are interested in performing your own medical record deidentification NLP.
Conclusion
Medical record deidentification NLP is crucial for safeguarding patient privacy and maintaining compliance with HIPAA regulations. Tenasol employs a robust approach that integrates structured data extraction and deidentification while adhering to the Safe Harbor principles. By leveraging advanced tools, rules, and metadata analysis, Tenasol ensures accurate and consistent data handling. For sensitive fields like filenames, hashing techniques, such as SHA256 with added random values, provide further protection.
While structured data can be reliably deidentified, unstructured data presents inherent challenges, reinforcing the importance of meticulous extraction and transformation. The penalties for improper deidentification underscore the need for precision and expertise. Tenasol’s pipeline exemplifies a secure, compliant method of deidentification, balancing privacy with functionality. As deidentification is a complex process, organizations must carefully design workflows, consult HIPAA regulations, and, where necessary, seek expert input to achieve the desired low-risk standard of privacy protection.