Healthcare data sources are the near infinite number of places where healthcare related data may be sourced from and used for informatics, modeling, and prediction.

Too often it is taken as a given that healthcare data sources entirely comprise of medical records. The answer could not be further from the truth. We will explore the full scope of healthcare data sources available across the care continuum for analyzing who a patient is as well as their past, present, and future risks. Let’s start with medical records.

What Are the Common Healthcare Record Formats?

In order of oldest to newest:

paper / scanned medical images or PDF files. This is by far the most common even today in terms of total information volume.
X12 EDI transactions, used for medical claims or prior authorization by health plans and hospitals
HL7v2 / ADT messages, used for transmitting patient status information between hospitals
HL7v3 / CDA - the third standardized format, usually in XML for medical record summaries.
HL7V4 / FHIR - the most recent format, used for all transactions implied for API exchange, and are in inherently graph structured.

What Healthcare Data Sources are in Records?

Stated more briefly - regardless of format, medical records (in an EMR system or not) usually contain:

Who the patient is: name, gender, dob, race, ethnicity, address, phone, email
Unstructured notes: recorded notes from the practitioner on status, goals, plans, patient forms, etc
Codes: Codes for just about everything. (ICD-10-CM/PCS, CPT, LOINC, RxNorm, NDC, billing, the list goes on)
Provider Information: The practitioner for each data subset, their location and contact info, clinics, dates and times associated with each entry to varying degrees
Raw Data: Sometimes EKG plots or radiology results will be directly placed into records as raw data or as images for record keeping.

Healthcare data sources in medical records — *Modified diagram from Kohane et al. of just medical record data [1]*

When this data is exchanged outside of a hospital this data is often not exported cleanly. For example codes are frequently not attributed to an individual practitioner or hospital system, and dates of service for the patient may not be present.

Important information is frequently only stated in unstructured (as opposed to codified) methodologies. The quality of this often varies by Electronic Medical Record (EMR) system. Ultimately, the unstructured sections of a medical record can be the most telling about a patient, while simultaneously being the most difficult for a computer to make decisions on.

What Healthcare Data Sources CAN be in Records?

Previously, it was just text, PDF, audio files, and maybe a few other types. Today, recent formats have the capability to carry literally any filetype. HL7, the organization that forms the standards has purposefully designed modern medical record transfer protocols to accommodate this idea in a standardized manner (FHIR HL7 v4). In a not-so-extreme example this means that the following are all possible:

3D computer models of perhaps organs
audio files of encounters
video encounters
genomic profile

Healthcare Data Sources NOT in Medical Records

Simply stated, a lot.

60% of the data types shown in the full data diagram shown are not commonly in records, but it is much worse than that. Among the data types that are associated with being in a record, only a fraction of those actually are in most cases. Beyond this, the time range a medical record covers is only extreme cases the patients full history. Rather it is typically only the period of time that the patient has been with the hospital that created it, excluding visits to other clinics.

A medical record will also be missing anything the patient does not state about themselves prior to or during a clinical encounter, social determinants of health (SDoH) such as:

preferences
hobbies
job / employment
habits
financials
community
travel information
extended family history
police records
genetic information
home air quality and living conditions
physical activity, unless expressly documented
mental health, unless expressly documented
sleeping habits, unless expressly documented
Frequently their over-the-counter medications are not included as the patient has to add them themselves or have their practitioner do so.

Implied Healthcare Data Sources

There is a new form of medical record data extraction: deriving insights from content that is not explicitly stated or codified. Tenasol does in this a few ways via extraction of:

Custom Concepts and their confidence from unstructured text using machine learning
Codes, and then calculation of patient risks using those extracted codes fed into machine learning models to form second-stage outputs
Social Determinants of Health from unstructured data in multiple methods as similarly described in [2][3]
HEDIS, Risk, or disability gaps

All three of these have utility in extraction alone, but also have secondary utility in being fed into a subsequent machine learning stage.

Take for example a patient record that does not codify physical activity, but describes a patient highly active in Tennis. This information is picked up by a language model that assesses personal physical activity represented between 0 and 1. This number is subsequently fed into a system assessing patient risk for vascular disease whose output risk is dropped due to the detection of high amounts of physical activity. This is not an uncommon example that we will delve into this more in later blog.

External Healthcare Data Sources

Healthcare data science firms rarely pull from external sources, instead only relying on structured data within a medical record. Furthermore, modern machine learning models permit missing variables, so if it is not available for a single patient, it is of no impact to the output. Tenasol plugs directly into external data sources to supplement decision making in many of our systems, including:

Census data
Environment Protection Agency (EPA) air quality data API
NOAA API
Weather API
Health trends available from other Tenasol Health data
Data derived from partnerships
…and many more

These APIs hold a plurality of information that is patient specific. Air quality in a specified zip code can have strong impacts on lung health. Highly humid air can have impacts on long-term cardiovascular health. These items must be considered within the patient profile.

With the advent of the internet providing supplemental data sources, linking to all of them and incorporating them into a machine learning model input vector permits the formation of coefficients dependent upon these factors in all sorts of decision making. This holistic view facilitates a more nuanced risk assessment and personalized care planning, acknowledging the intricate interplay between genetics, lifestyle, environmental exposures, and socio-economic factors.

Conclusion

The landscape of sources of healthcare data is vast and multifaceted, extending well beyond the confines of traditional medical records to encompass a wide array of external data sources. By harnessing the power of both structured and unstructured data, from the nuances of personal health records to the broad strokes of environmental and social determinants, healthcare providers and data scientists can achieve a more holistic understanding of patient health. This comprehensive approach not only enhances individual patient care but also advances public health outcomes through predictive analytics and personalized interventions. As the field of healthcare data science evolves, the integration of these diverse data streams will undoubtedly play a pivotal role in shaping the future of healthcare delivery and policy.

If you would like to learn more about how to extract data from all types of healthcare data sources, reach out to our sales team.

Sources

[1] Weber, Griffin M., Kenneth D. Mandl, and Isaac S. Kohane. “Finding the missing link for big biomedical data.” Jama 311.24 (2014): 2479-2480.

[2] Han, Sifei, et al. "Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing." Journal of biomedical informatics 127 (2022): 103984.

[3] Patra, Braja G., et al. "Extracting social determinants of health from electronic health records using natural language processing: a systematic review." Journal of the American Medical Informatics Association 28.12 (2021): 2716-2727.

A Comprehensive Guide to Healthcare Data Sources