A Comprehensive Guide to Healthcare Data Sources

healthcare data sources

Too often in healthcare, it is taken as a given that healthcare data sources are mainly just medical records. The answer could not be further from the truth. We will explore the full scope of healthcare data sources available across the care continuum for analyzing who a patient is as well as their past, present, and future risks. Let’s start with medical records.

Healthcare Data Sources

Diagram from Kohane et al. describing the known universe of digital information available on a patient [1]

What Healthcare Data Sources are in Records?

 Stated more briefly - regardless of format, medical records usually contain:

  • Who the patient is: name, gender, dob, race, ethnicity, address, phone, email

  • Unstructured notes: recorded notes from the practitioner on status, goals, plans, patient forms, etc

  • Codes: Codes for just about everything. (ICD-10-CM/PCS, CPT, LOINC, RxNorm, NDC, billing, the list goes on)

  • Provider Information: The practitioner for each data subset, their location and contact info, clinics, dates and times associated with each entry to varying degrees

  • Raw Data: Sometimes EKG plots or radiology results will be directly placed into records as raw data or as images for record keeping.

Healthcare data sources in records

Modified diagram from Kohane et al. of just medical record data [1]

When this data is exchanged outside of a hospital this data is often not exported cleanly. For example codes are frequently not attributed to an individual practitioner or hospital system, and dates of service for the patient may not be present.

Important information is frequently only stated in unstructured (as opposed to codified) methodologies. The quality of this often varies by Electronic Medical Record (EMR) system. Ultimately, the unstructured sections of a medical record can be the most telling about a patient, while simultaneously being the most difficult for a computer to make decisions on.

What Healthcare Data Sources CAN be in Records?

Previously, it was just text, PDF, audio files, and maybe a few other types. Today, recent formats have the capability to carry literally any filetype. HL7, the organization that forms the standards has purposefully designed modern medical record transfer protocols to accommodate this idea in a standardized manner (FHIR HL7 v4). In a not-so-extreme example this means that the following are all possible:

  • 3D computer models of perhaps organs

  • audio files of encounters

  • video encounters

  • genomic profile

Healthcare Data Sources NOT in Medical Records

Simply stated, a lot.

non-medical record data

Modified diagram from Kohane et al. of non-medical record data [1] to show data not commonly in medical records.

60% of the data types shown in the full data diagram shown are not commonly in records, but it is much worse than that. Among the data types that are associated with being in a record, only a fraction of those actually are in most cases. Beyond this, the time range a medical record covers is only extreme cases the patients full history. Rather it is typically only the period of time that the patient has been with the hospital that created it, excluding visits to other clinics.

A medical record will also be missing anything the patient does not state about themselves prior to or during a clinical encounter. Their:

  • preferences

  • hobbies

  • job / employment

  • habits

  • financials

  • community

  • travel information

  • extended family history

  • police records

  • genetic information

  • home air quality and living conditions

  • physical activity, unless expressly documented

  • mental health, unless expressly documented

  • sleeping habits, unless expressly documented

  • Frequently their over-the-counter medications are not included as the patient has to add them themselves or have their practitioner do so.

Implied Healthcare Data Sources

There is a new form of medical record data extraction: deriving insights from content that is not explicitly stated or codified. Tenasol does in this a few ways via extraction of:

  • Custom Concepts and their confidence from unstructured text using machine learning

  • Codes, and then calculation of patient risks using those extracted codes fed into machine learning models to form second-stage outputs

  • Social Determinants of Health from unstructured data in multiple methods as similarly described in [2][3]

All three of these have utility in extraction alone, but also have secondary utility in being fed into a subsequent machine learning stage.

Take for example a patient record that does not codify physical activity, but describes a patient highly active in Tennis. This information is picked up by a language model that assesses personal physical activity represented between 0 and 1. This number is subsequently fed into a system assessing patient risk for vascular disease whose output risk is dropped due to the detection of high amounts of physical activity. This is not an uncommon example that we will delve into this more in later blog.

External Healthcare Data Sources

Healthcare data science firms rarely pull from external sources, instead only relying on structured data within a medical record. Furthermore, modern machine learning models permit missing variables, so if it is not available for a single patient, it is of no impact to the output. Tenasol plugs directly into external data sources to supplement decision making in many of our systems, including:

These APIs hold a plurality of information that is patient specific. Air quality in a specified zip code can have strong impacts on lung health. Highly humid air can have impacts on long-term cardiovascular health. These items must be considered within the patient profile.

With the advent of the internet providing supplemental data sources, linking to all of them and incorporating them into a machine learning model input vector permits the formation of coefficients dependent upon these factors in all sorts of decision making. This holistic view facilitates a more nuanced risk assessment and personalized care planning, acknowledging the intricate interplay between genetics, lifestyle, environmental exposures, and socio-economic factors.

Conclusion

The landscape of sources of healthcare data is vast and multifaceted, extending well beyond the confines of traditional medical records to encompass a wide array of external data sources. By harnessing the power of both structured and unstructured data, from the nuances of personal health records to the broad strokes of environmental and social determinants, healthcare providers and data scientists can achieve a more holistic understanding of patient health. This comprehensive approach not only enhances individual patient care but also advances public health outcomes through predictive analytics and personalized interventions. As the field of healthcare data science evolves, the integration of these diverse data streams will undoubtedly play a pivotal role in shaping the future of healthcare delivery and policy.

If you would like to learn more about how to extract data from all types of healthcare data sources, reach out to our sales team.

Sources

[1] Weber, Griffin M., Kenneth D. Mandl, and Isaac S. Kohane. “Finding the missing link for big biomedical data.” Jama 311.24 (2014): 2479-2480.

[2] Han, Sifei, et al. "Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing." Journal of biomedical informatics 127 (2022): 103984.

[3] Patra, Braja G., et al. "Extracting social determinants of health from electronic health records using natural language processing: a systematic review." Journal of the American Medical Informatics Association 28.12 (2021): 2716-2727.

Previous
Previous

AI Ethics in Healthcare