Beyond the Algorithm: The Tightrope Walk in Health Data Science

The application of data science methodologies holds immense promise for revolutionizing healthcare. Personalized medicine, more accurate and timely diagnostics, and the potential for vastly improved efficiency within healthcare systems are all within reach, fueled by advanced algorithms like machine learning and artificial intelligence. However, a critical challenge lies at the heart of this transformative endeavor: the limitations inherent in the data upon which these powerful algorithms rely. The central question remains: can we truly achieve this transformative vision with the data we currently collect and can readily access? Beyond the Hospital Walls - The Missing Pieces Hospitals and clinics are often assumed to be data-rich environments. In reality, the datasets collected in these environments are frequently narrow, incomplete, and inconsistent. Vital signs, lab results, diagnosis codes, medications, and procedures are certainly recorded — but these are often siloed across systems, riddled with missing values, or lack the temporal resolution necessary for dynamic modeling. A significant limitation of relying solely on Electronic Health Records (EHR) is the absence of crucial contextual information that profoundly influences models to be built and their outcomes. Comprehensive data on lifestyle factors such as diet, exercise habits, smoking and alcohol consumption, and mental well-being are often lacking or inconsistently recorded. Furthermore, EHRs typically do not capture detailed information about environmental exposures, socioeconomic determinants of health (like income, education, and housing), or patient-reported outcomes regarding their quality of life and functional status. Without this broader context, our understanding of why diseases develop and how to prevent them is inherently limited. The increasing recognition of the importance of social determinants of health has led to efforts to incorporate this information into EHRs. However, the practicalities of consistently and accurately collecting such data within the already demanding clinical workflow present significant challenges. Clinicians have limited time, and their primary focus naturally remains on immediate patient care needs. Gathering detailed information on social circumstances might be perceived as intrusive or time-consuming, leading to incomplete or inconsistent data capture. This lack of a holistic view can introduce biases and lead to incomplete research findings. For instance, research relying solely on hospital data might incorrectly attribute disease causes or fail to identify effective preventative measures rooted in lifestyle modifications or addressing environmental risk factors. Locked Away: The Accessibility Obstacle It’s truly unfortunate that valuable health data, data that could significantly improve healthcare outcomes and advance medical knowledge, frequently remains inaccessible. This data often resides outside the secure walls of established hospital systems, scattered across various platforms and databases. Alternatively, it may be present within hospitals but buried deep within lengthy and complicated reports, making it difficult to extract and utilize effectively. The process of attempting to retrieve this essential information for research purposes can often feel like navigating a complex and frustrating obstacle course, filled with numerous hurdles and challenges. The existence of privacy laws, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in Europe, is undeniably crucial and well-intentioned. These laws serve a vital purpose: to safeguard our sensitive personal health information from unauthorized access and misuse, ensuring the confidentiality and security of individuals' medical records. However, while these regulations are essential for protecting privacy, they also introduce significant complexities into the process of accessing and sharing health data for research and other legitimate purposes. Synthetic Data: A Silver Bullet or a Faustian Bargain? In response to the challenges of accessing real-world health data, synthetic health data has emerged as a potential alternative. This artificially generated data is designed to mimic the statistical properties of real data without containing identifiable patient information. The benefits of using synthetic data are numerous. It can overcome privacy concerns, enable broader data sharing among researchers, and facilitate the development and testing of algorithms without the need for complex data access agreements. This can significantly accelerate the pace of research in privacy-sensitive areas. However, for all its usefulness, synthetic data is no replacement for reality. It lacks the messiness, the edge cases, the human quirks that make health data uniquely complex. Models that perform well on synthetic data often falter when exposed to real-world

Apr 25, 2025 - 15:08

Beyond the Algorithm: The Tightrope Walk in Health Data Science

The application of data science methodologies holds immense promise for revolutionizing healthcare. Personalized medicine, more accurate and timely diagnostics, and the potential for vastly improved efficiency within healthcare systems are all within reach, fueled by advanced algorithms like machine learning and artificial intelligence.

However, a critical challenge lies at the heart of this transformative endeavor: the limitations inherent in the data upon which these powerful algorithms rely. The central question remains: can we truly achieve this transformative vision with the data we currently collect and can readily access?

Beyond the Hospital Walls - The Missing Pieces

Hospitals and clinics are often assumed to be data-rich environments. In reality, the datasets collected in these environments are frequently narrow, incomplete, and inconsistent. Vital signs, lab results, diagnosis codes, medications, and procedures are certainly recorded — but these are often siloed across systems, riddled with missing values, or lack the temporal resolution necessary for dynamic modeling. A significant limitation of relying solely on Electronic Health Records (EHR) is the absence of crucial contextual information that profoundly influences models to be built and their outcomes.

Comprehensive data on lifestyle factors such as diet, exercise habits, smoking and alcohol consumption, and mental well-being are often lacking or inconsistently recorded. Furthermore, EHRs typically do not capture detailed information about environmental exposures, socioeconomic determinants of health (like income, education, and housing), or patient-reported outcomes regarding their quality of life and functional status. Without this broader context, our understanding of why diseases develop and how to prevent them is inherently limited.

The increasing recognition of the importance of social determinants of health has led to efforts to incorporate this information into EHRs. However, the practicalities of consistently and accurately collecting such data within the already demanding clinical workflow present significant challenges. Clinicians have limited time, and their primary focus naturally remains on immediate patient care needs. Gathering detailed information on social circumstances might be perceived as intrusive or time-consuming, leading to incomplete or inconsistent data capture.

This lack of a holistic view can introduce biases and lead to incomplete research findings. For instance, research relying solely on hospital data might incorrectly attribute disease causes or fail to identify effective preventative measures rooted in lifestyle modifications or addressing environmental risk factors.

Locked Away: The Accessibility Obstacle

It’s truly unfortunate that valuable health data, data that could significantly improve healthcare outcomes and advance medical knowledge, frequently remains inaccessible. This data often resides outside the secure walls of established hospital systems, scattered across various platforms and databases. Alternatively, it may be present within hospitals but buried deep within lengthy and complicated reports, making it difficult to extract and utilize effectively. The process of attempting to retrieve this essential information for research purposes can often feel like navigating a complex and frustrating obstacle course, filled with numerous hurdles and challenges.

The existence of privacy laws, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in Europe, is undeniably crucial and well-intentioned. These laws serve a vital purpose: to safeguard our sensitive personal health information from unauthorized access and misuse, ensuring the confidentiality and security of individuals' medical records. However, while these regulations are essential for protecting privacy, they also introduce significant complexities into the process of accessing and sharing health data for research and other legitimate purposes.

Synthetic Data: A Silver Bullet or a Faustian Bargain?

In response to the challenges of accessing real-world health data, synthetic health data has emerged as a potential alternative. This artificially generated data is designed to mimic the statistical properties of real data without containing identifiable patient information. The benefits of using synthetic data are numerous. It can overcome privacy concerns, enable broader data sharing among researchers, and facilitate the development and testing of algorithms without the need for complex data access agreements. This can significantly accelerate the pace of research in privacy-sensitive areas.

However, for all its usefulness, synthetic data is no replacement for reality. It lacks the messiness, the edge cases, the human quirks that make health data uniquely complex. Models that perform well on synthetic data often falter when exposed to real-world clinical environments.

But Why?

Synthetic Data is too 'clean'.

Real-world data is messy — full of missing values, inconsistent formats, typos, outliers, and contradictions. Synthetic data, by contrast, is usually generated using rules or models that follow distributions too neatly. So, models trained on it don’t learn how to handle chaos.

Lack of Rare Cases (Edge Cases)

In healthcare, those 1-in-1000 scenarios really matter — like a rare adverse drug reaction or an unusual combination of symptoms. Synthetic data often fails to include these rare but critical edge cases, making the model blind to them.

Missing Human Behavior and Judgment

Real medical records reflect human quirks: how doctors phrase things, how patients describe pain, or even how data entry staff input info. Synthetic data can’t replicate those subtle human factors, which actually influence outcomes a lot.

Bias in the Synthetic Generator

Synthetic data is as good as the model it is created on. Suffice to say, if that model itself was trained on biased or limited real data, the synthetic output will reflect and possibly amplify those same biases.

The Real-World Challenges: An Example in Female Sexual Health Research

The challenges faced in this research aren’t just theoretical—they’re real, tangible roadblocks that emerged when I was working on data science models in sensitive areas like female sexual health, particularly uterine fibroids. This field is critical for millions of women, yet the data barriers can feel insurmountable at times.

The Struggle to Access Reliable Data

One of my biggest frustrations has been securing comprehensive datasets. When I reached out to organizations like the WHO or healthdata.gov, I often encountered denials or bureaucratic delays. Some data exists—like the Texas Department of State Health Services’ records on fibroid diagnoses or the Global Burden of Disease Study’s prevalence statistics—but it’s scattered across different sources. Each one requires separate permissions, and the process is slow and exhausting. I’ve spent weeks just trying to get access to what should be readily available for research that could improve women’s health outcomes.

The Problem of Poorly Curated Variables

Even when I finally get my hands on a dataset, I’ve found that many aren’t well-structured for my research. One time, I opened a dataset supposedly focused on women with fibroids—only to find a redundant "gender" column. If the data is exclusively about women, why include a gender field? It’s a sign that the dataset wasn’t designed with this research in mind. I need variables like age, race, family history, and menarche age—known risk factors—but instead, I waste time cleaning irrelevant data. Furthermore, upon cleaning, I am left with meaningless columns to work with.

Fallback or perhaps fall back

When real-world data is too hard to obtain, I’ve had to rely on synthetic datasets. But this isn’t a perfect solution. If the synthetic data isn’t carefully constructed—with deep knowledge of fibroid biology and clinical realities—the models I train on it perform well in theory but fail in practice. I’ve seen models that look promising in simulations but collapse when applied to actual patient cases. It’s disheartening, knowing that these limitations could delay meaningful advancements in women’s health.

Conclusion: Care before Code
Unlocking the full potential of health data science starts with better data — not just more of it, but more meaningful, diverse, and representative variables. To get there, we need open, privacy-conscious data sharing and stronger collaboration between hospitals, researchers, policymakers, and technologists. Only through this collective effort can we move beyond the algorithm and build solutions that truly serve everyone.

This goes without saying — I was helped a great deal by Elizabeth Waithera, a data scientist, in shaping this thought, especially in highlighting the need for collective responsibility and shared access in the health data ecosystem.
Have questions? Ping me. I don’t bite (unless you're a fraudulent transaction).