Generating Synthetic Healthcare Data

Introduction

Privacy, security, and usage issues can impede access to clinical health data, which are critical to innovators, researchers, and product developers. Researchers, health information technology (health IT) developers, and informaticians often depend on anonymized data to test theories, data models, algorithms, or prototype innovations, but the risk of re-identification is high. Our team develops synthetic health data using SyntheaTM, an open-source publicly available patient data generation engine that produces realistic, but not real, health data of fictitious patients. Synthea-generated synthetic health data are free to use and do not contain the privacy concerns, security restrictions, or usage issues associated with anonymized synthetic health data sets.

In 2019, we were awarded a contract from the Office of the National Coordinator for Health Information Technology (ONC) to expand the capabilities of Synthea and to expand the number and diversity of Synthea-generated synthetic health records available for patient centered outcomes research (PCOR) use cases.

Project Results and Achievements

We developed five new modules using Synthea to generate data representing patients with complex care needs, opioid use, and pediatric populations, as these use cases are associated with a higher likelihood of re-identification, real health data that are typically more difficult to access, and additional privacy considerations that may not impact real clinical health data from other use cases. The resulting modules and companion guides enable synthetic health data generation for the following scenarios: Prescribing Opioids for Chronic Pain and Treatment of Opioid Use Disorder, Cerebral Palsy, Sepsis, Spina Bifida, and Acute Myeloid Leukemia.

The project also included a demonstration study to assess the utility and validity of synthetic health data for research and hypothesis testing. The Acute Myeloid Leukemia module was designed to replicate a simulation comparing levofloxacin prophylaxis to usual care for leukemia patients undergoing chemotherapy. In the study, use of Synthea to generate data was compared to platforms designed for data simulation. The study demonstrated that Synthea data generation platform can be used, with modifications, for some types of simulation studies.

Major Activities

  • Convened a multidisciplinary panel of nationally recognized experts;
  • Generated synthetic data sets for testing and validation of Synthea modules and data output;
  • Developed five new modules and companion guides with technical information for developers and implementers;
  • Conducted three national webinars to educate the community about Synthea and its potential uses;
  • Conducted a challenge competition to engage a broad community of researchers, developers, and innovators to validate the realism and demonstrate potential uses of Synthea-generated synthetic health data;
  • Conducted a demonstration study to evaluate the scope and utility of Synthea for simulation research and summarized results in a scientific paper which is pending publication; and
  • Documented project findings and considerations for advancing Synthea for ONC publication.

Tools and Methods Used

  • SyntheaTM Patient Generator
  • GitHub
  • Data analysis
  • Clinical data modeling
  • Testing and validation
  • Methodology development

Presentations and Publications

  • Garcia, S., Thompson, C., Hellewell, J., Nguyen, V., Kannampallil, T. (2021, March 30). A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research (PCOR) [Conference Presentation]. 2021 ONC Annual Meeting, Washington, DC.
  • Garcia, S., Thompson, C., Hellewell, J., Nguyen, V., Kannampallil, T. (2021, November 15). Generating Synthetic Health Data to Accelerate Patient-Centered Outcomes Research (PCOR) and Health Information Technology. [Conference Presentation]. AMIA 2020 Virtual Annual Symposium.
  • Garcia, S., Thompson, C., Meeker, D. (2022, March 22). Generating Synthetic Health Data to Accelerate Patient-Centered Outcomes Research (PCOR) and the Evaluation of an Open-Source Synthetic Data Platform for Simulation Studies[Conference Presentation]. AMIA 2022 Informatics Summit, Chicago, IL.
  • Meeker, D., Kallem, C., Heras, Y., Garcia, S., Thompson, C. (2022). Case Report: Evaluation of an Open-Source Synthetic Data Platform for Simulation Studies. [Manuscript submitted for publication].
  • Office of the National Coordinator for Health Information Technology. (2022). Synthetic Health Data Generation to Accelerate Patient-Centered outcomes Research (PCOR) Final Report. Washington, DC. Available online at: https://www.healthit.gov/topic/scientific-initiatives/pcor/synthetic-health-data-generation-accelerate-patient-centered-outcomes