2023 - A Coruña - Spain

PAGE 2023: Methodology – AI/Machine Learning
Servane Lunven

Generative Models for Synthetic Data Generation: Application to PKPD data

Yulun Jiang (1), Alberto Garcia Duran (2), Pascal Girard (2), Servane Lunven (3), Federico Amato (4), Idris Bachali Losada (2), Nadia Terranova (2)*

(1) School of Computer and Communication Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; (2) Merck Institute for Pharmacometrics, Ares Trading S.A. (an affiliate of Merck KGaA, Darmstadt, Germany), Lausanne, Switzerland; (3) School of Basic Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; (4) Swiss Data Science Center, EPFL Lausanne and ETH Zurich, Switzerland

Objectives: 

The generation of synthetic (artificial) patient data that possess the statistical properties of the original ones plays a fundamental role in today’s world for its potential to:

  • Democratize access to data for statistical and/or research purposes.
  • Enlarge the available data. Most importantly, the synthetic data may enrich the real data in low-density regions (i.e., patients with under-represented characteristics).

Generative methods constitute a family of solutions to generate synthetic data. The objectives of this work are to (i) benchmark a number of state-of-the-art generative methods from the deep learning (DL) literature in a large variety of clinical datasets composed of patient covariates and several PK (pharmacokinetics) and PD (pharmacodynamics) endpoints; and (ii) quantify the quality of the synthesized data with metrics from not only the Machine Learning (ML) community, but also univariate statistical tests and others based on standard pharmacometrics (PMX) analysis tools.

Methods: 

The core idea of generative methods stems around learning a model that approximates the distribution of a dataset. The learned model may be used to generate synthetic data. The selected methods have to be appropriate for the specificities of the clinical data:

  • Longitudinal data are collected over time.
  • Heterogeneous information: presence of constant and time-dependent measures (covariates, endpoints…).
  • Multiple treatment arms: subjects may be grouped by treatment arms, each characterized by a drug, a treatment regimen, and sampling time points.

After a thorough literature review, we selected three DL methods that are able to deal with the above-mentioned characteristics of the data. The selected ML methods are Probabilistic AutoRegressive (PAR) model [1], TimeGan [2], and Conditional Generative Adversarial Networks (cGAN) [3]. PAR and TimeGan are built upon Recurrent Neural Networks (RNNs). As opposed to PAR and TimeGan, cGAN enable the conditional generation of data based on the treatment arm, reducing the complexity of the learning to a great extent.

The selected methods are evaluated in a large variety of datasets. We leverage readily available models in Simulx to generate 12 “real” simulated datasets, including covariates, PK, PD and response profiles. These datasets have different degree of complexity—measured in terms of number of arms, number of covariates and the number of PK/PD responses. The properties of the synthetic generated data are quantified through several metrics: i) well-known proxy ML measures of fidelity (discriminative score [2]) and usefulness (predictive score [2]); ii) univariate statistical tests (i.e., Kolmogorov-Smirnov, and Chi-Square 2 sample tests); and iii) PMX-related evaluation metrics. The latter are meant to compare the similarity of individual and population PK/PD parameters estimated on real and synthetic data. We also compare the running time (i.e., training and inference time) of the methods.

Results: 

Experiments conclude that overall, cGAN shows the best performance in terms of univariate statistical tests, and PMX-related metrics:

  • The Kolmogorov-Smirnov 2 sample test applied to the PK/PD response profiles reveals cGAN as the best-performing technique in 11 out of the 12 datasets.
  • The PMX-related metrics show that cGAN is the best technique in 7 out of the 12 datasets.

These findings match our visual perception of the synthetically generated data. cGAN is also the most efficient (~1.2x and ~3x speed up w.r.t. SVD and TimeGan, respectively) method in terms of running time---less than 1 hour in average. On the other side, the discriminative and predictive scores do not show a clear winner---there is not a single technique outperforming the two others in more than 50% of the datasets. We hypothesized that this may be explained due to these scores being positively biased towards RNN-based models (i.e., PAR and TimeGAN).

Conclusions: 

We evaluate three state-of-the-art deep generative techniques in simulated datasets from 12 PK and PK/PD models. Performance is measured with metrics that focus on different statistical aspects of the data. We contribute to the evaluation with metrics that rely on PMX analysis tools. Results conclude that cGAN exhibit the best overall performance for most of the metrics. We expect this work to set the foundation stone towards the adoption of these techniques to facilitate the sharing of PKPD data across institutions and publications.



References:
[1] Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The Synthetic Data Vault. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399-410.
[2] Yoon, J., Jarrett, D., & Schaar, M.V. (2019). Time-series Generative Adversarial Networks. Neural Information Processing Systems.
[3] Mirza, M., & Osindero, S. (2014). Conditional Generative Adversarial Nets. ArXiv, abs/1411.1784.


Reference: PAGE 31 (2023) Abstr 10365 [www.page-meeting.org/?abstract=10365]
Poster: Methodology – AI/Machine Learning
Top