Opportunities of Covariate Data Imputation with Machine Learning for Pharmacometricians in R
Dominic Stefan Bräm (1), Uri Nahum (1), Andrew Atkinson (1), Gilbert Koch (1), and Marc Pfister (1)
(1) Pediatric Pharmacology and Pharmacometrics, University Children's Hospital Basel (UKBB), University of Basel, Basel, Switzerland
Introduction: A common challenge with clinical data is that there is missing covariate information [1]. Since missing data can affect pharmacometric analyses in terms of loss of statistical power and potentially biased results, it is critical to correctly handle missing covariate data in pharmacometric analyses. To this end, various statistical methods have been widely adopted [2,3] and fully or partially implemented in pharmacometrics [4]. Simple methods such as listwise deletion or mean imputation can easily be implemented by pharmacometricians themselves. More advanced approaches such as Norm or Predictive Mean Matching (PMM) imputation are provided through various R packages, for example the mice package [5]. Here, we introduce two Machine Learning (ML) methods available in R, Random Forest (RF) [6] and Artificial Neural Networks (ANN) [7], capable of imputing missing covariate data in a pharmacometric setting.
Objectives: In this evaluation study we compare RF and ANN to four standard statistical methods, listwise deletion, mean imputation, Norm imputation and PMM imputation.
Methods:
We performed an evaluation study in R to compare the six methods to handle missing covariate data in the context of population PK analyses. To this end, a reference covariate PK data set was simulated including four covariates (Birth Weight (BW), Body Length (BL), gestational age (GA) and sex) and PK concentration-time data for a generic drug. The covariates were generated so that there was either a linear or a non-linear relationship between BW and BL. PK concentration-time data was simulated with an IV bolus one-compartment model. According to the relationship between BW and BL, a linear or a non-linear covariate model relating BW with the volume of distribution V was applied. To evaluate and compare performance of imputation methods, an IV bolus one-compartment model was fitted to the generated reference covariate PK data set without missing values using the R interface to Monolix [8]. The estimated pharmacometric model parameters population volume of distribution Vpop, population clearance Clpop and covariate effect served as reference model parameters for comparison to model parameters estimated with population PK datasets after covariate imputation. Missingness was introduced to the covariate PK data set by removing 20% of the values in the baseline variable BW under the Missing At Random (MAR) assumption.
The six methods were used to handle the missing covariate information and the PK model parameters were re-estimated for each method. This procedure was repeated 50 times with different reference covariate PK data sets to distinguish between stochastic and systematic error in the subsequent results. To evaluate the performance of a method, the coverage rate (CR) describing the proportion of confidence intervals that contain the reference parameters was calculated and compared to the desired proportion of at least 95%. Additionally, the biases between the reference PK model parameter estimates and the estimates after covariate imputation were calculated and it was investigated whether they were distributed around zero.
Results:
For the data in which there is a linear relationship between BL and BW and between BW and V under the MAR assumption for missingness in BW, the population estimates for the volume of distribution Vpop were biased with listwise deletion and mean imputation. Coverage rates for these two methods decreased to 82% and 24%, respectively. The other established imputation methods Norm and PMM imputation and the ML methods RF and ANN provided unbiased results with coverage rates above 95%.
For the data in which there is a non-linear relationship between BL and BW and between BW and V with MAR conditions, mean imputation provided strongly biased results for Vpop with a coverage rate of 2%. The multiple imputation method Norm showed biased results with coverage rates of 42% and 84% for and Vpop, respectively. All other methods were slightly biased with coverage rates of 90% or higher.
Conclusions:
This evaluation study demonstrates that the in R available ML based methods RF and ANN appropriately impute missing covariate data, leading to broadly similar results compared to established statistical methods. ML based approaches bear potential for enhanced performance through optimized parameter tuning, and increased flexibility when encountering more complex non-linear relationships.
References:
[1] Ibrahim, J. G., Chu, H. & Chen, M. H. Missing data in clinical studies: Issues and methods. J. Clin. Oncol. (2012).doi:10.1200/JCO.2011.38.7589
[2] Little, R. J. A. & Rubin, D. B. Statistical analysis with missing data. Stat. Anal. with Missing Data (2019).doi:10.1002/9781119482260
[3] Ette, E. I., Chu, H. M. & Ahmad, A. Data Imputation. In Pharmacometrics Sci. Quant. Pharmacol. (2006).doi:10.1002/9780470087978.ch9
[4] Johansson, Å. M. & Karlsson, M. O. Comparison of methods for handling missing covariate data. AAPS J. (2013).doi:10.1208/s12248-013-9526-y
[5] Buuren, S. van & Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. (2011).doi:10.18637/jss.v045.i03
[6] Stekhoven, D. J. & Buehlmann, P. MissForest - non-parametric missing value imputation for mixed-type data.
Bioinformatics 28, 112–118 (2012).
[7] Falbel, D. & Luraschi, J. torch: Tensors and Neural Networks with ‘GPU’ Acceleration. (2021).at https://cran.r-project.org/package=torch
[8] LIXOFT lixoftConnectors: R connectors for Lixoft Suite (@Lixoft). (2019)