mlcov: New Machine Learning Based R package for Covariate Selection
Ibtissem Rebai (1), Vincent Duval (1), Ayman Akil (1), James Craig (1), Mike Talley (1), Anna Largajolli* (1), Floris Fauchet*(1).
(1) Certara, Princeton, NJ, USA ; * Contributed equally.
Introduction
In the last few years, several alternatives have been proposed to address the weaknesses of the most widely used approach in covariate selection, stepwise covariate modeling (SCM) [1], including the use of machine learning (ML) algorithms [2] [3]. We previously evaluated [4] the performance of the Boruta algorithm [5] using XGboost in combination with Lasso regularized regression method [6], as a new framework for covariate selection. This evaluation led us to establish a workflow that is now implemented in the mlcov R package (https://github.com/certara/mlcov) which is available to the pharmacometrics community.
Methods
The methodology implemented in the mlcov R package consists of four key steps. First, the dataset, comprised of empirical Bayesian estimates of individual parameters (EBEs) and covariate sets, is randomly split into five folds (step 1, data splitting). Second, the covariate selection (step 2) is performed by applying the Lasso algorithm to reduce irrelevant or redundant covariates due to correlation followed by the Boruta algorithm to iteratively identify relevant covariates based on their importance scores. Third, a voting mechanism (step 3) across folds determines the final selected covariates based on their robustness. Note that these first three steps are implemented by a simple call to a mlcov function (MLCovSearch). Finally, residual plots (step 4) are employed to evaluate the covariate-parameter relationships. Following the covariate selection using the proposed ML method, an XGboost model is trained on the selected covariates and the remaining trends between residuals (difference between the actual target values and the model's predicted values) and unselected covariates are examined. The primary goal is to ensure that the ML method did not overlook any significant trends or relationships that could be captured by additional covariates. This step is implemented in a separate function (generate_residualplots).
This framework was evaluated using a few real-world data examples in tandem with the traditional SCM approach. Multivariate forest plots were used to elucidate the clinical implications of the selected covariates from both techniques, providing a visual representation of their impact and significance within the population PK model. In one example, covariate impact was tested on clearance (CL/F), volume of distribution (V/F), and absorption rate constant (Ka), with consistent covariates tested for both SCM and mlcov — (1) weight (WGT), albumin (ALB), creatinine clearance (CRCL), sex, race and ethnicity for CL/F; (2) WGT, ALB, sex, race and ethnicity for V/F; (3) age, formulation (FORM), device for Ka — using the base population PK model and EBEs, respectively.
Results
The following covariates were selected in the covariate model: (1) WGT, ALB, CRCL, sex, race, and ethnicity on CL/F using SCM and WGT, ALB, CRCL and race using mlcov ; (2) WGT and ALB on V/F using SCM and mlcov; and (3) device effect on Ka using SCM and nothing with mlcov. Execution time for SCM averaged around 13 hours, contrasting with less than 5 minutes for mlcov.
In this real-world study, while SCM identified ethnicity and sex for CL/F, mlcov did not, likely due to their correlations with race and body weight, respectively. For Ka, SCM identified device, while mlcov did not identify any covariate, with no trends in the residual plots. It is noteworthy to mention, the device did not demonstrate any significant impact on the extent of absorption (shown in the bioequivalence study).
Conclusion
These results show that the covariate selection process can become efficient and user friendly by using ML framework algorithms as implemented in the mlcov package. With a simple call of two functions, the user can obtain a full set of results. In conclusion, the mlcov package introduces a significant advancement in covariate selection methodology, offering not only efficiency and accuracy but also notable time-saving benefits in pharmacometrics research.
References:
[1] Jonsson E, Karlsson M (1998) Automated covariate model building with NONMEM. Pharm Res 15(9):1463–1468
[2] Emeric Sibieude et al. “Fast screening of covariates in population models empowered by machine learning”.In: Journal of Pharmacokinetics and Pharmacodynamics 48.4 (2021), pp. 597–609.
[3] Chiara Nicolo et al. “Machine learning and mechanistic modeling for prediction of metastatic relapse in early-stage breast cancer”. In: JCO clinical cancer informatics 4 (2020), pp. 259–274.
[4] Rebai I., Duval, V., Akil, A., Teusher, N., Largajolli, A. and Fauchet, F. Evaluation of the Boruta Machine Learning Algorithm for Covariate Selection. 31st PAGE meeting 2023, A Coruna, Spain, June 2023
[5] Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), 1–13.
[6] Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33(1):1-22