Comparison of methods for handling missing covariate data
Åsa M. Johansson, Mats O. Karlsson
Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
Background: Inclusion of important covariates in non-linear mixed effects modelling will reduce the unexplained inter-individual variability and improve the predictability of the model. Missing covariate data is a common problem and the method chosen for handling missing data can be crucial for the outcome of the study. Missing data can be divided into three categories: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1]. For MCAR, the missingness does not depend on any observed or unobserved data; for MAR, the missingness depends on observed data but not on unobserved data; and for MNAR, the missingness depends on the unobserved missing data itself. The underlying mechanism of the missingness is usually unknown but will affect the predictability of the model if wrong assumptions are made.
Objective: The aim of this study was to implement and compare different methods for handling missing categorical covariate data under different mechanisms of missingness.
Methods: A simulation model was set up to generate data for 200 individuals. 60% of the individuals were assigned to be males and 40% females. Weights were simulated according to truncated lognormal distributions where the sex-specific means and variances had been estimated from a big dataset with 1022 males and 423 females. The PK model was a constant infusion model with a 100% difference in CL between males and females (estimated as two fixed effects), an inter-individual variability of 30% and a residual variability of 20%. Three different types of missingness were simulated; MCAR, MAR and MNAR. For each type of missingness, 50% of the individuals were assumed to lack the covariate sex. For MCAR, all individuals had the same probability to miss sex information; for MAR, the underlying mechanism gave a higher missing probability with increasing weight; and for MNAR, the underlying mechanism gave a three times higher missing probability for males than females. Different methods for handling missing covariates were compared: multiple imputation (MI), modelling with $MIX based on observed WT (MOD) [2] and modelling with $MIX based on observed WT with estimation of an additional fixed effect (EST). For comparison purposes, estimation with all data (ALL) was carried out and also with simpler imputation algorithms, but the relatively poor performance of the latter is not reported below. Implementation of MI, MOD and EST needed additional estimations, and for MI additional simulations had to be carried out. The function p(male|WT) was estimated prior to estimation with MOD and EST, as a logit-transformed linear regression equation (incorrect regression function), or as a more appropriate probability density function based on estimated lognormal WT distributions for males and females, respectively. MI was a modified version of the method described by Wu and Wu [3]. The covariate imputation was preceded by an estimation of the PK model without inclusion of any covariates followed by an estimation of the probability curve for p(male|WT, EBE), where EBE is the Empirical Bayes Estimate estimated from the base model, and the two WT functions described above were explored in parallel. Prior to the estimations with MI simulations took place to impute the missing sex values based on the probability functions and the individual WT and EBE values. The imputation step followed by estimation was repeated six times after which the mean value of each parameter was calculated. A Stochastic Simulations and Estimations (SSE) analysis was utilized to compare the methods. 200 datasets were simulated and the methods were compared according to bias and precision of parameter estimates. The OFVs obtained with MOD and EST were compared and a significant drop in OFV was taken as an indication of data being MNAR and/or a probability equation with poor predictability. In those cases when a significantly lower OFV was obtained with EST parameter estimates obtained with this method was used for calculation of bias and precision, otherwise estimates obtained with MOD were used. Root mean squared error in θ estimates were evaluated for each parameter in each method and were expressed as % of RMSE compared to ALL (rRMSE).
Results:
Discussion: With an increase in the use of combined analysis, appropriate handling of missing covariates is likely to be of increasing importance. Simplistic strategies like data omission, imputation of mode or other single imputation methods were, as expected, found to be suboptimal compared to MI and MOD/EST (not shown). Estimation of a regression curve for the missing covariate based on observed covariates is a commonly used method but this study shows that erroneous assumptions about the probability curve may have a substantial effect on the parameter estimates for all underlying mechanisms of missingness. The problem is less pronounced for MI where the EBEs stabilize the regression curve. The MI method used assumes low EBE shrinkage and higher EBE shrinkage will lead to greater bias and lower precision for MI.
Conclusions: When covariate missingness is important to handle appropriately, MI and/or MOD/EST may be appropriate but the methods differ in their robustness to misspecification of the relation to known covariates, missingness mechanism and data richness. This work outlines the relative merits of these methods.
References:
[1] Little and Rubin. Statistical analysis with missing data, 2002
[2] Karlsson et al. Journal of Pharmacokinetics and Biopharmaceutics, 1998
[3] Wu and Wu. Statistics in Medicine, 2001