2023 - A Coruña - Spain

PAGE 2023: Methodology - Covariate/Variability Models
Niklas Hartung

An information-theoretic evaluation of vine copula models for high-dimensional covariate distributions

Niklas Hartung, Aleksandra Khatova

Institute of Mathematics, University of Potsdam, Germany

Objectives: Statistical modelling of covariate distributions allows to create virtual populations or to impute missing values in a covariate dataset. However, covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, which simple descriptions like multivariate Gaussian distributions fail to represent. Copula models allow to separate the modelling of marginal distribution and correlation structure. Within this framework, vine copula models provide a flexible approach for distribution fitting that scales up well to higher dimensions [1]. While vine copula models have been proposed previously to covariate modelling [2], their goodness-of-fit to the theoretical underlying population distribution has not been investigated yet. Kullback-Leibler (KL) divergence provides a scale-invariant information-theoretic goodness-of-fit criterion for distributions which can be interpreted well, namely as the information loss when approximating the true population distribution by a surrogate model [3]. In this work, we developed covariate models based on vine copulas and evaluated their goodness-of-fit through KL divergence. We compared healthy and diseased populations, as well as copula-based approximations to different Gaussian approximations.

Methods: Physiological data from the US National Health and Nutrition Examination Survey (NHANES) in R package NHANES were used for the healthy population [4], and ICU records in the MIMIC database as the diseased population [5]. R package rvinecopulib was used for estimation of and simulation from vine copula models [6]. Within the copula framework, both a general parametric class of pair copulas (best vine copula) and a restricted class using only Gaussian pair copulas (Gaussian vine copula) were investigated and compared to multivariate Gaussian copulas. The three copula-based approaches were also evaluated against multivariate Gaussian distributions. For a sample-based estimation of KL divergence in high dimensions where kernel-based methods fail, we implemented a generalized nearest-neighbour-based algorithm compensating for finite sample size bias [7, algorithm NN-epsilon-1]. This KL divergence estimator was evaluated successfully on test cases and showed no sample size-dependent trends. Uncertainty of the KL divergence approximation was estimated via repeated sampling.

Results: 2612 records on 11 health-related covariates were extracted from the NHANES dataset and 10440 records on 30 covariates from the MIMIC dataset. The best vine copula model outperformed not only the multivariate Gaussian copula but also the Gaussian vine copula model in terms of KL divergence on both the healthy and diseased datasets, demonstrating the suitability of this approach for modelling correlation structures. All copula-based models were in turn considerably better than multivariate Gaussians, which shows the importance of modelling marginal distributions correctly. When comparing a common subset of 6 variables from NHANES and MIMIC, KL divergence estimates were lower for MIMIC compared to NHANES, which might be due to smoother and hence easier to fit covariate distributions in the diseased population caused by more between-individual variability.

Conclusions: We successfully implemented a KL divergence-based evaluation of approximate covariate distribution models in high dimensions. Vine copula models showed a favourable performance for covariate modelling in both healthy and diseased populations.



References:
[1] Czado C. Analyzing Dependent Data with Vine Copulas. Lect. Notes Stat. 222 (Springer, 2019)
[2] Zwep, L. et al. PAGE 30 (2022) Abstr 10099 [www.page-meeting.org/?abstract=10099]
[3] Kullback, S., Leibler, R.A., Ann Math Stat (1951), 22 (1): 79-86
[4] https:// cran.r-project.org/package=NHANES
[5] Johnson, A. et al., MIMIC-IV (version 2.2). PhysioNet (2023). https://doi.org/10.13026/6mm1-ek67
[6] https://cran.r-project.org/package=rvinecopulib
[7] Wang, Q. et al., IEEE Trans Inf Theory (2009) 55(5): 2392-2405


Reference: PAGE 31 (2023) Abstr 10454 [www.page-meeting.org/?abstract=10454]
Poster: Methodology - Covariate/Variability Models
Click to open PDF poster/presentation (click to open)
Top