Multivariate semi-parametric empirical best predictions using latent variable models
Abstract
Traditional surveys are usually designed to produce reliable estimates for predefined domains planned at the design stage. Consequently, for specific population subgroups, direct survey estimates are often unreliable and official statistics are therefore not released. To address these limitations, Small Area Estimation (SAE) methods are required. In this framework, we propose a multivariate unit-level SAE model for binary data to estimate the indicators of interest at a disaggregated level, with significantly better efficiency compared to direct estimates that rely only on sample data related to the subpopulation of interest. In detail, we start from the semi-parametric empirical best prediction (sp-EBP) approach and extend it to a multivariate framework. The standard sp-EBP model relies on a finite mixture of logistic linear models, offering a flexible approach that does not require strong or untestable parametric assumptions on the distribution of domain-specific random effects, while maintaining computational efficiency. Here, the domain-specific random effects account for sources of unobserved heterogeneity that are not captured by the covariates and describe correlation between units within the same small area. The sp-EBP approach also allows for the analytical estimation of the mean squared error, thus avoiding computationally intensive procedures such as the bootstrap or complex numerical approximations. We extend this approach in a multivariate perspective by assuming the existence of a multidimensional, continuous, latent variable (trait) associated to each unit in a given area. Here, a multivariate model can provide more efficient estimates of small area proportions by exploiting the correlations among response variables, unlike a univariate model. Moreover, in contexts with limited auxiliary information at the population level, the inclusion of a multidimensional continuous latent variable allows the model to account for the effects of unobserved (latent) characteristics on the multivariate binary outcomes. However, the computation of the empirical best predictor and the analytic approximation to its mean squared error require the solution of multiple integrals that do not have a closed form. To solve the issue, we propose a semi-parametric multivariate empirical best predictor (sp-mEBP) which is based on leaving the distribution of the area-specific random effects unspecified and estimating it directly from the observed data, as in the univariate sp-EBP approach. This is known to lead to a discrete mixing distribution that helps avoid (i) unverifiable parametric assumptions and (ii) heavy integral approximations. Furthermore, to avoid deviating from standard assumptions about the Gaussianity of the latent trait(s), we adopt a numerical approach based on Gaussian quadrature to derive the empirical best predictor (EBP) for the small area parameters of interest.