Regional Statistics Conference 2026

Regional Statistics Conference 2026

Machine Learning for Mass Imputation in Structural Business Statistics Using Administrative Data

Conference

Regional Statistics Conference 2026

Format: CPS Abstract - Malta 2026

Keywords: administrative data integration, machine learning, official statistics, respondent burden

Session: CPS 12 Survey Issues

Thursday 4 June 11 a.m. - noon (Europe/Malta)

Abstract

Administrative data is a valuable source for Structural Business Statistics (SBS). Ideally, we would want to mass impute all SBS variables for the full population from administrative data. But this source usually does not cover the full population and lacks some variables. Moreover it is not really a unique source but a set of different sources that occasionally disagree and often cover different subpopulations.
Nowadays, a substantial part of the sample is already imputed, making it unnecessary to collect the imputed units. However, we would like to use these resources to reduce even more the sample size while maintaining the accuracy, with the double objective of reducing costs and response burden.
We cannot directly substitute survey data with administrative data for several reasons. First, there are some variables that are collected in the survey, but we do not have a direct analogue in the administrative data for them. Second, for those with a direct analogue, the definitions of administrative and statistical data might be slightly different. Finally, for some units, little or no information is available from administrative sources.
However, for a substantial number of variables, there is a strong relationship between administrative data and survey data. To exploit this relationship, we propose a machine learning–based approach. For each target variable in the SBS, we perform feature selection and the training, testing, and tuning of several models—such as Random Forests, linear models with ElasticNet regularization, and XGBoost—using administrative variables as regressors. Each model is then used to predict the survey variable for all units in the population. Models are compared based on their cross-validated mean squared error, and the best-performing model is selected.
After model selection, the mean squared error of the model-based estimators is estimated via predictive inference in order to obtain a sample-based quality indicator.
Note that this approach does not completely eliminate the sample, which remains important for several reasons. Large units are still exhaustively sampled, as they account for most of the error, and self-employed units are sampled using traditional methods due to the limited availability of auxiliary information. In addition, sampled units must be included in model retraining to prevent model drift. Finally, the sample is used to improve estimation precision and to assess error.