Regional Statistics Conference 2026

Regional Statistics Conference 2026

Automating Household Expenditure Coding in Official Statistics: A Hybrid and Production-Oriented Approach under Data Scarcity and Label Heterogeneity

Conference

Regional Statistics Conference 2026

Format: CPS Abstract - Malta 2026

Keywords: artificial intelligence, data classification, household-survey-data, official statistics

Session: CPS 26 Synthetic Data

Wednesday 3 June 4:30 p.m. - 5:30 p.m. (Europe/Malta)

Abstract

The codification of statistical units using internationally recognised classifications is a cornerstone of official statistics, ensuring comparability, coherence, and long-term usability of economic indicators across countries. As statistical systems increasingly rely on large volumes of detailed textual data, traditional manual coding processes are becoming unsustainable. This challenge affects many statistical domains where fine-grained classifications coexist with heterogeneous, weakly structured textual inputs.
This paper addresses the problem of automatically coding household expenditure descriptions into the COICOP nomenclature within the framework of the European Household Budget Survey (HBS). In the most recent French wave of the survey, nearly 70,000 distinct expenditure labels were manually coded across almost 900 COICOP categories. The reduction in human resources allocated to annotation and the obsolescence of previously used rule-based systems make it essential to automate the coding of the next wave of the HBS survey in 2026.
The core difficulty lies in the combination of four constraints: (i) a highly granular target classification, (ii) heterogeneous textual inputs generated by multiple collection modes (receipts, digital diaries, paper diaries), (iii) overly generic or nonsensical textual inputs, and (iv) limited and unevenly distributed labeled data. In addition, part of the data collection relies on a predefined product suggester, whose coverage is inherently biased toward frequent expenditures observed in previous survey waves.
To address this problem, we propose a hybrid, incremental approach combining deterministic methods, supervised text classification models, synthetic data generation, and large language models. We first establish strong baselines using string-based similarity measures, providing robustness and interpretability in production contexts. We then evaluate embedding-based classifiers trained under different supervision regimes, including training on suggester-derived data and testing on manually annotated survey labels. To mitigate data scarcity, we also explore the controlled use of synthetic data derived from the COICOP manual, with a specific focus on assessing and limiting semantic biases introduced by the classification structure itself. Finally, we investigate the use of open-source large language models within a retrieval-augmented generation framework, achieving competitive performance at higher aggregation levels of the nomenclature.
Beyond model performance, a key contribution of this work lies in its production-oriented perspective. We describe the deployment of these models within a statistical production environment, relying on experiment tracking and model monitoring tools, and supported by an interactive annotation and inspection interface enabling human-in-the-loop validation and iterative improvement.
The expected outcomes are twofold. Methodologically, this work provides a comparative and reproducible framework for automated coding under realistic constraints commonly faced by national statistical institutes. Operationally, it examines how hybrid AI systems can achieve the statistical quality, transparency, and control of manual coding. More broadly, the approach is designed to be transferable to other surveys and classification tasks in official statistics, contributing to the modernization and long-term sustainability of statistical production systems.