2026 IAOS Conference

2026 IAOS Conference

Modernising the Household Budget Survey in Slovenia: Integrating Machine Learning and Digital Data Collection

Conference

2026 IAOS Conference

Format: CPS Abstract - IAOS 2026

Keywords: household_expenditures, innovative data collection, machine learning, official_statistics

Session: Household survey developments in official statistics

Tuesday 12 May 2:30 p.m. - 4 p.m. (Europe/Vilnius)

Abstract

The Household Budget Survey (HBS) is a key source of data on household consumption expenditures and living conditions, but it also involves substantial respondent burden and complex data processing, particularly in the expenditure diary where detailed item-level data and receipts are collected. This paper presents the experience of the Statistical Office of the Republic of Slovenia with the use of OCR and machine learning in HBS 2022 and describes the further development of these approaches for HBS 2026. In the 2022 wave, respondents submitted paper receipts which were scanned and processed using an OCR solution. The extracted text was then classified into COICOP categories using a supervised machine learning model, following initial testing of an existing Dutch OCR and machine learning solution. A human-in-the-loop approach was applied to ensure data quality, allowing expert review and correction of classifications where needed.

For HBS 2026, the OCR-based processing of receipts is being continued and further developed. In addition to paper receipts, a new web-based application has been introduced for uploading digital receipts from online retailers. The application was piloted in autumn 2025 and internally tested prior to implementation. All receipts, regardless of the submission channel, are processed using OCR and stored in a central database. A new machine learning model is currently under development to classify the extracted data into COICOP categories and to provide a confidence score for each classification. Based on this confidence score, items below a predefined threshold are planned to be reviewed using the SDMC application before final storage in the database. A human-in-the-loop approach is foreseen to support data quality and validation of the results. The Slovenian experience illustrates how OCR and machine learning can be progressively integrated into the expenditure diary process of the Household Budget Survey while maintaining data quality and methodological control.