65th ISI World Statistics Congress

65th ISI World Statistics Congress

A detailed look at COICOP classification of scanned receipt texts

Conference

65th ISI World Statistics Congress

Format: IPS Abstract - WSC 2025

Keywords: classification, machine learning

Abstract

The household budget survey app developed at Statistics Netherlands offers respondents the possibility to scan store-receipts instead of manually entering expenditures. Scanned receipts are processed using OCR software, after which they have to be classified into COICOP level 5 label. This paper discusses the classification of scanned receipts into more detail. In the first part of the paper, we discuss the challenges of classifying scanned receipt texts. In specific, product inventory varies from store to store, even if the stores are part of the same branch. For instance, supermarkets use different lengths of receipt texts, and various abbreviations for the same product. Moreover, product inventory varies from one supermarket to the other. On top of that, store inventory is not static and product inventory varies from month to month. In some cases, product dynamics are very high. In addition to that, data may not be available for all supermarkets, which puts extra demands on the level of generalizability that a COICOP classifier needs to attain.
A COICOP classifier will have to be able to deal with the above challenges. On the one hand, a classifier will have to be able to reach a sufficient performance on the data available at the time of training. On the other hand, the classifier also needs to be able to deal with new and unseen data in the survey period. To this cause, we investigate two approaches: (1) a string matching approach to classification that uses fuzzy matching to the CPI data, (2) a machine learning approach to classification that uses state-of-the-art NLP techniques to classify the receipt texts. These methods will be evaluated and compared on their classification performance in a best case scenario as well as a simulated real-world production scenario. The best way to compare these scenarios will also be discussed.