Download PDF

Large language models for COICOP classification in Norway's household budget survey

Author

Boriska Toth

Conference

Regional Statistics Conference 2026

Format: IPS Abstract - Malta 2026

Session: IPS 1249 - Using Artificial Intelligence for Official Statistics

Wednesday 3 June 11:20 a.m. - 1 p.m. (Europe/Malta)

Abstract

We present work at Statistics Norway exploring the use of large language models (LLMs) for COICOP classification. The presentation builds on work described in the paper [1] and also gives some updated results reflecting ongoing efforts. The goal is to classify the text of purchased items into one of about 300 COICOP codes with as little need as possible for manual labelling in the production pipeline. Our initial experiments on using commercial LLMs for this task found very encouraging performance. Non-sensitive data sent to the latest free model of ChatGPT (as of 2025) achieved similar accuracy to a human coder. At NSIs the challenge is to achieve this performance in a data-protected environment.

We report results from 3 different approaches: 1) a RAG framework, which was our primary focus; 2) directly prompting an LLM to classify; and 3) using a classifier built on a BERT embedding model to test performance without an LLM. In the first approach, we used a retrieval augmented generation (RAG) procedure to first retrieve a set of candidate codes then prompt an LLM for the desired code. We used self-hosted LLMs, and performance was limited on the smaller LLMs that our computing infrastructure at the time could support. In more recent experiments, the SSPCloud (Onyxia) data science infrastructure is used to test the performance of larger models on open data. We found significant improvements in RAG when fine-tuned embedding models were used for the retrieval step and larger LLMs for the classification step. Next, we describe experiments using larger LLMs on the Onyxia platform to directly classify without RAG by inputting the entire classification index into the prompt. We explore how large and costly LLMs need to be to yield human-level classification performance and to support features such as outputting measures of confidence. We also compare the performance of LLMs to a classifier that uses a fine-tuned version of a Norwegian BERT embedding.

Finally, we mention some strategies being explored at Statistics Norway for implementing LLM-based pipelines, describing how commercial LLMs can be integrated with our closed data science platform, and how LLMs can be used in a human-in-the-loop workflow.