65th ISI World Statistics Congress

65th ISI World Statistics Congress

Build your own open-source text analysis tools: Istat's experience

Conference

65th ISI World Statistics Congress

Format: IPS Abstract - WSC 2025

Keywords: data-science, natural-language-processing, open source,

Session: IPS 748 - Embracing Open Source: The Future of Statistics

Wednesday 8 October 2 p.m. - 3:40 p.m. (Europe/Amsterdam)

Abstract

Text analysis is crucial for extracting insights from large volumes of unstructured textual data. Semantic methods for text classification, in particular, are among the most promising applications of recently developed transformer-based models. In this presentation, we demonstrate how to create and deploy semantic applications, including semantic search and retrieval-augmented generation (RAG) systems, using open-weight and open-source models. These methods are proving to be especially effective in National Statistical Offices (NSOs) for data dissemination, automatic coding, and overall improved user experience, with several systems already in production. For computationally intensive tasks, non-commercial cloud providers, such as SSPCloud hosted by Insee, are becoming increasingly attractive due to growing availability of GPUs. In this context, we also present our recent experience with an international text classification tutorial hosted entirely on SSPCloud.