Build your own open-source text analysis tools: Istat's experience
Conference
65th ISI World Statistics Congress
Format: IPS Abstract - WSC 2025
Keywords: data-science, natural-language-processing, open source,
Session: IPS 748 - Embracing Open Source: The Future of Statistics
Wednesday 8 October 2 p.m. - 3:40 p.m. (Europe/Amsterdam)
Abstract
Text analysis is crucial for extracting insights from large volumes of unstructured textual data. Semantic methods for text classification, in particular, are among the most promising applications of recently developed transformer-based models. In this presentation, we demonstrate how to create and deploy semantic applications, including semantic search and retrieval-augmented generation (RAG) systems, using open-weight and open-source models. These methods are proving to be especially effective in National Statistical Offices (NSOs) for data dissemination, automatic coding, and overall improved user experience, with several systems already in production. For computationally intensive tasks, non-commercial cloud providers, such as SSPCloud hosted by Insee, are becoming increasingly attractive due to growing availability of GPUs. In this context, we also present our recent experience with an international text classification tutorial hosted entirely on SSPCloud.