Regional Statistics Conference 2026

Regional Statistics Conference 2026

From Text to Code: How Insee has integrated AI and ML for Text Classification and Information Retrieval

Organiser

JP
Julien PRAMIL

Participants

  • MH
    Ms Mélina Hillion
    (Presenter/Speaker)
  • Combining BERTopic and LLMs for automatic coding of free-text survey responses

  • JP
    Mr Julien PRAMIL
    (Presenter/Speaker)
  • Same text, new labels: leverage the power of LLMs to (re)label data

  • MT
    Mr Meilame Tayebjee
    (Presenter/Speaker)
  • torchTextClassifiers: a unified framework for text classification with PyTorch, from a MLOps perspective

  • LG
    Mr Lino Galiana
    (Presenter/Speaker)
  • MLOps in practice: maintaining a high quality model with monitoring and training strategies

  • Proposal Description

    Automatic coding, the process of assigning standardized codes to free-text survey responses, is a major usecase of modern statistical production. At Insee, we have gradually shifted from rule-based methods to robust, scalable AI-driven systems for this task.

    This talk will present a typical pipeline for automatic coding, highlight our MLOps practices for model deployment and monitoring, and introduce torchTextClassifiers, an internally developed (and open-sourced) PyTorch package for text classification.

    Beyond the technical details, we will also address the operational challenges of deploying these models in production:

    1 - Training from annotated data: building and maintaining high-quality labeled datasets remains a important challenge. We will present how annotation campaigns are organized, how annotators are supported by AI-assisted interfaces, and how this data feeds iterative model improvement.

    2 - Containerization and deployment: to ensure reproducibility and scalability, models are containerized (e.g., with Docker) and integrated into production workflows. This enables smooth deployment, versioning, and monitoring across multiple environments.

    3 - Continuous evaluation and feedback loops: production models are continuously evaluated through the collection of new annotations and user feedback. This allows us to detect drifts, monitor performance over time, and retrain models when necessary.

    We will also showcase recent innovations, including the use of retrieval-augmented and context-aware generation (RAG/CAG) to recode datasets under new nomenclatures, and BERTopic to cluster and help classifying free text at scale, in an unsupervised way.