From Text to Code: How Insee has integrated AI and ML for Text Classification and Information Retrieval
Conference
Proposal Description
Automatic coding, the process of assigning standardized codes to free-text survey responses, is a major usecase of modern statistical production. At Insee, we have gradually shifted from rule-based methods to robust, scalable AI-driven systems for this task.
This talk will present a typical pipeline for automatic coding, highlight our MLOps practices for model deployment and monitoring, and introduce torchTextClassifiers, an internally developed (and open-sourced) PyTorch package for text classification.
Beyond the technical details, we will also address the operational challenges of deploying these models in production:
1 - Training from annotated data: building and maintaining high-quality labeled datasets remains a important challenge. We will present how annotation campaigns are organized, how annotators are supported by AI-assisted interfaces, and how this data feeds iterative model improvement.
2 - Containerization and deployment: to ensure reproducibility and scalability, models are containerized (e.g., with Docker) and integrated into production workflows. This enables smooth deployment, versioning, and monitoring across multiple environments.
3 - Continuous evaluation and feedback loops: production models are continuously evaluated through the collection of new annotations and user feedback. This allows us to detect drifts, monitor performance over time, and retrain models when necessary.
We will also showcase recent innovations, including the use of retrieval-augmented and context-aware generation (RAG/CAG) to recode datasets under new nomenclatures, and BERTopic to cluster and help classifying free text at scale, in an unsupervised way.