A human-in-the-loop method for classifying short survey free-text responses in official statistics
Conference
Regional Statistics Conference 2026
Format: IPS Abstract - Malta 2026
Thursday 4 June 11:30 a.m. - 1:10 p.m. (Europe/Malta)
Abstract
Statistical surveys frequently include short free-text responses that contain valuable information but remain difficult to exploit because they are brief, heterogeneous, ambiguous, and costly to code manually. This talk presents a semi-automated method designed to support their classification in a way that remains interpretable, evaluable, and compatible with the quality requirements of official statistics.
The approach is illustrated on a use case drawn from French national victimization surveys, focusing on a free-text field describing the presumed perpetrator of an incident.
The method is organized in three stages. First, the corpus is explored through semantic clustering in order to reveal recurrent themes in the responses. Second, these clusters are transformed into a structured and documented expert-validated coding manual, which serves as an intermediate reference framework for coding. Third, responses are classified with large language models using this explicit reference framework. A central feature of the method is that automation does not replace expert judgment: clustering is treated as an exploratory step, while the coding manual is reviewed, corrected, merged, or reorganized by a domain expert before being used for classification. In this sense, the contribution is not fully automated coding, but a human-in-the-loop procedure that combines computational assistance with substantive expertise.
The evaluation protocol is also a core contribution of the work. Performance is assessed against a manually annotated reference corpus produced through independent double human annotation. To quantify reliability, we first measure agreement between human annotators using raw agreement and Cohen’s kappa. We then compare model predictions to human codings using the same metrics, treating the model as an additional annotator. This makes it possible to determine whether automated classification approaches human-level performance, and to estimate the specific contribution of the expert-validated coding manual to coding quality.
Results show that direct classification into the original survey categories remains difficult, including for human annotators. By contrast, once the expert-validated coding manual has been constructed from clustering outputs and reviewed by a domain expert, agreement improves substantially for both humans and models. In the case study, the best-performing LLM reaches a level of concordance close to the human reference. The study therefore shows that the decisive step is not only the choice of model, but the explicit formalization of categories through an expert-validated coding manual. More broadly, this work argues for a pragmatic framework in which LLMs assist coding, while humans remain central to taxonomy design, validation, and evaluation.