Classifying respondent comments from the 2021 Canadian Census of Population using machine learning methods
Conference
65th ISI World Statistics Congress
Format: SIPS Abstract - WSC 2025
Keywords: deep learning, machine learning, text-classification
Session: SIPS 1164 - IAOS Young Statisticians Prize 2023, 2024, 2025
Monday 6 October 9:20 a.m. - 10:30 a.m. (Europe/Amsterdam)
Abstract
To improve the analysis of respondent comments from the Canadian Census of Population, data scientists at Statistics Canada compared and evaluated traditional machine learning, deep learning and transformer-based techniques. Cross-lingual Language Model-Robustly Optimized Bidirectional Encoder Representations from Transformers (XLM-R), a cross-lingual language model, fine-tuned on census respondent comments yield the best result of 89.91% F1 score overall despite language and class imbalances. Following the evaluation, the fine-tuned model was implemented successfully to objectively categorize comments from the 2021 Census of Population, with high accuracy. As a result, feedback from respondents was directed to the appropriate subject matter analysts, for them to analyze post-collection.