Download PDF

Expanding Thematic Statistics through the Analytical Reuse of Census Data: Evidence from Religious Establishments in Brazil

Author

Bruno Perez

Conference

2026 IAOS Conference

Format: CPS Abstract - IAOS 2026

Keywords: census, machine learning

Session: Selected topics in official statistics

Wednesday 13 May 4:30 p.m. - 6 p.m. (Europe/Vilnius)

Abstract

This paper explores the analytical reuse of census data originally produced for general addressing purposes in order to identify and characterize religious establishments. The identification of this specific type of establishment required the development of tailored methodological procedures, grounded in the systematic analysis of textual descriptions contained in the records. Automatic, semi-automatic, and manual validation techniques were combined, enabling the identification of a national universe of approximately 765 thousand religious establishments. This approach highlights both the potential and the inherent limitations of the secondary use of census data for the production of new thematic statistics.
Once the universe of religious establishments was identified, they were subclassified into major religious matrices. This classification ensuring conceptual coherence and statistical feasibility, acknowledging the constraints imposed by the textual, heterogeneous, and sometimes generic nature of the available information.
The manual classification method consisted of assisted coding carried out by a trained team of coders. Classification was initially based on the direct reading of establishment descriptions and, in ambiguous or generic cases, was complemented by external searches. This approach achieved high classification accuracy, particularly when textual information alone was insufficient to determine the religious matrix of an establishment.
In parallel, an automated classification method was developed using machine learning models, combining supervised and unsupervised approaches applied exclusively to textual descriptions. Supervised models were trained using samples previously labeled by a subject-matter specialist, while unsupervised models supported the identification of recurring textual patterns.
The comparison between two methods revealed a high degree of agreement, especially for the major religious matrices, with compatibility levels exceeding 90% for well-defined categories. Validation against specialist-labeled samples confirmed high accuracy rates for both methods. Overall, the results demonstrate that machine learning–based methods are technically viable and statistically robust for the production of experimental statistics, particularly when combined with human validation.
Author: Bruno Mandelli Perez
Co(author) Daléa Soares Antunes e Rafael Kessler Fernandez