2026 IAOS Conference

2026 IAOS Conference

Data Quality Assurance and Anomaly Detection in the Brazilian Agricultural Census: A Machine Learning Approach for High-Dimensional and Sparse Data

Conference

2026 IAOS Conference

Format: CPS Abstract - IAOS 2026

Keywords: agricultural census, anomaly detection;, data, data-driven, machine learning

Session: Agricultural statistics innovation

Tuesday 12 May 11 a.m. - 12:30 p.m. (Europe/Vilnius)

Abstract

Data quality assurance is a central challenge in large-scale census operations, especially in complex surveys such as the Brazilian Agricultural Census, which is marked by high productive, regional, and structural heterogeneity. This challenge is further intensified by data sparsity, as the questionnaire comprises over 500 items, of which only a specific subset is answered in each interview. In this context, automated anomaly detection methods—based on advanced statistical techniques and machine learning—have gained prominence as strategic tools to support field supervision, data editing, and the enhancement of validation processes for the collected information.

This paper presents an investigation into the use of machine learning approaches for anomaly detection in Agricultural Census questionnaires. The proposal is based on the understanding that anomalies are not limited to obvious errors but also include rare combinations of features, internal inconsistencies, and extreme behaviors that may signal either operational issues or legitimate productive phenomena.

The adopted approach is exploratory and comparative, encompassing different anomaly detection paradigms applied to large-volume, high-dimensional tabular databases. This work discusses perspectives for consolidating benchmarks and the use of machine learning methods within the context of official statistics, highlighting the potential of these approaches to expand the monitoring and diagnostic capacity of census data quality during and after the operation. Consequently, this methodology supports all phases of the statistical operation and ensures the improvement of the final quality of the produced data.