Detecting and Correcting Measurement Anomalies in Agricultural Plot Data: Longitudinal Evidence from Nigeria
Conference
10th International Conference on Agricultural Statistics
Format: CPS Paper - ICAS 2026
Keywords: agriculture, outliers, productivity
Abstract
Accurate measurement of agricultural land is essential for empirical research and policy design, particularly in development economics where land area data inform productivity estimates and decisions on poverty reduction, land reform, and food security. While GPS-based measurement has become the preferred method due to its objectivity and practicality, it is not immune to errors arising from environmental, plot, and surveyor factors. This study leverages longitudinal data from the Nigeria General Household Survey (GHS-Panel) Wave 5, which includes repeated GPS measurements for selected plots, to systematically detect and correct measurement anomalies using machine learning techniques. The GHS-Panel Wave 5 is a nationally representative survey covering approximately 5,000 households across Nigeria’s six geopolitical zones, with two rounds—post-planting and post-harvest—where enumerators record plot boundaries using GPS perimeter walks. In the post-harvest round, 766 plots previously flagged as outliers were re-measured, providing a rare opportunity to validate and improve measurement accuracy. Environmental data from European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5-Land product were matched to each survey using plot location and interview timing, capturing humidity, temperature, wind speed, surface pressure, precipitation, vegetation index, and cloud cover, all factors known to affect GPS signal quality.
The methodological approach combines supervised and unsupervised machine learning methods to identify outlier plots in both survey rounds. Supervised detection uses a linear probability model with LASSO regularization, selecting key predictors from household, plot, and geometric features, while unsupervised methods such as Isolation Forest, One-Class support vector machine (OC-SVM), and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) capture different aspects of anomalous data patterns. Any plot flagged by at least one method is classified as an outlier, ensuring comprehensive anomaly detection. For plots transitioning from outlier to non-outlier status between rounds, a regression model identifies environmental, topographical, and surveyor-related factors associated with improved measurement. Subsequently, plot areas for outliers are imputed using household and plot characteristics, employing both LASSO regression and Random Forest models. The study then assesses how measurement choices affect the estimated relationship between land size and productivity.
Results show that across methods, 21% of plots were identified as outliers in the post-planting round. LASSO highlighted that larger, more compact plots with less boundary irregularity were less likely to be outliers, while household size and food insecurity were positively correlated with extreme plot areas. Isolation Forest, OC-SVM, and DBSCAN each detected distinct subsets of outliers, with some overlap. Re-measurement in the post-harvest round led to a substantial reduction in outlier classification, with over 60% of previously flagged plots reclassified as non-outliers. The distribution of plot sizes became more concentrated at smaller sizes, and re-measured plots aligned more closely with non-outlier features, suggesting improved accuracy. Regression analysis revealed that clearer skies, lower humidity, and reduced cloud cover during the post-harvest period were associated with successful reclassification of outlier plots, and surveyor experience and time of day also played a role. Imputation models using household and plot characteristics provided reliable estimates for outlier plots, with Random Forest and LASSO both identifying number of crops, household size, and plot compactness as important predictors. Comparing productivity estimates using original GPS, imputed, and re-measured plot areas showed that correcting measurement anomalies can materially affect the observed land size–productivity relationship, with implications for the widely debated inverse relationship in development economics.
This study advances the literature by introducing a robust, data-driven framework for detecting and correcting GPS measurement errors in agricultural surveys, demonstrating the utility of repeated measurements and machine learning for improving data quality, and providing evidence that environmental and surveyor factors significantly influence GPS accuracy. It also shows that correcting measurement anomalies can alter key empirical relationships, such as the inverse land size–productivity hypothesis. The findings underscore the importance of high-quality measurement for credible agricultural statistics and policy analysis, and the methodological toolkit developed here can be adapted to other survey domains and integrated into field operations to enhance data reliability. Reliable land measurement is critical for agricultural research and policy, and by combining machine learning-based outlier detection, predictive imputation, and repeated measurement validation, this paper offers a practical framework for improving survey data quality. Future research should extend these methods to other contexts and explore their integration into routine survey practice.