Integrated Tool for AI-assisted Exploratory Data Analysis and Quality Control of Heterogeneous Statistical Data
Conference
65th ISI World Statistics Congress
Format: IPS Abstract - WSC 2025
Session: IPS 959 - Sharing and Accessing Granular Administrative Data
Wednesday 8 October 2 p.m. - 3:40 p.m. (Europe/Amsterdam)
Abstract
The Data Science team at the BELab Data Laboratory of Banco de España has recently developed an integrated tool to support diverse user groups in performing Exploratory Data Analysis (EDA) and Data Quality Management (DQM) on the microdatasets hosted by BELab. Despite significant variations in the size, nature, and confidentiality levels of these datasets, they share a common structure that enables a high degree of standardization and generalization in EDA and DQM processes. The tool allows for standardized processing across a wide variety of datasets and is suitable for a broad spectrum of users, including data producers, lab technicians, data analysts, and researchers.
For EDA, the tool supports the standardized exploration of highly heterogeneous tabular datasets. Users can examine data at various levels of detail, including both aggregated information and the underlying microdata, depending on their access rights and specific interests. The tool also enables the automatic creation of interactive exploratory dashboards for large collections of tabular datasets and can serve as a data catalog to showcase available data. Additionally, it incorporates an AI-powered interface that allows users to ask questions about both aggregated and micro-level data.
In terms of data quality, the tool automatically evaluates several dimensions of the DAMA framework for DQM—namely, completeness, uniqueness, validity, and consistency. It checks whether the data align with the associated metadata, identifies structural and formatting issues, and detects duplicate records and invalid entries. Its EDA capabilities also support the identification and analysis of inconsistencies. Furthermore, the tool includes machine learning–based multivariate analysis and anomaly detection functionalities.
The tool was designed and implemented with modularity and ease of sharing as core principles, facilitating the integration of new functionalities and supporting its adoption across a wide range of contexts.