Measuring the quality of official statistics based on web data
Conference
65th ISI World Statistics Congress
Format: IPS Abstract - WSC 2025
Session: IPS 776 - Web Data for Official Statistics – Methodology, Quality, Production and Community
Wednesday 8 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
The ESSnet “Trusted Smart Statistics – Web Intelligence Network (WIN)” is a project within the European Statistical System (ESS), which engages 17 organizations from 14 European countries. It aims to develop a web intelligence system at the ESS level, providing a greater chance to generate the right conditions for the integration of web data into official statistics.
One work package (WP2) of this ESSnet includes already well-established use cases such as online job advertisements (OJA) and online-based enterprise characteristics (OBEC), with the ambition to be moved into the statistical production stage soon. Another work package (WP3) focusses on new types of web data sources, such as web data about the real estate market, construction activities, online prices or hotel prices. For these use cases the aim of the ESSnet is to produce experimental statistics.
Building on the experiences made in the different use cases, a work package of its own (WP4) aims to consolidate knowledge gained in the WIN in the area of methodology, architecture and quality when collecting, processing and analysing web data.
In this paper we present specific quality indicators which become relevant when incorporating web data in official statistics. These quality indicators go beyond the well-known quality frameworks since new methods of acquisition and processing data lead to new quality challenges which are not covered by the traditional quality frameworks yet.
New processes such as extraction of information from scraped text and deriving classification variables (such as NUTS or NACE) lead also to a shift in the necessary quality assessment procedures. Classification algorithms may process huge amounts of information automatically, but experiences show that human resources are needed to train the algorithms and to validate them. This is done with the help of manual annotations of samples of automatically classified records.
One obstacle in creating a data set which contains several manually labelled variables per record is the high cost, in terms of time and resources, that comes with manually annotating a large number of records. To mitigate the costs one can reduce the volume by creating a sampling design such that certain margins are well represented in the sampled data. Drawing such a sample can lead to a higher quality given the number of cases to be annotated or to a lower number of cases given the quality of the accuracy estimates.
Additionally to quality considerations about involved processes along the statistical production process we also present lessons learned from a centralised scraping platform.