Exploring NLP and Web Data as Complementary Sources in Tourism Statistics
Conference
Format: IPS Abstract - IAOS 2026
Session: Tourism and New Data Sources
Wednesday 13 May 2:30 p.m. - 4 p.m. (Europe/Vilnius)
Abstract
Contemporary tourism is becoming increasingly complex, while the demand for timely and multidimensional statistical information continues to grow. This creates significant challenges for official statistics. Traditional survey-based approaches, despite their importance, face limitations related to costs, respondent burden, and their insufficient ability to capture dynamic phenomena, particularly short-term travel such as same-day visits.
Addressing these challenges requires the integration of multiple data sources. In addition to traditional statistical data, alternative sources are gaining importance, including administrative registers, data collected from online platforms (including collaborative economy platforms), and transaction data. In this context, methods based on Natural Language Processing (NLP) and web scraping offer new opportunities to identify unregistered or “hidden” segments of the tourism market, potentially reducing respondent burden.
However, the use of such data introduces significant methodological challenges related to the “uncertainty of input”, stemming from data incompleteness, noise, and ambiguity in textual and web-based sources. The presentation will discuss strategies for managing this uncertainty, including data validation, triangulation across sources, and robustness checks in model-based inference. Particular attention will be given to exploring how these approaches may complement traditional survey data, with careful consideration of their implications for statistical quality.
At the same time, these approaches may support the gradual modernisation of tourism statistics by contributing more timely and complementary information, while requiring careful consideration of challenges related to data harmonisation, quality assessment, and privacy protection.