Download PDF

From Panel Data to Token Sequences: NLP-inspired Machine Learning for Longitudinal Categorical Data

Author

Meilame Tayebjee

Conference

2026 IAOS Conference

Format: CPS Abstract - IAOS 2026

Keywords: ai and machine learning in statistics,, deep learning, nlp, sequential, transformers_modeling

Session: Large Language Models & Machine Learning in official statistics

Tuesday 12 May 4:30 p.m. - 6 p.m. (Europe/Vilnius)

Abstract

Longitudinal data in official statistics and administrative sources often take the form of individual categorical trajectories or pathways, such as sequences of educational states or healthcare events observed over time. In econometrics, such data are traditionally analyzed using panel models, event history analysis, or multi-state transition models, which typically rely on parametric or Markovian assumptions. While well suited for inference on specific transitions or outcomes, these approaches are not primarily designed for high-dimensional representation learning and can become difficult to scale when the number of possible categorical states (i.e. the effective vocabulary) is large.

In this paper, we propose a unified framework that reinterprets longitudinal categorical panel data as symbolic sequences, enabling the use of Natural Language Processing (NLP), Machine Learning, and Deep Learning methods originally developed for text. Individual pathways are tokenized into ordered sequences of categorical states or events, which allows us to apply a wide range of models, from n-gram approaches to embedding-based methods (e.g., Word2Vec) and Transformer architectures, with increasing computational requirements. In parallel, we compare these approaches with more classical machine learning techniques, such as gradient boosting applied to lagged and engineered features, which remain widely used in applied policy settings.

We illustrate the framework on two large-scale applications drawn from French administrative data: student higher-education pathways, covering approximately 10 million individuals over seven years, and medical care pathways derived from one of the most comprehensive healthcare databases in the world, encompassing around 70 million patients over five years and nearly 10 billion medical events.

Across these experiments, we show how sequence-based models can serve both predictive objectives (e.g. forecasting future states or outcomes) and representational objectives (e.g. learning low-dimensional embeddings of trajectories). These embeddings enable meaningful comparisons between individuals, the identification of typical pathways, and improved interpretability of complex longitudinal patterns.

By bridging econometric panel data analysis and modern sequence modeling, this work provides a flexible and scalable toolkit for the analysis of longitudinal categorical data in official statistics. The proposed approach opens new perspectives for descriptive analysis, prediction, and policy evaluation in domains where individual trajectories are central.