Regional Statistics Conference 2026

Regional Statistics Conference 2026

Toward Robust Tree-Based Ensemble Learning for Genomic Prediction

Conference

Regional Statistics Conference 2026

Format: IPS Abstract - Malta 2026

Keywords: ensemble learning, genomic, machine learning, robust modelling, snps

Session: IPS 1218- Showcasing Technical Research by Women in Statistical Science in Portugal

Friday 5 June 2 p.m. - 3:40 p.m. (Europe/Malta)

Abstract

The analysis of real-world data is often sensitive to violations of model assumptions, a vulnerability that can be particularly pronounced in the presence of data contamination—ranging from recording errors to extreme outliers. For example, in linear regression, even a single outlier can disrupt the normality assumption, leading to biased parameter estimates and compromised inferential results. Unlike linear regression, Random Forests (RF) and Stochastic Gradient Boosting (SGB), which are tree-based ensemble methods, do not rely on parametric assumptions (e.g., linearity, normality of errors, or homoscedasticity). Nevertheless, their performance can still be affected by data contamination, making robust approaches important even for these flexible machine learning methods.

While data contamination can occur at both the response (output) and covariate (feature) levels, this work primarily focuses on the former. To address this issue, we evaluate the performance of the classical RF and SGB methods through simulations and investigate robust techniques to enhance their resilience to contaminated data. Specifically, we employ a synthetic animal breeding dataset from the literature and introduce several plausible contamination scenarios. This study sheds light on the implications of data contamination in genomic prediction and selection for breeding programs, while providing insight into robust adaptations of RF and SGB that help mitigate the challenges posed by certain types of contamination. The results highlight several robust strategies that improve the resilience of RF and SGB under response contamination. Notably, the best-performing approaches are not tied to a specific learning algorithm, suggesting potential applicability across a broader range of machine learning models.