64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada





  • O
    Onifade Femi


64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract


The aim of quantitative structure-activity prediction (QSAR) studies is to identify novel drug-like molecules that can be suggested as lead compounds utilizing two approaches, which are discussed in this article. First, to identify appropriate molecular descriptors by focusing on one feature-selection algorithm, and second to predict the biological activities of designed compounds. Recent studies have shown increased interest in the prediction of a huge number of molecules, known as Big Data, using deep learning models. However, despite all these efforts to solve critical challenges in QSAR models, such as over-fitting, and massive processing procedures, is major shortcomings in deep learning models. Hence, finding the most effective molecular descriptors in the shortest possible time is an ongoing task. One of the successful methods to speed up the extraction of the best features from big datasets is the use of the least absolute shrinkage and selection operator (LASSO). This algorithm is a regression model that selects a subset of molecular descriptors to enhance prediction accuracy and interpretability because of removing inappropriate and irrelevant features. In this study, we combine the least absolute deviation-least absolute shrinkage and selection operator (LAD-LASSO) with the random forest (RF) and ridge estimator to form two new estimators, LAD-Lasso-RF and LAD-Lasso-Ridge, respectively. 2540 computed DRAGON descriptors were reduced to a smaller number using preprocessing methods. The descriptors with the most significant relevance to biological activities were chosen using the LAD-LASSO variable selection method. To implement and test our proposed model, a random forest or ridge regression was built to predict the selected molecular activities. Finally, the prediction results and computation time of the suggested model were compared with the other well-known algorithms. The results revealed that improving output correlation through LAD-LASSO-random forest or LAD-Lasso-Ridge leads to appreciably reduced implementation time and model complexity while maintaining the accuracy of the predictions.