Data-Driven Facies Prediction: A Comparative Study of Random Forest, XGBoost, SVM, CatBoost, and K-Means
Abstract
Facies classification plays a critical role in characterizing subsurface heterogeneity and supporting effective reservoir development. Traditional methods, which often rely on core interpretation and manual log analysis, are limited by subjective interpretation and sparse data coverage. This study aims to improve facies prediction by comparing the performance of five machine learning models: Random Forest, XGBoost, Support Vector Machine, CatBoost, and K-Means clustering. The dataset is derived from sandstone formations in Labuan Island, Malaysia, and is enhanced using synthetic data generated through Latin Hypercube Sampling to address data scarcity. Feature selection is performed using three independent techniques to identify the most informative variables, and Principal Component Analysis is used to investigate feature relationships. Model evaluation is based on classification accuracy, precision-recall metrics, receiver operating characteristic curves, and confusion matrices. Among the models tested, CatBoost achieved the highest cross-validation accuracy at 95.4%, followed by XGBoost at 93.7%. Random Forest achieved a test accuracy of 89.5%, while Support Vector Machine performed less reliably with a test accuracy of 85.6%. The K-Means clustering approach yielded an overall accuracy of 49.7% in aligning predicted clusters with true facies labels. The results demonstrate the effectiveness of ensemble methods in facies classification and support the use of augmented data in enhancing model performance. This approach provides a practical framework for applying machine learning in geological settings, with potential benefits for reservoir modeling and development planning.
Related articles
Related articles are currently not available for this article.