Risk prediction of oral premalignant lesions (oral cancer) using explainable machine learning through a community-based cross-sectional study in rural India

Sundar M
Siva M
Poornima B. Khot
Maliakel Steffi Francis S

0 evaluations Published on Mar 2, 2026

This article on Sciety

Abstract

Background Oral cancer is a major public health problem among the population of India. In rural setting whereby there is exposure to high-risk factors associated with tobacco use and low accessibility to structured screening services. Though visual oral examination has been proven to be effective in lowering mortality rates of oral cancer, population-based screening has not only been found to be resource intensive but also hard to maintain in the primary health care system. Screening programs could be made more efficient by risk stratification methods that identify the people with a greater risk of having oral premalignant lesions. The recent developments of machine learning gives a chance to make data-driven predictions of risks, although the issues associated with the model transparency and its applicability in relation to a population are still present. Methods The design of the study was community based cross-sectional study on 3,700 adults in 100 rural clusters of the Hassan District in Karnataka. Structured interviews were used to gather sociodemographic data, tobacco and alcohol use, and oral hygiene practices, and then clinical oral examination was carried out based on guidelines of the World Health Organization. Cross-validation was used to develop and assess supervised machine learning models that comprise support vector machine, random forest, and extreme gradient boosting (XGBoost). SHapley Additive explanations (SHAP) were used to determine model interpretability and populations-level risk stratification was done using unsupervised K-means clustering. Results The best performance models XGBoost were found to have the highest predictive accuracy (area under the receiver operating characteristic curve = 0.91, accuracy = 85.7%). Aging and exposure to tobacco and bad oral health were also reported to be the most consistent predictors across the models. Clustering analysis revealed the presence of a high-risk sub-group with a significantly greater relative risk burden of oral premalignant lesions that may be supported by regression-based estimates (odds ratio = 3.46, 95% confidence interval: 2.58–4.72). Conclusion Predictable machine learning-based risk predictors can be used to facilitate population stratification and focused screening plans to detect oral premalignant lesions in rural areas to allow the most effective utilization of scarce public health assets in primary healthcare systems.

Related articles are currently not available for this article.