Integrated Logistic Regression–XGBoost and Bayesian Network Models for Disease Prediction
Download as PDF
DOI: 10.25236/iwmecs.2025.020
Author(s)
Yihao Zhu, Ruixin Zhang, Junfeng Li
Corresponding Author
Yihao Zhu
Abstract
Cardiovascular disease, stroke and cirrhosis are diseases that pose a major health threat worldwide. This study carries out data-driven disease risk prediction and association analysis based on three disease datasets. In the first step, the stroke, heart disease and cirrhosis datasets were systematically preprocessed, including outlier processing based on the K-S test, spline function interpolation of missing values, and standardization and visual analysis. Through correlation analysis and chi-square test, the key influencing factors such as age, ST-segment depression, and albumin were identified. In the second step, a logistic regression-XGBoost integrated model was constructed to predict the prevalence probability of three types of diseases, and the model performance was evaluated through accuracy, AUC-ROC and other indicators, among which the accuracy of logistic regression for heart disease prediction reached 68%, and XGBoost's performance in the multi-classification task of liver cirrhosis needs to be improved. The results showed that the probability of heart disease-cirrhosis complications reached 83.33% in the high-risk group and 25.75% in the high-risk group, revealing a strong correlation between diseases. This study provides data support and method reference for multi-disease risk prediction and collaborative prevention and control.
Keywords
K-S test; Spline function; Logistic regression - XGBoost integration; Bayesian network; Disease prediction; Comorbidity analysis