生物技术进展 ›› 2023, Vol. 13 ›› Issue (5): 798-806.DOI: 10.19586/j.2095-2341.2023.0063

• 研究论文 • 上一篇    下一篇

基于加权平均的肠道菌群特征筛选和疾病预测模型研究

曹海涛1(), 朱静1(), 曾海波2, 刘彦辰1   

  1. 1.新疆农业大学计算机与信息工程学院,乌鲁木齐 830052
    2.新疆乌鲁木齐市友谊医院,乌鲁木齐 830049
  • 收稿日期:2023-05-05 接受日期:2023-07-05 出版日期:2023-09-25 发布日期:2023-10-10
  • 通讯作者: 朱静
  • 作者简介:曹海涛 E-mail: 2232060551@qq.com
  • 基金资助:
    国家自然科学基金项目(31860649)

Research on Feature Selection of Gut Microbiota and Disease Prediction Model Based on Weighted Average

Haitao CAO1(), Jing ZHU1(), Haibo ZENG2, Yanchen LIU1   

  1. 1.Computer and Information Engineering College,Xinjiang Agricultural University,Urumqi 830052,China
    2.Friendship Hospital of Urumqi,Urumqi 830049,China
  • Received:2023-05-05 Accepted:2023-07-05 Online:2023-09-25 Published:2023-10-10
  • Contact: Jing ZHU

摘要:

利用宏基因组分析预测人类疾病和健康状况以及发现生物标志物是当前研究的热点。通过生物信息学工具KneadData和MetaPhlAn2对原始宏基因组进行数据质量控制和去宿主污染后得到纯净序列,利用数据降维方法和随机森林模型筛选出与疾病发生高度相关的特征菌群,以代替原始数据特征作为疾病预测模型输入。结合多层感知机(multilayer perceptron, MLP)、支持向量机(support vector machine, SVM)和极端梯度提升(extreme gradient boosting, XGBoost)为子模型构建融合疾病预测模型,并在肝硬化、2型糖尿病和肥胖症3个数据集上经过特征筛选后交叉验证,得到的AUC值分别为0.928 6、0.652 1和0.574 7。ROC曲线下面积显示,筛选出特征菌群后的模型能高效准确地筛查和诊断疾病,并能有效区分健康人和疾病患者,为建立一种新的非侵入性、可量化的辅助诊断方法提供了有益参考。

关键词: 疾病预测, 肠道菌群, 特征筛选, 融合模型, 宏基因组

Abstract:

The utilization of metagenomic analysis to investigate human diseases and predict health conditions is a current focal point of research. Through the application of bioinformatics tools such as KneadData and MetaPhlAn2, the raw metagenomic data undergoes quality control and host contamination removal were carried out to obtain the pure sequences. Subsequently, dimensionality reduction methods and a random forest model were employed to identify microbial taxa that were highly correlated with disease occurrence, serving as replacements for the original data features in the disease prediction model. A fusion disease prediction model was constructed by integrating multilayer perceptron (MLP), support vector machine (SVM), and extreme gradient boosting (XGBoost) as sub-models. Following feature selection and cross-validation on datasets pertaining to liver cirrhosis, type 2 diabetes, and obesity, the obtained AUC values were 0.928 6, 0.652 1, and 0.574 7, respectively. The area under the ROC curve demonstrated that the model augmented with the selected microbial taxa, which could efficiently and accurately screen and diagnose diseases, effectively distinguishing between healthy individuals and patients. This work provided valuable insights for the establishment of a novel non-invasive and quantifiable auxiliary diagnostic method.

Key words: disease prediction, intestinal microbiota, feature screening, fusion model, metagenomics

中图分类号: