生物技术进展 ›› 2024, Vol. 14 ›› Issue (2): 323-330.DOI: 10.19586/j.2095-2341.2023.0145

• 研究论文 • 上一篇    

基于瘤胃球菌微生物群丰度构建疾病类型预测的肠道菌群标签

徐婷1(), 沈佳豪2, 赵康1, 黄鹭1, 董恩惠1, 曾可心3, 卞新为3, 季明辉1(), 许勤1()   

  1. 1.南京医科大学护理学院,南京 211166
    2.南京中医药大学中医学院·中西医结合学院,南京 210023
    3.南京医科大学第一临床医学院,南京 211166
  • 收稿日期:2023-11-11 接受日期:2023-12-21 出版日期:2024-03-25 发布日期:2024-04-17
  • 通讯作者: 季明辉,许勤
  • 作者简介:徐婷E-mail: tingxu1229@stu.njmu.edu.cn
  • 基金资助:
    国家自然科学基金项目(82073407);江苏高校优势学科建设工程项目“护理学”(苏政办发〔2018〕87号);“十三五”江苏省重点学科项目“护理学”(苏教研〔2016〕9号)

Bacterial Signature for Prediction of Disease Type Based on Abundance of Ruminococcus

Ting XU1(), Jiahao SHEN2, Kang ZHAO1, Lu HUANG1, Enhui DONG1, Kexin ZENG3, Xinwei BIAN3, Minghui JI1(), Qin XU1()   

  1. 1.School of Nursing,Nanjing Medical University,Nanjing 211166,China
    2.School of Integrated Chinese and Western Medicine,Nanjing University of Chinese Medicine,Nanjing 210023,China
    3.The First Clinical Medical College of Nanjing Medical University,Nanjing 211166,China
  • Received:2023-11-11 Accepted:2023-12-21 Online:2024-03-25 Published:2024-04-17
  • Contact: Minghui JI,Qin XU

摘要:

为探讨肠道菌群在疾病类型预测中的价值,利用机器学习基于瘤胃球菌丰度构建了疾病的非侵入性评估模型。选取ExperimentHub R库存储库数据,下载来自不同研究的人类粪便瘤胃球菌丰度信息及实验方案、疾病状态、年龄、性别、抗生素使用情况、地区、吸烟情况等多种信息,利用随机森林、决策树、Adaboost等机器学习模型建立疾病筛查的评估模型,使用GridSearchCV(网格搜索)调整参数,并用混淆矩阵评估外部验证结果。经数据处理提取标准化命名了12种瘤胃球菌、7种疾病并将25个变量进行了哑变量变换。利用多种瘤胃球菌属微生物的丰度及性别、年龄等样本一般资料信息建立了3种评估模型。其中随机森林模型准确率最高(0.884),且当n_estimators为220时,模型得分为0.892,为最佳模型。外部验证结果也显示可见模型中分类算法预测错误的情况相对较少,模型性能良好。根据粪便样本的宏基因组学数据,基于瘤胃球菌丰度利用随机森林算法可以有效地对疾病类型进行预测。

关键词: 建模预测, 肠道菌群, 瘤胃球菌, 机器学习

Abstract:

The study used machine learning model to construct a non-invasive evaluation model of diseases based on the abundance of Ruminococus to explore the value of intestinal flora in the prediction of disease types. Data in R library was used to download data from different studies. Abundance of Ruminococcus, study condition, disease state, age, sex, antibiotic use, region, smoking situation, and other information of human samples were selected, and the evaluation model of disease screening was established by using machine learning classification models such as random forest, decision tree and Adaboost. The parameters were adjusted by GridSearchCV, and the external verification results were evaluated by using a confusion matrix. Three evaluation models were established based on the abundance of Ruminococcus and the general information of samples such as sex and age. The random forest model had the highest accuracy (0.884). In addition, when n_estimators was 220, the score was 0.892, which was the best model. The external validation results also showed that the classification algorithm in the visible model predicted relatively few errors, and the model performed well. According to the metagenomic data of fecal samples, the random forest algorithm can effectively predict the disease types based on the abundance of Ruminococcus.

Key words: modeling prediction, intestinal flora, Ruminococcus, machine learning

中图分类号: