生物技术进展 ›› 2023, Vol. 13 ›› Issue (4): 645-653.DOI: 10.19586/j.2095-2341.2022.0007
• 技术与方法 • 上一篇
收稿日期:
2022-01-17
接受日期:
2022-03-09
出版日期:
2023-07-25
发布日期:
2023-08-03
通讯作者:
朱静
作者简介:
马云鹏 E-mail:1020264950@qq.com;
基金资助:
Yunpeng MA(), Jing ZHU(
), Xinghua CUI
Received:
2022-01-17
Accepted:
2022-03-09
Online:
2023-07-25
Published:
2023-08-03
Contact:
Jing ZHU
摘要:
微生物群落会对所处环境的宏观性质产生重要影响,但微生物存在数据高维、复杂、稀疏的特点,为了解微生物与生态环境之间的关系提出了新的挑战。机器学习的发展以及第二代DNA测序技术应用的普及为解决这一问题提供了一种新的方法。利用308个样本共44 d的植物凋落物分解实验的土壤微生物群和溶解有机碳(dissolved organic carbon,DOC)数据,并以1 709个细菌微生物操作分类单元(operational taxonomic units,OTU)作为特征构建12种常用的机器学习模型,采用嵌入法、包装法以及嵌入-包装融合法进行特征选择,并选择梯度提升决策树(gradient boosting decision tree,GBDT)作为最优模型进行参数优化。模型采用均方根误差、平均绝对误差、线性拟合优度作为评价指标。结果表明,特征筛选后降低了数据维度,提升了模型精度,在仿真实验中,嵌入-包装融合法在应用模型中表现最佳。将嵌入-包装融合法与梯度提升决策树结合构建溶解有机碳预测模型,并通过实验验证了模型的有效性。研究结果为利用细菌微生物数据应用机器学习方法估测溶解有机碳提供了新思路。
中图分类号:
马云鹏, 朱静, 崔兴华. 基于机器学习的微生物溶解有机碳含量估测[J]. 生物技术进展, 2023, 13(4): 645-653.
Yunpeng MA, Jing ZHU, Xinghua CUI. Content Estimating of Microbial Dissolved Organic Carbon Based on Machine Learning[J]. Current Biotechnology, 2023, 13(4): 645-653.
样本 | 操作分类单元编号 | DOC含量/(mg·g-1) | ||||||
---|---|---|---|---|---|---|---|---|
OTU_401 | OTU_12 | OTU_960 | OTU_20 | OTU_11 | OTU_6 | OTU_25 | ||
样本1 | 0 | 38 | 1 | 17 | 0 | 27 | 0 | 10.46 |
样本2 | 140 | 28 | 0 | 0 | 0 | 3 | 0 | 7.00 |
样本3 | 262 | 8 | 109 | 11 | 0 | 67 | 18 | 5.00 |
样本4 | 102 | 9 | 0 | 1 | 0 | 2 | 0 | 8.42 |
样本5 | 2 | 26 | 6 | 73 | 0 | 0 | 0 | 9.46 |
表1 OTU部分样表
Table 1 OTU partial sample table
样本 | 操作分类单元编号 | DOC含量/(mg·g-1) | ||||||
---|---|---|---|---|---|---|---|---|
OTU_401 | OTU_12 | OTU_960 | OTU_20 | OTU_11 | OTU_6 | OTU_25 | ||
样本1 | 0 | 38 | 1 | 17 | 0 | 27 | 0 | 10.46 |
样本2 | 140 | 28 | 0 | 0 | 0 | 3 | 0 | 7.00 |
样本3 | 262 | 8 | 109 | 11 | 0 | 67 | 18 | 5.00 |
样本4 | 102 | 9 | 0 | 1 | 0 | 2 | 0 | 8.42 |
样本5 | 2 | 26 | 6 | 73 | 0 | 0 | 0 | 9.46 |
算法 | RMSE | MAE | R2 |
---|---|---|---|
套索回归 | 2.460 6 | 1.979 1 | 0.305 2 |
弹性网回归 | 2.305 8 | 1.804 4 | 0.390 5 |
支持向量机 | 2.389 3 | 1.894 2 | 0.342 1 |
决策树 | 2.779 2 | 2.131 4 | 0.065 0 |
K近邻 | 2.748 6 | 2.097 0 | 0.146 7 |
多层感知机 | 2.549 0 | 2.116 5 | 0.262 2 |
极限树 | 1.995 5 | 1.551 8 | 0.539 9 |
极限梯度提升决策树 | 2.049 1 | 1.576 0 | 0.508 1 |
随机森林 | 1.979 1 | 1.515 4 | 0.544 5 |
自适应增强算法 | 2.073 4 | 1.620 5 | 0.503 1 |
引导聚集算法 | 2.103 5 | 1.622 9 | 0.477 2 |
梯度提升决策树 | 1.955 4 | 1.472 4 | 0.558 5 |
表2 多种机器学习模型预测结果
Table 2 Prediction results of multiple machine learning models
算法 | RMSE | MAE | R2 |
---|---|---|---|
套索回归 | 2.460 6 | 1.979 1 | 0.305 2 |
弹性网回归 | 2.305 8 | 1.804 4 | 0.390 5 |
支持向量机 | 2.389 3 | 1.894 2 | 0.342 1 |
决策树 | 2.779 2 | 2.131 4 | 0.065 0 |
K近邻 | 2.748 6 | 2.097 0 | 0.146 7 |
多层感知机 | 2.549 0 | 2.116 5 | 0.262 2 |
极限树 | 1.995 5 | 1.551 8 | 0.539 9 |
极限梯度提升决策树 | 2.049 1 | 1.576 0 | 0.508 1 |
随机森林 | 1.979 1 | 1.515 4 | 0.544 5 |
自适应增强算法 | 2.073 4 | 1.620 5 | 0.503 1 |
引导聚集算法 | 2.103 5 | 1.622 9 | 0.477 2 |
梯度提升决策树 | 1.955 4 | 1.472 4 | 0.558 5 |
操作分类单元编号 | |||||||||
---|---|---|---|---|---|---|---|---|---|
OTU_401 | OTU_12 | OTU_960 | OTU_20 | OTU_11 | OTU_6 | OTU_40 | OTU_53 | OTU_150 | OTU_55 |
OTU_57 | OTU_202 | OTU_160 | OTU_574 | OTU_249 | OTU_95 | OTU_188 | OTU_221 | OTU_1469 | OTU_3824 |
OTU_273 | OTU_389 | OTU_292 | OTU_636 | OTU_23 | OTU_101 | OTU_4022 | OTU_539 | OTU_61 | OTU_146 |
OTU_181 | OTU_100 | OTU_120 | OTU_81 | OTU_262 | OTU_1259 | OTU_616 | OTU_5 | OTU_1019 | OTU_1032 |
OTU_138 | OTU_16 | OTU_167 | OTU_170 | OTU_1914 | OTU_21 | OTU_22 | OTU_220 | OTU_227 | OTU_24 |
OTU_267 | OTU_29 | OTU_309 | OTU_313 | OTU_329 | OTU_3858 | OTU_473 | OTU_474 | OTU_534 | OTU_597 |
OTU_70 | OTU_82 | OTU_98 | OTU_1 | OTU_15 | OTU_8 | OTU_7 | OTU_13 | OTU_18 | OTU_26 |
OTU_10 | OTU_9 | OTU_45 | OTU_1033 | OTU_44 | OTU_193 | OTU_35 | OTU_32 | OTU_27 | OTU_131 |
OTU_28 | OTU_51 | OTU_84 | OTU_5179 | OTU_56 | OTU_54 | OTU_77 | OTU_75 | OTU_94 | OTU_1297 |
OTU_1111 | OTU_1200 | OTU_1974 | OTU_103 | OTU_2139 | OTU_950 | OTU_106 | OTU_235 | OTU_251 | OTU_431 |
OTU_358 | OTU_2512 | OTU_713 | OTU_3826 | OTU_179 | OTU_211 | OTU_1119 | OTU_1569 | OTU_201 | OTU_2586 |
OTU_3002 | OTU_320 | OTU_953 | OTU_1509 | OTU_226 | OTU_347 | OTU_169 | OTU_470 | OTU_293 | OTU_5841 |
OTU_363 | OTU_357 | OTU_407 | OTU_458 | OTU_372 | OTU_1052 | OTU_581 | OTU_652 | OTU_5988 | OTU_1550 |
OTU_545 | OTU_698 | OTU_1348 | OTU_5531 | OTU_4794 | OTU_2669 | OTU_516 | OTU_994 | OTU_4277 | OTU_5059 |
表3 REF-FIS(GBDT)特征选择OTU表
Table 3 RFE-FIS (GBDT) feature selection OTU table
操作分类单元编号 | |||||||||
---|---|---|---|---|---|---|---|---|---|
OTU_401 | OTU_12 | OTU_960 | OTU_20 | OTU_11 | OTU_6 | OTU_40 | OTU_53 | OTU_150 | OTU_55 |
OTU_57 | OTU_202 | OTU_160 | OTU_574 | OTU_249 | OTU_95 | OTU_188 | OTU_221 | OTU_1469 | OTU_3824 |
OTU_273 | OTU_389 | OTU_292 | OTU_636 | OTU_23 | OTU_101 | OTU_4022 | OTU_539 | OTU_61 | OTU_146 |
OTU_181 | OTU_100 | OTU_120 | OTU_81 | OTU_262 | OTU_1259 | OTU_616 | OTU_5 | OTU_1019 | OTU_1032 |
OTU_138 | OTU_16 | OTU_167 | OTU_170 | OTU_1914 | OTU_21 | OTU_22 | OTU_220 | OTU_227 | OTU_24 |
OTU_267 | OTU_29 | OTU_309 | OTU_313 | OTU_329 | OTU_3858 | OTU_473 | OTU_474 | OTU_534 | OTU_597 |
OTU_70 | OTU_82 | OTU_98 | OTU_1 | OTU_15 | OTU_8 | OTU_7 | OTU_13 | OTU_18 | OTU_26 |
OTU_10 | OTU_9 | OTU_45 | OTU_1033 | OTU_44 | OTU_193 | OTU_35 | OTU_32 | OTU_27 | OTU_131 |
OTU_28 | OTU_51 | OTU_84 | OTU_5179 | OTU_56 | OTU_54 | OTU_77 | OTU_75 | OTU_94 | OTU_1297 |
OTU_1111 | OTU_1200 | OTU_1974 | OTU_103 | OTU_2139 | OTU_950 | OTU_106 | OTU_235 | OTU_251 | OTU_431 |
OTU_358 | OTU_2512 | OTU_713 | OTU_3826 | OTU_179 | OTU_211 | OTU_1119 | OTU_1569 | OTU_201 | OTU_2586 |
OTU_3002 | OTU_320 | OTU_953 | OTU_1509 | OTU_226 | OTU_347 | OTU_169 | OTU_470 | OTU_293 | OTU_5841 |
OTU_363 | OTU_357 | OTU_407 | OTU_458 | OTU_372 | OTU_1052 | OTU_581 | OTU_652 | OTU_5988 | OTU_1550 |
OTU_545 | OTU_698 | OTU_1348 | OTU_5531 | OTU_4794 | OTU_2669 | OTU_516 | OTU_994 | OTU_4277 | OTU_5059 |
特征选择方法 | 算法 | RMSE | MAE | R2 |
---|---|---|---|---|
RFE(RF) | 梯度提升决策树 | 1.940 7 | 1.478 6 | 0.579 2 |
极限树 | 1.963 1 | 1.501 0 | 0.566 9 | |
随机森林 | 1.958 5 | 1.486 0 | 0.566 9 | |
RFE(GBDT) | 梯度提升决策树 | 1.821 2 | 1.377 5 | 0.618 3 |
极限树 | 1.855 1 | 1.420 5 | 0.601 5 | |
随机森林 | 1.905 8 | 1.453 7 | 0.581 8 | |
RFE(ET) | 梯度提升决策树 | 1.954 3 | 1.487 6 | 0.556 3 |
极限树 | 1.874 0 | 1.425 9 | 0.597 6 | |
随机森林 | 1.936 5 | 1.474 2 | 0.566 1 | |
FIS(GBDT) | 梯度提升决策树 | 1.864 4 | 1.412 1 | 0.601 3 |
极限树 | 1.937 1 | 1.499 3 | 0.566 7 | |
随机森林 | 1.956 4 | 1.493 8 | 0.555 8 | |
RFE-FIS(GBDT) | 梯度提升决策树 | 1.818 8 | 1.386 8 | 0.620 3 |
极限树 | 1.914 6 | 1.466 3 | 0.577 8 | |
随机森林 | 1.924 7 | 1.459 3 | 0.570 2 |
表4 模型预测结果
Table 4 Model prediction results
特征选择方法 | 算法 | RMSE | MAE | R2 |
---|---|---|---|---|
RFE(RF) | 梯度提升决策树 | 1.940 7 | 1.478 6 | 0.579 2 |
极限树 | 1.963 1 | 1.501 0 | 0.566 9 | |
随机森林 | 1.958 5 | 1.486 0 | 0.566 9 | |
RFE(GBDT) | 梯度提升决策树 | 1.821 2 | 1.377 5 | 0.618 3 |
极限树 | 1.855 1 | 1.420 5 | 0.601 5 | |
随机森林 | 1.905 8 | 1.453 7 | 0.581 8 | |
RFE(ET) | 梯度提升决策树 | 1.954 3 | 1.487 6 | 0.556 3 |
极限树 | 1.874 0 | 1.425 9 | 0.597 6 | |
随机森林 | 1.936 5 | 1.474 2 | 0.566 1 | |
FIS(GBDT) | 梯度提升决策树 | 1.864 4 | 1.412 1 | 0.601 3 |
极限树 | 1.937 1 | 1.499 3 | 0.566 7 | |
随机森林 | 1.956 4 | 1.493 8 | 0.555 8 | |
RFE-FIS(GBDT) | 梯度提升决策树 | 1.818 8 | 1.386 8 | 0.620 3 |
极限树 | 1.914 6 | 1.466 3 | 0.577 8 | |
随机森林 | 1.924 7 | 1.459 3 | 0.570 2 |
参数 | 搜索范围 | 搜索步长 |
---|---|---|
learning_rate | 0.01~0.2 | 0.01 |
n_estimators | 100~1 000 | 1.00 |
max_depth | 1~10 | 1.00 |
表5 模型参数网格搜索范围
Table 5 Model parameter grid search range
参数 | 搜索范围 | 搜索步长 |
---|---|---|
learning_rate | 0.01~0.2 | 0.01 |
n_estimators | 100~1 000 | 1.00 |
max_depth | 1~10 | 1.00 |
模型状态 | RMSE | MAE | R2 |
---|---|---|---|
RFE-FIS(GBDT) | 1.818 8 | 1.386 8 | 0.620 3 |
GS-RFE-FIS(GBDT) | 1.722 0 | 1.293 4 | 0.659 9 |
表6 参数优化后精度对比
Table 6 Precision comparison after parameter optimization
模型状态 | RMSE | MAE | R2 |
---|---|---|---|
RFE-FIS(GBDT) | 1.818 8 | 1.386 8 | 0.620 3 |
GS-RFE-FIS(GBDT) | 1.722 0 | 1.293 4 | 0.659 9 |
1 | LI H Z. Microbiome, metagenomics, and high-dimensional compositional data analysis[J]. Ann. Rev. Stat. Appl., 2015, 2: 73-94. |
2 | SARKER I H. Machine learning: Algorithms, real-world applications and research directions[J]. SN Comput. Sci., 2021, 2(3): 1-21. |
3 | HASAN B M S, ABDULAZEEZ A M. A review of principal component analysis algorithm for dimensionality reduction[J]. J. Soft Comput. Data Min., 2021, 2(1): 20-30. |
4 | STATNIKOV A, HENAFF M, NARENDRA V, et al.. A comprehensive evaluation of multicategory classification methods for microbiomic data[J]. Microbiome, 2013, 1(1): 1-12. |
5 | ZELLER G, TAP J, VOIGT A Y, et al.. Potential of fecal microbiota for early:tage detection of colorectal cancer[J/OL]. Mol. Syst. Biol., 2014, 10(11): 766[2022-05-06]. . |
6 | NING J, BEIKO R G. Phylogenetic approaches to microbial community classification[J]. Microbiome, 2015, 3(1): 1-13. |
7 | LO C, MARCULESCU R. MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks[J]. BMC Bioinform., 2019, 20(12): 1-14. |
8 | BOKULICH N A, DILLON M R, BOLYEN E, et al.. q2 -sample-classifier: machine-learning tools for microbiome classification and regression[J/OL]. J. Open Res. Softw., 2018, 3(30):934[2022-05-06]. . |
9 | 黄荣才,高胜涛,范士杰,等.畜禽粪污源抗生素及耐药基因在环境中的归趋[J].生物技术进展,2019,9(2):146-151. |
10 | 刘超,王宪伟,宋艳宇,等.增温对冻土区泥炭沼泽土壤孔隙水甲烷关联微生物和溶解性有机碳的影响[J].生态学报,2021,41(1):184-193. |
11 | 丁咸庆,柏菁,项文化,等.不同浸提剂处理森林土壤溶解性有机碳含量比较[J].土壤,2020,52(3):518-524. |
12 | 郭利娜,贾羽旋,李彤,等.森林溶解性有机碳淋溶驱动机制及模拟研究进展[J].生态学杂志,2020,39(5):1723-1733. |
13 | 余高, 陈芬, 谢英荷,等.有机肥替代化肥比例对黄壤土活性有机碳及酶活性的影响[J].中国蔬菜,2020(4):48-55. |
14 | LIANG C, SCHIMEL J P, JASTROW J D. The importance of anabolism in microbial control over soil carbon storage[J]. Nat. Microbiol., 2017, 2(8): 1-6. |
15 | ZITNIK M, NGUYEN F, WANG B, et al.. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities[J]. Inf. Fusion, 2019, 50: 71-91. |
16 | JOHANSEN R, ALBRIGHT M, LOPEZ D, et al.. Microbial community-level features linked to divergent carbon flows during early litter decomposition in a constant environment[J/OL]. BioRxiv, 2019: 659383[2022-05-16]. . |
17 | AFENDRAS G, MARKATOU M. Optimality of training/test size and resampling effectiveness in cross-validation[J]. J. Stat. Plan. Infer., 2019, 199: 286-301. |
18 | THOMPSON J, JOHANSEN R, DUNBAR J, et al.. Machine learning to predict microbial community functions: an analysis of dissolved organic carbon from litter decomposition[J/OL]. PLoS ONE, 2019, 14(7): e0215502[2022-05-16]. . |
19 | SEONWOO M, BYUNGHAN L, SUNGROH Y. Deep learning in bioinformatics[J]. Brief Bioinform., 2017, 18(5): 851-869. |
[1] | 邱思元, 徐晶雪, 段育阳, 赵金玉, 赵文婧, 张莉欣, 任国领. 甘露糖赤藓糖醇脂生产及应用研究进展[J]. 生物技术进展, 2023, 13(2): 210-219. |
[2] | 苗瑞菊, 丁尊丹, 田健, 张红兵, 关菲菲. PET水解酶传统与智能分子设计研究进展[J]. 生物技术进展, 2023, 13(1): 46-54. |
[3] | 郝捷, 季嫱, 李力群, 郑超, 吴娜, 吴晗, 李选文, 孙志康. 生物酶和微生物技术改善烟叶香气的研究进展[J]. 生物技术进展, 2022, 12(6): 817-824. |
[4] | 刘培敏, 罗金萍, 高权新. 水产养殖环境微生物研究进展[J]. 生物技术进展, 2022, 12(5): 690-695. |
[5] | 李力群, 孙志康, 郝捷, 季嫱, 李选文, 吴晗, 吴娜, 郑超, 杨婧. 果胶酶生产及工业应用进展[J]. 生物技术进展, 2022, 12(4): 549-558. |
[6] | 李伟, 王冲, 刘嗣嘉, 杨敏一, 张云平. 宏基因组学技术在痤疮研究中的应用进展[J]. 生物技术进展, 2021, 11(6): 694-699. |
[7] | 辛志奇, 赵航, 汪海, 路铁刚. 基于深度学习的作物基因组学和遗传改良[J]. 生物技术进展, 2021, 11(4): 483-488. |
[8] | 赵冬雪,刘璐,穆迎春,韩刚,张洪玉,房洪博,阮志勇4,宋金龙. 磺胺甲恶唑高效降解菌群的多样性分析和降解微生物的分离表征[J]. 生物技术进展, 2021, 11(2): 196-203. |
[9] | 玄琦月,韩雪,付英梅,. 肺外结核病微生物学诊断方法的研究和应用进展[J]. 生物技术进展, 2021, 11(1): 47-53. |
[10] | 张兆昆,,周文学,李永丽,,胡建华,,刘占英,. 核黄素发酵菌种改造研究进展[J]. 生物技术进展, 2021, 11(1): 54-60. |
[11] | 樊英,于晓清,李乐,王晓璐,叶海斌,胡发文,刁菁,刘洪军. 基于16S rRNA高通量测序分析大泷六线鱼表皮粘液及肠道内容物微生物多样性[J]. 生物技术进展, 2021, 11(1): 79-90. |
[12] | 陈硕,高佳奇,王迪,龙艳,李亮,张晓. DNA四面体纳米结构及其在生物技术领域的应用进展[J]. 生物技术进展, 2020, 10(6): 661-667. |
[13] | 徐欢欢,张红兵,李会宣,李磊. 常压室温等离子体技术在微生物诱变中的应用进展[J]. 生物技术进展, 2020, 10(4): 358-362. |
[14] | 李宇邦,吴军林,李曼莎. 微生物发酵处理药食同源植物研究进展[J]. 生物技术进展, 2019, 9(5): 461-466. |
[15] | 马永凯,陶宏兵,李文茹,谢小保,施庆珊,周少璐. 水性涂料中微生物群落结构及其多样性分析[J]. 生物技术进展, 2019, 9(4): 396-403. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
版权所有 © 2021《生物技术进展》编辑部