Skip to content
2000
Volume 20, Issue 7
  • ISSN: 1574-8936
  • E-ISSN: 2212-392X

Abstract

Background

Genome assembly tools are used to reconstruct genomic sequences from raw sequencing data, which are then used for identifying the organisms present in a metagenomic sample.

Methodology

More recently, machine learning approaches have been applied to a variety of bioinformatics problems, and in this paper, we explore their use for organism identification. We start by evaluating several commonly used metagenomic assembly tools, including PhyloFlash, MEGAHIT, MetaSPAdes, Kraken2, Mothur, UniCycler, and PathRacer, and compare them against state-of-the-art deep learning-based machine learning classification approaches represented by DNABERT and DeLUCS, in the context of two synthetic mock community datasets.

Results

Our analysis focuses on determining whether ensembling metagenome assembly tools with machine learning tools have the potential to improve identification performance relative to using the tools individually.

Conclusion

We find that this is indeed the case, and analyze the level of effectiveness of potential tool ensembling for organisms with different characteristics (based on factors such as repetitiveness, genome size, and GC content).

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936299440240709070105
2024-07-25
2025-11-05
Loading full text...

Full text loading...

References

  1. ThielV. HüglerM. WardD.M. BryantD.A. The dark side of the mushroom spring microbial mat: Life in the shadow of chlorophototrophs. II. metabolic functions of abundant community members predicted from metagenomic analyses.Front. Microbiol.2017894310.3389/fmicb.2017.00943 28634470
    [Google Scholar]
  2. ThielV. WoodJ.M. OlsenM.T. The dark side of the mushroom spring microbial mat: Life in the shadow of chlorophototrophs. I. microbial diversity based on 16s rrna gene amplicons and metagenomic sequencing.Front. Microbiol.2016791910.3389/fmicb.2016.00919 27379049
    [Google Scholar]
  3. TysonG.W. ChapmanJ. HugenholtzP. Community structure and metabolism through reconstruction of microbial genomes from the environment.Nature20044286978374310.1038/nature02340 14961025
    [Google Scholar]
  4. AnyansiC. StraubT.J. MansonA.L. EarlA.M. AbeelT. Computational methods for strain-level microbial detection in colony and metagenome sequencing data.Front. Microbiol.202011192510.3389/fmicb.2020.01925 33013732
    [Google Scholar]
  5. CurryK.D. WangQ. NuteM.G. Emu: Species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data.Nat. Methods202219784585310.1038/s41592‑022‑01520‑4 35773532
    [Google Scholar]
  6. SingerE. AndreopoulosB. BowersR.M. Next generation sequencing data of a defined microbial mock community.Sci. Data20163116008110.1038/sdata.2016.81 27673566
    [Google Scholar]
  7. FritzA. HofmannP. MajdaS. CAMISIM: Simulating metagenomes and microbial communities.Microbiome2019711710.1186/s40168‑019‑0633‑6 30736849
    [Google Scholar]
  8. KlattC.G. WoodJ.M. RuschD.B. Community ecology of hot spring cyanobacterial mats: Predominant populations and their functional potential.ISME J.2011581262127810.1038/ismej.2011.73 21697961
    [Google Scholar]
  9. BhayaD. GrossmanA.R. SteunouA.S. Population level functional diversity in a microbial community revealed by comparative genomic and metagenomic analyses.ISME J.20071870371310.1038/ismej.2007.46 18059494
    [Google Scholar]
  10. BecraftE.D. CohanF.M. KühlM. JensenS.I. WardD.M. Fine-scale distribution patterns of Synechococcus ecological diversity in microbial mats of Mushroom Spring, Yellowstone National Park.Appl. Environ. Microbiol.201177217689769710.1128/AEM.05927‑11 21890675
    [Google Scholar]
  11. EdgarR.C. Updating the 97% identity threshold for 16S ribosomal RNA OTUs.Bioinformatics201834142371237510.1093/bioinformatics/bty113 29506021
    [Google Scholar]
  12. StackebrandtE. GoebelB.M. Taxonomic Note: A place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology.Int. J. Syst. Evol. Microbiol.199444484684910.1099/00207713‑44‑4‑846
    [Google Scholar]
  13. QuinceC. NurkS. RaguideauS. STRONG: Metagenomics strain resolution on assembly graphs.Genome Biol.202122121410.1186/s13059‑021‑02419‑7 34311761
    [Google Scholar]
  14. PetersonD. BonhamK.S. RowlandS. PattanayakC.W. Klepac-CerajV. Comparative analysis of 16S rRNA gene and metagenome sequencing in pediatric gut microbiomes.Front. Microbiol.20211267033610.3389/fmicb.2021.670336 34335499
    [Google Scholar]
  15. WoodD.E. LuJ. LangmeadB. Improved metagenomic analysis with Kraken 2.Genome Biol.201920125710.1186/s13059‑019‑1891‑0 31779668
    [Google Scholar]
  16. YeS.H. SiddleK.J. ParkD.J. SabetiP.C. Benchmarking metagenomics tools for taxonomic classification.Cell2019178477979410.1016/j.cell.2019.07.010 31398336
    [Google Scholar]
  17. AlmeidaA. MitchellA.L. TarkowskaA. FinnR.D. Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments.Gigascience201875giy05410.1093/gigascience/giy054 29762668
    [Google Scholar]
  18. SchlossP.D. WestcottS.L. RyabinT. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities.Appl. Environ. Microbiol.200975237537754110.1128/AEM.01541‑09 19801464
    [Google Scholar]
  19. ProdanA. TremaroliV. BrolinH. ZwindermanA.H. NieuwdorpM. LevinE. Comparing bioinformatic pipelines for microbial 16S rRNA amplicon sequencing.PLoS One2020151e022743410.1371/journal.pone.0227434 31945086
    [Google Scholar]
  20. WangQ. GarrityG.M. TiedjeJ.M. ColeJ.R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.Appl. Environ. Microbiol.200773165261526710.1128/AEM.00062‑07 17586664
    [Google Scholar]
  21. FiannacaA. La PagliaL. La RosaM. Deep learning models for bacteria taxonomic classification of metagenomic data.BMC Bioinformatics201819S7Suppl. 719810.1186/s12859‑018‑2182‑6 30066629
    [Google Scholar]
  22. Lonèar-TurukaloT. Lazi’cI. Maljkovi’cN. BrdarS. Clustering of microbiome data: Evaluation of ensemble design approaches.IEEE EUROCON 2019 -18th International Conference on Smart Technologies. 01-04 July 2019; Novi Sad, Serbia.201910.1109/EUROCON.2019.8861929
    [Google Scholar]
  23. ZengF. WangZ. WangY. ZhouJ. ChenT. Large-scale 16S gene assembly using metagenomics shotgun sequences.Bioinformatics201733101447145610.1093/bioinformatics/btx018 28158392
    [Google Scholar]
  24. CepedaV. LiuB. AlmeidaM. HillC.M. KorenS. TreangenT.J. MetaCompass: Reference-guided assembly of metagenomes.bioRxiv201710.1101/212506
    [Google Scholar]
  25. KorenS. TreangenT.J. HillC.M. PopM. PhillippyA.M. Automated ensemble assembly and validation of microbial genomes.BMC Bioinformatics201415112610.1186/1471‑2105‑15‑126 24884846
    [Google Scholar]
  26. ZhuX. LeungH.C.M. ChinF.Y.L. PERGA: A paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.PLoS One2014912e11425310.1371/journal.pone.0114253 25461763
    [Google Scholar]
  27. de SouzaK.P. SetubalJ.C. Machine learning meets genome assembly.Brief. Bioinform.20192062116212910.1093/bib/bby072 30137230
    [Google Scholar]
  28. GreenerJ.G. KandathilS.M. MoffatL. JonesD.T. A guide to machine learning for biologists.Nat. Rev. Mol. Cell Biol.2022231405510.1038/s41580‑021‑00407‑0 34518686
    [Google Scholar]
  29. HarrisZ.N. DhungelE. MosiorM. AhnT.H. Massive metagenomic data analysis using abundance-based machine learning.Biol. Direct20191411210.1186/s13062‑019‑0242‑0 31370905
    [Google Scholar]
  30. WoloszynekS. ZhaoZ. ChenJ. RosenG.L. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.PLOS Comput. Biol.2019152e100672110.1371/journal.pcbi.1006721 30807567
    [Google Scholar]
  31. LamuriasA. SereikaM. AlbertsenM. HoseK. NielsenT.D. Metagenomic binning with assembly graph embeddings.Bioinformatics202238194481448710.1093/bioinformatics/btac557 35972375
    [Google Scholar]
  32. IuchiH. MatsutaniT. YamadaK. Representation learning applications in biological sequence analysis.Comput. Struct. Biotechnol. J.2021193198320810.1016/j.csbj.2021.05.039 34141139
    [Google Scholar]
  33. ChoiI. PonseroA.J. BomhoffM. Youens-ClarkK. HartmanJ.H. HurwitzB.L. Libra: Scalable k- mer-based tool for massive all-vs-all metagenome comparisons.Gigascience201982giy16510.1093/gigascience/giy165 30597002
    [Google Scholar]
  34. JiY. ZhouZ. LiuH. DavuluriR.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.Bioinformatics202137152112212010.1093/bioinformatics/btab083 33538820
    [Google Scholar]
  35. DevlinJ. ChangM.W. LeeK. ToutanovaK. BERT: Pre-training of deep bidirectional transformers for language understanding.ar-Xiv:1810048052019
    [Google Scholar]
  36. AriasP.M. AlipourF. HillK.A. KariL. DeLUCS: Deep learning for unsupervised clustering of DNA sequences.PLoS One2021171e026153110.1101/2021.05.13.444008
    [Google Scholar]
  37. HoangM.H. HoangV. LeV.V. Using deep learning for the taxonomic classification of microbial sequences.J Tech Edu Sci2014191152110.54644/jte.2024.1521
    [Google Scholar]
  38. LiangQ. BibleP.W. LiuY. ZouB. WeiL. DeepMicrobes: Taxonomic classification for metagenomics with deep learning.NAR Genom. Bioinform.202021lqaa00910.1093/nargab/lqaa009 33575556
    [Google Scholar]
  39. ŞapcıA.O.B. RachtmanE. MirarabS. CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing.Bioinformatics2024404btae15010.1093/bioinformatics/btae150 38492564
    [Google Scholar]
  40. YuanC. LeiJ. ColeJ. SunY. Reconstructing 16S rRNA genes in metagenomic data.Bioinformatics20153112i35i4310.1093/bioinformatics/btv231 26072503
    [Google Scholar]
  41. OlmM.R. Crits-ChristophA. DiamondS. LavyA. CarnevaliP.B.M. BanfieldJ.F. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries.mSystems202051e00731e1910.1128/mSystems.00731‑19
    [Google Scholar]
  42. MarxV. Microbiology: The road to strain-level identification.Nat. Methods201613540140410.1038/nmeth.3837 27123815
    [Google Scholar]
  43. PreheimS.P. PerrottaA.R. Martin-PlateroA.M. GuptaA. AlmE.J. Distribution-based clustering: Using ecology to refine the operational taxonomic unit.Appl. Environ. Microbiol.201379216593660310.1128/AEM.00342‑13 23974136
    [Google Scholar]
  44. SchlossP.D. Amplicon sequence variants artificially split bacterial genomes into separate clusters.MSphere202164e00191e2110.1128/mSphere.00191‑21 34287003
    [Google Scholar]
  45. CallahanB.J. McMurdieP.J. HolmesS.P. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.ISME J.201711122639264310.1038/ismej.2017.119 28731476
    [Google Scholar]
  46. CallahanB.J. McMurdieP.J. RosenM.J. HanA.W. JohnsonA.J.A. HolmesS.P. DADA2: High-resolution sample inference from Illumina amplicon data.Nat. Methods201613758158310.1038/nmeth.3869 27214047
    [Google Scholar]
  47. WhiteJ.R. NavlakhaS. NagarajanN. GhodsiM.R. KingsfordC. PopM. Alignment and clustering of phylogenetic markers - implications for microbial diversity studies.BMC Bioinformatics201011115210.1186/1471‑2105‑11‑152 20334679
    [Google Scholar]
  48. NguyenN.P. WarnowT. PopM. WhiteB. A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity.Biofilms Microbiomes2016211600410.1038/npjbiofilms.2016.4
    [Google Scholar]
  49. MahéF. RognesT. QuinceC. de VargasC. DunthornM. Swarm: Robust and fast clustering method for amplicon-based studies.PeerJ20142e59310.7717/peerj.593 25276506
    [Google Scholar]
  50. SchlossP.D. WestcottS.L. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis.Appl. Environ. Microbiol.201177103219322610.1128/AEM.02810‑10 21421784
    [Google Scholar]
  51. HuseS.M. WelchD.M. MorrisonH.G. SoginM.L. Ironing out the wrinkles in the rare biosphere through improved OTU clustering.Environ. Microbiol.20101271889189810.1111/j.1462‑2920.2010.02193.x 20236171
    [Google Scholar]
  52. EdgarR.C. UPARSE: Highly accurate OTU sequences from microbial amplicon reads.Nat. Methods2013101099699810.1038/nmeth.2604 23955772
    [Google Scholar]
  53. HaoX. JiangR. ChenT. Clustering 16S rRNA for OTU prediction: A method of unsupervised Bayesian clustering.Bioinformatics201127561161810.1093/bioinformatics/btq725 21233169
    [Google Scholar]
  54. RasheedZ RangwalaH BarbaráD. 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing.BMC Syst Biol20137Suppl 4)(Suppl. 4S1110.1186/1752‑0509‑7‑S4‑S11 24565031
    [Google Scholar]
  55. ChaudharyN. SharmaA.K. AgarwalP. GuptaA. SharmaV.K. 16S classifier: A tool for fast and accurate taxonomic classification of 16S rRNA hypervariable regions in metagenomic datasets.PLoS One2015102e011610610.1371/journal.pone.0116106 25646627
    [Google Scholar]
  56. TikhonovM. LeachR.W. WingreenN.S. Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution.ISME J.201591688010.1038/ismej.2014.117 25012900
    [Google Scholar]
  57. SegataN. On the road to strain-resolved comparative metagenomics.mSystems201832e00190e1710.1128/mSystems.00190‑17 29556534
    [Google Scholar]
  58. SmillieC.S. SaukJ. GeversD. Strain tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation.Cell Host Microbe2018232229240.e510.1016/j.chom.2018.01.003 29447696
    [Google Scholar]
  59. VentoleroM.F. WangS. HuH. LiX. Computational analyses of bacterial strains from shotgun reads.Brief. Bioinform.2022232bbac01310.1093/bib/bbac013 35136954
    [Google Scholar]
  60. NayfachS. Rodriguez-MuellerB. GarudN. PollardK.S. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography.Genome Res.201626111612162510.1101/gr.201863.115 27803195
    [Google Scholar]
  61. SubediS. NeishD. BakS. FengZ. Cluster analysis of microbiome data by using mixtures of dirichlet-multinomial regression models.Appl. Stat.20206951163118710.1111/rssc.12432
    [Google Scholar]
  62. FangY. SubediS. Clustering microbiome data using mixtures of logistic normal multinomial models.Sci. Rep.20231311475810.1038/s41598‑023‑41318‑8 37679485
    [Google Scholar]
  63. PavlopoulosG.A. BaltoumasF.A. LiuS. Unraveling the functional dark matter through global metagenomics.Nature2023622798359460210.1038/s41586‑023‑06583‑7 37821698
    [Google Scholar]
  64. del RíoÁ.R. Giner-LamiaJ. CantalapiedraC.P. Functional and evolutionary significance of unknown genes from uncultivated taxa.Nature2024626799827738410.1038/s41586‑023‑06955‑z
    [Google Scholar]
  65. NayfachS. RouxS. SeshadriR. A genomic catalog of Earth’s microbiomes.Nat. Biotechnol.202139449950910.1038/s41587‑020‑0718‑6 33169036
    [Google Scholar]
  66. LiuZ. KlattC.G. WoodJ.M. Metatranscriptomic analyses of chlorophototrophs of a hot-spring microbial mat.ISME J.2011581279129010.1038/ismej.2011.37 21697962
    [Google Scholar]
  67. KorenS. HarhayG.P. SmithT.P.L. Reducing assembly complexity of microbial genomes with single-molecule sequencing.Genome Biol.2013149R10110.1186/gb‑2013‑14‑9‑r101 24034426
    [Google Scholar]
  68. MoriH. Evans-YamamotoD. IshiguroS. TomitaM. YachieN. Fast and global detection of periodic sequence repeats in large genomic resources.Nucleic Acids Res.2019472e8e810.1093/nar/gky890 30304510
    [Google Scholar]
  69. OchmanH. Caro-QuinteroA. Genome size and structure, bacterial.Bacterial Genomics.AmsterdamElsevier201617918510.1016/B978‑0‑12‑800049‑6.00235‑3
    [Google Scholar]
  70. BushnellB. BBMap: A fast, accurate, splice-aware aligner.2014Available From: https://escholarship.org/uc/item/1h3515gn
    [Google Scholar]
  71. Gruber-VodickaH.R. SeahB.K.B. PruesseE. PhyloFlash: Rapid small-subunit rRNA profiling and targeted assembly from metagenomes.mSystems202055e00920e2010.1128/mSystems.00920‑20
    [Google Scholar]
  72. NurkS. MeleshkoD. KorobeynikovA. PevznerP.A. MetaSPAdes: A new versatile metagenomic assembler.Genome Res.201727582483410.1101/gr.213959.116 28298430
    [Google Scholar]
  73. LiD. LuoR. LiuC.M. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices.Methods201610231110.1016/j.ymeth.2016.02.020 27012178
    [Google Scholar]
  74. BankevichA. NurkS. AntipovD. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing.J. Comput. Biol.201219545547710.1089/cmb.2012.0021 22506599
    [Google Scholar]
  75. QuastC. PruesseE. YilmazP. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools.Nucleic Acids Res.201241D1D590D59610.1093/nar/gks1219 23193283
    [Google Scholar]
  76. ChenI.M.A. ChuK. PalaniappanK. The IMG/M data management and analysis system v.6.0: New tools and advanced capabilities.Nucleic Acids Res.202149D1D751D76310.1093/nar/gkaa939 33119741
    [Google Scholar]
  77. BruijnF.J. Handbook of Molecular Microbial Ecology, I: Metagenomics and Complementary Approaches.OxfordWiley-Blackwell2011
    [Google Scholar]
  78. WickR.R. JuddL.M. GorrieC.L. HoltK.E. UniCycler: Resolving bacterial genome assemblies from short and long sequencing reads.PLOS Comput. Biol.2017136e100559510.1371/journal.pcbi.1005595 28594827
    [Google Scholar]
  79. ShlemovA. KorobeynikovA. PathRacer: Racing profile HMM paths on assembly graph.Algorithms for Computational Biology.ChamSpringer International Publishing2019809410.1007/978‑3‑030‑18174‑1_6
    [Google Scholar]
  80. ClumA. HuntemannM. BushnellB. DOE JGI Metagenome Workflow.mSystems202163e00804e0082010.1128/mSystems.00804‑20 34006627
    [Google Scholar]
  81. AlmeidaJ.S. CarriçoJ.A. MaretzekA. NobleP.A. FletcherM. Analysis of genomic sequences by Chaos Game Representation.Bioinformatics200117542943710.1093/bioinformatics/17.5.429 11331237
    [Google Scholar]
  82. JeffreyH.J. Chaos game representation of gene structure.Nucleic Acids Res.19901882163217010.1093/nar/18.8.2163 2336393
    [Google Scholar]
  83. PapoutsoglouG. TarazonaS. LopesM.B. Machine learning approaches in microbiome research: Challenges and best practices.Front. Microbiol.202314126188910.3389/fmicb.2023.1261889 37808286
    [Google Scholar]
  84. D’EliaD. TruuJ. LahtiL. Advancing microbiome research with machine learning: Key findings from the ML4Microbiome COST action.Front. Microbiol.202314125700210.3389/fmicb.2023.1257002 37808321
    [Google Scholar]
  85. WangX.W. SunZ. JiaH. Identifying keystone species in microbial communities using deep learning.Nat. Ecol. Evol.202381223110.1038/s41559‑023‑02250‑2 37974003
    [Google Scholar]
  86. ShafieiM. DunnK.A. BoonE. BioMiCo: A supervised Bayesian model for inference of microbial community structure.Microbiome201531810.1186/s40168‑015‑0073‑x 25774293
    [Google Scholar]
  87. AndreopoulosB. AnA. WangX. Bi-level clustering of mixed categorical and numerical biomedical data.Int. J. Data Min. Bioinform.200611195610.1504/IJDMB.2006.009920 18402041
    [Google Scholar]
  88. TariL. BaralC. KimS. Fuzzy c-means clustering with prior biological knowledge.J. Biomed. Inform.2009421748110.1016/j.jbi.2008.05.009 18595779
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936299440240709070105
Loading
/content/journals/cbio/10.2174/0115748936299440240709070105
Loading

Data & Media loading...

Supplements

Supplementary material is available on the publisher’s website along with the published article. Data, figure images, and code notebooks are available under the Supplementary Information and under Zenodo. https://zenodo.org/record/7953871#. ZGp8fKXMKh8.


  • Article Type:
    Research Article
Keyword(s): kraken2; MEGAHIT; metaSPAdes; mothur; pathRacer; PhyloFlash; uniCycler
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test