Skip to content
2000
Volume 21, Issue 1
  • ISSN: 1574-8936
  • E-ISSN: 2212-392X

Abstract

Background

Genetic information about organisms' traits is stored and encoded in deoxyribonucleic acid (DNA) sequences. The fundamental inquiry into the storage mechanisms of this genetic information within genomes has long been of interest to geneticists and biophysicists.

Objective

The objective of this study was to investigate the distribution of coding sequence (CDS) lengths in species genomes across different kingdoms.

Methods

In this study, we used the maximum entropy principle and the gamma distribution model based on a comprehensive dataset including viruses, archaea, bacteria, and eukaryote species.

Results

Our study result revealed unique patterns in CDS length distributions among kingdoms and CDS lengths exhibit a right-skewed distribution, with varying preferences among kingdoms. Eukaryotes displayed bimodal distributions, with CDS sequences longer than those of prokaryotes. Fitting the gamma distribution model revealed differences in shape and scale parameters among kingdoms, with eukaryotes exhibiting larger scale parameters, indicating longer CDS sequences. Additionally, analysis of moments highlighted the complexity of eukaryotic genomes relative to prokaryotes.

Conclusion

This study result deepens our understanding of genome evolution and provides valuable insights for biological research.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936355149250108083111
2025-01-30
2026-02-04
Loading full text...

Full text loading...

References

  1. BeadleG.W. TatumE.L. Genetic control of biochemical reactions in neurospora.Proc. Natl. Acad. Sci. USA1941271149950610.1073/pnas.27.11.499 16588492
    [Google Scholar]
  2. WuR. DNA sequence analysis.Annu. Rev. Biochem.197847160763410.1146/annurev.bi.47.070178.003135 209729
    [Google Scholar]
  3. WangY ZhaiY DingY ZouQ. SBSM-Pro: Support bio-sequence machine for proteins.arXiv2023
    [Google Scholar]
  4. CaoC. ShaoM. ZuoC. RAVAR: A curated repository for rare variant–trait associations.Nucleic Acids Res.202452D1D990D99710.1093/nar/gkad876 37831073
    [Google Scholar]
  5. SteinL. Genome annotation: From sequence to biology.Nat. Rev. Genet.20012749350310.1038/35080529 11433356
    [Google Scholar]
  6. QiaoJ. JinJ. YuH. WeiL. Towards retraining-free RNA modification prediction with incremental learning.Inf. Sci.202466012010510.1016/j.ins.2024.120105
    [Google Scholar]
  7. WangL. DingY. TiwariP. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites.Inf. Sci.2023630405210.1016/j.ins.2023.01.149
    [Google Scholar]
  8. RenL. NingL. YangY. MetaboliteCOVID: A manually curated database of metabolite markers for COVID-19.Comput. Biol. Med.202316710766110.1016/j.compbiomed.2023.107661 37925911
    [Google Scholar]
  9. GhorbaniM. KarimiH. Bioinformatics approaches for gene finding.Int. J. Sci. Res. Sci. Technol.201514
    [Google Scholar]
  10. WangR. JiangY. JinJ. DeepBIO: An automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis.Nucleic Acids Res.20235173017302910.1093/nar/gkad055 36796796
    [Google Scholar]
  11. ZhuH. HaoH. YuL. Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance.BMC Biol.202321129410.1186/s12915‑023‑01796‑8 38115088
    [Google Scholar]
  12. ZhangY. LiuC. LiuM. Attention is all you need: Utilizing attention in AI-enabled drug discovery.Brief. Bioinform.2023251bbad46710.1093/bib/bbad467 38189543
    [Google Scholar]
  13. CobbM. 60 years ago, Francis Crick changed the logic of biology.PLoS Biol.2017159e200324310.1371/journal.pbio.2003243 28922352
    [Google Scholar]
  14. ShimizuM. In Origin and evolution of the genetic code, Origin of Life Proceedings of the Third ISSOL Meeting and the Sixth ICOL Meeting.Jerusalem June 22–271980423430
    [Google Scholar]
  15. SinghU. WurteleE.S. orfipy: A fast and flexible tool for extracting ORFs.Bioinformatics202137183019302010.1093/bioinformatics/btab090 33576786
    [Google Scholar]
  16. ChenS. KrinskyB.H. LongM. New genes as drivers of phenotypic evolution.Nat. Rev. Genet.201314964566010.1038/nrg3521 23949544
    [Google Scholar]
  17. JinJ. YuY. WangR. iDNA-ABF: Multi-scale deep biological language learning model for the interpretable prediction of DNA methylations.Genome Biol.202223121910.1186/s13059‑022‑02780‑1 36253864
    [Google Scholar]
  18. ZhaoM. HeW. TangJ. ZouQ. GuoF. A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data.Brief. Bioinform.2022232bbab56810.1093/bib/bbab568 35062026
    [Google Scholar]
  19. ZhangY. PanX. ShiT. P450Rdb: A manually curated database of reactions catalyzed by cytochrome P450 enzymes.J. Adv. Res.2023633245 37871773
    [Google Scholar]
  20. KearseM.G. WiluszJ.E. Non-AUG translation: A new start for protein synthesis in eukaryotes.Genes Dev.201731171717173110.1101/gad.305250.117 28982758
    [Google Scholar]
  21. RenL. XuY. NingL. TCM2COVID: A resource of anti‐COVID-19 traditional Chinese medicine with effects and mechanisms.iMeta202214e4210.1002/imt2.42 36245702
    [Google Scholar]
  22. LiuY. ShenX. GongY. LiuY. SongB. ZengX. Sequence Alignment/Map format: A comprehensive review of approaches and applications.Brief. Bioinform.2023245bbad32010.1093/bib/bbad320 37668049
    [Google Scholar]
  23. TekaiaF. YeramianE. DujonB. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: A global picture with correspondence analysis.Gene20022971-2516010.1016/S0378‑1119(02)00871‑5 12384285
    [Google Scholar]
  24. MorariuV.V. Distribution and Correlation of the coding sequence lengths in bacterial genomes.J. Chem.200859111201120410.37358/RC.08.11.2000
    [Google Scholar]
  25. ZhangJ. Protein-length distributions for the three domains of life.Trends Genet.200016310710910.1016/S0168‑9525(99)01922‑8 10689349
    [Google Scholar]
  26. FengL. LiH. The distribution model of open reading frame length in different genomes and the genome evolution.ACTA BIOPHYSICA SINICA2004205375381
    [Google Scholar]
  27. BoltzmannL. Lectures on gas theory.Univ of California Press202310.2307/jj.8501520
    [Google Scholar]
  28. HakenH. Information and self-organization a macroscopic approach to complex systems.Springer Berlin Heidelberg1988
    [Google Scholar]
  29. LuoL. BaiG. The maximum information principle and the evolution of nucleotide sequences.J. Theor. Biol.1995174213113610.1006/jtbi.1995.0086 7643609
    [Google Scholar]
  30. LynchM. ConeryJ.S. The origins of genome complexity.Science200330256491401140410.1126/science.1089370 14631042
    [Google Scholar]
  31. LiH.L. PangY.H. LiuB. BioSeq-BLM: A platform for analyzing DNA, RNA and protein sequences based on biological language models.Nucleic Acids Res.20214922e12910.1093/nar/gkab829 34581805
    [Google Scholar]
  32. LiuB. GaoX. ZhangH. BioSeq-Analysis2.0: An updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.Nucleic Acids Res.20194720e12710.1093/nar/gkz740 31504851
    [Google Scholar]
  33. ChenL. YuL. GaoL. Potent antibiotic design via guided search from antibacterial activity evaluations.Bioinformatics2023392btad05910.1093/bioinformatics/btad059 36707990
    [Google Scholar]
  34. KimuraM. The neutral theory of molecular evolution.Sci Am1979241598126102, 108 passim10.1038/scientificamerican1179‑98504979
    [Google Scholar]
  35. ZhangR. A rebuttal to the comments on the genome order index and the Z-curve.Biol. Direct2011611010.1186/1745‑6150‑6‑10 21324187
    [Google Scholar]
  36. BrocchieriL. KarlinS. Protein length in eukaryotic and prokaryotic proteomes.Nucleic Acids Res.200533103390340010.1093/nar/gki615 15951512
    [Google Scholar]
  37. TiessenA. Pérez-RodríguezP. Delaye-ArredondoL.J. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes.BMC Res. Notes2012518510.1186/1756‑0500‑5‑85 22296664
    [Google Scholar]
  38. Ramírez-SánchezO. Pérez-RodríguezP. DelayeL. TiessenA. Plant proteins are smaller because they are encoded by fewer exons than animal proteins.Genomics Proteomics Bioinformatics201614635737010.1016/j.gpb.2016.06.003 27998811
    [Google Scholar]
  39. NeversY. GloverN.M. DessimozC. LecompteO. Protein length distribution is remarkably uniform across the tree of life.Genome Biol.202324113510.1186/s13059‑023‑02973‑2 37291671
    [Google Scholar]
  40. KooninE.V. WolfY.I. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world.Nucleic Acids Res.200836216688671910.1093/nar/gkn668 18948295
    [Google Scholar]
  41. LongX. XueH. WongJ.T.F. Descent of bacteria and eukarya from an archaeal root of life.Evol. Bioinform. Online202016117693432090826710.1177/1176934320908267 32636606
    [Google Scholar]
  42. BelshawR. PybusO.G. RambautA. The evolution of genome compression and genomic novelty in RNA viruses.Genome Res.200717101496150410.1101/gr.6305707 17785537
    [Google Scholar]
  43. JayaramanB. SmithA.M. FernandesJ.D. FrankelA.D. Oligomeric viral proteins: Small in size, large in presence.Crit. Rev. Biochem. Mol. Biol.201651537939410.1080/10409238.2016.1215406 27685368
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936355149250108083111
Loading
/content/journals/cbio/10.2174/0115748936355149250108083111
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test