Exploring Coding Sequence Length Distributions Across Taxonomic Kingdoms Based on Maximum Information Principle

Sebu Aboma Temesgen; Bakanina Kissanga Grace-Mercure; Basharat Ahmad; Yan-Ting Jin; Li Liu; Hao Lin

doi:10.2174/0115748936355149250108083111

ISSN: 1574-8936
E-ISSN: 2212-392X

Exploring Coding Sequence Length Distributions Across Taxonomic Kingdoms Based on Maximum Information Principle
Authors: Sebu Aboma Temesgen¹, Bakanina Kissanga Grace-Mercure¹, Basharat Ahmad¹, Yan-Ting Jin¹, Li Liu² and Hao Lin¹
View Affiliations Hide Affiliations

¹ School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China ; ² Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
Source: Current Bioinformatics, Volume 21, Issue 1, Jan 2026, p. 82 - 90
DOI: https://doi.org/10.2174/0115748936355149250108083111
- Received: 03 Sep 2024
- Accepted: 03 Dec 2024
- Available online: 30 Jan 2025

Abstract

Background

Genetic information about organisms' traits is stored and encoded in deoxyribonucleic acid (DNA) sequences. The fundamental inquiry into the storage mechanisms of this genetic information within genomes has long been of interest to geneticists and biophysicists.

Objective

The objective of this study was to investigate the distribution of coding sequence (CDS) lengths in species genomes across different kingdoms.

Methods

In this study, we used the maximum entropy principle and the gamma distribution model based on a comprehensive dataset including viruses, archaea, bacteria, and eukaryote species.

Results

Our study result revealed unique patterns in CDS length distributions among kingdoms and CDS lengths exhibit a right-skewed distribution, with varying preferences among kingdoms. Eukaryotes displayed bimodal distributions, with CDS sequences longer than those of prokaryotes. Fitting the gamma distribution model revealed differences in shape and scale parameters among kingdoms, with eukaryotes exhibiting larger scale parameters, indicating longer CDS sequences. Additionally, analysis of moments highlighted the complexity of eukaryotic genomes relative to prokaryotes.

Conclusion

This study result deepens our understanding of genome evolution and provides valuable insights for biological research.

Article metrics loading...

/content/journals/cbio/10.2174/0115748936355149250108083111

2025-01-30

2026-02-04

From This Site

/content/journals/cbio/10.2174/0115748936355149250108083111

dcterms_title,dcterms_subject,pub_keyword

-contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

References

BeadleG.W. TatumE.L. Genetic control of biochemical reactions in neurospora.Proc. Natl. Acad. Sci. USA1941271149950610.1073/pnas.27.11.499 16588492
[Google Scholar]
WuR. DNA sequence analysis.Annu. Rev. Biochem.197847160763410.1146/annurev.bi.47.070178.003135 209729
[Google Scholar]
WangY ZhaiY DingY ZouQ. SBSM-Pro: Support bio-sequence machine for proteins.arXiv2023
[Google Scholar]
CaoC. ShaoM. ZuoC. RAVAR: A curated repository for rare variant–trait associations.Nucleic Acids Res.202452D1D990D99710.1093/nar/gkad876 37831073
[Google Scholar]
SteinL. Genome annotation: From sequence to biology.Nat. Rev. Genet.20012749350310.1038/35080529 11433356
[Google Scholar]
QiaoJ. JinJ. YuH. WeiL. Towards retraining-free RNA modification prediction with incremental learning.Inf. Sci.202466012010510.1016/j.ins.2024.120105
[Google Scholar]
WangL. DingY. TiwariP. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites.Inf. Sci.2023630405210.1016/j.ins.2023.01.149
[Google Scholar]
RenL. NingL. YangY. MetaboliteCOVID: A manually curated database of metabolite markers for COVID-19.Comput. Biol. Med.202316710766110.1016/j.compbiomed.2023.107661 37925911
[Google Scholar]
GhorbaniM. KarimiH. Bioinformatics approaches for gene finding.Int. J. Sci. Res. Sci. Technol.201514
[Google Scholar]
WangR. JiangY. JinJ. DeepBIO: An automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis.Nucleic Acids Res.20235173017302910.1093/nar/gkad055 36796796
[Google Scholar]
ZhuH. HaoH. YuL. Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance.BMC Biol.202321129410.1186/s12915‑023‑01796‑8 38115088
[Google Scholar]
ZhangY. LiuC. LiuM. Attention is all you need: Utilizing attention in AI-enabled drug discovery.Brief. Bioinform.2023251bbad46710.1093/bib/bbad467 38189543
[Google Scholar]
CobbM. 60 years ago, Francis Crick changed the logic of biology.PLoS Biol.2017159e200324310.1371/journal.pbio.2003243 28922352
[Google Scholar]
ShimizuM. In Origin and evolution of the genetic code, Origin of Life Proceedings of the Third ISSOL Meeting and the Sixth ICOL Meeting.Jerusalem June 22–271980423430
[Google Scholar]
SinghU. WurteleE.S. orfipy: A fast and flexible tool for extracting ORFs.Bioinformatics202137183019302010.1093/bioinformatics/btab090 33576786
[Google Scholar]
ChenS. KrinskyB.H. LongM. New genes as drivers of phenotypic evolution.Nat. Rev. Genet.201314964566010.1038/nrg3521 23949544
[Google Scholar]
JinJ. YuY. WangR. iDNA-ABF: Multi-scale deep biological language learning model for the interpretable prediction of DNA methylations.Genome Biol.202223121910.1186/s13059‑022‑02780‑1 36253864
[Google Scholar]
ZhaoM. HeW. TangJ. ZouQ. GuoF. A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data.Brief. Bioinform.2022232bbab56810.1093/bib/bbab568 35062026
[Google Scholar]
ZhangY. PanX. ShiT. P450Rdb: A manually curated database of reactions catalyzed by cytochrome P450 enzymes.J. Adv. Res.2023633245 37871773
[Google Scholar]
KearseM.G. WiluszJ.E. Non-AUG translation: A new start for protein synthesis in eukaryotes.Genes Dev.201731171717173110.1101/gad.305250.117 28982758
[Google Scholar]
RenL. XuY. NingL. TCM2COVID: A resource of anti‐COVID-19 traditional Chinese medicine with effects and mechanisms.iMeta202214e4210.1002/imt2.42 36245702
[Google Scholar]
LiuY. ShenX. GongY. LiuY. SongB. ZengX. Sequence Alignment/Map format: A comprehensive review of approaches and applications.Brief. Bioinform.2023245bbad32010.1093/bib/bbad320 37668049
[Google Scholar]
TekaiaF. YeramianE. DujonB. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: A global picture with correspondence analysis.Gene20022971-2516010.1016/S0378‑1119(02)00871‑5 12384285
[Google Scholar]
MorariuV.V. Distribution and Correlation of the coding sequence lengths in bacterial genomes.J. Chem.200859111201120410.37358/RC.08.11.2000
[Google Scholar]
ZhangJ. Protein-length distributions for the three domains of life.Trends Genet.200016310710910.1016/S0168‑9525(99)01922‑8 10689349
[Google Scholar]
FengL. LiH. The distribution model of open reading frame length in different genomes and the genome evolution.ACTA BIOPHYSICA SINICA2004205375381
[Google Scholar]
BoltzmannL. Lectures on gas theory.Univ of California Press202310.2307/jj.8501520
[Google Scholar]
HakenH. Information and self-organization a macroscopic approach to complex systems.Springer Berlin Heidelberg1988
[Google Scholar]
LuoL. BaiG. The maximum information principle and the evolution of nucleotide sequences.J. Theor. Biol.1995174213113610.1006/jtbi.1995.0086 7643609
[Google Scholar]
LynchM. ConeryJ.S. The origins of genome complexity.Science200330256491401140410.1126/science.1089370 14631042
[Google Scholar]
LiH.L. PangY.H. LiuB. BioSeq-BLM: A platform for analyzing DNA, RNA and protein sequences based on biological language models.Nucleic Acids Res.20214922e12910.1093/nar/gkab829 34581805
[Google Scholar]
LiuB. GaoX. ZhangH. BioSeq-Analysis2.0: An updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.Nucleic Acids Res.20194720e12710.1093/nar/gkz740 31504851
[Google Scholar]
ChenL. YuL. GaoL. Potent antibiotic design via guided search from antibacterial activity evaluations.Bioinformatics2023392btad05910.1093/bioinformatics/btad059 36707990
[Google Scholar]
KimuraM. The neutral theory of molecular evolution.Sci Am1979241598126102, 108 passim10.1038/scientificamerican1179‑98504979
[Google Scholar]
ZhangR. A rebuttal to the comments on the genome order index and the Z-curve.Biol. Direct2011611010.1186/1745‑6150‑6‑10 21324187
[Google Scholar]
BrocchieriL. KarlinS. Protein length in eukaryotic and prokaryotic proteomes.Nucleic Acids Res.200533103390340010.1093/nar/gki615 15951512
[Google Scholar]
TiessenA. Pérez-RodríguezP. Delaye-ArredondoL.J. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes.BMC Res. Notes2012518510.1186/1756‑0500‑5‑85 22296664
[Google Scholar]
Ramírez-SánchezO. Pérez-RodríguezP. DelayeL. TiessenA. Plant proteins are smaller because they are encoded by fewer exons than animal proteins.Genomics Proteomics Bioinformatics201614635737010.1016/j.gpb.2016.06.003 27998811
[Google Scholar]
NeversY. GloverN.M. DessimozC. LecompteO. Protein length distribution is remarkably uniform across the tree of life.Genome Biol.202324113510.1186/s13059‑023‑02973‑2 37291671
[Google Scholar]
KooninE.V. WolfY.I. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world.Nucleic Acids Res.200836216688671910.1093/nar/gkn668 18948295
[Google Scholar]
LongX. XueH. WongJ.T.F. Descent of bacteria and eukarya from an archaeal root of life.Evol. Bioinform. Online202016117693432090826710.1177/1176934320908267 32636606
[Google Scholar]
BelshawR. PybusO.G. RambautA. The evolution of genome compression and genomic novelty in RNA viruses.Genome Res.200717101496150410.1101/gr.6305707 17785537
[Google Scholar]
JayaramanB. SmithA.M. FernandesJ.D. FrankelA.D. Oligomeric viral proteins: Small in size, large in presence.Crit. Rev. Biochem. Mol. Biol.201651537939410.1080/10409238.2016.1215406 27685368
[Google Scholar]

/content/journals/cbio/10.2174/0115748936355149250108083111

Exploring Coding Sequence Length Distributions Across Taxonomic Kingdoms Based on Maximum Information Principle

Curr Bioinform 21, 82 (2026); https://doi.org/10.2174/0115748936355149250108083111

/content/journals/cbio/10.2174/0115748936355149250108083111

Data & Media loading...

Article Type: Research Article

Keyword(s): bimodal distributions; DNA; gamma distribution; Information storage; length of coding sequence; maximum information

Most Cited Most Cited RSS feed

- A Review of Ensemble Methods in Bioinformatics
  
  Authors: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis
  
  Authors: Masahiro Sugimoto, Masato Kawakami, Martin Robert, Tomoyoshi Soga and Masaru Tomita
- Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
  
  Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong Chen
- A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods
  
  Authors: Jun Zhang and Bin Liu
- Molecular Genetic Markers: Discovery, Applications, Data Storage and Visualisation
  
  Authors: Chris Duran, Nikki Appleby, David Edwards and Jacqueline Batley
- A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization
  
  Authors: Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding and Hao Lin
- Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
  
  Authors: Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou
- Relevance of Molecular Docking Studies in Drug Designing
  
  Authors: Ritu Jakhar, Mehak Dangi, Alka Khichi and Anil K. Chhillar
- The Advances and Challenges of Deep Learning Application in Biological Big Data Processing
  
  Authors: Li Peng, Manman Peng, Bo Liao, Guohua Huang, Weibiao Li and Dingfeng Xie
- Gene Expression Profile Classification: A Review
  
  Authors: Musa H. Asyali, Dilek Colak, Omer Demirkaya and Mehmet S. Inan
More Less

Exploring Coding Sequence Length Distributions Across Taxonomic Kingdoms Based on Maximum Information Principle

Abstract

Most Read This Month

Most Cited Most Cited RSS feed