Volume 20, Issue 3

Current Bioinformatics - Volume 20, Issue 3, 2025

Volume 20, Issue 3, 2025

- Comparative Analysis of Deep Generative Model for Industrial Enzyme Design
  
  Authors: Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang and Fei Guo
  
  https://doi.org/10.2174/0115748936303223240404043202
  More Less
  
  Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. As a powerful strategy, computational method can not only explore sequence space rapidly and efficiently, but also promote the design of new enzymes suitable for specific conditions and requirements, so it is very beneficial to design new industrial enzymes. Currently, there exists only one tool for enzyme generation, which exhibits suboptimal performance. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We summarized the computational methods used for protein sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of the six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.
  
  Add to my favourites
  
  Email this

- Improved Hybrid Approach for Enhancing Protein-Coding Regions Identification in DNA Sequences
  
  Authors: Emad S. Hassan, Ahmed M. Dessouky, Hesham Fathi, Gerges M. Salama, Ahmed S. Oshaba, Atef El-Emary and Fathi E. Abd El-Samie
  
  https://doi.org/10.2174/0115748936287244240117065325
  More Less
  
  Introduction
  Identifying and predicting protein-coding regions within DNA sequences play a pivotal role in genomic research. This paper introduces an approach for identifying protein-coding regions in DNA sequences by employing a hybrid methodology that combines digital bandpass filtering with wavelet transform and various spectral estimation techniques to enhance exon prediction. Specifically, the Haar and Daubechies wavelet transforms are applied to improve the accuracy of protein-coding region (exon) prediction, enabling the extraction of intricate details that may be obscured in the original DNA sequences.
  Methods
  This research work showcases the utility of Haar and Daubechies wavelet transforms, both non-parametric and parametric spectral estimation techniques, and the deployment of a digital bandpass filter for detecting peaks in exon regions. Additionally, the application of the Electron-Ion Interaction Potential (EIIP) method for converting symbolic DNA sequences into numerical values and the utilization of Sum-of-Sinusoids (SoS) mathematical model with optimized parameters further enrich the toolbox for DNA sequence analysis, ensuring the success of the proposed approach in modeling DNA sequences, optimally, and accurately identifying genes.
  Results
  The outcomes of this approach showcase a substantial enhancement in identification accuracy for protein-coding regions. In terms of peak location detection, the application of Haar and Daubechies wavelet transforms enhances the accuracy of peak localization by approximately (0.01, 3-5 dB). When employing non-parametric and parametric spectral estimation techniques, there is an improvement in peak localization by approximately (0.01, 4 dB) compared to the original signal. The proposed approach also achieves higher accuracy, when compared with existing ones.
  Conclusion
  These findings not only bridge gaps in DNA sequence analysis but also offer a promising pathway for advancing exonic region prediction and gene identification in genomics research. The hybrid methodology presented stands as a robust contribution to the evolving landscape of genomic analysis techniques.
  
  Add to my favourites
  
  Email this

- An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction
  
  Authors: Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde and Jelili Oyelade
  
  https://doi.org/10.2174/0115748936286848240108074303
  More Less
  
  Background
  The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding.
  Objective
  This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method.
  Methods
  The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF).
  Results
  The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF.
  Conclusion
  In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.
  
  Add to my favourites
  
  Email this

- Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes
  
  Authors: Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo and Lei Yang
  
  https://doi.org/10.2174/0115748936298012240322091111
  More Less
  
  Background
  Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features.
  Objective
  In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients.
  Methods
  The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles.
  Results
  Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses.
  Conclusion
  Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.
  
  Add to my favourites
  
  Email this

- Validating the Distinctiveness of the Omicron Lineage within the SARS-CoV-2 based on Protein Language Models
  
  Authors: Ke Dong and Jingyang Gao
  
  https://doi.org/10.2174/0115748936291075240409080924
  More Less
  
  Introduction
  Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model.
  Methods
  By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores.
  Results
  It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low.
  Conclusion
  Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.
  
  Add to my favourites
  
  Email this

- YADA - Reference Free Deconvolution of RNA Sequencing Data
  
  Authors: Dani Livne, Tom Snir and Sol Efroni
  
  https://doi.org/10.2174/0115748936304034240405034414
  More Less
  
  Introduction
  We present YADA, a cellular content deconvolution algorithm for estimating cell type proportions in heterogeneous cell mixtures based on gene expression data. YADA utilizes curated gene signatures of cell type-specific marker genes, either obtained intrinsically from pure cell type expression matrices or provided by the user.
  Methods
  YADA implements an accessible and extensible deconvolution framework uniquely capable of handling marker genes alone as inputs. Adoption barriers are lowered significantly by relying solely on literature-supported cell type-specific signatures rather than full transcriptomic profiles from purified isolates. However, flexible inputs do not necessitate sacrificing rigor - predictions match metrics of current methodologies through an integrated optimization scheme balancing multiple inference algorithms. Efficiency optimizations via compiled runtimes enable rapid execution. Packaging as an importable Python toolkit promotes community enhancement while retaining codebase extensibility.
  Results
  Validation studies demonstrate that YADA matches or exceeds the performance of current deconvolution methods on benchmark datasets. To demonstrate the utility and enable immediate usage, we provide an online Jupyter Notebook implementation coupled with tutorials.
  Conclusion
  YADA provides an accurate, efficient, and extensible Python-based toolkit for cellular deconvolution analysis of heterogeneous gene expression data.
  
  Add to my favourites
  
  Email this

- Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning
  
  Authors: Junyu Zhang, Ronglin Lu, Hongmei Zhou and Xinbo Jiang
  
  https://doi.org/10.2174/0115748936294345240510112941
  More Less
  
  Background
  Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide category is necessary.
  Methods
  In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models.
  Results
  Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor.
  Conclusion
  Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.
  
  Add to my favourites
  
  Email this

Most Cited Most Cited RSS feed

- A Review of Ensemble Methods in Bioinformatics
  
  Authors: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis
  
  Authors: Masahiro Sugimoto, Masato Kawakami, Martin Robert, Tomoyoshi Soga and Masaru Tomita
- Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
  
  Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong Chen
- A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods
  
  Authors: Jun Zhang and Bin Liu
- Molecular Genetic Markers: Discovery, Applications, Data Storage and Visualisation
  
  Authors: Chris Duran, Nikki Appleby, David Edwards and Jacqueline Batley
- A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization
  
  Authors: Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding and Hao Lin
- Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
  
  Authors: Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou
- Relevance of Molecular Docking Studies in Drug Designing
  
  Authors: Ritu Jakhar, Mehak Dangi, Alka Khichi and Anil K. Chhillar
- The Advances and Challenges of Deep Learning Application in Biological Big Data Processing
  
  Authors: Li Peng, Manman Peng, Bo Liao, Guohua Huang, Weibiao Li and Dingfeng Xie
- Gene Expression Profile Classification: A Review
  
  Authors: Musa H. Asyali, Dilek Colak, Omer Demirkaya and Mehmet S. Inan
More Less

Current Bioinformatics - Volume 20, Issue 3, 2025

Volume 20, Issue 3, 2025

Volumes & issues

Most Read This Month

Most Cited Most Cited RSS feed