Current Bioinformatics - Volume 20, Issue 3, 2025
Volume 20, Issue 3, 2025
-
-
Comparative Analysis of Deep Generative Model for Industrial Enzyme Design
Authors: Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang and Fei GuoAlthough enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. As a powerful strategy, computational method can not only explore sequence space rapidly and efficiently, but also promote the design of new enzymes suitable for specific conditions and requirements, so it is very beneficial to design new industrial enzymes. Currently, there exists only one tool for enzyme generation, which exhibits suboptimal performance. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We summarized the computational methods used for protein sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of the six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.
-
-
-
Improved Hybrid Approach for Enhancing Protein-Coding Regions Identification in DNA Sequences
IntroductionIdentifying and predicting protein-coding regions within DNA sequences play a pivotal role in genomic research. This paper introduces an approach for identifying protein-coding regions in DNA sequences by employing a hybrid methodology that combines digital bandpass filtering with wavelet transform and various spectral estimation techniques to enhance exon prediction. Specifically, the Haar and Daubechies wavelet transforms are applied to improve the accuracy of protein-coding region (exon) prediction, enabling the extraction of intricate details that may be obscured in the original DNA sequences.
MethodsThis research work showcases the utility of Haar and Daubechies wavelet transforms, both non-parametric and parametric spectral estimation techniques, and the deployment of a digital bandpass filter for detecting peaks in exon regions. Additionally, the application of the Electron-Ion Interaction Potential (EIIP) method for converting symbolic DNA sequences into numerical values and the utilization of Sum-of-Sinusoids (SoS) mathematical model with optimized parameters further enrich the toolbox for DNA sequence analysis, ensuring the success of the proposed approach in modeling DNA sequences, optimally, and accurately identifying genes.
ResultsThe outcomes of this approach showcase a substantial enhancement in identification accuracy for protein-coding regions. In terms of peak location detection, the application of Haar and Daubechies wavelet transforms enhances the accuracy of peak localization by approximately (0.01, 3-5 dB). When employing non-parametric and parametric spectral estimation techniques, there is an improvement in peak localization by approximately (0.01, 4 dB) compared to the original signal. The proposed approach also achieves higher accuracy, when compared with existing ones.
ConclusionThese findings not only bridge gaps in DNA sequence analysis but also offer a promising pathway for advancing exonic region prediction and gene identification in genomics research. The hybrid methodology presented stands as a robust contribution to the evolving landscape of genomic analysis techniques.
-
-
-
An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction
Authors: Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde and Jelili OyeladeBackgroundThe use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding.
ObjectiveThis presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method.
MethodsThe proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF).
ResultsThe results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF.
ConclusionIn conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.
-
-
-
Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes
Authors: Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo and Lei YangBackgroundMutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features.
ObjectiveIn this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients.
MethodsThe somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles.
ResultsSignificant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses.
ConclusionMetabolic mutation typing of cancer assists in guiding patient prognosis and treatment.
-
-
-
Validating the Distinctiveness of the Omicron Lineage within the SARS-CoV-2 based on Protein Language Models
Authors: Ke Dong and Jingyang GaoIntroductionVariants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model.
MethodsBy inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores.
ResultsIt is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low.
ConclusionMutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.
-
-
-
YADA - Reference Free Deconvolution of RNA Sequencing Data
Authors: Dani Livne, Tom Snir and Sol EfroniIntroductionWe present YADA, a cellular content deconvolution algorithm for estimating cell type proportions in heterogeneous cell mixtures based on gene expression data. YADA utilizes curated gene signatures of cell type-specific marker genes, either obtained intrinsically from pure cell type expression matrices or provided by the user.
MethodsYADA implements an accessible and extensible deconvolution framework uniquely capable of handling marker genes alone as inputs. Adoption barriers are lowered significantly by relying solely on literature-supported cell type-specific signatures rather than full transcriptomic profiles from purified isolates. However, flexible inputs do not necessitate sacrificing rigor - predictions match metrics of current methodologies through an integrated optimization scheme balancing multiple inference algorithms. Efficiency optimizations via compiled runtimes enable rapid execution. Packaging as an importable Python toolkit promotes community enhancement while retaining codebase extensibility.
ResultsValidation studies demonstrate that YADA matches or exceeds the performance of current deconvolution methods on benchmark datasets. To demonstrate the utility and enable immediate usage, we provide an online Jupyter Notebook implementation coupled with tutorials.
ConclusionYADA provides an accurate, efficient, and extensible Python-based toolkit for cellular deconvolution analysis of heterogeneous gene expression data.
-
-
-
Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning
Authors: Junyu Zhang, Ronglin Lu, Hongmei Zhou and Xinbo JiangBackgroundCurrently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide category is necessary.
MethodsIn our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models.
ResultsSince TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor.
ConclusionUltimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
