Current Bioinformatics - Volume 18, Issue 8, 2023
Volume 18, Issue 8, 2023
-
-
An Overview of Protein Function Prediction Methods: A Deep Learning Perspective
Authors: Emilio Ispano, Federico Bianca, Enrico Lavezzo and Stefano ToppoPredicting the function of proteins is a major challenge in the scientific community, particularly in the post-genomic era. Traditional methods of determining protein functions, such as experiments, are accurate but can be resource-intensive and time-consuming. The development of Next Generation Sequencing (NGS) techniques has led to the production of a large number of new protein sequences, which has increased the gap between available raw sequences and verified annotated sequences. To address this gap, automated protein function prediction (AFP) techniques have been developed as a faster and more cost-effective alternative, aiming to maintain the same accuracy level. Several automatic computational methods for protein function prediction have recently been developed and proposed. This paper reviews the best-performing AFP methods presented in the last decade and analyzes their improvements over time to identify the most promising strategies for future methods. Identifying the most effective method for predicting protein function is still a challenge. The Critical Assessment of Functional Annotation (CAFA) has established an international standard for evaluating and comparing the performance of various protein function prediction methods. In this study, we analyze the best-performing methods identified in recent editions of CAFA. These methods are divided into five categories based on their principles of operation: sequence-based, structure-based, combined-based, ML-based and embeddings-based. After conducting a comprehensive analysis of the various protein function prediction methods, we observe that there has been a steady improvement in the accuracy of predictions over time, mainly due to the implementation of machine learning techniques. The present trend suggests that all the bestperforming methods will use machine learning to improve their accuracy in the future. We highlight the positive impact that the use of machine learning (ML) has had on protein function prediction. Most recent methods developed in this area use ML, demonstrating its importance in analyzing biological information and making predictions. Despite these improvements in accuracy, there is still a significant gap compared with experimental evidence. The use of new approaches based on Deep Learning (DL) techniques will probably be necessary to close this gap, and while significant progress has been made in this area, there is still more work to be done to fully realize the potential of DL.
-
-
-
A Comparison of Mutual Information, Linear Models and Deep Learning Networks for Protein Secondary Structure Prediction
Background: Over the last several decades, predicting protein structures from amino acid sequences has been a core task in bioinformatics. Nowadays, the most successful methods employ multiple sequence alignments and can predict the structure with excellent performance. These predictions take advantage of all the amino acids at a given position and their frequencies. However, the effect of single amino acid substitutions in a specific protein tends to be hidden by the alignment profile. For this reason, single-sequence-based predictions attract interest even after accurate multiple-alignment methods have become available: the use of single sequences ensures that the effects of substitution are not confounded by homologous sequences. Objective: This work aims at understanding how the single-sequence secondary structure prediction of a residue is influenced by the surrounding ones. We aim at understanding how different prediction methods use single-sequence information to predict the structure. Methods: We compare mutual information, the coefficients of two linear models, and three deep learning networks. For the deep learning algorithms, we use the DeepLIFT analysis to assess the effect of each residue at each position in the prediction. Results: Mutual information and linear models quantify direct effects, whereas DeepLIFT applied on deep learning networks quantifies both direct and indirect effects. Conclusion: Our analysis shows how different network architectures use the information of single protein sequences and highlights their differences with respect to linear models. In particular, the deep learning implementations take into account context and single position information differently, with the best results obtained using the BERT architecture.
-
-
-
Screening and Identification of Key Genes for Cervical Cancer, Ovarian Cancer and Endometrial Cancer by Combinational Bioinformatic Analysis
More LessIntroduction: Cervical cancer, ovarian cancer and endometrial cancer are the top three cancers in women. With the rapid development of gene chip and high-throughput sequencing technology, it has been widely used to study genomic functional omics data and identify markers for disease diagnosis and treatment. At the same time, more and more public databases containing genetic data have appeared. The result of the bioinformatic analysis can provide a diagnosis of new perspectives on cell origin and differences. Methods: In this paper, three datasets about cervical cancer, ovarian cancer and endometrial cancer from GEO were used to dig out common DEGs (differentially expressed genes) among cervical cancer/ovarian cancer/endometrial cancer. DEGs contain 400 up-regulation genes and 157 down-regulation genes. Results: The results of GO (gene ontology) functional enrichment analysis show that the BP (biological process) changes of DEGs are mainly in cell division, mitotic nuclear division, sister chromatid cohesion, and DNA replication. The CC (cell component) function enrichments of DEGs were mainly in the nucleoplasm, nucleus, condensed chromosome kinetochore, chromosome, centromeric region. The MF (molecular function) function enrichments of DEGs were mainly in protein binding. The results of the KEGG pathway analysis showed that the upregulation DEGs were mainly enriched in retinoblastoma gene in the cell cycle, cellular senescence, oocyte meiosis, and pathways in cancer, while the downregulation DEGs enriched in thiamine metabolism, protein processing in endoplasmic reticulum. Similarly, the function of the most significant module was enriched in cell division, condensed chromosome kinetochore, and microtubule motor activity. Conclusion: In the result, 4 of the top 10 hub genes (CCNA2, CCNB1, CDC6 and CDK1) will provide help for future biomedical experimental research.
-
-
-
Non-small Cell Lung Cancer Survival Estimation Through Multi-omic Two-layer SVM: A Multi-omics and Multi-Sources Integrative Model
Background: The new paradigm of precision medicine brought an increasing interest in survival prediction based on the integration of multi-omics and multi-sources data. Several models have been developed to address this task, but their performances are widely variable depending on the specific disease and are often poor on noisy datasets, such as in the case of non-small cell lung cancer (NSCLC). Objective: The aim of this work is to introduce a novel computational approach, named multi-omic twolayer SVM (mtSVM), and to exploit it to get a survival-based risk stratification of NSCLC patients from an ongoing observational prospective cohort clinical study named PROMOLE. Methods: The model implements a model-based integration by means of a two-layer feed-forward network of FastSurvivalSVMs, and it can be used to get individual survival estimates or survival-based risk stratification. Despite being designed for NSCLC, its range of applicability can potentially cover the full spectrum of survival analysis problems where integration of different data sources is needed, independently of the pathology considered. Results: The model is here applied to the case of NSCLC, and compared with other state-of-the-art methods, proving excellent performance. Notably, the model, trained on data from The Cancer Genome Atlas (TCGA), has been validated on an independent cohort (from the PROMOLE study), and the results were consistent. Gene-set enrichment analysis of the risk groups, as well as exome analysis, revealed well-defined molecular profiles, such as a prognostic mutational gene signature with potential implications in clinical practice.
-
-
-
A Pan-cancer Analysis Reveals the Tissue Specificity and Prognostic Impact of Angiogenesis-associated Genes in Human Cancers
Authors: Zhenshen Bao, Minzhen Liao, Wanqi Dong, Yanhao Huo, Xianbin Li, Peng Xu and Wenbin LiuIntroduction: Angiogenesis is one of the hallmarks of cancer and can impact the processes of cancer initiation, progression, and response to therapy. Background: Anti-angiogenic therapy is thus an encouraging therapeutic option to treat cancers, but the detailed angiogenic mechanisms and the association between angiogenesis and clinical outcome remain unknown in different cancers. Methods: Here, we systematically assess the impacts of 82 angiogenesis-associated genes (AAGs) in tumor tissue specificity and prognosis across 16 cancer types. Results: Results demonstrate that the expression patterns of the 82 AAGs can reflect the tumor tissue specificity, and high expressions of up-regulated AAGs are significantly associated with poor prognosis of cancer. We further define a prognostic score for predicting overall survival (OS) based on the expressions of up-regulated AAGs and confirm its reliable predictive ability. Results indicate that a low prognostic score demonstrates a superior OS and vice versa. Conclusion: The results of this study will contribute to the understanding of different tumor angiogenesis mechanisms in various tissues and cancer-personalized anti-angiogenic treatment. The code of our analysis can be accessed at https://github.com/ZhenshenBao/AAGs_analysis.git.
-
-
-
Multi-channel Partial Graph Integration Learning of Partial Multi-omics Data for Cancer Subtyping
Authors: Qing-Qing Cao, Jian-Ping Zhao and Chun-Hou ZhengBackground: The appearance of cancer subtypes with different clinical significance fully reflects the high heterogeneity of cancer. At present, the method of multi-omics integration has become more and more mature. However, in the practical application of the method, the omics of some samples are missing. Objective: The purpose of this study is to establish a depth model that can effectively integrate and express partial multi-omics data to accurately identify cancer subtypes. Methods: We proposed a novel partial multi-omics learning model for cancer subtypes, MPGIL (Multichannel Partial Graph Integration Learning). MPGIL has two main components. Firstly, it obtains more lateral adjacency information between samples within the omics through the multi-channel graph autoencoders based on high-order proximity. To reduce the negative impact of missing samples, the weighted fusion layer is introduced to replace the concatenate layer to learn the consensus representation across multi-omics. Secondly, a classifier is introduced to ensure that the consensus representation is representative of clustering. Finally, subtypes were identified by K-means. Results: This study compared MPGIL with other multi-omics integration methods on 16 datasets. The clinical and survival results show that MPGIL can effectively identify subtypes. Three ablation experiments are designed to highlight the importance of each component in MPGIL. A case study of AML was conducted. The differentially expressed gene profiles among its subtypes fully reveal the high heterogeneity of cancer. Conclusion: MPGIL can effectively learn the consistent expression of partial multi-omics datasets and discover subtypes, and shows more significant performance than the state-of-the-art methods.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
