Current Bioinformatics - Volume 16, Issue 8, 2021
Volume 16, Issue 8, 2021
-
-
Ensemble Adaptive Total Variation Graph Regularized NMF for Singlecell RNA-seq Data Analysis
More LessAuthors: Ya-Li Zhu, Ying-Lian Gao, Jin-Xing Liu, Rong Zhu and Xiang-Zhen KongBackground: Single-cell RNA sequencing techniques have emerged as effective approaches for finding the heterogeneity between cells and discovering the differentiation stage. Adaptive total variation graph regularized nonnegative matrix factorization (ATV-NMF) has been proposed to capture the inner geometric structure and determine whether to retain feature details or denoise, which is suitable for analyzing single-cell data. However, the rank of matrix factorization significantly affects clustering performance greatly, and it is still challenging to determine the optimal rank. Objective: To solve the problem, in this paper, we propose an ensemble clustering method ANMFCE to integrate several base clustering results corresponding to different parameter rank values. Methods: Firstly, we use the ATV-NMF algorithm to obtain clustering results with different dimension reduction ranks. Secondly, the consensus function based on connected-triple-based similarity is applied to obtain the similarity matrix. Finally, the spectral clustering method is used to find the final optimal partition. Results: Clustering results on six single-cell sequencing datasets show that our method is more advanced than the individual ATV-NMF method and other comparison methods, which can illustrate that our method is effective in finding the heterogeneity in single-cell datasets. Moreover, the identification of gene markers also achieves accurate results. Conclusion: In summary, our method is effective for analyzing single-cell RNA sequencing datasets.
-
-
-
Effect of Various Sequence Descriptors in Predicting Human Proteinprotein Interactions Using ANN-based Prediction Models
More LessAuthors: Pankaj S. Dholaniya and Samreen RizviAims: A number of sequence-based descriptors for proteins have been proposed by many researchers. This study aims to evaluate the performance of these descriptors in predicting protein-protein interactions on the benchmark dataset. Background: The behavior of a protein inside or outside the cell is defined by its interaction with the elements present in the surrounding environment, which include small metabolites to the macromolecules such as RNA, DNA, or proteins. Of these, understanding protein-protein interactions (PPIs) is one of the important aspects to investigate the biological role of a protein. The interactions of a protein are determined by how it folds in 3-dimensional space, and this threedimensional folding of a protein largely depends on the linear sequence of amino acids. This information makes it possible to exploit the sequences for proteins to computationally determine the possible interactions among them. Objective: This study aims at studying the efficacy of various sequence-based descriptors in predicting protein-protein interactions. Methods: In this study, we have used the benchmark dataset of interacting and non-interacting protein pairs provided by Pan et al. to build the PPI prediction models using artificial neural networks. We have compared the efficacy of different descriptors on two types of datasets, one with all the protein pairs and the second with proteins having less than 25% identity. Result: The results show that conjoint-triad descriptors performed better than other descriptors in predicting PPIs. The feature selection on the conjoint triad was performed and the effect on the prediction model with reduced features versus all feature sets was studied. Conclusion: The classification model with conjoint-triad descriptors obtained the highest accuracy. The feature ranking for the conjoint triad descriptor was utilized and the model performance was compared with all and selected features. The model with reduced features shows less overfitting.
-
-
-
Identification of Risk Molecular Subtype of Colon Cancer with Lymphovascular Invasion
More LessAuthors: Qing Jin, Binhua Liang, Xiujie Chen and Huiwen LiuBackground: Although surgical resection generally yields excellent outcomes, a number of patients with colon cancer still have relapse or metastasis after surgery. Adjuvant chemotherapy in tumor stage III has been demonstrated to eradicate micrometastasis and improve survival, whereas the benefits of adjuvant chemotherapy in tumor stage II remain controversial. The leading cause is the lack of understanding of the molecular basis of underlying metastatic mechanisms. Objective: This study aimed to identify molecular subtype(s) of colon cancer with a high risk of metastasis and provide potential biomarkers for prognostic prediction in tumor stage II. Methods: Based on the assumption that colon cancer evolves because of the stepwise accumulation of a series of genetic mutations, we performed a systematic investigation on the molecular basis of colon cancer through applying restart random walk on the PPI network. To compare functional similarity of patients, we extracted mutation-propagating modules of each patient and calculated their enrichment score in 50 hallmark gene sets. According to functional similarity matrix, we classified colon cancers with positive lymphovascular invasion and the prognosis of molecular subtypes. We determined the molecular characteristics of subtypes by enrichment analysis of subtype-specific genetic mutations. Additionally, we identified potential biomarkers for predicting patients with a high risk of metastasis in stage II through differential analysis of miRNA expression profiles of subtypes. Then we used two independent data sets to construct a random forest classifier and performed 10-fold cross-validation of miRNA biomarkers. Results: Firstly, we identified two molecular subtypes of colon cancer with positive lymphovascular invasion as well as their associated biological characteristics: LVI1=Canonical subtype (110, 85%); LVI2=Metastatic subtype (20, 15%). Secondly, we identified 11 miRNA biomarkers for predicting patients with a high risk of metastasis in tumor stage II. Conclusion: Our findings put forward a detailed classification for colon cancer and provided risk biomarkers for stage II patients to determine whether to take adjuvant chemotherapy after surgery.
-
-
-
PREDAIP: Computational Prediction and Analysis for Anti-inflammatory Peptide via a Hybrid Feature Selection Technique
More LessAuthors: Dan Lin, Jialin Yu, Ju Zhang, Huan He, Xinyun Guo and Shaoping ShiBackground: Anti-Inflammatory Peptides (AIPs) are potent therapeutic agents for inflammatory and autoimmune disorders due to their high specificity and minimal toxicity under normal conditions. Therefore, it is greatly significant and beneficial to identify AIPs for further discovering novel and efficient AIPs-based therapeutics. Recently, three computational approaches, which can effectively identify potential AIPs, have been developed based on machine learning algorithms. However, there are several challenges with the existing three predictors. Objective: A novel machine learning algorithm needs to be proposed to improve the AIPs prediction accuracy. Methods: This study attempts to improve the recognition of AIPs by employing multiple primary sequence-based feature descriptors and an efficient feature selection strategy. By sorting features through four enhanced minimal redundancy maximal relevance (emRMR) methods, and then attaching seven different classifiers wrapper methods based on the sequential forward selection algorithm (SFS), we proposed a hybrid feature selection technique emRMR-SFS to optimize feature vectors. Furthermore, by evaluating seven classifiers trained with the optimal feature subset, we developed the Extremely Randomized Tree (ERT) based predictor named PREDAIP for identifying AIPs. Results: We systematically compared the performance of PREDAIP with the existing tools on independent test dataset. It demonstrates the effectiveness and power of the PREDAIP. Conclusion: The correlation criteria used in emRMR would affect the selection results of the optimal feature subset at the SFS-wrapper stage, which justifies the necessity for considering different correlation criteria in emRMR.
-
-
-
Informative Gene Selection Based on Cost-Sensitive Fast Correlation- Based Filter Feature Selection
More LessAuthors: Yueling Xiong, Qingqing Li, Peipei Wang and Mingquan YeBackground: Informative gene selection is an essential step in performing tumor classification. However, it is difficult to select informative genes related to tumors from large-scale gene expression profiles because of their characteristics, such as high dimensionality, relatively small samples, and class imbalance, and some genes are superfluous and irrelevant. Objective: Many researchers analyze and process gene expression data to obtain classified gene subsets by using machine learning methods. However, the gene expression profiles of tumors often have massive computational challenges. In addition, when improving feature importance and classification accuracy, cost estimation is often ignored in traditional feature selection algorithms, which makes tumor classification more difficult. Methods: In this study, a novel informative gene selection method based on cost-sensitive fast correlation- based filter (CS-FCBF) feature selection is proposed. Results: First, the symmetric uncertainty index is used to evaluate the correlation between informative genes and class labels, then a large number of irrelevant and redundant genes are quickly filtered according to importance. Thereby, a candidate gene subset is generated. Second, costsensitive learning, which introduces the misclassification cost matrix and support vector machine attribute evaluation, is used to obtain the top-ranked gene subset with minimum misclassification loss. Finally, the candidate gene subset is optimized. Conclusion: This experiment was verified in eight independent tumor datasets. By comparing and analyzing CS-FCBF with another three hybrids of typical gene selection algorithms combined with cost-sensitive learning, we found that the method proposed in this study has a better classification performance with fewer selected genes, which might provide guidance in tumor diagnosis and research.
-
-
-
PSO-ELM with Modified Acceleration Coefficients for Classifying the Active Compound
More LessAuthors: Dian E. Ratnawati, Marjono Marjono and Nashi WidodoBackground: The classification of active compounds based on their function using machine learning is essential for predicting the function of new active compounds quickly. These classification results are beneficial to accelerate the work of laboratory assistants in identifying the function of active compounds. In this study, an active compound is represented by the Simplified Molecular-Input Line-Entry System (SMILES) code. Objective: This paper proposes a modified acceleration coefficient to improve the PSO-ELM performance for predicting the function of the SMILES code. Methods: The research uses a machine-learning algorithm that is a combination of the Particle Swarm Optimization and Extreme Learning Machine (PSO-ELM). ELM is used to classify the SMILES code, while PSO is used to optimize ELM parameters, i.e., weight, bias, and the number of hidden neurons. The important parameters that significantly influence the PSO performance are acceleration coefficients. The acceleration coefficients, that are modified Sigmoid-Based Acceleration Coefficient (SBAC), are introduced and compared with seven other acceleration coefficients. Results: The experimental results show that the sensitivity, specificity, accuracy, and Area Under the Curve (AUC) of the proposed acceleration coefficients outperform all other acceleration coefficients. The increased accuracy of the proposed can reach up to 2.64%, 5.84%, 7.93%, 8.44%, and 16.29% for Support Vector Machine (SVM), decision tree, AdaBoost, MLP Classifier, and Gaussian Naïve Bayes algorithms, respectively. Conclusion: The acceleration coefficients affect the prediction accuracy of the SMILES code classification. The proposed acceleration coefficients improve the performance of the PSO-ELM for predicting the function of the SMILES code.
-
-
-
Highly Accurate Gene Essentiality Prediction with W-Nucleotide Z Curve Features and Feature Selection Technique in Saccharomyces cerevisiae
More LessAuthors: Wen-Xin Zheng, Shu-Xuan Wang and Hong LiuBackground: Many studies have been conducted on essentiality prediction in the Saccharomyces cerevisiae genome, but the accuracy is not as high as those in bacterial or human genomes. The most frequently used features are Protein-Protein Interaction (PPI) networks combined with some other features, such as evolutionary conservation, expression level, and protein domain information. Sequence composition features are the least used features. Objective: To improve the accuracy of essentiality prediction in the Saccharomyces cerevisiae genome, we proposed a highly accurate gene essentiality prediction algorithm. Methods: In this paper, we propose an algorithm based on a linear Support Vector Machine (SVM) using sequence features only. The variables in this paper are derived from sequence data based on the w-nucleotide Z curve format without any other information. Results: After feature selection, the best area under the receiver operating characteristic curve (AUC) was 0.944 for 5-fold cross-validation. From 1- to 6-nucleotide Z curve variables, feature extraction can increase the AUC in all cases. Conclusion: The prediction on sequence composition is only promising, particularly when a feature filtering method is used, and maybe a good complement for algorithms based on other features.
-
-
-
DeepFusion-RBP: Using Deep Learning to Fuse Multiple Features to Identify RNA-binding Protein Sequences
More LessAuthors: Xu Wang, Shunfang Wang, Haoyi Fu, Xiaoli Ruan and Xianjun TangBackground: RNA-binding protein plays an important role in regulating splicing, RNA transport, and other post-transcriptional processes, identifying special RNA binding domains, and interacting with RNA. Objective: This paper proposes a deep learning framework, DeepFusion-RBP, composed of three submodels. A sliding window is used to obtain sub-sequences, local features are obtained, and then the model is customized for each feature. Methods: The main advantage of this research is using the sliding window method to cut the original sequence. While expanding the data set, this method avoids filling in too much meaningless data. Then, the model is customized for each feature to accurately perform RNA binding protein classification, with specific methods such as LSTM, Conv1D, Amino acid embedding, etc. Results: To test whether the customized model can improve the final prediction effect, we used different combinations of sub-models and test sets of different lengths. The prediction ACC, F1-score and MCC of DeepFusion-RBP are 92.62%, 91.29%, and 84.96%, respectively, with cross-validation. At the same time, DeepFusion-RBP also showed excellent performance on three independent verification sets. Conclusion: The results of 10-fold cross-validation and the independent verification set tests both suggested that the proposed models for different features and intercepting sub-sequences produce a certain improvement in the prediction effect of the model. The data supporting the findings of the article are available at https://github.com/mmwangxu/DeepFusion-RBP-tool.
-
-
-
Identification of Cancer Trait Genes and Association Analysis Under Pan-Cancer
More LessAuthors: Shudong Wang, Yuanyuan Zhang, Hongting Mu and Shanchen PangBackground: Different cancers have different sites of origin, cell types, and forms of genetic mutations that manifest different forms of cancers. Therefore, identifying genes associated with cancer traits and analyzing their functions in different cancers is important for understanding the mechanisms of cancer. Objective: The purpose of this paper is to make up for the shortcomings of single tumor analysis and realize the discovery and identification of genes related to each cancer trait at the level of multiple tumors, and their association analysis. Methods: In this paper, we use structural equation model to quantitatively identify genes associated with cancer traits for five cancers. We verify the correctness and effectiveness of the method through correlation analysis. Then we analyze the functions of genes and the biological processes involved through GO and KEGG pathways. Finally, we further analyze and verify the experimental results through protein interaction network and survival analysis. Results: Through five types of cancer data, we identify 44 genes related to cancer traits. We verify the combined effects of these genes and the biological processes they participate in. Moreover, we find key gene pathways and two significant gene functional modules. Conclusion: The results show that the structural equation model has unique advantages in quantifying the combined effects of genes. Many of the genes we have identified are tumor metastasis genes and are related to many cancers. There are strong potential commonalities among cancers. Four cancer genes are not only related to protein metabolism but also related to the regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism, which are of great significance for cancer treatment.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month