Current Proteomics - Volume 15, Issue 2, 2018
Volume 15, Issue 2, 2018
-
-
An Overview on Protein Fold Classification via Machine Learning Approach
Authors: Xiaoyu Tian, Daozheng Chen and Jun GaoProtein fold classification plays a key role in protein functional analysis, molecular biology, cell biology, biomedicine and drug design. The methods of classifying protein fold can be roughly divided into two categories: taxonomy-based method and template-based method. Machine learning algorithms, due to their excellent performance, have been widely applied to taxonomy-based methods. In this review, we mainly discuss the most popular and representative taxonomy-based methods via machine learning approach, including the three important aspects: dataset, feature extraction method, and classifying algorithm. We compare the overall accuracies of methods using the same classifiers with different feature vectors and summarize the development tendency and potential research directions. This review intends to assist researchers in choosing appropriate materials and developing new classifying methods in this area.
-
-
-
A Hybrid Discrete Imperialist Competition Algorithm for Gene Selection for Microarray Data
Authors: Aorigele, Zheng Tang, Yuki Todo and Shangce GaoObjective and Background: This paper presents a hybrid imperialist competition algorithm (ICA) for feature selection from microarray gene expression data. As we all known, ICA performs global search well by parallel searching. However, the population evolution only depends on assimilation mechanism and the algorithm has slow convergence speed. Therefore, a learning mechanism among imperialists is used to speed up the evolution of the population and accelerate the convergence velocity of the algorithm. Method: ICA is a kind of random search method. In order to select as many informative genes as possible, this paper presents a hybrid ICA combined with information entropy, which called as ICAIE. In the proposed algorithm, we utilize information entropy to locate genes and the roulette wheel selection mechanism to avoid the informative gene excessively selected. The proposed algorithm was tested on 10 standard gene expression datasets. Results and Conclusion: From the experiment, outcomes manifest that the performance of the presented algorithm is superior to different PSO-related (particle swarm optimization) and ICA-based algorithms in view of classification accuracy and the amount of targeted informative genes. Therefore, ICAIE is a very excellent method for feature selection.
-
-
-
Identifying the Characteristics of the Hypusination Sites Using SMOTE and SVM Algorithm with Feature Selection
Authors: XiJun Sun, JiaRui Li, Lei Gu, ShaoPeng Wang, YuHang Zhang, Tao Huang and Yu-Dong CaiBackground: Hypusination is a unique modification on lysine residues in eukaryotic translation initiation factor 5A (eIF5A), which is essential and highly conserved in all kinds of eukaryotes. However, the mechanism of recognizing this particular hypusination site remains unclear. In this study, we first gave an attempt in uncovering the characteristics of the hypusination sites using computational methods. Method: The hypusination sites validated by experiments or predicted through sequence similarity that were retrieved from the UniProt database were selected for investigating. Each site was transformed into a peptide segment that contained the modification site and the residues around it. Four types of features were extracted from the peptide segments. Because the hypusination sites are much fewer than non-hypusination sites, the synthetic minority over-sampling technique (SMOTE) was performed to make the dataset containing them balanced. Then, some feature selection methods, including maximum relevance minimum redundancy (mRMR) and incremental feature selection (IFS), were used to analyze four types of features and build an optimal classifier that used support vector machine (SVM) as the prediction engine. Results: The obtained optimal SVM classifier harboring four amino acid features yielded a perfect Mathews' correlation coefficient (MCC) value of 1.000 on both training and testing sets, indicating these four features are hypusination specific characteristics. Conclusions: As a pioneer work, our analysis provides insight into the improvement of the understanding of hypusination mechanisms.
-
-
-
Complex Detection in PPI Network Using Genes Expression Information
Authors: Zehua Zhang, Jijun Tang and Fei GuoBackground: Identifying of protein complexes from PPI networks has become a key problem to elucidate protein functions and identify signaling and biological processes in a cell. Objective: Accurate determination of complexes in PPI networks is crucial for understanding principles of cellular organization. Method: We propose a novel method to identify protein complexes on PPI networks. First, we use Markov Cluster Algorithm with an edge-weighting scheme to calculate complexes on PPI networks. Second, we design a new co-expression analysis method to measure each protein complex, based on differential co-expression information. Results: To evaluate our method, we experiment on two yeast PPI networks. On DIP network, our method has Precision and F-Measure values of 0.5014 and 0.5219, which improves upon Precision and F-Measure values of 0.2896 and 0.3211 for COACH, 0.4252 and 0.3675 for ClusterONE. On MIPS network, our method has F-Measure values of 0.3597, which improves upon F-Measure values of 0.2497 for COACH, 0.3326 for ClusterONE. Conclusion: Our method achieves better results than some state-of-the-art methods for identifying protein complexes on dynamic PPI networks, with the prediction improved.
-
-
-
A Method for Lymph Node Segmentation with Scaling Features in a Random Forest Model
Authors: Wenjing Zhao and Feng ShiBackground: Accurate identification lymph nodes in multi-slice CT images enables promptly diagnosing and correctly treating of cancers and subsequent measuring the effect of the treatment. Computer-aided detection (CAD) systems are necessary choice to reduce labor intensity of radiologists and to do the work with higher accuracy than the artificial recognition. The detection of lymph node is non-trivial since the lymph nodes vary in shape and there is not significant contrast to their surrounding regions, which makes the effect of the classifiers based on features of either boundaries or shapes of the lymph nodes unsatisfactory. Recently, the feature extraction from intra lymph nodes gets more attention than those from the borders and the shapes. Method: In the paper, the lymph node was segmented by a Random Forest model. 500 random contextual features were extracted for each voxel of the lymph node. In order to improve performance, we proposed the scaling features in a Random Forest classifier without any extra complexity. Result: We testified our method on 10 mediastinum lymph nodes from TCIA (the Cancer Imaging Archive) database. We improved the performance of the random forest model by the scaled features. After we adjusted the model parameters and chose for features with high information gains, our Random Forest classifier reached better performance. Conclusion: A simpler, faster and more efficient method is searched for enabling practicable computeraided diagnosis and computer-aided detection in the field of the lymph node segmentation. Since the scaling could ensure equal treatment of the features with different absolute value in the classifier, the precision and the recall of our Random Forest classifier were increased based on the scaled features.
-
-
-
A Homology and Pseudo Amino Acid Composition-based Multi-label Model for Predicting Human Membrane Protein Types
Authors: Yanjun Huang and Guohua HuangBackground: Membrane proteins are embedded into biological membranes and interact with them, playing a large range of roles from transporting materials to catalyzing interactions in the cellular processes. The functions of membrane proteins are closely associated with types they belong to. Membrane proteins have simultaneously more than one type, but most of the computational predictions can deal with only one type. Objective and Method: To bridge the gap, we proposed a multi-label method based on the sequence homology and pseudo amino acid composition for predicting human membrane protein types. The method is a two-step decision. The uncharacterized membrane protein firstly was aligned against the database consisting of membrane proteins with known types and types of the most homological membrane protein were transferred to it. If it had no homological membrane protein, the pseudo amino acid composition-based method was used to predict its types. Results: The predictive accuracies of the leave-one-out cross-validation test on these three benchmark datasets are 0.8817, 0.8206 and 0.7276, respectively, better than our previous algorithm. We collected 5752 manually reviewed human membrane proteins with annotated types as the training set, and developed a program MemPred for predicting multi-label types of membrane proteins. Conclusion: We have proposed a multi-label computational method for predicting membrane protein types and achieved a better performance. The advantage of the proposed method is that it can predict simultaneously more than one type.
-
-
-
Pathway Crosstalk Analysis based on Signaling Pathway Impact Analysis in Alzheimer's Disease
Authors: Jin Deng, Wei Kong, Xiaoyang Mou and Shuaiqun WangBackground: Identifying dysregulated pathways from significant differential expression genes (DEGs) to infer underlying biological insights play an important role in discovering pathogenesis of diseases. However, current pathway-based methods only focus on single pathways in isolation and the analysis of the pathways crosstalk that contains DEGs could improve our understanding of alterations in biological processes. Objective: To explore the underlying dysregulated pathways of Alzheimer's disease (AD) efficiently by the crosstalk analysis on both the significant DEGs and the pathways with high contributions. Method: A novel signaling pathway impact analysis method is used to calculate and rank the signaling pathways of AD. Distance correlation model based on the pathways with ranking contributions is applied to calculate the crosstalk of pathways of AD. Results: The method not only confirms the presence of known pathways associated with AD including Parkinson's disease, Vegf signaling pathway and so on but also predicts the presence of unknown pathways such as Basal cell carcinoma and Olfactory transduction pathways that are significantly associated with the onset and deterioration of AD. Conclusion: The results provide useful supplement and basis for the biological experiments of AD pathogenesis.
-
-
-
Prediction of Sphingomonas Protein Coding Regions Based on 3-Base Periodicity Analysis Method
Authors: Zhongwei Li, Shengyu Xia, Xin Liu, Qinghua Lu, Weishan Zhang, Huazhou A. Li and Hu ZhuBackground: Sphingomonas is a kind of microbial resources used for biodegradation of aromatic compounds. In computational biology, identifying protein coding domains in Sphingomonas genome is known as a challenging problem. Objective: In this work, to address the challenge, we propose a novel method to predict protein coding regions from Sphingomonas genome by 3-base periodicity. Method: In our method, DNA sequences are firstly transformed into wavelet by a so-called 3-base characteristics strategy. After that, sliding windows with certain fixed lengths are developed to identify protein coding regions, in which the initial size of sliding windows and values of thresholds are set by experimentally verified protein data in NCBI library. Results: As results, an experimental verified protein coding domain in congeneric families of Sphingomonas is identified from Sphingomonas genome. Conclusion: This would be with high possibility to encode the similar functioning proteins. As well, some potential protein coding regions are marked by narrowing the forecast areas, and then an extensible sliding window strategy is used to improve predictive accuracy.
-
-
-
Human Disease-Protein Network
Authors: Yangmei Cheng, Hao Zhang, Hui Zheng, Jun Zhang, Yang Hu and Liang ChengBackground: Using system biology data to investigate diseases is a tendency. In consideration that protein is the functional unit of human body in molecule level, it is a straight way to view the relationships among diseases from the perspective of human proteins. However, lack of disease annotations of human proteins limit this purpose. Objective: Our objective is to present a framework for extracting associations between diseases and proteins first, and then constructed human disease network (HDN) based on disease-related proteins. Method: The protein-disease associations were extracted from UniProt, which involves disease descriptions of human proteins. Each description contains an Online Mendelian Inheritance in Man (OMIM) id or a text. OMIM ids of the descriptions were mapped to Comparative Toxicogenomics Database (CTD)'s ‘merged disease vocabulary'(MEDIC), and disease terms of the texts were annotated to MEDIC using MGREP. Relativity scores of disease pairs were calculated based on Jaccard Index for establishing the HDN, where a node represents a disease and an edge of pair-wise diseases indicates their relativity score more than zero. Results: 4,466 associations between 2,933 diseases and 2,625 proteins were obtained. The degree distribution of the diseases in the HDN revealed a power-law distribution with R2 = 0.9762, which shows that the network displayed scale-free characteristics like many other biological networks. Conclusion: Here, we constructed a HDN by our protein-disease annotations. As our expectation, hub nodes of the network are always disease classes or complex diseases. In comparison, the most similar diseases are always specific diseases.
-
-
-
Incorporating Link Information in Feature Selection for Identifying Tumor Biomarkers by Using miRNA-mRNA Paired Expression Data
Authors: Kaiwen Liu and Yang YangBackground: Feature selection methods have been commonly used in differential expression analysis. The selected genes can serve as potential biomarkers, and play important roles in disease diagnosis and prognosis. Recently, many studies have shown that an efficient way to enhance the performance of feature selection is incorporating data properties, such as the correlation between instances or attributes in heterogeneous data. Gene expression data is a typical kind of linked data, in which genes are related by co-regulation, and samples are groups by similar disease status. However, most of the analysis approaches for gene expression data are designed for generic data, without consideration of data characteristics. Objective: In this paper, we aim to identify miRNA biomarkers by using feature selection methods. Benefitting from the abundant mRNA-miRNA parallel expression data, mining the linked data can provide valuable information for feature selection and biomarker identification. Method: Using mRNA-miRNA paired data, we infer connections between data samples by mRNA expression levels, and incorporate the link information into a graph regularization method to achieve feature selection for miRNAs. Results: The experiments were conducted on three public miRNA-mRNA microarray data sets. The new method greatly reduces feature dimensionality, and achieves high classification accuracy. Experimental comparisons show that it outperforms the classic regularization methods and state-of-the-art feature selection methods. Conclusion: Taking data properties into consideration has been demonstrated as an effective way to improve the performance of feature selection. Specifically, link information in gene expression data provides useful hints to design structured regularization and assists biomarker identification.
-
Volumes & issues
-
Volume 21 (2024)
-
Volume 20 (2023)
-
Volume 19 (2022)
-
Volume 18 (2021)
-
Volume 17 (2020)
-
Volume 16 (2019)
-
Volume 15 (2018)
-
Volume 14 (2017)
-
Volume 13 (2016)
-
Volume 12 (2015)
-
Volume 11 (2014)
-
Volume 10 (2013)
-
Volume 9 (2012)
-
Volume 8 (2011)
-
Volume 7 (2010)
-
Volume 6 (2009)
-
Volume 5 (2008)
-
Volume 4 (2007)
-
Volume 3 (2006)
-
Volume 2 (2005)
-
Volume 1 (2004)
Most Read This Month
