Volume 13, Issue 2

Current Proteomics - Volume 13, Issue 2, 2016

Volume 13, Issue 2, 2016

- Editorial (Thematic Issue: Machine Learning Techniques for Protein Structure, Genomics Function Analysis and Disease Prediction)
  
  By Quan Zou
  
  https://doi.org/10.2174/157016461302160513235846
  More Less
  
  Add to my favourites
  
  Email this

- Protein Folds Prediction with Hierarchical Structured SVM
  
  Authors: Dapeng Li, Ying Ju and Quan Zou
  
  https://doi.org/10.2174/157016461302160514000940
  More Less
  
  Protein folds prediction is an essential and basic problem for protein structure and function research. As far as we see, there are generally three problems for the protein folds prediction. The first one is the overfitting problem due to the lack of training samples. The second one is the missing information of hierarchical labels. Small size of the current benchmark is another troubling issue. In this paper, we proposed structured SVM to overcome the first and second problems. We also contributed three comparatively huge datasets as benchmark for protein folds prediction. Experiments on different datasets can prove the performance and robustness of our structured SVM.
  
  Add to my favourites
  
  Email this

- Protein Remote Homology Detection by Combining Pseudo Dimer Composition with an Ensemble Learning Method
  
  Authors: Bin Liu, Junjie Chen and Shanyi Wang
  
  https://doi.org/10.2174/157016461302160514002939
  More Less
  
  Background: With the development of the next generation sequencing technique in biology, more and more protein sequence data is generated exponentially. However, the protein structure data grows slowly. The gap between them is growing large. The protein remote homology detection becomes an important and intense research problem. Objective: Although several methods have been reported to tackle this problem, their performance is still too low to be used for real world application. Therefore, it is necessary and urgent to characterize protein sequences from a new perspective so as to improve the predictive performance of protein remote homology detection. Method: In this study, we proposed a new feature of proteins called Pseudo Dimer Composition (PDC). A new computational method for protein remote homology detection called PDC-Ensemble was constructed by combining PDC via an ensemble learning approach. Result: Experimental results on a public benchmark dataset showed that the performance of PDC-Ensemble outperformed other sequence-based methods, and is highly comparable with some state-of-the-art predictors in the field of protein remote homology detection. Conclusion: PDC can extract more dipeptide information. PDC-Ensemble is a useful tool for the studies of protein remote homology detection.
  
  Add to my favourites
  
  Email this

- Latent Semantic Analysis- and Hierarchical Clustering-Based Method for Detecting Remote Protein Homology
  
  Authors: Tianjiao Zhang, Yue Jiang, Liang Cheng, Yang Hu and Yadong Wang
  
  https://doi.org/10.2174/157016461302160514003220
  More Less
  
  Background: The detection of remote homology between protein sequences is a central problem in computational biology. Discriminative methods such as the support vector machine (SVM) are among the most effective approaches. Objective: Many SVM-based methods focus on finding useful representations of protein sequences using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and may contain some noise. In addition, the dataset for the problem of remote homology detection is imbalanced as the number of negative samples is far greater than the number of positive samples. Method: Based on these observations, we propose a new method for reconstructing feature space based on latent semantic analysis (LSA) and hierarchical clustering. In addition, for detecting remote homology, we adopt an alternative evaluation method called the precision-recall (PR) curve & score instead of the receiver operating characteristic (ROC). Results: Compared to existing methods, the performance increased by 14% on the 3-gram features and 7% on the LA features. Conclusion: Through analysis of the contrasting experiment results, we confirmed that our method is effective and performs better than other existing methods.
  
  Add to my favourites
  
  Email this

- Predicting Human Enzyme Family Classes by Using Pseudo Amino Acid Composition
  
  Authors: Yun Wu, Hua Tang, Wei Chen and Hao Lin
  
  https://doi.org/10.2174/157016461302160514003437
  More Less
  
  Background: Enzymes are biological macromolecules which can act as catalysts and help complex biochemical reactions. They can increase the rate of a reaction by reducing its activation energy. Different enzymes can catalyze different chemical reactions. Objective: With the appearance of vast human protein data, correctly identifying the human enzymes class is extremely important to understand their functions. However, no computational method was developed to predict enzyme functional classes in human. We aimed to develop a computational method to discriminate human enzymes from non-enzymes and further predict the classes of human enzymes. Method: In this paper, the pseudo amino acid composition was proposed to formulate proteins by incorporating rigidity, flexibility and irreplaceability of amino acids. The feature selection technique was used to optimize the feature set. We proposed SVM to perform prediction. Results: The results of five-fold cross-validation test show that the overall accuracies are 72.6% and 46.1%, respectively for discriminating human enzymes from non-enzymes and predicting six classes of human enzymes. Conclusion: The work in this study provides an efficient method on this issue. Especially, three kinds of new characteristics were introduced to incorporate into PseAAC. The results indicate that the three characteristics of amino acids can be used in human enzyme prediction.
  
  Add to my favourites
  
  Email this

- A Brief Review on Software Implementations and Algorithm Enhancements of Chou’s Pseudo-Amino Acid Compositions
  
  By Pu-Feng Du
  
  https://doi.org/10.2174/157016461302160514003628
  More Less
  
  Backgrounds: In the last decade, the computational prediction of structural and functional protein attributes has been applied in nearly every field of protein science. The fundamental challenge of computationally predicting protein attributes is to represent the protein sequence in a length-fixed digital vector. For this purpose, Chou proposed a pseudo-amino acid composition method that has been widely applied in almost every branch of computational protein science. Conclusions: In this review, we will first introduce the background and history of pseudo-amino acid composition and focus on the software implementation of this widely used algorithm and the enhancements that have been developed since its creation.
  
  Add to my favourites
  
  Email this

- Predicting Protein Ligand Binding Sites with Structure Alignment Method on Hadoop
  
  Authors: Guangzhong Liu, Min Liu, Daozheng Chen, Lei Chen, Jiali Zhu, Bo Zhou and Jun Gao
  
  https://doi.org/10.2174/157016461302160514003915
  More Less
  
  Background: Identifying protein-ligand binding sites is an important step to the characterizing of molecular function. Although many ligand-binding site prediction methods have been developed, there is still a great demand for improving the prediction accuracy and reducing the amount of calculation. Objective: In this paper, we introduce a structure alignment-based binding site prediction method, involved a big and well refined template database, homologous indexed alignment, combination of conservation in binding sites ranking, and Hadoop based alignment acceleration. Method: We first build a big template database with strict quality control. Homologous index is used to refine the templates of a certain query chain in the process of structure alignment. Moreover, Hadoop is used for structure alignment, which improves the prediction efficiency. Clustering method is used for analysis of sites. Finally, the sites are ranked according to the conservation scores of all residues in each site. Results: For the 210 bound test dataset, our method achieved Accuracy (ACC) up to 0.93, Matthews Correlation Coefficient (MCC) 0.80. For the 48 unbound/bound test dataset, our method achieved ACC up to 0.97 for bound proteins (MCC 0.87), and 0.95 for unbound proteins (MCC 0.66). Structure alignment is also accelerated on Hadoop cluster, as illustrated in chain 1qif.A. Conclusion: Our method can reduce computation time and improve prediction accuracy, compared with other binding site prediction methods using the same test datasets.
  
  Add to my favourites
  
  Email this

- Identification of Residue-Residue Contacts Using a Novel Coevolution- Based Method
  
  Authors: Yijie Ding, Jijun Tang and Fei Guo
  
  https://doi.org/10.2174/157016461302160514004105
  More Less
  
  Background: Residue-residue interactions play important roles in functional and spatial relationship of proteins. These interactions are usually related to the sequence but display close proximity within three-dimensional structure. In the past few years, identifying residue-residue contacts in proteins is an important prediction problem. Objective: Many methods extract contact information from multiple sequence alignments (MSAs). Existing methods associated with MSAs are derived from homologous protein sequences. However, they need a large number of homologous protein sequences, average of about several thousand, for residue-residue contact prediction. Method: In this article, we use both phylogenetic information and amino acid frequency to predict residue-residue contacts, based on small size of MSAs. In order to better reflect evolutionary information, we combine the evolutionary distance matrix and the similarity matrix and produce a novel score to filter some noise, based on amino acid frequency. We use the above information to estimate correlation coefficient between each pair of sites from one target protein family, and extract binding sites with high values of final correlative score. Results: First, we present statistical analysis of correlative relationship on residue-residue contact. Second, we evaluate our method on 150 benchmark proteins to predict residue-residue contact. Third, we identify protein-protein interaction in bacterial signal transduction. Experiments show that our method is very effective in real applications. Conclusion: In the case of less protein sequences, experimental results confirm that the performance of our method is better than some currently popular methods. We reduce the number of homologous proteins. Therefore, the computing time to construct phylogenetic trees decreases significantly. On 150 benchmark proteins, our method achieves overall precisions of 68%, 64%, 54% and 45% in the top L/10, L/5, L/2 and L ranked, respectively. The performance of our method is better than the normalized Mutual Information scoring with sequence weighting and the Bayesian approach of Burger & van Nimwegen (B).
  
  Add to my favourites
  
  Email this

- Protein Function Prediction by Random Walks on a Hybrid Graph
  
  Authors: Jie Liu, Jun Wang and Guoxian Yu
  
  https://doi.org/10.2174/157016461302160514004307
  More Less
  
  Background: Proteins participate in various essential processes of life and hence accurately annotating functional roles of proteins can elucidate the understanding of life and diseases. Objective: Various network-based function prediction models have been proposed to predict protein functions using protein-protein interactions networks, while most of them do not make use of function correlations in functional inference. Furthermore, these models suffer from false positive interactions. Our aim is to solve these problems with advanced machine learning techniques. Method: In this paper, we introduce an approach called protein function prediction by random walks on a hybrid graph (ProHG). ProHG not only takes into account of the function correlation and direct interactions, but also indirect interactions between proteins by functional similarity weight (FS-weight) to alleviate noisy interactions. Results: Experiments on three public accessible PPI networks show that ProHG can take advantage of function correlations and indirect interactions between proteins for function predictions, and it achieves better performance than other related approaches. Conclusion: The extensive empirical study demonstrates that our proposed ProHG is superior to other related methods for function prediction in most cases, and using indirect interactions can boost the performance of network-based function prediction.
  
  Add to my favourites
  
  Email this

- EnPC: An Ensemble Clustering Framework for Detecting Protein Complexes in Protein-Protein Interaction Network
  
  Authors: Qiguo Dai, Xiaodong Duan, Maozu Guo and Yingjie Guo
  
  https://doi.org/10.2174/157016461302160514005420
  More Less
  
  Background: Proteins interact with each other to form a complex, which plays a key role in a cell. Many methods have been proposed to predict complexes by clustering protein-protein interaction networks. However, it remains a challenge to identify protein complexes accurately. Objective: Although each of previous methods has its advantage in predicting complexes, there is no one method that is always superior to others. Therefore, the goal of this work is propose an ensemble method to integrate the results of multiple previous methods, to obtain a better performance than using a single one of them. Method: We present an ensemble framework, named EnPC, to combine the results from several existing methods. A cluster- wise voting mechanism is employed to extract the consensus information embedded in the results of different methods. Furthermore, we employ a least squares-based optimization to predict complexes from the matrix. Results: We test the proposed framework on several widely used yeast PPI networks. The experimental results show that EnPC framework achieves a better performance on detecting protein complexes than other tested base methods. Conclusion: We conclude that the proposed EnPC framework is suitable to integrate the results of tested base methods for detecting protein complexes from PPI networks.
  
  Add to my favourites
  
  Email this

- Prediction of MicroRNA–disease Associations by Matrix Completion
  
  Authors: Xiangxiang Zeng, Ningxiang Ding, Alfonso Rodríguez-Patón, Ziyu Lin and Ying Ju
  
  https://doi.org/10.2174/157016461302160514005711
  More Less
  
  Background: MicroRNAs play important roles in the progression of various diseases. Therefore, it is of vital importance to predict novel microRNA-disease associations for understanding disease mechanisms. Objective: As far as we see, there are generally three problems for the microRNA-disease association prediction. The first one is the lack of similarity among miRNAs. The second one is the presence of a few defined relationships between miRNAs and diseases. The insufficient number of available negative samples for studies on miRNA–disease associations is another troubling issue. We aimed to solve the three problems with the inductive matrix completion method. Method: In this paper, the inductive matrix completion method is exploited to overcome the three problems. We also contributed multiple feature sets to address problems related to insufficient miRNA–disease association data. The method could be applied to predict unknown microRNA-disease associations and new pathogenic miRNAs for well-characterized diseases. Results: Experiments can prove the performance of our inductive matrix completion method. The experiment is compared with several current methods through cross-validation. Our result reveals the superiority of our method to other approaches. Conclusion: We can conclude that the inductive matrix completion method is more suitable than transductive one, for the prediction of microRNA-disease associations.
  
  Add to my favourites
  
  Email this

- A Prioritization Method for Identifying Disease-Causative Gene Based on Hyper Graph Network
  
  Authors: Haoyue Fu, Lianping Yang and Xiangde Zhang
  
  https://doi.org/10.2174/157016461302160514005851
  More Less
  
  Backgrounds: Difficulty on identification of transcription factor binding site lies in, compared with those hundreds or thousands bp background noise sequences, the motif signals with ten to several tens bp in length are rather short; moreover, the motif instance of a transcription factor is likely to mutate partially. The TFBS identification has always been a challenge task. Results: The experimental methods which are widely used in the study on transcription regulation, the databases that collect information on TFBS, the models that represent TFBS and the TFBS identification algorithms are introduced and reviewed systematically in this paper. Conclusion: The regulation mechanism of TFBS in the regulation network is to be further discovered. We insist that the progress on experiment technology and the insight into the regulation mechanism will definitely bring new life into the bioinformatics on TFBS. Since deep learning method has manifested the excellent performance on identification of TFBS, there are good reasons to believe that integrated more up-to-date biological data, the deep learning method will become the dominant way to study transcription regulation.
  
  Add to my favourites
  
  Email this

Current Proteomics - Volume 13, Issue 2, 2016

Volume 13, Issue 2, 2016

Volumes & issues

Most Read This Month

Most Cited Most Cited RSS feed