Current Protein and Peptide Science - Volume 12, Issue 6, 2011
Volume 12, Issue 6, 2011
-
-
Editorial [Hot Topic: Machine Learning Models in Protein Bioinformatics (Guest Editors: Lukasz Kurgan & Yaoqi Zhou)]
Authors: Lukasz Kurgan and Yaoqi ZhouBioinformatics is a relatively new field concerned with the computational analysis and prediction of properties of biomolecules, DNA, RNA, and proteins, in particular, on a genomic/proteomic scale. Machine learning models play increasingly important roles in development of novel methodologies, summarization, and high-throughput analysis in the bioinformatics field. Advances in the related area, including protein structure and function prediction [1, 2], structural bioinformatics [3], and peptide analysis [4] were recently summarized, and several works that overview specific sub-areas of protein bioinformatics, such as prediction of secondary structure [5, 6], helical transmembrane proteins [7], localization and targeting [8], binding sites [9, 10], and RNA-binding [11], were published in the last couple of years. This issue provides a comprehensive overview of current efforts related to the analysis of protein data, from sequences to structures to functions. It consists of two parts, the first with five reviews and the second that includes seven original methodology papers. The first review by Xin and Radivojac summarizes approaches for the computational identification of functional residues in protein structures and discusses their applications in functional proteomics, including prediction of catalytic residues, post-translational modifications, and nucleic acid-binding sites. The second manuscript by Kurgan and Disfani provides a comprehensive review of ten onedimensional structural descriptors of proteins and comparatively summarizes over eighty computational models that are used to predict these descriptors from the protein sequences, primarily focusing on the prediction of secondary structure, relative solvent accessibility, and disorder. The review by Gromiha and Huang discusses machine learning-based and statistical methods for the computational prediction of protein folding rates and stability. The fourth paper by Zhou and coworkers overviews and compares current techniques for the prediction of small open reading frames and emphasizes the need for further research in this area. The last review introduces cellular automata and concentrates on its applications in the protein bioinformatics. The first original research paper by Kihara and coworkers describes the three-dimensional Zernike descriptor, which is used to describe molecular surfaces, and overviews several applications of this descriptor. In the next paper, Qin and Zhou introduce their DISPLAR method that aims at the accurate protein structure-based prediction of DNA binding sites. The manuscript by Xu and coworkers describes and evaluates a new sampling-based machine learning method to rank protein structural models by integrating multiple scores and features. The next two original contributions describe new methodologies for the protein model quality assessment. The work by Martin, Mirabello, and Pollastri concerns an efficient knowledge-based approach that utilizes neural network pairwise interaction fields. The paper by Meller and coworkers introduces a method based on the prediction of relative solvent accessibility using support vector regression, which is applied to soluble and alpha-helical membrane proteins. The next contribution by Hwang et al. investigates a relation between contact numbers and catalytic residues to build a simple and effective predictor of the catalytic residues. We close the issue with the paper by Yin, Fan, and Shen which proposes and evaluates an accurate nearest neighbor-based method for the prediction of the conotoxin superfamily. We are excited to deliver this comprehensive issue that tackles a diverse set of developments in the area of protein bioinformatics. We hope that it will constitute an indispensable resource for bioinformaticians, computer scientists, computational biologists, biophysicists, and biochemists. Last but not least, we would like to thank all the authors who make this issue possible. We are in great debt to the 26 anonymous reviewers from around the world who delivered timely and useful comments to the authors. The guest editors also express their gratitude to the Editor-in-Chief Prof. Ben M. Dunn for his invitation and support that resulted in the successful completion of this special issue.
-
-
-
Computational Methods for Identification of Functional Residues in Protein Structures
Authors: Fuxiao Xin and Predrag RadivojacThe recent accumulation of experimentally determined protein 3D structures combined with our ability to computationally model structure from amino acid sequence has resulted in an increased importance of structure-based methods for protein function prediction. Two types of methods for function prediction have been proposed: those that can accurately predict overall biochemical or biological roles of a protein and those that predict its functional residues. Here, we review approaches used for the computational identification of functional residues in protein structures and summarize their applications to a wide variety of problems in functional proteomics, such as the prediction of catalytic residues, posttranslational modifications, or nucleic acid-binding sites. We examine four different problems in order to perform a comparison between several recently proposed methods and, finally, conclude by identifying limitations and future challenges in this field.
-
-
-
Structural Protein Descriptors in 1-Dimension and their Sequence-Based Predictions
Authors: Lukasz Kurgan and Fatemeh Miri DisfaniThe last few decades observed an increasing interest in development and application of 1-dimensional (1D) descriptors of protein structure. These descriptors project 3D structural features onto 1D strings of residue-wise structural assignments. They cover a wide-range of structural aspects including conformation of the backbone, burying depth/solvent exposure and flexibility of residues, and inter-chain residue-residue contacts. We perform first-of-its-kind comprehensive comparative review of the existing 1D structural descriptors. We define, review and categorize ten structural descriptors and we also describe, summarize and contrast over eighty computational models that are used to predict these descriptors from the protein sequences. We show that the majority of the recent sequence-based predictors utilize machine learning models, with the most popular being neural networks, support vector machines, hidden Markov models, and support vector and linear regressions. These methods provide high-throughput predictions and most of them are accessible to a non-expert user via web servers and/or stand-alone software packages. We empirically evaluate several recent sequence-based predictors of secondary structure, disorder, and solvent accessibility descriptors using a benchmark set based on CASP8 targets. Our analysis shows that the secondary structure can be predicted with over 80% accuracy and segment overlap (SOV), disorder with over 0.9 AUC, 0.6 Matthews Correlation Coefficient (MCC), and 75% SOV, and relative solvent accessibility with PCC of 0.7 and MCC of 0.6 (0.86 when homology is used). We demonstrate that the secondary structure predicted from sequence without the use of homology modeling is as good as the structure extracted from the 3D folds predicted by top-performing template-based methods.
-
-
-
Machine Learning Algorithms for Predicting Protein Folding Rates and Stability of Mutant Proteins: Comparison with Statistical Methods
Authors: M. Michael Gromiha and Liang-Tsung HuangMachine learning algorithms have wide range of applications in bioinformatics and computational biology such as prediction of protein secondary structures, solvent accessibility, binding site residues in protein complexes, protein folding rates, stability of mutant proteins, and discrimination of proteins based on their structure and function. In this work, we focus on two aspects of predictions: (i) protein folding rates and (ii) stability of proteins upon mutations. We briefly introduce the concepts of protein folding rates and stability along with available databases, features for prediction methods and measures for prediction performance. Subsequently, the development of structure based parameters and their relationship with protein folding rates will be outlined. The structure based parameters are helpful to understand the physical basis for protein folding and stability. Further, basic principles of major machine learning techniques will be mentioned and their applications for predicting protein folding rates and stability of mutant proteins will be illustrated. The machine learning techniques could achieve the highest accuracy of predicting protein folding rates and stability. In essence, statistical methods and machine learning algorithms are complimenting each other for understanding and predicting protein folding rates and the stability of protein mutants. The available online resources on protein folding rates and stability will be listed.
-
-
-
Small Open Reading Frames: Current Prediction Techniques and Future Prospect
Authors: Haoyu Cheng, Wai Soon Chan, Zhixiu Li, Dan Wang, Song Liu and Yaoqi ZhouEvidence is accumulating that small open reading frames (sORF, <100 codons) play key roles in many important biological processes. Yet, they are generally ignored in gene annotation despite they are far more abundant than the genes with more than 100 codons. Here, we demonstrate that popular homolog search and codon-index techniques perform poorly for small genes relative to that for larger genes, while a method dedicated to sORF discovery has a similar level of accuracy as homology search. The result is largely due to the small dataset of experimentally verified sORF available for homology search and for training ab initio techniques. It highlights the urgent need for both experimental and computational studies in order to further advance the accuracy of sORF prediction.
-
-
-
Cellular Automata and Its Applications in Protein Bioinformatics
Authors: Xuan Xiao, Pu Wang and Kuo-Chen ChouWith the explosion of protein sequences generated in the postgenomic era, it is highly desirable to develop high-throughput tools for rapidly and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. The knowledge thus obtained can help us timely utilize these newly found protein sequences for both basic research and drug discovery. Many bioinformatics tools have been developed by means of machine learning methods. This review is focused on the applications of a new kind of science (cellular automata) in protein bioinformatics. A cellular automaton (CA) is an open, flexible and discrete dynamic model that holds enormous potentials in modeling complex systems, in spite of the simplicity of the model itself. Researchers, scientists and practitioners from different fields have utilized cellular automata for visualizing protein sequences, investigating their evolution processes, and predicting their various attributes. Owing to its impressive power, intuitiveness and relative simplicity, the CA approach has great potential for use as a tool for bioinformatics.
-
-
-
Molecular Surface Representation Using 3D Zernike Descriptors for Protein Shape Comparison and Docking
Authors: Daisuke Kihara, Lee Sael, Rayan Chikhi and Juan Esquivel-RodriguezThe tertiary structures of proteins have been solved in an increasing pace in recent years. To capitalize the enormous efforts paid for accumulating the structure data, efficient and effective computational methods need to be developed for comparing, searching, and investigating interactions of protein structures. We introduce the 3D Zernike descriptor (3DZD), an emerging technique to describe molecular surfaces. The 3DZD is a series expansion of mathematical three-dimensional function, and thus a tertiary structure is represented compactly by a vector of coefficients of terms in the series. A strong advantage of the 3DZD is that it is invariant to rotation of target object to be represented. These two characteristics of the 3DZD allow rapid comparison of surface shapes, which is sufficient for real-time structure database screening. In this article, we review various applications of the 3DZD, which have been recently proposed.
-
-
-
Structural Models of Protein-DNA Complexes Based on Interface Prediction and Docking
Authors: Sanbo Qin and Huan-Xiang ZhouProtein-DNA interactions are the physical basis of gene expression and DNA modification. Structural models that reveal these interactions are essential for their understanding. As only a limited number of structures for protein-DNA complexes have been determined by experimental methods, computation methods provide a potential way to fill the need. We have developed the DISPLAR method to predict DNA binding sites on proteins. Predicted binding sites have been used to assist the building of structural models by docking, either by guiding the docking or by selecting near-native candidates from the docked poses. Here we applied the DISPLAR method to predict the DNA binding sites for 20 DNAbinding proteins, which have had their DNA binding sites characterized by NMR chemical shift perturbation. For two of these proteins, the structures of their complexes with DNA have also been determined. With the help of the DISPLAR predictions, we built structural models for these two complexes. Evaluations of both the DNA binding sites for 20 proteins and the structural models of the two protein-DNA complexes against experimental results demonstrate the significant promise of our model-building approach.
-
-
-
A Sampling-Based Method for Ranking Protein Structural Models by Integrating Multiple Scores and Features
Authors: Xiaohu Shi, Jingfen Zhang, Zhiquan He, Yi Shang and Dong XuOne of the major challenges in protein tertiary structure prediction is structure quality assessment. In many cases, protein structure prediction tools generate good structural models, but fail to select the best models from a huge number of candidates as the final output. In this study, we developed a sampling-based machine-learning method to rank protein structural models by integrating multiple scores and features. First, features such as predicted secondary structure, solvent accessibility and residue-residue contact information are integrated by two Radial Basis Function (RBF) models trained from different datasets. Then, the two RBF scores and five selected scoring functions developed by others, i.e., Opus-CA, Opus-PSP, DFIRE, RAPDF, and Cheng Score are synthesized by a sampling method. At last, another integrated RBF model ranks the structural models according to the features of sampling distribution. We tested the proposed method by using two different datasets, including the CASP server prediction models of all CASP8 targets and a set of models generated by our in-house software MUFOLD. The test result shows that our method outperforms any individual scoring function on both best model selection, and overall correlation between the predicted ranking and the actual ranking of structural quality.
-
-
-
Neural Network Pairwise Interaction Fields for Protein Model Quality Assessment and Ab Initio Protein Folding
Authors: Alberto J.M. Martin, Claudio Mirabello and Gianluca PollastriIn order to use a predicted protein structure one needs to know how good it is, as the utility of a model depends on its quality. To this aim, many Model Quality Assessment Programs (MQAP) have been developed over the last decade, with MQAP also being assessed at the CASP competition. We present a new knowledge-based MQAP which evaluates single protein structure models. We use a tree representation of the Cα trace to train a novel Neural Network Pairwise Interaction Field (NN-PIF) to predict the global quality of a model. NN-PIF allows fast evaluation of multiple structure models for a single sequence. In our tests on a large set of structures, our networks outperform most other methods based on different and more complex protein structure representations in global model quality prediction. Moreover, given NNPIF can evaluate protein conformations very fast, we train a separate version of the model to gauge its ability to fold protein structures ab initio. We show that the resulting system, which relies only on basic information about the sequence and the Cα trace of a conformation, generally improves the quality of the structures it is presented with and may yield promising predictions in the absence of structural templates, although more research is required to harness the full potential of the model.
-
-
-
Solvent and Lipid Accessibility Prediction as a Basis for Model Quality Assessment in Soluble and Membrane Proteins
Authors: Mukta Phatak, Rafal Adamczak, Baoqiang Cao, Michael Wagner and Jaroslaw MellerOn-going efforts to improve protein structure prediction stimulate the development of scoring functions and methods for model quality assessment (MQA) that can be used to rank and select the best protein models for further refinement. In this work, sequence-based prediction of relative solvent accessibility (RSA) is employed as a basis for a simple MQA method for soluble proteins, and subsequently extended to the much less explored case of (alpha-helical) membrane proteins. In analogy to soluble proteins, the level of exposure to the lipid of amino acid residues in transmembrane (TM) domains is captured in terms of the relative lipid accessibility (RLA), which is predicted from sequence using lowcomplexity Support Vector Regression models. On an independent set of 23 TM proteins, the new SVR-based predictor yields correlation coefficient (CC) of 0.56 between the predicted and observed RLA profiles, as opposed to CC of 0.13 for a baseline predictor that utilizes TMLIP2H empirical lipophilicity scale (with standard deviations of about 0.15). A simple MQA approach is then defined by ranking models of membrane proteins in terms of consistency between predicted and observed RLA profiles, as a measure of similarity to the native structure. The new method does not require a set of decoy models to optimize parameters, circumventing current limitations in this regard. Several different sets of models, including those generated by fragment based folding simulations, and decoys obtained by swapping TM helices to mimic errors in template based assignment, are used to assess the new approach. Predicted RLA profiles can be used to successfully discriminate near native models from non-native decoys in most cases, significantly improving the separation of correct and incorrectly folded models compared to a simple baseline approach that utilizes TMLIP2H. As suggested by the robust performance of a simple MQA method for soluble proteins that utilizes more accurate RSA predictions, further significant improvements are likely to be achieved. The steady growth in the number of resolved membrane protein structures is expected to yield enhanced RLA predictions, facilitating further efforts to improve de novo and template based prediction of membrane protein structure.
-
-
-
On the Relationship Between Catalytic Residues and their Protein Contact Number
Authors: Shao-Wei Huang, Sung-Huan Yu, Chien-Hua Shih, Huei-Wen Guan, Tsun-Tsao Huang and Jenn-Kang HwangDue to advances in structural biology, an increasing number of protein structures of unknown function have been deposited in Protein Data Bank (PDB). These proteins are usually characterized by novel structures and sequences. Conventional comparative methodology (such as sequence alignment, structure comparison, or template search) is unable to determine their function. Thus, it is important to identify protein's function directly from its structure, but this is not an easy task. One of the strategies used is to analyze whether there are distinctive structure-derived features associated with functional residues. If so, one may be able to identify the functional residues directly from a single structure. Recently, we have shown that protein weighted contact number is related to atomic thermal fluctuations and can be used to derive motional correlations in proteins. In this report, we analyze the weighted contact-number profiles of both catalytic residues and non-catalytic residues for a dataset of 760 structures. We found that catalytic residues have distinct distributions of weighted contact numbers from those of non-catalytic residues. Using this feature, we are able to effectively differentiate catalytic residues from other residues with a single optimized threshold value. Our method is simple to implement and compares favourably with other more sophisticated methods. In addition, we discuss the physics behind the relationship between catalytic residues and their contact numbers as well as other features (such as residue centrality or B-factors) associated with catalytic residues.
-
-
-
Conotoxin Superfamily Prediction Using Diffusion Maps Dimensionality Reduction and Subspace Classifier
Authors: Jiang-Bo Yin, Yong-Xian Fan and Hong-Bin ShenConotoxins are disulfide-rich small peptides that are invaluable channel-targeted peptides and target neuronal receptors, which have been demonstrated to be potent pharmaceuticals in the treatment of Alzheimer's disease, Parkinson's disease, and epilepsy. Accurate prediction of conotoxin superfamily would have many important applications towards the understanding of its biological and pharmacological functions. In this study, a novel method, named dHKNN, is developed to predict conotoxin superfamily. Firstly, we extract the protein's sequential features composed of physicochemical properties, evolutionary information, predicted secondary structures and amino acid composition. Secondly, we use the diffusion maps for dimensionality reduction, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set in order to obtain efficient representation of data geometric descriptions. Finally, an improved K-local hyperplane distance nearest neighbor subspace classifier method called dHKNN is proposed for predicting conotoxin superfamilies by considering the local density information in the diffusion space. The overall accuracy of 91.90% is obtained through the jackknife cross-validation test on a benchmark dataset, indicating the proposed dHKNN is promising.
-
Volumes & issues
-
Volume 26 (2025)
-
Volume (2025)
-
Volume 25 (2024)
-
Volume 24 (2023)
-
Volume 23 (2022)
-
Volume 22 (2021)
-
Volume 21 (2020)
-
Volume 20 (2019)
-
Volume 19 (2018)
-
Volume 18 (2017)
-
Volume 17 (2016)
-
Volume 16 (2015)
-
Volume 15 (2014)
-
Volume 14 (2013)
-
Volume 13 (2012)
-
Volume 12 (2011)
-
Volume 11 (2010)
-
Volume 10 (2009)
-
Volume 9 (2008)
-
Volume 8 (2007)
-
Volume 7 (2006)
-
Volume 6 (2005)
-
Volume 5 (2004)
-
Volume 4 (2003)
-
Volume 3 (2002)
-
Volume 2 (2001)
-
Volume 1 (2000)
Most Read This Month
