Current Bioinformatics - Volume 11, Issue 2, 2016
Volume 11, Issue 2, 2016
-
-
Improved Prediction of DNA-Binding Proteins Using Chaos Game Representation and Random Forest
More LessAuthors: Xiaohui Niu and Xuehai HuDNA-binding proteins (DNA-BPs) play an important role in many biological processes. Now next-generation sequencing technologies are widely used to obtain genome of many organisms. Consequently, identification of DNA-BPs accurately and rapidly will provide significant helps in annotation of genomes. Chaos game representation (CGR) can reveal the information hidden in protein sequences. Furthermore, fractal dimensions are a vital index to measure compactness of complex and irregular geometric objects. In this research, in order to extract the intrinsic correlation with DNAbinding property from protein sequence, CGR algorithm and fractal dimension, together with amino acid composition are applied to formulate the protein samples. Here we employ the random forest as the classifier to predict DNA-BPs based on sequence-derived features with amino acid composition and fractal dimension. This resulting predictor is compared with three important existing methods DNA-Prot, iDNA-Prot and DNAbinder in the same datasets. On two benchmark datasets from DNA-Prot and iDNA-Prot, the average accuracies (ACC) achieve 82.07%, 84.91% respectively, and average Matthew's correlation coefficients (MCC) achieve 0.6085, 0.6981 respectively. The point to point comparisons demonstrate that our fractal approach shows some improvements.
-
-
-
Analysis of Differential Gene Expression Based on Bayesian Estimation of Variance
More LessAuthors: Jiyuan An, John Lai, Lingzao Zeng and Colleen C. NelsonGene expression is arguably the most important indicator of biological function. Thus identifying differentially expressed genes is one of the main aims of high throughout studies that use microarray and RNAseq platforms to study deregulated cellular pathways. There are many tools for analysing differentia gene expression from transciptomic datasets. The major challenge of this topic is to estimate gene expression variance due to the high amount of ‘background noise’ that is generated from biological equipment and the lack of biological replicates. Bayesian inference has been widely used in the bioinformatics field. In this work, we reveal that the prior knowledge employed in the Bayesian framework also helps to improve the accuracy of differential gene expression analysis when using a small number of replicates. We have developed a differential analysis tool that uses Bayesian estimation of the variance of gene expression for use with small numbers of biological replicates. Our method is more consistent when compared to the widely used cyber-t tool that successfully introduced the Bayesian framework to differential analysis. We also provide a user-friendly web based Graphic User Interface for biologists to use with microarray and RNAseq data. Bayesian inference can compensate for the instability of variance caused when using a small number of biological replicates by using pseudo replicates as prior knowledge. We also show that our new strategy to select pseudo replicates will improve the performance of the analysis.
-
-
-
Enhanced Prediction of Small Non-coding RNA in Bacterial Genomes Based on Improved Inter-Nucleotide Distances of Genomes
More LessAuthors: Li-Qian Zhou, Rui Li and Liu HuSmall non-coding RNA genes have been concerned as an important field of life sciences in recent years. It plays important regulatory roles in cellular processes. However, the prediction of noncoding RNA genes is a great challenge, because non-coding RNAs have a small size, are not translated into proteins and show variable stability. In this paper, we propose an improved inter-nucleotide distances model as sequence characteristics, and combine with support vector machines (SVM) to predict small non-coding RNA in bacterial genomes. The prediction result of the mixed bacterial ncRNA is 95.38%, which shows that our method can effectively predict bacterial ncRNAs.
-
-
-
Protein Folding Kinetic Order Prediction from Amino Acid Sequence Based on Horizontal Visibility Network
More LessAuthors: Zhi-Qin Zhao, Zu-Guo Yu, Vo Anh, Jing-Yang Wu and Guo-Sheng HanProtein folding is one of the most important problems in molecular biology. The kinetic order of protein folding is one of the main aspects of the folding process. Previous methods for predicting protein folding kinetic order require to use the information on tertiary or predicted secondary structure of a protein. In this paper, based on physicochemical properties of amino acids, we propose an approach to predict the protein folding kinetic order from the primary structure of a protein using support vector machine combined with principal component analysis. The horizontal visibility network, Hilbert-Huang transform, global descriptor, and Lempel-Ziv complexity are used to extract features in our approach. To evaluate our approach, the leave-one-out cross-validation test is employed on two widely-used data sets (“IvankovData” and “ZhengData” data sets) consisting of two-state and multi-state proteins. The overall accuracies of prediction can reach 83.87% for “IvankovData” data set and 85% for “ZhengData” data set respectively. Comparisons with the existing methods show that the present approach performs better on the “IvankovData” data set. These results indicate that the present approach is effective and valuable for predicting protein folding kinetic order. Based on factor analysis, we find that the length of protein sequence, hydrophobicity and hydrophilicity of amino acids are important features in our approach.
-
-
-
Global Propagation Method for Predicting Protein Function by Integrating Multiple Data Sources
More LessAuthors: Jun Meng, Xin Zhang and Yushi LuanProtein function prediction is one of the most important tasks in bioinformatics. Nowadays, high-throughput experiments have generated large scale genomics and proteomics data. To accurately annotate proteins, it is necessary and wise to integrate these heterogeneous data sources. In this paper, a multi-source protein global propagation (MS-PGP) algorithm has been proposed, which integrates multiple data sources and combines protein global propagation with label correlation (PGP) algorithm to predict functions for unannotated proteins. Specifically, we use three data sources to predict protein functions: sequence data, microarray gene expression data and protein-protein interaction data. A naïve Bayesian fashion method is adopted to fuse the three data sources into a combined network. Gene ontology biological process annotation is used to calculate the association scores between unannotated proteins and functions. The experimental results on Yeast show that the proposed method has a higher accuracy over other multiple network methods. It is efficient to predict the function of unannotated proteins.
-
-
-
Prioritizing Disease Genes by Using Search Engine Algorithm
More LessAuthors: Min Li, Ruiqing Zheng, Qi Li, Jianxin Wang, Fang-Xiang Wu and Zhuohua ZhangIt is a fundamental challenge that identifying disease genes from a large number of candidates for a specific disease. As the biological experiment-based methods are generally timeconsuming and laborious, it has become a new strategy to identify disease candidates by using computational approaches. In this paper, we proposed an algorithm based on the search engine ranking method, named PDGTR, to prioritize disease candidates. Firstly, we constructed a weighted human disease network by calculating the topological similarity and phenotype similarity of each pair of diseases. Then, we calculated the similarities of all the genes by using the protein-protein interaction network and the edge clustering coefficient. For a specific disease, a logistic regression model was used to generate the prior-knowledge of each gene. Finally, the search engine ranking based algorithm PDGTR was applied to prioritize the disease candidates. The proposed algorithm PDGTR was tested on five typical cancers: Breast Cancer, Colorectal Cancer, Hepatocellular carcinoma, Gastric Cancer and Osteoporosis, and compared with four state-of-the-art algorithms: RWR, DADA, PRINCE and PRP. The experimental results based on the leave-one-out cross validation, precision, ROC curve, and enrichment show that the proposed algorithm PDGTR outperforms RWR, DADA, PRINCE and PRP. Moreover, some potential disease genes were predicted by PDGTR and already mentioned by some literatures.
-
-
-
Network Propagation Reveals Novel Features Predicting Drug Response of Cancer Cell Lines
More LessAuthors: Jiguang Wang, Judith Kribelbauer and Raul RabadanTranslating data derived from cancer genomes into personalized cancer therapy is a holy grail of computational biology. An important, yet challenging, question in this undertaking is to relate features of tumor cells to clinical outcomes of anticancer drugs. Recent progress in large pharmacogenomic studies has provided a wealth of data about cancer cell lines, indicating that many genetic and gene expression candidates might predict the drug response of cancer cells. Unfortunately, most of the predicted features are inconsistent with current clinical knowledge and lack mutual dependencies that could explain their molecular mode of action. To address this question, we have developed a new method, named dNetFS, to prioritize genetic and gene expression features of cancer cell lines that predict drug response, by integrating genomic/pharmaceutical data, protein-protein interaction network, and prior knowledge of drug-targets interaction with the techniques of network propagation. Comparing with previous methods, dNetFS is more accurate in cross-validation analysis, and it is able to reveal the key pathways involved in drug response. It therefore provides a basis to identify the underlying molecular mechanism for a given compound in different genomic backgrounds.
-
-
-
Applications of Random Walk Model on Biological Networks
More LessAuthors: Wei Peng, Jianxin Wang, Zhen Zhang and Fang-Xiang WuBiological networks play a significant role in addressing biological problems. Random walk model is a highly efficient way to study networks which has been widely used in solving biological problems based on networks. In this work, those biological problems are classified into four categories, ranking nodes in biological networks, measuring similarity or distance between nodes in biological networks, detecting models from biological networks and finding interrelationship between nodes from different biological networks. After that, we survey the recent advance in applications of random walk models to solve these types of problems on the basis of biological networks.
-
-
-
A Markov Clustering Based Link Clustering Method to Identify Overlapping Modules in Protein-Protein Interaction Networks
More LessAuthors: Yan Wang, Guishen Wang, Di Meng, Lan Huang, Enrico Blanzieri and Juan CuiPrevious studies indicated that many overlapping structures exist among the modular structures in protein-protein interaction (PPI) networks, which may reflect common functional components shared by different biological processes. In this paper, a Markov clustering based Link Clustering (MLC) method for the identification of overlapping modular structures in PPI networks is proposed. Firstly, MLC method calculates the extended link similarity and derives a similarity matrix to represent the relevance among the protein interactions. Then it employs markov clustering to partition the link similarity matrix and obtains overlapping network modules with significantly less parameters and threshold constraints compared to most current methodologies. Experiments on two networks with known reference classes and two biological PPI networks of Escherichia coli, Saccharomyces cerevisiae, respectively, show that MLC outperforms the original Link Clustering and the classical Clique Percolation Method in terms of accurate identification of the core modules in each test network. Therefore, we consider the MLC method is high promisingly in identifying important pathways through studying the interplay between functional processes in different organism.
-
-
-
Detecting Non-Trivial Protein Structure Relationships
More LessAutomated methods for protein three-dimensional structure comparison play an important role in understanding protein function, evolution and biochemical reaction mechanisms. Since the tertiary structure of proteins is more conserved than their amino-acid sequences, accurately aligning three-dimensional structures allows to detect homology between proteins in the “twilight zone”, those sharing less than ~25% sequence identity. Unfortunately, existing methods for protein structure comparison are often unable to properly compare and align proteins related by complex structural modifications, such as circular permutations, large conformational changes and large residue insertions and deletions. In this paper, we present an algorithm capable of computing biologically meaningful alignments from structurally homologous but spatially distant fragments. Accurate alignments of proteins that have undergone large conformational variations are derived from multiple spatial superpositions. For mild to moderate conformational variations, approximate rigid body superpositions are recursively relaxed to allow matching of spatially distant regions. The algorithm incorporates an exact procedure for computing alignments of proteins related by circular permutations. We used two benchmarking datasets to demonstrate that our algorithm compares favorably to some of the most accurate methods available today. In the most difficult RIPC test set, the median accuracy of our method is 100%. The algorithm is freely available as a Web service at http://bioinfo.cs.uni.edu.
-
-
-
Reconstruction, Topological and Gene Ontology Enrichment Analysis of Cancerous Gene Regulatory Network Modules
More LessBy Khalid RazaThe availability of large set of high throughput biological data needs algorithm that automatically reconstructs gene regulatory networks from these datasets. Cancerous regulatory network modules when analyzed critically may reveal the underlying mechanism of cancer, which may help in better diagnosis. Identification of cancerous genes and their regulation is an important research area in cancer systems biology. In this paper, we introduced an algorithm to infer cancerous gene regulatory network modules from gene expression profiles. The proposed algorithm has been applied to gene expression dataset of colon cancer patients and several network modules have been identified. We performed topological analysis of inferred network modules in terms of network density, degree distribution, clustering coefficient, average path length, network heterogeneity, and centrality measures. Further, GO-based enrichment analysis of the inferred network has been performed. To validate the proposed algorithm, it has been tested on benchmark dataset taken from DREAM3 challenge project.
-
-
-
ORFpred: A Machine Learning Program to Identify Translatable Small Open Reading Frames in Intergenic Regions of the Plasmodium falciparum Genome
More LessAuthors: Vivek Srinivas, Mayank Kumar, Santosh Noronha and Swati PatankarMotivation: Small Open Reading Frames (smORFs) are involved in a variety of cellular processes varying from metabolism to gene regulation and eukaryotic genomes have been predicted to contain a large number of smORFs. Only a meager 174 smORFs have been annotated in the genome of the human malaria parasite Plasmodium falciparum. Although millions of smORFs can be extracted from the parasite genome, the identification of translatable smORFs from the P. falciparum genome is a challenging task due to low accuracy of existing smORF predictors when applied to an AT biased genome. Result: We developed ORFpred, a machine learning algorithm which calculates the probability of translation initiation and elongation of ORFs in the P. falciparum genome. ORFpred identified 2204 translatable smORFs and when compared to available predictors, showed higher accuracy. We believe that ORFpred will help in identification of probable protein coding smORFs in other eukaryotic genomes. Availability and Implementation: Database used for training and testing the algorithm and source codes are freely available at http://www.bio.iitb.ac.in/~patankar/software/ORFpred.
-
-
-
Ubipredictor: A New Tool for Species-Specific Prediction of Ubiquitination Sites Using Linear Discriminant Analysis
More LessAuthors: Muhammad Saeed, Wajya Ajmal, Anum Masood, M. Rizwan Riaz and Malik Nadeem AkhtarUbiquitination is involved in various cellular processes such as protein degradation and stability, cell cycle progression, transcriptional regulation, antigen processing, DNA repair, inflammation and regulation of apoptosis, etc. In silico prediction of potential candidate lysine (K) for ubiquitination will not only save time and money but will also generate valuable data for further scientific research. We developed Ubipredictor (http://chemdp.com/ubipredictor.php) tool for prediction of potential ubiquitinated lysine in protein sequences of human, mouse and yeast dataset using LDA. The statistically significant features selected through LDA were amino acid dimers, position specific score matrix (PSSM) and physicochemical properties of amino acid like electrostatic charge, heat capacity, codon diversity and secondary structure, etc. Testing on three different model organism datasets (human, mouse, yeast) showed that the predictive performance of Ubipredictor was better than two existing tools. On human and mouse datasets, Ubipredictor was found to be more sensitive than Ubipred and Ubpred. Unlike previously designed tools, we trained Ubipredictor specifically on experimentally verified ubiquitinated dataset for each of the human mouse and yeast species.
-
-
-
Understanding Effects of Psychological Stress on Physiology and Disease Through Human Stressome - An Integral Algorithm
More LessAuthors: Sushri Priyadarshini and Palok AichPsychological stress perturbs normal physiological function or homeostasis. Restoration of normalcy demands more supply of energy. A physiological mechanism via activated stress response system is aimed at providing quick energy to deal with such emergency situations. If stress response system remains activated for longer period, maintaining physiological homeostasis becomes difficult because of higher demand for energy which eventually leads to increased susceptibility to infection or disease. Although there are reports, associating psychological stress with physiological functions and diseases, a clear understanding of mechanism of stress manifestation is yet to be established. In order to facilitate extensive exploration and prediction of possible mechanisms, integration of molecular (gene-level) data pertaining to psychological stress, physiological processes and stress-associated diseases is needed. We report power of text-mining in combination with our data-integration methods and mathematical formulation to develop integrated geneassociation networks. These networks can be analyzed to gain holistic insights into the relationship between psychological stress-associated genes (stressome) and related physiological functions and diseases. We built the human psychostressome networks to understand and predict pathways and candidate genes responsible for perturbing balance among various physiological functions and disease manifestation. Using the current methodology, we were able to predict involvement of serotonin receptors and uridine 5'-diphospho-glucuronosyltransferases in mediating effects of psychological stress.
-
-
-
Suitability of Sequence-Based Feature Vector for Classification Algorithm Improves Accuracy of Human Protein-Protein Interaction Prediction: A Red Blood Cell Case Study
More LessAuthors: Afsaneh Maali, Mahmood A. Mahdavi and Reza GheshlaghiTo classify human protein-protein interaction information and consolidate existing data, supervised learning algorithms are implemented. These algorithms require a feature vector to generate a prediction model and feature vectors could be constructed based on various input data. The suitability of feature vector for classification algorithm results in a more predictive model and predictions with higher accuracies based on low-dimension vectors. To investigate the proper combination of feature sets and the algorithms, three feature vectors including AA Frequency, AA Graphical Parameter, and AA Triplex based on the sole knowledge of primary structure of human red blood cell proteins were constructed and then applied to five different classification methods. The results indicated that support vector machine (SVM) algorithm produced the highest accuracy of 84.65% with AA Graphical Parameter feature set while it reached accuracy of 80.65% with AA Triplex feature set. Random forest (RF) achieved high accuracy of 83.69% with all three feature sets on average. Bayesian classifier of TAN performed better than NB using all three features. Artificial neural network (ANN) classifier demonstrated the lowest average accuracy of 76%; however, the performance was comparable with TAN where AA triplex learning feature was used with the accuracy of 77.90%. These figures demonstrated that selecting an appropriate feature set for a classification task results in a higher accuracy with the advantage of utilizing low-dimension feature vectors constructed from more simple data.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month