Current Bioinformatics - Volume 13, Issue 5, 2018
Volume 13, Issue 5, 2018
-
-
Group-sparse Modeling Drug-kinase Networks for Predicting Combinatorial Drug Sensitivity in Cancer Cells
Authors: Hui Liu, Libo Luo, Zhanzhan Cheng, Jianjiang Sun, Jihong Guan, Jie Zheng and Shuigeng ZhouBackground: Due to the intrinsic compensatory mechanism and cross-talks mong cellular signaling pathways, single-target drugs often fail to inhibit the survival pathways in cancer cells. Some multi-target combination drugs have demonstrated their high sensitivities and low side effects in cancer therapies, and thus drawn intensive attentions from researchers and pharmaceutical enterprises. Method: Although a few computational methods have been developed to infer combination drug sensitivities based on drug-kinase interactions, they either depend on the binarization of drug-kinase binding affinities, which would lead to the loss of weak drug-target inhibitions known to affect significantly the anticancer effects, or disregard the functional group structure among the kinases involved in cancer signalling pathways. In this paper, we employed a sparse linear model, uncertain group sparse representation (UGSR), to infer essential kinases governing the cellular responses to drug treatments in cancer cells, based on the massively collected drug-kinase interactions and drug sensitivity datasets over hundreds of cancer cell lines. The inferred essential kinases can be subsequently used to calculate the cancer cell sensitivities to combination drugs. Results: The leave-one-out cross validations and two real cases show that our method achieve high performance in predict drug sensitivities of combination drugs. Moreover, a user-friendly web interface with interactive network viewer, tabular viewer and other graphical visualization plugins, has been implemented to facilitate data access and interpretation.
-
-
-
TagDict: Prediction of Theoretical Spectra of Peptides Based on A Tag Dictionary
Authors: Yaojun Wang, Jingwei Zhang, Dongbo Bu and Shiwei SunBackground: Tandem mass spectrometry (MS/MS) peptide identification is an important research topic in molecular biology; the comparison between an experimental spectrum and a theoretically predicted spectrum is a crucial step for many identification methods. Consequently, the accurate prediction of the theoretical spectrum from a peptide sequence can potentially improve the performance of peptide identification and is a significant problem for mass spectrometry-based proteomics. Objective: We studied the mechanism of peptide fragmentation in the mass spectrometer and proposed a strategy for theoretical spectrum simulation. We have proposed a new theoretical spectrum prediction model called TagDict. Method: TagDict built a “tag dictionary” from existing spectrum library and used for theoretical spectrum prediction. This dictionary collected a large number of records that each record comprised of peptide segment and the middle adjacent position fragment ion's intensity. Results: Full theoretical spectrum can derive from the adjacent ion intensity ratios get from query “tag dictionary”. Compared with MassAnalyzer, the theoretical spectrum of TagDict simulated is more similar to the real spectrum. Conclusion: The new approach, comparing with another existing spectrum prediction tool MassAnalyzer, not only simplifies the process of theoretical spectrum simulation but also improves the prediction accuracy of the spectrum library searching by using this approach to extend spectrum library.
-
-
-
Large-scale Investigation of Long Noncoding RNA Secondary Structures in Human and Mouse
Authors: Xingli Guo, Lin Gao, Yu Wang, David K.Y. Chiu, Bingbo Wang, Yue Deng and Xiao WenBackground: It is very likely that RNA secondary structures, more so than the sequence itself, are closely related to their functions, especially for mRNAs and even short noncoding RNAs. However, secondary structure of most lncRNAs (long noncoding RNAs) remains poorly understood. Method: Here, we perform a large-scale investigation of lncRNA secondary structures especially for hairpin structural motif in human and mouse based on computational prediction using the RNAfold software. Results: The main results show some difference between lncRNAs and mRNAs in various kinds of local secondary structures. However, there are many hairpins in lncRNAs, even in those with short sequence length, suggesting lncRNA as a highly structured RNA molecule. Furthermore, in both human and mouse genome, there are more lncRNAs than mRNAs containing long-stem and big-loop hairpins. It is important to note that these hairpins in lncRNAs are inclined to compact together and form a junction-like structure motif which we call hairpin junction. Tetraloops are also analyzed to uncover the probable associations with lncRNA functional stability. Conclusion: Taken together, we find the secondary structure of lncRNAs has many characteristics, most of which are similar with those in mRNAs. And we provide evidence of various lncRNA secondary structural components, which can be exploited in lncRNA identification, the classification of different types and the inference of function annotation.
-
-
-
Prediction of Protein S-sulfenylation Sites Using a Deep Belief Network
Authors: Lulu Nie, Lei Deng, Chao Fan, Weihua Zhan and Yongjun TangBackground: Protein S-Sulfenylation, the reversible oxidative modification of cysteine thiol groups to cysteine S-Sulfenic acids, is a post-translational modification (PTM) that plays a critical role in regulating protein function and signal transduction. The identification of specific protein Ssulfenylation sites is crucial to understand the underlying molecular mechanisms. Objective: We sought to develop a computational method that can effectively predict S-sulfenylation sites by using optimally extracted properties. Method: We propose DBN-Sulf, which uses a Deep Belief Network (DBN) with Restricted Boltzmann Machines (RBMs) to reduce the feature dimensions from a combination of heterogeneous information, including amino acid related features, evolutionary features, and structure-based features. Then a support vector machine (SVM) based predictor is built with the optimal features. Results: We evaluate the DBN-Sulf classifier using a training dataset including 1007 positive sites and 7837 negative sites with 5-fold cross validation, and get an AUC score of 0.80, an ACC of 0.85 and a MCC of 0.53, which are significantly better than that of the existing methods. We further validate our method on the independent test set and obtain promising results. Conclusion: The superior performance over existing S-sulfenylation site prediction approaches indicates the importance of the deep belief network-based feature extracting procedure.
-
-
-
Features Identification for Phenotypic Classification Based on Genes and Gene Pairs
Authors: Yansen Su, Yanxin Li, Zheng Zhang and Linqiang PanBackground: The classification of phenotypes on microarray data has drawn much attention in last few years. The known methods mainly focused on the selection or construction of features based on either genes or gene pairs on continuous-value gene expression data. However, few researches have been implemented to identify useful features based on both genes and gene pairs on binary-value gene expression data. Objective: In this work, we proposed a new algorithm, called FSGGP, to select both feature genes and feature gene pairs on the binary-value gene expression data to improve two-phenotype classification. Method: We calculated the uncertainty coefficient which represented how well a phenotype was described by a gene or gene pair under some possible relationship, and the exact relationship between the gene or gene pair and the phenotype was identified by the value of uncertainty coefficient. Furthermore, the closeness between genes or gene pairs and phenotypes was calculated, and the genes or gene pairs closely related with phenotypes were selected. The redundancy of genes and gene pairs as features was calculated by cross entropy on the binary data, and the redundant feature genes or gene pairs were eliminated. The optimal feature sets were obtained by the wrapper based forward feature selection for three classical classifiers. Results: The algorithm was experimentally assessed on four public datasets. The results showed that algorithm FSGGP had better performance over four known feature selection algorithms based on either genes or gene pairs in terms of the average classification error rates. Conclusion: We developed an algorithm to select both feature genes and feature gene pairs on the binaryvalue gene expression data, where the selection of feature gene pairs was implemented by identifying the higher logical relationship between gene pairs and phenotypes. The comparison with four known feature selection algorithms suggests that feature selection algorithms based on both genes and gene pairs can achieve better performance than feature selection algorithms based on either genes or gene pairs, and the identification of higher logical relationship is an effective approach for the selection of feature gene pairs.
-
-
-
MyPhi: Efficient Levenshtein Distance Computation on Xeon Phi Based Architectures
Authors: Yuandong Chan, Kai Xu, Haidong Lan, Bertil Schmidt, Shaoliang Peng and Weiguo LiuBackground: Approximate string matching algorithms are widely used in bioinformatics, among which the bit-parallel Myers algorithm is a popular approach to compute the Levenshtein distance between two genome sequences. The bit vector encoding of the Myers algorithm makes it feasible to extend to modern parallel architectures with wider-than-ever vector registers and many cores. Objective: Myers algorithm has already been integrated into some NGS all mappers such as RazerS and GEM for the verification stage. Due to the huge number of NGS reads to be processed, it is demanded to accelerate the bit-parallel Myers algorithm for higher throughput of NGS all mappers. In this paper, we aim to design an ultra-fast implementation of Myers algorithm on Intel Xeon Phi based architectures, including KNL-based processors and KNC-based co-processors. Method: We designed a two-level framework to fully exploit the computing power of Xeon Phi based many-core architectures. At the coarse-grained thread level, we used multi-threading to invoke many cores. At the fine-grained VPU level, we proposed a novel vectorized computing method for the Myers algorithm. The in-depth analysis for memory access leads to a more cache friendly searching strategy. Results: Performance evaluation revealed that MyPhi achieved a peak performance of 1.03 and 1.62 TCUPS (Trillion Cell Updates per Second) on KNC-based Xeon Phi 7110 co-processor and KNL-based Xeon Phi 7210 processor, respectively, which outperformed a multi-threaded scalar implementation on dual six-core CPUs by an average speed up of 8.95 and 14.08. Conclusion: We presented the MyPhi to compute the Levenshtein distance between the two strings efficiently on Xeon Phi based architectures. Performance evaluation has shown good speedups over other CPU-based and accelerator-enabled works as well as good scalability. MyPhi can be further used as building blocks for short read aligners, clustering algorithms and potentially other sequence aligning tools.
-
-
-
A Metric on the Space of Rooted Phylogenetic Trees
More LessBackground: The purpose of phylogenetic analysis is to not only show the evolutionary history of taxa, but also comprehend the origin of life. Rooted phylogenetic trees are employed to express the result of phylogenetic analysis. Objective: Computing the dissimilarity of rooted phylogenetic trees has been instrumental in our understanding of the evolutionary relationship of species and the analysis of the reconstruction method of phylogenetic trees. For example, in order to evaluate the method for constructing phylogenetic trees, we need to measure the differences among phylogenetic trees computed from different genes, or the differences between the constructed trees and the simulate trees or the true trees. Method: This paper proposes a new metric on the space of rooted phylogenetic trees that can be calculated in polynomial time in the size of the compared trees. The metric is based on the equivalence property of nodes. Results: Experimental results demonstrate the correlation of Triple distance with our distance is least and the correlation of Cluster distance and our distance is most. Conclusion: The metric proposed by this paper is very effective. This metric is defined for rooted phylogenetic trees, but can be carried over to unrooted phylogenetic trees by applying to an outgroup species appended the tree.
-
-
-
Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques
Authors: Zhijun Liao, Shixiang Wan, Yan He and Quan ZouObjective: Small GTPase is an important molecular switch that plays an important role in numerous signaling transduction pathways, the aim is to explore its binary classification features with machine learning algorithms. Methods: The sequences including small GTPases and non small GTPases were clustered to remove similar entries, respectively. Then, they were divided into 10 datasets, each containing equal entries of small GTPases and non small GTPases. These datasets extracted three feature vectors that included188- dimensional(188D), 400D, and motif-based features (608D). The next step was classification based on easy-classify.py software in scikit-learn, which integrated 12 classifiers and finally discovered the conserved motifs by MEME suite. Results: The three best performed classifiers were logistic regression (LR), gradient boosting decision tree (GBDT), and bagging for 400D features, LibSVM, GBDT, and bagging for 188D features, and GBDT, bagging, and AdaBoost for 608D features, respectively. The top four classifiers were GBDT, bagging, LR, and AdaBoost according to commonly evaluated indices as a whole. GBDT obtained the highest area under the curve (AUC) value at 88.61%. The 400D features performed better than the 188D and 608D ones. Five conserved G-box motifs were discovered in the sequences of human small GTPases. Conclusion: This study provides the first description of GBDT algorithm performed best for small GTPases classification.
-
-
-
Identification of Attention Deficit/Hyperactivity Disorder in Children Using Multiple ERP Features
Authors: Wenjie Li, Tiantong Zhou, Ling Zou, Jieru Lu, Hui Liu and Suhong WangBackground and Objective: Attention deficit hyperactivity disorder (ADHD) is a typical neurodevelopmental disorder occurs in children's early school-age, which often results in serious executive dysfunction. Recent ADHD studies highlight the great potential of non-invasive event-related potential (ERP) technique. It is thus worth combining multiple features to form sensitive and robust biomarkers to distinguish ADHD from normal children. Methods: In this paper, we collected the EEG signals of sixty-eight ADHD children and seventy-three age-match typically developing children during a classic Simon-spatial Stroop task. A channel optimization method was used to select the feature channel. Time-domain features and frequencydomain features were extracted from EEG data. Three classifiers were used to classify ADHD children from typically developing children by using multiple features as well as each single feature. Results: ADHD children showed weaker N2 and P2 signals than typically developing children. Behavior response results showed that, children with ADHD exhibited lower correct response rates, longer average response time and higher data variance. In classification experiment, performance of three classifiers trained on multiple features was much better than that on single feature. Multiple features classification achieved the highest accuracy of 96.6%, while single time-domain and frequencydomain feature only achieved the highest accuracy of 88.10% and 92.85% respectively. All the highest accuracies were achieved on feature channel in inferior parietal cortex. Conclusion: Feature channel generally performed better than empirical channel. The multiple ERP features classification method has a good recognition accuracy, being worth researching in ADHD's auxiliary diagnosis.
-
-
-
Low Rank Representation and Its Application in Bioinformatics
Authors: Yuan You, Hongmin Cai and Jiazhou ChenBackground: Sparse representation has achieved tremendous success recently. Low-rank representation is one of the successful methods. It is aimed to capture underlying low-dimensional structures of high dimensional data and attracted much attention in the area of the pattern recognition and signal processing. Such successful applications were mainly to its effectiveness in exploring lowdimensional manifolds embedded in data, which can be naturally characterized by low rankness of the data matrix. Objective: In this paper, we review the theoretical and numerical models based on low rank representation and hope the review can attract more research in bioinformatics. Method: Low rank representation is particularly well suited to big data analysis in bioinformatics. The first reason is that the interested objects are naturally sparse, like copy number variations. The second reason is that there exist strong correlations among various modalities for the same object, like DNA, RNA and methylation. Results and Conclusion: Its applications in bioinformatics area, including mining of key genes subset, finding common patterns across various modalities and biomedical image analysis were categorically summarized.
-
-
-
RGDtrip: A Database for the Investigation of Proteins Containing the RGD Tripeptide
Background: The sequence Arginine-Glycine-Aspartic acid (RGD tripeptide) has been identified in most proteins implicated in cell adhesion and signal transduction. Moreover, the RGD paradigm extends to the plant and microbial kingdoms. Investigating this field can be facilitated by combining data from multiple databases into a single one. The RGD tripeptide database is a comprehensive resource with records including general annotation, ontology, database cross-references, sequence and structure data. Objective: In this work, we present the integration of a novel visualization tool within the RGDtrip 1.0 version data collection and retrieval environment for proteins containing the RGD tripeptide. This approach allows state-of-the-art data querying combined with an advanced, user-friendly visualization environment. Method: The overall system architecture is based on a three-tier client-server model, thus comprising three main components: the client application, the application server and the database server. The underlying structure of RGDtrip is a relational database developed with Microsoft SQL Server. All the data compiled in RGDtrip were originally scattered in other data bases, such as UNIProt, PDBdb, etc. has been incorporated into a visualization tool based on the Microsoft's PivotViewer software. The tool enables users to see data under many different perspectives and thus to gain a better aspect and understanding of them. Results: The RGDtrip database may be used for the investigation of proteins containing the RGD tripeptide and the shaping of meaningful conclusions regarding, among other things, evolution, phylogenesis and pharmacological interactions with disease- implicated entities and possible loci of side-effects. The RGDtrip database offers the following main advantages: (i) a collection of about 32,000 proteins containing the RGD tripeptide in just one database and through a unique user interface; (ii) the utilization of state-of-the-art technologies to deliver new data querying and visualization tools for scientists, thus allowing Visual Data Mining, for both basic and applied research on the above mentioned proteins. Conclusion: This paper describes the integration of existing information with advanced visualization and querying tools, in a dedicated database to implement Visual Data Mining, for basic and applied research on RGD-containing proteins.
-
-
-
Early Stage Identification of Alzheimer's Disease Using a Two-stage Ensemble Classifier
Authors: Bing Wang, Kun Lu, Xiao Zheng, Benyue Su, Yuming Zhou, Peng Chen and Jun ZhangBackground: Alzheimer's disease (AD) has attracted more and more attention in recent years. Accurate diagnosis of AD is significant, especially its prodromal stage, i.e., mild cognitive impairment (MCI), for timely therapy is possibly beneficial to delay the disease progression. Some existing studies indicated that different biomarkers provide complementary information to discriminate MCI patients from healthy normal controls (NCs), but the high complexity of these algorithms brought high computational cost. Objective: To identify Alzheimer's disease in its early stage with a low computational complexity where the complementary of different biomarkers can still be used. Method: In this work, we employ the methodology of ensemble learning to construct a two-stage classifier for combining the classification capacity of three biomarkers, i.e., magnetic resonance imaging (MRI), positron emission tomography (PET), and quantification of specific proteins measured through cerebrospinal fluid (CSF), to identify MCIs from healthy controls based on support vector machines algorithm. In the first stage, two SVM classifiers based on MRI and CSF are used for the identification of MCI, respectively. For the samples which can get the same results in the first stage will be seen as the training data, and the ones with the inconsistent result will be put into the second stage as the test data, where PET features are adapted to the classification. Results: An original dataset downloaded from ADNI database, where 99 MCI patients and 52 healthy controls included, had been adopted for the validation of our proposed method. The experimental results demonstrated the effectiveness of the two-stage ensemble classifier with a classification accuracy of 75.5%, a sensitivity of 78.4% and a specificity of 70.0%. Conclusion: This study proposed a computational framework to identify the early stage of AD by a two-stage ensemble strategy. The performance of this work shows that combination of different biomarkers can improve identification of MCI with a relatively low computational cost, which is very meaningful for the diagnosis and delay of the AD progress.
-
-
-
Genome-wide Characterization of Major Intrinsic Protein (MIP) Gene Family in Brachypodium distachyon
Authors: Ankush A. Saddhe, Shweta, Kareem A. Mosa, Kundan Kumar, Manoj Prasad and Om Parkash DhankherBackground: Major intrinsic proteins (MIPs) are membrane channel proteins which maintain water homeostasis and permeable to small molecules across the membrane. Objective: Genome analysis of Brachypodium MIPs (BdMIPs) gene family and in silico studies are based on available bioinformatic tools. Further comparison and evolutionary study of MIPs members were performed within grass family. Method: MIPs sequences were retrieved from Gramene database, aligned and weblogo was generated. Physio-chemical analysis was performed and phylogenetic tree was constructed by neighbor-joining. In silico expression profile of BdMIP genes was searched and image maps were generated by CIMMiner web-based server. Result: Genome wide analysis of B. distachyon identified 33MIP genes and classified into four major groups. Analysis of motifs and transmembrane domains strongly supported their identity as a member of the MIP super family. Duplication analysis revealed that 4 genes were tandemly duplicated and no segmental duplication events in BdMIPs were observed. Prediction of cis-elements in BdMIP promoter region gave more insight into regulation mechanism under hormonal and stress conditions. In silico expression profile under development stages provided insight into expression pattern of BdMIP genes. Conclusion: Total 33 MIPs were predicted in Brachypodium genome. Tandem duplication event was dominant phenomenon over segmental duplication in BdMIPs. Orthology analysis revealed Brachypodium MIP members were close to grass family MIP members compared to Arabidopsis. Compilation of this work will significantly contribute to the understanding of an evolutionary and biological importance of MIP genes in grass family and thus provide a set up for functional genomics studies in Brachypodium.
-
-
-
Multiscale Products in B-spline Wavelet Domain: A New Method for Short Exon Detection
Authors: Xiaolei Zhang, Guishan Zhang, Yangjiang Yu, Guocheng Pan, Haitao Deng, Xinbao Shi, Yue Jiao, Renhua Wu and Yaowen ChenBackground: In sequencing human and model organism DNA, the development of efficient computational techniques for the rapid prediction of short exons in eukaryotes is a major challenge. Objective: This paper presents a multiscale products-based method in B-spline wavelet domain for short exon detection. In our analysis, we find out the wavelet coefficients associated to introns are less correlated between consecutive scales than coefficients related to exons. We reveal the explanation of this investigation which results from the HMR195 dataset by calculating the histogram distributions of the exon and intron coefficients. We employ these inter-scale correlation features to enhance exon structures and weak background noise. Method: The development of our method is outlined at two stages: (i) A new B-spline wavelet transform is designed to extract the exon features in multiscale domain; so, setting the window length parameter which affects the results is avoided, and this wavelet has higher degree of freedom for curve design. (ii) Based on the significant difference of correlated features between the exon and intron coefficients, we present a multiscale products-based method to discriminate significant exon features from introns. Results: The BG570 and HMR195 datasets have been used in the evaluation of considered methods. By comparison with eight other existing techniques, the detection results show that: the proposed method reveals at least improvement of 26.8%, 9.5%, 8.2%, 3.5%, 10.2%, 4.5%, 7.8% and 6.4% on the exons length of 0-24, 25-49, 50-74, 100-124, 125-149, 150-174, 175-199 and 200-299, respectively. Conclusion: Experimental results demonstrate that our approach leads to better performance for short exon detection.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
