Current Bioinformatics - Volume 13, Issue 6, 2018
Volume 13, Issue 6, 2018
-
-
A Pipeline Architecture for Inferring and Visualizing Gene Networks from cDNA Microarray Expression Data in Crop Plants
More LessBackground: In many important crops genomic studies are generating large amounts of data from cDNA sequencing and RNA expression experiments. Genomic data is complementing the efforts at improving production of new plant varieties with resistance to major worldwide biotic problems, facing the climate change challenge and pursuing the quest for better quality. After the initial exploratory phase of genome sequencing and functional characterization of genes of interest, a postgenomics phase is pointing towards the understanding of the organism function as a whole, through Systems Biology. Objective: To develop a Software Architecture that facilitates Gene Networks inference from highthroughput gene expression data collected from microarray experiments. Method: A pipeline architecture was designed and constructed for data mining that was validated using known pathways for starch and sucrose metabolism in plants. Results: The pipeline provides the support for functional annotations of both putative homologs and new genes, allowing as well the identification of novel co-expressed gene clusters related to metabolic important traits. Conclusion: Our approach can be transferred between organisms, taking advantage of the open and adaptable platform in R language, and visualization of gene expression networks that can be easily incorporated for web access.
-
-
-
Extracting Diagnostic Knowledge from MedLine Plus: A Comparison between MetaMap and cTAKES Approaches
More LessBackground: The development of diagnostic decision support systems (DDSS) requires having a reliable and consistent knowledge based on diseases and their symptoms, signs, and diagnostic tests. Physicians are typically the source of this knowledge but it is not always possible to obtain all the desired information from them. Other valuable sources are medical books and articles describing the diagnosis of diseases, but again, extracting this information is a hard and time-consuming task. Objective: In this paper we present the results of our research to compare two well-known tools that are used to perform NLP in medical domain. In this context we have used these tools to perform the operation of Name Entity Recognition to extract diagnostic terms from texts contained in MedLine Plus articles. Method: We have used Web scraping, natural language processing (NLP) techniques, a variety of publicly available sources of diagnostic knowledge and two widely known medical concept identifiers, MetaMap and cTAKES, to extract diagnostic criteria for infectious diseases from MedLine Plus articles. Results: A performance comparison of MetaMap and cTAKES is presented being visible that although the differences between both systems are not really significant there are some palpable differences in the results provided by the system. Conclusion: The extraction of diagnostic terms is a very important task for the creation of databases with this information. The use of NLP systems capable of extraction, those terms from texts are very valuable tools that need to be implemented and evaluated in order to obtain the maximum accuracy on this process.
-
-
-
Nextpresso: Next Generation Sequencing Expression Analysis Pipeline
More LessAuthors: O. Graña, M. Rubio-Camarillo, F. Fdez-Riverola, D.G. Pisano and D. Glez-PeñaBackground: Many bioinformatics pipelines are available nowadays to analyze transcriptomics data produced by high-throughput RNA sequencing. They implement different workflows that address several analysis tasks, supported by the use of third party programs. Nevertheless, a proper workflow definition for RNA-seq data analysis is still lacking. Objective: To proper define what a comprehensive RNA-seq data analysis workflow should be. Compare all available pipelines and, if such a solution is not available, implement a new pipeline. Method: We have developed a new pipeline integrating state-of-the art programs for different parts of the RNA-seq analysis. We also have used RUbioSeq libraries to achieve a scalable solution. Results: We have defined a comprehensive RNA-seq data analysis workflow, comprising the most common needs demanded by biologists and implemented it in a new pipeline, nextpresso. We also validate it in two case studies presented here. Conclusion: Nexpresso is a new, freely available, pipeline covering the most common needs of RNA-seq data analysis. It is easy to configure, generates user friendly results and scales well for larger studies comprising a high number of samples.
-
-
-
Determining the Influence of Class Imbalance for the Triage of Biomedical Documents
More LessBackground: Unbalanced data is a well-known and common problem in many practical applications of machine learning, having remarkable effects on the performance of standard classifiers. Taking into account the enormous growth of biomedical literature publicly available over Internet, one relevant task for the biomedical community is the automatic classification of relevant documents for further research. Objective: Focusing on this topic, the objective of this work is two-fold: to evaluate alternative strategies also proposing a novel approach denoted as LITl (Limited Iterative Tomek links) for alleviating the class imbalance problem, and to analyse the true impact of unbalanced data for the accurate triage of biomedical documents. Method: Different strategies are applied and evaluated over a standard corpus of Medline documents where each entry is represented by a set of MeSH terms. Results: Results obtained from experimentation demonstrate the real effect of class imbalance over popular classifiers such as kNN, Naive Bayes, SVM and C4.5, and show how their performance can be improved when using appropriate balancing strategies. Conclusion: The classifier that least suffers from an imbalanced scenario comprising Medline documents is Naive Bayes. Moreover, we demonstrated that the performance of a given balancing strategy largely depends on the selected classifier. In this sense, those classifiers that are best suited to work with our LITl approach are kNN and C4.5.
-
-
-
Pharmacophore Mapping of Ligand Based Virtual Screening, Molecular Docking and Molecular Dynamic Simulation Studies for Finding Potent NS2B/NS3 Protease Inhibitors as Potential Anti-dengue Drug Compounds
More LessAuthors: A. J. Fathima, G. Murugaboopathi and P. SelvamBackground: Dengue virus (DENV) has become a crucial health concern. The NS2B/NS3 Protease is a major drug target for DENV in rational drug design. At present, effective treatment of DENV is not possible due to unavailability of specific anti-viral drugs. Based on the drug repurposing studies, bromocriptine compound was found to be a potent anti-DENV drug-like compound and it is also an approved drug for treatment of other diseases. Materials and Methods: Taking bromocriptine as a lead compound, in the current research, pharmacophore feature based virtual screening was performed to find an effective target specific protease inhibitors. Results: Out of 40,000 bromocriptine similar compounds screened against NS2B/NS3 protease drug target, the ZINC92615064 compound was found to be highly potent compared to bromocriptine based on its compared binding energies and ADMET properties. To further validate the results, molecular dynamic simulations for NS2B/NS3 protease in complex with bromocriptine compared to NS2B/NS3 protease in complex with ZINC92615064 were performed for 20 nanoseconds for understanding its plausible mode of action. Conclusion: The outcome of the present study exposed several potent dengue NS2B/NS3 protease inhibitors which are worth considering for further clinical studies.
-
-
-
Codon Usage Pattern of Metallothionein Genes in the Poplar Genome
More LessAuthors: Junkai Zhi, Jian Zhang, Jian Li, Hao Zhang and Jichen XuBackground: Metallothioneins (MTs) are important proteins for phytoremediation, and are widely involved in heavy metal uptake, transport, and enrichment. Objective: The aim of this research was to clarify all the MT genes in Populus trichocarpa, and to determine their sequences, chromosome distribution, and codon-use preferences. The results of this study would further our understanding of the diversity of poplar MT genes, and be useful for clarifying the function of poplar MTs in remediating soils contaminated by heavy metals. Method: The sequence similarity of the poplar MT genes was conducted by using the software DNAMAN and the codon usage bias was analyzed by CodonW and CHIPS. Further statistical analysis on high-frequency codons were conducted for each type of MT. Results: Based on MT characteristics, 10 MT genes were identified and categorized into four types in the Populus trichocarpa genome. Their open reading frames ranged from 201 bp to 288 bp in length. The theoretical isoelectric point of poplar MTs ranged from 4.4 to 7.99 with an average value of 5.28. The amino acid similarity among poplar MT sequences ranged from 10.14% to 98.67%. A codon usage frequency analysis revealed a codon usage bias presented in the 10 poplar MT genes, with 21 highfrequency codons totally. Conclusion: These results reveal the characteristics of poplar MTs, and provide the basis for further studies on their functional mechanisms and applications in phytoremediation.
-
-
-
Identification and Analysis of Cancer Diagnosis Using Probabilistic Classification Vector Machines with Feature Selection
More LessAuthors: Xiuquan Du, Xinrui Li, Wen Li, Yuanting Yan and Yanping ZhangBackground: The accurate classification of tumors types is mainly important for the treatment of cancer. With the progress of the microarray expression profile, many methods are proposed to deal with these data. However, because of the feature dimension of tumor gene expression profile is very high; many machine learning algorithms are failure. Objective & Methods: In this paper, a novel method named probabilistic classification vector machines (PCVM) with feature selection is proposed for tumor types detection using gene expression data, PCVM adopt a signed and truncated Gaussian prior to solve the problem of unstable solutions caused, and the complexity of the model can be controlled by the truncated Gaussian prior. The performance of PCVM is evaluated on two datasets by using four metrics. Results: This method achieves 84.21% accuracy and 95.24 % accuracy in the leukemia and prostrate dataset respectively. As compared to other methods, PCVM obtain much higher performance than Support Vector Machines (SVM), Naïve Bayes (NB), RBF Neural Networks (RBF), K-nearest Neighbor (KNN), and Random Forest (RF) except SVM on Prostate dataset. In order to reduce computational time, we adopt a feature selection method (DX) to rank the features and search the optimal feature combination based on PCVM, PCVM with DX method (PCVM-DX) achieves 94.74% accuracy, 100% sensitivity, 85.71% specificity and 92.31% precision on the leukemia dataset. PCVMDX method obtained the same result as PCVM on the prostate dataset. We also compare DX with other feature selection method; the result reveals that the PCVM-DX is efficient for tumor classification in terms of performance. Conclusion: PCVM-DX is observed to be better than the other methods in two data sets. The novelty of this approach lies in applying PCVM to tackle the same prior for different classes may lead to unstable solutions by RVMs and also exploring the important feature subset in the microarray expression profile with feature selection.
-
-
-
Comparative Study of FMN Riboswitch in Representative Species of Different Phyla
More LessAuthors: Sunita Yadav, D. Swati and Mayank RashmiBackground: Flavin mononucleotide (FMN) specific riboswitch, also known as RFN element is frequently found in prokaryotes and also in some eukaryotes. FMN riboswitch is directly involved in the expression of Rib DEAHT genes in the biosynthesis and transport of riboflavin (vitamin B2). In case of bacteria all the genes of Rib operon are involved and the RFN element is found in a number of species where as in eukaryotes only one or two genes of the Rib operon are involved. Objective: In this study we have searched the FMN riboswitch in different phyla and compared them at the structural level. Method: The riboswitch finders RibEx and Infernal are used to predict the FMN riboswitch from the sequences aligned to known sequences from two bacteria. The putative sequences are tallied with Rfam database and LocARNA web server and RNAstructure tool are used to find the secondary structures. ModeRNA is used to find the tertiary structure, which is compared to a known structure from PDB by using ARTS server. Autodock Vina is used to probe the binding of the ligand FMN to the aptamer of the riboswitch. Results and Conclusions: Amongst eukaryotes, very few instances of FMN riboswitch have been identified in plants and only one putative FMN riboswitch has been found among fungi. In case of archaea, FMN is found only in two species of Euryarchaeota. The crystallographic structure of FMN riboswitch is well established with six stem junctions in Fusobacterium nucleatum. In this study we find that the aptamer sequence of the FMN riboswitch is six stem loop structure that is P1, P2, P3, P4, P5, P6 and confirm that the FMN ligand is bound in this six stem loop junction. The aptameric region on the basis of sequence as well as structure is conserved in all the Bacterial, Archaeal, Fungal and Plant species studied. The tertiary structure of the FMN riboswitch is predicted in Bacillus thuringiensis, Methanobrevibacter smithii, Cannabis sativa and Epichloe glyceriae and it is compared with the well known structure of Fusobacterium nucleatum.
-
-
-
Imputation of Ignorable and Non-ignorable Missing Values in Large Datasets Using ACO with Local Search
More LessAuthors: R. D. Priya and R. SivarajBackground: Presence of missing values in databases causes serious threats for knowledge extraction. Especially in large databases which are integrated from multiple sources, the number of missing values may be high which in turn may lead to biased inferences. Many methods have been proposed by researchers for handling ignorable (Missing At Random and Missing Completely At Random) and non-ignorable missingness (Not Missing At Random). Still, there exists gap in (i) handling heterogeneous missing attributes (ii) imputing missing values in large databases and (iii) dealing with both ignorable and non-ignorable missingness. Objective: This paper addresses all these three issues by proposing a single algorithm called as Repopulated Bayesian Ant Colony Optimization (RPBACO) by hybridizing Bayesian and Ant Colony Optimization (ACO) techniques. Methodology: ACO chooses the right covariate values required for optimal imputation of missing values based on the probability and pheromone updation values of ants. Bayesian principles are used to evaluate fitness of solutions in the ACO process which involves local beam search for repopulating in successive generations. RPBACO is implemented on large real datasets for imputing heterogeneous (discrete and continuous) missing values with both ignorable and non-ignorable patterns. Results: The experimental results are encouraging when compared with other existing standard techniques in terms of both imputation accuracy and computational time calculated at different missing rates from 5% to 50%. The statistical tests conducted to validate the experimental results also prove the superiority of RPBACO in all the datasets considered. Conclusion: RPBACO can be successfully used for handling both ignorable and non-ignorable missing values in heterogeneous attributes in large datasets with better imputation accuracy.
-
-
-
Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method
More LessAuthors: Hongda Bu, Jiaqi Hao, Jihong Guan and Shuigeng ZhouBackground: Enhancers are short DNA regions that improve transcription efficiency by recruiting transcription factors. Identifying enhancer regions is important to understand the process of gene expression. As enhancers are independent of their distances and orientations to the target genes, it is difficult to locate enhancers accurately. Recently, with the development of highthroughput ChIP-seq (Chromatin Immunoprecipitation sequencing) technologies, several computational methods were developed to predict enhancers. However, most of these methods rely on p300 binding sites and/or DNase I hypersensitive sites (DHSs) for selecting positive training samples, which is imprecise and subsequently leads to unsatisfactory prediction performance. Besides, in the literature, there is no work that predicts enhancers from tissues across different developmental stages. Methods: In this paper, we proposed a method based on support vector machines (SVMs) to investigate enhancer prediction on cell lines and tissues from EnhancerAtlas. Specifically, we focused on predicting enhancers on different developmental stages of heart and lung tissues. Results and Conclusion: Our results show that 1) the proposed method achieves good performance on most cell lines and tissues, especially it outperforms several state of the art methods on heart and lung. 2) It is easier to predict enhancers from tissues of adult stage than from tissues of fetal stage, which is proven on both heart and lung tissues.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month