Current Bioinformatics - Volume 12, Issue 2, 2017
Volume 12, Issue 2, 2017
-
-
Evaluation of Selected DNA Spectral Analysis-Based Gene Prediction Techniques
Authors: Sajid A. Marhon and Stefan C. KremerThis article analyzes and evaluates several DNA spectral analysis-based gene prediction techniques. This empirical review of this class of gene prediction techniques is beneficial to researchers to evaluate the state of the art empirically and impartially in this study. The techniques are applied to five benchmark datasets to evaluate and compare their performance. The receiver operating characteristic (ROC) curves are plotted to compare the performance. In this work, we impartially analyze the techniques by performing an empirical comparison and studying some issues that are general obstacles in this class of techniques such as the tuning of the window length parameter and signal thresholding. In addition, we analyze issues that are specific to certain techniques. The study reveals that the window length parameter, signal thresholding, and noise are the main challenges of this class of techniques that put their performance behind other non-DSP-based techniques; however, these issues are underestimated by many researchers in the design of their techniques. Furthermore, the analysis carried out in this study shows that the performance of the techniques is dependent on the choice of the analyzed parameters. These parameters are different depending on the considered method. The choice of the optimal value of these parameters is still an open research question.
-
-
-
Identifying Extreme Observations, Outliers and Noise in Clinical and Genetic Data
Authors: Concepcion Arenas, Claudio Toma, Bru Cormand and Itziar IrigoienBackground: Currently, a major challenge is the treatment and interpretation of actual data. Data sets are often high-dimensional, have small number of observations and are noisy. Furthermore, in recent years, many approaches have been suggested for integrating continuous with categorical/ordinal data, in order to capture the information which is lost in independent studies. Objective: The aim of this paper is to develop a statistical tool for the detection of outliers adapted to any kind of features and to high-dimensional data. Method: Data is an nxp data matrix (n< < p) where the rows correspond to observations, the columns correspond to any kind of features. The new procedure is based on the distances between all the observations and offers a ranking by assigning each observation a value reflecting its degree of outlyingness. It was evaluated by simulation and by using actual data from clinical and genetic studies. Results: The simulation studies showed that the procedure correctly identified the outliers, was robust in front of the masking effect and was useful in the detection of noise. With simulated two-sample microarray data sets, it correctly detected outliers, especially when many genes showed increased expression only for a small number of samples. The method was applied to adult lymphoid malignancies, human liver cancer and autism multiplex families’ data sets obtaining good and valuable results. Conclusion: The actual and simulation studies show the efficiency of the procedure, offering a useful tool in those applications where the detection of outliers or noise is relevant.
-
-
-
A Bicluster-Based Sequential Interpolation Imputation Method for Estimation of Missing Values in Microarray Gene Expression Data
Background: Gene expression matrix produced by DNA microarray technology inexorably contains multiple missing entries due to experimental problems. Prediction of missing values in gene expression matrix is essential as algorithms analyzing gene expression typically need a matrix without missing values. Objective: The objective of this paper is to present a novel bicluster-based sequential interpolation imputation method to predict missing values in gene expression data. Method: For each missing entry, this method first generates a bicluster by selecting a number of correlated genes and samples for that missing position and then applies interpolation based approximation technique on that bicluster. This method starts imputation from the gene with the minimum number of missing values and continues imputation by reusing the already imputed values. Results: The result of the proposed method is compared with seven well known existing estimation techniques over nine different data sets. The metric used to compare the performance are normalized root mean square error (NRMSE) and average distance between partition errors (ADBPE). Conclusion: Performance of the proposed method is observed to be better than the well-known methods in a variety of data sets. The novelty of this approach lies in applying interpolation technique in the identified local structure (bicluster) for predicting missing values.
-
-
-
Functional Prediction: Gene Filtering Based on Multivariate Techniques
Authors: Liliana Lopez-Kleine, Rosa Montano and Francisco Torres-AvilesBackground: Gene expression data is available on several organisms of interest in publicly available data-bases and knowledge on gene functions can be extracted from gene expression profiles by pattern comparison between genes when analyzed with multivariate techniques. Nevertheless, gene expression data is very noisy and those patterns are often difficult to detect with classical multivariate analysis. Objective: This work proposes using classical multivariate methods in order to detect and/or predict a subset of potential genes that could belong to a functional class of interest. Method: In order to achieve confident results (low error in the classification of genes with known function), strong filtering on the original data set is proposed here. The methodology is applied on three time course microarray data sets that compare healthy and pathogen inoculated plants, in order to illustrate methodology. Results: Results when focusing on prediction of unknown immunity genes show that the here proposed methodology is suitable for functional gene prediction. Conclusion: Moreover, the methodology is suitable for other organisms and microarray data sets from which gene expression profiles can be extracted.
-
-
-
An Insight into Species from Same Descendent Aspect and the Application into Clostridia
Authors: Guojun Li, Qin Ma, Bingqiang Liu, Zheng Chang, Chuan Zhou and Zhenjia WangBackground: A phylogenetic tree which describes the evolutionary relationships among various species from a common ancestor is a fundamental concept in evolutionary biology. In recent years, a number of models of this tree structure have been proposed, mainly based on constructing a similar hierarchical structure or grouping together descendents with common ancestors. Objective: We use vertices in an acyclic graph to represent different organisms, and seek those relatively important vertices of almost same descendents by tracing generations back of a set of vertices. Method: We propose an algorithm for grouping based on previously theoretical analysis, which can alleviate the negative effects of incorrect initial central vertices and noisy frontier vertices by two strategies used in group merging step in our algorithm. Results: The computational results in Clostridia show that our algorithm illustrates the data features better compared with the traditional hierarchy clustering method.
-
-
-
ExomeHMM: A Hidden Markov Model for Detecting Copy Number Variation Using Whole-Exome Sequencing Data
Authors: Ao Li, Minghui Wang, Zhenhua Yu and Cheng GuoBackground: Copy number variations (CNVs), including amplification and deletion, are alterations of DNA copy number compared to a reference genome. CNVs play a crucial role in tumourigenesis and progression, including amplification of oncogenes and deletion of tumor suppressor genes that may significantly increase the risk of cancer. CNVs are also reported to be closely related with non-cancer diseases, such as Down syndrome, Parkinson disease, and Alzheimer disease. Objective: Whole-exome sequencing (WES) has been successfully applied to the discovery of gene mutations as well as clinical diagnosis. But it is quite challenging to evaluate the copy number using WES data due to read depth bias, exons' distribution pattern and normal cell contamination. Our aim is develop an efficient method to overcome these challenges and detect CNVs using WES data. Method: In this study, we present ExomeHMM, a hidden Markov model (HMM) based CNV detecting algorithm. ExomeHMM exploits relative read depth, a ratio based signal, to mitigate read depth distortion and employs exponential attenuated transition matrix to handle sparsely and non-uniformly distributed exons. Expectation–maximization algorithm is used to optimize parameters for the proposed model. Finally, we use standard Viterbi algorithm to infer the copy number of exons. Results: Using previously identified CNVs in 1000 Genome Project data as golden standard, ExomeHMM achieves the highest F-score among the four methods compared in this study. When applied to triple-negative breast cancer data, ExomeHMM is capable to find abnormal genes that are significantly associated with breast cancer. Conclusion: In conclusion, ExomeHMM is a suitable tool for CNV detections in both healthy samples as well as clinic tumor samples on whole-exome sequencing data.
-
-
-
Overcoming the Limitation of GWAS Platforms Using Systems Biology Approach
Authors: Sapna Negi, Santosh K. Behera and Budheswar DehuryBackground: Type 2 diabetes (T2D) is one of the major multi-factorial disorders resulting in various health problems. Despite enormous data generated through GWAS, complete T2D heritability is yet to be achieved. Most of the GWAS platforms are also underrepresented in regulatory genes and nonsynonymous SNPs (nsSNPs) and these may therefore remain undetected. Objective: The present study is an attempt to delineate additional key players of T2D employing systems biology approach on available GWAS results. Method: Genes belonging to significant biological processes (BP) were identified as key genes. Key genes were then used for building gene set enrichment of T2D associated genes using gene-gene and protein-protein networks. The key genes and their connected genes were further used to explore networks involving miRNAs and Transcription Factors (TF), for this, computational feed-forward loops (FFLs) were used. Thereafter, connected genes were also looked for non-synonymous SNPs (nsSNPs) and their effect on structure-function of proteins as elucidated by in-silico analysis. Results: Carbohydrate and Glucose homeostasis (p=1.33E-05) were the significant BP identified which involves HNF1A, PCK1, IRS1, WFS1, PARG and TCF7L2 genes. Computational feed-forward loops showed involvement of regulatory genes such as hsa-miR17, hsa-miR141 (miRNAs) and CTCF (TF) in T2D. Two nsSNPs in the associated genes, PCK1 (I267V) and PEPD (R388H) were found as deleterious and damaging to protein structure. Conclusion: The present study provides new approach towards underpinning plausible genetic heritability to T2D. Experimental validation of these regulatory genes and nsSNPs may provide added insights into pathophysiology of T2D, and holds promise for personalized medication toT2D.
-
-
-
Three-Dimensional Ideal Gas Reference State Based Energy Function
Authors: Md Tamjidul Hoque and Avdesh MishraBackground: Energy functions of proteins are developed to quantitatively capture the desirable features of physical interaction that determines the protein folding and structure prediction processes. Objective: It is vital to develop an accurate energy function to discriminate native-like proteins from decoys. Along the same line, we develop an accurate energy function, which involves careful modelling of the reference state. Method: Here we propose a novel three-dimensional ideal gas reference state based energy function, which is based on three distinct hydrophobic-hydrophilic interactions of amino acids. The three distinct group of interactions, namely hydrophobic versus hydrophilic, hydrophobic versus hydrophobic and hydrophilic versus hydrophilic are controlled via three-dimensional optimized values of alpha. Using Genetic Algorithm, we optimized the contributions of each of the three groups along with the z-score to discriminate the native from the decoys. Results: The approach allows us to segregate the statistics, which in turn enables us to model the interactions more accurately without grossly averaging the impact as done in well-known ideal gas reference state based approach. To compute the energy scores we use a database of 4332 known protein structures obtained from the Protein Data Bank. Conclusion: Our energy function is found to be very competitive compared to the state-of-the-art approaches, and outperforms the nearest competitor by 40.9% for the most challenging Rosetta decoy-set.
-
-
-
CpGIScan: An Ultrafast Tool for CpG Islands Identification from Genome Sequence
Authors: Zhenxin Fan, Bisong Yue, Xiuyue Zhang, Lianming Du and Zuoyi JianBackground: The CpG islands (CGIs) are clusters of CpGs in CG-rich regions, which confer a critical role in the regulation of transcription. Although multiple programs are developed for searching CGIs, but all of them have drawbacks, such as low accuracy or long running time. Objective: The aim of this study was to develop a new CGIs search tool, namely CpGIScan (CpG Islands Scan), which improves upon previous programs. Method: In this work, a CpG island is defined by three types of parameters: the window length, the guanine and cytosine (G + C) frequency, and the ratio of the observed over the expected CpGs (CpG o/e). The algorithm in CpGIScan is based on the sliding window method. To reduce the time required to identify CGIs, multithread technology is employed in our program. CpGIScan was compared to existing widely used tools to benchmark its performance. Results: Evaluations on a set of test sequences show that CpGIScan has high sensitivity and specificity. In addition, CpGIScan is at least 4 times faster than existing tools. It has a large performance advantage over previous tools when searching CpG islands from the bulk genomes. CpGIScan is written in C++ and provided under the GNU CPL license. It is freely available at https://github.com/jianzuoyi/CpGIScan. Conclusion: CpGIScan was specifically developed for ultrafast identifying CGIs in large sequences sets. It takes the advantages of previous tools and significantly improves the computational efficiency. CpGIScan will be of value to researchers for generating an initial genome-wide map of CpG islands.
-
-
-
Identifying Key Regulator Genes for Tuberculosis by Differential Co- Expression Analysis of Gene Expression Profiling
Authors: Chuanyou Li, Mengqiu Gao, Lijun Bi, Joy Fleming, Wei Wang and Jingming LiuIntroduction: Tuberculosis (TB) is a major global public health problem. Its pathogenesis, however, is not fully understood. The purpose of this study was to identify key genes for TB by a bioinformatics analysis of gene expression profiles. Methods: We downloaded the gene expression profiles of TB from the Gene Expression Omnibus and identified differentially-expressed genes (DEGs) and highly-enriched pathways between TB patients and healthy controls. We then identified differentially co-expressed genes (DCGs), differentially coexpressed links (DCLs), differentially-regulated genes (DRGs) and differentially-regulated links (DRLs) using Differential Co-Expression Analysis (DCEA) and Differential Regulation Analysis (DRA). In addition, we constructed a TF bridged DCL-centered network by mapping the DCGs to known regulatory data between transcription factors (TFs) and target genes. We then calculated the TED, TDD and regulatory impact factor (RIF) of each TF. Results: A total of 5540 DEGs, 61DCGs, 3915 DCLs, 59 DRGs and 1139 DRLs were identified between TB patients and healthy controls. KEGG pathway enrichment analysis identified the lysosome as the most significantly-enriched pathway. Based on their TED, TDD and RIF scores, the REL, TAL1, RELA, NFKB1, NF-kappaB2, Cart-1, TCF3, MZF1, POU2F2 and EPAS1 transcription factors may play key roles in tuberculosis. Of these genes, REL and TAL1 were the only two among the top 20 genes of the three algorithms and may therefore paly more significant roles in tuberculosis. Conclusions: REL and TAL1 may play more significant roles in tuberculosis. However, more laboratory work is needed to validate our results.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
