Current Bioinformatics - Volume 17, Issue 4, 2022
Volume 17, Issue 4, 2022
-
-
A Machine Learning Perspective on DNA and RNA G-quadruplexes
Authors: Fabiana Rossi and Alessandro PaiardiniG-quadruplexes (G4s) are particular structures found in guanine-rich DNA and RNA sequences that exhibit a wide diversity of three-dimensional conformations and exert key functions in the control of gene expression. G4s are able to interact with numerous small molecules and endogenous proteins, and their dysregulation can lead to a variety of disorders and diseases. Characterization and prediction of G4-forming sequences could elucidate their mechanism of action and could thus represent an important step in the discovery of potential therapeutic drugs. In this perspective, we propose an overview of G4s, discussing the state of the art of methodologies and tools developed to characterize and predict the presence of these structures in genomic sequences. In particular, we report on machine learning (ML) approaches and artificial neural networks (ANNs) that could open new avenues for the accurate analysis of quadruplexes, given their potential to derive informative features by learning from large, high-density datasets.
-
-
-
Construction of Network Biomarkers Using Inter-Feature Correlation Coefficients (FeCO3) and their Application in Detecting High-Order Breast Cancer Biomarkers
Authors: Shenggeng Lin, Yuqi Lin, Kexin Wu, Yueying Wang, Zixuan Feng, Meiyu Duan, Shuai Liu, Yusi Fan, Lan Huang and Fengfeng ZhouAims: This study aims to formulate the inter-feature correlation as the engineered features. Background: Modern biotechnologies tend to generate a huge number of characteristics of a sample, while an OMIC dataset usually has a few dozens or hundreds of samples due to the high costs of generating the OMIC data. Therefore, many bio-OMIC studies assumed inter-feature independence and selected a feature with a high phenotype association. Objective: Many features are closely associated with each other due to their physical or functional interactions, which may be utilized as a new view of features. Methods: This study proposed a feature engineering algorithm based on the correlation coefficients (FeCO3) by utilizing the correlations between a given sample and a few reference samples. A comprehensive evaluation was carried out for the proposed FeCO3 network features using 24 bio-OMIC datasets. Results: The experimental data suggested that the newly calculated FeCO3 network features tended to achieve better classification performances than the original features, using the same popular feature selection and classification algorithms. The FeCO3 network features were also consistently supported by the literature. FeCO3 was utilized to investigate the high-order engineered biomarkers of breast cancer and detected the PBX2 gene (Pre-B-Cell Leukemia Transcription Factor 2) as one of the candidate breast cancer biomarkers. Although the two methylated residues cg14851325 (P-value = 8.06e-2) and cg16602460 (Pvalue = 1.19e-1) within PBX2 did not have a statistically significant association with breast cancers, the high-order inter-feature correlations showed a significant association with breast cancers. Conclusion: The proposed FeCO3 network features calculated the high-order inter-feature correlations as novel features and may facilitate the investigations of complex diseases from this new perspective. The source code is available on FigShare at 10.6084/m9.figshare.13550051 or the web site http://www.healthinformaticslab.org/supp/.
-
-
-
A Network-Based Method for the Detection of Cancer Driver Genes in Transcriptional Regulatory Networks Using the Structural Analysis of Weighted Regulatory Interactions
Authors: Mostafa Akhavan-Safar, Babak Teimourpour and Abbas Nowzari-DaliniBackground: Identifying genes that instigate cell anomalies and cause cancer in humans is an important field in oncology research. Abnormalities in these genes are transferred to other genes in the cell, disrupting its normal functionality. Such genes are known as cancer driver genes (CDGs). Various methods have been proposed for predicting CDGs, mostly based on genomic data and computational methods. Some novel bioinformatic approaches have been developed. Objective: In this article, we propose a network-based algorithm, SalsaDriver (Stochastic approach for link-structure analysis for driver detection), which can calculate each gene's receiving and influencing power using the stochastic analysis of regulatory interaction structures in gene regulatory networks. Methods: First, regulatory networks related to breast, colon, and lung cancers are constructed using gene expression data and a list of regulatory interactions, the weights of which are then calculated using biological and topological features of the network. After that, the weighted regulatory interactions are used in the structural analysis of interactions, with two separate Markov chains on the bipartite graph taken from the main graph of the gene network and the implementation of the stochastic approach for link-structure analysis. The proposed algorithm categorizes higher-ranked genes as driver genes. Results: The proposed algorithm was compared with 24 other computational and network tools based on the F-measure value and the number of detected CDGs. The results were validated using four databases. The findings of this study show that SalsaDriver outperforms other methods and can identify substantiallyy more driver genes than other methods. Conclusion: The SalsaDriver network-based approach is suitable for predicting CDGs and can be used as a complementary method along with other computational tools.
-
-
-
A Combined Feature Screening Approach of Random Forest and Filterbased Methods for Ultra-high Dimensional Data
Authors: Lifeng Zhou and Hong WangBackground: Various feature (variable) screening approaches have been proposed in the past decade to mitigate the impact of ultra-high dimensionality in classification and regression problems, including filter based methods such as sure independence screening, and wrapper based methods such as random forest. However, the former type of methods rely heavily on strong modelling assumptions while the latter ones requires an adequate sample size to make the data speak for themselves. These requirements can seldom be met in biochemical studies in cases where we have only access to ultra-high dimensional data with a complex structure and a small number of observations. Objective: In this research, we want to investigate the possibility of combining both filter based screening methods and random forest based screening methods in the regression context. Methods: We have combined four state-of-art filter approaches, namely, sure independence screening (SIS), robust rank correlation based screening (RRCS), high dimensional ordinary least squares projection (HOLP) and a model free sure independence screening procedure based on the distance correlation (DCSIS) from the statistical community with a random forest based Boruta screening method from the machine learning community for regression problems. Results: Among all the combined methods, RF-DCSIS performs better than the other methods in terms of screening accuracy and prediction capability on the simulated scenarios and real benchmark datasets. Conclusion: By empirical study from both extensive simulation and real data, we have shown that both filter based screening and random forest based screening have their pros and cons, while a combination of both may lead to a better feature screening result and prediction capability.
-
-
-
Analyzing Association Between Expression Quantitative Trait and CNV for Breast Cancer Based on Gene Interaction Network Clustering and Group Sparse Learning
Authors: Xia Chen, Yexiong Lin, Qiang Qu, Bin Ning, Haowen Chen, Bo Liao and Xiong LiAim: The occurrence and development of tumor are accompanied by a change in pathogenic gene expression. Tumor cells avoid the damage of immune cells by regulating the expression of immune- related genes. Background: Tracing the causes of gene expression variation is helpful to understand tumor evolution and metastasis. Objective: Current explanation methods for gene expression variation are confronted with several main challenges, which include low explanation power, insufficient prediction accuracy, and lack of biological meaning. Methods: In this study, we propose a novel method to analyze the mRNA expression variations of breast cancer risk genes. Firstly, we collected some high-confidence risk genes related to breast cancer and then designed a rank-based method to preprocess the breast cancers copy number variation (CNV) and mRNA data. Secondly, to elevate the biological meaning and narrow down the combinatorial space, we introduced a prior gene interaction network and applied a network clustering algorithm to generate high-density subnetworks. Lastly, to describe the interlinked structure within and between subnetworks and target genes mRNA expression, we proposed a group sparse learning model to identify CNVs for pathogenic genes expression variations. Results: The performance of the proposed method is evaluated by both significantly improved predication accuracy and biological meaning of pathway enrichment analysis. Conclusion: The experimental results show that our method has practical significance.
-
-
-
A Novel Method for Predicting Essential Proteins by Integrating Multidimensional Biological Attribute Information and Topological Properties
Authors: Hanyu Lu, Chen Shang, Sai Zou, Lihong Cheng, Shikong Yang and Lei WangBackground: Essential proteins are indispensable to the maintenance of life activities and play essential roles in the areas of synthetic biology. Identification of essential proteins by computational methods has become a hot topic in recent years because of its efficiency. Objective: Identification of essential proteins is of important significance and practical use in the areas of synthetic biology, drug targets, and human disease genes. Methods: In this paper, a method called EOP (Edge clustering coefficient -Orthologous-Protein) is proposed to infer potential essential proteins by combining Multidimensional Biological Attribute Information of proteins with Topological Properties of the protein-protein interaction network. Results: The simulation results on the yeast protein interaction network show that the number of essential proteins identified by this method is more than the number identified by the other 12 methods (DC, IC, EC, SC, BC, CC, NC, LAC, PEC, CoEWC, POEM, DWE). Especially compared with DCDegree Centrality), the SN (sensitivity) is 9% higher, when the candidate protein is 1%, the recognition rate is 34% higher, when the candidate protein is 5%, 10%, 15%, 20%, 25% the recognition rate is 36%, 22%, 15%, 11%, 8% higher, respectively. Conclusion: Experimental results show that our method can achieve satisfactory prediction results, which may provide references for future research.
-
-
-
Analysis of Novel Variants Associated with Three Human Ovarian Cancer Cell Lines
Authors: Venugopala R. Mekala, Jan-Gowth Chang and Ka-Lok NgBackground: Identification of mutations is of great significance in cancer research, as it can contribute to the development of therapeutic strategies and prevention of cancer formation. Ovarian cancer is one of the leading cancer-related causes of death in Taiwan. Furthermore, it has been observed that the accumulation of genetic mutations can lead to cancer. Objective: We utilized whole-exome sequencing to explore cancer-associated missense variants in three human ovarian cancer cell lines derived from Taiwanese patients. Methods: We utilized cell line whole-exome sequencing data, 188 patients’ whole-exome sequencing data, and in vitro experiments to verify predicted variant results. We established an effective analysis workflow for the discovery of novel ovarian cancer variants, comprising three steps: (i) use of public databases and in-house hospital data to select novel variants, (ii) investigation of protein structural stability caused by genetic mutations, and (iii) use of in vitro experiments to verify predictions. Results: Our study enumerated 296 novel variants by imposing specific criteria and using sophisticated bioinformatics tools for further analysis. Eleven and 54 missense novel variants associated with cancerous and non-cancerous genes, respectively, were identified. A total of 13 missense mutations were found to affect the stability of protein 3D structure, while 11 disease-causing novel variants were confirmed by PCR sequencing. Among these, ten variants were predicted to be pathogenic, while the pathogenicity of one variant was uncertain. Conclusion: It was confirmed that novel variant genes play a crucial role in ovarian cancer patients, with 11 novel variants that may promote the progression and development of ovarian cancer.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
