Volume 17, Issue 4

Current Bioinformatics - Volume 17, Issue 4, 2022

Volume 17, Issue 4, 2022

- A Machine Learning Perspective on DNA and RNA G-quadruplexes
  
  Authors: Fabiana Rossi and Alessandro Paiardini
  
  https://doi.org/10.2174/1574893617666220224105702
  More Less
  
  G-quadruplexes (G4s) are particular structures found in guanine-rich DNA and RNA sequences that exhibit a wide diversity of three-dimensional conformations and exert key functions in the control of gene expression. G4s are able to interact with numerous small molecules and endogenous proteins, and their dysregulation can lead to a variety of disorders and diseases. Characterization and prediction of G4-forming sequences could elucidate their mechanism of action and could thus represent an important step in the discovery of potential therapeutic drugs. In this perspective, we propose an overview of G4s, discussing the state of the art of methodologies and tools developed to characterize and predict the presence of these structures in genomic sequences. In particular, we report on machine learning (ML) approaches and artificial neural networks (ANNs) that could open new avenues for the accurate analysis of quadruplexes, given their potential to derive informative features by learning from large, high-density datasets.
  
  Add to my favourites
  
  Email this

- Construction of Network Biomarkers Using Inter-Feature Correlation Coefficients (FeCO3) and their Application in Detecting High-Order Breast Cancer Biomarkers
  
  Authors: Shenggeng Lin, Yuqi Lin, Kexin Wu, Yueying Wang, Zixuan Feng, Meiyu Duan, Shuai Liu, Yusi Fan, Lan Huang and Fengfeng Zhou
  
  https://doi.org/10.2174/1574893617666220124123303
  More Less
  
  Aims: This study aims to formulate the inter-feature correlation as the engineered features. Background: Modern biotechnologies tend to generate a huge number of characteristics of a sample, while an OMIC dataset usually has a few dozens or hundreds of samples due to the high costs of generating the OMIC data. Therefore, many bio-OMIC studies assumed inter-feature independence and selected a feature with a high phenotype association. Objective: Many features are closely associated with each other due to their physical or functional interactions, which may be utilized as a new view of features. Methods: This study proposed a feature engineering algorithm based on the correlation coefficients (FeCO3) by utilizing the correlations between a given sample and a few reference samples. A comprehensive evaluation was carried out for the proposed FeCO3 network features using 24 bio-OMIC datasets. Results: The experimental data suggested that the newly calculated FeCO3 network features tended to achieve better classification performances than the original features, using the same popular feature selection and classification algorithms. The FeCO3 network features were also consistently supported by the literature. FeCO3 was utilized to investigate the high-order engineered biomarkers of breast cancer and detected the PBX2 gene (Pre-B-Cell Leukemia Transcription Factor 2) as one of the candidate breast cancer biomarkers. Although the two methylated residues cg14851325 (P-value = 8.06e-2) and cg16602460 (Pvalue = 1.19e-1) within PBX2 did not have a statistically significant association with breast cancers, the high-order inter-feature correlations showed a significant association with breast cancers. Conclusion: The proposed FeCO3 network features calculated the high-order inter-feature correlations as novel features and may facilitate the investigations of complex diseases from this new perspective. The source code is available on FigShare at 10.6084/m9.figshare.13550051 or the web site http://www.healthinformaticslab.org/supp/.
  
  Add to my favourites
  
  Email this

- A Network-Based Method for the Detection of Cancer Driver Genes in Transcriptional Regulatory Networks Using the Structural Analysis of Weighted Regulatory Interactions
  
  Authors: Mostafa Akhavan-Safar, Babak Teimourpour and Abbas Nowzari-Dalini
  
  https://doi.org/10.2174/1574893617666220127094224
  More Less
  
  Background: Identifying genes that instigate cell anomalies and cause cancer in humans is an important field in oncology research. Abnormalities in these genes are transferred to other genes in the cell, disrupting its normal functionality. Such genes are known as cancer driver genes (CDGs). Various methods have been proposed for predicting CDGs, mostly based on genomic data and computational methods. Some novel bioinformatic approaches have been developed. Objective: In this article, we propose a network-based algorithm, SalsaDriver (Stochastic approach for link-structure analysis for driver detection), which can calculate each gene's receiving and influencing power using the stochastic analysis of regulatory interaction structures in gene regulatory networks. Methods: First, regulatory networks related to breast, colon, and lung cancers are constructed using gene expression data and a list of regulatory interactions, the weights of which are then calculated using biological and topological features of the network. After that, the weighted regulatory interactions are used in the structural analysis of interactions, with two separate Markov chains on the bipartite graph taken from the main graph of the gene network and the implementation of the stochastic approach for link-structure analysis. The proposed algorithm categorizes higher-ranked genes as driver genes. Results: The proposed algorithm was compared with 24 other computational and network tools based on the F-measure value and the number of detected CDGs. The results were validated using four databases. The findings of this study show that SalsaDriver outperforms other methods and can identify substantiallyy more driver genes than other methods. Conclusion: The SalsaDriver network-based approach is suitable for predicting CDGs and can be used as a complementary method along with other computational tools.
  
  Add to my favourites
  
  Email this

- A Combined Feature Screening Approach of Random Forest and Filterbased Methods for Ultra-high Dimensional Data
  
  Authors: Lifeng Zhou and Hong Wang
  
  https://doi.org/10.2174/1574893617666220221120618
  More Less
  
  Background: Various feature (variable) screening approaches have been proposed in the past decade to mitigate the impact of ultra-high dimensionality in classification and regression problems, including filter based methods such as sure independence screening, and wrapper based methods such as random forest. However, the former type of methods rely heavily on strong modelling assumptions while the latter ones requires an adequate sample size to make the data speak for themselves. These requirements can seldom be met in biochemical studies in cases where we have only access to ultra-high dimensional data with a complex structure and a small number of observations. Objective: In this research, we want to investigate the possibility of combining both filter based screening methods and random forest based screening methods in the regression context. Methods: We have combined four state-of-art filter approaches, namely, sure independence screening (SIS), robust rank correlation based screening (RRCS), high dimensional ordinary least squares projection (HOLP) and a model free sure independence screening procedure based on the distance correlation (DCSIS) from the statistical community with a random forest based Boruta screening method from the machine learning community for regression problems. Results: Among all the combined methods, RF-DCSIS performs better than the other methods in terms of screening accuracy and prediction capability on the simulated scenarios and real benchmark datasets. Conclusion: By empirical study from both extensive simulation and real data, we have shown that both filter based screening and random forest based screening have their pros and cons, while a combination of both may lead to a better feature screening result and prediction capability.
  
  Add to my favourites
  
  Email this

- Analyzing Association Between Expression Quantitative Trait and CNV for Breast Cancer Based on Gene Interaction Network Clustering and Group Sparse Learning
  
  Authors: Xia Chen, Yexiong Lin, Qiang Qu, Bin Ning, Haowen Chen, Bo Liao and Xiong Li
  
  https://doi.org/10.2174/1574893617666220207095117
  More Less
  
  Aim: The occurrence and development of tumor are accompanied by a change in pathogenic gene expression. Tumor cells avoid the damage of immune cells by regulating the expression of immune- related genes. Background: Tracing the causes of gene expression variation is helpful to understand tumor evolution and metastasis. Objective: Current explanation methods for gene expression variation are confronted with several main challenges, which include low explanation power, insufficient prediction accuracy, and lack of biological meaning. Methods: In this study, we propose a novel method to analyze the mRNA expression variations of breast cancer risk genes. Firstly, we collected some high-confidence risk genes related to breast cancer and then designed a rank-based method to preprocess the breast cancers copy number variation (CNV) and mRNA data. Secondly, to elevate the biological meaning and narrow down the combinatorial space, we introduced a prior gene interaction network and applied a network clustering algorithm to generate high-density subnetworks. Lastly, to describe the interlinked structure within and between subnetworks and target genes mRNA expression, we proposed a group sparse learning model to identify CNVs for pathogenic genes expression variations. Results: The performance of the proposed method is evaluated by both significantly improved predication accuracy and biological meaning of pathway enrichment analysis. Conclusion: The experimental results show that our method has practical significance.
  
  Add to my favourites
  
  Email this

- A Novel Method for Predicting Essential Proteins by Integrating Multidimensional Biological Attribute Information and Topological Properties
  
  Authors: Hanyu Lu, Chen Shang, Sai Zou, Lihong Cheng, Shikong Yang and Lei Wang
  
  https://doi.org/10.2174/1574893617666220304201507
  More Less
  
  Background: Essential proteins are indispensable to the maintenance of life activities and play essential roles in the areas of synthetic biology. Identification of essential proteins by computational methods has become a hot topic in recent years because of its efficiency. Objective: Identification of essential proteins is of important significance and practical use in the areas of synthetic biology, drug targets, and human disease genes. Methods: In this paper, a method called EOP (Edge clustering coefficient -Orthologous-Protein) is proposed to infer potential essential proteins by combining Multidimensional Biological Attribute Information of proteins with Topological Properties of the protein-protein interaction network. Results: The simulation results on the yeast protein interaction network show that the number of essential proteins identified by this method is more than the number identified by the other 12 methods (DC, IC, EC, SC, BC, CC, NC, LAC, PEC, CoEWC, POEM, DWE). Especially compared with DCDegree Centrality), the SN (sensitivity) is 9% higher, when the candidate protein is 1%, the recognition rate is 34% higher, when the candidate protein is 5%, 10%, 15%, 20%, 25% the recognition rate is 36%, 22%, 15%, 11%, 8% higher, respectively. Conclusion: Experimental results show that our method can achieve satisfactory prediction results, which may provide references for future research.
  
  Add to my favourites
  
  Email this

- Analysis of Novel Variants Associated with Three Human Ovarian Cancer Cell Lines
  
  Authors: Venugopala R. Mekala, Jan-Gowth Chang and Ka-Lok Ng
  
  https://doi.org/10.2174/1574893617666220224105106
  More Less
  
  Background: Identification of mutations is of great significance in cancer research, as it can contribute to the development of therapeutic strategies and prevention of cancer formation. Ovarian cancer is one of the leading cancer-related causes of death in Taiwan. Furthermore, it has been observed that the accumulation of genetic mutations can lead to cancer. Objective: We utilized whole-exome sequencing to explore cancer-associated missense variants in three human ovarian cancer cell lines derived from Taiwanese patients. Methods: We utilized cell line whole-exome sequencing data, 188 patients’ whole-exome sequencing data, and in vitro experiments to verify predicted variant results. We established an effective analysis workflow for the discovery of novel ovarian cancer variants, comprising three steps: (i) use of public databases and in-house hospital data to select novel variants, (ii) investigation of protein structural stability caused by genetic mutations, and (iii) use of in vitro experiments to verify predictions. Results: Our study enumerated 296 novel variants by imposing specific criteria and using sophisticated bioinformatics tools for further analysis. Eleven and 54 missense novel variants associated with cancerous and non-cancerous genes, respectively, were identified. A total of 13 missense mutations were found to affect the stability of protein 3D structure, while 11 disease-causing novel variants were confirmed by PCR sequencing. Among these, ten variants were predicted to be pathogenic, while the pathogenicity of one variant was uncertain. Conclusion: It was confirmed that novel variant genes play a crucial role in ovarian cancer patients, with 11 novel variants that may promote the progression and development of ovarian cancer.
  
  Add to my favourites
  
  Email this

Most Cited Most Cited RSS feed

- A Review of Ensemble Methods in Bioinformatics
  
  Authors: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis
  
  Authors: Masahiro Sugimoto, Masato Kawakami, Martin Robert, Tomoyoshi Soga and Masaru Tomita
- Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
  
  Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong Chen
- A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods
  
  Authors: Jun Zhang and Bin Liu
- Molecular Genetic Markers: Discovery, Applications, Data Storage and Visualisation
  
  Authors: Chris Duran, Nikki Appleby, David Edwards and Jacqueline Batley
- A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization
  
  Authors: Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding and Hao Lin
- Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
  
  Authors: Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou
- Relevance of Molecular Docking Studies in Drug Designing
  
  Authors: Ritu Jakhar, Mehak Dangi, Alka Khichi and Anil K. Chhillar
- The Advances and Challenges of Deep Learning Application in Biological Big Data Processing
  
  Authors: Li Peng, Manman Peng, Bo Liao, Guohua Huang, Weibiao Li and Dingfeng Xie
- Gene Expression Profile Classification: A Review
  
  Authors: Musa H. Asyali, Dilek Colak, Omer Demirkaya and Mehmet S. Inan
More Less

Current Bioinformatics - Volume 17, Issue 4, 2022

Volume 17, Issue 4, 2022

Volumes & issues

Most Read This Month

Most Cited Most Cited RSS feed