Current Bioinformatics - Volume 15, Issue 9, 2020
Volume 15, Issue 9, 2020
-
-
Cancer Diagnosis and Disease Gene Identification via Statistical Machine Learning
Authors: Liuyuan Chen, Juntao Li and Mingming ChangDiagnosing cancer and identifying the disease gene by using DNA microarray gene expression data are the hot topics in current bioinformatics. This paper is devoted to the latest development in cancer diagnosis and gene selection via statistical machine learning. A support vector machine is firstly introduced for the binary cancer diagnosis. Then, 1-norm support vector machine, doubly regularized support vector machine, adaptive huberized support vector machine and other extensions are presented to improve the performance of gene selection. Lasso, elastic net, partly adaptive elastic net, group lasso, sparse group lasso, adaptive sparse group lasso and other sparse regression methods are also introduced for performing simultaneous binary cancer classification and gene selection. In addition to introducing three strategies for reducing multiclass to binary, methods of directly considering all classes of data in a learning model (multi_class support vector, sparse multinomial regression, adaptive multinomial regression and so on) are presented for performing multiple cancer diagnosis. Limitations and promising directions are also discussed.
-
-
-
Stochastic Neighbor Embedding Algorithm and its Application in Molecular Biological Data
Authors: Pan Wang, Guiyang Zhang, You Li, Ammar Oad and Guohua HuangWith the advent of the era of big data, the numbers and the dimensions of data are increasingly becoming larger. It is very critical to reduce dimensions or visualize data and then uncover the hidden patterns of characteristics or the mechanism underlying data. Stochastic Neighbor Embedding (SNE) has been developed for data visualization over the last ten years. Due to its efficiency in the visualization of data, SNE has been applied to a wide range of fields. We briefly reviewed the SNE algorithm and its variants, summarizing application of it in visualizing single-cell sequencing data, single nucleotide polymorphisms, and mass spectrometry imaging data. We also discussed the strength and the weakness of the SNE, with a special emphasis on how to set parameters to promote quality of visualization, and finally indicated potential development of SNE in the coming future.
-
-
-
Model with the GBDT for Colorectal Adenoma Risk Diagnosis
Authors: Junbo Gao, Lifeng Zhang, Gaiqing Yu, Guoqiang Qu, Yanfeng Li and Xuebing YangBackground and Objective: Colorectal cancer (CRC) is a common malignant tumor of the digestive system; it is associated with high morbidity and mortality. However, an early prediction of colorectal adenoma (CRA) that is a precancerous disease of most CRC patients provides an opportunity to make an appropriate strategy for prevention, early diagnosis and treatment. It has been aimed to develop a machine learning model to predict CRA that could assist physicians in classifying high-risk patients, make informed choices and prevent CRC. Methods: Patients who had undergone a colonoscopy to fill out a questionnaire at the Sixth People Hospital of Shanghai in China from July 2018 to November 2018 were instructed. A classification model with the gradient boosting decision tree (GBDT) was developed to predict CRA. This model was compared with three other models, namely, random forest (RF), support vector machine (SVM), and logistic regression (LR). The area under the receiver operating characteristic curve (AUC) was used to evaluate performance of the models. Results: Among the 245 included patients, 65 patients had CRA. The area under the receiver operating characteristic (AUCs) of GBDT, RF, SVM ,and LR with 10 fold-cross validation was 0.8131, 0.74, 0.769 and 0.763. An online prediction service, CRA Inference System, to substantialize the proposed solution for patients with CRA was also built. Conclusion: Four classification models for CRA prediction were developed and compared, and the GBDT model showed the highest performance. Implementing a GBDT model for screening can reduce the cost of time and money and help physicians identify high-risk groups for primary prevention.
-
-
-
Co-expression Network Analysis Revealing the Potential Regulatory Roles of LncRNAs in Atrial Fibrillation
Authors: Lishui Shen, Guilin Shen, Xiaoli Lu, Guomin Ding and Xiaofeng HuBackground: Atrial fibrillation (AF) is one of the most common heart arrhythmic disorders all over the world. However, it is worth noting that the mechanism underlying AF is still dimness. Methods: In this study, we implemented a series of bioinformatics methods to explore the mechanisms of lncRNAs underlying AF pathogenesis. The present study analyzed the public datasets (GSE2240 and GSE115574) to identify differentially expressed long non-coding RNAs (lncRNAs) and mRNAs in the progression of AF. Results: Totally, 71 differentially expressed lncRNAs and 390 DEGs were identified in AF.Next, we performed bioinformatics analyses to explore the functions of lncRNAs in AF. Gene Ontology (GO) analysis indicated that differentially expressed lncRNAs were involved in regulating multiple key biological processes, such as cell cycle and signal transduction. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis demonstrated these lncRNAs were associated with the regulation of MAPK and Wnt signaling pathways. Eight lncRNAs (RP5-1154L15.2, RP11- 339B21.15, RP11-448A19.1, RP11-676J12.4, LOC101930415, MALAT1, NEAT1, and PWAR6) were identified to be key lncRNAs and widely co-expressed with a series of differentially expressed genes (DEGs). Conclusion: Although further validation was still needed, our study may be helpful to elucidate the mechanisms of lncRNAs underlying AF pathogenesis and providing further insight into identifying novel biomarkers for AF.
-
-
-
Aggregation Prone Regions in Antibody Sequences Raised Against Vibrio cholerae: A Bioinformatic Approach
Authors: Zakia Akter, Anamul Haque, Md. S. Hossain, Firoz Ahmed and Md Asiful IslamBackground: Cholera, a diarrheal illness, causes millions of deaths worldwide due to large outbreaks. The monoclonal antibody used as therapeutic purposes of cholera is prone to be unstable due to various factors including self-aggregation. Objectives: In this bioinformatic analysis, we identified the aggregation prone regions (APRs) of antibody sequences of different immunogens (i.e., CTB, ZnM-CTB, ZnP-CTB, TcpA-CT-CTB, ZnM-TcpA-CT-CTB, ZnP-TcpA-CT-CTB, ZnM-TcpA, ZnP-TcpA, TcpA-CT-TcpA, ZnM-TcpACT- TcpA, ZnP-TcpA-CT-TcpA, Ogawa, Inaba and ZnM-Inaba) raised against Vibrio cholerae. Methods: To determine APRs in antibody sequences that were generated after immunizing Vibrio cholerae immunogens on Mus musculus, a total of 94 sequences were downloaded as FASTA format from a protein database and the algorithms such as Tango, Waltz, PASTA 2.0, and AGGRESCAN were followed to analyze probable APRs in all of the sequences. Results: A remarkably high number of regions in the monoclonal antibodies were identified to be APRs which could explain a cause of instability/short term protection of the anticholera vaccine. Conclusion: To increase the stability, it would be interesting to eliminate the APR residues from the therapeutic antibodies in such a way that the antigen-binding sites or the complementarity determining region loops involved in antigen recognition are not disrupted.
-
-
-
An Analysis Model of Protein Mass Spectrometry Data and its Application
Authors: Pingan He, Longao Hou, Hong Tao, Qi Dai and Yuhua YaoBackground: The impact of cancer in society created the necessity of new and faster theoretical models for the early diagnosis of cancer. Methods: In this work, a mass spectrometry (MS) data analysis method based on the star-like graph of protein and support vector machine (SVM) was proposed and applied to the ovarian cancer early classification in the MS data set. Firstly, the MS data is reduced and transformed into the corresponding protein sequence. Then, the topological indexes of the star-like graph are calculated to describe each MS data of the cancer sample. Finally, the SVM model is suggested to classify the MS data. Results: Using independent training and testing experiments 10 times to evaluate the ovarian cancer detection models, the average prediction accuracy, sensitivity, and specificity of the model were 96.45%, 96.88%, and 95.67%, respectively, for [0,1] normalization data, and 94.43%, 96.25%, and 91.11% for [-1,1] normalization data. Conclusion: The model combined with the SELDI-TOF-MS technology has a prospect in early clinical detection and diagnosis of ovarian cancer.
-
-
-
Identification of Carcinogenic Chemicals with Network Embedding and Deep Learning Methods
Authors: Xuefei Peng, Lei Chen and Jian-Peng ZhouBackground: Cancer is the second leading cause of human death in the world. To date, many factors have been confirmed to be the cause of cancer. Among them, carcinogenic chemicals have been widely accepted as the important ones. Traditional methods for detecting carcinogenic chemicals are of low efficiency and high cost. Objective: The aim of this study was to design an efficient computational method for the identification of carcinogenic chemicals. Methods: A new computational model was proposed for detecting carcinogenic chemicals. As a data-driven model, carcinogenic and non-carcinogenic chemicals were obtained from Carcinogenic Potency Database (CPDB). These chemicals were represented by features extracted from five chemical networks, representing five types of chemical associations, via a network embedding method, Mashup. Obtained features were fed into a powerful deep learning method, recurrent neural network, to build the model. Results: The jackknife test on such model provided the F-measure of 0.971 and AUROC of 0.971. Conclusion: The proposed model was quite effective and was superior to the models with traditional machine learning algorithms, classic chemical encoding schemes or direct usage of chemical associations.
-
-
-
A Mini-review of Computational Approaches to Predict Functions and Findings of Novel Micro Peptides
Authors: Mohsin A. Nasir, Samia Nawaz and Jian HuangNew techniques in bioinformatics and the study of the transcriptome at a wide-scale have uncovered the fact that a large part of the genome is being translated than recently perceived thoughts and research, bringing about the creation of a various quantity of RNA with proteincoding and noncoding potential. A lot of RNA particles have been considered as noncoding due to many reasons, according to developing proofs. Like many sORFs that encode many functional micro peptides have neglected due to their tiny sizes. Advanced studies reveal many major biological functions of these sORFs and their encoded micro peptides in a different and wide range of species. All the achievement in the identification of these sORFs and micro peptides is due to the progressive bioinformatics and high-throughput sequencing methods. This field has pulled in more consideration due to the detection of a large number of more sORFs and micro peptides. Nowadays, COVID-19 grabs all the attention of science as it is a sudden outbreak. sORFs of COVID-19 should be revealed for new ways to understand this virus. This review discusses ongoing progress in the systems for the identification and distinguishing proof of sORFs and micro peptides.
-
-
-
Exploiting XG Boost for Predicting Enhancer-promoter Interactions
Authors: Xiaojuan Yu, Jianguo Zhou, Mingming Zhao, Chao Yi, Qing Duan, Wei Zhou and Jin LiBackground: Gene expression and disease control are regulated by the interaction between distal enhancers and proximal promoters, and the study of enhancer promoter interactions (EPIs) provides insight into the genetic basis of diseases. Objective: Although the recent emergence of high-throughput sequencing methods have a deepened understanding of EPIs, accurate prediction of EPIs still limitations. Methods: We have implemented a XGBoost-based approach and introduced two sets of features (epigenomic and sequence) to predict the interactions between enhancers and promoters in different cell lines. Results: Extensive experimental results show that XGBoost effectively predicts EPIs across three cell lines, especially when using epigenomic and sequence features. Conclusion: XGBoost outperforms other methods, such as random forest, Adadboost, GBDT, and TargetFinder.
-
-
-
Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule
Authors: Yaser D. Khan, Ebraheem Alzahrani, Wajdi Alghamdi and Malik Zaka UllahBackground: Allergens are antigens that can stimulate an atopic type I human hypersensitivity reaction by an immunoglobulin E (IgE) reaction. Some proteins are naturally allergenic than others. The challenge for toxicologists is to identify properties that allow proteins to cause allergic sensitization and allergic diseases. The identification of allergen proteins is a very critical and pivotal task. The experimental identification of protein functions is a hectic, laborious and costly task; therefore, computer scientists have proposed various methods in the field of computational biology and bioinformatics using various data science approaches. Objectives: Herein, we report a novel predictor for the identification of allergen proteins. Methods: For feature extraction, statistical moments and various position-based features have been incorporated into Chou’s pseudo amino acid composition (PseAAC), and are used for training of a neural network. Results: The predictor is validated through 10-fold cross-validation and Jackknife testing, which gave 99.43% and 99.87% accurate results. Conclusion: Thus, the proposed predictor can help in predicting the Allergen proteins in an efficient and accurate way and can provide baseline data for the discovery of new drugs and biomarkers.
-
-
-
GASPIDs Versus Non-GASPIDs - Differentiation Based on Machine Learning Approach
Authors: Fawad Ahmad, Saima Ikram, Jamshaid Ahmad, Waseem Ullah, Fahad Hassan, Saeed U. Khattak and Irshad Ur RehmanBackground: Peptidases are a group of enzymes which catalyze the cleavage of peptide bonds. Around 2-3% of the whole genome codes for proteases and about one-third of all known proteases are serine proteases which are divided into 13 clans and 40 families. They are involved in diverse physiological roles such as digestion, coagulation of blood, fibrinolysis, processing of proteins and prohormones, signaling pathways, complement fixation, and have a vital role in the immune defense system. Based on their functions, they can broadly be divided into two classes; GASPIDs (Granule Associated Serine Peptidases involved in Immune Defense System) and Non- GASPIDs. GASPIDs, in particular are involved in immune-associated functions i.e. initiating apoptosis to kill virally infected and cancerous cells, cytokine modulation for the generation of inflammatory responses, and direct killing of pathogens through phagosomes. Methods: In this study, sequence-based characterization of these two types of serine proteases is performed. We first identified sequences by analyzing multiple online databases as well as by analyzing whole genomes of different species from different orthologous and non-orthologous species. Sequences were identified by devising a distinct criterion to differentiate GASPIDs from Non-GASPIDs. The translated version of these sequences was then subjected to feature extraction. Using these distinctive features, we differentiated GASPIDs from Non-GASPIDs by applying multiple supervised machine learning models. Results and Conclusion: Our results show that, among the three classifiers used in this study, SVM classifier coupled with tripeptide as feature method has shown the best accuracy in classification of sequences as GASPIDs and Non-GASPIDs.
-
-
-
An Algorithm to Improve the Speed of Semi and Non-specific Enzyme Searches in Proteomics
Authors: Zach Rolfs, Robert J. Millikin and Lloyd M. SmithBackground: The identification of non-specifically cleaved peptides in proteomics and peptidomics poses a significant computational challenge. Current strategies for the identification of such peptides are typically time-consuming and hinder routine data analysis. Objective: We aimed to design an algorithm that would improve the speed of semi- and nonspecific enzyme searches and could be applied to existing search programs. Methods: We developed a novel search algorithm that leverages fragment-ion redundancy to simultaneously search multiple non-specifically cleaved peptides at once. Briefly, a theoretical peptide tandem mass spectrum is generated using only the fragment-ion series from a single terminus. This spectrum serves as a proxy for several shorter theoretical peptides sharing the same terminus. After database searching, amino acids are removed from the opposing terminus until the observed and theoretical precursor masses match within a given mass tolerance. Results: The algorithm was implemented in the search program MetaMorpheus and found to perform an order of magnitude faster than the traditional MetaMorpheus search and produce superior results. Conclusion: We report a speedy non-specific enzyme search algorithm that is open-source and enables search programs to utilize fragment-ion redundancy to achieve a notable increase in search speed.
-
-
-
Identifying Breast Cancer-induced Gene Perturbations and its Application in Guiding Drug Repurposing
Authors: Jujuan Zhuang, Shuang Dai, Lijun Zhang, Pan Gao, Yingmin Han, Geng Tian, Na Yan, Min Tang and Ling KuiBackground: Breast cancer is a complex disease with high prevalence in women, the molecular mechanisms of which are still unclear at present. Most transcriptomic studies on breast cancer focus on differential expression of each gene between tumor and the adjacent normal tissues, while the other perturbations induced by breast cancer including the gene regulation variations, the changes of gene modules and the pathways, which might be critical to the diagnosis, treatment and prognosis of breast cancer are more or less ignored. Objective: We presented a complete process to study breast cancer from multiple perspectives, including differential expression analysis, constructing gene co-expression networks, modular differential connectivity analysis, differential gene connectivity analysis, gene function enrichment analysis key driver analysis. In addition, we prioritized the related anti-cancer drugs based on enrichment analysis between differential expression genes and drug perturbation signatures. Methods: The RNA expression profiles of 1109 breast cancer tissue and 113 non-tumor tissues were downloaded from The Cancer Genome Atlas (TCGA) database. Differential expression of RNAs was identified using the “DESeq2” bioconductor package in R, and gene co-expression networks were constructed using the weighted gene co-expression network analysis (WGCNA). To compare the module changes and gene co-expression variations between tumor and the adjacent normal tissues, modular differential connectivity (MDC) analysis and differential gene connectivity analysis (DGCA) were performed. Results: Top differential genes like MMP11 and COL10A1 were known to be associated with breast cancer. And we found 23 modules in the tumor network had significantly different co-expression patterns. The top differential modules were enriched in Goterms related to breast cancer like MHC protein complex, leukocyte activation, regulation of defense response and so on. In addition, key genes like UBE2T driving the top differential modules were significantly correlated with the patients’ survival. Finally, we predicted some potential breast cancer drugs, such as Eribulin, Taxane, Cisplatin and Oxaliplatin. Conclusion: As an indication, this framework might be useful in understanding the molecular pathogenesis of diseases like breast cancer and inferring useful drugs for personalized medication.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
