Current Bioinformatics - Volume 18, Issue 1, 2023
Volume 18, Issue 1, 2023
-
-
Comparison of Gene Selection Methods for Clustering Single-cell RNA-seq Data
Authors: Xiaoshu Zhu, Jianxin Wang, Rongruan Li and Xiaoqing PengBackground: In single-cell RNA-seq data, clustering methods are employed to identify cell types to understand cell-differentiation and development. Because clustering methods are sensitive to the high dimensionality of single-cell RNA-seq data, one effective solution is to select a subset of genes in order to reduce the dimensionality. Numerous methods, with different underlying assumptions, have been proposed for choosing a subset of genes to be used for clustering. Objective: To guide users in selecting suitable gene selection methods, we give an overview of different gene selection methods and compare their performance in terms of the differences between the selected gene sets, clustering performance, running time, and stability. Results: We first review the data preprocessing strategies and gene selection methods in analyzing single-cell RNA-seq data. Then, the overlaps among the gene sets selected by different methods are analyzed and the clustering performance based on different feature gene sets is compared. The analysis reveals that the gene sets selected by the methods based on highly variable genes and high mean genes are most similar, and the highly variable genes play an important role in clustering. Additionally, a small number of selected genes would compromise the clustering performance, such as SCMarker selected fewer genes than other methods, leading to a poorer clustering performance than M3Drop. Conclusion: Different gene selection methods perform differently in different scenarios. HVG works well on the full-transcript sequencing datasets, NBDrop and HMG perform better on the 3’ end sequencing datasets, M3Drop and HMG are more suitable for big datasets, and SCMarker is most consistent in different preprocessing methods.
-
-
-
i4mC-CPXG: A Computational Model for Identifying DNA N4- methylcytosine Sites in Rosaceae Genome Using Novel Encoding Strategy
Authors: Lichao Zhang, Ying Liang, Kang Xiao and Liang KongBackground: N4-methylcytosine (4mC) is one of the most widespread DNA methylation modifications, which plays an important role in DNA replication and repair, epigenetic inheritance, gene expression levels and regulation of transcription. Although biological experiments can identify potential 4mC modification sites, they are limited due to the experimental environment and labor intensive. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Objective: Although some computational methods have been proposed to identify the 4mC sites, some problems should not be ignored, such as: (1) a large number of unknown nucleotides exist in the biological sequence; (2) a large number of zeros exist in the previous encoding technologies; (3) sequence distribution information is important to identify 4mC sites. Considering these aspects, we propose a computational model based on a novel encoding strategy with position specific information to identify 4mC sites. Methods: We constructed an accurate computational model i4mC-CPXG based on extreme gradient boosting. Two aspects of feature vectors are extracted according to nucleotide information and position specific information. From the aspect of nucleotide information, we used prior information to identify the base type of unknown nucleotide and decrease the influence of invalid information caused by lots of zeros. From the aspect of position specific information, the vector was designed carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide information and position specific information was input into extreme gradient boosting to construct the model. Results: The accuracy of i4mC-CPXG is 82.49% on independent dataset. The result was better than model i4mC-w2vec which was the best model in the imbalanced dataset with the ratio of 1:15. Meanwhile, our model achieved good performance on other species. These results validated the effectiveness of i4mC-CPXG. Conclusion: Our method is effective to identify potential 4mC modification sites due to the proposed new encoding strategy fused position specific information. The satisfactory prediction results of balanced datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable to provide a reasonable supplement for biology research.
-
-
-
M-CAMPTM: A Cloud-based Web Platform with a Novel Approach for Species-level Classification of 16S rRNA Microbiome Sequences
Background: The M-CAMPTM (Microbiome Computational Analysis for Multi-omic Profiling) Cloud Platform was designed to provide users with an easy-to-use web interface to access best in class microbiome analysis tools. This interface allows bench scientists to conduct bioinformatic analysis on their samples and then download publication-ready graphics and reports. Objective: In this study, we aim to describe the M-CAMPTM platform and demonstrate that the taxonomic classification is more accurate than previously described methods on a wide range of microbiome samples. Methods: The core pipeline of the platform is the 16S-seq taxonomic classification algorithm which provides species-level classification of Illumina 16s sequencing. This algorithm uses a novel approach combining alignment and kmer-based taxonomic classification methodologies to produce a highly accurate and comprehensive profile. Additionally, a comprehensive proprietary database combining reference sequences from multiple sources was curated and contained 18056 unique V3-V4 sequences covering 11527 species. Results and Discussion: The M-CAMPTM 16S taxonomic classification algorithm evaluated 52 sequencing samples from both public and in-house standard sample mixtures with known fractions. The same evaluation process was also performed on 5 well-known 16S taxonomic classification algorithms, including Qiime2, Kraken2, Mapseq, Idtaxa and Spingo, using the same dataset. Results have been discussed in terms of evaluation metrics and classified taxonomic levels. Conclusion: Compared to current popular publicly accessible classification algorithms, M-CAMPTM 16S taxonomic classification algorithm provides the most accurate species-level classification of 16S rRNA sequencing data.
-
-
-
Comprehensive Pan-cancer Gene Signature Assessment through the Implementation of a Cascade Machine Learning System
Background: Despite all the medical advances introduced for personalized patient treatment and the research supported in search of genetic patterns inherent to the occurrence of its different manifestations on the human being, the unequivocal and effective treatment of cancer, unfortunately, remains as an unresolved challenge within the scientific panorama. Until a universal solution for its control is achieved, early detection mechanisms for preventative diagnosis increasingly avoid treatments, resulting in unreliable effectiveness. The discovery of unequivocal gene patterns allowing us to discern between multiple pathological states could help shed light on patients suspected of an oncological disease but with uncertainty in the histological and immunohistochemical results. Methods: This study presents an approach for pan-cancer diagnosis based on gene expression analysis that determines a reduced set of 12 genes, making it possible to distinguish between the main 14 cancer diseases. Results: Our cascade machine learning process has been robustly designed, obtaining a mean F1 score of 92% and a mean AUC of 99.37% in the test set. Our study showed heterogeneous over-or underexpression of the analyzed genes, which can act as oncogenes or tumor suppressor genes. Upregulation of LPAR5 and PAX8 was demonstrated in thyroid cancer samples. KLF5 was highly expressed in the majority of cancer types. Conclusion: Our model constituted a useful tool for pan-cancer gene expression evaluation. In addition to providing biological clues about a hypothetical common origin of cancer, the scalability of this study promises to be very useful for future studies to reinforce, confirm, and extend the biological observations presented here. Code availability and datasets are stored in the following GitHub repository to aim for the research reproducibility: https://github.com/CasedUgr/PanCancerClassification.
-
-
-
Hypertension Risk Prediction Based on SNPs by Machine Learning Models
Authors: S. A. Lajevardi, Mehrdad Kargari, Maryam S. Daneshpour and Mahdi AkbarzadehBackground: Hypertension is one of the most significant underlying ailments of cardiovascular disease; hence, methods that can accurately reveal the risk of hypertension at an early age are essential. Also, one of the most critical personal health objectives is to improve disease prediction accuracy by examining genetic variants. Objective: Therefore, various clinical and genetically based methods are used to predict the disease; however, the critical issue with these methods is the high number of input variables as genetic markers with small samples. One approach that can be used to solve this problem is machine learning. Methods: This study was conducted on the participants' genetic markers in the 20-year research of cardiometabolic genetics in Tehran (TCGS). Various machine learning methods were used, including linear regression, neural network, random forest, decision tree, and support vector machine. The top ten genetic markers were identified using importance-based ranking methods, including information gain, gain ratio, Gini index, χ², relief, and FCBF. Results: A model based on a neural network with AUC of 89% was presented. This model has an accuracy and an f-measure of 0.89, which shows the quality. The final results indicate the success of the machine learning approach. Conclusion: Study shows machine learning approach helps predict the risk of hypertension at a young age and finds significant SNPs that affect HTN.
-
-
-
TMMGdb - Tumor Metastasis Mechanism-associated Gene Database
Authors: Hsueh-Chuan Liu, Ka-Lok Ng, Venugopala R. Mekala and Chien-Hung HuangBackground: At present, all or the majority of published databases report metastasis genes based on the concept of using cancer types or hallmarks of cancer/metastasis. Since tumor metastasis is a dynamic process involving many cellular and molecular processes, those databases cannot provide information on the sequential relations and cellular and molecular mechanisms among different metastasis stages. Objective: We incorporate the concept of tumor metastasis mechanism to construct a tumor metastasis mechanism-associated gene (TMMG) database based on using the metastasis mechanism concept. Methods: We utilized the text mining tool, BioBERT to mine the titles and abstracts of the papers and identify TMMGs. Results: This tumor metastasis mechanism-associated gene database (TMMGdb) contains a wealth of annotations. To check the reliability of TMMGdb, we compared the proportions of housekeeping genes (HKGs) in TMMGdb, HCMDB, and CMgene, the results showed that around 20% of the TMMGs are HKGs, and the proportions are highly consistent among the three databases. Compared with the HCMDB and CMgene databases, TMMGdb is able to find a more recent (on or after 2017) collection of publications and TMMGs. We provided six case studies to illustrate the uniqueness of the TMMGdb database. Conclusion: TMMGdb is a comprehensive resource for the biomedical community to understand the dynamic process, molecular features, and cellular processes involved in tumor metastasis. TMMGdb provides four interfaces; ‘Browse’, ‘Search’, ‘DEG Search’ and ‘Download’, for users to investigate the causal effects among different metastasis stages; the database is freely accessible at http://hmg.asia.edu.tw/ TMMGdb.
-
-
-
Characterization, Potential Prognostic Value, and Immune Heterogeneity of Cathepsin C in Diffuse Glioma
Authors: Quanwei Zhou, Shasha Li, Xuejun Yan, Hecheng Zhu, Weidong Liu, Youwei Guo, Hongjuan Xu, Wen Yin, Xuewen Li, Qian Yang, Hui Liu, Xingjun Jiang and Caiping RenBackground: Diffuse glioma is the most frequent intracranial tumor and remains universally lethal. Prognostic biomarkers have remained a focus in diffuse glioma during the last decades. More reliable predictors to adequately characterize the prognosis of diffuse glioma are essential. Cathepsin C (CTSC), a lysosomal cysteine protease, is an essential component of the lysosomal hydrolase family, with their potential roles in diffuse glioma remaining to be characterized. Objective: We aimed to investigate the performance of CTSC in predicting prognosis and therapeutic targets in diffuse glioma. Methods: The expression profile of CTSC in multiple tumors and more than 2000 glioma samples with corresponding clinical data were collected through authoritative public databases. The expression level of CTSC was evaluated by qPCR and IHC. The prognostic value of CTSC was assessed using the univariate and multivariate cox regression analysis. The ESTIMATE R package was used to evaluate the immune and stromal scores based on the gene expression profile. The CIBERSORT was applied to evaluate the relative levels of 22 immune cell subtypes by using the R package 'CIBERSORT' to define the cell composition of tumor tissues. In addition, the MCP counter was used to assess the absolute abundance of neutrophils. Results/Discussion: CTSC was aberrantly expressed and significantly correlated with clinical outcomes in multiple tumors. CTSC was heterogeneously expressed across histologic types and tumor grades for diffuse glioma and highly enriched in IDH or IDH1-wildtype glioma. CTSC was positively associated with immune and stromal scores and infiltrating levels of M2 macrophages and neutrophils and negatively associated with infiltrating levels of NK cells. Additionally, CTSC was closely correlated with some immune checkpoint molecules, including CD276, CD80, CD86 and PD-L2. Conclusion: CTSC was involved in shaping the immunosuppressive microenvironment and acted as an independent indicator of a poor prognosis in diffuse glioma. Targeting CTSC for glioma therapies might provide promising prospects.
-
-
-
DHOSGR: lncRNA-disease Association Prediction Based on Decay High-order Similarity and Graph-regularized Matrix Completion
Authors: Guobo Xie, Zelin Jiang, Zhiyi Lin, Guosheng Gu, Yuping Sun, Qing Su, Ji Cui and Huizhe ZhangBackground: It has been shown in numerous recent studies that long non-coding RNAs (lncRNAs) play a vital role in the regulation of various biological processes, as well as serve as a basis for understanding the causes of human illnesses. Thus, many researchers have developed matrix completion approaches to infer lncRNA–disease connections and enhance prediction performance by using similarity information. Objective: Most matrix completion approaches are solely based on the first-order or second-order similarity between nodes, and higher-order similarity is rarely considered. In view of this, we developed a computational method to incorporate higher-order similarity information into the similarity network with different weights using a decay function designed by a random walk with restart (DHOSGR). Methods: First, considering that the information will decay as the distance increases during network propagation, we defined a novel decay high-order similarity by combining the similarity matrix and its high-order similarity information through a decay function to construct a similarity network. Then, we applied the similarity network to the objective function as a graph regularization term. Finally, a proximal splitting algorithm was used to perform matrix completion to infer relationships between diseases and lncRNAs. Results: In the experiment, DHOSGR achieves a superior performance in leave-one-out cross validation (LOOCV) and 100 times 5-fold cross validation (5-fold-CV), with AUC values of 0.9459 and 0.9334 ± 0.0016, respectively, which are better than other five previous models. Moreover, case studies of three diseases (leukemia, lymphoma, and squamous cell carcinoma) demonstrated that DHOSGR can reliably predict associated lncRNAs. Conclusion: DHOSGR can serve as a high efficiency calculation model for predicting lncRNAdisease associations.
-
Volumes & issues
-
Volume 20 (2025)
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)
Most Read This Month
