Current Bioinformatics - Online First
Description text for Online First listing goes here...
21 - 32 of 32 results
-
-
PredART: Uncertainty-quantified Machine Learning Prediction of Androgen Receptor Agonists Overcoming Imbalanced Dataset
Authors: Jidon Jang, Dokyun Na and Kwang-Seok OhAvailable online: 02 January 2025More LessAimThis study aims to develop and validate a machine learning-based model for the accurate prediction of androgen receptor (AR) agonistic toxicity, addressing the challenges posed by data imbalance in existing predictive models.
BackgroundAnomalous agonistic activity of the androgen receptor is a known major indicator of reproductive toxicity, which can lead to prostate cancer. Machine learning-based models have been developed for the rapid prediction of such agonists. However, the existing models have exhibited biased learning outcomes and low sensitivity due to the imbalance in the available training data. In the early screening process of drug discovery, low sensitivity caused by data imbalance can hinder the detection of potentially toxic compounds.
ObjectiveThe objective of this study is to develop a machine learning prediction model that classifies whether a drug candidate is an androgen receptor agonist or not with highly balanced performance compared to existing models.
MethodsPredART is a bootstrap aggregated k-nearest neighbor model for the balanced prediction of androgen receptor agonistic toxicity using 381 active and 8,089 inactive datasets with structural features of them.
ResultIn this work, we propose an advanced model that combines the bootstrap aggregating algorithm with machine learning binary classifiers to identify androgen receptor-based reproductive toxicity while avoiding biased prediction results. The optimal model using k-nearest neighbor classifiers achieved an accuracy of 0.831, positive predictive value (PPV) of 0.882, sensitivity of 0.625, specificity of 0.951, Mathews correlation coefficient (MCC) of 0.633 on external test data, demonstrating a significant improvement in sensitivity compared to the previous study and achieving balanced learning. Furthermore, by calculating the standard deviation among outputs of the classifiers and employing this prediction uncertainty as a screening metric to select reliable predictions, the model's performance could be further enhanced.
ConclusionBased on the bootstrap aggregating algorithm, our prediction model effectively addressed data imbalance while evaluating the performance of various machine learning and deep learning classifiers for a benchmark. Additionally, by quantifying uncertainty, our model provided an intuitive assessment of prediction reliability during large-scale screening processes.
-
-
-
A Method of Enhancing Heterogeneous Graph Representation for Predicting the Associations between lncRNAs and Diseases
Authors: Dengju Yao, Yuehu Wu and Xiaojuan ZhanAvailable online: 06 November 2024More LessBackgroundLong non-coding RNAs (lncRNAs) are a category of more extended RNA strands that lack protein-coding abilities. Although they are not involved in the translation of proteins, studies have shown that they play essential regulatory functions in cells, regulating gene expression and cell biological processes. However, it is both costly and inefficient to determine the associations between lncRNAs and diseases through biological experiments. Therefore, there is an urgent need to develop convenient and fast computational methods to predict lncRNA-disease associations (LDAs) more efficiently.
ObjectivePredicting disease-associated lncRNAs can help explore the mechanisms of action of lncRNAs in diseases, and this is crucial for early intervention and treatment of diseases.
MethodsIn this paper, we propose an enhanced heterogeneous graph representation method for predicting LDAs, named GCGALDA. The GCGALDA first obtains the topological structure features of nodes by a biased random walk. Based on this, the neighboring nodes of a node are weighted using the attention mechanism to further mine the semantic association relationships between nodes in the graph data. Then, a graph convolution network (GCN) is used to transfer the neighborhood features of the node to the central node and combine them with the node's features so that the final node representation contains not only structural information but also semantic association information. Finally, the association score between lncRNA and disease is obtained by multilayer perceptron (MLP).
ResultsAs evidenced by the experimental findings, the GCGALDA outperforms other advanced models in terms of prediction accuracy on openly accessible databases. In addition, case studies on several human diseases further confirm the predictive ability of the GCGALDA.
ConclusionIn conclusion, the proposed GCGALDA model extracts multi-perspective features, such as topology, semantic association, and node attributes, obtains high-quality heterogeneous graph node representations, and effectively improves the performance of the LDA prediction model.
-
-
-
Identification and Analysis of Plant miRNAs: Evolution of In-silicoResources and Future Challenges
Authors: Abhishek Kushwaha, Hausila Prasad Singh and Noopur SinghAvailable online: 04 November 2024More LessEndogenous small RNAs (miRNA) are the key regulators of numerous eukaryotic lineages playing an important role in a broad range of plant development. Computational analysis of miRNAs facilitates the understanding of miRNA-based regulations in plants. The discovery of small non-coding RNAs has led to a greater understanding of gene regulation, and the development of bioinformatic tools has enabled the identification of microRNAs (miRNAs) and their targets. The need for comprehensive miRNA analysis is being accomplished by the development of advanced computational tools/algorithms and databases. Each resource has its own specificity and limitations for the analysis. This review provides a comprehensive overview of various algorithms used by computational tools, software, and databases for plant miRNA analysis. However, over a period of about two decades, a lot of knowledge has been added to our understanding of the biogenesis and functioning of miRNAs in other plants. Several parameters were already integrated and others need to be incorporated in order to give more accurate and efficient results. The reassessment of computational recourses (based on old algorithms) is required on the basis of new miRNA research and development. Generally, computational methods, including ab-initio and homology search-based methods, are used for miRNA identification and target prediction. This review presents the new challenges faced by the existing computational methods and the need to develop new tools and advanced algorithms and highlight the limitations of existing computational tools and methods, and emphasizing the need for a comprehensive platform for miRNA gene exploration.
-
-
-
GVNNVAE: A Novel Microbe-Drug Association Prediction Model based on an Improved Graph Neural Network and the Variational Auto-Encoder
Authors: Yiming Chen, Zhen Zhang, Xin Liu, Bin Zeng and Lei WangAvailable online: 31 October 2024More LessMicroorganisms play a crucial role in human health and disease. Identifying potential microbe-drug associations is essential for drug discovery and clinical treatment. In this manuscript, we proposed a novel prediction model named GVNNVAE by combining an Improved Graph Neural Network (GNN) and the Variational Auto-Encoder (VAE) to infer potential microbe-drug associations. In GVNNVAE, we first established a heterogeneous microbe-drug network N by integrating multiple similarity metrics of microbes, drugs, and diseases. Subsequently, we introduced an improved GNN and the VAE to extract topological and attribute representations for nodes in N respectively. Finally, through incorporating various original attributes of microbes and drugs with above two kinds of newly obtained topological and attribute representations, predicted scores of potential microbe-drug associations would be calculated. Furthermore, To evaluate the prediction performance of GVNNVAE, intensive experiments were done and comparative results showed that GVNNVAE could achieve a satisfactory AUC value of 0.9688, which outperformed existing competitive state-of-the-art methods. And moreover, case studies of known microbes and drugs confirmed the effectiveness of GVNNVAE as well, which highlighted its potential for predicting latent microbe-drug associations.
-
-
-
Graph-Root: Prediction of Root-Associated Proteins in Maize, Sorghum, And Soybean Based on Graph Convolutional Network and Network Embedding Method
Authors: Bo Zhou, Siyang Liu, Lei Chen and Qi DaiAvailable online: 29 October 2024More LessBackgroundThe root system plays an irreplaceable role in plant growth. Its improvement can increase crop productivity. However, such a system is still mysterious for us. The underlying mechanism has not been fully uncovered. The investigation on proteins related to the root system is an important means to complete this task. In the previous time, lack of root-related proteins makes it impossible to adopt machine learning methods for designing efficient models for the discovery of novel root-related proteins. Recently, a public database on root-related proteins was set up and machine learning methods can be applied in this field.
ObjectiveThe purpose of this study was to design an efficient computational method to predict root-associated proteins in three plants: maize, sorghum, and soybean.
MethodIn this study, we proposed a machine learning based model, named Graph-Root, for the identification of root-related proteins in maize, sorghum, and soybean. The features derived from protein sequences, functional domains, and one network were extracted, where the first type of features were processed by graph convolutional neural network and multi-head attention, the second type of features reflected the essential functions of proteins, and the third type of features abstracted the linkage between proteins. These features were fed into the fully connected layer to make predictions.
ResultsThe 5-fold cross-validation and independent tests suggested its acceptable performance. It also outperformed the only previous model, SVM-Root. Furthermore, the importance of each feature type and component in the proposed model was investigated.
ConclusionGraph-Root had a good performance and can be a useful tool to identify novel root-related proteins. BLOSUM62 features were found to be important in determining root-related proteins.
-
-
-
Robust Somatic Copy Number Estimation using Coarse-to-fine Segmentation
Available online: 28 October 2024More LessIntroductionCancers routinely exhibit chromosomal instability that results in copy number variants (CNVs), namely changes in the abundance of genomic material. Unfortunately, the detection of these variants in cancer genomes is difficult.
MethodsWe present Ploidetect, a software package that effectively identifies CNVs within whole-genome sequenced tumors. Ploidetect utilizes a coarse-to-fine segmentation approach which yields highly contiguous segments while allowing for focal CNVs to be detected with high sensitivity.
ResultsWe benchmark Ploidetect against popular CNV tools using synthetic data, cell line data, and real-world metastatic tumor data and demonstrate strong performance in all tests. We show that high quality CNVs from Ploidetect enable the identification of recurrent homozygous deletions and genes associated with chromosomal instability in a multi-cancer cohort of 687 patients. Using highly contiguous CNV calls afforded by Ploidetect, we also demonstrate the use of segment N50 as a novel metric for the measurement of chromosomal instability within tumor biopsies.
ConclusionWe propose that the increasingly accurate determination of CNVs is critical for their productive study in cancer, and our work demonstrates advances made possible by progress in this regard.
-
-
-
PredPVP: A Stacking Model for Predicting Phage Virion Proteins Based on Feature Selection Methods
Authors: Qian Cao, Xufeng Xiao, Yannan Bin, Jianping Zhao and Chunhou ZhengAvailable online: 28 October 2024More LessBackgroundPhage therapy has a broad application prospect as a novel therapeutic method, and Phage Virion Proteins (PVP) can recognize the host and bind to surface receptors, which is of great significance for the development of antimicrobial drugs for the treatment of infectious diseases caused by bacteria. In recent years, several PVP predictors based on machine learning have been developed, which usually use a single feature to train the learner. In contrast, higher dimensional feature representations tend to contain more potential sequence information.
MethodsIn this work, we construct a stacking model PredPVP for PVP prediction by combining multiple features and using feature selection methods. Specifically, the sequence is first encoded using seven features. For this high-dimensional feature representation, three feature selection methods wereutilized to remove redundant features, then integrated with eight machine learning algorithms. Finally, probability features and class features (PCFs) generated by 24 base models were put into logistic regression (LR) to train the model.
ResultsThe results of the independent test set indicate that PredPVP has higher performance compared to other existing predictors, with an AUC of 93.4%.
Conclusion:We expect PredPVP to be used as a tool for large-scale PVP recognition, providing a new way for the development of novel antimicrobials and accelerating its application in actual treatment. The datasets and source codes used in this study are available at https://github.com/caoqian23/PredPVP.
-
-
-
A Low Transformed Tubal Rank Tensor Model Using a Spatial-Tubal Constraint for Sample Clustering with Cancer Multi-omics Data
Authors: Sheng-Nan Zhang, Ying-Lian Gao, Yu-Lin Zhang, Junliang Shang, Chun-Hou Zheng and Jin-Xing LiuAvailable online: 21 October 2024More LessBackgroundSince each dimension of a tensor can store different types of genomics data, compared to matrix methods, utilizing tensor structure can provide a deeper understanding of multi-dimensional data while also facilitating the discovery of more useful information related to cancer. However, in reality, there are issues such as insufficient utilization of prior knowledge in multi-omics data and limitations in the recovery of low-tubal-rank tensors. Therefore, the method proposed in this article was developed.
Objective: In this paper, we proposed a low transformed tubal rank tensor model (LTTRT) using a spatial-tubal constraint to accurately partition different types of cancer samples and provide reliable theoretical support for the identification, diagnosis, and treatment of cancer.
MethodIn the LTTRT method, the transformed tensor nuclear norm based on the transformed tensor singular value decomposition is characterized by the low-rank tensor, which can explore the global low-rank property of the tensor, resolving the challenge of the tensor nuclear norm-based method not achieving the lowest tubal rank. Additionally, the introduction of weighted total variation regularization is conducive to extracting more information from sequencing data in both spatial and tubal dimensions, exploring cross-correlation features of multiple genomic data, and addressing the problem of overlooking prior knowledge from various perspectives. In addition, the L1-norm is used to improve sparsity. A symmetric Gauss‒Seidel-based alternating direction method of multipliers (sGS-ADMM) is used to update the LTTRT model iteratively.
ResultsThe experiments of sample clustering on multiple integrated cancer multi-omics datasets show that the proposed LTTRT method is better than existing methods. Experimental results validate the effectiveness of LTTRT in accurately partitioning different types of cancer samples.
ConclusionThe LTTRT method achieves precise segmentation of different types of cancer samples.
-
-
-
Predicting Molecular Subtypes of Breast Cancer Using Gene Expression Profiling and Random Forest Classifier
Available online: 14 October 2024More LessBackgroundOne of the main causes of cancer-related mortality in women is breast cancer [BC]. There were four molecular subtypes of this malignancy, and adjuvant therapy efficacy differed based on these subtypes. Gene expression profiles provide valuable information that is helpful for patients whose prognosis is not clear from clinical markers and immunohistochemistry.
ObjectiveIn this study, we aim to predict molecular types of BC using a gene expression dataset of patients with BC and normal samples using six well-known ensemble machine-learning techniques.
MethodsTwo microarray datasets were downloaded; [GSE45827] and [GSE140494] from the Gene Expression Omnibus [GEO] database. These datasets comprise 21 samples of normal tissues that were part of a cohort analysis of primary invasive breast cancer [57 basal, 36 HER2, 56 Luminal A, and 66 Luminal B]. Namely, we used AdaBoost, Random Forest [RF], Artificial Neural Network [ANN], Naïve Bayes [NB], Classification and Regression Tree [CART], and Linear Discriminant Analysis [LDA] classifiers.
ResultThe results of the data analysis show that the RF and NB classifiers outperform the other models in the prediction of the BC subtype. The RF shows superior performance with an accuracy range between 0.89 and 1.0 in contrast to its competitor NB, which has an average accuracy of 0.91. Our approach perfectly discriminates un-affected cases [normal] from the carcinoma. In this case, the RF provides perfect prediction with zero errors. Additionally, we used PCA, DHWT low-frequency, and DHWT high-frequency to perform a dimensional reduction for the numerous gene expression values. Consequently, the LDA achieves up to 95% improvement in performance through data reduction. Moreover, feature selection allowed for the best performance, which is recorded by the RF with classification accuracy 98%.
ConclusionOverall, we provide a successful framework that leads to shorter computation times and smaller ML models, especially where memory and time restrictions are crucial.
-
-
-
NEXT-GEN Medicine: Designing Drugs to Fit Patient Profiles
Authors: Raj Kamal, Diksha, Priyanka Paul, Ankit Awasthi and Amandeep SinghAvailable online: 14 October 2024More LessBackground : Personalized medicine, with its focus on tailoring drug formulations to individual patient profiles, has made significant strides in healthcare. The integration of genomics, biomarkers, nanotechnology, 3D printing, and real-time monitoring provides a comprehensive approach to optimizing drug therapies on an individual basis. This review aims to highlight the recent advancements in personalized medicine and its applications in various diseases, such as cancer, cardiovascular diseases, diabetes mellitus, and neurodegenerative diseases. The review explores the integration of multiple technologies in the field of personalized medicine, including genomics, biomarkers, nanotechnology, 3D printing, and real-time monitoring. As these technologies continue to evolve, we are entering an era of truly personalized medicine that promises improved treatment outcomes, reduced adverse effects, and a more patient-centric approach to healthcare. The advancements in personalized medicine hold great promise for improving patient outcomes and reducing adverse effects, heralding a new era in patient-centric healthcare.
-
-
-
Artificial Intelligence in Diabetes Mellitus Prediction: Advancements and Challenges - A Review
Authors: Rohit Awasthi, Anjali Mahavar, Shraddha Shah, Darshana Patel, Mukti Patel, Drashti Shah and Ashish PatelAvailable online: 11 October 2024More LessPoor dietary habits and a lack of understanding are contributing to the rapid global increase in the number of diabetic people. Therefore, a framework that can accurately forecast a large number of patients based on clinical details is needed. Artificial intelligence (AI) is a rapidly evolving field, and its implementations to diabetes, a worldwide pandemic, have the potential to revolutionize the strategy of diagnosing and forecasting this chronic condition. Algorithms based on artificial intelligence fundamentals have been developed to support predictive models for the risk of developing diabetes or its complications. In this review, we will discuss AI-based diabetes prediction. Thus, AI-based new-onset diabetes prediction has not beaten the statistically based risk stratification models, in traditional risk stratification models. Despite this, it is anticipated that in the near future, a vast quantity of well-organized data and an abundance of processing power will optimize AI's predictive capabilities, greatly enhancing the accuracy of diabetic illness prediction models.
-
-
-
scADCA: An Anomaly Detection-Based scRNA-seq Dataset Cell Type Annotation Method for Identifying Novel Cells
Authors: Yongle Shi, Yibing Ma, Xiang Chen and Jie GaoAvailable online: 10 October 2024More LessBackgroundWith the rapid evolution of single-cell RNA sequencing technology, the study of cellular heterogeneity in complex tissues has reached an unprecedented resolution. One critical task of the technology is cell-type annotation. However, challenges persist, particularly in annotating novel cell types.
ObjectiveCurrent methods rely heavily on well-annotated reference data, using correlation comparisons to determine cell types. However, identifying novel cells remains unstable due to the inherent complexity and heterogeneity of scRNA-seq data and cell types. To address this problem, we propose scADCA, a method based on anomaly detection, for identifying novel cell types and annotating the entire dataset.
MethodsThe convolutional modules and fully connected networks are integrated into an autoencoder, and the reference dataset is trained to obtain the reconstruction errors. The threshold based on these errors can distinguish between novel and known cells in the query dataset. After novel cells are identified, a multinomial logistic regression model fully annotates the dataset.
ResultsUsing a simulation dataset, three real scRNA-seq pancreatic datasets, and a real scRNA-seq lung cancer cell line dataset, we compare scADCA with six other cell-type annotation methods, demonstrating competitive performance in terms of distinguished accuracy, full accuracy, -score, and confusion matrix.
ConclusionIn conclusion, the scADCA method can be further improved and expanded to achieve better performance and application effects in cell type annotation, which is helpful to improve the accuracy and reliability of cytology research and promote the development of single-cell omics.
-