Protein Classification Using Machine Learning and Statistical Techniques

Chhote L. P. Gupta; Anand Bihari; Sudhakar Tripathi

doi:10.2174/2666255813666190925163758

Abstract

Background: In the recent era, the prediction of enzyme class from an unknown protein is one of the challenging tasks in bioinformatics. Day-to-day, the number of proteins increases which causes difficulties in clinical verification and classification; as a result, the prediction of enzyme class gives a new opportunity to bioinformatics scholars. The machine learning classification technique helps in protein classification and predictions. But it is imperative to know which classification technique is more suited for protein classification. This study used human proteins data that is extracted from the UniProtKB databank. A total of 4368 protein data with 45 identified features were used for experimental analysis. Objective: The prime objective of this article is to find an appropriate classification technique to classify the reviewed as well as un-reviewed human enzyme class of protein data. Also, find the significance of different features in protein classification and prediction. Methods: In this article, the ten most significant classification techniques such as CRT, QUEST, CHAID, C5.0, ANN, SVM, Bayesian, Random Forest, XgBoost, and CatBoost have been used to classify the data and discover the importance of features. To validate the result of different classification techniques, accuracy, precision, recall, F-measures, sensitivity, specificity, MCC, ROC, and AUROC were used. All experiments were done with the help of SPSS Clementine and Python. Results: Above discussed classification techniques give different results and found that the data are imbalanced for class C4, C5, and C6. As a result, all of the classification techniques give acceptable accuracy above 60% for these classes of data, but their precision value is very less or negligible. The experimental results highlight that the Random forest gives the highest accuracy as well as AUROC among all, i.e., 96.84% and 0.945, respectively, and also has high precision and recall value. Conclusion: The experiment conducted and analyzed in this article highlights that the Random Forest classification technique can be used for protein of human enzyme classification and predictions.

Protein Classification Using Machine Learning and Statistical Techniques

Abstract

From This Site

Most Read This Month

Most Cited Most Cited RSS feed

Key Issues in Software Reliability Growth Models

An Ensemble of Bacterial Foraging, Genetic, Ant Colony and Particle Swarm Approach EB-GAP: A Load Balancing Approach in Cloud Computing

Remaining Useful Life Prediction of Lithium-ion Batteries Using Multiple Kernel Extreme Learning Machine

ROUGE-SS: A New ROUGE Variant for the Evaluation of Text Summarization

Extensive Review of Literature on Explainable AI (XAI) in Healthcare Applications

An Analog Circuit Fault Diagnosis Approach Based on Wavelet-based Fractal Analysis and Multiple Kernel SVM

Research on Monitoring System of Daily Statistical Indexes Through Big Data

A Study on E-Learning and Recommendation System

Container Elasticity: Based on Response Time using Docker

Revolutionizing Agriculture: A Comprehensive Review of IoT Farming Technologies