Full text loading...
Considering the heterogeneity of proteins across diverse cell types and states, studying protein thermostability at the single-cell level enables a more profound comprehension of cellular function and the mechanisms underlying disease progression.
In this study, we constructed classification and regression models to predict the thermostability difference of homologous protein pairs by integrating implicit features extracted from protein sequences using eight language models, including ProtBERT, AminoBERT, and ProtT5-XL, with explicit sequence features that are manually computed.
Our results demonstrate that the fusion of explicit and implicit features significantly enhances prediction performance. In classification tasks, the combination of implicit features extracted by AminoBERT and the optimal explicit feature set achieves an accuracy of 87.1%. In regression tasks, the combination of implicit features extracted by Word2vec and the optimal explicit feature set yields a PCC of 0.864 and a R2 of 0.742, which is better than previously reported results.
This study reveals the complementary strengths of language models and handcrafted features in predicting protein thermostability. Combining both types of features significantly improves the performance of classification and regression models and helps identify key factors affecting protein stability. However, the study is limited by its reliance on existing datasets, which may reduce its ability to generalize to novel or rare protein families.
The integration of implicit and explicit sequence features enables a more comprehensive representation of protein sequences and facilitates the identification of factors influencing the thermostability of orthologous proteins.
Article metrics loading...
Full text loading...
References
Data & Media loading...
Supplements