Skip to content
2000
image of Prediction of Homologous Protein Thermostability at the Single-Cell Level by Incorporating Explicit and Implicit Sequence Features

Abstract

Introduction

Considering the heterogeneity of proteins across diverse cell types and states, studying protein thermostability at the single-cell level enables a more profound comprehension of cellular function and the mechanisms underlying disease progression.

Methods

In this study, we constructed classification and regression models to predict the thermostability difference of homologous protein pairs by integrating implicit features extracted from protein sequences using eight language models, including ProtBERT, AminoBERT, and ProtT5-XL, with explicit sequence features that are manually computed.

Results

Our results demonstrate that the fusion of explicit and implicit features significantly enhances prediction performance. In classification tasks, the combination of implicit features extracted by AminoBERT and the optimal explicit feature set achieves an accuracy of 87.1%. In regression tasks, the combination of implicit features extracted by Word2vec and the optimal explicit feature set yields a PCC of 0.864 and a R2 of 0.742, which is better than previously reported results.

Discussion

This study reveals the complementary strengths of language models and handcrafted features in predicting protein thermostability. Combining both types of features significantly improves the performance of classification and regression models and helps identify key factors affecting protein stability. However, the study is limited by its reliance on existing datasets, which may reduce its ability to generalize to novel or rare protein families.

Conclusion

The integration of implicit and explicit sequence features enables a more comprehensive representation of protein sequences and facilitates the identification of factors influencing the thermostability of orthologous proteins.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936394443250911224648
2025-10-28
2026-02-04
Loading full text...

Full text loading...

References

  1. Modarres H.P. Mofrad M.R. Sanati-Nezhad A. Protein thermostability engineering. RSC Advances 2016 6 116 115252 115270 10.1039/C6RA16992A
    [Google Scholar]
  2. Ahmad R. Budnik B. A review of the current state of single-cell proteomics and future perspective. Anal. Bioanal. Chem. 2023 415 28 6889 6899 10.1007/s00216‑023‑04759‑8 37285026
    [Google Scholar]
  3. Mateus A. Määttä T.A. Savitski M.M. Thermal proteome profiling: Unbiased assessment of protein state through heat-induced stability changes. Proteome Sci. 2016 15 1 13 10.1186/s12953‑017‑0122‑4 28652855
    [Google Scholar]
  4. Wu M. Singh A.K. Single-cell protein analysis. Curr. Opin. Biotechnol. 2012 23 1 83 88 10.1016/j.copbio.2011.11.023 22189001
    [Google Scholar]
  5. Savitski M.M. Reinhard F.B.M. Franken H. Werner T. Savitski M.F. Eberhard D. Molina D.M. Jafari R. Dovega R.B. Klaeger S. Kuster B. Nordlund P. Bantscheff M. Drewes G. Tracking cancer drugs in living cells by thermal profiling of the proteome. Science 2014 346 6205 1255784 10.1126/science.1255784 25278616
    [Google Scholar]
  6. Franken H. Mathieson T. Childs D. Sweetman G.M.A. Werner T. Tögel I. Doce C. Gade S. Bantscheff M. Drewes G. Reinhard F.B.M. Huber W. Savitski M.M. Thermal proteome profiling for unbiased identification of direct and indirect drug targets using multiplexed quantitative mass spectrometry. Nat. Protoc. 2015 10 10 1567 1593 10.1038/nprot.2015.101 26379230
    [Google Scholar]
  7. Leuenberger P. Ganscha S. Kahraman A. Cappelletti V. Boersema P.J. von Mering C. Claassen M. Picotti P. Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability. Science 2017 355 6327 eaai7825 10.1126/science.aai7825 28232526
    [Google Scholar]
  8. Teng S. Srivastava A.K. Wang L. Sequence feature-based prediction of protein stability changes upon amino acid substitutions. BMC Genomics 2010 11 Suppl 2 S5 10.1186/1471‑2164‑11‑S2‑S5 21047386
    [Google Scholar]
  9. Yang Y. Ding X. Zhu G. Niroula A. Lv Q. Vihinen M. ProTstab – predictor for cellular protein stability. BMC Genomics 2019 20 1 804 10.1186/s12864‑019‑6138‑7 31684883
    [Google Scholar]
  10. Fang J. Predicting thermostability difference between cellular protein orthologs. Bioinformatics 2023 39 8 btad504 10.1093/bioinformatics/btad504 37572303
    [Google Scholar]
  11. Li G. Buric F. Zrimec J. Viknander S. Nielsen J. Zelezniak A. Engqvist M.K.M. Learning deep representations of enzyme thermal adaptation. Protein Sci. 2022 31 12 4480 10.1002/pro.4480 36261883
    [Google Scholar]
  12. Rao R. Bhattacharya N. Thomas N. Duan Y. Chen X. Canny J. Abbeel P. Song Y.S. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 2019 32 9689 9701 33390682
    [Google Scholar]
  13. Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 2018 10.48550/arXiv.1810.04805
    [Google Scholar]
  14. Radford A. Narasimhan K. Salimans T. Sutskever I. Improving language understanding by generative pre‑training. 2018
    [Google Scholar]
  15. Raffel Colin Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683 2019 10.48550/arXiv.1910.10683
    [Google Scholar]
  16. Haselbeck F. John M. Zhang Y. Pirnay J. Fuenzalida-Werner J.P. Costa R.D. Grimm D.G. Superior protein thermophilicity prediction with protein language model embeddings. NAR Genom. Bioinform. 2023 5 4 lqad087 10.1093/nargab/lqad087 37829176
    [Google Scholar]
  17. Pei H. Li J. Ma S. Jiang J. Li M. Zou Q. Lv Z. Identification of thermophilic proteins based on sequence-based bidirectional representations from transformer-embedding features. Appl. Sci. 2023 13 5 2858 10.3390/app13052858
    [Google Scholar]
  18. Jung F. Frey K. Zimmer D. Mühlhaus T. DeepSTAbp: A deep learning approach for the prediction of thermal protein stability. Int. J. Mol. Sci. 2023 24 8 7444 10.3390/ijms24087444 37108605
    [Google Scholar]
  19. Kuang J. Zhao Z. Yang Y. Yan W. PON-Tm: A sequence-based method for prediction of missense mutation effects on protein thermal stability changes. Int. J. Mol. Sci. 2024 25 15 8379 10.3390/ijms25158379 39125949
    [Google Scholar]
  20. Harmalkar A. Rao R. Richard Xie Y. Honer J. Deisting W. Anlahr J. Hoenig A. Czwikla J. Sienz-Widmann E. Rau D. Rice A.J. Riley T.P. Li D. Catterall H.B. Tinberg C.E. Gray J.J. Wei K.Y. Toward generalizable prediction of antibody thermostability using machine learning on sequence and structure features. MAbs 2023 15 1 2163584 10.1080/19420862.2022.2163584 36683173
    [Google Scholar]
  21. Mardikoraem M. Woldring D. Protein fitness prediction is impacted by the interplay of language models, ensemble learning, and sampling methods. Pharmaceutics 2023 15 5 1337 10.3390/pharmaceutics15051337 37242577
    [Google Scholar]
  22. Miotto M. Olimpieri P.P. Di Rienzo L. Ambrosetti F. Corsi P. Lepore R. Tartaglia G.G. Milanetti E. Insights on protein thermal stability: A graph representation of molecular interactions. Bioinformatics 2019 35 15 2569 2577 10.1093/bioinformatics/bty1011 30535291
    [Google Scholar]
  23. Pucci F. Dhanani M. Dehouck Y. Rooman M. Protein thermostability prediction within homologous families using temperature-dependent statistical potentials. PLoS One 2014 9 3 91659 10.1371/journal.pone.0091659 24646884
    [Google Scholar]
  24. Zeldovich K.B. Berezovsky I.N. Shakhnovich E.I. Protein and DNA sequence determinants of thermophilic adaptation. PLOS Comput. Biol. 2007 3 1 5 10.1371/journal.pcbi.0030005 17222055
    [Google Scholar]
  25. Rives A. Meier J. Sercu T. Goyal S. Lin Z. Liu J. Guo D. Ott M. Zitnick C.L. Ma J. Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021 118 15 2016239118 10.1073/pnas.2016239118 33876751
    [Google Scholar]
  26. Unsal S. Atas H. Albayrak M. Turhan K. Acar A.C. Doğan T. Learning functional properties of proteins with language models. Nat. Mach. Intell. 2022 4 3 227 245 10.1038/s42256‑022‑00457‑9
    [Google Scholar]
  27. Brandes N. Ofer D. Peleg Y. Rappoport N. Linial M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics 2022 38 8 2102 2110 10.1093/bioinformatics/btac020 35020807
    [Google Scholar]
  28. Chowdhury R. Bouatta N. Biswas S. Floristean C. Kharkar A. Roy K. Rochereau C. Ahdritz G. Zhang J. Church G.M. Sorger P.K. AlQuraishi M. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 2022 40 11 1617 1623 10.1038/s41587‑022‑01432‑w 36192636
    [Google Scholar]
  29. Elnaggar A. Heinzinger M. Dallago C. Rehawi G. Wang Y. Jones L. Gibbs T. Feher T. Angerer C. Steinegger M. Bhowmik D. Rost B. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022 44 10 7112 7127 10.1109/TPAMI.2021.3095381 34232869
    [Google Scholar]
  30. Chen Z. Zhao P. Li C. Li F. Xiang D. Chen Y.Z. Akutsu T. Daly R.J. Webb G.I. Zhao Q. Kurgan L. Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021 49 10 e60 e60 10.1093/nar/gkab122 33660783
    [Google Scholar]
  31. Chen T. Ma J. Liu Y. Chen Z. Xiao N. Lu Y. Fu Y. Yang C. Li M. Wu S. Wang X. Li D. He F. Hermjakob H. Zhu Y. iProX in 2021: Connecting proteomics data sharing with big data. Nucleic Acids Res. 2022 50 D1 D1522 D1527 10.1093/nar/gkab1081 34871441
    [Google Scholar]
  32. Lian X. Zhang Y. Zhou Y. Sun X. Huang S. Dai H. Han L. Zhu F. SingPro: A knowledge base providing single-cell proteomic data. Nucleic Acids Res. 2024 52 D1 D552 D561 10.1093/nar/gkad830 37819028
    [Google Scholar]
  33. Perez-Riverol Y. Bai J. Bandla C. García-Seisdedos D. Hewapathirana S. Kamatchinathan S. Kundu D.J. Prakash A. Frericks-Zipper A. Eisenacher M. Walzer M. Wang S. Brazma A. Vizcaíno J.A. The PRIDE database resources in 2022: A hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022 50 D1 D543 D552 10.1093/nar/gkab1038 34723319
    [Google Scholar]
  34. Petrosius V. Aragon-Fernandez P. Üresin N. Kovacs G. Phlairaharn T. Furtwängler B. Op De Beeck J. Skovbakke S.L. Goletz S. Thomsen S.F. Keller U. Natarajan K.N. Porse B.T. Schoof E.M. Exploration of cell state heterogeneity using single-cell proteomics through sensitivity-tailored data-independent acquisition. Nat. Commun. 2023 14 1 5910 10.1038/s41467‑023‑41602‑1 37737208
    [Google Scholar]
  35. Mikolov T. Efficient estimation of word representations in vector space. arXiv:1301.3781 2013 10.48550/arXiv.1301.3781
    [Google Scholar]
  36. Le Q. Mikolov T. Distributed representations of sentences and documents. Proc. 31st Int’l Conf. Machine Learning (ICML) 2014 32 2 1188 1196
    [Google Scholar]
  37. Lee J. Yoon W. Kim S. Kim D. Kim S. So C.H. Kang J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020 36 4 1234 1240 10.1093/bioinformatics/btz682 31501885
    [Google Scholar]
  38. Alley E.C. Khimulya G. Biswas S. AlQuraishi M. Church G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019 16 12 1315 1322 10.1038/s41592‑019‑0598‑1 31636460
    [Google Scholar]
  39. Lin Z. Akin H. Rao R. Hie B. Zhu Z. Lu W. Smetanin N. Verkuil R. Kabeli O. Shmueli Y. dos Santos Costa A. Fazel-Zarandi M. Sercu T. Candido S. Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023 379 6637 1123 1130 10.1126/science.ade2574 36927031
    [Google Scholar]
  40. Chen J. Liu B. Huang D. Protein remote homology detection based on an ensemble learning approach. BioMed Res. Int. 2016 2016 1 1 11 10.1155/2016/5813645 27294123
    [Google Scholar]
  41. Sievers A. Bosiek K. Bisch M. Dreessen C. Riedel J. Froß P. Hausmann M. Hildenbrand G. K-mer content, correlation, and position analysis of genome DNA sequences for the identification of function and evolutionary features. Genes 2017 8 4 122 10.3390/genes8040122 28422050
    [Google Scholar]
  42. Li W. Cowley A. Uludag M. Gur T. McWilliam H. Squizzato S. Park Y.M. Buso N. Lopez R. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 2015 43 W1 W580 W584 10.1093/nar/gkv279 25845596
    [Google Scholar]
  43. Huang S. Cai N. Pacheco P.P. Narrandes S. Wang Y. Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics 2018 15 1 41 51 29275361
    [Google Scholar]
  44. Bergstra J. Bengio Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012 13 1 281 305
    [Google Scholar]
  45. Li C. Luo J. Qin Z. Chen H. Gao Q. Li J. Mechanical and thermal properties of microcrystalline cellulose-reinforced soy protein isolate–gelatin eco-friendly films. RSC Advances 2015 5 70 56518 56525 10.1039/C5RA04365D
    [Google Scholar]
  46. Sauer A. Moraru C.I. Heat stability of micellar casein concentrates as affected by temperature and pH. J. Dairy Sci. 2012 95 11 6339 6350 10.3168/jds.2012‑5706 22959944
    [Google Scholar]
  47. Mosca E. Szigeti F. Tragianni S. SHAP-based explanation methods: A review for NLP interpretability. Proceedings of the 29th international conference on computational linguistics Gyeongju, October 2022, pp. 4593-4603.
    [Google Scholar]
  48. Van der Maaten L. Hinton G. Visualizing data using t-SNE. JMLR 2008 9 2579 2605
    [Google Scholar]
  49. Zhang Y. Skolnick J. TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005 33 7 2302 2309 10.1093/nar/gki524 15849316
    [Google Scholar]
  50. Kumar S. Tsai C.J. Nussinov R. Factors enhancing protein thermostability. Protein Eng. Des. Sel. 2000 13 3 179 191 10.1093/protein/13.3.179 10775659
    [Google Scholar]
  51. Szilágyi A. Závodszky P. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: Results of a comprehensive survey. Structure 2000 8 5 493 504 10.1016/S0969‑2126(00)00133‑7 10801491
    [Google Scholar]
  52. Kellis J.T. Nyberg K. S˘ail D. Fersht A.R. Contribution of hydrophobic interactions to protein stability. Nature 1988 333 6175 784 786 10.1038/333784a0 3386721
    [Google Scholar]
  53. Berezovsky I.N. Chen W.W. Choi P.J. Shakhnovich E.I. Entropic stabilization of proteins and its proteomic consequences. PLOS Comput. Biol. 2005 1 4 47 10.1371/journal.pcbi.0010047 16201009
    [Google Scholar]
  54. Sadeghi M. Naderi-Manesh H. Zarrabi M. Ranjbar B. Effective factors in thermostability of thermophilic proteins. Biophys. Chem. 2006 119 3 256 270 10.1016/j.bpc.2005.09.018 16253416
    [Google Scholar]
  55. Chan C.H. Yu T.H. Wong K.B. Stabilizing salt-bridge enhances protein thermostability by reducing the heat capacity change of unfolding. PLoS One 2011 6 6 21624 10.1371/journal.pone.0021624 21720566
    [Google Scholar]
  56. Vogt G. Woell S. Argos P. Protein thermal stability, hydrogen bonds, and ion pairs. J. Mol. Biol. 1997 269 4 631 643 10.1006/jmbi.1997.1042 9217266
    [Google Scholar]
  57. Adhikari A. Bhattarai B.R. Aryal A. Thapa N. Kc P. Adhikari A. Maharjan S. Chanda P.B. Regmi B.P. Parajuli N. Reprogramming natural proteins using unnatural amino acids. RSC Advances 2021 11 60 38126 38145 10.1039/D1RA07028B 35498070
    [Google Scholar]
  58. Makam P. Yamijala S.S.R.K.C. Bhadram V.S. Shimon L.J.W. Wong B.M. Gazit E. Single amino acid bionanozyme for environmental remediation. Nat. Commun. 2022 13 1 1505 10.1038/s41467‑022‑28942‑0 35314678
    [Google Scholar]
  59. Topolska M. Beltran A. Lehner B. Deep indel mutagenesis reveals the impact of amino acid insertions and deletions on protein stability and function. Nat. Commun. 2025 16 1 2617 10.1038/s41467‑025‑57510‑5 40097423
    [Google Scholar]
  60. Apweiler R. Bairoch A. Wu C.H. Protein sequence databases. Curr. Opin. Chem. Biol. 2004 8 1 76 80 10.1016/j.cbpa.2003.12.004 15036160
    [Google Scholar]
  61. Zhou H.X. Influence of crowded cellular environments on protein folding, binding, and oligomerization: Biological consequences and potentials of atomistic modeling. FEBS Lett. 2013 587 8 1053 1061 10.1016/j.febslet.2013.01.064 23395796
    [Google Scholar]
  62. Shantappa A. Talukdar K. Study of the interaction of potassium ion channel protein with micelle by molecular dynamics simulation. AIP Conf. Proc. 2018 1942 1 040008 10.1063/1.5028617
    [Google Scholar]
  63. Chakravarty S. Varadarajan R. Elucidation of factors responsible for enhanced thermal stability of proteins: A structural genomics based study. Biochemistry 2002 41 25 8152 8161 10.1021/bi025523t 12069608
    [Google Scholar]
  64. Ye A. Zhang J.Y. Xu Q. Guo H.X. Liao Z. Cui H. Zhang D. Guo F.B. Carmna: Classification and regression models for nitrogenase activity based on a pretrained large protein language model. Brief. Bioinform. 2025 26 2 bbaf197 10.1093/bib/bbaf197 40273431
    [Google Scholar]
  65. Baldwin R.L. How Hofmeister ion interactions affect protein stability. Biophys. J. 1996 71 4 2056 2063 10.1016/S0006‑3495(96)79404‑3 8889180
    [Google Scholar]
  66. Elnaggar A. Heinzinger M. Dallago C. ProtTrans: Towards cracking the language of Life’s code through self-supervised deep learning and high performance computing. arXiv:2007.06225 2007 10.48550/arXiv.2007.06225
    [Google Scholar]
  67. Liao M. Somero G.N. Dong Y. Comparing mutagenesis and simulations as tools for identifying functionally important sequence changes for protein thermal adaptation. Proc. Natl. Acad. Sci. USA 2019 116 2 679 688 10.1073/pnas.1817455116 30584112
    [Google Scholar]
  68. Montanucci L. Fariselli P. Martelli P.L. Casadio R. Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 2008 24 13 i190 i195 10.1093/bioinformatics/btn166 18586713
    [Google Scholar]
  69. Li Y. Middaugh C.R. Fang J. A novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting relative thermostability of protein mutants. BMC Bioinformatics 2010 11 1 62 10.1186/1471‑2105‑11‑62 20109199
    [Google Scholar]
  70. Chou K.C. Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins 2001 43 3 246 255 10.1002/prot.1035 11288174
    [Google Scholar]
  71. Dubchak I. Muchnik I. Holbrook S.R. Kim S.H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA 1995 92 19 8700 8704 10.1073/pnas.92.19.8700 7568000
    [Google Scholar]
  72. Li M. Wang H. Yang Z. DeepTM: A deep learning algorithm for prediction of melting temperature of thermophilic proteins directly from sequences. Computational and Structural Biotechnology Journal. 2023, 21: 5544-5560.Asgari E, Mofrad M R K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 2015 10 11 0141287 26555596
    [Google Scholar]
  73. Yoshida K Kawai S Fujitani M Enhancement of protein thermostability by three consecutive mutations using loop-walking method and machine learning. Sci. Rep. 2021 11 1 11883 10.1038/s41598‑021‑91339‑4
    [Google Scholar]
  74. Li G. Jia L. Wang K. Sun T. Huang J. Prediction of thermostability of enzymes based on the amino acid index (AAindex) database and machine learning. Molecules 2023 28 24 8097 10.3390/molecules28248097 38138586
    [Google Scholar]
  75. Michalewicz K. Barahona M. Bravi B. Integrating protein sequence embeddings with structure via graph-based deep learning for the prediction of single-residue properties. arXiv.2502.17294 2025
    [Google Scholar]
  76. Xiao H. Lin W. Chen X. STELLA: Towards protein function prediction with multimodal llms integrating sequence-structure representations. arXiv.2506.03800 2025
    [Google Scholar]
  77. Zhou G. Zhao Y. He S. Bo X. SST-ResNet: A sequence and structure information integration model for protein property prediction. Int. J. Mol. Sci. 2025 26 6 2783 10.3390/ijms26062783 40141424
    [Google Scholar]
  78. Hadinoto K. Ling J.K.U. Pu S. Tran T.T. Effects of alkaline extraction pH on amino acid compositions, protein secondary structures, thermal stability, and functionalities of Brewer’s spent grain proteins. Int. J. Mol. Sci. 2024 25 12 6369 10.3390/ijms25126369 38928076
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936394443250911224648
Loading
/content/journals/cbio/10.2174/0115748936394443250911224648
Loading

Data & Media loading...

Supplements

Supplementary material is available on the publisher's website along with the published article.

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test