Skip to content
2000
Volume 21, Issue 8
  • ISSN: 1573-4099
  • E-ISSN: 1875-6697

Abstract

Introduction

Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.

Methods

The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.

Results

We defined three testing scenarios: without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.

Conclusion

Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.

Loading

Article metrics loading...

/content/journals/cad/10.2174/0115734099315538240909101737
2024-09-24
2025-12-25
Loading full text...

Full text loading...

References

  1. PaulS.M. MytelkaD.S. DunwiddieC.T. PersingerC.C. MunosB.H. LindborgS.R. SchachtA.L. How to improve R&D productivity: the pharmaceutical industry’s grand challenge.Nat. Rev. Drug Discov.20109320321410.1038/nrd3078 20168317
    [Google Scholar]
  2. HouckK.A. KavlockR.J. Understanding mechanisms of toxicity: Insights from drug discovery research.Toxicol. Appl. Pharmacol.2008227216317810.1016/j.taap.2007.10.022 18063003
    [Google Scholar]
  3. RifaiogluA.S. AtasH. MartinM.J. Cetin-AtalayR. AtalayV. DoğanT. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.Brief. Bioinform.20192051878191210.1093/bib/bby061 30084866
    [Google Scholar]
  4. ShakerB. AhmadS. LeeJ. JungC. NaD. In silico methods and tools for drug discovery.Comput. Biol. Med.202113710485110.1016/j.compbiomed.2021.104851 34520990
    [Google Scholar]
  5. TonoyanL. Machine learning in toxicological sciences: opportunities for assessing drug toxicity.Front. Drug Discov.20244133602510.3389/fddsv.2024.1336025
    [Google Scholar]
  6. OselusiS.O. DubeP. OdugbemiA.I. AkinyedeK.A. IloriT.L. EgieyehE. SibuyiN.R.S. MeyerM. MadieheA.M. WyckoffG.J. EgieyehS.A. The role and potential of computer-aided drug discovery strategies in the discovery of novel antimicrobials.Comput. Biol. Med.202416910792710.1016/j.compbiomed.2024.107927 38184864
    [Google Scholar]
  7. WuL. HuangR. TetkoI.V. XiaZ. XuJ. TongW. Trade-off predictivity and explainability for machine-learning powered predictive toxicology: An in-depth investigation with tox21 data sets.Chem. Res. Toxicol.202134254154910.1021/acs.chemrestox.0c00373 33513003
    [Google Scholar]
  8. SmirnovP. KofiaV. MaruA. FreemanM. HoC. El-HachemN. AdamG.A. Ba-alawiW. SafikhaniZ. Haibe-KainsB. Pharmacodb: an integrative database for mining in vitro anticancer drug screening studies.Nucleic Acids Res.201846D1D994D100210.1093/nar/gkx911
    [Google Scholar]
  9. SanzF. PognanF. Steger-HartmannT. DíazC. CasesM. PastorM. MarcP. WichardJ. BriggsK. WatsonD.K. KleinöderT. YangC. AmbergA. BeaumontM. BrookesA.J. BrunakS. CroninM.T.D. EckerG.F. EscherS. GreeneN. GuzmánA. HerseyA. JacquesP. LammensL. MestresJ. MusterW. NorthevedH. PinchesM. SaizJ. SajotN. ValenciaA. van der LeiJ. VermeulenN.P.E. VockE. WolberG. ZamoraI. Legacy data sharing to improve drug safety assessment: the eTOX project.Nat. Rev. Drug Discov.2017161281181210.1038/nrd.2017.177 29026211
    [Google Scholar]
  10. ThomasR. PaulesR.S. SimeonovA. FitzpatrickS.C. CroftonK.M. CaseyW.M. MendrickD.L. The US Federal Tox21 Program: A strategic and operational plan for continued leadership.Altern. Anim. Exp.201835216316810.14573/altex.1803011 29529324
    [Google Scholar]
  11. TangW. ChenJ. WangZ. XieH. HongH. Deep learning for predicting toxicity of chemicals: a mini review.J. Environ. Sci. Health Part C Environ. Carcinog. Ecotoxicol. Rev.201836425227110.1080/10590501.2018.1537563 30821199
    [Google Scholar]
  12. IdakwoG. ThangapandianS. LuttrellJ. LiY. WangN. ZhouZ. HongH. YangB. ZhangC. GongP. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets.J. Cheminform.20201216610.1186/s13321‑020‑00468‑x 33372637
    [Google Scholar]
  13. MayrA. KlambauerG. UnterthinerT. HochreiterS. Deeptox: toxicity prediction using deep learning.Front. Environ. Sci.201638010.3389/fenvs.2015.00080
    [Google Scholar]
  14. KarimA. SinghJ. MishraA. DehzangiA. Pacific Rim Knowledge Acquisition Workshop1421522019
    [Google Scholar]
  15. BaeS.Y. LeeJ. JeongJ. LimC. ChoiJ. Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints.Comput. Toxicol.20212010017810.1016/j.comtox.2021.100178
    [Google Scholar]
  16. ChandrasekaranB. AbedS.N. Al-AttraqchiO. KucheK. TekadeR.K. computer-aided prediction of pharmacokinetic (admet) properties.Dosage Form Design Parameters, Advances in Pharmaceutical Product Development and ResearchAcademic Press201810.1016/B978‑0‑12‑814421‑3.00021‑X
    [Google Scholar]
  17. ElreedyD. AtiyaA.F. A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance.Inform Sci.20195053264
    [Google Scholar]
  18. ChawlaN.V. BowyerK.W. HallL.O. KegelmeyerW.P. Smote: synthetic minority over- sampling technique.J. Artif. Intell. Res.20021632135710.1613/jair.953
    [Google Scholar]
  19. Haibo He GarciaE.A. Learning from imbalanced data.IEEE Trans. Knowl. Data Eng.20092191263128410.1109/TKDE.2008.239
    [Google Scholar]
  20. PanigrahiR. KumarL. KuanarS.K. An empirical study to investigate different SMOTE data sampling techniques for improving software refactoring prediction.Neural Information Processing. ICONIP 2020. Communications in Computer and Information ScienceNovember 18-22, 2020, Cham, pp. 23-31.10.1007/978‑3‑030‑63820‑7_3
    [Google Scholar]
  21. RupaparaV. RustamF. ShahzadH.F. MehmoodA. AshrafI. ChoiG.S. Impact of smote on imbalanced text features for toxic comments classification using rvvc model.IEEE Access20219786217863410.1109/ACCESS.2021.3083638
    [Google Scholar]
  22. ArwatchananukulS. SaengrayapR. ChaiwongS. AunsriN. Fast and efficient cavendish banana grade classification using random forest classifier with synthetic minority oversampling technique.IAENG Int. J. Comput. Sci.202249119
    [Google Scholar]
  23. BatistaG.E.A.P.A. PratiR.C. MonardM.C. A study of the behavior of several methods for balancing machine learning training data.ACM SIGKDD Explor. Newslett.200461202910.1145/1007730.1007735
    [Google Scholar]
  24. WilsonD.L. Asymptotic properties of nearest neighbor rules using edited data.IEEE Trans. Syst. Man Cybern.1972SMC-2340842110.1109/TSMC.1972.4309137
    [Google Scholar]
  25. LaurikkalaJ. Improving identification of difficult small classes by balancing class distribution.Conference on artificial intelligence in medicine in Europe01 January 2001, Berlin, Heidelberg, pp. 63-66.10.1007/3‑540‑48229‑6_9
    [Google Scholar]
  26. ManiI. ZhangI. knn approach to unbalanced data distributions: a case study involving information extraction.Proceedings of workshop on learning from imbalanced datasets2003Washington DC17
    [Google Scholar]
  27. NabourehA. LiA. BianJ. LeiG. AmaniM. A hybrid data balancing method for classification of imbalanced training data within google earth engine: Case studies from mountainous regions.Remote Sens.20201220330110.3390/rs12203301
    [Google Scholar]
  28. ZhangJ. ChenL. AbidF. Prediction of breast cancer from imbalance respect using cluster-based under- sampling method.J. Healthc. Eng.2019201911010.1155/2019/7294582 31737241
    [Google Scholar]
  29. BasurtoN. JiménezA. BayraktarS. HerreroÁ. HerreroÁ. CambraC. UrdaD. SedanoJ. QuintiánH. CorchadoE. Data balancing to improve prediction of project success in the telecom sector.15th International Conference on Soft Computing Models in Industrial and Environmental Applications, SOCO 202016-18 September 2020, Burgos, Spain, pp. 366-373.
    [Google Scholar]
  30. AlamriM. YkhlefM. Hybrid undersampling and oversampling for handling imbalanced credit card data.IEEE Access202412140501406010.1109/ACCESS.2024.3357091
    [Google Scholar]
  31. SusanS. KumarA. XiaoZ. YangL.T. BalajiP. LiT. LiK. ZomayaA.Y. Learning data space transformation matrix from pruned imbalanced datasets for nearest neighbor classification.5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019August 10-12, 2019Zhangjiajie, China2831283810.1109/HPCC/SmartCity/DSS.2019.00397
    [Google Scholar]
  32. RogersD. HahnM. Extended-Connectivity Fingerprints.J. Chem. Inf. Model.201050574275410.1021/ci100050t
    [Google Scholar]
  33. KorotcovA. TkachenkoV. RussoD.P. EkinsS. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets.Mol. Pharm.201714124462447510.1021/acs.molpharmaceut.7b00578 29096442
    [Google Scholar]
  34. LiJ.C. Imbalanced toxicity prediction using multi-task learning and over-sampling.International Conference on Machine Learning and Cybernetics, ICMLC 202002-02 December 2020, Adelaide, Australia, pp. 1-7.10.1109/ICMLC51923.2020.9469546
    [Google Scholar]
  35. LeeY. YoungJ.K. The Eiect of Resampling on Data-imbalanced Conditions for Prediction towards Nuclear Receptor Profiling Using Deep Learning.Molecular Informatics202039811710.1002/minf.201900131
    [Google Scholar]
  36. HuangR. Tox21 public data.2016Available from: https://tripod.nih.gov//tox21/pubdata/ (accessed on 20-8-2024)
    [Google Scholar]
  37. NorinderU. Traditional machine and deep learning for predicting toxicity endpoints.Molecules202228121710.3390/molecules28010217 36615411
    [Google Scholar]
  38. TranT.T.V. Surya WibowoA. TayaraH. ChongK.T. Artificial intelligence in drug toxicity prediction: recent advances, challenges, and future perspectives.J. Chem. Inf. Model.20236392628264310.1021/acs.jcim.3c00200 37125780
    [Google Scholar]
  39. CavasottoC.N. ScardinoV. Machine learning toxicity prediction: latest advances by toxicity end point.ACS Omega2022751475364754610.1021/acsomega.2c05693 36591139
    [Google Scholar]
  40. KianpourM. MohammadinasabE. IsfahaniT.M. Isfahani. Prediction of oral acute toxicity of organophosphates using qsar methods.Curr. Computeraided Drug Des.2021171385610.2174/1573409916666191227093237 31880265
    [Google Scholar]
  41. MauriA. ConsonniV. PavanM. TodeschiniR. Dragon software: An easy approach to molecular descriptor calculations.Match (Mulh.)2006562237248
    [Google Scholar]
  42. LandrumG. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling.Greg. Landrum2013831.105281
    [Google Scholar]
  43. RamsundarB. EastmanP. WaltersP. PandeV. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more.O’Reilly Media2019
    [Google Scholar]
  44. MoriwakiH. TianY.S. KawashitaN. TakagiT. Mordred: a molecular descriptor calculator.J. Cheminformatics.20181011410.1186/s13321‑018‑0258‑y
    [Google Scholar]
  45. SetiyaA. JaniV. SonavaneU. JoshiR. MolToxPred: small molecule toxicity prediction using machine learning approach.RSC Advances20241464201422010.1039/D3RA07322J 38292268
    [Google Scholar]
  46. ZhangR. LinY. WuY. DengL. ZhangH. LiaoM. PengY. MvMRL: a multi-view molecular representation learning method for molecular property prediction.Brief. Bioinform.2024254bbae29810.1093/bib/bbae298 38920342
    [Google Scholar]
  47. WuL. LiuZ. AuerbachS. HuangR. ChenM. McEuenK. XuJ. FangH. TongW. Integrating drug’s mode of action into quantitative structure–activity relationships for improved prediction of drug-induced liver injury.J. Chem. Inf. Model.20175741000100610.1021/acs.jcim.6b00719 28350954
    [Google Scholar]
  48. Caballero AlfonsoA.Y. ChayawanC. GadaletaD. RoncaglioniA. BenfenatiE. A knime workflow to assist the analogue identification for read-across, applied to aromatase activity.Molecules2023284183210.3390/molecules28041832 36838826
    [Google Scholar]
  49. PronkT.E. HoondertR.P.J. KoolsS.A.E. KumarV. de BaatM.L. Bioassay predictive values for chemical health risks in drinking water.Environ. Int.202418810873310.1016/j.envint.2024.108733
    [Google Scholar]
  50. ZhaoL. ZhouS. JiaY. Adaptation feature norm method based on l2-normalization and scaling parameter.Int. J. Innov. Comput., Inf. Control202117515011511
    [Google Scholar]
  51. LemaitreG. NogueiraF. AridasC.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning.J. Mach. Learn. Res.20171817117
    [Google Scholar]
  52. Arshad Awan Awan, A. Classification techniques in machine learning: applications and issues.J. Basic Appl. Sci.20171345946510.6000/1927‑5129.2017.13.76
    [Google Scholar]
  53. OzcanE.C. GörgülüB. BaydoganM.G. Column generation-based prototype learning for optimizing area under the receiver operating characteristic curve.Eur. J. Oper. Res.2024314129730710.1016/j.ejor.2023.11.016
    [Google Scholar]
  54. RavinderB. SeeniS.K. Web Data Mining with Organized Contents Using Naive Bayes Algorithm.2024 2nd International Conference on Computer, Communication and Control (IC4)08-10 February 2024, Indore, India202416
    [Google Scholar]
  55. YamaguchiK. MartinezA.J. Variational Bayes inference for hidden Markov diagnostic classification models.Br. J. Math. Stat. Psychol.2024771557910.1111/bmsp.12308 37249065
    [Google Scholar]
  56. DasA. Logistic regression.Encyclopedia of Quality of Life and Well-Being Research.Springer202439853986
    [Google Scholar]
  57. WangH. ShaoY. Fast generalized ramp loss support vector machine for pattern classification.Pattern Recognit.202414610998710.1016/j.patcog.2023.109987
    [Google Scholar]
  58. SchmidgallS. ZiaeiR. AchterbergJ. KirschL. Brain-inspired learning in artificial neural networks: a review.APL Machine Learn.202422021501
    [Google Scholar]
  59. ChaoW.A.N.G. ShuyuanZ.H.A.N.G. TianhangM.A. YuetongX.I.A.O. Michael ZhiqiangC.H.E.N. LeiW.A.N.G. Swarm intelligence: A survey of model classification and applications.Chin. J. Aeronauti.2024
    [Google Scholar]
  60. HosseiniM.P. LuS. KamarajK. SlowikowskiA. Deep learning architectures.Deep learning: concepts and architectures.Springer2020
    [Google Scholar]
  61. Siji RaniS. ShilpaP. MenonA.G. Menon. Enhancing drug recommendations: A modified lstm approach in intelligent deep learning systems.Procedia Comput. Sci.202423387288110.1016/j.procs.2024.03.276
    [Google Scholar]
  62. LuoH. YinW. WangJ. ZhangG. LiangW. LuoJ. YanC. Drug-drug interactions prediction based on deep learning and knowledge graph: A review.iScience202427310914810.1016/j.isci.2024.109148 38405609
    [Google Scholar]
  63. RaschkaS. MirjaliliV. Python.Machine Learning2019
    [Google Scholar]
  64. YusufM. Insights into the in-silico research: current scenario, advantages, limits, and future perspectives.Life in Silico2023111325
    [Google Scholar]
  65. BasileA.O. YahiA. TatonettiN.P. Basile, Alexandre Yahi, and Nicholas P Tatonetti. Artificial intelligence for drug toxicity and safety.Trends Pharmacol. Sci.201940962463510.1016/j.tips.2019.07.005 31383376
    [Google Scholar]
  66. FDA. In Vitro Drug Interaction Studies - Cytochrome.P450 Enzyme- and Transporter-Mediated Drug Interactions Guidance for Industry2020Available from: https://www.fda.gov/media/134582/download (accessed on 20-8-2024)
    [Google Scholar]
  67. Mechanistic Model-Based Methods for DDI Prediction.2020Available from: https://www.xenotech.com/preclinical-drug-development/adme-ddi-consulting/in-silico-modeling/ (accessed on 20-8-2024)
  68. BanerjeeP. KemmlerE. DunkelM. PreissnerR. ProTox 3.0: a webserver for the prediction of toxicity of chemicals.Nucleic Acids Res.202452W1W513W52010.1093/nar/gkae303 38647086
    [Google Scholar]
  69. HutterM.C. The current limits in virtual screening and property prediction.Future Med. Chem.201810131623163510.4155/fmc‑2017‑0303 29953247
    [Google Scholar]
  70. ZhouZ.H. Machine learningSpringer nature2021
    [Google Scholar]
  71. KamY. RejniakK.A. AndersonA.R.A. Cellular modeling of cancer invasion: Integration of in silico and in vitro approaches.J. Cell. Physiol.2012227243143810.1002/jcp.22766 21465465
    [Google Scholar]
  72. ZhangR. DingY. Identification of key features of cns drugs based on svm and greedy algorithm.Curr. Computeraided Drug Des.202116672573310.2174/1573409915666191212095340 31830888
    [Google Scholar]
  73. OnayA. OnayM. A drug decision support system for developing a successful drug candidate using machine learning techniques.Curr. Computeraided Drug Des.202016440741910.2174/1573409915666190716143601 31438830
    [Google Scholar]
  74. QuX. DuG. HuJ. CaiY. Graph-dti: A new model for drug-target interaction prediction based on heterogenous network graph embedding.Curr. Computeraided Drug Des.20242061013102410.2174/1573409919666230713142255 37448360
    [Google Scholar]
  75. SinghS. SinghP.K. SachanK. KumarM. BhardwajP. Singh, Kapil Sachan, Mukesh Kumar, and Poonam Bhardwaj. Automation of drug discovery through cutting-edge in-silico research in pharmaceuticals: Challenges and future scope.Curr. Computeraided Drug Des.202420672373510.2174/0115734099260187230921073932 37807412
    [Google Scholar]
/content/journals/cad/10.2174/0115734099315538240909101737
Loading
/content/journals/cad/10.2174/0115734099315538240909101737
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test