Skip to content
2000
Volume 20, Issue 8
  • ISSN: 1574-8936
  • E-ISSN: 2212-392X

Abstract

Aim

This study aims to develop and validate a machine learning-based model for the accurate prediction of Androgen Receptor (AR) agonistic toxicity, addressing the challenges posed by data imbalance in existing predictive models.

Background

Anomalous agonistic activity of the androgen receptor is a known major indicator of reproductive toxicity, which can lead to prostate cancer. Machine learning-based models have been developed for the rapid prediction of such agonists. However, the existing models have exhibited biased learning outcomes and low sensitivity due to the imbalance in the available training data. In the early screening process of drug discovery, low sensitivity caused by data imbalance can hinder the detection of potentially toxic compounds.

Objective

The objective of this study is to develop a machine learning prediction model that classifies whether a drug candidate is an androgen receptor agonist or not with highly balanced performance compared to existing models.

Methods

PredART is a bootstrap aggregated k-nearest neighbor model for the balanced prediction of androgen receptor agonistic toxicity using 381 active and 8,089 inactive datasets with structural features of them.

Results

In this work, we propose an advanced model that combines the bootstrap aggregating algorithm with machine learning binary classifiers to identify androgen receptor-based reproductive toxicity while avoiding biased prediction results. The optimal model using k-nearest neighbor classifiers achieved an accuracy of 0.831, Positive Predictive Value (PPV) of 0.882, sensitivity of 0.625, specificity of 0.951, Mathews Correlation Coefficient (MCC) of 0.633 on external test data, demonstrating a significant improvement in sensitivity compared to the previous study and achieving balanced learning. Furthermore, by calculating the standard deviation among outputs of the classifiers and employing this prediction uncertainty as a screening metric to select reliable predictions, the model's performance could be further enhanced.

Conclusion

Based on the bootstrap aggregating algorithm, our prediction model effectively addressed data imbalance while evaluating the performance of various machine learning and deep learning classifiers for a benchmark. Additionally, by quantifying uncertainty, our model provided an intuitive assessment of prediction reliability during large-scale screening processes.

This is an open access article published under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/legalcode
Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936355551241220190451
2025-01-02
2025-12-18
Loading full text...

Full text loading...

/deliver/fulltext/cbio/20/8/CBIO-20-8-06.html?itemId=/content/journals/cbio/10.2174/0115748936355551241220190451&mimeType=html&fmt=ahah

References

  1. BrinkmannA.O. Molecular basis of androgen insensitivity.Mol. Cell. Endocrinol.20011791-210510910.1016/S0303‑7207(01)00466‑X11420135
    [Google Scholar]
  2. McPhaulM.J. MarcelliM. TilleyW.D. GriffinJ.E. WilsonJ.D. Androgen resistance caused by mutations in the androgen receptor gene.FASEB J.19915142910291510.1096/fasebj.5.14.17523591752359
    [Google Scholar]
  3. SiegelR. NaishadhamD. JemalA. Cancer statistics, 2013.CA Cancer J. Clin.2013631113010.3322/caac.2116623335087
    [Google Scholar]
  4. HeinleinC.A. ChangC. Androgen receptor in prostate cancer.Endocr. Rev.200425227630810.1210/er.2002‑003215082523
    [Google Scholar]
  5. TanM.H.E. LiJ. XuH.E. MelcherK. YongE. Androgen receptor: Structure, role in prostate cancer and drug discovery.Acta Pharmacol. Sin.201536132310.1038/aps.2014.1824909511
    [Google Scholar]
  6. LynchC. SakamuruS. HuangR. StavrevaD.A. VarticovskiL. HagerG.L. JudsonR.S. HouckK.A. KleinstreuerN.C. CaseyW. PaulesR.S. SimeonovA. XiaM. Identifying environmental chemicals as agonists of the androgen receptor by using a quantitative high-throughput screening platform.Toxicology2017385485810.1016/j.tox.2017.05.00128478275
    [Google Scholar]
  7. NgH.W. ZhangW. ShuM. LuoH. GeW. PerkinsR. TongW. HongH. Competitive molecular docking approach for predicting estrogen receptor subtype α agonists and antagonists.BMC bioinformatics.2014151111510.1186/1471‑2105‑15‑S11‑S4
    [Google Scholar]
  8. YanL. ZhangQ. HuangF. NieW.W. HuC.Q. YingH.Z. DongX.W. ZhaoM.R. Ternary classification models for predicting hormonal activities of chemicals via nuclear receptors.Chem. Phys. Lett.201870636036610.1016/j.cplett.2018.06.022
    [Google Scholar]
  9. ManganelliS. RoncaglioniA. MansouriK. JudsonR.S. BenfenatiE. ManganaroA. RuizP. Development, validation and integration of in silico models to identify androgen active chemicals.Chemosphere201922020421510.1016/j.chemosphere.2018.12.13130584954
    [Google Scholar]
  10. MansouriK. KleinstreuerN. AbdelazizA.M. AlbergaD. AlvesV.M. AnderssonP.L. AndradeC.H. BaiF. BalabinI. BallabioD. BenfenatiE. BhhataraiB. BoyerS. ChenJ. ConsonniV. FaragS. FourchesD. García-SosaA.T. GramaticaP. GrisoniF. GrulkeC.M. HongH. HorvathD. HuX. HuangR. JeliazkovaN. LiJ. LiX. LiuH. ManganelliS. MangiatordiG.F. MaranU. MarcouG. MartinT. MuratovE. NguyenD.T. NicolottiO. NikolovN.G. NorinderU. PapaE. PetitjeanM. PiirG. PogodinP. PoroikovV. QiaoX. RichardA.M. RoncaglioniA. RuizP. RupakhetiC. SakkiahS. SangionA. SchrammK.W. SelvarajC. ShahI. SildS. SunL. TaboureauO. TangY. TetkoI.V. TodeschiniR. TongW. TrisciuzziD. TropshaA. Van Den DriesscheG. VarnekA. WangZ. WedebyeE.B. WilliamsA.J. XieH. ZakharovA.V. ZhengZ. JudsonR.S. CoMPARA: Collaborative modeling project for androgen receptor activity.Environ. Health Perspect.2020128202700210.1289/EHP558032074470
    [Google Scholar]
  11. CáceresE.L. TudorM. ChengA.C. Deep learning approaches in predicting ADMET properties.Future Med. Chem.202012221995199910.4155/fmc‑2020‑025933124448
    [Google Scholar]
  12. FerreiraL.L.G. AndricopuloA.D. ADMET modeling approaches in drug discovery.Drug Discov. Today20192451157116510.1016/j.drudis.2019.03.01530890362
    [Google Scholar]
  13. WongL. WangL. YouZ.H. YuanC.A. HuangY.A. CaoM.Y. GKLOMLI: A link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm.BMC Bioinformatics202324118810.1186/s12859‑023‑05309‑w37158823
    [Google Scholar]
  14. WeiM. WangL. LiY. LiZ. ZhaoB. SuX. WeiY. YouZ. BioKG-CMI: A multi-source feature fusion model based on biological knowledge graph for predicting circRNA-miRNA interactions.Sci. China Inf. Sci.202467818910410.1007/s11432‑024‑4098‑3
    [Google Scholar]
  15. GuoL.X. WangL. YouZ.H. YuC.Q. HuM.L. ZhaoB.W. LiY. Likelihood-based feature representation learning combined with neighborhood information for predicting circRNA–miRNA associations.Brief. Bioinform.2024252bbae02010.1093/bib/bbae02038324624
    [Google Scholar]
  16. DulsatJ. López-NietoB. Estrada-TejedorR. BorrellJ.I. Evaluation of free online ADMET tools for academic or small biotech environments.Molecules202328277610.3390/molecules2802077636677832
    [Google Scholar]
  17. XiongG. WuZ. YiJ. FuL. YangZ. HsiehC. YinM. ZengX. WuC. LuA. ChenX. HouT. CaoD. ADMETlab 2.0: An integrated online platform for accurate and comprehensive predictions of ADMET properties.Nucleic Acids Res.202149W1W5W1410.1093/nar/gkab25533893803
    [Google Scholar]
  18. YuM.S. LeeJ. LeeY. NaD. 2-D chemical structure image-based in silico model to predict agonist activity for androgen receptor.BMC Bioinformatics202021S5Suppl. 524510.1186/s12859‑020‑03588‑133106158
    [Google Scholar]
  19. IdakwoG. ThangapandianS. LuttrellJ. LiY. WangN. ZhouZ. HongH. YangB. ZhangC. GongP. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets.J. Cheminform.20201216610.1186/s13321‑020‑00468‑x33372637
    [Google Scholar]
  20. CaiX. LaiH. WangX. WangL. LiuW. WangY. WangZ. CaoD. ZengX. Comprehensive evaluation of molecule property prediction with ChatGPT.Methods202422213314110.1016/j.ymeth.2024.01.00438242382
    [Google Scholar]
  21. SnowO. LallousN. EsterM. CherkasovA. Deep learning modeling of androgen receptor responses to prostate cancer therapies.Int. J. Mol. Sci.20202116584710.3390/ijms2116584732823970
    [Google Scholar]
  22. HuangR. XiaM. NguyenD.T. ZhaoT. SakamuruS. ZhaoJ. ShahaneS.A. RossoshekA. SimeonovA. Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs.Front. Environ. Sci.201638510.3389/fenvs.2015.00085
    [Google Scholar]
  23. PiirG. SildS. MaranU. Binary and multi-class classification for androgen receptor agonists, antagonists and binders.Chemosphere202126212831310.1016/j.chemosphere.2020.12831333182081
    [Google Scholar]
  24. BreimanL. Bagging predictors.Mach. Learn.199624212314010.1007/BF00058655
    [Google Scholar]
  25. BegoliE. BhattacharyaT. KusnezovD. The need for uncertainty quantification in machine-assisted medical decision making.Nat. Mach. Intell.201911202310.1038/s42256‑018‑0004‑1
    [Google Scholar]
  26. MayrA. KlambauerG. UnterthinerT. HochreiterS. DeepTox: toxicity prediction using deep learning.Front. Environ. Sci.201638010.3389/fenvs.2015.00080
    [Google Scholar]
  27. Clarivate. Cortellis drug discovery intelligence. https://www.cortellis.com/drugdiscovery (Accessed March 21, 2024).
  28. MansouriK. AbdelazizA. RybackaA. RoncaglioniA. TropshaA. VarnekA. ZakharovA. WorthA. RichardA.M. GrulkeC.M. TrisciuzziD. FourchesD. HorvathD. BenfenatiE. MuratovE. WedebyeE.B. GrisoniF. MangiatordiG.F. IncisivoG.M. HongH. NgH.W. TetkoI.V. BalabinI. KancherlaJ. ShenJ. BurtonJ. NicklausM. CassottiM. NikolovN.G. NicolottiO. AnderssonP.L. ZangQ. PolitiR. BegerR.D. TodeschiniR. HuangR. FaragS. RosenbergS.A. SlavovS. HuX. JudsonR.S. CERAPP: Collaborative estrogen receptor activity prediction project.Environ. Health Perspect.201612471023103310.1289/ehp.151026726908244
    [Google Scholar]
  29. RogersD. HahnM. Extended-connectivity fingerprints.J. Chem. Inf. Model.201050574275410.1021/ci100050t20426451
    [Google Scholar]
  30. LandrumG RDKit Documentation2013Available from: https://ftp.ccp4.ac.uk/ccp4/7.0/unpacked/checkout/rdkit-Release_2015_03_1/Docs/Book/RDKit.pdf
  31. BravermanV. OstrovskyR. ZanioloC. Optimal sampling from sliding windows.Twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsProvidence, Rhode Island, USA, 2009. pp. 147-15610.1145/1559795.1559818
    [Google Scholar]
  32. HearstM.A. DumaisS.T. OsunaE. PlattJ. ScholkopfB. Support vector machines.IEEE Intell. Syst. Their Appl.1998134182810.1109/5254.708428
    [Google Scholar]
  33. BreimanL. Random forests.Mach. Learn.200145153210.1023/A:1010933404324
    [Google Scholar]
  34. KeG. MengQ. FinleyT. WangT. ChenW. MaW. YeQ. LiuT-Y. Lightgbm: A highly efficient gradient boosting decision tree.Adv. Neural Inf. Process. Syst.20173149 3157
    [Google Scholar]
  35. ChenT. GuestrinC. Xgboost: A scalable tree boosting system.22nd acm sigkdd international conference on knowledge discovery and data miningSan Francisco, California, USA, 2016, pp. 785–794
    [Google Scholar]
  36. ProkhorenkovaL. GusevG. VorobevA. DorogushA.V. GulinA. CatBoost: Unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems.MIT Pres2017
    [Google Scholar]
  37. PetersonL. K-nearest neighbor.Scholarpedia J.200942188310.4249/scholarpedia.1883
    [Google Scholar]
  38. MenardS. Applied logistic regression analysis.Sage Publications,Inc.200210.4135/9781412983433
    [Google Scholar]
  39. MacKayDJ Introduction to Gaussian processes.NATO ASI series F comput. syst sci.1998168133168
    [Google Scholar]
  40. GilmerJ. SchoenholzS.S. RileyP.F. VinyalsO. DahlG.E. Neural message passing for quantum chemistry.34th International Conference on Machine LearningSydney, NSW, Australia, :2017. PMLR: pp. 1263-1272
    [Google Scholar]
  41. SongY. ZhengS. NiuZ. FuZ-H. LuY. YangY. Communicative Representation Learning on Attributed Molecular Graphs. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence.International Joint Conferences on Artificial Intelligence Organization.20202831283810.24963/ijcai.2020/392
    [Google Scholar]
  42. BergstraJ. BardenetR. BengioY. KéglB. Algorithms for hyper-parameter optimization.Adv. Neural Inf. Process. Syst.201124
    [Google Scholar]
  43. WeiQ. DunbrackR.L.Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics.PLoS One201387e6786310.1371/journal.pone.006786323874456
    [Google Scholar]
  44. ChiccoD. JurmanG. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.BMC Genomics2020211610.1186/s12864‑019‑6413‑731898477
    [Google Scholar]
  45. LiF. XieQ. LiX. LiN. ChiP. ChenJ. WangZ. HaoC. Hormone activity of hydroxylated polybrominated diphenyl ethers on human thyroid receptor-β: In vitro and in silico investigations.Environ. Health Perspect.2010118560260610.1289/ehp.090145720439171
    [Google Scholar]
  46. DingD. XuL. FangH. HongH. PerkinsR. HarrisS. BeardenE.D. ShiL. TongW. The EDKB: An established knowledge base for endocrine disrupting chemicals.BMC bioinformatics.2010111710.1186/1471‑2105‑11‑S6‑S5
    [Google Scholar]
  47. RyuS. KwonY. KimW.Y. A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification.Chem. Sci. (Camb.)201910368438844610.1039/C9SC01992H31803423
    [Google Scholar]
  48. HüllermeierE. WaegemanW. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods.Mach. Learn.2021110345750610.1007/s10994‑021‑05946‑3
    [Google Scholar]
  49. WeissK. KhoshgoftaarT.M. WangD. A survey of transfer learning.J. Big Data201631910.1186/s40537‑016‑0043‑6
    [Google Scholar]
  50. MaM. RenJ. ZhaoL. TulyakovS. WuC. PengX. Smil: Multimodal learning with severely missing modality.Proceedings of the AAAI Conference on Artificial IntelligenceMay 2021, pp. 2302-231010.1609/aaai.v35i3.16330
    [Google Scholar]
  51. TutsoyO. BalikciK. OzdilN.F. Unknown uncertainties in the COVID-19 pandemic: Multi-dimensional identification and mathematical modelling for the analysis and estimation of the casualties.Digit. Signal Process.202111410305810.1016/j.dsp.2021.10305833879984
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936355551241220190451
Loading
/content/journals/cbio/10.2174/0115748936355551241220190451
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test