Skip to content
2000
image of RF-SCGFS: A Feature Selection Method Based on Secuer and Random Forest Model for Single-cell RNA-Seq Data

Abstract

Introduction

Single-cell RNA sequencing (scRNA-seq) is crucial for unraveling gene expression complexity. However, existing feature selection methods often overlook the biological significance of co-expressed gene regions, leading to the omission of potential biomarkers.

Methods

We propose RF-SCGFS, a co-expressed gene region and gene joint selection method based on random forests. The method identifies co-expressed gene regions within homologous cell populations and builds a random forest model using cell type labels generated by the Scalable and Efficient speCtral clUstERing algorithm (Secuer). Feature importance evaluation is applied to select key co-expressed gene regions and genes.

Results

Experiments on 13 public scRNA-seq datasets demonstrate that RF-SCGFS outperforms traditional methods with average improvements of 0.15 and 0.19 in normalized mutual information (NMI) and adjusted Rand index (ARI), respectively. When combined with mainstream unsupervised algorithms, RF-SCGFS achieves excellent performance (NMI > 0.91 on Yan and Biase datasets). In the PBMC-ctrl dataset, the method successfully identifies genes associated with immune system processes (GO:0006955, = 2.02E-37).

Discussion

RF-SCGFS addresses key challenges in single-cell analysis by reducing computational burden through efficient feature selection while maintaining biological relevance through unsupervised clustering-guided selection.

Conclusion

RF-SCGFS provides an interpretable framework for feature selection in single-cell data, successfully identifying relevant disease genes and revealing the potential value of co-expressed gene regions in analyzing cellular heterogeneity.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936401797251030055501
2026-01-07
2026-02-21
Loading full text...

Full text loading...

References

  1. Saliba A.E. Westermann A.J. Gorski S.A. Vogel J. Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res. 2014 42 14 8845 8860 10.1093/nar/gku555 25053837
    [Google Scholar]
  2. Chen G. Wang C. Shi T. Overview of available methods for diverse RNA-Seq data analyses. Sci. China Life Sci. 2011 54 12 1121 1128 10.1007/s11427‑011‑4255‑x 22227904
    [Google Scholar]
  3. Jones R.C. Karkanias J. Krasnow M.A. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022 376 6594 eabl4896 10.1126/science.abl4896 35549404
    [Google Scholar]
  4. Luecken M.D. Theis F.J. Current best practices in single‐cell RNA‐seq analysis: A tutorial. Mol. Syst. Biol. 2019 15 6 e8746 10.15252/msb.20188746 31217225
    [Google Scholar]
  5. Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 2020 11 1 1169 10.1038/s41467‑020‑14976‑9 32127540
    [Google Scholar]
  6. Borah K. Das H.S. Seth S. Mallick K. Rahaman Z. Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct. Integr. Genomics 2024 24 5 139 10.1007/s10142‑024‑01415‑x 39158621
    [Google Scholar]
  7. Carangelo G. Magi A. Semeraro R. From multitude to singularity: An up-to-date overview of scRNA-seq data generation and analysis. Front. Genet. 2022 13 994069 10.3389/fgene.2022.994069 36263428
    [Google Scholar]
  8. Bhadra T. Mallik S. Hasan N. Zhao Z. Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinformatics 2022 23 S3 153 10.1186/s12859‑022‑04678‑y 35484501
    [Google Scholar]
  9. Liu J.X. Wang D. Gao Y.L. A joint-L2,1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis. Neurocomputing 2017 228 263 269 10.1016/j.neucom.2016.09.083
    [Google Scholar]
  10. Marukatat S. Tutorial on PCA and approximate PCA and approximate kernel PCA. Artif. Intell. Rev. 2023 56 6 5445 5477 10.1007/s10462‑022‑10297‑z
    [Google Scholar]
  11. Chen Y. Wang Y. Chen Y. Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis. Nat. Commun. 2022 13 1 6735 10.1038/s41467‑022‑34550‑9 36347853
    [Google Scholar]
  12. Stuart T. Satija R. Integrative single-cell analysis. Nat. Rev. Genet. 2019 20 5 257 272 10.1038/s41576‑019‑0093‑7 30696980
    [Google Scholar]
  13. Jin X. Ji X. Yin H. Identification of potential targets of stress cardiomyopathy by a machine learning algorithm. Cardiovasc. Innov. Appl. 2024 9 1 973 10.15212/CVIA.2024.0011
    [Google Scholar]
  14. Vans E. Patil A. Sharma A. FEATS: Feature selection-based clustering of single-cell RNA-seq data. Brief. Bioinform. 2021 22 4 10.1093/bib/bbaa306 33285568
    [Google Scholar]
  15. Luo Q. Chen Y. Lan X. COMSE: analysis of single-cell RNA-seq data using community detection-based feature selection. BMC Biol. 2024 22 1 167 10.1186/s12915‑024‑01963‑5 39113021
    [Google Scholar]
  16. Wu Y. Hu Q. Wang S. Highly Regional Genes: Graph-based gene selection for single-cell RNA-seq data. J. Genet. Genomics 2022 49 9 891 899 10.1016/j.jgg.2022.01.004 35144027
    [Google Scholar]
  17. Satija R. Farrell J.A. Gennert D. Schier A.F. Regev A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 2015 33 5 495 502 10.1038/nbt.3192 25867923
    [Google Scholar]
  18. Deng T. Chen S. Zhang Y. A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis. Brief. Bioinform. 2023 24 2 bbad042 10.1093/bib/bbad042 36754847
    [Google Scholar]
  19. Xu W. Zhang H. Xia Y. Ren Y. Guan J. Zhou S. Hybrid causal feature selection for cancer biomarker identification from RNA-Seq data. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2024 21 6 1645 1655 10.1109/TCBB.2024.3406922 38809725
    [Google Scholar]
  20. Su K. Yu T. Wu H. Accurate feature selection improves single-cell RNA-seq cell clustering. Brief. Bioinform. 2021 22 5 bbab034 10.1093/bib/bbab034 33611426
    [Google Scholar]
  21. Zappia L. Phipson B. Oshlack A. Feature selection methods affect scRNA-seq data integration performance. Nat. Methods 2024 21 123 135 40082610
    [Google Scholar]
  22. Cui S. Nassiri S. Zakeri I. Mcadet: A feature selection method for fine-resolution single-cell RNA-seq data based on multiple correspondence analysis and community detection. PLOS Comput. Biol. 2024 20 10 e1012560 10.1371/journal.pcbi.1012560 39466833
    [Google Scholar]
  23. Townes F.W. Hicks S.C. Aryee M.J. Irizarry R.A. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019 20 1 295 10.1186/s13059‑019‑1861‑6 31870412
    [Google Scholar]
  24. Wang J. Ma A. Chang Y. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun. 2021 12 1 1882 10.1038/s41467‑021‑22197‑x 33767197
    [Google Scholar]
  25. Huang H. Liu C. Wagle M.M. Yang P. Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis. Genome Biol. 2023 24 1 259 10.1186/s13059‑023‑03100‑x 37950331
    [Google Scholar]
  26. Yang P. Huang H. Liu C. Feature selection revisited in the single-cell era. Genome Biol. 2021 22 1 321 10.1186/s13059‑021‑02544‑3 34847932
    [Google Scholar]
  27. Xu Y. Li H-D. Lin C-X. CellBRF: A feature selection method for single-cell clustering using cell balance and random forest. Bioinformatics 2023 39 Suppl. 1 i368 i376 10.1093/bioinformatics/btad216
    [Google Scholar]
  28. Aevermann B. Zhang Y. Novotny M. A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing. Genome Res. 2021 31 10 1767 1780 10.1101/gr.275569.121 34088715
    [Google Scholar]
  29. Wei N. Nie Y. Liu L. Zheng X. Wu H.J. Secuer: Ultrafast, scalable and accurate clustering of single-cell RNA-seq data. PLOS Comput. Biol. 2022 18 12 e1010753 10.1371/journal.pcbi.1010753 36469543
    [Google Scholar]
  30. Efron B. Bootstrap methods: Another look at the jackknife. Breakthroughs in Statistics. New York, NY Springer 1992 569 593 10.1007/978‑1‑4612‑4380‑9_41
    [Google Scholar]
  31. Breiman L. Random forests. Mach. Learn. 2001 45 1 5 32 10.1023/A:1010933404324
    [Google Scholar]
  32. Wang Z. Research on feature selection methods based on random forest. Teh. Vjesn. 2023 30 2 623 633
    [Google Scholar]
  33. Biase F.H. Cao X. Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 2014 24 11 1787 1796 10.1101/gr.177725.114 25096407
    [Google Scholar]
  34. Chu L.F. Leng N. Zhang J. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 2016 17 1 173 10.1186/s13059‑016‑1033‑x 27534536
    [Google Scholar]
  35. Deng Q. Ramsköld D. Reinius B. Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 2014 343 6167 193 196 10.1126/science.1245316 24408435
    [Google Scholar]
  36. Kim D.H. Marinov G.K. Pepke S. Single-cell transcriptome analysis reveals dynamic changes in lncRNA expression during reprogramming. Cell Stem Cell 2015 16 1 88 101 10.1016/j.stem.2014.11.005 25575081
    [Google Scholar]
  37. Leng N. Chu L.F. Barry C. Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods 2015 12 10 947 950 10.1038/nmeth.3549 26301841
    [Google Scholar]
  38. Pollen A.A. Nowakowski T.J. Shuga J. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 2014 32 10 1053 1058 10.1038/nbt.2967 25086649
    [Google Scholar]
  39. Kimmerling R.J. Lee Szeto G. Li J.W. A microfluidic platform enabling single-cell RNA-seq of multigenerational lineages. Nat. Commun. 2016 7 1 10220 10.1038/ncomms10220 26732280
    [Google Scholar]
  40. Treutlein B. Brownfield D.G. Wu A.R. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 2014 509 7500 371 375 10.1038/nature13173 24739965
    [Google Scholar]
  41. Yan L. Yang M. Guo H. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat. Struct. Mol. Biol. 2013 20 9 1131 1139 10.1038/nsmb.2660 23934149
    [Google Scholar]
  42. Muraro M.J. Dharmadhikari G. Grün D. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016 3 4 385 394.e3
    [Google Scholar]
  43. Puram S.V. Tirosh I. Parikh A.S. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell 2017 171 7 1611 1624
    [Google Scholar]
  44. Kang H.M. Subramaniam M. Targ S. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 2018 36 1 89 94 10.1038/nbt.4042 29227470
    [Google Scholar]
  45. Pérez-Rubio P. Lottaz C. Engelmann J.C. FastqPuri: High-performance preprocessing of RNA-seq data. BMC Bioinformatics 2019 20 1 226 10.1186/s12859‑019‑2799‑0 31053060
    [Google Scholar]
  46. Hafemeister C. Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019 20 1 296 10.1186/s13059‑019‑1874‑1 31870423
    [Google Scholar]
  47. Tran H.T.N. Ang K.S. Chevrier M. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020 21 1 12 10.1186/s13059‑019‑1850‑9 31948481
    [Google Scholar]
  48. Kotliar D. Veres A. Nagy M.A. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 2019 8 e43803 10.7554/eLife.43803 31282856
    [Google Scholar]
  49. Lemoine G.G. Scott-Boyer M.P. Ambroise B. Périn O. Droit A. GWENA: Gene co-expression networks analysis and extended modules characterization in a single Bioconductor package. BMC Bioinformatics 2021 22 1 267 10.1186/s12859‑021‑04179‑4 34034647
    [Google Scholar]
  50. Hurst L.D. Pál C. Lercher M.J. The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 2004 5 4 299 310 10.1038/nrg1319 15131653
    [Google Scholar]
  51. Strehl A. Ghosh J. Cluster ensembles - A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002 3 Dec 583 617
    [Google Scholar]
  52. Hubert L. Arabie P. Comparing partitions. J. Classif. 1985 2 1 193 218 10.1007/BF01908075
    [Google Scholar]
  53. Lloyd S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982 28 2 129 137 10.1109/TIT.1982.1056489
    [Google Scholar]
  54. Van der Maaten L. Hinton G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008 9 11
    [Google Scholar]
  55. Jianbo Shi Malik J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000 22 8 888 905 10.1109/34.868688
    [Google Scholar]
  56. Sokal R.R. Michener C.D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958 38 1409 1438
    [Google Scholar]
  57. Zhang T. Ramakrishnan R. Livny M. BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 1996 25 2 103 114 10.1145/235968.233324
    [Google Scholar]
  58. Traag V.A. Waltman L. van Eck N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019 9 1 5233 10.1038/s41598‑019‑41695‑z 30914743
    [Google Scholar]
  59. Breiman L. Classification and Regression Trees. New York Chapman and Hall/CRC 2017 10.1201/9781315139470
    [Google Scholar]
  60. Chen T. Guestrin C. XGBoost: A scalable tree boosting system. KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. California, San Francisco, USA, Aug. 13-17 2016 785 794 10.1145/2939672.2939785
    [Google Scholar]
  61. Friedman J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001 29 5 1189 1232 10.1214/aos/1013203451
    [Google Scholar]
  62. Subramanian A. Tamayo P. Mootha V.K. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005 102 43 15545 15550 10.1073/pnas.0506580102 16199517
    [Google Scholar]
  63. Ashburner M. Ball C.A. Blake J.A. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000 25 1 25 29 10.1038/75556 10802651
    [Google Scholar]
  64. Kanehisa M. Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000 28 1 27 30 10.1093/nar/28.1.27 10592173
    [Google Scholar]
  65. Raudvere U. Kolberg L. Kuzmin I. g:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019 47 W1 W191-8 10.1093/nar/gkz369 31066453
    [Google Scholar]
  66. Hu J. Szymczak S. Evaluation of network-guided random forest for disease gene discovery. BioData Min. 2024 17 1 10 10.1186/s13040‑024‑00361‑5 38627770
    [Google Scholar]
  67. Speiser J.L. Miller M.E. Tooze J. Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 2019 134 93 101 10.1016/j.eswa.2019.05.028 32968335
    [Google Scholar]
  68. Dai M. Pei X. Wang X.J. Accurate and fast cell marker gene identification with COSG. Brief. Bioinform. 2022 23 2 bbab579 10.1093/bib/bbab579 35048116
    [Google Scholar]
  69. Hagemann-Jensen M. Ziegenhain C. Chen P. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 2020 38 6 708 714 10.1038/s41587‑020‑0497‑0 32518404
    [Google Scholar]
  70. Luecken M.D. Büttner M. Chaichoompu K. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 2022 19 1 41 50 10.1038/s41592‑021‑01336‑8 34949812
    [Google Scholar]
  71. Hao Y. Hao S. Andersen-Nissen E. Integrated analysis of multimodal single-cell data. Cell 2021 184 13 3573 3587.e29 10.1016/j.cell.2021.04.048 34062119
    [Google Scholar]
  72. Saeed M.M. Al Aghbari Z. Alsharidah M. Big data clustering techniques based on Spark: A literature review. PeerJ Comput. Sci. 2020 6 e321 10.7717/peerj‑cs.321 33816971
    [Google Scholar]
  73. Abbaszadeh O. Khanteymoori A.R. Azarpeyvand A. Parallel algorithms for inferring gene regulatory networks: A review. Curr. Genomics 2018 19 7 603 614 10.2174/1389202919666180601081718 30386172
    [Google Scholar]
  74. van Dijk D. Sharma R. Nainys J. Recovering gene interactions from single-cell data using data diffusion. Cell 2018 174 3 716 729.e27 10.1016/j.cell.2018.05.061 29961576
    [Google Scholar]
  75. Li W.V. Li J.J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 2018 9 1 997 10.1038/s41467‑018‑03405‑7 29520097
    [Google Scholar]
  76. Arisdakessian C. Poirion O. Yunits B. Zhu X. Garmire L.X. DeepImpute: An accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol. 2019 20 1 211 10.1186/s13059‑019‑1837‑6 31627739
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936401797251030055501
Loading
/content/journals/cbio/10.2174/0115748936401797251030055501
Loading

Data & Media loading...

Supplements

Supplementary material is available on the publisher’s website along with the published article.

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test