Skip to content
2000
image of CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Abstract

Introduction

During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.

Methods

To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.

Results

CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.

Discussion

By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.

Conclusion

CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).

This is an open access article published under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/legalcode
Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936416627250905170048
2025-09-18
2025-12-08
Loading full text...

Full text loading...

/deliver/fulltext/cbio/10.2174/0115748936416627250905170048/BMS-CBIO-2025-197.html?itemId=/content/journals/cbio/10.2174/0115748936416627250905170048&mimeType=html&fmt=ahah

References

  1. Khare S. Gurry C. Freitas L. GISAID’s role in pandemic response. China CDC Wkly 2021 3 49 1049 1051 10.46234/ccdcw2021.255 34934514
    [Google Scholar]
  2. National center for biotechnology information. 2023 Available from: https://www.ncbi.nlm.nih.gov/
  3. Issues with SARS-CoV-2 sequencing data. 2020 Available from: https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
  4. Morel B. Barbera P. Czech L. Phylogenetic analysis of SARS-CoV-2 data is difficult. Mol. Biol. Evol. 2021 38 5 1777 1791 10.1093/molbev/msaa314 33316067
    [Google Scholar]
  5. Hunt M. Hinrichs A.S. Anderson D. Karim L. Dearlove B.L. Knaggs J. Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny. bioRxiv 2024 2024.04.29.591666
    [Google Scholar]
  6. Nextclade. 2025 Available from: https://clades.nextstrain.org
  7. O’Toole Á. Scher E. Underwood A. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021 7 2 veab064 10.1093/ve/veab064 34527285
    [Google Scholar]
  8. The python language reference. 2023 Available from: https://docs.python.org/3/reference/index.html
  9. Bash - GNU project - Free software foundation 2023 Available from: https://www.gnu.org/software/bash/
  10. Huddleston J. Hadfield J. Sibley T. Augur: A bioinformatics toolkit for phylogenetic analyses of human pathogens. J. Open Source Softw. 2021 6 57 2906 10.21105/joss.02906 34189396
    [Google Scholar]
  11. Katoh K. Rozewicki J. Yamada K.D. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 2019 20 4 1160 1166 10.1093/bib/bbx108 28968734
    [Google Scholar]
  12. The data model concept in statistical mapping. 1967 Available from: https://archives.lib.ku.edu/repositories/3/archival_objects/382862
  13. Georg-unterholzner/kneebow. 2025 Available from: https://github.com/georg-unterholzner/kneebow
  14. multiprocessing — Process-based parallelism. 2025 Available from: https://docs.python.org/3/library/multiprocessing.html
  15. O’Leary N.A. Cox E. Holmes J.B. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci. Data 2024 11 1 732 10.1038/s41597‑024‑03571‑y 38969627
    [Google Scholar]
  16. Hadfield J. Megill C. Bell S.M. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018 34 23 4121 4123 10.1093/bioinformatics/bty407 29790939
    [Google Scholar]
  17. Polars. 2025 Available from: https://www.pola.rs/
  18. pandas-dev/pandas: Pandas. 2024 Available from: https://zenodo.org/records/13819579
  19. Updated PDS-H benchmark results. 2025 Available from: https://pola.rs/posts/benchmarks/
  20. Polars vs. pandas: What’s the Difference?. 2024 Available from: https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/
  21. Matsakis N.D. Klock F.S. The rust language. ACM SIGAda Ada Letters 2014 34 3 103 104 10.1145/2692956.2663188
    [Google Scholar]
  22. Chapman B. Chang J. Biopython. ACM SIGBIO Newsletter 2000 20 2 15 19 10.1145/360262.360268
    [Google Scholar]
  23. Kreier F. Deltacron: The story of the variant that wasn’t. Nature 2022 602 7895 19 9 10.1038/d41586‑022‑00149‑9 35058630
    [Google Scholar]
  24. Mazzotti G. Bianco L. Lavezzo E. Bado M. Toppo S. Fontana P. Viral Network Analyzer (VirNA): A novel minimum spanning networks algorithm for investigating viral evolution. Int. J. Mol. Sci. 2025 26 5 2008 10.3390/ijms26052008
    [Google Scholar]
  25. Leigh J.W. Bryant D. popart: Full‐feature software for haplotype network construction. Methods Ecol. Evol. 2015 6 9 1110 1116 10.1111/2041‑210X.12410
    [Google Scholar]
  26. Paradis E. pegas: An R package for population genetics with an integrated–modular approach. Bioinformatics 2010 26 3 419 420 10.1093/bioinformatics/btp696 20080509
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936416627250905170048
Loading
/content/journals/cbio/10.2174/0115748936416627250905170048
Loading

Data & Media loading...

Supplements

Supplementary material is available on the publisher’s website along with the published article.


  • Article Type:
    Research Article
Keywords: clustering ; big data ; multiprocessing ; public databases ; viral sequences ; filtering ; Cleaning
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test