CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Giorgia Mazzotti; Martina Bado; Enrico Lavezzo; Stefano Toppo

doi:10.2174/0115748936416627250905170048

image of CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

oa CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data
Authors: Giorgia Mazzotti¹, Martina Bado¹, Enrico Lavezzo¹ and Stefano Toppo¹
View Affiliations Hide Affiliations

¹ Department of Molecular Medicine, University of Padua, Padova, Italy
Source: Current Bioinformatics
Available online: 18 September 2025
DOI: https://doi.org/10.2174/0115748936416627250905170048
- Received: 03 Jun 2025
- Accepted: 03 Jul 2025
- Available online: 18 Sep 2025

Abstract

Introduction

During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.

Methods

To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.

Results

CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.

Discussion

By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.

Conclusion

CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).

This is an open access article published under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/legalcode

Article metrics loading...

/content/journals/cbio/10.2174/0115748936416627250905170048

2025-09-18

2026-02-26

From This Site

/content/journals/cbio/10.2174/0115748936416627250905170048

dcterms_title,dcterms_subject,pub_keyword

-contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

/deliver/fulltext/cbio/10.2174/0115748936416627250905170048/BMS-CBIO-2025-197.html?itemId=/content/journals/cbio/10.2174/0115748936416627250905170048&mimeType=html&fmt=ahah

References

Khare S. Gurry C. Freitas L. GISAID’s role in pandemic response. China CDC Wkly 2021 3 49 1049 1051 10.46234/ccdcw2021.255 34934514
[Google Scholar]
National center for biotechnology information. 2023 Available from: https://www.ncbi.nlm.nih.gov/
Issues with SARS-CoV-2 sequencing data. 2020 Available from: https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
Morel B. Barbera P. Czech L. Phylogenetic analysis of SARS-CoV-2 data is difficult. Mol. Biol. Evol. 2021 38 5 1777 1791 10.1093/molbev/msaa314 33316067
[Google Scholar]
Hunt M. Hinrichs A.S. Anderson D. Karim L. Dearlove B.L. Knaggs J. Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny. bioRxiv 2024 2024.04.29.591666
[Google Scholar]
Nextclade. 2025 Available from: https://clades.nextstrain.org
O’Toole Á. Scher E. Underwood A. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021 7 2 veab064 10.1093/ve/veab064 34527285
[Google Scholar]
The python language reference. 2023 Available from: https://docs.python.org/3/reference/index.html
Bash - GNU project - Free software foundation 2023 Available from: https://www.gnu.org/software/bash/
Huddleston J. Hadfield J. Sibley T. Augur: A bioinformatics toolkit for phylogenetic analyses of human pathogens. J. Open Source Softw. 2021 6 57 2906 10.21105/joss.02906 34189396
[Google Scholar]
Katoh K. Rozewicki J. Yamada K.D. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 2019 20 4 1160 1166 10.1093/bib/bbx108 28968734
[Google Scholar]
The data model concept in statistical mapping. 1967 Available from: https://archives.lib.ku.edu/repositories/3/archival_objects/382862
Georg-unterholzner/kneebow. 2025 Available from: https://github.com/georg-unterholzner/kneebow
multiprocessing — Process-based parallelism. 2025 Available from: https://docs.python.org/3/library/multiprocessing.html
O’Leary N.A. Cox E. Holmes J.B. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci. Data 2024 11 1 732 10.1038/s41597‑024‑03571‑y 38969627
[Google Scholar]
Hadfield J. Megill C. Bell S.M. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018 34 23 4121 4123 10.1093/bioinformatics/bty407 29790939
[Google Scholar]
Polars. 2025 Available from: https://www.pola.rs/
pandas-dev/pandas: Pandas. 2024 Available from: https://zenodo.org/records/13819579
Updated PDS-H benchmark results. 2025 Available from: https://pola.rs/posts/benchmarks/
Polars vs. pandas: What’s the Difference?. 2024 Available from: https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/
Matsakis N.D. Klock F.S. The rust language. ACM SIGAda Ada Letters 2014 34 3 103 104 10.1145/2692956.2663188
[Google Scholar]
Chapman B. Chang J. Biopython. ACM SIGBIO Newsletter 2000 20 2 15 19 10.1145/360262.360268
[Google Scholar]
Kreier F. Deltacron: The story of the variant that wasn’t. Nature 2022 602 7895 19 9 10.1038/d41586‑022‑00149‑9 35058630
[Google Scholar]
Mazzotti G. Bianco L. Lavezzo E. Bado M. Toppo S. Fontana P. Viral Network Analyzer (VirNA): A novel minimum spanning networks algorithm for investigating viral evolution. Int. J. Mol. Sci. 2025 26 5 2008 10.3390/ijms26052008
[Google Scholar]
Leigh J.W. Bryant D. popart: Full‐feature software for haplotype network construction. Methods Ecol. Evol. 2015 6 9 1110 1116 10.1111/2041‑210X.12410
[Google Scholar]
Paradis E. pegas: An R package for population genetics with an integrated–modular approach. Bioinformatics 2010 26 3 419 420 10.1093/bioinformatics/btp696 20080509
[Google Scholar]

/content/journals/cbio/10.2174/0115748936416627250905170048

CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Bentham Science Publishers ; https://doi.org/10.2174/0115748936416627250905170048

/content/journals/cbio/10.2174/0115748936416627250905170048

Data & Media loading...

Supplements

Supplementary material is available on the publisher’s website along with the published article.

Article Type: Research Article

Keywords: clustering ; big data ; multiprocessing ; public databases ; viral sequences ; filtering ; Cleaning

Most Cited Most Cited RSS feed

- A Review of Ensemble Methods in Bioinformatics
  
  Authors: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis
  
  Authors: Masahiro Sugimoto, Masato Kawakami, Martin Robert, Tomoyoshi Soga and Masaru Tomita
- Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
  
  Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong Chen
- A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods
  
  Authors: Jun Zhang and Bin Liu
- Molecular Genetic Markers: Discovery, Applications, Data Storage and Visualisation
  
  Authors: Chris Duran, Nikki Appleby, David Edwards and Jacqueline Batley
- A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization
  
  Authors: Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding and Hao Lin
- Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
  
  Authors: Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou
- Relevance of Molecular Docking Studies in Drug Designing
  
  Authors: Ritu Jakhar, Mehak Dangi, Alka Khichi and Anil K. Chhillar
- The Advances and Challenges of Deep Learning Application in Biological Big Data Processing
  
  Authors: Li Peng, Manman Peng, Bo Liao, Guohua Huang, Weibiao Li and Dingfeng Xie
- Gene Expression Profile Classification: A Review
  
  Authors: Musa H. Asyali, Dilek Colak, Omer Demirkaya and Mehmet S. Inan
More Less

oa CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Abstract

Supplementary material is available on the publisher’s website along with the published article.

Most Read This Month

Most Cited Most Cited RSS feed