-
oa CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data
-
-
- 03 Jun 2025
- 03 Jul 2025
- 18 Sep 2025
Abstract
During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.
To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.
CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.
By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.
CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).