BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
Loading...
Identifiers
ISSN: 2047-217X
E-ISSN: 2047-217X
Publication date
Advisors
Tutors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Oxford University Press
Abstract
Background
High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node.
Results
Our approach, BigSeqKit, takes advantage of a high-performance computing–Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line.
Conclusions
BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.
Description
Keywords
Bibliographic citation
GigaScience, Volume 12, 2023, giad062
Relation
Has part
Has version
Is based on
Is part of
Is referenced by
Is version of
Requires
Publisher version
https://doi.org/10.1093/gigascience/giad062Sponsors
This work was supported by MICINN, Spain [PLEC2021-007662]; Xunta de Galicia, Spain [ED431G/08, ED431G-2019/04, ED431C 2018/19, and ED431F 2020/08]; European Commission RIA—H2020 [HPC-EUROPA3—INFRAIA-2016-1-730897]; and European Regional Development Fund (ERDF).
Rights
© The Author(s) 2023. Published by Oxford University Press GigaScience. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.








