Fast FASTA/FASTQ Random Subsampling

onestop_databy:

Bioinformatics

This short tutorial teaches how to subsample a paired FASTQ, single FASTQ, pair FASTA, or single FASTA file to a specific number of reads.

This can be quickly accomplished by using seqtk which can download using bioconda.

Randomly Subsample Paired FASTQ or FASTA

Using seqtk, we can quickly downsample a paired set of FASTQs. It is important to set the same seed (-s 123) when running FASTQ pairs so the order of the random selection can be repeated between FASTQ.

In the example below, we subsample 100k reads from each FASTQ pair.

# FASTQ R1
$ seqtk sample -s 123 read1.fq 100000 > sub_read1.fq

# FASTQ R2
$ seqtk sample -s 123 read2.fq 100000 > sub_read2.fq

The same command lines could had been applied on paired FASTA files. Moreover, it should also work to subsample a FASTQ gz file.

Randomly Subsample FASTQ or FASTA

Similar to the previous section, here we subsample 100k reads from a single pair FASTQ or FASTA.

# single paired FASTA
$ seqtk sample sample.fasta 100000 > sub_sample.fasta

More Resources