Get Read from a FASTA in One Line

onestop_databy:

Bioinformatics

Unfortunately, getting a read from a FASTA file can be challenging for large files if you try to open it in text edit.

Have you generated an alignment file and got frustrated because the query sequence was not in the output? This tutorial shows how to retrieve the sequence from the FASTA using a single awk line.

Single Line to Extract a Sequence from FASTA

First and fore more, awk can be simply used to access the sequence from a FASTA file assuming that the sequence id is known for the target sequence – this can be easily obtained from the output of BLAST, DIAMOND, BWA, etc

$ awk -v seq="TARGETED_ID" -v RS='>' '$1 == seq {print RS $0}' YOUR_FASTA

Finally, I hope this is useful for you – it has been for me over the years.

Extracting more than one Sequence

In the case of more than one sequence is needed, I would recommend using seqtk with the following command line which requires a file defining with sequences that should be pulled out.

  $ seqtk subseq {FASTA/FASTQ/FASTQ_GZ} {LIST_IDS} > {OUTPUT_FILE}

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts