Unfortunately, getting a read from a FASTA file can be challenging for large files if you try to open it in text edit.
Have you generated an alignment file and got frustrated because the query sequence was not in the output? This tutorial shows how to retrieve the sequence from the FASTA using a single awk line.
Single Line to Extract a Sequence from FASTA
First and fore more, awk can be simply used to access the sequence from a FASTA file assuming that the sequence id is known for the target sequence – this can be easily obtained from the output of BLAST, DIAMOND, BWA, etc
$ awk -v seq="TARGETED_ID" -v RS='>' '$1 == seq {print RS $0}' YOUR_FASTA
Finally, I hope this is useful for you – it has been for me over the years.
Extracting more than one Sequence
In the case of more than one sequence is needed, I would recommend using seqtk with the following command line which requires a file defining with sequences that should be pulled out.
$ seqtk subseq {FASTA/FASTQ/FASTQ_GZ} {LIST_IDS} > {OUTPUT_FILE}
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model