This tutorial teaches a fast approach to how to read sequences from large FASTA files in Python using Pysam.
1. Reading a large FASTA File Very Quickly
Most of the time that I work with a FASTA file in Python, I load all the sequences into a dictionary where the sequences id is the key and the sequence is the value, and I use the dictionary to run my analysis.
However, in many cases the loading all the sequences into a dictionary is not the most efficient approach to work with large FASTA files; for example, when fetching sequences from the NR/NT database.
Pysam can be used to create a FASTA object and to fetch the whole sequence given the sequence ID or a slice of it.
Please assume the following FASTA file with 3 sequences:
# file name "test.fasta" >sequence_1 ATGCTGCTGCTCGTCGTC >sequence_2 TTTATATATAAGTGTAGTAGA >sequence_3 GGGCGCGCGAGATAATATATTAT
Using the following pysam calls, it creates a FASTA object for your input file
from pysam import FastaFile fasta = "test.fasta" # read FASTA file sequences_object = FastaFile(fasta)
2. Fetching Regions of a Sequence from a FASTA file
Once the FASTA was indexed, it guarantees the agile FASTA reading and fetching. As demonstrated below, we can read the whole sequence or part of it.
# fetch whole sequence tmp_sequence = sequences_object.fetch("sequence_1") print(tmp_sequence) >>> ATGCTGCTGCTCGTCGTC # fetch part of the sequence "sequence 1" from position 0 to 3 tmp_sequence = sequences_object.fetch("sequence_1", 0, 3) print(tmp_sequence) >>> ATG