The Fastest Way to Read a FASTA in Python

by:

Bioinformatics

This tutorial teaches a fast approach on how to read sequences from large FASTA files in Python using Pysam.

1. Reading a large FASTA File Very Quickly

Most of the time that I work with a FASTA file in Python, I load all the sequences into a dictionary where the sequences id is the key and the sequence is the value, and I use the dictionary to run my analysis.

However, in many cases the loading all the sequences into a dictionary is not the most efficient approach to work with large FASTA files; for example, when fetching sequences from the NR/NT database.

Pysam can be used to create a FASTA object and to fetch the whole sequence given the sequence ID or a slice of it.

Please assume the following FASTA file with 3 sequences:

# file name "test.fasta"
>sequence_1
ATGCTGCTGCTCGTCGTC
>sequence_2
TTTATATATAAGTGTAGTAGA
>sequence_3
GGGCGCGCGAGATAATATATTAT

Using the follow pysam calls, it creates a FASTA object for your input file

from pysam import FastaFile

fasta = "test.fasta"
# read FASTA file
sequences_object = FastaFile(fasta)

When “FastaFile” is called, pysam calls for you “sammtools faidx” which indexes your FASTA file. In case you already have the input file index (extension .fai), it does not create it again.

Once the FASTA was indexed, it guarrantes the agile FASTA reading and fetcing. As demonstraded below, we can read the whole sequence or part of it.

# fetch whole sequence
tmp_sequence = sequences_object.fetch("sequence_1")
print(tmp_sequence)

>>> ATGCTGCTGCTCGTCGTC

# fetch part of the sequence "sequence 1" from position 0 to 3
tmp_sequence = sequences_object.fetch("sequence_1", 0, 3)
print(tmp_sequence)

>>> ATG

Last but not least, here we used a small FASTA file as a test. Nonetheless, when using a large FASTA file (500 Gbp) is when Pysam fetching method highlights the most.

Related Posts