This tutorial teaches a fast approach to how to read sequences from large FASTA files in Python using Pysam. I know how reading large FASTA files can be painful, so I hope this tutorial is helpful for you.
1. Why is important it important to read a FASTQ file fast?
Reading a FASTA file quickly is important for several reasons:
- Large data sets: In many applications, FASTA files can be quite large, containing hundreds of thousands or even millions of sequences. Reading such a large file can take a significant amount of time, so it’s important to be able to read the file as quickly as possible to minimize the wait time.
- Efficient data processing: In many bioinformatics applications, FASTA files are used as input for various data processing and analysis pipelines. The faster the file can be read, the faster the pipeline can be executed, leading to more efficient use of computational resources and faster results.
- Time-sensitive analysis: In some cases, such as in real-time analysis or emergency situations, time is of the essence, and every second counts. In these cases, it’s critical to be able to read FASTA files as quickly as possible to ensure that the analysis can be performed in a timely manner.
- Memory optimization: In some cases, reading a large FASTA file into memory can consume a significant amount of memory, causing the system to slow down or even crash. Reading the file more quickly can help reduce the amount of memory used and prevent performance issues.
Overall, the ability to read FASTA files quickly is important because it allows for efficient and timely data processing, optimizes memory usage, and helps ensure that data analysis can be performed quickly and effectively.
2. Reading a large FASTA File Very Quickly
When working with a FASTA file in Python, I load all the sequences into a dictionary where the sequences id is the key, the sequence is the value, and I use the dictionary to run my analysis.
However, in many cases, loading all the sequences into a dictionary is not the most efficient approach to work with large FASTA files, for example, when fetching sequences from the NR/NT database.
Now, Pysam can create a FASTA object and fetch the whole sequence given the sequence ID or a slice of it.
Please assume the following FASTA file with three sequences:
# file name "test.fasta"
>sequence_1
ATGCTGCTGCTCGTCGTC
>sequence_2
TTTATATATAAGTGTAGTAGA
>sequence_3
GGGCGCGCGAGATAATATATTAT
Using the subsequent pysam calls, it creates a FASTA object for your input file
from pysam import FastaFile
fasta = "test.fasta"
# read FASTA file
sequences_object = FastaFile(fasta)
When calling “FastaFile,” Pysam calls “samtools faidx” which indexes your FASTA file if not present.
3. Fetching Regions of a Sequence from a FASTA file
Once the FASTA is indexed, it guarantees the agile FASTA reading and fetching. We can read the whole sequence or part of it as demonstrated below.
# fetch whole sequence
tmp_sequence = sequences_object.fetch("sequence_1")
print(tmp_sequence)
>>> ATGCTGCTGCTCGTCGTC
# fetch part of the sequence "sequence 1" from position 0 to 3
tmp_sequence = sequences_object.fetch("sequence_1", 0, 3)
print(tmp_sequence)
>>> ATG
Last but not least, here, we used a small FASTA file as a test. Nonetheless, when using a large FASTA file (500 Gbp) is, when Pysam fetching method highlights the most.
I hope that this tutorial was helpful for you. Reading large FASTA files can be painful, and I’m sure the approach presented here will help you read it quickly and make the task less painful; enjoy it!