Everything About the NCBI BLAST Aligner and Faster Alternatives

onestop_databy:

Bioinformatics

This blog post talks about everything you need to know about BLAST (Basic Local Alignment Search Tool) NCBI. Here you will learn: How to run it (blastx, blastn, blastp, tblastx, tblasn, and megablast), what is e-value, etc.

Moreover, it provides some alternative faster aligners (DIAMOND, rapsearch2, bwa, and minimap2) for when aligning your sequences against a reference database depending on how much memory you have available.

What is BLAST?

First and foremost, BLAST stands for Basic Local Alignment Search Tool and was created by Stephen Altschul. It allows the comparison of a set of subject sequence(s) against a set of reference(s) – which could be in DNA or protein level.

Nowadays, there are many tools to do this task, but BLAST is still the gold standard for alignments. It is so famous that some people use the verb “to blast” when referring to using another aligner.

Also, in the video below, Dr. Robert Edwards explains in detail how the BLAST algorithm works, what is e-value, etc.

Which BLAST should I use: blastn, blastx, blastp, etc

The type of BLAST subcommand required depends on your query and your reference database. Below I detail which one should be used:

Command nameDetails
blastnQuery: Nucleotide and Reference: Nucleotide.
megablast is the default in -task parameter which is faster than blastn, but less sensitive.
blastpQuery: Protein and Reference: Protein
blastxQuery: Nucleotide (six-frame translations) and Reference: Protein.
tblastxQuery: Nucleotide (six-frame translations) and Reference: Nucleotide database dynamically translated in all six reading frames.
tblastnQuery: Protein and Reference: Nucleotide database dynamically translated in all six reading frames.

Installing and Running it

Now that you understand the different types of sub-blasts, let’s learn how to install it. You can easily install the tool using bioconda.

$ conda install blast

One of the ways to run BLAST is to use a formatted database that can be done by using the command makeblastdb.

# format database and assume that reference DNA
$ makeblastdb -in {REFERENCE.FASTA} -dbtype nucl

# format database and assume that reference in Protein
$ makeblastdb -in {REFERENCE.FASTA} -dbtype prot

Formated files will be saved on the same directory of the FASTA file.

Now that you know how to format your database, you can learn how to run the tool.

For example, below it is shown how to run blastn using the a query FASTA (-q), with the database we formatted (-db), saving the output to a file (-out) in tabular format (-outfmt 6), limiting the output to alignment with an e-value of at least 0.00001 (-evalue), and on its most sensitive mode (-task blastn).

$ blastn -query {QUERY.FASTA} -db {GENOME.FASTA} -out {OUTPUT_FILE} -outfmt 6 -evalue 0.00001 -task blastn -num_threads 10

The same command line can be simply modified to run other types of blast subcommand. Please remember to remove the “-task” parameter when running other sub-commands.

Tabular Output

One of the most column BLAST output formats is the tabular output which can be set by flagging -outfmt 6. This can be used as input by MEGAN for example.

Below there are all the columns that are present on the tabular output.

Column IndexColumn NameColumn Definition
1qseqidquery (e.g., gene) sequence id
2sseqidsubject (e.g., reference genome) sequence id
3pidentpercentage of identical matches
4lengthalignment length
5mismatchnumber of mismatches
6gapopennumber of gap openings
7qstartstart of alignment in query
8qendend of alignment in query
9sstartstart of alignment in subject
10sendend of alignment in subject
11evalueexpect value
12bitscorebit score

Personalized tabular output can be created by setting -outfmt”6 column_name_1 column_name_2… column_name_n.

For example, -outfmt 6 qseqid sseqid qcovs evalue” will output the query name, subject name, query coverage per subject, and e-value.

Column NameColumn Definition
qseqidQuery Seq-id
qgiQuery GI
qaccQuery accession
qaccverQuery accession.version
qlenQuery sequence length
sseqidSubject Seq-id
sallseqidAll subject Seq-id(s), separated by a ‘;’
sgiSubject GI
sallgiAll subject GIs
saccSubject accession
saccverSubject accession.version
sallaccAll subject accessions
slenSubject sequence length
qstartStart of alignment in query
qendEnd of alignment in query
sstartStart of alignment in subject
sendEnd of alignment in subject
qseqAligned part of the query sequence
sseqAligned part of the subject sequence
evalueExpect value
bitscoreBit score
scoreRaw score
lengthAlignment length
pidentPercentage of identical matches
nidentNumber of identical matches
mismatchNumber of mismatches
positiveNumber of positive-scoring matches
gapopenNumber of gap openings
gapsTotal number of gaps
pposPercentage of positive-scoring matches
framesQuery and subject frames separated by a ‘/’
qframeQuery frame
sframeSubject frame
btopBlast traceback operations (BTOP)
staxidsSubject Taxonomy ID(s), separated by a ‘;’
sscinamesSubject Scientific Name(s), separated by a ‘;’
scomnamesSubject Common Name(s), separated by a ‘;’
sblastnamesSubject Blast Name(s), separated by a ‘;’ (in alphabetical order)
sskingdomsSubject Super Kingdom(s), separated by a ‘;’ (in alphabetical order)
stitleSubject Title
salltitlesAll Subject Title(s), separated by a ‘<>’
sstrandSubject Strand
qcovsQuery Coverage Per Subject
qcovhspQuery Coverage Per HSP

Why Is BLAST Still Used?

BLAST is still used in many applications because it is one of the most sensitive aligners out there. For example, it is so sensitive that here blastx results were used as a group of comparison what was was “True”. Moreover, when generating recruitment plots, I normally use blastn for alignment – I want to make sure I don’t miss any region.

However, as sequencing became much cheaper over the year, query and reference databases because much much larger. Thus, faster alternatives to BLAST were needed.

Unfortunately, they are less sensitive than BLAST, but still worth using on metagenomics applications. Please see below some of these tools.

Faster Alternatives to BLAST

For some applications such as analyzing big data metagenomic datasets, running BLAST may not be the best option because of the time requirements.

There are some options below for replacing blastn, blastp, and blastx with major gains on speed.

DIAMOND: Up to 20 thousand faster than blastx

DIAMOND is a great replacement for blastx and blastp with little loss of sensitivity. It is fast because the algorithm uses a new reduction to the protein alphabet, improves the seeding step, and seed index which is the bottleneck when aligning a large number of sequences against a large database.

Besides the speed advantage, DIAMOND also can generate the same tabular output generated by BLAST.

The tool can be easily installed using Bioconda.

$ conda install diamond

Please read the tool README on its GitHub page for details on how to run it.

Rapsearch2: 100 times faster than blastx

You may be asking why I’m showing Rapsearch2 here if it is only 100 times faster than blastx when we know that DIAMOND is 20 thousand times faster than blastx, right? Well, DIAMOND normally uses a large amount of memory, and not everybody has access to it.

The tool uses a reduced protein alphabet and that is one of the reasons it is much faster than blastx and blastp.

Moreover, you can easily install it with Bioconda.

$ conda install rapsearch

Please read the tool README on its GitHub page for details on how to run it.

BWA: blastn for short reads

BWA is a great replacement for blastn if the query is composed of short reads which is very common in metagenomics. Also, it is fast and memory efficient.

BWA is based on the Burrows-Wheeler Transform algorithm which is known to be used in data compression. Please see below a video explaining the Burrows-Wheeler Transform algorithm.

Furthermore, you can easily install it with Bioconda.

$ conda install bwa

Please read the tool README on its GitHub page for details on how to run it.

Furthermore, Bwa-mem2 promises to produce the same output of BWA, but ~80% faster.

Minimap2: blastn for long reads

Minimap2 is a great replacement for blastn in the case when the query is large for example Pacbio or Oxford Nanopore reads.

Moreover, you can easily install it with Bioconda.

$ conda install minimap2

Please read the tool README on its GitHub page for details on how to run it.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, this article showed you everything you need to know about BLAST NCBI.

However, it showed that sometimes using BLAST may not be the best choice due to speed limitation. Luckily, we presented a few solutions to replace blastn, blastp, blastx depending on your query size and how much memory you have available.

Related Posts