This blog post talks about everything you need to know about BLAST (Basic Local Alignment Search Tool) NCBI. Here you will learn: How to run it (blastx, blastn, blastp, tblastx, tblasn, and megablast), what is e-value, etc.
Moreover, it provides some alternative faster aligners (DIAMOND, rapsearch2, bwa, and minimap2) for when aligning your sequences against a reference database depending on how much memory you have available.
What is BLAST?
First and foremost, BLAST stands for Basic Local Alignment Search Tool and was created by Stephen Altschul. It allows the comparison of a set of subject sequence(s) against a set of reference(s) – which could be in DNA or protein level.
Nowadays, there are many tools to do this task, but BLAST is still the gold standard for alignments. It is so famous that some people use the verb “to blast” when referring to using another aligner.
Also, in the video below, Dr. Robert Edwards explains in detail how the BLAST algorithm works, what is e-value, etc.
Which BLAST should I use: blastn, blastx, blastp, etc
The type of BLAST subcommand required depends on your query and your reference database. Below I detail which one should be used:
Command name | Details |
blastn | Query: Nucleotide and Reference: Nucleotide. megablast is the default in -task parameter which is faster than blastn, but less sensitive. |
blastp | Query: Protein and Reference: Protein |
blastx | Query: Nucleotide (six-frame translations) and Reference: Protein. |
tblastx | Query: Nucleotide (six-frame translations) and Reference: Nucleotide database dynamically translated in all six reading frames. |
tblastn | Query: Protein and Reference: Nucleotide database dynamically translated in all six reading frames. |
Installing and Running it
Now that you understand the different types of sub-blasts, let’s learn how to install it. You can easily install the tool using bioconda.
$ conda install blast
One of the ways to run BLAST is to use a formatted database that can be done by using the command makeblastdb.
# format database and assume that reference DNA
$ makeblastdb -in {REFERENCE.FASTA} -dbtype nucl
# format database and assume that reference in Protein
$ makeblastdb -in {REFERENCE.FASTA} -dbtype prot
Formated files will be saved on the same directory of the FASTA file.
Now that you know how to format your database, you can learn how to run the tool.
For example, below it is shown how to run blastn using the a query FASTA (-q), with the database we formatted (-db), saving the output to a file (-out) in tabular format (-outfmt 6), limiting the output to alignment with an e-value of at least 0.00001 (-evalue), and on its most sensitive mode (-task blastn).
$ blastn -query {QUERY.FASTA} -db {GENOME.FASTA} -out {OUTPUT_FILE} -outfmt 6 -evalue 0.00001 -task blastn -num_threads 10
The same command line can be simply modified to run other types of blast subcommand. Please remember to remove the “-task” parameter when running other sub-commands.
Tabular Output
One of the most column BLAST output formats is the tabular output which can be set by flagging -outfmt 6. This can be used as input by MEGAN for example.
Below there are all the columns that are present on the tabular output.
Column Index | Column Name | Column Definition |
1 | qseqid | query (e.g., gene) sequence id |
2 | sseqid | subject (e.g., reference genome) sequence id |
3 | pident | percentage of identical matches |
4 | length | alignment length |
5 | mismatch | number of mismatches |
6 | gapopen | number of gap openings |
7 | qstart | start of alignment in query |
8 | qend | end of alignment in query |
9 | sstart | start of alignment in subject |
10 | send | end of alignment in subject |
11 | evalue | expect value |
12 | bitscore | bit score |
Personalized tabular output can be created by setting -outfmt”6 column_name_1 column_name_2… column_name_n“.
For example, -outfmt “6 qseqid sseqid qcovs evalue” will output the query name, subject name, query coverage per subject, and e-value.
Column Name | Column Definition |
qseqid | Query Seq-id |
qgi | Query GI |
qacc | Query accession |
qaccver | Query accession.version |
qlen | Query sequence length |
sseqid | Subject Seq-id |
sallseqid | All subject Seq-id(s), separated by a ‘;’ |
sgi | Subject GI |
sallgi | All subject GIs |
sacc | Subject accession |
saccver | Subject accession.version |
sallacc | All subject accessions |
slen | Subject sequence length |
qstart | Start of alignment in query |
qend | End of alignment in query |
sstart | Start of alignment in subject |
send | End of alignment in subject |
qseq | Aligned part of the query sequence |
sseq | Aligned part of the subject sequence |
evalue | Expect value |
bitscore | Bit score |
score | Raw score |
length | Alignment length |
pident | Percentage of identical matches |
nident | Number of identical matches |
mismatch | Number of mismatches |
positive | Number of positive-scoring matches |
gapopen | Number of gap openings |
gaps | Total number of gaps |
ppos | Percentage of positive-scoring matches |
frames | Query and subject frames separated by a ‘/’ |
qframe | Query frame |
sframe | Subject frame |
btop | Blast traceback operations (BTOP) |
staxids | Subject Taxonomy ID(s), separated by a ‘;’ |
sscinames | Subject Scientific Name(s), separated by a ‘;’ |
scomnames | Subject Common Name(s), separated by a ‘;’ |
sblastnames | Subject Blast Name(s), separated by a ‘;’ (in alphabetical order) |
sskingdoms | Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order) |
stitle | Subject Title |
salltitles | All Subject Title(s), separated by a ‘<>’ |
sstrand | Subject Strand |
qcovs | Query Coverage Per Subject |
qcovhsp | Query Coverage Per HSP |
Why Is BLAST Still Used?
BLAST is still used in many applications because it is one of the most sensitive aligners out there. For example, it is so sensitive that here blastx results were used as a group of comparison what was was “True”. Moreover, when generating recruitment plots, I normally use blastn for alignment – I want to make sure I don’t miss any region.
However, as sequencing became much cheaper over the year, query and reference databases because much much larger. Thus, faster alternatives to BLAST were needed.
Unfortunately, they are less sensitive than BLAST, but still worth using on metagenomics applications. Please see below some of these tools.
Faster Alternatives to BLAST
For some applications such as analyzing big data metagenomic datasets, running BLAST may not be the best option because of the time requirements.
There are some options below for replacing blastn, blastp, and blastx with major gains on speed.
DIAMOND: Up to 20 thousand faster than blastx
DIAMOND is a great replacement for blastx and blastp with little loss of sensitivity. It is fast because the algorithm uses a new reduction to the protein alphabet, improves the seeding step, and seed index which is the bottleneck when aligning a large number of sequences against a large database.
Besides the speed advantage, DIAMOND also can generate the same tabular output generated by BLAST.
The tool can be easily installed using Bioconda.
$ conda install diamond
Please read the tool README on its GitHub page for details on how to run it.
Rapsearch2: 100 times faster than blastx
You may be asking why I’m showing Rapsearch2 here if it is only 100 times faster than blastx when we know that DIAMOND is 20 thousand times faster than blastx, right? Well, DIAMOND normally uses a large amount of memory, and not everybody has access to it.
The tool uses a reduced protein alphabet and that is one of the reasons it is much faster than blastx and blastp.
Moreover, you can easily install it with Bioconda.
$ conda install rapsearch
Please read the tool README on its GitHub page for details on how to run it.
BWA: blastn for short reads
BWA is a great replacement for blastn if the query is composed of short reads which is very common in metagenomics. Also, it is fast and memory efficient.
BWA is based on the Burrows-Wheeler Transform algorithm which is known to be used in data compression. Please see below a video explaining the Burrows-Wheeler Transform algorithm.
Furthermore, you can easily install it with Bioconda.
$ conda install bwa
Please read the tool README on its GitHub page for details on how to run it.
Furthermore, Bwa-mem2 promises to produce the same output of BWA, but ~80% faster.
Minimap2: blastn for long reads
Minimap2 is a great replacement for blastn in the case when the query is large for example Pacbio or Oxford Nanopore reads.
Moreover, you can easily install it with Bioconda.
$ conda install minimap2
Please read the tool README on its GitHub page for details on how to run it.
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model
Conclusion
In summary, this article showed you everything you need to know about BLAST NCBI.
However, it showed that sometimes using BLAST may not be the best choice due to speed limitation. Luckily, we presented a few solutions to replace blastn, blastp, blastx depending on your query size and how much memory you have available.