Multiple Sequence Alignment – Theory and Practice – Step-by-Step



This blog post described Multiple Sequence Alignment (MSA) focusing on the theory and practice – Step-by-Step using MAFFT and Muscle.

1. What is Multiple Sequence Alignment?

Multiple sequence alignment is one of the bioinformatics areas of active research. Multiple Sequence Alignment which is also referred to as MSA is an essential technique in the molecular biology, bioinformatics, and computational biology fields. A MSA is a general sequence alignment of three or more biological sequences like protein, nucleic acid, DNA and RNA sequences of similar length.

2. Importance of Multiple Sequence Alignment

Multiple Sequence Alignments are an essential tool for sequence analysis of protein structure and its function prediction; phylogeny can be inferred and other common sequence analysis tasks. The sequence analysis outputs can infer homology and evolutionary relationships that exist between the studied sequences.

While protein alignment issues are being studied for several decades, many new recent studies demonstrated significant progress in enhancing the scalability or the accuracy of multiple alignment sequence tools, extending the scope of tasks that a MSA program can handle. Thus recent developmental studies of multi-sequence alignment have shown advanced progress and the state of the art with its accuracy, scalability with thousands of proteins, and flexibility to compare thousands of proteins that have varied domain structures. So, MSA is characterized as the computational problem with the highest complexity.

Multiple Sequence Alignments deals with the alignment of three or more biological sequences. Since it is difficult to have three or more biological sequences of exact length and also it is a very long time taking to align them by hand, there are many computational algorithms that are used to create and analyze the biological sequence alignments. Many bioinformatics techniques and procedures depend on the accuracy of MSA and hence it is of high critical importance.

3. Multiple Sequence Alignment Algorithms

Unfortunately producing accurate Multiple sequence alignments are high biologically complex tasks and computationally intensive and to date, none of the current MSA tools are generating perfect results biologically, though they are more accurate compared to the rest of the sequence alignment techniques.

This is the reason behind this area of research to be highly active targeting to develop an algorithm or method that aligns thousands are lengthy sequences to develop high-quality alignments within an optimistically reasonable time. As and when the number of sequences to be aligned is increasing, the alignment time and computational complexity are negatively impacted.

The new MSA databases achieved benchmark from the recent studies are SABMARK, IRMBASE, OXBENCH, and PREFAB. A lot of MSA algorithms increasing at the pace of adding 1 to 2 new ones monthly. CLUSTALW being the most accurate and most scalable multiple sequence algorithm and hence is the most popular MSA tool to date, though there are many significant methods offering a better quality of sequence alignment and reduced computational cost like T-Coffee, MAFFT, MUSCLE, and Kalign, etc.

Below it is explained how you can run MUSCLE and MAFFT.

4. Creating a MSA with MAFFT

Luckily MAFFT is available on bioconda, so you can install it using the command line below

$ conda install mafft

Now that MAFFT is installed, you can create a MSA with the following command line

# run mafft using 16 threads with an input in FASTA format and outputs
# MSA to $OUTPUT.msa
$ mafft --thread 16 $INPUT.fasta > $OUTPUT.msa

# Run fast mode using 16 threads with an input in FASTA format and outputs MSA to $OUTPUT.msa
$ mafft --thread 16 --retree 1 $INPUT.fasta > $OUTPUT.msa

5. Running Muscle to Create a MSA

Next, Muscle is also available on bioconda, so you can install it using the command line below

$ conda install muscle

It is also easy to run Muscle

# run muscle with a input in FASTA format and outputs MSA to 
# $OUTPUT.msa
$ muscle -in $INPUT.fasta -out $OUTPUT.msa

# run muscle in fast mode with a input in nucleotide in FASTA format and outputs
# MSA to $OUTPUT.msa 
$ muscle -in $INPUT.fasta -out $OUTPUT.msa -maxiters 1 -diags

# run muscle in fast mode with a input in aminoacid in FASTA format and outputs
# MSA to $OUTPUT.msa 
$ muscle -in $INPUT.fasta -out $OUTPUT.msa -maxiters 1 -diags -sv -distance1 kbit20_3

6. Visualizing a MSA

Last but not least, now that the MSA was generated, you need a way to visualize it. Recently, I was presented to MSAViewer.

Wikipedia has a page which list many tools for MSA visualization.

Related Posts