This blog post teaches an easy way how to create a multiple sequence alignment (MSA) aware of forward and reverse complement directions.
1. Phylogeny and Distance in Matrice.
Phylogeny from ancient Greek phylon, meaning tribe, family, clan, and genesis, meaning “creation,” is the study of kinship links (phylogenetic or phyletic relationships) between living beings and those who have disappeared:
Phylogenesis makes it possible to reconstruct the evolution of living organisms.
In phylogenesis, we commonly represent relatives by a phylogenetic tree. The number of nodes between the branches, which means many common ancestors, indicates the degree of kinship between individual groups. The more nodes and intermediate ancestors there are between two living beings, the older their common ancestor and the more distant their current kinship. Advances in molecular biotechnology have led to a vast accumulation of new biological data, mainly in nucleic acid sequences (genes or ORFs, molecular markers, etc.) and proteins (enzymes of energy metabolism, structural proteins, etc.). The acquisition of this data and its treatment requires adequate methods and tools.
The phylogeny explains a comparison of specific characters for a set of individuals. These characters are, in general, homologous and belong to contemporary organizations. Two distinct groups divide Phylogenetics:
- Data related to phenotypic characteristics.
- Molecular data such as DNA or protein sequences.
These data concern morphological, physiological, genetics, and genomics. Phenotypic data treatment includes observable traits (at different states: morphological, biochemical, and physiological) and binary patterns (of type presence of a given character/absence of the same character). In the case of bacteria, for example, the characters can be:
Electrophoretic profiles of enzymatic systems, Systematics, or the study of biological diversity with a view to its classification, focuses, in the light of recent findings, on a phylogenetic classification now replacing the classical category.
2. Why is It Important to make the MSA Aware of the sequence Forward and Reverse Directions?
Most MSA tools are not aware of the input sequences’ directionality. Therefore, if the SAME sequence is used as input into an MSA in different directions, a tool like MAFFT automatically outputs an MSA showing that the two sequences are far from each other, as we saw above.
3. Creating a MSA aware of the Reverse and Forward Directions
First of all, MAFFT contains two options that allow the MSA to be aware of the forward and reverse directions.
The option below uses a 6-mer and is the fastest of the two modes:
$ mafft --adjustdirection INPUT > OUTPUT
Now, the slowest mode but more sensitive:
$ mafft --adjustdirectionaccurately INPUT > OUTPUT
As you can see below (very different from the image above), the two sequences are exact matches.
It is important to note two things:
- The output MSA where the sequence was reverse complemented, it will have “_R_” in the header
- For obvious reasons, this option does not work if the input is composed of protein sequences
This blog post taught you how to create an MSA aware of the sequences forward and reverse direction – Congratulations – Now; you won’t be committing this common mistake when running MAFFT or another clustering tool. Clustal probably also has a parameter to do the same thing; however, I will post it in the future if I learn about it, or if you know how to do it, please post it in the comments section.