Have you noticed that most assemblers use an odd k-mer length? Do you know why? Don’t you think it is odd?
This blog post explains below why an odd kmer should be the choice.
Why should you choose an odd k-mer length when assembling a genome?
Have you noticed that an odd k-mer length is required by Kallisto when doing the quantification of RNA-seq, and assemblers such as Velvet and SOAPdenovo2 ? Also, that SPades set of default k-mer lengths are also odd? Lastly, that kmergenie prediction is normally (if not always) an odd length kmer?
This is not used by random chance: when writing a bioinformatics tool k-mer based, it does not know which DNA strand it is reading from; thus it needs to take into consideration the relationship within a k-mer and its reverse complement.
And that where the problem with an even k-mer length rises!!
An even k-mer length can generate DNA palindromes, and an odd k-mer never does. In DNA language, a palindrome is a sequence that is its own reverse complement. That generates ambiguity in the de Bruijn graph when running SPades and makes assembling reads more problematic.
For example, “AGCGCT” is a 6-mer (even length) and its reverse complete is also “AGCGCT”. This would never happen with an odd k-mer because of the extra-base: “AGCGCTA” is a 7-mer and its reverse is “TAGCGCT”.
An even k-mer length may have its reverse complement to be a palindrome which generates ambiguity in the de Bruijn graph when running SPades and makes assembling reads more challenging.
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model