This tutorial shows how you can estimate the genome size and the best k-mer length for genome assembly using KmerGenie. Moreover, the tutorial shows how to install the tool.
Why Estimate Genome Size and Best K-mer Length for assembly?
Many genome assemblers use a k-mer approach to assemble the reads into contigs. However, most of them ( e.g Velvet, SOAPdenovo 2, ABySS, or Minia) ask for the k-mer length rather than trying to estimate it. Thus, in these cases, finding the optimal k-mer size is very important. Furthermore, estimating the genome size from reads is important because often scientists either like or are asked to report the genome size when reporting a draft genome.
Here we use KmerGenie to estimate these two metrics by using the k-mer abundance histogram for many k-mer sizes. Next, for each value of k, it estimates the number of distinct k-mers in the sample and returns the k-mer size which maximizes this number.
It is easy – see the steps below:
# create kmergenie environment which includes it and all its dependencies $ conda create -n kmergenie_env -c bioconda kmergenie # use this command every time you want to run kmergenie. It activates the conda environment with the tool $ source activate kmergenie_env
Running KmerGenie – Step-by-Step
Now that we have the input data and KmerGenie is installed, we can carry on running the tool.
If you need to estimate the best k-mer size for one file, you can use the following syntax:
# syntax $ kmergenie <read_file> [options] # real case # -l minimum k-mer size for search # -k maximum k-mer size for search # -s 2 intervals between k-mers # -o output prefix # -t number of threads $ kmergenie hiseq_reads_ecoli_R1.fastq -l 21 -k 121 -s 2 -o out_file.txt -t 16 # please add the flag "--diploid" in case your genome is diploid
However, in most cases, multiple files are given. In this case, you need to create a one-column file with the paths to the sequence files. See below:
# syntax $ kmergenie <file_with_paths_for_sequences> [options] # real case # -l minimum k-mer size for search # -k maximum k-mer size for search # -s 2 intervals between k-mers # -o output prefix # -t number of threads $ kmergenie genomes.txt -l 21 -k 121 -s 2 -o out_file.txt -t 16 # please add the flag "--diploid" in case your genome is diploid
One it is done running, it tells you what is the best k-mer length for the dataset – in this case, it was 61.
For the Escherichia coli 042 case, KmerGenie estimated the genome size to be 5,174,062 and the real genome size is 5,241,977 – not bad at all !! Also, the best k-mer length was 61, and this should be the k-mer size when running an assembler such as Velvet, SOAPdenovo 2, ABySS, or Minia.
KmerGenie outputs some great plots which show why k=61 was the choice. Please check the HTML output – see part of it below.
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model
In summary, this tutorial shows how to estimate the genome size and how to estimate the best k-mer length for when assembling a genome. Moreover, it is important to point out that assemblers such as SPAdes or IDBA generate perform better with their k-mer size estimation. However, this is still very useful for tools such as Velvet, SOAPdenovo 2, ABySS, or Minia.