Estimate Genome Size and Best k-mer Size for Assembly – Step-by-Step

by:

Bioinformatics

This tutorial shows how you can estimate the genome size and the best k-mer length for genome assembly using KmerGenie. Moreover, the tutorial shows how to install the tool.

Why Estimate Genome Size and Best K-mer Length for assembly?

Many genome assemblers use a k-mer approach to assemble the reads into contigs. However, most of them ( e.g Velvet, SOAPdenovo 2, ABySS, or Minia) ask for the k-mer length rather than trying to estimate it. Thus, in these cases, finding the optimal k-mer size is very important. Furthermore, estimating the genome size from reads is important because often scientists either like or are asked to report the genome size when reporting a draft genome.

Here we use KmerGenie to estimate these two metrics by using the k-mer abundance histogram for many k-mer sizes. Next, for each value of k, it estimates the number of distinct k-mers in the sample and returns the k-mer size which maximizes this number.

Installing KmerGenie

Luckily KmerGenie is on bioconda. However, the tool requires Python 2.7, so I would recommend you creating a conda environment just for KmerGenie, so you don’t need to mess up your os Python.

It is easy – see the steps below:

# create kmergenie environment which includes it and all its dependencies
$ conda create -n kmergenie_env -c bioconda kmergenie

# use this command every time you want to run kmergenie. It activates the conda environment with the tool
$ source activate kmergenie_env

Running KmerGenie – Step-by-Step

For this tutorial, 1,000,000 Illumina HiSeq reads pairs of Escherichia coli 042 were simulated. They are used to exemplify the estimation of best k-mer length and genome size estimation.

Now that we have the input data and KmerGenie is installed, we can carry on running the tool.

If you need to estimate the best k-mer size for one file, you can use the following syntax:

# syntax
$ kmergenie <read_file> [options]

# real case
# -l minimum k-mer size for search
# -k maximum k-mer size for search
# -s 2 intervals between k-mers
# -o output prefix
# -t number of threads
$ kmergenie hiseq_reads_ecoli_R1.fastq -l 21 -k 121 -s 2 -o out_file.txt -t 16

# please add the flag "--diploid" in case your genome is diploid

However, in most cases, multiple files are given. In this case, you need to create a one-column file with the paths to the sequence files. See below:

# syntax
$ kmergenie <file_with_paths_for_sequences> [options]

# real case
# -l minimum k-mer size for search
# -k maximum k-mer size for search
# -s 2 intervals between k-mers
# -o output prefix
# -t number of threads
$ kmergenie genomes.txt -l 21 -k 121 -s 2 -o out_file.txt -t 16

# please add the flag "--diploid" in case your genome is diploid

One it is done running, it tells you what is the best k-mer length for the dataset – in this case, it was 61.

Results

For the Escherichia coli 042 case, KmerGenie estimated the genome size to be 5,174,062 and the real genome size is 5,241,977 – not bad at all !! Also, the best k-mer length was 61, and this should be the k-mer size when running an assembler such as Velvet, SOAPdenovo 2, ABySS, or Minia.

KmerGenie outputs some great plots which show why k=61 was the choice. Please check the HTML output – see part of it below.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, this tutorial shows how to estimate the genome size and how to estimate the best k-mer length for when assembling a genome. Moreover, it is important to point out that assemblers such as SPAdes or IDBA generate perform better with their k-mer size estimation. However, this is still very useful for tools such as Velvet, SOAPdenovo 2, ABySS, or Minia.

Related Posts