Scaffolding Genome: Increase Draft Genome N50 Length – Step-by-Step

by:

Bioinformatics

This tutorial demonstrates how to increase a draft genome N50 by using Scaffold_Builder which takes pre-assembled genomes and a closely related reference genome.

What is Scaffolding a Genome?

First and foremost, scaffolding in bioinformatics implies in to connect a set of contigs (assembled reads) into one or more scaffolds. Moreover, in general, the contigs are separated by gaps of known length estimated from a homologous reference or contigs that overlap with a certain level of homology.

Below is the workflow of Scaffold_builder and how it creates the scaffolds. For more details on the tool, please read its paper.

Figure 1: Scaffold_builder workflow. This image was taken from the tool paper.

Scaffolding Genome: Case of Study

Next, we use an assembled-set of contigs of Escherichia coli genome as query and Escherichia coli 042 complete genome (NCBI) as the closely related genome as a reference.

Installing Scaffold_builder

Now, Scaffold builder is in Bioconda which means you can install it and bioconda will take care of its dependencies.

Furthermore, Scaffold_builder uses Python 2.X, thus I would strongly advise you to create a conda environment. It will install all the Python 2.X and other dependencies and not down-grade any of your other requirements already installed.

# create conda environment with scaffold_builder and dependencies
$ conda create -n scaffold_builder_env -c bioconda scaffold_builder
$ source activate scaffold_builder_env

Running it

Scaffolding Genomes with Scaffold_builder is very simple. Please see below

#syntax example
$ scaffold_builder.py -q {QUERY_FASTA} -r {REFERENCE_FASTA}

# real case
$ scaffold_builder.py -q ecoli_Contigs.fna -r ecoli_reference.fasta

Results

For the Escherichia coli scaffolding genome, the draft genome improvements were amazing: Reduced the number of sequences by 31x and increased the N50 by 154x.

ContigsScaffolds
Number of sequences2,46679
Average Sequence Length1,24364,992
Longest Sequence6,1261,168,409
N501,427220,774

More Resources

Here are two of my favorite Data Visualization Python Books in case you want to learn more about it.

Related Posts