Scaffolding Genome: Increase Draft Genome N50 Length – Step-by-Step



This tutorial demonstrates how to increase a draft genome N50 by using Scaffold_Builder which takes pre-assembled genomes and a closely related reference genome.

What is Scaffolding a Genome?

First and foremost, scaffolding in bioinformatics implies in to connect a set of contigs (assembled reads) into one or more scaffolds. Moreover, in general, the contigs are separated by gaps of known length estimated from a homologous reference or contigs that overlap with a certain level of homology.

Below is the workflow of Scaffold_builder and how it creates the scaffolds. For more details on the tool, please read its paper.

Figure 1: Scaffold_builder workflow. This image was taken from the tool paper.

Scaffolding Genome: Case of Study

Next, we use an assembled-set of contigs of Escherichia coli genome as query and Escherichia coli 042 complete genome (NCBI) as the closely related genome as a reference.

Installing Scaffold_builder

Now, Scaffold builder is in Bioconda which means you can install it and bioconda will take care of its dependencies.

Furthermore, Scaffold_builder uses Python 2.X, thus I would strongly advise you to create a conda environment. It will install all the Python 2.X and other dependencies and not down-grade any of your other requirements already installed.

# create conda environment with scaffold_builder and dependencies
$ conda create -n scaffold_builder_env -c bioconda scaffold_builder
$ source activate scaffold_builder_env

Running it

Scaffolding Genomes with Scaffold_builder is very simple. Please see below

#syntax example

# real case
$ -q ecoli_Contigs.fna -r ecoli_reference.fasta


For the Escherichia coli scaffolding genome, the draft genome improvements were amazing: Reduced the number of sequences by 31x and increased the N50 by 154x.

Number of sequences2,46679
Average Sequence Length1,24364,992
Longest Sequence6,1261,168,409

More Resources

