This tutorial demonstrates how to increase a draft genome N50 by using Scaffold_Builder which takes pre-assembled genomes and a closely related reference genome.
What is Scaffolding a Genome?
First and foremost, scaffolding in bioinformatics implies in to connect a set of contigs (assembled reads) into one or more scaffolds. Moreover, in general, the contigs are separated by gaps of known length estimated from a homologous reference or contigs that overlap with a certain level of homology.
Below is the workflow of Scaffold_builder and how it creates the scaffolds. For more details on the tool, please read its paper.

Figure 1: Scaffold_builder workflow. This image was taken from the tool paper.
Scaffolding Genome: Case of Study
Next, we use an assembled-set of contigs of Escherichia coli genome as query and Escherichia coli 042 complete genome (NCBI) as the closely related genome as a reference.
Installing Scaffold_builder
Now, Scaffold builder is in Bioconda which means you can install it and bioconda will take care of its dependencies.
Furthermore, Scaffold_builder uses Python 2.X, thus I would strongly advise you to create a conda environment. It will install all the Python 2.X and other dependencies and not down-grade any of your other requirements already installed.
# create conda environment with scaffold_builder and dependencies
$ conda create -n scaffold_builder_env -c bioconda scaffold_builder
$ source activate scaffold_builder_env
Running it
Scaffolding Genomes with Scaffold_builder is very simple. Please see below
#syntax example
$ scaffold_builder.py -q {QUERY_FASTA} -r {REFERENCE_FASTA}
# real case
$ scaffold_builder.py -q ecoli_Contigs.fna -r ecoli_reference.fasta
Results
For the Escherichia coli scaffolding genome, the draft genome improvements were amazing: Reduced the number of sequences by 31x and increased the N50 by 154x.
– | Contigs | Scaffolds |
Number of sequences | 2,466 | 79 |
Average Sequence Length | 1,243 | 64,992 |
Longest Sequence | 6,126 | 1,168,409 |
N50 | 1,427 | 220,774 |
More Resources
Here are two of my favorite Data Visualization Python Books in case you want to learn more about it.
- Mastering Python Data Visualization by Kirthi Raman
- Python Data Visualization: An Easy Introduction to Data Visualization in Python with Matplotlip, Pandas, and Seaborn by Samuel Burns