Genome Assembly for Omics – Step-by-Step



This tutorial will show step-by-step how to assembly single-cell Microbial genomes and metagenomic datasets.

Here we will be using SPAdes to assemble the datasets and show all the different parameters that can be used in the tool.

Microbial Genome Assembly

In simple words, assembling a genome means putting together the sequenced random fragments (known as reads) into longer sequences (called contigs).

When the order of the contigs is known, it is called scaffolds. This tutorial will focus on how to run SPAdes to assemble microbial genomes and metagenomes.

Interested in learning more about genome assembly? Please see the lecture below by Dr. Robert Edwards where he talks about how SPAdes assemble genomes using k-mers and De Bruijn graph.

Installing SPAdes

You can simply install SPAdes using bioconda

$ conda install spades==3.13.0

Single Cell Micro Genome Assembly

The comment line below will run SPAdes for a single cell microbial genome and it will correct reads (–careful).

Moreover, it will create the De Bruijn Graph using k-mers 21, 33, 55, and 77 for the reads in pair 1 (-1) and pair 2 (-2).

$ -k 21,33,55,77 --careful -1 ecoli_miseq_reads_R1.fastq -2 ecoli_miseq_reads_R2.fastq -o spades_output_ecoli/

In case unpaired reads for single reads library number is available, please pass the FASTQ file in the -s parameter.

The assembly output will be stored at spades_output_ecoli/, and scaffolds.fasta should contain the best assembly generated by SPAdes. Please use this file for further analysis.

Metagenomic Assembly (metaspades tutorial)

The comment line below will run metaspades (SPAdes for metagenomes) for a metagenome dataset.

Furthermore, it will create the De Bruijn Graph using k-mers 21, 33, 55, and 77 for the reads in pair 1 (-1) and pair 2 (-2).

$ -k 21,33,55,77 -1 sample_R1.fastq -2 sample_R2.fastq -o spades_metagenome/

Important to point out that metaspades does not allow correcting reads (–careful).

The output here will be at spades_metagenome/. Also, target the scaffolds.fasta file.

Important Parameters

Last but not least, if more memory may be needed for large input datasets, thus please set -m to memory in Gb (default = 250 Gb).

Also, if you have more access to threads please set -t to the number of threads you want to use during the assembling process (default -t = 16).

Moreover, if you have Sanger data pass the file under –sanger, PacBio under –pacbio, or Nanopore data to –nanopore.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts