Genome Assembly for Omics – Step-by-Step

by:

Bioinformatics

This tutorial will show step-by-step how to assembly single-cell Microbial genomes and metagenomic datasets.

Here we will be using SPAdes to assemble the datasets and show all the different parameters that can be used in the tool.

Why is it important to assembly a genome?

Assembling a genome is an important process in genomics that involves putting together the fragmented pieces of DNA sequence into a complete representation of an organism’s genetic material. This information can then be used to understand an organism’s biological processes, evolution, and disease risk, among other things.

Having a complete and accurate genome assembly provides a foundation for many downstream studies, such as gene discovery, functional annotation, and comparative genomics. This can lead to new insights into the biology of organisms, including the identification of disease-causing genes, the discovery of new drug targets, and the improvement of crop yields.

In addition, genome assembly also provides an important resource for conservation biology, as it allows scientists to understand the genetic diversity of endangered species and inform conservation efforts.

In summary, the assembly of a genome is a critical step in unlocking the secrets of an organism’s biology and has far-reaching implications for many areas of science and technology.

Microbial Genome Assembly

In simple words, assembling a genome means putting together the sequenced random fragments (known as reads) into longer sequences (called contigs).

When the order of the contigs is known, it is called scaffolds. This tutorial will focus on how to run SPAdes to assemble microbial genomes and metagenomes.

Interested in learning more about genome assembly? Please see the lecture below by Dr. Robert Edwards where he talks about how SPAdes assemble genomes using k-mers and De Bruijn graph.

Installing SPAdes

You can simply install SPAdes using bioconda

$ conda install spades==3.13.0

Single Cell Micro Genome Assembly

The comment line below will run SPAdes for a single cell microbial genome and it will correct reads (–careful).

Moreover, it will create the De Bruijn Graph using k-mers 21, 33, 55, and 77 for the reads in pair 1 (-1) and pair 2 (-2).

$ spades.py -k 21,33,55,77 --careful -1 ecoli_miseq_reads_R1.fastq -2 ecoli_miseq_reads_R2.fastq -o spades_output_ecoli/

In case unpaired reads for single reads library number is available, please pass the FASTQ file in the -s parameter.

The assembly output will be stored at spades_output_ecoli/, and scaffolds.fasta should contain the best assembly generated by SPAdes. Please use this file for further analysis.

Metagenomic Assembly (metaspades tutorial)

The comment line below will run metaspades (SPAdes for metagenomes) for a metagenome dataset.

Furthermore, it will create the De Bruijn Graph using k-mers 21, 33, 55, and 77 for the reads in pair 1 (-1) and pair 2 (-2).

$ metaspades.py -k 21,33,55,77 -1 sample_R1.fastq -2 sample_R2.fastq -o spades_metagenome/

Important to point out that metaspades does not allow correcting reads (–careful).

The output here will be at spades_metagenome/. Also, target the scaffolds.fasta file.

Important Parameters

Last but not least, if more memory may be needed for large input datasets, thus please set -m to memory in Gb (default = 250 Gb).

Also, if you have more access to threads please set -t to the number of threads you want to use during the assembling process (default -t = 16).

Moreover, if you have Sanger data pass the file under –sanger, PacBio under –pacbio, or Nanopore data to –nanopore.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts