Reference-Free Metagenomic Datasets Comparison – Step-by-Step



Reference-free metagenomic methods are very useful to compare datasets with high levels of unknown sequences. This tutorial teaches how this can be done going from metagenomic reads for multiple samples to datasets comparison.

Reference-Free Comparative Metagenomics

In metagenomics is very common to annotate the taxa and functions for a given set of samples, and use the results to compare these samples.

However, this approach is not as effective when comparing samples with high levels of unknown sequences. Here, we present a tutorial on CrAss which uses cross-assembly of sequences from different samples to estimate the similarity between metagenomes.

Docker Image with crAss and other tools

CrAss requires many dependencies. Generally, it is not a problem because most of the bioinformatics tools and dependencies live the bioconda repository. However, it is not the case for crAss and most of its dependencies.

Don’t freak out! I got you covered. I have created a Docker image crAss with all its dependencies, SPAdes assembler and BWA aligner.

Please use the command below to pull the image into your computer.

$ docker pull onestopdataanalysis/crass:latest

Run the Docker image and mount it to a directory with your metagenomic data.

# syntax example

# real case
$ docker run -i -v /Users/onestop_data/Desktop/crass_files/:/data/ -t onestopdataanalysis/crass:latest

Step-by-Step: Reference-Free Comparative Metagenomics

Running CrAss is simple once you have all the files needed. This is not the case for most people, so below there are all the steps in detail.

Here we use SPAdes to generate the cross-assembly of the samples. Below is the SPAdes call to assemble the FASTA files crAss test set. Next, we will be using (several viral metagenomes from humans and water) which should be living in the Docker image under /src/.

# change directory to mounted volume
$ cd ../data/

# create work directory
$ mkdir workdir
$ cd workdir

# create directory for data and copy FASTA
$ mkdir data/
$ cp /src/15D_example/*.fasta data/

# merge all FASTA files for cross-assembly
$ cat data/*.fasta > all.fasta

# run spades to create cross-assembly
$ -s all.fasta -o spades_out --only-assembler -t 16

Now that we have the cross-assembly of all the reads, the second step is to map the reads back to the cross assembly.

# build cross-assembly bwa index
$ bwa index spades_out/scaffolds.fasta

# align reads to cross assembly and output in SAM
# crAss requires SAM rather than BAM
# output SAM into data/
$ bwa mem  -t 16 spades_out/scaffolds.fasta all.fasta > data/map.sam

Finally, we have everything crAss needs: A SAM file with the reads maps against the cross assembly and the reads themselves.

Please see below how to run crAss:

# run crAss on directory with FASTA files and SAM file
$ data/

Check the folder data/ for the png files. They should show the metagenomic comparison of all your samples using different distance measurements – each figure has a different comparison method. See an example below:

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.


In summary, this tutorial shows how to compare a set of metagenomic datasets where most reads are unknown. Hopefully, you make great use of this tutorial which shares a Docker Images which handles all the crAss dependencies for you.

Related Posts