Painless Metagenomic Contigs Binning – Step-by-Step

onestop_databy:

Bioinformatics

This blog post teaches a step-by-step on binning metagenomic contigs using two unsupervised methods – CONCOCT and Metabat2. Furthermore, it uses a simulated metagenomic to evaluate both methods.

Contig Binning

First and foremost, binning metagenomic sequences is a challenging task, especially when a reference dataset is not available to use as a point of reference.

Unsupervised methods that leverage the contig coverage have been developed for contigs binning – here we focus on CONCOCT and Metabat2.

Curious to learn more about binning? Dr. Robert Edwards talks more about binning here.

Docker: Easy Setup for Dependencies

CONCOCT and Metabat2 have many dependencies, so this tutorial will use bioconda and Docker to make the tools installation easy and allow the results to be reproducible. Besides these tools, we are also going to BWA and samtools.

# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH
 
# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge
 
RUN conda install concoct==1.1.0 metabat2==2.15 bwa==0.7.17 samtools

# Entrypoint
CMD /bin/bash

Case Study

Here, we use an assembled simulated metagenome that contains 10 species. The metagenomic dataset was assembled using Spades.

Both tools need to input the assembled contigs and a BAM file with the mapping of reads to the contigs.

Please see how you can do it below using BWA and sorting it with samtools (Metabat2 requires the BAM file to be sorted).

# Create index for contigs FASTA
$ bwa index original_contigs.fa

# Align reads against contigs and sort BAM
$ bwa mem  -t 16 original_contigs.fa simShort_single.fasta | samtools sort -o alignment.bam

# index BAM file (needed by Metabat2)
$ samtools index alignment.bam

CONCOCT

CONCOCT requires many steps to run it and can get very overwhelming, but don’t worry! Below there is an easy step-by-step on how to bin your contigs.

Moreover, it needs an assembled set of contigs (original_contigs.fa) and the BAM file which maps the reads used on the contigs assembled against the contigs (alignment.bam).

# adapted from https://github.com/BinPro/CONCOCT/tree/1.1.0

# Slice contigs into smaller sequences
$ cut_up_fasta.py original_contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa

# Generate coverage depth 
$ concoct_coverage_table.py contigs_10K.bed alignment.bam > coverage_table.tsv

# Execute CONCOCT
$ concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/

# Merge sub-contig clustering into original contig clustering
$ merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv

# Create output folder for bins
$ mkdir concoct_output/fasta_bins

# Parse bins into different files
$ extract_fasta_bins.py original_contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins

CONCOCT generated 16 bins and the simulated dataset had 10 species. Later on this tutorial, we will examine how accurate they are.

MetaBat2:

MetaBat2 is much simpler to run. All you need is assembled contigs (original_contigs.fa) and a BAM file with the map of the reads against the contigs (alignment.bam).

$ runMetaBat.sh original_contigs.fa alignment.bam

# output bins at original_contigs.fa.metabat-bin*

Metabat2 generates 2 bins and the simulated dataset had 10 species.

Results

Last but not least, it is time to see how both tools performed on the contigs binning of the 10 species.

CONCOCT overestimated the number of genomes with 16 bins and Metabat underestimated the number of genomes with 2 bins.

Out of the 19,404 contigs, CONCOCT binned 1,366 contigs into 16 bins and Metabat2 bins had 317 contigs.

Next, we can use CheckM to investigate the bins contamination and completeness level.

CONCOCT

Bin nameCompleteness (%)Contamination
197.8155.30
944.288.01
316.590.00
1113.890.00
104.170.00
124.150.19
41.220.00
80.000.00
70.000.00
60.000.00
50.000.00
20.000.00
150.000.00
140.000.00
130.000.00
00.000.00

Metabat2

Bin nameCompleteness (%)Contamination (%)
Bin 181.521.91
Bin 234.840.56

As you can see above, CONCOCT performed better than Metabat2 on binning the contigs. It overestimated the number of bins, but one of the bins was very contaminated (55.30%).

Moreover, it is important to point out that some of the organisms are low abundant on the same, thus it explains why some of the CONCOCT bins had completeness equal to 0 (I would read as close to 0) – which makes sense.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, this tutorial shows a step-by-step on how to bin metagenomic contigs using CONCOCT and Metabat2.

On the example presented here, CONCOCT performed better than Metabat2. However, I would recommend running both tools – every case is different.

Related Posts