This blog post teaches a step-by-step on binning metagenomic contigs using two unsupervised methods – CONCOCT and Metabat2. Furthermore, it uses a simulated metagenomic to evaluate both methods.
Contig Binning
First and foremost, binning metagenomic sequences is a challenging task, especially when a reference dataset is not available to use as a point of reference.
Unsupervised methods that leverage the contig coverage have been developed for contigs binning – here we focus on CONCOCT and Metabat2 tutorials.
Curious to learn more about binning? Dr. Robert Edwards talks more about binning here.
Docker: Easy Setup for Dependencies
CONCOCT and Metabat2 have many dependencies, so this tutorial will use bioconda and Docker to make the tools installation easy and allow the results to be reproducible. Besides these tools, we are also going to BWA and samtools.
# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH
# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge
RUN conda install concoct==1.1.0 metabat2==2.15 bwa==0.7.17 samtools
# Entrypoint
CMD /bin/bash
Case Study
Here, we use an assembled simulated metagenome that contains 10 species. The metagenomic dataset was assembled using Spades.
Both tools need to input the assembled contigs and a BAM file with the mapping of reads to the contigs.
Please see how you can do it below using BWA and sorting it with samtools (Metabat2 requires the BAM file to be sorted).
# Create index for contigs FASTA
$ bwa index original_contigs.fa
# Align reads against contigs and sort BAM
$ bwa mem -t 16 original_contigs.fa simShort_single.fasta | samtools sort -o alignment.bam
# index BAM file (needed by Metabat2)
$ samtools index alignment.bam
CONCOCT Tutorial
CONCOCT requires many steps to run it and can get very overwhelming, but don’t worry! Below there is an easy step-by-step on how to bin your contigs.
Moreover, it needs an assembled set of contigs (original_contigs.fa) and the BAM file which maps the reads used on the contigs assembled against the contigs (alignment.bam).
# adapted from https://github.com/BinPro/CONCOCT/tree/1.1.0
# Slice contigs into smaller sequences
$ cut_up_fasta.py original_contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa
# Generate coverage depth
$ concoct_coverage_table.py contigs_10K.bed alignment.bam > coverage_table.tsv
# Execute CONCOCT
$ concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/
# Merge sub-contig clustering into original contig clustering
$ merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv
# Create output folder for bins
$ mkdir concoct_output/fasta_bins
# Parse bins into different files
$ extract_fasta_bins.py original_contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins
CONCOCT generated 16 bins and the simulated dataset had 10 species. Later on this tutorial, we will examine how accurate they are.
MetaBat2 Tutorial:
MetaBat2 is much simpler to run. All you need is assembled contigs (original_contigs.fa) and a BAM file with the map of the reads against the contigs (alignment.bam).
$ runMetaBat.sh original_contigs.fa alignment.bam
# output bins at original_contigs.fa.metabat-bin*
Metabat2 generates 2 bins and the simulated dataset had 10 species.
Results
Last but not least, it is time to see how both tools performed on the contigs binning of the 10 species.
CONCOCT overestimated the number of genomes with 16 bins and Metabat underestimated the number of genomes with 2 bins.
Out of the 19,404 contigs, CONCOCT binned 1,366 contigs into 16 bins and Metabat2 bins had 317 contigs.
Next, we can use CheckM to investigate the bins contamination and completeness level.
CONCOCT
Bin name | Completeness (%) | Contamination |
1 | 97.81 | 55.30 |
9 | 44.28 | 8.01 |
3 | 16.59 | 0.00 |
11 | 13.89 | 0.00 |
10 | 4.17 | 0.00 |
12 | 4.15 | 0.19 |
4 | 1.22 | 0.00 |
8 | 0.00 | 0.00 |
7 | 0.00 | 0.00 |
6 | 0.00 | 0.00 |
5 | 0.00 | 0.00 |
2 | 0.00 | 0.00 |
15 | 0.00 | 0.00 |
14 | 0.00 | 0.00 |
13 | 0.00 | 0.00 |
0 | 0.00 | 0.00 |
Metabat2
Bin name | Completeness (%) | Contamination (%) |
Bin 1 | 81.52 | 1.91 |
Bin 2 | 34.84 | 0.56 |
As you can see above, CONCOCT performed better than Metabat2 on binning the contigs. It overestimated the number of bins, but one of the bins was very contaminated (55.30%).
Moreover, it is important to point out that some of the organisms are low abundant on the same, thus it explains why some of the CONCOCT bins had completeness equal to 0 (I would read as close to 0) – which makes sense.
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model
Conclusion
In summary, this tutorial shows a step-by-step on how to bin metagenomic contigs using CONCOCT and Metabat2.
On the example presented here, CONCOCT performed better than Metabat2. However, I would recommend running both tools – every case is different.