Step by Step: Completeness and Contamination of MAGs

onestop_databy:

Bioinformatics

This article demonstrates STEP by STEP how to estimate completeness and contamination of Metagenome Assembled Genomes (MAGs) using CheckM.

1. How does CheckM work?

My Ph.D. advisor, Robert Edwards, explains below how CheckM uses a hidden Markov model to estimate the completeness and contamination of bins (MAGs).

2. Step by Step: Running CheckM

First and foremost, as CheckM has many dependencies, I will use here a Docker image that carries all the dependencies for it. All you need to do is to make sure you have Docker installed in your machine and pull the CheckM bioconda Docker Image. You can do it by using the command below:

$ docker pull quay.io/biocontainers/checkm-genome:1.1.2--py_0

Make sure you get up your Docker image to use at least 16 GB of RAM. If you don’t know how to do it, please learn how to do it here.

Next step is to download the CheckM database which can be downloaded from:

# download database
$ wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
# create folder and uncompress db into database folder
$ mkdir -p db/
$ tar xzf checkm_data_2015_01_16.tar.gz -C db/

Attention: On this tutorial, please make sure your database lives in the same folder as your bins files.

Now that you pull the image to your machine, set up the right amount of memory in the Docker image, and have the CheckM database, you can run it (yay!)

However, first identity your image ID using docker images

$ docker images
REPOSITORY                            TAG                 IMAGE ID            CREATED             SIZE
quay.io/biocontainers/checkm-genome   1.1.2--py_0         04fa265258d1        3 weeks ago         1.1GB

In my case, it was 04fa265258d1 as you can see above.

Next, we can now run the Docker image but calling docker run as shown below {WORK_DIR} is the directory where the bins live

# run docker image with CheckM and its dependencies
$ docker run -i -t -v {WORK_DIR}:/checkm_docker --entrypoint /bin/bash {IMAGE_ID}
# in the docker image, it enters the working directory 
$ cd checkm_docker/

Next step is to set up the location of the database so CheckM can find it

$ checkm data setRoot db/

Finally, we are all set and can run the tool by using the command below.

$ checkm lineage_wf {BIN_DIRECTORY} {OUTPUT_DIRECTORY} --reduced_tree -t {NUMBER_THREADS}

If your computer has way more than 16 GB of RAM you should remove the -reduced_tree flag. This flag is for computer with about 16 GB of RAM, and produce sub-optimal results.

Attention: Make sure all your files in the {BIN_DIRECTORY} are on the .fna extension. If they are all in the .fasta extension, please add the flag –extension .fasta.

3. Study Case

In order to test CheckM, I tested it with some microbial genomes from NCBI:

Furthermore, I combined both genomes to simulate a mixed bin so I could check for contamination, and removed some contigs from the Lactobacillus vini DSM 20605 assembly so I could check for completeness.

Last but not least, as you can see below, CheckM was able to classify the bins correctly. We see some small contamination on the NCBI genomes – this could be a classification noise.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, I hope you now understand that is important to check for the Completeness and Contamination of MAGs. Moreover, I hope that this Step by Step tutorial is helpful for you.

Related Posts