Painless Prokaryote Pan Genome – Step-by-Step

onestop_databy:

Bioinformatics

This tutorial shows how to annotate genomes in FASTA format and how to generate the pan-genome and core genome using the annotation. For annotating the genes, we use Prokka, and Roary is used for the pan-genome generation.

Pan Genomes: Core genome, Unique Genes, etc

First and foremost, a pan-genome is the complete set of genes in present a set of targets genome. Thus, we can assume that it contains genes present in targets (also known as core genome) and genes present only in some target(s).

Next, as you can see below, it is represented for 3 genomes, the set of genes that are core (present in all the 3 genomes), genes that are unique to each of the genomes, and genes that are shared in at least two genomes.

Now, follow the steps below, and you will learn how to annotate each genome and find the pan-genome.

Defining Dockerfile with Prokka and Roary, and Python other libraries

Now, here is the Dockerfile which you can use to easily create a Docker image for this tutorial. Please save it into a file and save it as “Dockerfile”.

# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH

# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge

RUN conda install prokka==1.14.5 roary==3.12.0 tbl2asn-forever==25.7.1f biopython==1.76 pandas==1.0.1 seaborn==0.10.0

# Entrypoint
CMD /bin/bash

Build the Docker Image

Next, once you have the Docker image, all you need to do is to run the command below to build the image.

docker build -t pangenome:0.1 .

which should print a simular tail message

Step 8/8 : CMD /bin/bash
 ---> Using cache
 ---> 0501ed472c06
Successfully built 0501ed472c06
Successfully tagged pangenome:0.1

Case Study

On this tutorial, we annotate and create the pan-genome for these 3 genomes: Escherichia coli strain NCTC9702, Escherichia coli 042, and Escherichia coli strain RM9088

Run the Docker Image

Now that you have the FASTA files with contigs/complete genomes, you can use the command below to run your Docker image. Notice that PATH_LOCAL_MACHINE is the path in your machine with your FASTA file(s) and FOLDER_NAME_VM is a virtual folder name that Docker will create for you. On my case, I set PATH_LOCAL_MACHINE to /Users/onestop_data/Desktop/pangenome/ and FOLDER_NAME_VM to genomes/.

# syntax example
$ docker run -i -v {PATH_LOCAL_MACHINE}:/{FOLDER_NAME_VM} -t {IMAGE_NAME}:{TAG}

# real case
$ docker run -i -v /Users/onestop_data/Desktop/pangenome/:/genomes/ -t pangenome:0.1

Prokka: Prokaryotic Genome Annotation

Next, Prokka is used for prokaryotic genome annotation. It is probably one of the most used tools for this proposal in the microbial bioinformatics community. In case of more interest in the tool, please read its paper.

Now that you ran your Docker image (previous section command), you should change your directory to “genomes/” and run Prokka with the command line below – simple.

# enter folder with input files
$ cd genomes/

# syntax - example
$ prokka --outdir {OUTPUT_DIR} --prefix {OUTPUT_PREFIX} {FASTA_TO_BE_ANNOTATED}

# real case (annotate genome 1)
$ prokka --outdir ecoli_NCTC9702_annotation/ --prefix ecoli_NCTC9702 ecoli_NCTC9702.fasta

# real case (annotate genome 2)
$ prokka --outdir ecoli_FN554766_annotation/ --prefix ecoli_FN554766 ecoli_FN554766.fasta

# real case (annotate genome 3)
$ prokka --outdir ecoli_RM9088_annotation/ --prefix ecoli_RM9088 ecoli_RM9088.fasta

Creating a Pan-genome with Roary

Once Roary is installed, it is very easy to run. Please see below how to run it – all it requires are the annotated genome gff files.

# syntax
$ roary -f {OUTPUT_DIR} -p {NUMBER_CPUs} {ANNOTATED_GENOME_IN_GFF}

# real case
$ roary -f ecoli_output/ -p 16 ecoli_*_annotation/*.gff 

All the output files should be in the ecoli_output/ directory. One of the most important files is gene_presence_absence.csv which contains some statistics per genome and which of the annotated genomes contain the genome.

Using this file identify which genes are part of the core genome (shared across the 3 genomes), shared across 2 genomes, and unique genes.

This file can be opened on Microsoft Excel if wanted. See a sample of our pan-genome below.

Plotting Roary output

Last but not least, we can use this script to plot the data for the pan-genome – please make sure you run it from Docker. All you need is to pass as input the Newick file generated by Roary and the gene_presence_absence.csv file.

$ python roary_plots.py ecoli_output/accessory_binary_genes.fa.newick ecoli_output/gene_presence_absence.csv 

Number of Genes in the core, etc

Genes Frequency per Genome

Clustered Matrix of Genes

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, this tutorial shows how to annotate prokaryote genomes and how to use these annotations to generate a pan-genome and identify the core genomes, shared genes, and unique genes.

Related Posts