Painless Prokaryote Pan Genome – Step-by-Step



This tutorial shows how to annotate genomes in FASTA format and how to generate the pan-genome and core genome using the annotation. For annotating the genes, we use Prokka, and Roary is used for the pan-genome generation.

Pan Genomes: Core genome, Unique Genes, etc

First and foremost, a pan-genome is the complete set of genes present in a group of genomes. Thus, we can assume that it contains genes present in targets (also known as core genome) and genes present only in some target(s).

Next, as you can see below, it is represented for three genomes, the set of genes that are core (present in all the three genomes), genes that are unique to each of the genomes, and genes shared in at least two genomes.

Now, follow the steps below, and you will learn how to annotate each genome and find the pan-genome.

Defining Dockerfile with Prokka and Roary, and Python other libraries

Here is the Dockerfile, which you can use to create a Docker image for this tutorial easily. Please save it into a file and keep it as “Dockerfile.”

# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH

# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge

RUN conda install prokka==1.14.5 roary==3.12.0 tbl2asn-forever==25.7.1f biopython==1.76 pandas==1.0.1 seaborn==0.10.0

# Entrypoint
CMD /bin/bash

Build the Docker Image

Next, once you have the Docker image, all you need to do is run the command below to build the image.

docker build -t pangenome:0.1 .

which should print a similar tail message

Step 8/8 : CMD /bin/bash
 ---> Using cache
 ---> 0501ed472c06
Successfully built 0501ed472c06
Successfully tagged pangenome:0.1

Case Study

In this tutorial, we annotate and create the pan-genome for these three genomes: Escherichia coli strain NCTC9702, Escherichia coli 042, and Escherichia coli strain RM9088

Run the Docker Image

Now that you have the FASTA files with contigs/complete genomes, you can use the command below to run your Docker image. Notice that PATH_LOCAL_MACHINE is the path in your machine with your FASTA file(s), and FOLDER_NAME_VM is a virtual folder name that Docker will create for you. On my case, I set PATH_LOCAL_MACHINE to /Users/onestop_data/Desktop/pangenome/ and FOLDER_NAME_VM to genomes/.

# syntax example

# real case
$ docker run -i -v /Users/onestop_data/Desktop/pangenome/:/genomes/ -t pangenome:0.1

Prokka: Prokaryotic Genome Annotation

Next, Prokka is used for prokaryotic genome annotation. It is probably one of the most used tools for this proposal in the microbial bioinformatics community. In case of more interest in the tool, please read its paper.

Now that you ran your Docker image (previous section command), you should change your directory to “genomes/” and run Prokka with the command line below – simple.

# enter folder with input files
$ cd genomes/

# syntax - example
$ prokka --outdir {OUTPUT_DIR} --prefix {OUTPUT_PREFIX} {FASTA_TO_BE_ANNOTATED}

# real case (annotate genome 1)
$ prokka --outdir ecoli_NCTC9702_annotation/ --prefix ecoli_NCTC9702 ecoli_NCTC9702.fasta

# real case (annotate genome 2)
$ prokka --outdir ecoli_FN554766_annotation/ --prefix ecoli_FN554766 ecoli_FN554766.fasta

# real case (annotate genome 3)
$ prokka --outdir ecoli_RM9088_annotation/ --prefix ecoli_RM9088 ecoli_RM9088.fasta

Creating a Pan-genome with Roary

Once Roary is installed, it is straightforward to run. Please see below how to run it – all it requires are the annotated genome gff files.

# syntax

# real case
$ roary -f ecoli_output/ -p 16 ecoli_*_annotation/*.gff 

All the output files should be in the ecoli_output/ directory. One of the most critical files is gene_presence_absence.csv which contains some statistics per genome and which of the annotated genomes contain the genome.

Using this file, identify which genes are part of the core genome (shared across the three genomes), shared across two genomes, and unique genes.

This file can be opened on Microsoft Excel if wanted. See a sample of our pan-genome below.

Plotting Roary output

Last but not least, we can use this script to plot the data for the pan-genome – please make sure you run it from Docker. All you need is to pass as input the Newick file generated by Roary and the gene_presence_absence.csv file.

$ python ecoli_output/accessory_binary_genes.fa.newick ecoli_output/gene_presence_absence.csv 

Number of Genes in the core, etc

Genes Frequency per Genome

Clustered Matrix of Genes


In summary, this tutorial shows how to annotate prokaryote genomes and how to use these annotations to generate a pan-genome and identify the core genomes, shared genes, and unique genes.

Related Posts