Painless Prokaryote Pan Genome – Step-by-Step

by:

Bioinformatics

This tutorial shows how to annotate genomes in FASTA format and how to generate the pan-genome and core genome using the annotation. For annotating the genes, we use Prokka, and Roary is used for the pan-genome generation.

Pan Genomes: Core genome, Unique Genes, etc

A pan genome is a comprehensive catalog of the genetic variation found in a group of related organisms. It is a term used in genomics to describe the set of all genes that are present across a group of closely related organisms, as well as the set of genes that are unique to each individual organism.

The concept of a pan genome is used in the study of bacterial genomes, where it is recognized that bacterial species often have a large pool of genes that are shared among the members of the species, as well as a smaller set of genes that are unique to each individual strain.

Pan genomes are used in several applications, including:

  1. Evolutionary analysis: Pan genomes can be used to study the evolution of a group of related organisms and to understand the mechanisms that drive the acquisition and loss of genes in these organisms.
  2. Antimicrobial resistance: Pan genomes can be used to identify the genetic basis of antimicrobial resistance in bacteria and to understand the spread of resistant strains.
  3. Bacterial classification: Pan genomes can be used to classify bacterial species and to determine their evolutionary relationships.
  4. Medical and public health: Pan genomes can be used to study the genetic basis of diseases caused by bacteria and to develop strategies for disease control and prevention.
  5. Biotechnology: Pan genomes can be used to identify new genes and pathways that are involved in important biological processes, such as metabolism and biodegradation.

Overall, the concept of a pan genome is a valuable tool for the study of bacterial genomes and the understanding of bacterial evolution, classification, and pathogenesis.

Next, as you can see below, it is represented for three genomes, the set of genes that are core (present in all the three genomes), genes that are unique to each of the genomes, and genes shared in at least two genomes.

Now, follow the steps below, and you will learn how to annotate each genome and find the pan-genome.

Defining Dockerfile with Prokka and Roary, and Python other libraries

Here is the Dockerfile, which you can use to create a Docker image for this tutorial easily. Please save it into a file and keep it as “Dockerfile.”

# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH

# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge

RUN conda install prokka==1.14.5 roary==3.12.0 tbl2asn-forever==25.7.1f biopython==1.76 pandas==1.0.1 seaborn==0.10.0

# Entrypoint
CMD /bin/bash

Build the Docker Image

Next, once you have the Docker image, all you need to do is run the command below to build the image.

docker build -t pangenome:0.1 .

which should print a similar tail message

Step 8/8 : CMD /bin/bash
 ---> Using cache
 ---> 0501ed472c06
Successfully built 0501ed472c06
Successfully tagged pangenome:0.1

Case Study

In this tutorial, we annotate and create the pan-genome for these three genomes: Escherichia coli strain NCTC9702, Escherichia coli 042, and Escherichia coli strain RM9088

Run the Docker Image

Now that you have the FASTA files with contigs/complete genomes, you can use the command below to run your Docker image. Notice that PATH_LOCAL_MACHINE is the path in your machine with your FASTA file(s), and FOLDER_NAME_VM is a virtual folder name that Docker will create for you. On my case, I set PATH_LOCAL_MACHINE to /Users/onestop_data/Desktop/pangenome/ and FOLDER_NAME_VM to genomes/.

# syntax example
$ docker run -i -v {PATH_LOCAL_MACHINE}:/{FOLDER_NAME_VM} -t {IMAGE_NAME}:{TAG}

# real case
$ docker run -i -v /Users/onestop_data/Desktop/pangenome/:/genomes/ -t pangenome:0.1

Prokka: Prokaryotic Genome Annotation

Next, Prokka is used for prokaryotic genome annotation. It is probably one of the most used tools for this proposal in the microbial bioinformatics community. In case of more interest in the tool, please read its paper.

Now that you ran your Docker image (previous section command), you should change your directory to “genomes/” and run Prokka with the command line below – simple.

# enter folder with input files
$ cd genomes/

# syntax - example
$ prokka --outdir {OUTPUT_DIR} --prefix {OUTPUT_PREFIX} {FASTA_TO_BE_ANNOTATED}

# real case (annotate genome 1)
$ prokka --outdir ecoli_NCTC9702_annotation/ --prefix ecoli_NCTC9702 ecoli_NCTC9702.fasta

# real case (annotate genome 2)
$ prokka --outdir ecoli_FN554766_annotation/ --prefix ecoli_FN554766 ecoli_FN554766.fasta

# real case (annotate genome 3)
$ prokka --outdir ecoli_RM9088_annotation/ --prefix ecoli_RM9088 ecoli_RM9088.fasta

Creating a Pan-genome with Roary

Once Roary is installed, it is straightforward to run. Please see below how to run it – all it requires are the annotated genome gff files.

# syntax
$ roary -f {OUTPUT_DIR} -p {NUMBER_CPUs} {ANNOTATED_GENOME_IN_GFF}

# real case
$ roary -f ecoli_output/ -p 16 ecoli_*_annotation/*.gff 

All the output files should be in the ecoli_output/ directory. One of the most critical files is gene_presence_absence.csv which contains some statistics per genome and which of the annotated genomes contain the genome.

Using this file, identify which genes are part of the core genome (shared across the three genomes), shared across two genomes, and unique genes.

This file can be opened on Microsoft Excel if wanted. See a sample of our pan-genome below.

Plotting Roary output

Last but not least, we can use this script to plot the data for the pan-genome – please make sure you run it from Docker. All you need is to pass as input the Newick file generated by Roary and the gene_presence_absence.csv file.

$ python roary_plots.py ecoli_output/accessory_binary_genes.fa.newick ecoli_output/gene_presence_absence.csv 

Number of Genes in the core, etc

Genes Frequency per Genome

Clustered Matrix of Genes

Conclusion

In summary, this tutorial shows how to annotate prokaryote genomes and how to use these annotations to generate a pan-genome and identify the core genomes, shared genes, and unique genes.

Related Posts