This tutorial presents a step-by-step on finding prophages (phage) in bacterial genomes from a FASTA file (not Genbank format) to another FASTA with the (if any) prophages.
This lesson starts from a FASTA file because most biologists don’t have the contigs annotate in Genbank format, thus the tutorial shows how to simply annotate the genome and find the phages within it.
In order to accomplish this, Prokka will be used to annotate the genome, and its output will be used by Phispy for the identification of prophage regions within the bacterial genome.
Introduction
First and foremost, I’m sure you are familiar with what Prophage is: viruses that infect bacteria, so we don’t target the biology of it.
In case you want to learn more about prophages, please check the Youtube video below
Instead, we focus the step-by-step on finding the prophages from a bacterial genome. Finding phages, in general, can be challenging because of the lack of homology to known sequence, so in this tutorial, we use Phispy that combines homology and machine learning to successfully classify which regions of the genome belong to the bacteria or phage.
Besides Phispy, we will need Prokka for genome annotation. It has too many dependencies and one of them can be very problematic to install – tbl2asn. Thus, this tutorial shares a Bioconda Docker image which includes Prokka and Phispy and hands all the dependencies for you.
Defining Dockerfile with Prokka and Phispy
Now, here is the Dockerfile which you can use to easily create a Docker image for this tutorial. Please save it into a file and save it as “Dockerfile”.
# conda setup
FROM continuumio/miniconda3:4.7.12
RUN conda create -n env python=3.7
ENV PATH /opt/conda/envs/env/bin:$PATH
# creating conda environment
RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge
RUN conda install prokka==1.14.5 phispy==3.7.8 tbl2asn-forever==25.7.1f
# Entrypoint
CMD /bin/bash
Build the Docker Image
Next, once you have the Docker image, all you need to do is to run the command below to build the image.
$ docker build -t phage_image:0.1 .
which should print a simular tail message
Removing intermediate container eb77cdcfb923
---> b96efc9d452a
Step 9/9 : CMD /bin/bash
---> Running in 86b86ea2bb3d
Removing intermediate container 86b86ea2bb3d
---> 1f795d691c1d
Successfully built 1f795d691c1d
Successfully tagged phage_image:0.1
Case study
Now, on this tutorial, we will annotate and find phages for Escherichia coli str. K-12 substr. MG1655, complete genome. Please download the genome from NCBI in FASTA format, or use a set of contigs or complete genome in FASTA format.
Run the Docker Image
Now that you have the FASTA file with contigs/complete genome, you can use the command below to run your Docker image. Notice that PATH_LOCAL_MACHINE is the path in your machine with your FASTA file(s) and FOLDER_NAME_VM is a virtual folder name that Docker will create for you. On my case, I set PATH_LOCAL_MACHINE to /Users/onestop_data/Desktop/phage/ and FOLDER_NAME_VM to phage_files/.
# syntax example
$ docker run -i -v {PATH_LOCAL_MACHINE}:/{FOLDER_NAME_VM} -t {IMAGE_NAME}:{TAG}
# real case
$ docker run -i -v /Users/onestop_data/Desktop/phage/:/phage_files/ -t phage_image:0.1
Prokka: Prokaryotic Genome Annotation
Prokka is used for prokaryotic genome annotation. It is probably one of the most used tools for this proposal in the microbial bioinformatics community. In case of more interest in the tool, please read its paper.
Now that you ran your Docker image (previous section command), you should change your directory to “phage_files/” and run prokka with the command line below – simple.
# enter folder with input files
$ cd phage_files/
# syntax - example
$ prokka --outdir {OUTPUT_DIR} --prefix {OUTPUT_PREFIX} {FASTA_TO_BE_ANNOTATED}
# real case
$ prokka --outdir U00096.2_annoation/ --prefix ecoli_U00096.2 U00096.2.fasta
Phispy: Finding Prophages
Attention: If for any reason you have a GenBank file, you can skip the Prokka annotation and jump into this section.
Phipsy was initially implemented by Sajia Akhter at Dr. Robert Edwards’s group. The tool is used to identify prophages regions within a bacterial genome. This is accomplished by training a Random Forest with only a few features including Shannon Entropy. Please read the paper for more information on the tool method.
# syntax example
$ PhiSpy.py {PROKKA_GBK_OUTPUT} -o {OUTPUT_FOLDER}
# real case
$ PhiSpy.py U00096.2_annoation/ecoli_U00096.2.gbk -o U00096.2_annoation_phages
Creating FASTA with Prophage Genomes
Last but not least, you have the Phispy output which contains the prophages details (and other information).
Please refer to Phispy’s GitHub README file on details about the output. Here, we will focus on some part one of the output files, and here is shared a Python script to extract the full phage genome from your initial FASTA file.
Here, we target prophage_tbl.tsv which contains the details for every ORF annotated by Prokka such as function ID, function annotation, contig start, contig stop, etc. Moreover, the 10th column contains the ORF status: If the value is 0 the ORF is bacterial; otherwise, it is viral.
Now, using the script below and inputting the prophage.tbl (here located at U00096.2_annoation_phages, you can generate a FASTA with the prophages from your bacterial genome.
from pysam import FastxFile
def fasta_to_hash(fasta_file):
"""Parse FASTA file into a dict.
Args:
fasta_file (str): Path to FASTA file.
Returns:
dict: key is the contig id and value is the sequence.
"""
contigs = {}
with FastxFile(fasta_file) as fh:
for entry in fh:
contigs[entry.name] = entry.sequence
return contigs
def extract_phage_sequences(prophage_tbl, input_fasta, output_fasta):
"""Extract phage sequences from the bacterial genome.
Args:
prophage_tbl (str): Path to Phispy's prophage.tbl output.
input_fasta (str): Path to input file in FASTA.
output_fasta (str): Path to output file in FASTA.
"""
contigs = fasta_to_hash(input_fasta)
with open(prophage_tbl) as my_file, open(output_fasta, "w+") as output:
for row in my_file:
sequence_id, coords = row.strip().split()
contig_id, prophage_start, prophage_end = coords.split("_")
prophage_genome = contigs[contig_id][int(prophage_start) - 1:int(prophage_end)]
output.write(">{}_{}_start_{}_end_{}\n{}\n".format(sequence_id, contig_id, prophage_start, prophage_end,
prophage_genome))
print("Check for: {}\nDone :)".format(output_fasta))
extract_phage_sequences("U00096.2_annoation_phages/prophage.tbl", "U00096.2.fasta", "U00096.2_prophages_genomes.fasta")
prints
Check for: U00096.2_prophages_genomes.fasta
Done :)
“U00096.2_prophages_genomes.fasta” should contain your prophage genomes.
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model
Conclusion
In summary, I hope this tutorial was useful for you on the task of finding prophages. Hopefully, you see how powerful Prokka and Phispy were to accomplish the phage hunting.