Pain-free Genome Depth Plotting

by:

BioinformaticsData visualization

A genome depth plot is a great way to visualize how well a target genome is covered across the genome. This can be used to eyeball if a certain level of coverage was met across the genome or even to identify contamination when looking into metagenomic datasets.

Here I share a python script to generate the depth plot. Enjoy it!

What is a Genome Depth Plot?

A genome depth plot, also known as a read depth plot, is a graphical representation of the coverage of a genome by sequencing reads. It shows the distribution of the number of reads covering each base in the genome. The x-axis represents the positions along the genome, and the y-axis represents the number of reads covering each position.

Genome depth plots are commonly used in next-generation sequencing (NGS) to visualize the coverage of a genome by sequencing reads and to identify regions with low coverage or uneven coverage. They are useful for assessing the quality of the sequencing data and for detecting potential biases or problems in the sequencing process.

In a typical genome depth plot, the coverage should be relatively uniform across the genome, indicating that the sequencing reads are distributed evenly across the genome. However, if the coverage is uneven, it may indicate a problem with the sequencing process, such as a bias in the sequencing library or a failure to sequence certain regions of the genome.

Genome depth plots are also useful for identifying regions with low coverage, which may indicate regions that are difficult to sequence or that are missing in the sequencing data. This information can be used to improve the sequencing process and to ensure that the sequencing data is of high quality.

Overall, genome depth plots are a useful tool for visualizing the coverage of a genome by sequencing reads and for assessing the quality of the sequencing data. They can provide valuable information for improving the sequencing process and for ensuring that the sequencing data is of high quality.

Dependencies

First and foremost, the script presented here relies on a few dependencies:

Numpy and Matplotlib are also dependencies, but they are installed when installing Seaborn.

Generating Genome Depth file

Secondly, we are targeting a 1 single contig genome as a reference and mapping some reads to use using BWA. Therefore, you can use the command lines below to format the reference genome and align the reads to the genome.

$ bwa index {REFERENCE_GENOME.FASTA}
$ bwa mem  -t 16 {REFERENCE_GENOME.FASTA} {READS_INPUT} | samtools sort -o {ALIGNMENT}.bam

Now that we have the alignments from our reads to target genomes, we can use samtools to generate the depth report

$ samtools depth {ALIGNMENT}.bam > {GENOME}.depth

Ps: Please keep in mind that samtools depth contig start count from position 1 rather than 0.

Python Script

Last but not least, now that you have the samtool’s generated depth file, we can generate the visual of the depth.

# !/usr/bin/env python3
# -*- coding: utf-8 -*-

import seaborn as sns
import numpy as np

from matplotlib import (pyplot as plt,
                        lines)


def parse_depth(depth_input, genome_size):
    """Parse depth file.

    Args:
        depth_input (str): Path to depth file.
        genome_size (int): Genome size.

    Returns:
        list: List with depth.

    """
    depth = [0] * genome_size
    references = set()

    with open(depth_input) as depth_object:
        for row in depth_object:
            genome_id, position, depth_count = row.split()

            references.add(genome_id)

            if len(references) > 1:
                raise Exception(' This script only handles one genome - contig.')

            depth[int(position)] = int(depth_count)

    return depth


def plot_depth(depth_report, output_name, plot_title, genome_size, normalize=False, depth_cut_off=20):
    """Plot genome Depth across genome.

    Args:
        depth_report (str): Path to samtool's depth file.
        output_name (str): Path to output PNG image.
        plot_title (str): Plot title.
        genome_size (int): Genome size.
        normalize (bool): If `True`, normalizes the depth by the largest depth (default = `False`).
        depth_cut_off (int): Plot a line to represent a targeted depth (default = 20).

    """
    data = parse_depth(depth_report, genome_size)

    y_label = "Normalized Depth" if normalize else "Depth"
    data = [xx / max(data) for xx in data] if normalize else data

    sns.set(color_codes=True)
    plt.title(plot_title)
    ax = plt.subplot(111)

    sns_plot = sns.lineplot(x=range(len(data)), y=data)
    sns_plot.set(xlabel='Genome Position (bp)', ylabel=y_label)

    if not normalize:
        ax.add_line(lines.Line2D([0, genome_size + 1], [depth_cut_off], color="r"))

    plt.savefig(output_name, bbox_inches='tight', dpi=400)
    plt.close()

    print("Done :)")

plot_depth is the function that should be called, and it takes as input the following parameters:

  • depth_report (str): Path to samtool’s depth file.
  • output_name (str): Path to an output PNG image.
  • plot_title (str): Plot title.
  • genome_size (int): Genome size.
  • normalize (bool): If True, normalizes the depth by the largest depth (default = False).
  • depth_cut_off (int): Plot a line to represent a targeted depth (default = 20).

Now that we know what we need to input into the script and luckily we have what we need, we can call the function. Below it calls the function to plot the normalized and raw depth with depth marker at 40x.

>>> import plot_depth

# raw depth
>>> plot_depth("chlamydia_phage.depth", "depth_normalized.png", "Guinea pig Chlamydia phage", 4529, False, 40)

# normalize depth
>>> plot_depth("chlamydia_phage.depth", "depth_raw.png", "Guinea pig Chlamydia phage", 4529, True, 40)

Here is how the normalized plot looks like

And here is the raw depth plot

More Resources

Here are two of my favorite Data Visualization Python Books in case you want to learn more about it.

Related Posts