Simple Script for Shannon Entropy

by:

Bioinformatics

This tutorial presents a Python implementation of the Shannon Entropy algorithm to compute Entropy on a DNA/Protein sequence.

1. The Shannon Entropy – An Intuitive Information Theory

Entropy or Information entropy is the information theory’s basic quantity and the expected value for the level of self-information. Entropy is introduced by Claude Shannon and hence it is named so after him.

Shannon entropy is a self-information related introduced by him. The self-information related value quantifies how much information or surprise levels are associated with one particular outcome. This outcome is referred to as an event of a random variable. The Shannon entropy quantifies the levels of “informative” or “surprising” the whole of the random variable would be and all its possible outcomes are averaged. Information entropy is generally measured in terms of bits which are also known as Shannons or otherwise called bits and even as nats.

2. Shannon Entropy Equation

Consider as a random variable taking many values with a finite limit, and consider  as its distribution of probability. We define the self-information of the event of  i.e.we can calculate the Shannon Entropy of  as below:

In the above equation, the definition of is written in units of bits or nats. And one nat is referred to as the quantity of information gained while observing an event of probability.

Now, we can quantify the level of uncertainty in a whole probability distribution using the equation of Shannon entropy as below:

It measures or quantifies the average uncertainty of x as the number of bits.

3. Use of Entropy in Genomics

Shannon Entropy is applicable in many fields including bioinformatics.

To illustrate, PhiSpy, a bioinformatics tool to find phages in bacterial genomes, uses entropy as a feature in a Random forest.

4. Code to Compute the Entropy

Secondly, here is the Python code for computing entropy for a given DNA/Protein sequence:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import collections
import math


def estimate_shannon_entropy(dna_sequence):
    m = len(dna_sequence)
    bases = collections.Counter([tmp_base for tmp_base in dna_sequence])

    shannon_entropy_value = 0
    for base in bases:
        # number of residues
        n_i = bases[base]
        # n_i (# residues type i) / M (# residues in column)
        p_i = n_i / float(m)
        entropy_i = p_i * (math.log(p_i, 2))
        shannon_entropy_value += entropy_i

    return shannon_entropy_value * -1

Finally, you can execute the function presented above.

>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785

5. Entropy Calculation with Scipy

Last but not least, if you have scipy installed on your computer, it should be the easiest way to compute entropy in Python. See Bellow:

import collections

from scipy.stats import entropy
 
def estimate_shannon_entropy(dna_sequence):
    bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
    # define distribution
    dist = [x/sum(bases.values()) for x in bases.values()]

    # use scipy to calculate entropy
    entropy_value = entropy(dist, base=2)

    return entropy_value

Now you can test it

>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts