Simple Script for Shannon Entropy

by:

Bioinformatics

This tutorial presents a Python implementation of the Shannon Entropy algorithm to compute Entropy on a DNA/Protein sequence.

1. The Shannon Entropy – An Intuitive Information Theory

Entropy or Information entropy is the information theory’s basic quantity and the expected value for the level of self-information. Entropy is introduced by Claude Shannon and hence it is named so after him.

Shannon entropy is a self-information related introduced by him. The self-information-related value quantifies how much information or surprise levels are associated with one particular outcome. This outcome is referred to as an event of a random variable. The Shannon entropy quantifies the levels of “informative” or “surprising” the whole of the random variable would be and all its possible outcomes are averaged. Information entropy is generally measured in terms of bits which are also known as Shannons or otherwise called bits and even as nats.

2. Shannon Entropy Equation

The Shannon entropy is a measure of the uncertainty or randomness in a set of outcomes. It is defined mathematically as follows:

H = -∑ p_i log_2(p_i)

Where H is the entropy, p_i is the probability of the i-th outcome, and the summation is taken over all possible outcomes. The log_2 function is used because entropy is usually expressed in units of bits.

The entropy is a non-negative number, with larger values indicating greater uncertainty. If all outcomes are equally likely, the entropy is at its maximum, and if only one outcome is possible, the entropy is zero. The entropy is an important concept in information theory and has applications in many fields, including cryptography, data compression, and coding theory.

3. Use of Entropy in Genomics

Shannon Entropy is applicable in many fields including bioinformatics.

To illustrate, PhiSpy, a bioinformatics tool to find phages in bacterial genomes, uses entropy as a feature in a Random forest.

4. Code to Compute the Entropy

Secondly, here is the Python code for computing entropy for a given DNA/Protein sequence:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import collections
import math


def estimate_shannon_entropy(dna_sequence):
    m = len(dna_sequence)
    bases = collections.Counter([tmp_base for tmp_base in dna_sequence])

    shannon_entropy_value = 0
    for base in bases:
        # number of residues
        n_i = bases[base]
        # n_i (# residues type i) / M (# residues in column)
        p_i = n_i / float(m)
        entropy_i = p_i * (math.log(p_i, 2))
        shannon_entropy_value += entropy_i

    return shannon_entropy_value * -1

Finally, you can execute the function presented above.

>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785

5. Entropy Calculation with Scipy

Last but not least, if you have scipy installed on your computer, it should be the easiest way to compute entropy in Python. See Bellow:

import collections

from scipy.stats import entropy
 
def estimate_shannon_entropy(dna_sequence):
    bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
    # define distribution
    dist = [x/sum(bases.values()) for x in bases.values()]

    # use scipy to calculate entropy
    entropy_value = entropy(dist, base=2)

    return entropy_value

Now you can test it

>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts