This tutorial presents a Python implementation of the Shannon Entropy algorithm to compute Entropy on a DNA/Protein sequence.
1. The Shannon Entropy – An Intuitive Information Theory
Entropy or Information entropy is the information theory’s basic quantity and the expected value for the level of self-information. Entropy is introduced by Claude Shannon and hence it is named so after him.
Shannon entropy is a self-information related introduced by him. The self-information-related value quantifies how much information or surprise levels are associated with one particular outcome. This outcome is referred to as an event of a random variable. The Shannon entropy quantifies the levels of “informative” or “surprising” the whole of the random variable would be and all its possible outcomes are averaged. Information entropy is generally measured in terms of bits which are also known as Shannons or otherwise called bits and even as nats.
2. Shannon Entropy Equation
The Shannon entropy is a measure of the uncertainty or randomness in a set of outcomes. It is defined mathematically as follows:
H = -∑ p_i log_2(p_i)
Where H is the entropy, p_i is the probability of the i-th outcome, and the summation is taken over all possible outcomes. The log_2 function is used because entropy is usually expressed in units of bits.
The entropy is a non-negative number, with larger values indicating greater uncertainty. If all outcomes are equally likely, the entropy is at its maximum, and if only one outcome is possible, the entropy is zero. The entropy is an important concept in information theory and has applications in many fields, including cryptography, data compression, and coding theory.
3. Use of Entropy in Genomics
Shannon Entropy is applicable in many fields including bioinformatics.
To illustrate, PhiSpy, a bioinformatics tool to find phages in bacterial genomes, uses entropy as a feature in a Random forest.
4. Code to Compute the Entropy
Secondly, here is the Python code for computing entropy for a given DNA/Protein sequence:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import collections
import math
def estimate_shannon_entropy(dna_sequence):
m = len(dna_sequence)
bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
shannon_entropy_value = 0
for base in bases:
# number of residues
n_i = bases[base]
# n_i (# residues type i) / M (# residues in column)
p_i = n_i / float(m)
entropy_i = p_i * (math.log(p_i, 2))
shannon_entropy_value += entropy_i
return shannon_entropy_value * -1
Finally, you can execute the function presented above.
>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785
5. Entropy Calculation with Scipy
Last but not least, if you have scipy installed on your computer, it should be the easiest way to compute entropy in Python. See Bellow:
import collections
from scipy.stats import entropy
def estimate_shannon_entropy(dna_sequence):
bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
# define distribution
dist = [x/sum(bases.values()) for x in bases.values()]
# use scipy to calculate entropy
entropy_value = entropy(dist, base=2)
return entropy_value
Now you can test it
>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model