This tutorial presents a Python implementation of the Shannon Entropy algorithm to compute Entropy on a DNA/Protein sequence.

## 1. The Shannon Entropy – An Intuitive Information Theory

Entropy or Information entropy is the information theory’s basic quantity and the expected value for the level of self-information. Entropy is introduced by Claude Shannon and hence it is named so after him.

Shannon entropy is a self-information related introduced by him. The self-information related value quantifies how much information or surprise levels are associated with one particular outcome. This outcome is referred to as an event of a random variable. The Shannon entropy quantifies the levels of “informative” or “surprising” the whole of the random variable would be and all its possible outcomes are averaged. Information entropy is generally measured in terms of bits which are also known as Shannons or otherwise called bits and even as nats.

## 2. Shannon Entropy Equation

Consider as a random variable taking many values with a finite limit, and consider as its distribution of probability. We define the self-information of the event of i.e.we can calculate the Shannon Entropy of as below:

In the above equation, the definition of is written in units of bits or nats. And one nat is referred to as the quantity of information gained while observing an event of probability.

Now, we can quantify the level of uncertainty in a whole probability distribution using the equation of Shannon entropy as below:

It measures or quantifies the average uncertainty of x as the number of bits.

## 3. Use of Entropy in Genomics

Shannon Entropy is applicable in many fields including bioinformatics.

To illustrate, PhiSpy, a bioinformatics tool to find phages in bacterial genomes, uses entropy as a feature in a Random forest.

## 4. Code to Compute the Entropy

Secondly, here is the Python code for computing entropy for a given DNA/Protein sequence:

```
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import collections
import math
def estimate_shannon_entropy(dna_sequence):
m = len(dna_sequence)
bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
shannon_entropy_value = 0
for base in bases:
# number of residues
n_i = bases[base]
# n_i (# residues type i) / M (# residues in column)
p_i = n_i / float(m)
entropy_i = p_i * (math.log(p_i, 2))
shannon_entropy_value += entropy_i
return shannon_entropy_value * -1
```

Finally, you can execute the function presented above.

```
>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785
```

## 5. Entropy Calculation with Scipy

Last but not least, if you have scipy installed on your computer, it should be the easiest way to compute entropy in Python. See Bellow:

```
import collections
from scipy.stats import entropy
def estimate_shannon_entropy(dna_sequence):
bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
# define distribution
dist = [x/sum(bases.values()) for x in bases.values()]
# use scipy to calculate entropy
entropy_value = entropy(dist, base=2)
return entropy_value
```

Now you can test it

```
>>> import estimate_shannon_entropy
>>> print(estimate_shannon_entropy("ATCGTAGTGAC"))
>>> 1.9808259362290785
```

## More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model