Bloom Filter Made Simple: Theory and Code

onestop_databy:

Python

This tutorial simplifies Bloom Filter in Python by teaching what is a bloom filter, talks about its false positive and false negative rate, introduces some graphics a video, and how it is used in genomics and other applications coding.

You know, I’m a huge fan of hash tables (Python dictionary) because of its time complexity (O(1)) and for being able to query a value associated with a key – love it. However, in cases where memory usage is limited a bloom filter may be a better solution.

1. What is a Bloom Filter (BF)?

First and foremost, Bloom filter is one of those data structures that not many people have to know about, but it is quite often present in interviews with Google and Facebook.

For example, I nailed an interview in the past because I gave an innovative solution to the problem using BF, and the interviewers did not know about it. On the other hand, I have seen some of my friends blowing interviews because they did not know about BFs.

BF is a probabilistic data structure that is great to be queried and asks the question: “Is this key in my BF?” Here are possible answers:

  • Yes! In this case, it is more a Maybe. As the bloom filter is a probabilistic data structure, it can’t guarantee with 100% of certainty that the key is in it
  • No! If it says it says NO it is really NO. The structure does not return False Negatives.

You may be asking. Why would I want to use this if I can’t guarantee that my key is in the structure? Well…. there are ways to set the bloom filter to have a really low False Positive rate. See below.

2. False Positive rate

The bloom filter data structure works around a bit array and many hash functions (could be one – not recommended) which will set values into the bit array. As the same hash function result can collide within a bit array, it allows the data structure to use much less memory usage than a hash table. However, for the same reason, it may return False positives (it says the key is in the BF when it is not).

One of the things that can be done to reduce the false positive rate is to increase the bit array size and which translates in including more hash functions. Nonetheless, this will require more memory from you.

3. False Negative rate

As I said before when the BF is queried if the answer is that the key is not present in the BF, it is 100% not there. It is true because it means that at least one hash function value was not set in the bit array.

4. BF on Graphics

Robert Edwards discusses Bloom Filters below and gives an visual example of checking if a k-mer is present in the BF or not.

5. Python code

BF is not a native data structurer in Python such as list, set, dict (hash), etc. However, you can simply use pip to install it by calling pip install bloom-filter (shoutout to python-bloom-filter). This module is a wrapper that takes care of all the details described above.

The module bloom_filter does all the job for your (setting hash value in the bit array, etc). The only thing you need to pass is the expected number of elements that you will have in your bloom filter and the error rate. As you probably know, a lower error rate implies more memory usage due to probably having more hash functions and a later bit array.

On the example below, I know I will be only adding the 4 DNA bases and will be checking for a 5 base (“N”).

# import BF library
from bloom_filter import BloomFilter

# define BF
dna_bf = BloomFilter(max_elements=5, error_rate=0.01)

# Adds to BF bases
bases = ["A", "T", "G", "C"]
for dna in bases:
    assert (dna in dna_bf) == False
    # Mark the key as seen
    dna_bf.add(dna)

# queries BF all the bases that were added + `N`
for base in bases + ["N"]:
    if base in dna_bf:
        print("Base {} is very likely in the BF (0.01 error rate)".format(base))
    else:
        print("Base {} is for sure not in the BF".format(base))

which prints

Base A is very likely to be in the BF (0.01 error rate)
Base T is very likely to be in the BF (0.01 error rate)
Base G is very likely to be in the BF (0.01 error rate)
Base C is very likely to be in the BF (0.01 error rate)
Base N is for sure not in the BF

6. Real Applications

Obviously, the bloom filter example above was very silly just to demonstrate the module. But a real life application for this on checking for malicious URLs before allowing the user to enter it. Have you wondered how many malicious URLs were identified by Google? Now imagine you trying to store it into a Hash – no good!

Here, a bloom filter is handy here because all you want to do is to check if the URL is malicious. As you learned earlier: If we don’t find the URL int he BF, it is 100% sure that it is not malicious. However, if it finds in the BF, it could be malicious or not (the false positive rate can be controlled with the size of the bit array and number of hash functions).

Furthermore, khmer and bfcounter use BF to support the k-mer counting in (meta)genomics.

More Resources

Here are three of my favorite Python Books in case you want to learn more about it.

Conclusion

In summary, I hope you learned today how useful Bloom can be useful. Your browser probably ran a BF check for a malicious page on this blog – I’m glad it did not come out as a False Positive 😉

Related Posts