This tutorial teaches you how to use Locality Sensitive Hashing (LSH) to detect near-duplicate sentences.
Moreover, the task of identifying similar sentences is a common task every time you run a google or youtube search. As you probably know, when you search for something on those engines, you don’t get an exact string match. Instead, it retrieves a near-duplicate String, and what is what we are going to learn here today.
1. What is the LSH Algorithm?
LSH (Locality-Sensitive Hashing) is a technique used in computer science for efficient similarity search in high-dimensional spaces. It is a hashing-based algorithm that maps high-dimensional data points to lower-dimensional hash codes in such a way that similar data points are more likely to be mapped to the same hash code.
The main benefit of LSH is that it provides a way to perform similarity search in high-dimensional spaces in a more efficient way compared to traditional techniques. This is because LSH reduces the dimensionality of the data, which leads to a reduction in the number of comparisons that need to be made.
Another benefit of LSH is that it can be used to handle noisy data, as it is robust to small variations in the data. This makes LSH a useful tool for applications such as image and text retrieval, where the data can be noisy or have small variations.
LSH is also a probabilistic algorithm, which means that it provides a trade-off between accuracy and efficiency. This means that LSH may not always find the exact nearest neighbors, but it can still provide an approximation that is close enough for many applications.
Overall, the LSH algorithm is a useful tool for performing similarity search in high-dimensional spaces. Its ability to reduce dimensionality and handle noisy data, combined with its efficiency and probabilistic nature, make it a popular choice for many applications in computer science and data science.
The video below explains more in details the algorithm:
2. Script
Now, the script below was inspired by how to use SnaPy which has the library as one of the script’s dependency.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
from snapy import MinHash, LSH
SEED = 3
def load_content(sentence_file):
"""Load input file with sentences to build LSH.
Args:
sentence_file (str): Path to input with txt file with sentences to Build LSH.
Returns:
dict: Dict with strings and version of string in lower case and without comma.
"""
sentences = {}
with open(sentence_file) as content:
for line in content:
line = line.strip()
line_clean = line.replace(",", "")
line_clean = line_clean.lower()
sentences[line_clean] = line
return sentences
def create_lsh(content, no_of_bands, n_permutations, n_gram):
"""Create Minhash and Locality Sensitive Hashing (LSH) to detect near duplicate texts.
Args:
content (list): List with string to build LSH.
no_of_bands (int): Number of bands to break minhash signature into before hashing into buckets.
n_permutations (int): Number of permutations used to create minhash signatures used in LSH model.
n_gram (int): Size of each overlapping text shingle to break text into prior to hashing.
no_of_bands(int): Number of bands to break minhash signature into before hashing into buckets.
Returns:
class 'snapy.lsh.LSH': Snapy LSH object.
"""
labels = range(len(content))
# Create MinHash object.
minhash = MinHash(content, n_gram=n_gram, permutations=n_permutations, hash_bits=64, seed=SEED)
# Create LSH model.
lsh = LSH(minhash, labels, no_of_bands=no_of_bands)
return lsh
def find_near_duplicate(query_sentences, sentences, min_jaccard_value, no_of_bands, n_permutations, n_gram):
"""Using LSH object finds the near duplicate strings.
Args:
query_sentences (dict): Dict with query strings and version of string in lower case and without comma.
sentences (dict): Dict with target strings and version of string in lower case and without comma.
min_jaccard_value (float): Minimum value for the Jaccard Distance.
no_of_bands (int): Number of bands to break minhash signature into before hashing into buckets.
n_permutations (int): Number of permutations used to create minhash signatures used in LSH model.
n_gram (int): Size of each overlapping text shingle to break text into prior to hashing.
"""
content = list(query_sentences.keys()) + list(sentences.keys())
lsh = create_lsh(content, no_of_bands, n_permutations, n_gram)
# Query to find near duplicates the string in `search`
closest_results = lsh.query(0, min_jaccard=min_jaccard_value)
for index_query, search_string in enumerate(query_sentences):
print("{} QUERY: {}".format(index_query + 1, query_sentences[search_string]))
for content_index in closest_results:
result = content[content_index]
print(sentences[result])
print()
def parse_args():
"""Parse args entered by the user.
Returns:
argparse.Namespace: Parsed arguments.
"""
parser = argparse.ArgumentParser(
description="Detect near duplicate texts using Minhash and Locality Sensitive Hashing.",
epilog="example > python3 find_near_duplicate.py -q INPUT -t TARGERS")
parser.add_argument("-q", "--query", help="Path to file with sentences to query", required=True)
parser.add_argument("-t", "--targets", help="Path to file with sentences be matched against", required=True)
parser.add_argument("-g", "--n_gram", help="Size of each overlapping text shingle to break text into "
"prior to hashing", default=9)
parser.add_argument("-p", "--n_permutations", help="Number of permutations used to create minhash signatures used "
"in LSH model.", default=100)
parser.add_argument("-j", "--min_jaccard", help="Jaccard similarity threshold texts have to exceed to be "
"returned as similar.", default=0.25)
parser.add_argument("-b", "--no_of_bands", help="Number of bands to break minhash signature into "
"before hashing into buckets..", default=50)
return parser.parse_args()
def main():
args = parse_args()
query = args.query
targets = args.targets
min_jaccard_value = float(args.min_jaccard)
n_gram = int(args.n_gram)
n_permutations = int(args.n_permutations)
no_of_bands = int(args.no_of_bands)
# load sentences from file
query_sentences = load_content(query)
targets_sentences = load_content(targets)
# find near duplicate sequences to `search_string`
find_near_duplicate(query_sentences, targets_sentences, min_jaccard_value, no_of_bands, n_permutations, n_gram)
if __name__ == "__main__":
main()
3. Script Usage
Additionally, please see below the script usage and every detail on how to use the script presented above.
usage: find_near_duplicate.py [-h] -q QUERY -t TARGETS [-g N_GRAM]
[-p N_PERMUTATIONS] [-j MIN_JACCARD]
[-b NO_OF_BANDS]
Detect near duplicate texts using Minhash and Locality Sensitive Hashing.
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Path to file with sentences to query
-t TARGETS, --targets TARGETS
Path to file with sentences be matched against
-g N_GRAM, --n_gram N_GRAM
Size of each overlapping text shingle to break text
into prior to hashing
-p N_PERMUTATIONS, --n_permutations N_PERMUTATIONS
Number of permutations used to create minhash
signatures used in LSH model.
-j MIN_JACCARD, --min_jaccard MIN_JACCARD
Jaccard similarity threshold texts have to exceed to
be returned as similar.
-b NO_OF_BANDS, --no_of_bands NO_OF_BANDS
Number of bands to break minhash signature into before
hashing into buckets.
example > python3 find_near_duplicate.py -q INPUT -t TARGERS
4. Study Case
Furthermore, I generated 5 random sentences to be the search with the LSH algorithm (-q parameter in the script) to validate the script.
Three generations with six decades of life experience.
He looked behind the door and didn't like what he saw.
The pigs were insulted that they were named hamburgers.
Not all people who wander are lost.
It is grossly unfair to suggest that the school was responsible for the accident.
And slightly modified the sentences making 3 copies of each sentence to be the target sequences (-t parameter).
Two generations with six days of life experience.
She looked in front the door and knew shat she saw.
The cat were insulted that they were named burritos.
all people who wander are lost.
It is grossly unfair to tell that the school was responsible for the kids lunch.
3 generations with 6 decades of life experience.
He looked behind the car and didn't like what he wanted.
The horses were insulted that they were named Maple.
Not all humans who wander are lost.
It is grossly fair to suggest that the school was responsible for kids success.
10000 generations with 15 decades of life experience.
They looked behind the door and didn't like what they saw.
The man was insulted that he was named hamburgers.
Not all animals who wander are found.
It is cool to suggest that the hospital was responsible for the accident.
and these were the found results. As you can see the LHS was able to find all the near-duplicate sentences.
1 QUERY: Three generations with six decades of life experience.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.
2 QUERY: He looked behind the door and didn't like what he saw.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.
3 QUERY: The pigs were insulted that they were named hamburgers.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.
4 QUERY: Not all people who wander are lost.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.
5 QUERY: It is grossly unfair to suggest that the school was responsible for the accident.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.
Last but not least, the other parameters are very important to be tuned in depending on the data that is used for query and target.
Please check the snapy documentation for more examples and how to tune in the parameters. In this study, I only needed to tune the Jaccard Distance value (-j).
More Resources
Here are two of my favorite Machine Learning with Python Books in case you want to learn more about it
- Introduction to Machine Learning with Python: A Guide for Data Scientists by Andreas C. Müller, Sarah Guido
- Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn by Sebastian Raschka, Vahid Mirjalili
Conclusion
In summary, I hope this tutorial is useful for you when trying to find near-duplicate sequences. LSH is a powerful algorithm – please enjoy it.
Moreover, please comment below on what kind of application you intend to use this algorithm.