Simple Near-duplicate String Detection with LSH

by:

Machine Learning

This tutorial teaches you how to use Locality Sensitive Hashing (LSH) to detect near-duplicate sentences.

Moreover, the task of identifying similar sentences is a common task every time you run a google or youtube search. As you probably know, when you search for something on those engines, you don’t get an exact string match. Instead, it retrieves a near-duplicate String, and what is what we are going to learn here today.

1. What is the LSH Algorithm?

LSH (Locality-Sensitive Hashing) is a technique used in computer science for efficient similarity search in high-dimensional spaces. It is a hashing-based algorithm that maps high-dimensional data points to lower-dimensional hash codes in such a way that similar data points are more likely to be mapped to the same hash code.

The main benefit of LSH is that it provides a way to perform similarity search in high-dimensional spaces in a more efficient way compared to traditional techniques. This is because LSH reduces the dimensionality of the data, which leads to a reduction in the number of comparisons that need to be made.

Another benefit of LSH is that it can be used to handle noisy data, as it is robust to small variations in the data. This makes LSH a useful tool for applications such as image and text retrieval, where the data can be noisy or have small variations.

LSH is also a probabilistic algorithm, which means that it provides a trade-off between accuracy and efficiency. This means that LSH may not always find the exact nearest neighbors, but it can still provide an approximation that is close enough for many applications.

Overall, the LSH algorithm is a useful tool for performing similarity search in high-dimensional spaces. Its ability to reduce dimensionality and handle noisy data, combined with its efficiency and probabilistic nature, make it a popular choice for many applications in computer science and data science.

The video below explains more in details the algorithm:

2. Script

Now, the script below was inspired by how to use SnaPy which has the library as one of the script’s dependency.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse

from snapy import MinHash, LSH

SEED = 3


def load_content(sentence_file):
    """Load input file with sentences to build LSH.

    Args:
        sentence_file (str): Path to input with txt file with sentences to Build LSH.

    Returns:
        dict: Dict with strings and version of string in lower case and without comma.

    """
    sentences = {}
    with open(sentence_file) as content:
        for line in content:
            line = line.strip()
            line_clean = line.replace(",", "")
            line_clean = line_clean.lower()
            sentences[line_clean] = line

    return sentences


def create_lsh(content, no_of_bands, n_permutations, n_gram):
    """Create Minhash and Locality Sensitive Hashing (LSH) to detect near duplicate texts.

    Args:
        content (list): List with string to build LSH.
        no_of_bands (int): Number of bands to break minhash signature into before hashing into buckets.
        n_permutations (int): Number of permutations used to create minhash signatures used in LSH model.
        n_gram (int): Size of each overlapping text shingle to break text into prior to hashing.
        no_of_bands(int): Number of bands to break minhash signature into before hashing into buckets.

    Returns:
        class 'snapy.lsh.LSH':  Snapy LSH object.

    """
    labels = range(len(content))

    # Create MinHash object.
    minhash = MinHash(content, n_gram=n_gram, permutations=n_permutations, hash_bits=64, seed=SEED)

    # Create LSH model.
    lsh = LSH(minhash, labels, no_of_bands=no_of_bands)

    return lsh


def find_near_duplicate(query_sentences, sentences, min_jaccard_value, no_of_bands, n_permutations, n_gram):
    """Using LSH object finds the near duplicate strings.

    Args:
        query_sentences (dict): Dict with query strings and version of string in lower case and without comma.
        sentences (dict): Dict with target strings and version of string in lower case and without comma.
        min_jaccard_value (float): Minimum value for the Jaccard Distance.
        no_of_bands (int): Number of bands to break minhash signature into before hashing into buckets.
        n_permutations (int): Number of permutations used to create minhash signatures used in LSH model.
        n_gram (int): Size of each overlapping text shingle to break text into prior to hashing.

    """
    content = list(query_sentences.keys()) + list(sentences.keys())
    lsh = create_lsh(content, no_of_bands, n_permutations, n_gram)

    # Query to find near duplicates the string in `search`
    closest_results = lsh.query(0, min_jaccard=min_jaccard_value)

    for index_query, search_string in enumerate(query_sentences):
        print("{} QUERY: {}".format(index_query + 1, query_sentences[search_string]))
        for content_index in closest_results:
            result = content[content_index]
            print(sentences[result])
        print()


def parse_args():
    """Parse args entered by the user.

    Returns:
        argparse.Namespace: Parsed arguments.

    """
    parser = argparse.ArgumentParser(
        description="Detect near duplicate texts using Minhash and Locality Sensitive Hashing.",
        epilog="example > python3 find_near_duplicate.py  -q INPUT -t TARGERS")
    parser.add_argument("-q", "--query", help="Path to file with sentences to query", required=True)
    parser.add_argument("-t", "--targets", help="Path to file with sentences be matched against", required=True)
    parser.add_argument("-g", "--n_gram", help="Size of each overlapping text shingle to break text into "
                                               "prior to hashing", default=9)
    parser.add_argument("-p", "--n_permutations", help="Number of permutations used to create minhash signatures used "
                                                       "in LSH model.", default=100)
    parser.add_argument("-j", "--min_jaccard", help="Jaccard similarity threshold texts have to exceed to be "
                                                          "returned as similar.", default=0.25)
    parser.add_argument("-b", "--no_of_bands", help="Number of bands to break minhash signature into "
                                                    "before hashing into buckets..", default=50)
    return parser.parse_args()


def main():
    args = parse_args()

    query = args.query
    targets = args.targets
    min_jaccard_value = float(args.min_jaccard)
    n_gram = int(args.n_gram)
    n_permutations = int(args.n_permutations)
    no_of_bands = int(args.no_of_bands)

    # load sentences from file
    query_sentences = load_content(query)
    targets_sentences = load_content(targets)

    # find near duplicate sequences to `search_string`
    find_near_duplicate(query_sentences, targets_sentences, min_jaccard_value, no_of_bands, n_permutations, n_gram)


if __name__ == "__main__":
    main()

3. Script Usage

Additionally, please see below the script usage and every detail on how to use the script presented above.

usage: find_near_duplicate.py [-h] -q QUERY -t TARGETS [-g N_GRAM]
                              [-p N_PERMUTATIONS] [-j MIN_JACCARD]
                              [-b NO_OF_BANDS]

Detect near duplicate texts using Minhash and Locality Sensitive Hashing.

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Path to file with sentences to query
  -t TARGETS, --targets TARGETS
                        Path to file with sentences be matched against
  -g N_GRAM, --n_gram N_GRAM
                        Size of each overlapping text shingle to break text
                        into prior to hashing
  -p N_PERMUTATIONS, --n_permutations N_PERMUTATIONS
                        Number of permutations used to create minhash
                        signatures used in LSH model.
  -j MIN_JACCARD, --min_jaccard MIN_JACCARD
                        Jaccard similarity threshold texts have to exceed to
                        be returned as similar.
  -b NO_OF_BANDS, --no_of_bands NO_OF_BANDS
                        Number of bands to break minhash signature into before
                        hashing into buckets.

example > python3 find_near_duplicate.py -q INPUT -t TARGERS

4. Study Case

Furthermore, I generated 5 random sentences to be the search with the LSH algorithm (-q parameter in the script) to validate the script.

Three generations with six decades of life experience.
He looked behind the door and didn't like what he saw.
The pigs were insulted that they were named hamburgers.
Not all people who wander are lost.
It is grossly unfair to suggest that the school was responsible for the accident.

And slightly modified the sentences making 3 copies of each sentence to be the target sequences (-t parameter).

Two generations with six days of life experience.
She looked in front the door and knew shat she saw.
The cat were insulted that they were named burritos.
all people who wander are lost.
It is grossly unfair to tell that the school was responsible for the kids lunch.
3 generations with 6 decades of life experience.
He looked behind the car and didn't like what he wanted.
The horses were insulted that they were named Maple.
Not all humans who wander are lost.
It is grossly fair to suggest that the school was responsible for kids success.
10000 generations with 15 decades of life experience.
They looked behind the door and didn't like what they saw.
The man was insulted that he was named hamburgers.
Not all animals who wander are found.
It is cool to suggest that the hospital was responsible for the accident.

and these were the found results. As you can see the LHS was able to find all the near-duplicate sentences.

1 QUERY: Three generations with six decades of life experience.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.

2 QUERY: He looked behind the door and didn't like what he saw.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.

3 QUERY: The pigs were insulted that they were named hamburgers.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.

4 QUERY: Not all people who wander are lost.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.

5 QUERY: It is grossly unfair to suggest that the school was responsible for the accident.
Two generations with six days of life experience.
3 generations with 6 decades of life experience.
10000 generations with 15 decades of life experience.

Last but not least, the other parameters are very important to be tuned in depending on the data that is used for query and target.

Please check the snapy documentation for more examples and how to tune in the parameters. In this study, I only needed to tune the Jaccard Distance value (-j).

More Resources

Here are two of my favorite Machine Learning with Python Books in case you want to learn more about it

Conclusion

In summary, I hope this tutorial is useful for you when trying to find near-duplicate sequences. LSH is a powerful algorithm – please enjoy it.

Moreover, please comment below on what kind of application you intend to use this algorithm.

Related Posts