Approximate String Matching using LSH - python

I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.
That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.
However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.
Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?

The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.
Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.
There's a great implementation in the Python datasketch library (pip install datasketch). Here's an example that shows you can catch fuzzy string similarity:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)
# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
This returns:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]

Related

Gensim: word mover distance with string as input instead of list of string

I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?

Why the fuzzywuzzy Ratio() uses a slightly different implementation of Levenshtein Distance while calculating the ratio between two strings?

I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.
Thank you!

similarity score between phrases

Levenshtein distance is an approach for measuring the difference between words, but not so for phrases.
Is there a good distance metric for measuring differences between phrases?
For example, if phrase 1 is made of n words x1 x2 x_n, and phrase 2 is made of m words y1 y2 y_m. I'd think they should be fuzzy aligned by words, then the aligned words should have a score about how similar they are, and some kind of gap penalty should be applied for non aligned words. These positive scores and negative scores should be aggregated in some way. There seem to be some heuristics involved.
Is there an existing solution for measuring the similarity between phrases? Python is preferred but other solution is also fine. Thanks.
Take a look at FuzzyWuzzy:
>>> from fuzzywuzzy import fuzz
>>> s1 = "this is a sentence used for testing"
>>> s2 = "while this is another sentence also used for testing"
>>> s3 = "I am a completely unrelated string"
>>> fuzz.partial_ratio(s1, s2)
80
>>> fuzz.partial_ratio(s1, s3)
52
>>> fuzz.partial_ratio(s2, s3)
43
It also includes other modes of comparison that account for out-of-order tokens, etc.
You can also measure the similarity between two phrases using Levenshtein distance, threating each word as a single element. When you have strings of unequal sizes you can use the Smith-Waterman or the Needleman-Wunsch algorithm. Those algorithms are widely used in bioinformatics and the implementation can be found in the biopython package.
You can also tokenize the words in the phrases and measure the frequency of each token in each phrase, that will result in an array of frequencies for each phrase. From that array you can measure the pairwise similarity using any vector distance such as euclidean distance or cosine similarity. The tokenization of the phrases can be done with the nltk package, and the distances can be measured with scipy.
Hope it helps.

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ∈ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_i∈{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Algorithm to sort rows and cols by similarity

I fell across a spreadsheet that explains a method to sort both rows and columns of a matrix that contains binary data so that the number of changes between consecutive rows and cols is minimzed.
For example, starting with:
After 15 manual steps described in the tabs of the spreadsheed, the following table is obtained:
I would like to know:
what is the common name of this algorithm or method ?
how to apply it to larger table (where 2^n would overflow...)
how to generalize it to non binary data, for example using Levenshtein distance ?
if there is any link to code (Excel VBA, Python, ...) already implementing this (otherwise I'll write it ... )
Thanks !
You can represent each row by a vector L = [1, 1, 0, ... 1], and then define the distance between two lines d(L0, L1) by the number of elements at corresponding positions which are different between L0 and L1. This is known as the binary Hamming distance. If you had non-binary data, you would just extend your definition of distance and yes, Levenshtein distance would be an option.
Once you have distance well-defined, the rest of your problem is minimizing distance between consecutive rows. This is exactly the Traveling salesman problem, which is known to be NP-hard(http://www.diku.dk/hjemmesider/ansatte/jyrki/Paper/EKP85.pdf).
The direct solution (visiting all permutations) is O(n!), but you can do better easily by using dynamic programming, for example Held–Karp_algorithm. There are also approximate algorithms, such as the Nearest_neighbour_algorithm which quickly computes a non-optimal solution.
Finally, for implementations you can easily google "traveling salesman excel/python" and find many tutorials and examples.

Categories

Resources