similarity score between phrases - python

Levenshtein distance is an approach for measuring the difference between words, but not so for phrases.
Is there a good distance metric for measuring differences between phrases?
For example, if phrase 1 is made of n words x1 x2 x_n, and phrase 2 is made of m words y1 y2 y_m. I'd think they should be fuzzy aligned by words, then the aligned words should have a score about how similar they are, and some kind of gap penalty should be applied for non aligned words. These positive scores and negative scores should be aggregated in some way. There seem to be some heuristics involved.
Is there an existing solution for measuring the similarity between phrases? Python is preferred but other solution is also fine. Thanks.

Take a look at FuzzyWuzzy:
>>> from fuzzywuzzy import fuzz
>>> s1 = "this is a sentence used for testing"
>>> s2 = "while this is another sentence also used for testing"
>>> s3 = "I am a completely unrelated string"
>>> fuzz.partial_ratio(s1, s2)
80
>>> fuzz.partial_ratio(s1, s3)
52
>>> fuzz.partial_ratio(s2, s3)
43
It also includes other modes of comparison that account for out-of-order tokens, etc.

You can also measure the similarity between two phrases using Levenshtein distance, threating each word as a single element. When you have strings of unequal sizes you can use the Smith-Waterman or the Needleman-Wunsch algorithm. Those algorithms are widely used in bioinformatics and the implementation can be found in the biopython package.
You can also tokenize the words in the phrases and measure the frequency of each token in each phrase, that will result in an array of frequencies for each phrase. From that array you can measure the pairwise similarity using any vector distance such as euclidean distance or cosine similarity. The tokenization of the phrases can be done with the nltk package, and the distances can be measured with scipy.
Hope it helps.

Related

Text similarity using Word2Vec

I would like to use Word2Vec to check similarity of texts.
I am currently using another logic:
from fuzzywuzzy import fuzz
def sim(name, dataset):
matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
return
(name is my column).
For applying this function I do the following:
df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)
Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?
Example of dataset:
Text
Hello, this is Peter, what would you need me to help you with today?
I need you
Good Morning, John here, are you calling regarding your cell phone bill?
Hi, this this is John. What can I do for you?
...
The first text and the last one are quite similar, although they have different words to express similar concept.
I would like to create a new column where to put, for each row, text that are similar.
I hope you can help me.
TLDR; skip to the last section (part 4.) for code implementation
1. Fuzzy vs Word embeddings
Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.
These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.
Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.
NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.
2. Similarity between word vectors / sentence vectors
“You shall know a word by the company it keeps”
Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.
A good way to find how similar 2 words vectors is cosine-similarity. Read more here.
3. Pre-trained word2vec models (and others)
The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.
You can check similarity between these sentence embeddings using cosine_similarity
4. Sample code implementation
I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data
from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.

Equate strings based on meaning

Is there a way to equate strings in python based on their meaning despite not being similar.
For example,
temp. Max
maximum ambient temperature
I've tried using fuzzywuzzy and difflib and although they are generally good for this using token matching, they also provide false positives when I threshold the outputs over a large number of strings.
Is there some other method using NLP or tokenization that I'm missing here?
Edit:
The answer provided by A CO does solve the problem mentioned above but is there any way to match specific substrings using word2vec from a key?
e.g. Key = max temp
Sent = the maximum ambient temperature expected tomorrow in California is 34 degrees.
So here I'd like to get the substring "maximum ambient temperature". Any tips on that?
As you say, packages like fuzzywuzzy or difflib will be limited because they compute similarities based on the spelling of the strings, not on their meaning.
You could use word embeddings. Word embeddings are vector representations of the words, computed in a way that allows to represent their meaning, to a certain extend.
There are different methods for generating word embeddings, but the most common one is to train a neural network on one - or a set - of word-level NLP tasks, and use the penultimate layer as a representation of the word. This way, the final representation of the word is supposed to have accumulated enough information to complete the task, and this information can be interpreted as an approximation for the meaning of the word. I recommend that you read a bit about Word2vec, which is the method that made word embeddings popular, as it is simple to understand but representative for what word embeddings are. Here is a good introductory article. The similarity between two words can be computed then using usually the cosine distance between their vector representations.
Of course, you don't need to train word embeddings yourself, as there exist plenty of pretrained vectors available (glove, word2vec, fasttext, spacy...). The choice of which embedding you will use depend on the observed performance and your understanding of how fit they are for the task you want to perform. Here is an example with spacy's word vectors, where the sentence vector is computed by averaging the word vectors:
# Importing spacy and fuzzy wuzzy
import spacy
from fuzzywuzzy import fuzz
# Loading spacy's large english model
nlp_model = spacy.load('en_core_web_lg')
s1 = "temp. Max"
s2 = "maximum ambient temperature"
s3 = "the blue cat"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors (The document or sentence vector is the average of the word vectors it contains)
print("Document vectors similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, doc1.similarity(doc2)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, doc1.similarity(doc3)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, doc2.similarity(doc3)))
# Fuzzy logic
print("Character ratio similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))
This will print:
>>> Document vectors similarity between 'temp. Max' and 'maximum ambient temperature' is: 0.6432
>>> Document vectors similarity between 'temp. Max' and 'the blue cat' is: 0.3810
>>> Document vectors similarity between 'maximum ambient temperature' and 'the blue cat' is: 0.3117
>>> Character ratio similarity between 'temp. Max' and 'maximum ambient temperature' is: 28.0000
>>> Character ratio similarity between 'temp. Max' and 'the blue cat' is: 38.0000
>>> Character ratio similarity between 'maximum ambient temperature' and 'the blue cat' is: 21.0000
As you can see, the similarity with word vectors reflects better the similarity in the meaning of the documents.
However this is really just a starting point as there can be plenty of caveats. Here is a list of some of the things you should watch out for:
Word (and document) vectors do not represent the meaning of the word (or document) per se, they are a way to approximate it. That implies that they will hit a limitation at some point and you cannot take for granted that they will allow you to differentiate all nuances of the language.
What we expect to be the "similarity in meaning" between two words/sentences varies according to the task we have. As an example, what would be the "ideal" similarity between "maximum temperature" and "minimum temperature"? High because they refer to an extreme state of the same concept, or low because they refer to opposite states of the same concept? With word embeddings, you will usually get a high similarity for these sentences, because as "maximum" and "minimum" often appear in the same contexts the two words will have similar vectors.
In the example given, 0.6432 is still not a very high similarity. This comes probably from the usage of abbreviated words in the example. Depending on how word embeddings have been generated, they might not handle abbreviation well. In a general manner, it is better to have syntactically and grammatically correct inputs to NLP algorithms. Depending on how your dataset looks like and your knowledge of it, doing some cleaning beforehand can be very helpful. Here is an example with grammatically correct sentences that highlights the similarity in meaning better:
s1 = "The president has given a good speech"
s2 = "Our representative has made a nice presentation"
s3 = "The president ate macaronis with cheese"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors
print(doc1.similarity(doc2))
>>> 0.8779
print(doc1.similarity(doc3))
>>> 0.6131
print(doc2.similarity(doc3))
>>> 0.5771
Anyway, word embeddings are probably what you are looking for but you need to take the time to learn about them. I would recommend that you read about word (and sentence, and document) embeddings and that you play a bit around with different pretrained vectors to get a better understanding of how they can be used for the task you have.

Why the fuzzywuzzy Ratio() uses a slightly different implementation of Levenshtein Distance while calculating the ratio between two strings?

I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.
Thank you!

Finding percentage of shared tokens (percent similarity) between multiple strings

I want to find unique strings in a list of strings with a specific percentage (in Python). However, these strings should be significantly different. If there is a small difference between two strings then it's not interesting for me.
I can loop through the strings for finding their similarity percentage but I was wondering if there is a better way to do it?
For example,
String A: He is going to school.
String B: He is going to school tomorrow.
Let's say that these two strings are 80% similar.
Similarity: The string with exact same words in the same order are most similar. A string can be 100% similar with itself
It's a bit vague definition but it works for my use-case.
If you want to check the amount that two sentences are similar and you want to know when they are the exact same word ordering, then you can use single sentence BLEU score.
I would use the sentence_bleu found here: http://www.nltk.org/_modules/nltk/translate/bleu_score.html
You will need to make sure that you do something with your weights for short sentences. An example from something I have done in the past is
from nltk.translate.bleu_score import sentence_bleu
from nltk import word_tokenize
sentence1 = "He is a dog"
sentence2 = "She is a dog"
reference = word_tokenize(sentence1.lower())
hypothesis = word_tokenize(sentence2.lower())
if min(len(hypothesis), len(reference)) < 4:
weighting = 1.0 / min(len(hypothesis), len(reference))
weights = tuple([weighting] * min(len(hypothesis), len(reference)))
else:
weights = (0.25, 0.25, 0.25, 0.25)
bleu_score = sentence_bleu([reference], hypothesis, weights=weights)
Note that single sentence BLEU is quite bad at detecting similar sentences with different word orderings. So if that's what you're interested in then be careful. Other methods you could try are document similarity, Jaccard similarity, and cosine similarity.

Approximate String Matching using LSH

I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.
That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.
However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.
Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?
The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.
Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.
There's a great implementation in the Python datasketch library (pip install datasketch). Here's an example that shows you can catch fuzzy string similarity:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)
# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
This returns:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]

Categories

Resources