How do I calculate the semantic similarity between two n-grams? - python

I'm trying to calculate the semantic similarity between two bi-grams and I need to use fasttext's pre-trained word vectors to accomplish this task.
For ex :
The b-grams are python lists of two elements:
[his, name] and [I, am]
They are two tuples and I need to calculate the similarity between these two tuples by any means necessary.
I'm hoping there's a score which can give me a good approximation of similarity.
For ex - If there are methods which can tell me that [His, name] is more similar to [I, am] than [An, apple].
Right now I only made use of cosine similarity which does include any semantic similarity.

Cosine similarity might be useful if you average both word vectors in a bi-gram first. So you want to take the vector for 'his' and 'name', average them into one vector. Then take the vector for 'I' and 'am' and average them into one vector. Finally, calculate cosine similarity for both resulting vectors and it should give you a rough semantic similarity.

Related

replace nested 'for loop' with lambda in python

I am working on one task where I need to check the cosine similarity between two dataframe columns.
I am using two for loop to iterate over two columns of data1 and data2 respectively.
for i in range(0,len(input_df)):
for j in range(0,len(data1)):
##check similarity ratio
similarity_score= cosine_sim(input_df['Summary'].iloc[i],data1['Summary'].iloc[j])
print(similarity_score)
###cosine_sim() is my function that gave similarity score.
how can i do this using Lambda instead of for loop as nested loop is taking much time.
There are other operations as well which I am doing after checking the cosine similarity.
To compute the cosine similarity between two vectors (your two columns), you could make use of NumPy:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
cosine_similarity(input_df['Summary'], data1['Summary'])
However, based on your code example, it seems that you want to compute the cosine similarity between each element of the columns. So I'm not entirely sure if the above is what you are looking for.

How to cluster large number of strings based on similarity matrix?

I need to cluster 500K+ strings based on their similarity.
I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.
I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).
have you considered doing something like https://en.wikipedia.org/wiki/Soundex ? the advantage in such algorithms is that similar words have the same canonical form. For example, both "Robert" and "Rupert" return the same string "R163". Then your clustering boils down to a map like:
clusters = { canonical_form: [list of similar words] }
Naturally, you can tweak the Soundex rules according to your domain.

similarity score between phrases

Levenshtein distance is an approach for measuring the difference between words, but not so for phrases.
Is there a good distance metric for measuring differences between phrases?
For example, if phrase 1 is made of n words x1 x2 x_n, and phrase 2 is made of m words y1 y2 y_m. I'd think they should be fuzzy aligned by words, then the aligned words should have a score about how similar they are, and some kind of gap penalty should be applied for non aligned words. These positive scores and negative scores should be aggregated in some way. There seem to be some heuristics involved.
Is there an existing solution for measuring the similarity between phrases? Python is preferred but other solution is also fine. Thanks.
Take a look at FuzzyWuzzy:
>>> from fuzzywuzzy import fuzz
>>> s1 = "this is a sentence used for testing"
>>> s2 = "while this is another sentence also used for testing"
>>> s3 = "I am a completely unrelated string"
>>> fuzz.partial_ratio(s1, s2)
80
>>> fuzz.partial_ratio(s1, s3)
52
>>> fuzz.partial_ratio(s2, s3)
43
It also includes other modes of comparison that account for out-of-order tokens, etc.
You can also measure the similarity between two phrases using Levenshtein distance, threating each word as a single element. When you have strings of unequal sizes you can use the Smith-Waterman or the Needleman-Wunsch algorithm. Those algorithms are widely used in bioinformatics and the implementation can be found in the biopython package.
You can also tokenize the words in the phrases and measure the frequency of each token in each phrase, that will result in an array of frequencies for each phrase. From that array you can measure the pairwise similarity using any vector distance such as euclidean distance or cosine similarity. The tokenization of the phrases can be done with the nltk package, and the distances can be measured with scipy.
Hope it helps.

Semantically weighted mean of word embeddings

Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!
Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.

How to find the closest word to a vector using word2vec

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose.
I have this vector which is the average vector for a set of vectors:
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
Is there a straight forward way to find the most similar word in my training data to this vector?
Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one?
Thanks.
For gensim implementation of word2vec there is most_similar() function that lets you find words semantically close to a given word:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
or to it's vector representation:
>>> your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
>>> model.most_similar(positive=[your_word_vector], topn=1))
where topn defines the desired number of returned results.
However, my gut feeling is that function does exactly the same that you proposed, i.e. calculates cosine similarity for the given vector and each other vector in the dictionary (which is quite inefficient...)
Don't forget to add empty array with negative words in most_similar function:
import numpy as np
model_word_vector = np.array( my_vector, dtype='f')
topn = 20;
most_similar_words = model.most_similar( [ model_word_vector ], [], topn)
Alternatively, model.wv.similar_by_vector(vector, topn=10, restrict_vocab=None) is also available in the gensim package.
Find the top-N most similar words by vector.
Parameters:
vector (numpy.array) – Vector from which similarities are to be computed.
topn ({int, False}, optional) – Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of
similarity scores.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For
example, restrict_vocab=10000 would only check the first 10000 word
vectors in the vocabulary order. (This may be meaningful if you’ve
sorted the vocabulary by descending frequency.)
Returns: Sequence of (word, similarity).
Return type: list of (str, float)

Categories

Resources