I would like to use Word2Vec to check similarity of texts.
I am currently using another logic:
from fuzzywuzzy import fuzz
def sim(name, dataset):
matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
return
(name is my column).
For applying this function I do the following:
df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)
Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?
Example of dataset:
Text
Hello, this is Peter, what would you need me to help you with today?
I need you
Good Morning, John here, are you calling regarding your cell phone bill?
Hi, this this is John. What can I do for you?
...
The first text and the last one are quite similar, although they have different words to express similar concept.
I would like to create a new column where to put, for each row, text that are similar.
I hope you can help me.
TLDR; skip to the last section (part 4.) for code implementation
1. Fuzzy vs Word embeddings
Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.
These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.
Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.
NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.
2. Similarity between word vectors / sentence vectors
“You shall know a word by the company it keeps”
Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.
A good way to find how similar 2 words vectors is cosine-similarity. Read more here.
3. Pre-trained word2vec models (and others)
The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.
You can check similarity between these sentence embeddings using cosine_similarity
4. Sample code implementation
I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data
from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.
Related
Is there a way to equate strings in python based on their meaning despite not being similar.
For example,
temp. Max
maximum ambient temperature
I've tried using fuzzywuzzy and difflib and although they are generally good for this using token matching, they also provide false positives when I threshold the outputs over a large number of strings.
Is there some other method using NLP or tokenization that I'm missing here?
Edit:
The answer provided by A CO does solve the problem mentioned above but is there any way to match specific substrings using word2vec from a key?
e.g. Key = max temp
Sent = the maximum ambient temperature expected tomorrow in California is 34 degrees.
So here I'd like to get the substring "maximum ambient temperature". Any tips on that?
As you say, packages like fuzzywuzzy or difflib will be limited because they compute similarities based on the spelling of the strings, not on their meaning.
You could use word embeddings. Word embeddings are vector representations of the words, computed in a way that allows to represent their meaning, to a certain extend.
There are different methods for generating word embeddings, but the most common one is to train a neural network on one - or a set - of word-level NLP tasks, and use the penultimate layer as a representation of the word. This way, the final representation of the word is supposed to have accumulated enough information to complete the task, and this information can be interpreted as an approximation for the meaning of the word. I recommend that you read a bit about Word2vec, which is the method that made word embeddings popular, as it is simple to understand but representative for what word embeddings are. Here is a good introductory article. The similarity between two words can be computed then using usually the cosine distance between their vector representations.
Of course, you don't need to train word embeddings yourself, as there exist plenty of pretrained vectors available (glove, word2vec, fasttext, spacy...). The choice of which embedding you will use depend on the observed performance and your understanding of how fit they are for the task you want to perform. Here is an example with spacy's word vectors, where the sentence vector is computed by averaging the word vectors:
# Importing spacy and fuzzy wuzzy
import spacy
from fuzzywuzzy import fuzz
# Loading spacy's large english model
nlp_model = spacy.load('en_core_web_lg')
s1 = "temp. Max"
s2 = "maximum ambient temperature"
s3 = "the blue cat"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors (The document or sentence vector is the average of the word vectors it contains)
print("Document vectors similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, doc1.similarity(doc2)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, doc1.similarity(doc3)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, doc2.similarity(doc3)))
# Fuzzy logic
print("Character ratio similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))
This will print:
>>> Document vectors similarity between 'temp. Max' and 'maximum ambient temperature' is: 0.6432
>>> Document vectors similarity between 'temp. Max' and 'the blue cat' is: 0.3810
>>> Document vectors similarity between 'maximum ambient temperature' and 'the blue cat' is: 0.3117
>>> Character ratio similarity between 'temp. Max' and 'maximum ambient temperature' is: 28.0000
>>> Character ratio similarity between 'temp. Max' and 'the blue cat' is: 38.0000
>>> Character ratio similarity between 'maximum ambient temperature' and 'the blue cat' is: 21.0000
As you can see, the similarity with word vectors reflects better the similarity in the meaning of the documents.
However this is really just a starting point as there can be plenty of caveats. Here is a list of some of the things you should watch out for:
Word (and document) vectors do not represent the meaning of the word (or document) per se, they are a way to approximate it. That implies that they will hit a limitation at some point and you cannot take for granted that they will allow you to differentiate all nuances of the language.
What we expect to be the "similarity in meaning" between two words/sentences varies according to the task we have. As an example, what would be the "ideal" similarity between "maximum temperature" and "minimum temperature"? High because they refer to an extreme state of the same concept, or low because they refer to opposite states of the same concept? With word embeddings, you will usually get a high similarity for these sentences, because as "maximum" and "minimum" often appear in the same contexts the two words will have similar vectors.
In the example given, 0.6432 is still not a very high similarity. This comes probably from the usage of abbreviated words in the example. Depending on how word embeddings have been generated, they might not handle abbreviation well. In a general manner, it is better to have syntactically and grammatically correct inputs to NLP algorithms. Depending on how your dataset looks like and your knowledge of it, doing some cleaning beforehand can be very helpful. Here is an example with grammatically correct sentences that highlights the similarity in meaning better:
s1 = "The president has given a good speech"
s2 = "Our representative has made a nice presentation"
s3 = "The president ate macaronis with cheese"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors
print(doc1.similarity(doc2))
>>> 0.8779
print(doc1.similarity(doc3))
>>> 0.6131
print(doc2.similarity(doc3))
>>> 0.5771
Anyway, word embeddings are probably what you are looking for but you need to take the time to learn about them. I would recommend that you read about word (and sentence, and document) embeddings and that you play a bit around with different pretrained vectors to get a better understanding of how they can be used for the task you have.
I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.
please, help me out.
Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.
A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.
So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.
I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.
I am trying to build a relation extraction system for drug-drug interactions using a CNN and need to make embeddings for the words in my sentences. The plan is to represent each word in the sentences as a combination of 3 embeddings: (w2v,dist1,dist2) where w2v is a pretrained word2vec embedding and dist1 and dist2 are the relative distances between each word in the sentence and the two drugs that are possibly related.
I am confused about how I should approach the issue of padding so that every sentence is of equal length. Should I pad the tokenised sentences with some series of strings(what string?) to equalise their lengths before any embedding?
You can compute the maximal separation between
entity mentions linked by a relation and choose an
input width greater than this distance. This will ensure
that every input (relation mention) has same length
by trimming longer sentences and padding shorter
sentences with a special token.
Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!
Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.