I'm using BERT to compare text similarity, with the following code:
from bert_embedding import BertEmbedding
import numpy as np
from scipy.spatial.distance import cosine as cosine_similarity
bert_embedding = BertEmbedding()
TEXT1 = "As expected from MIT-level of course: it's interesting, challenging, engaging, and for me personally quite enlightening. This course is second part of 5 courses in micromasters program. I was interested in learning about supply chain (purely personal interest, my work touch this topic but not directly) and stumbled upon this course, took it, and man-oh-man...I just couldn't stop learning. Now I'm planning to take the rest of the courses. Average time/effort per week should be around 8-10 hours, but I tried to squeeze everything into just 5 hours since I have very limited free time. You will need 2-3 hours per week for the lecture videos, 2 hours for practice problems, and another 2 hours for the weekly homework. This course offers several topics around demand forecasting and inventory. Basic knowledge of probability and statistics is needed. It will help if you take the prerequisite course: supply chain analytics. But if you've already familiar with basic concept of statistics, you can pick yourself along the way. The lectures are very interesting and engaging, it gives you a lot of knowledge but also throw in some business perspective, so it's very relatable and applicable! The practice problems can help strengthen the understanding of the given knowledge and the homework are very challenging compared to other online-courses I have taken. This course is the best quality I have taken so far, and I have taken several (3-4 MOOCs) from other provider."
TEXT1 = TEXT1.split('.')
sentence2 = ["CHALLENGING COURSE "]
from there I want to find the best match of sentence2 in one of the sentences of TEXT1, using cosine distance
best_match = {'sentence':'','score':''}
best = 0
for sentence in TEXT1:
#sentence = sentence.replace('SUPPLY CHAIN','')
if len(sentence) < 5:
continue
avg_vec1 = calculate_avg_vec([sentence])
avg_vec2 = calculate_avg_vec(sentence2)
score = cosine_similarity(avg_vec1,avg_vec2)
if score > best:
best_match['sentence'] = sentence
best_match['score'] = score
best = score
best_match
The code is working, but since I want to compare the sentence2 not only with TEXT1 with but N texts I need to improve the speed. is it possible to vectorice this loop? or any way to speed it up?
cosine_similarity is defined as a dot product of two normalized vectors.
This is essentially a matrix multiplication, followed by an argmax to get the best index.
I'll be using numpy, even though - as mentioned in the comments - you could probably plug it in to the BERT model with pytorch or tensorflow.
First, we define a normalized average vector:
def calculate_avg_norm_vec(sentence):
vs = sentence2vectors(sentence) # TODO: use Bert embedding
vm = vs.mean(axis=0)
return vm/np.linalg.norm(vm)
Then, we build a matrix of all sentences and their vectors
X = np.apply_along_axis(calculate_avg_norm_vec, 1, all_sentences)
target = calculate_avg_norm_vec(target_sentence)
Finally, we'll need to multiply the target vector with the X matrix, and take the argmax
index_of_sentence = np.dot(X,target.T).argmax(axis=1)
You might want to make sure that the axis and indexing fit your data, but this is the overall scheme
Related
I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.
To do this I use the following code:
import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector
Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.
However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.
I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean
import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)
Approaches tried
In both cases documents is a list with 200 string elements with ≈ 400 words each.
Without dealing with OOV terms:
import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [document.vector for document in list(nlp.pipe(documents))]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s
Ignoring OOV terms in output vector. Note that in this case we need to add an extra 'if' statment for those cases in which all words are OOV (if this happens the output vector is r_vec):
r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
if vectors.size == 0:
# Case in which none of the words in text were in vocabulary
avg_vector = r_vec
else:
avg_vector = vectors.mean(axis=0)
return avg_vector
times = []
for i in range(0, 100):
init = time.time()
documents_vec = [get_vector(document) for document in documents]
fin = time.time()
times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s
In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.
I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.
see this post by the author of Spacy which says:
The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.
Try this for example:
import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np
sentence = "I love Stack Overflow butitsalsodistractive"
print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])
np.array_equal(tokens.vector, tokensClean.vector)
#False
If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)
I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.
I am using gensim wmdistance for calculating similarity between a reference sentence and 1000 other sentences.
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
model.init_sims(replace=True)
reference_sentence = "it is a reference sentence"
other_sentences = [1000 sentences]
index = 0
for sentence in other_sentences:
distance [index] = model.wmdistance(refrence_sentence, other_sentences)
index = index + 1
According to gensim source code, model.wmdistance returns the following:
emd(d1, d2, distance_matrix)
where
d1 = # Compute nBOW representation of reference_setence.
d2 = # Compute nBOW representation of other_sentence (one by one).
distance_matrix = see the source code as its a bit too much to paste it here.
This code is inefficient in two ways for my use case.
1) For the reference sentence, it is repeatedly calculating d1 (1000 times) for the distance function emd(d1, d2, distance_matrix).
2) This distance function is called by multiple users from different points which repeat this whole process of model.wmdistance(doc1, doc2) for the same other_sentences and it is computationally expensive. For this 1000 comparisons, it takes around 7-8 seconds.
Therefore, I would like to isolate the two tasks. The final calculation of distance: emd(d1, d2, distance_matrix) and the preparation of these inputs: d1, d2, and distance matrix. As distance matrix depends on both so at least its input preparation should be isolated from the final matrix calculation.
My initial plan is to create three customized functions:
d1 = prepared1(reference_sentence)
d2 = prepared2(other_sentence)
distance_matrix inputs = prepare inputs
Is it possible to do this with this gensim function or should I just go my own customized version? Any ideas and solutions to deal with this problem in a better way?
You are right to observe that this code could be refactored & optimized to avoid doing repetitive operations, especially in the common case where one reference/query doc is evaluated against a larger set of documents. (Any such improvements would also be a welcome contribution back to gensim.)
Simply preparing single documents outside the calculation might not offer a big savings; in each case, all word-to-word distances between the two docs must be calculated. It might make sense to precalculate a larger distance_matrix (to the extent that the relevant vocabulary & system memory allows) that includes all words needed for many pairwise WMD calculations.
(As tempting as it might be to precalculate all word-to-word distances, with a vocabulary of 3 million words like the GoogleNews vector-set, and mere 4-byte float distances, storing them all would take at least 18TB. So calculating distances for relevant words, on manageable batches of documents, may make more sense.)
A possible way to start would be to create a variant of wmdistance() that explicitly works on one document versus a set-of-documents, and can thus combine the creation of histograms/distance-matrixes for many comparisons at once.
For the common case of not needing all WMD values, but just wanting the top-N nearest results, there's an optimization described in the original WMD paper where another faster calculation (called there 'RWMD') can be used to deduce when there's no chance a document could be in the top-N results, and thus skip the full WMD calculation entirely for those docs.
I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I generally think I have the algorithm down, but my results are very skewed. I am aware that and-1 is not optimal (to say the least), but I just want to be certain my results are from the and-1 methodology itself and not my attempt.
Now, the And-1/Laplace smoothing technique seeks to avoid 0 probabilities by, essentially, taking from the rich and giving to the poor. Therefore, a bigram that is found to have a zero probability becomes:
1/V, V=the number of types
This means that the probability of every other bigram becomes:
P(B|A) = Count(W[i-1][W[i])/(Count(W[i-1])+V)
You would then take a sentence to test and break each into bigrams and test them against the probabilities (doing the above for 0 probabilities), then multiply them all together to get the final probability of the sentence occurring.
I am implementing this in Python. My code looks like this, all function calls are verified to work:
#return is a counter of tuples containing ngrams: {('A','B'):C}
#this means ('A','B') means (B|A) in probabilistic terms
bigrams[0]=getBigrams(corpus[0])
...
bigrams[n]=getBigrams(corpus[n])
#return is a dictionary of the form P['A']=C
unigrams[0]=getUnigrams(corpus[0])
...
unigrams[N]=getUnigrams(corpus[n])
#generate bigram probabilities, return is P('A','B')=p, add one is done
prob[0]=getAddOneProb(unigrams[0],bigrams[0])
...
prob(n)=getAddOneProb(unigrams[n],bigrams[n])
for sentence in test:
bi=getBigrams(sentence)
uni=getUnigrams(sentence)
P[0]=...=P[n]=1 #set to 1
for b in bi:
tup=b
try:
P[0]*=prob[tup]
except KeyError:
P[0]=(1/len(unigrams[0])
#do this for all corpora
At the then I would compare all corpora, P[0] through P[n] and find the one with the highest probability
My results aren't that great but I am trying to understand if this is a function of poor coding, incorrect implementation, or inherent and-1 problems.
I am using GENSIM on a corpus of 50000 documents along with a dictionary of around 4000 features. I also have a LSI model already prepared for the same.
Now I want to find the highest matching features for each of the added documents. To find the best features in a particular document, I am running gensim's similarity module for each of the features on all the documents. This gives us a score for each of the feature that we want to use later on. But as you can imagine, this is a costly process as we have to iterate over 50000 indices and run 4000 iterations of similarity on each.
I need a better way of doing this as I run out of 8 GB memory on my system at around 1000 iterations. There's actually no reason for the memory to keep rising as I am only reallocating it during the iterations. Surprisingly the memory starts rising only after around 200 iterations.
Why the memory issue? How can it be solved?
Is there a better way of finding the highest scored features in a particular document (not topics)?
Here's a snippet of the code that runs out of memory:
dictionary = corpora.Dictionary.load('features-dict.dict')
corpus = corpora.MmCorpus('corpus.mm')
lsi = models.LsiModel.load('model.lsi')
corpus_lsi = lsi[corpus]
index = similarities.MatrixSimilarity(list(corpus_lsi))
newDict = dict()
for feature in dictionary.token2id.keys():
vec_bow = dictionary.doc2bow([feature])
vec_lsi = lsi[vec_bow]
sims = index[vec_lsi]
li = sorted(enumerate(sims * 100), key=lambda item: -item[1])
for data in li:
dict[data[0]] = (feature,data[1]) # Store feature and score for each document
# Do something with the dict created above
EDIT:
The memory issue was resolved using a memory profiler. There was something else in that loop that caused it to rise drastically.
Let me explain the purpose in detail. Imagine we are dealing with various recipes (each recipe is document) and each item in our dictionary is an ingredient. Find six such recipes below.
corpus = [[Olive Oil, Tomato, Brocolli, Oregano], [Garlic, Olive Oil, Bread, Cheese, Oregano], [Avocado, Beans, Cheese, Lime], [Jalepeneo, Lime, Tomato, Tortilla, Sour Cream], [Chili Sauce, Vinegar, Mushrooms, Rice], [Soy Sauce, Noodles, Brocolli, Ginger, Vinegar]]
There are thousands of such recipes. What I am trying to achieve is to assign a weight between 0 and 100 to each of the ingredient (where higher weighted ingredient is the most important or most unique). What would be the best way to achieve this.
Let's break this down:
unless I misunderstood your purpose, you can simple use the left singular vectors from lsi.projection.u to get your weights:
# create #features x #corpus 2D matrix of weights
doc_feature_matrix = numpy.dot(lsi.projection.u, index.index.T)
Rows of this matrix should be the "documents weights" you're looking for, one row for one feature.
the call to list() in your list(lsi[corpus]) makes your code very inefficient. It basically serializes the entire doc-topic matrix into RAM. Drop the list() and use the streamed version directly, it's much more memory-efficient: index = MatrixSimilarity(lsi[corpus], num_features=lsi.num_topics).
LSI usually works better over regularized input. Consider transforming the plain bag-of-words vectors (=integers) via e.g. TF-IDF or log entropy transformation before passing it to LSI.