Finding k most similar documents

Finding k most similar documents - python

I have some documents and I'd like to find the k documents most similar to a selected document. For the sake of a reproducible example, let's say k is 1 and my documents are these
documents = ['Two roads diverged in a yellow wood,',
'And sorry I could not travel both',
'And be one traveler, long I stood',
'And looked down one as far as I could',
'To where it bent in the undergrowth']
Then I think what I want to do is the below. (I'm using CountVectorizer for transparency and simplicity, even though maybe later I'd want to use Tf-Idf and a hashing vectorizer.)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(analyzer='word')
ft = vectorizer.fit_transform(documents)
one_doc = documents[1]
one_doc_code = vectorizer.transform([one_doc])
doc_match = np.matrix(ft) * np.matrix(one_doc_code.transpose())
and now doc_match is a column vector with weights that indicate closeness of match (0 = bad match, 1 = perfect match). But in order to do the multiplication, I (in desperation, in the face of element-wise multiplication) converted to a numpy matrix, so now I have this CSR format matrix that doesn't have a todense() member (so I can't just look, not that that would scale beyond my tiny example).
What I think I want now (but haven't been able to figure out so far) is how to say "what are the indices of the top k elements of doc_match?" (even if k is not 1).

If all you want are the indices in doc_match that have the highest score, you can do:
sorted_indices = np.argsort(doc_match)
doc_match_vals_sorted = doc_match[sorted_indices]

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!

I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!

Remove loops for sentence comparison in NLP

I'm using BERT to compare text similarity, with the following code:
from bert_embedding import BertEmbedding
import numpy as np
from scipy.spatial.distance import cosine as cosine_similarity
bert_embedding = BertEmbedding()
TEXT1 = "As expected from MIT-level of course: it's interesting, challenging, engaging, and for me personally quite enlightening. This course is second part of 5 courses in micromasters program. I was interested in learning about supply chain (purely personal interest, my work touch this topic but not directly) and stumbled upon this course, took it, and man-oh-man...I just couldn't stop learning. Now I'm planning to take the rest of the courses. Average time/effort per week should be around 8-10 hours, but I tried to squeeze everything into just 5 hours since I have very limited free time. You will need 2-3 hours per week for the lecture videos, 2 hours for practice problems, and another 2 hours for the weekly homework. This course offers several topics around demand forecasting and inventory. Basic knowledge of probability and statistics is needed. It will help if you take the prerequisite course: supply chain analytics. But if you've already familiar with basic concept of statistics, you can pick yourself along the way. The lectures are very interesting and engaging, it gives you a lot of knowledge but also throw in some business perspective, so it's very relatable and applicable! The practice problems can help strengthen the understanding of the given knowledge and the homework are very challenging compared to other online-courses I have taken. This course is the best quality I have taken so far, and I have taken several (3-4 MOOCs) from other provider."
TEXT1 = TEXT1.split('.')
sentence2 = ["CHALLENGING COURSE "]
from there I want to find the best match of sentence2 in one of the sentences of TEXT1, using cosine distance
best_match = {'sentence':'','score':''}
best = 0
for sentence in TEXT1:
#sentence = sentence.replace('SUPPLY CHAIN','')
if len(sentence) < 5:
continue
avg_vec1 = calculate_avg_vec([sentence])
avg_vec2 = calculate_avg_vec(sentence2)
score = cosine_similarity(avg_vec1,avg_vec2)
if score > best:
best_match['sentence'] = sentence
best_match['score'] = score
best = score
best_match
The code is working, but since I want to compare the sentence2 not only with TEXT1 with but N texts I need to improve the speed. is it possible to vectorice this loop? or any way to speed it up?

cosine_similarity is defined as a dot product of two normalized vectors.
This is essentially a matrix multiplication, followed by an argmax to get the best index.
I'll be using numpy, even though - as mentioned in the comments - you could probably plug it in to the BERT model with pytorch or tensorflow.
First, we define a normalized average vector:
def calculate_avg_norm_vec(sentence):
vs = sentence2vectors(sentence) # TODO: use Bert embedding
vm = vs.mean(axis=0)
return vm/np.linalg.norm(vm)
Then, we build a matrix of all sentences and their vectors
X = np.apply_along_axis(calculate_avg_norm_vec, 1, all_sentences)
target = calculate_avg_norm_vec(target_sentence)
Finally, we'll need to multiply the target vector with the X matrix, and take the argmax
index_of_sentence = np.dot(X,target.T).argmax(axis=1)
You might want to make sure that the axis and indexing fit your data, but this is the overall scheme

fastText embeddings sentence vectors?

I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.
In order to confirm this, I wrote the following script:
import numpy as np
import fastText as ft
# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)
# Printing the result.
print(is_equal)
# Result: FALSE
But, It seems that the obtained vectors are not similar.
Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?

First, you missed the part that get_sentence_vector is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.
Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.
try this (I assume the L2 norm of each word is positive):
def l2_norm(x):
return np.sqrt(np.sum(x**2))
def div_norm(x):
norm_value = l2_norm(x)
if norm_value > 0:
return x * ( 1.0 / norm_value)
else:
return x
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3
You can see the source code here or you can see the discussion here.

You might be hitting an issue with floating point math - e.g. if one addition was done on a CPU and one on a GPU they could differ.
The best way to check if it's doing what you want is to make sure the vectors are almost exactly the same.
You might want to print out the two vectors and manually inspect them, or do the dotproduct of one_two minus one_two_avg on itself (i.e. the length of the difference between the two).

Understanding and applying k-means clustering for topic modeling

I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print

'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.

When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.

Imposing a cap on word count in scikit learn

I'm analyzing song lyrics where repetition doesn't necessarily mean higher importance, so I'd like to cap the word count per document. For example, if a word appears n times in a song, where n > threshold, then I would replace nwith threshold.
I've checked the CountVectorizer docs, and there's an option for a min_df and max_df, but these can only disregard words that appear in some m documents, not words that appear n times in a single document.
I was thinking of changing the elements of the sparse matrix (say, find all elements > threshold, then replace), but I couldn't find a way to that either. Thanks in advance!

I don't know of any prebuilt feature in scikit learn for this, but you could definitely edit your doc-term matrix directly, with numpy.where for example :
x = numpy.where(x < threshold, x, threshold)
where x is your doc-term matrix and threshold is, well, your threshold.
EDIT :
I hadn't realized numpy.where didn't work on scipy sparse matrices. You can use the find function from scipy.sparse that will return all non-0 indices in a sparse matrix in order to access and modify those values directly:
from scipy.sparse import find
results = find(x > threshold)
for i in range(len(results[0])):
x[results[0][i], results[1][i]] = threshold
It's significantly less elegant but it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding k most similar documents - python

If all you want are the indices in doc_match that have the highest score, you can do: sorted_indices = np.argsort(doc_match) doc_match_vals_sorted = doc_match[sorted_indices]

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

Remove loops for sentence comparison in NLP

fastText embeddings sentence vectors?

Understanding and applying k-means clustering for topic modeling

Imposing a cap on word count in scikit learn

Categories

Resources