NLP - speed up word similarity matchin - python

I am trying to find maximum similarity between two words in pandas dataframe. Here is my routine
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
ret_val = max([(wordnet.wup_similarity(syn_1, syn_2) or 0) for
syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))])
return ret_val
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)
It works fine, but it is too slow. I am looking for a way to speed it up. wordnet takes majority of time Any suggestions? Cython? I open to using other packages such as spacy.

Since you said you are open to use spacy as NLP library, lets consider a simple benchmark. We will use the brown news corpus to create somewhat arbitrary word pairs by dividing it in half.
from nltk.corpus import brown
brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
'word_1':brown_corpus[:len(brown_corpus)//2],
'word_2': brown_corpus[len(brown_corpus)//2:]
})
len(brown_df)
50277
The cosine similarity of two tokens/documents can be computed with the Doc.similarity method.
import spacy
nlp = spacy.load('en')
def spacy_max_similarity(row):
word_1 = nlp(row['word_1'])
word_2 = nlp(row['word_2'])
return word_1.similarity(word_2)
Finally, apply both methods to the data frame:
nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop
spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop
Please note that NLTK and spacy use different techniques when it comes to measuring similarity. spacy uses word vectors that have been pretrained with a word2vec algorithm. From the docs:
Using word vectors and semantic similarities
[...]
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.

One way to make it faster is to store word-pair similarities. Then in case of repetition, avoid running the search function in the loop.
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
word_similarities = dict()
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
key = tuple(sorted([word_1, word_2])) # symmetric measure :)
if key not in word_similarities:
word_similarities[key] = max([
(wordnet.wup_similarity(syn_1, syn_2) or 0)
for syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))
])
return word_similarities[key]
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)

Related

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.
TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)
You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

Using known python packages for implementing N-Gram, TF-IDF and Cosine similarity

I'm trying to implement a similarity function using
N-Grams
TF-IDF
Cosine Similaity
Example
Concept:
words = [...]
word = '...'
similarity = predict(words,word)
def predict(words,word):
words_ngrams = create_ngrams(words,range=(2,4))
word_ngrams = create_ngrams(word,range=(2,4))
words_tokenizer = tfidf_tokenizer(words_ngrams)
word_vec = words_tokenizer.transform(word)
return cosine_similarity(word_ved,words_tokenizer)
I searched the web for a simple and safe implementation but I couldn't find one that was using known python packages as sklearn, nltk, scipy etc.
most of them using "self made" calculations.
I'm trying to avoid coding every step by hand, and I'm guessing there is an easy fix for all of 'that pipeline'.
any help(and code) would be appreciated. tnx :)
Eventualy I figured it out...
For who ever will find the need of a solution for this Q, here's a function I wrote that takes care of it...
'''
### N-Gram & TD-IDF & Cosine Similarity
Using n-gram on 'from column' with TF-IDF to predict the 'to column'.
Adding to the df a 'cosine_similarity' feature with the numeric result.
'''
def add_prediction_by_ngram_tfidf_cosine( from_column_name,ngram_range=(2,4) ):
global df
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer( analyzer='char',ngram_range=ngram_range )
vectorizer.fit(df.FromColumn)
w = from_column_name
vec_word = vectorizer.transform([w])
df['vec'] = df.FromColumn.apply(lambda x : vectorizer.transform([x]))
df['cosine_similarity'] = df.vec.apply(lambda x : cosine_similarity(x,vec_word)[0][0])
df = df.drop(['vec'],axis=1)
Note: it's not production ready

sentence similarity using word embedding

I am a PhD researcher and started using word2vec for my research. I just want to use it for calculating sentence similarity. I searched and found few links but I couldn't run those. I was looking at the following:
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model,num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
unfortunately I couldn't run this as I don't know how I can find "index2word_set". In addition, should I assign model= ? Or, are there any easy commands or instruction to implement this?
assign model to the model that you generated or any predefined word2vec model that you want to use
and as for index2word_set, you can set it to model.wv
It should work then.

gensim: custom similarity measure

Using gensim, I want to calculate the similarity within a list of documents. This library is excellent at handling the amounts of data that I have got. The documents are all reduced to timestamps and I have got a function time_similarity to compare them. gensim however, uses the cosine similarity.
I am wondering if anyone has attemted this before or has a different solution.
It is possible to do this by inheriting from the interface SimilarityABC. I did not find any documentation for this but it looks like it has been done before to define Word Mover Distance similarity. Here is a generic way to do this. You can likely make it more efficient by specializing to the similarity measure you care about.
import numpy
from gensim import interfaces
class CustomSimilarity(interfaces.SimilarityABC):
def __init__(self, corpus, custom_similarity, num_best=None, chunksize=256):
self.corpus = corpus
self.custom_similarity = custom_similarity
self.num_best = num_best
self.chunksize = chunksize
self.normalize = False
def get_similarities(self, query):
"""
**Do not use this function directly; use the self[query] syntax instead.**
"""
if isinstance(query, numpy.ndarray):
# Convert document indexes to actual documents.
query = [self.corpus[i] for i in query]
if not isinstance(query[0], list):
query = [query]
n_queries = len(query)
result = []
for qidx in range(n_queries):
qresult = [self.custom_similarity(document, query[qidx]) for document in self.corpus]
qresult = numpy.array(qresult)
result.append(qresult)
if len(result) == 1:
# Only one query.
result = result[0]
else:
result = numpy.array(result)
return result
To implement a custom similarity:
def overlap_sim(doc1, doc2):
# similarity defined by the number of common words
return len(set(doc1) & set(doc2))
corpus = [['cat', 'dog'], ['cat', 'bird'], ['dog']]
cs = CustomSimilarity(corpus, overlap_sim, num_best=2)
print(cs[['bird', 'cat', 'frog']])
This outputs [(1, 2.0), (0, 1.0)].

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

I am working on keyword extraction problem. Consider the very general case
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.
"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."
"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"
Our best blessings are often the least appreciated."""
tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print(feature_names[col], ' - ', response[0, col])
and this gives me
(0, 28) 0.443509712811
(0, 27) 0.517461475101
(0, 8) 0.517461475101
(0, 6) 0.517461475101
tree - 0.443509712811
travellers - 0.517461475101
jupiter - 0.517461475101
fruit - 0.517461475101
which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?
You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:
feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 3
top_n = feature_array[tfidf_sorting][:n]
This gives me:
array([u'fruit', u'travellers', u'jupiter'],
dtype='<U13')
The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.
Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.
Solution using sparse matrix itself (without .toarray())!
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus',
'frequency of words in a document is called term frequency'
]
X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())
new_doc = ['can key words in this new document be identified?',
'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)
def get_top_tf_idf_words(response, top_n=2):
sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
return feature_names[response.indices[sorted_nzs]]
print([get_top_tf_idf_words(response,2) for response in responses])
#[array(['key', 'words'], dtype='<U9'),
array(['frequency', 'words'], dtype='<U9')]
Here is a quick code for that:
(documents is a list)
def get_tfidf_top_features(documents,n_top=10):
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
return tfidf_feature_names[importance[:n_top]]

Categories

Resources