I am using gensim wmdistance for calculating similarity between a reference sentence and 1000 other sentences.
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
model.init_sims(replace=True)
reference_sentence = "it is a reference sentence"
other_sentences = [1000 sentences]
index = 0
for sentence in other_sentences:
distance [index] = model.wmdistance(refrence_sentence, other_sentences)
index = index + 1
According to gensim source code, model.wmdistance returns the following:
emd(d1, d2, distance_matrix)
where
d1 = # Compute nBOW representation of reference_setence.
d2 = # Compute nBOW representation of other_sentence (one by one).
distance_matrix = see the source code as its a bit too much to paste it here.
This code is inefficient in two ways for my use case.
1) For the reference sentence, it is repeatedly calculating d1 (1000 times) for the distance function emd(d1, d2, distance_matrix).
2) This distance function is called by multiple users from different points which repeat this whole process of model.wmdistance(doc1, doc2) for the same other_sentences and it is computationally expensive. For this 1000 comparisons, it takes around 7-8 seconds.
Therefore, I would like to isolate the two tasks. The final calculation of distance: emd(d1, d2, distance_matrix) and the preparation of these inputs: d1, d2, and distance matrix. As distance matrix depends on both so at least its input preparation should be isolated from the final matrix calculation.
My initial plan is to create three customized functions:
d1 = prepared1(reference_sentence)
d2 = prepared2(other_sentence)
distance_matrix inputs = prepare inputs
Is it possible to do this with this gensim function or should I just go my own customized version? Any ideas and solutions to deal with this problem in a better way?
You are right to observe that this code could be refactored & optimized to avoid doing repetitive operations, especially in the common case where one reference/query doc is evaluated against a larger set of documents. (Any such improvements would also be a welcome contribution back to gensim.)
Simply preparing single documents outside the calculation might not offer a big savings; in each case, all word-to-word distances between the two docs must be calculated. It might make sense to precalculate a larger distance_matrix (to the extent that the relevant vocabulary & system memory allows) that includes all words needed for many pairwise WMD calculations.
(As tempting as it might be to precalculate all word-to-word distances, with a vocabulary of 3 million words like the GoogleNews vector-set, and mere 4-byte float distances, storing them all would take at least 18TB. So calculating distances for relevant words, on manageable batches of documents, may make more sense.)
A possible way to start would be to create a variant of wmdistance() that explicitly works on one document versus a set-of-documents, and can thus combine the creation of histograms/distance-matrixes for many comparisons at once.
For the common case of not needing all WMD values, but just wanting the top-N nearest results, there's an optimization described in the original WMD paper where another faster calculation (called there 'RWMD') can be used to deduce when there's no chance a document could be in the top-N results, and thus skip the full WMD calculation entirely for those docs.
Related
I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?
Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.
I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ∈ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_i∈{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.
I fell across a spreadsheet that explains a method to sort both rows and columns of a matrix that contains binary data so that the number of changes between consecutive rows and cols is minimzed.
For example, starting with:
After 15 manual steps described in the tabs of the spreadsheed, the following table is obtained:
I would like to know:
what is the common name of this algorithm or method ?
how to apply it to larger table (where 2^n would overflow...)
how to generalize it to non binary data, for example using Levenshtein distance ?
if there is any link to code (Excel VBA, Python, ...) already implementing this (otherwise I'll write it ... )
Thanks !
You can represent each row by a vector L = [1, 1, 0, ... 1], and then define the distance between two lines d(L0, L1) by the number of elements at corresponding positions which are different between L0 and L1. This is known as the binary Hamming distance. If you had non-binary data, you would just extend your definition of distance and yes, Levenshtein distance would be an option.
Once you have distance well-defined, the rest of your problem is minimizing distance between consecutive rows. This is exactly the Traveling salesman problem, which is known to be NP-hard(http://www.diku.dk/hjemmesider/ansatte/jyrki/Paper/EKP85.pdf).
The direct solution (visiting all permutations) is O(n!), but you can do better easily by using dynamic programming, for example Held–Karp_algorithm. There are also approximate algorithms, such as the Nearest_neighbour_algorithm which quickly computes a non-optimal solution.
Finally, for implementations you can easily google "traveling salesman excel/python" and find many tutorials and examples.
I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a "nearest neighbour" or "nearest match" type algorithim.
I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.
datarows = [(10,2.0,3.4,100),
(11,2.0,5.4,120),
(17,12.9,42,123)]
filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter
def get_nearest_neighbour(data, criteria, weights):
for each row in data:
# calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
# determine the row which was either an exact match or was 'least dissimilar'
# return the match (or nearest match)
pass
if __name__ == '__main__':
result = get_nearest_neighbour(datarow, filter_record, weights)
print result
For the snippet above, the output should be:
(10,2.0,3.4,100)
since it is the 'nearest' to the sample data passed to the function get_nearest_neighbour().
My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the 'distance metric' we use is simply an arithmentic subtraction of the input data from the current row.
Simple out-of-the-box solution:
import math
def distance(row_a, row_b, weights):
diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
return sum([v*w for v,w in zip(diffs, weights)])
def get_nearest_neighbour(data, criteria, weights):
def sort_func(row):
return distance(row, criteria, weights)
return min(data, key=sort_func)
If you'd need to work with huge datasets, you should consider switching to Numpy and using Numpy's KDTree to find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it's implemented a top of highly optimized LAPACK (Linear Algebra PACKage).
About naive-NN:
Many of these other answers propose "naive nearest-neighbor", which is an O(N*d)-per-query algorithm (d is the dimensionality, which in this case seems constant, so it's O(N)-per-query).
While an O(N)-per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example):
10 queries and 100000 points
100 queries and 10000 points
1000 queries and 1000 points
10000 queries and 100 points
100000 queries and 10 points
Doing better than naive-NN:
Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in:
http://en.wikipedia.org/wiki/Nearest_neighbor_search (most likely linked off from that page), some examples linked:
http://en.wikipedia.org/wiki/K-d_tree
http://en.wikipedia.org/wiki/Locality_sensitive_hashing
http://en.wikipedia.org/wiki/Cover_tree
especially if you plan to run your program more than once. There are most likely libraries available. To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. As user 'dsign' points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library.
However if you can get away with using the simple-to-implement naive-NN though, you should use it.
use heapq.nlargest on a generator calculating the distance*weight for each record.
something like:
heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))