Gensim find vectors/words in ball of radius r - python

I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 i.e. ||v_1 - v_i||<=r, for some real number r.
I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.

If you call most_similar() with a topn=0, it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.)
You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy's argwhere.
For example:
target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]

Related

Gensim: word mover distance with string as input instead of list of string

I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?

Use Word2vec to determine which two words in a group of words is most similar

I am trying to use the python wrapper around Word2vec. I have a word embedding or group of words which can be seen below and from them I am trying to determine which two words are most similar to each other.
How can I do this?
['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
#rylan-feldspar's answer is generally the correct approach and will work, but you could do this a bit more compactly using standard Python libraries/idioms, especially itertools, a list-comprehension, and sorting functions.
For example, first use combinations() from itertools to generate all pairs of your candidate words:
from itertools import combinations
candidate_words = ['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
all_pairs = combinations(candidate_words, 2)
Then, decorate the pairs with their pairwise similarity:
scored_pairs = [(w2v_model.wv.similarity(p[0], p[1]), p)
for p in all_pairs]
Finally, sort to put the most-similar pair first, and report that score & pair:
sorted_pairs = sorted(scored_pairs, reverse=True)
print(sorted_pairs[0]) # first item is most-similar pair
If you wanted to be compact but a bit less readable, it could be a (long) "1-liner":
print(sorted([(w2v_model.wv.similarity(p[0], p[1]), p)
for p in combinations(candidate_words, 2)
], reverse=True)[0])
Update:
Integrating #ryan-feldspar's suggestion about max(), and going for minimality, this should also work to report the best pair (but not its score):
print(max(combinations(candidate_words, 2),
key=lambda p:w2v_model.wv.similarity(p[0], p[1])))
Given you're using gensim's word2vec, according to your comment:
Load up or train the model for your embeddings and then, on your model, you can call:
min_distance = float('inf')
min_pair = None
word2vec_model_wv = model.wv # Unsure if this can be done in the loop, but just to be safe efficiency-wise
for candidate_word1 in words:
for candidate_word2 in words:
if candidate_word1 == candidate_word2:
continue # ignore when the two words are the same
distance = word2vec_model_wv.distance(candidate_word1, candidate_word2)
if distance < min_distance:
min_pair = (candidate_word1, candidate_word2)
min_distance = distance
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.distance
Could also be similarity (I'm not entirely sure if there's a difference). https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity
If similarity gets bigger with closer words, as I'd expect, then you'll want to maximize not minimize and just replace the distance function calls with similarity calls. Basically this is just the simple min/max function over the pairs.

fastText embeddings sentence vectors?

I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.
In order to confirm this, I wrote the following script:
import numpy as np
import fastText as ft
# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)
# Printing the result.
print(is_equal)
# Result: FALSE
But, It seems that the obtained vectors are not similar.
Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?
First, you missed the part that get_sentence_vector is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.
Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.
try this (I assume the L2 norm of each word is positive):
def l2_norm(x):
return np.sqrt(np.sum(x**2))
def div_norm(x):
norm_value = l2_norm(x)
if norm_value > 0:
return x * ( 1.0 / norm_value)
else:
return x
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3
You can see the source code here or you can see the discussion here.
You might be hitting an issue with floating point math - e.g. if one addition was done on a CPU and one on a GPU they could differ.
The best way to check if it's doing what you want is to make sure the vectors are almost exactly the same.
You might want to print out the two vectors and manually inspect them, or do the dotproduct of one_two minus one_two_avg on itself (i.e. the length of the difference between the two).

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v āˆˆ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_iāˆˆ{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Matching 2 short descriptions and returning a confidence level

I have some data that I get from the Banks using Yodlee and the corresponding transaction messages on the mobile. Both have some description in them - short descriptions.
For example -
string1 = "tatasky_TPSL MUMBA IND"
string2 = "tatasky_TPSL"
They can be matched if one is a completely inside the other. However, some strings like
string1 = "T.G.I Friday's"
string1 = "TGI Friday's MUMBA MAH"
Still need to be matched. Is there a y algorithm which gives a confidence level in matching 2 descriptions ?
You might want to use Normalized edit distance also called levenstien distance levenstien distance wikipedia. So after getting levenstien distance between two strings, you can normalize it by dividing by the length of longest string (or average of those two strings). This normalised socre can act as confidense. You can find some 4-5 python packages of calculating levenstien distance. You can try it online as well edit distance calculator
Alternatively one simple solution is algorithm called longest common subsequence, which can be used here

Categories

Resources