Using gensim most_similar function on a subset of total vocab - python

I am trying to use the gensim word2vec most_similar function in the following way:
wv_from_bin.most_similar(positive=["word_a", "word_b"])
So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.
e.g.
Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z
Finite set:
word_d, word_e, word_f
most_similar on whole vocab
wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]
desired output
finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set
output = ['word_d', 'word_f']

Depending on your specific patterns of use, you have a few options.
If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.
Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)
Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.
Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.
Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.
If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.
Two options you could consider include:
Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.
Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.
If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.
Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)
On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.

Related

Inconsistent result output with gensim index_to_key

Good afternoon,
first of all thanks to all who take the time to read this.
My problem is this, I would like to have a word2vec output the most common words.
I do this with the following command:
#how many words to print out ( sort at frequency)
x = list(model.wv.index_to_key[:2500])
Basically it works, but sometimes I get only 1948 or 2290 words printed out. I can't find any connection with the size of the original corpus (tokens, lines etc.) or deviation from the target value (if I increase the output value to e.g. 3500 it outputs 3207 words).
I would like to understand why this is the case, unfortunately I can't find anything on Google and therefore I don't know how to solve the problem. maybe by increasing the value and later deleting all rows after 2501 by using pandas
If any Python list ranged-access, like my_list[:n], returns less than n items, then the original list my_list had less than n items in it.
So, if model.wv.index_to_key[:2500] is only returning a list of length 1948, then I'm pretty sure if you check len(model.wv.index_to_key), you'll see the source list is only 1948 items long. And of course, you can't take the 1st 2500 items from a list that's only 1948 items long!
Why might your model have fewer unique words than you expect, or even that you counted via other methods?
Something might be amiss in your preprocessing/tokenization, but most likely is that you're not considering the effect of the default min_count=5 parameter. That default causes all words that appear fewer than 5 times to be ignored during training, as if they weren't even in the source texts.
You may be tempted to use min_count=1, to keep all words, but that's almost always a bad idea in word2vec training. Word2vec needs many subtly-contrasting alternate uses of a word to train a good word-vector.
Keeping words which only have one, or a few, usage examples winds up not failing to get good generalizable vectors for those rare words, but also interferes with the full learning of vectors for other nearby more-frequent words – now that their training has to fight the noise & extra model-cycles from the insufficiently-represented rare words.
Instead of lowering min_count, it's better to get more data, or live with a smaller final vocabulary.

How to run Fasttext get_nearest_neighbors() faster?

I'm trying to extract morphs/similar words in Sinhala language using Fasttext.
But FastText takes a 1 second for 2.64 words. How can I increase the speed without changing the model size?
My code looks like this:
import fasttext
fasttext.util.download_model('si', if_exists='ignore') # Sinhala
ft = fasttext.load_model('cc.si.300.bin')
words_file = open(r'/Datasets/si_words_filtered.txt')
words = words_file.readlines()
words = words[0:300]
synon_dict = dict()
from tqdm import tqdm_notebook
for i in tqdm_notebook(range(len(words))):
word = words[i].strip()
synon = ft.get_nearest_neighbors(word)[0][1] ### takes a lot of time
if is_strictly_sinhala_word(synon):
synon_dict[word] = synon
import json
with open("out.json", "w", encoding='utf8') as f:
json.dump(synon_dict, f, ensure_ascii=False)
To do a fully accurate get_nearest_neighbors()-type of calculation is inherently fairly expensive, requiring a lookup & calculation against every word in the set, for each new word.
As it looks like that set of vectors is near or beyond 2GB in size, when just the word-vectors are loaded, that means a scan of 2GB of addressable memory may be the dominant factor in the runtime.
Some things to try that might help:
Ensure that you have plenty of RAM - if there's any use of 'swap'/virtual-memory, that will make things far slower.
Avoid all unnecessary comparisons - for example, perform your is_strictly_sinhala_word() check before the expensive step, so you can skip the costly step if not interested in the results. Also, you could consider shrinking the full set of word-vectors to eliminate those that you are unlikely to want as responses. This might involve throwing out words you know are not of the language-of-interest, or all lower-frequency words. (If you can throw out half the words as possible nearest-neighbors before even trying the get_nearest_neighbors(), it will go roughly twice as fast.) More on these options below.
Try other word-vector libraries, to see if they offer any improvement. For example, the Python Gensim project can load either plain sets of full-word vectors (eg, the cc.si.300.vec words-only file) or FastText models (the .bin file), and offers a .most_similar() function that has some extra options & might, in some cases, offer different performance. (Though, the official Facebook Fasttext .get_nearest_neighbors() is probably pretty good.)
Use an "approximate nearest neighbors" library to pre-build an index of the word-vector space that can then offer extra-fast nearest-neighbor lookups - although at some risk of not finding the exact right top-N neighbors. There are many such libraries – see this benchmarking project that compares over 20 of them. But, adding this step complicates things & the tradeoff of that complexity & the imperfect result may not be worth the effort & time-savings. So, just remember that it's a possibility if your need s large enough & nothing else helps.
With regard to slimming the set of vectors searched:
The Gensim KeyedVectors.load_word2vec_format() function, which can load the .vec words-only file, has an option limit that will only read the specified number of words from the file. It looks like the .vec file for your dataset has over 800k words - but if you chose to load only 400k, your .most_similar() calculations would go about twice as fast. (And, since such files typically front-load the files with the most-common words, the loss of the far-rarer words may not be a concern.)
Siilarly, even if you load all the vectors, the Gensim .most_similar() function has a restrict_vocab option that can limit searches to just the 1st words of that count, which could also speed things or helpfully drop obscure words that may be of less interest.
The .vec file may be easier to work with if you wanted to pre-filter the words to, for example, eliminate non-Sinhala words. (Note: the usual .load_word2vec_format() text format needs a 1st line that declares the count of words & word-dimensionality, but you may leave that off, then load using the no_header=True option, which instead uses 2 full passes over the file to get the count.)

Python3 - Doc2Vec: Get document by vector/ID

I've already built my Doc2Vec model, using around 20.000 files. I'm looking for a way to find the string representation of a given vector/ID, which might be similar to Word2Vec's index2entity. I'm able to get the vector itself, using model['n'], but now I'm wondering whether there's a way to get some sort of string representation of it as well.
If you want to look up your actual training text, for a given text+tag that was part of training, you should retain that mapping outside the Doc2Vec model. (The model doesn't store training texts – only looking at them, repeatedly, during training.)
If you want to generate a text from a Doc2Vec doc-vector, that's not an existing feature, nor do I know any published work describing a reliable technique for doing so.
There's a speculative/experimental bit of work-in-progress for gensim Doc2Vec that will forward-propagate a doc-vector through the model's neural-network, and report back the most-highly-predicted target words. (This is somewhat the opposite of the way infer_vector() works.)
That might, plausibly, give a sort-of summary text. For more details see this open issue & the attached PR-in-progress:
https://github.com/RaRe-Technologies/gensim/issues/2459
Whether this is truly useful or likely to become part of gensim is still unclear.
However, note that such a set-of-words wouldn't be grammatical. (It'll just be the ranked-list of most-predicted words. Perhaps some other subsystem could try to string those words together in a natural, grammatical way.)
Also, the subtleties of whether a concept has many potential associates words, or just one, could greatly affect the "top N" results of such a process. Contriving a possible example: there are many words for describing a 'cold' environment. As a result, a doc-vector for a text about something cold might have lots of near-synonyms for 'cold' in the 11th-20th ranked positions – such that the "total likelihood" of at least one cold-ish word is very high, maybe higher than any one other word. But just looking at the top-10 most-predicted words might instead list other "purer" words whose likelihood isn't so divided, and miss the (more-important-overall) sense of "coldness". So, this experimental pseudo-summarization method might benefit from a second-pass that somehow "coalesces" groups-of-related-words into their most-representative words, until some overall proportion (rather than fixed top-N) of the doc-vector's predicted-words are communicated. (This process might be vaguely like finding a set of M words whose "Word Mover's Distance" to the full set of predicted-words is minimized – though that could be a very expensive search.)

How to dynamically assign the right "size" for Word2Vec?

The question is two-fold:
1. How to select the ideal value for size?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
{kfhahf}
Lfhslnf;
.
.
.
Row 2
(stdgff ksshu, hsihf)
asgasf;
.
.
.
Etc.
Based on this post: Python: What is the "size" parameter in Gensim Word2vec model class The size parameter should be less than (or equal to?) the vocabulary size. So, I am trying to dynamically assign the size as following:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# I do Word2Vec for each row
For item in dataset:
Tokenized = word_tokenize(item)
model = Word2Vec([Tokenized], min_count=1)
I get the vocabulary size here. So I create a second model:
model1 = Word2Vec([Tokenized], min_count=1, size=len(model.wv.vocab))
This sets the size value to the current vocab value of the current row, as I intended. But is it the right way to do? What is the right size for a small vocabulary text?
There's no simple formula for the best size - it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size (and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new model for every item in your dataset (discarding the previous model on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item))). Any Word2Vec model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1 makes a model worse than larger values (like the default of min_count=5). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)

Filtering Word Embeddings from word2vec

I have downloaded Google's pretrained word embeddings as a binary file here (GoogleNews-vectors-negative300.bin.gz). I want to be able to filter the embedding based on some vocabulary.
I first tried loading the bin file as a KeyedVector object, and then creating a dictionary that uses its vocabulary along with another vocabulary as a filter. However, it takes a long time.
# X is the vocabulary we are interested in
embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin.gz', binary=True)
embeddings_filtered = dict((k, embeddings[k]) for k in X if k in list(embeddings.wv.vocab.keys()))
It takes a very long time to run. I am not sure if this is the most efficient solution. Should I filter it out in the load_word2vec_format step first?
Your dict won't have all the features of a KeyedVectors object, and it won't be stored as compactly. The KeyedVectors stores all vectors in a large contiguous native 2D array, with a dict indicating the row for each word's vector. Your second dict, with a separate vector for each word, will involve more overhead. (And further, as the vectors you get back from embeddings[k] will be "views" into the full vector – so your subset may actually indirectly retain the larger array, even after you try to discard the KeyedVectors.)
Since it's likely that a reason you only want a subset of the original vectors is that the original set was too large, having a dict that takes as much or more memory probably isn't ideal.
You should consider two options:
load_word2vec_format() includes an optional limit parameter that only loads the first N words from the supplied file. As such files are typically sorted from most-frequent to least-frequent words, and the less-frequent words are both far less useful and of lower vector quality, it is often practical to just use the first 1 million, or 500,000, or 100,000, etc entries for a large memory & speed savings.
You could try filtering on load. You'd need to adapt the loading code to do this. Fortunately you can review the full source code for load_word2vec_format() (it's just a few dozen lines) inside your local gensim instalation, or online at the project source code hosting at:
https://github.com/RaRe-Technologies/gensim/blob/9c5215afe3bc4edba7dde565b6f2db982bba5113/gensim/models/utils_any2vec.py#L123
You'd write your own version of this routine that skips words not of interest. (It might have to do two passes over the file, one to count the words of interest, then a second to actually allocate the right-sized in-memory arrays and do the real reading.)

Categories

Resources