Gensim word2vec - start vocabulary from index different than 0

Gensim word2vec - start vocabulary from index different than 0 - python

I am using gensim to create word vectors based on my corpus like the following:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
I was wondering if it is possible to start (or somehow avoid having) words at index 0 and 1? I would like my vocabulary to start at index 2, because I need to do other operations and if I keep 0 and 1 as indexes it gets a little confusing.
Thanks for the help!

It's not a native feature of Word2Vec.
This is probably not a good idea, but you could crudely fake it by creating two dummy words with very high-frequency, and add examples containing them to your training data in a way to have a minimal impact on other vectors.
For example, if the most-common word in your corpus occurs 5,000 times, create a fake text with just the words 'dummy000000000' and 'dummy000000001' in it, repeated 1,000 times each. Add this fake text to your corpus 6 times. Then, 'dummy000000000' and 'dummy000000001' will be the two most-frequent words in the corpus, and thus get indexes 0 and 1 (in the usual case). Their training will waste time, and the model will waste a little bit of its potential state giving those words crude vectors, but they should have a minimal effect on other words (since they never co-occur with real words). Voila, you've got 0 and 1 indexes you can ignore (or treat as errors) later!
But having written it out, it's pretty definitely a bad idea. It'll slow and worsen the model slightly. Various progress/tally statistics from the model will be subtly misleading.
And, having such indexes start at 0 is very typical professional programming practice. If you find it confusing, in general or for your specific project, that may be a habit/understanding barrier that it's better to work-through than try to patch-around with non-standard practice.

Related

Why Word2Vec function returns me a lot of 0.99 values

I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:
df=df.text.apply(gensim.utils.simple_preprocess)
and this is the dataset that I get:
0 [understand, location, low, score, look, mcdon...
3 [listen, it, morning, tired, maybe, hangry, ma...
6 [super, cool, bathroom, door, open, foot, nugg...
19 [cant, find, better, mcdonalds, know, getting,...
27 [night, went, mcdonalds, best, mcdonalds, expe...
...
1677 [mcdonalds, app, order, arrived, line, drive, ...
1693 [correct, order, filled, promptly, expecting, ...
1694 [wow, fantastic, eatery, high, quality, ive, e...
1704 [let, tell, eat, lot, mcchickens, best, ive, m...
1716 [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object
Now I create the Word2Vec model and train it:
model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))
What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.
[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]
What am I doing wrong?

You're right that's not normal.
It is unlikely that your df is the proper format Word2Vec expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.
Try displaying next(iter(df)), to see the 1st item in df, if iterated over as Word2Vec does. Does it look like a good piece of training data?
Separately regarding your code:
min_count=1 is always a bad idea with Word2Vec - rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the default min_count=5 shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.
Only 283 texts are unlikely to be enough training data unless each text has tens of thousands of tokens. (And even if it were possible to squeeze some results from this far-smaller-than-ideal corpus, you might need to shrink the vector_size and/or increase the epochs to get the most out of minimal data.
If you supply a corpus to sentences in the Word2Vec() construction, you don't need to call .train(). It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal .build_vocab() & .train() steps if you didn't supply a corpus at construction-time.)
I highly recommend you enable logging to at least the INFO level for the relevant classes (either all Gensim or just Word2Vec). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)

According to the official doc:
Find the top-N most similar words. ...
This method computes cosine similarity between a simple mean of the projection weight
vectors of the given words and the vectors for each word in the model. The method
corresponds to the word-analogy and distance scripts in the original word2vec
implementation. ...
Since you put this df as your sentence base in the param, gensim just calculates the analogy and distance of words in different sentences (dataframe rows). I'm not sure your dataframe contains "service", if yes, the result words are just words which have the closest values to "service" in their sentences.

Using gensim most_similar function on a subset of total vocab

I am trying to use the gensim word2vec most_similar function in the following way:
wv_from_bin.most_similar(positive=["word_a", "word_b"])
So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.
e.g.
Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z
Finite set:
word_d, word_e, word_f
most_similar on whole vocab
wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]
desired output
finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set
output = ['word_d', 'word_f']

Depending on your specific patterns of use, you have a few options.
If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.
Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)
Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.
Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.
Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.
If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.
Two options you could consider include:
Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.
Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.
If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.
Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)
On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.

Use word2vec to expand a glossary in order to classify texts

I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?

There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.

Inconsistent result output with gensim index_to_key

Good afternoon,
first of all thanks to all who take the time to read this.
My problem is this, I would like to have a word2vec output the most common words.
I do this with the following command:
#how many words to print out ( sort at frequency)
x = list(model.wv.index_to_key[:2500])
Basically it works, but sometimes I get only 1948 or 2290 words printed out. I can't find any connection with the size of the original corpus (tokens, lines etc.) or deviation from the target value (if I increase the output value to e.g. 3500 it outputs 3207 words).
I would like to understand why this is the case, unfortunately I can't find anything on Google and therefore I don't know how to solve the problem. maybe by increasing the value and later deleting all rows after 2501 by using pandas

If any Python list ranged-access, like my_list[:n], returns less than n items, then the original list my_list had less than n items in it.
So, if model.wv.index_to_key[:2500] is only returning a list of length 1948, then I'm pretty sure if you check len(model.wv.index_to_key), you'll see the source list is only 1948 items long. And of course, you can't take the 1st 2500 items from a list that's only 1948 items long!
Why might your model have fewer unique words than you expect, or even that you counted via other methods?
Something might be amiss in your preprocessing/tokenization, but most likely is that you're not considering the effect of the default min_count=5 parameter. That default causes all words that appear fewer than 5 times to be ignored during training, as if they weren't even in the source texts.
You may be tempted to use min_count=1, to keep all words, but that's almost always a bad idea in word2vec training. Word2vec needs many subtly-contrasting alternate uses of a word to train a good word-vector.
Keeping words which only have one, or a few, usage examples winds up not failing to get good generalizable vectors for those rare words, but also interferes with the full learning of vectors for other nearby more-frequent words – now that their training has to fight the noise & extra model-cycles from the insufficiently-represented rare words.
Instead of lowering min_count, it's better to get more data, or live with a smaller final vocabulary.

How to dynamically assign the right "size" for Word2Vec?

The question is two-fold:
1. How to select the ideal value for size?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
{kfhahf}
Lfhslnf;
.
.
.
Row 2
(stdgff ksshu, hsihf)
asgasf;
.
.
.
Etc.
Based on this post: Python: What is the "size" parameter in Gensim Word2vec model class The size parameter should be less than (or equal to?) the vocabulary size. So, I am trying to dynamically assign the size as following:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# I do Word2Vec for each row
For item in dataset:
Tokenized = word_tokenize(item)
model = Word2Vec([Tokenized], min_count=1)
I get the vocabulary size here. So I create a second model:
model1 = Word2Vec([Tokenized], min_count=1, size=len(model.wv.vocab))
This sets the size value to the current vocab value of the current row, as I intended. But is it the right way to do? What is the right size for a small vocabulary text?

There's no simple formula for the best size - it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size (and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new model for every item in your dataset (discarding the previous model on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item))). Any Word2Vec model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1 makes a model worse than larger values (like the default of min_count=5). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.