Good afternoon,
first of all thanks to all who take the time to read this.
My problem is this, I would like to have a word2vec output the most common words.
I do this with the following command:
#how many words to print out ( sort at frequency)
x = list(model.wv.index_to_key[:2500])
Basically it works, but sometimes I get only 1948 or 2290 words printed out. I can't find any connection with the size of the original corpus (tokens, lines etc.) or deviation from the target value (if I increase the output value to e.g. 3500 it outputs 3207 words).
I would like to understand why this is the case, unfortunately I can't find anything on Google and therefore I don't know how to solve the problem. maybe by increasing the value and later deleting all rows after 2501 by using pandas
If any Python list ranged-access, like my_list[:n], returns less than n items, then the original list my_list had less than n items in it.
So, if model.wv.index_to_key[:2500] is only returning a list of length 1948, then I'm pretty sure if you check len(model.wv.index_to_key), you'll see the source list is only 1948 items long. And of course, you can't take the 1st 2500 items from a list that's only 1948 items long!
Why might your model have fewer unique words than you expect, or even that you counted via other methods?
Something might be amiss in your preprocessing/tokenization, but most likely is that you're not considering the effect of the default min_count=5 parameter. That default causes all words that appear fewer than 5 times to be ignored during training, as if they weren't even in the source texts.
You may be tempted to use min_count=1, to keep all words, but that's almost always a bad idea in word2vec training. Word2vec needs many subtly-contrasting alternate uses of a word to train a good word-vector.
Keeping words which only have one, or a few, usage examples winds up not failing to get good generalizable vectors for those rare words, but also interferes with the full learning of vectors for other nearby more-frequent words – now that their training has to fight the noise & extra model-cycles from the insufficiently-represented rare words.
Instead of lowering min_count, it's better to get more data, or live with a smaller final vocabulary.
Related
I am trying to use the gensim word2vec most_similar function in the following way:
wv_from_bin.most_similar(positive=["word_a", "word_b"])
So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.
e.g.
Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z
Finite set:
word_d, word_e, word_f
most_similar on whole vocab
wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]
desired output
finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set
output = ['word_d', 'word_f']
Depending on your specific patterns of use, you have a few options.
If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.
Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)
Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.
Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.
Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.
If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.
Two options you could consider include:
Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.
Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.
If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.
Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)
On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.
I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?
There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.
I'd need some help with a downsampling issue. I have to make a larger corpus (6 654 940 sentences, 19 592 258 tokens) comparable to a smaller one (15 607 sentences, 927 711 tokens), to implement them on 2 comparable word2vec models.
Each corpus is a list of lists in which each list is a tokenized sentence:
e.g. [['the', 'boy', 'eats']['the', 'teacher', 'explains']...]
I want to downsample the largest one to have the same number of tokens of the smaller one (keeping the original data structure: downsampling sentences until I get the desidered number of tokens). I am a very beginner of programming and I thought of two possible ways of proceeding but I am not sure how I can implement them:
- downsampling the list of lists
- downsampling the trained word2vec model (I saw in the forum that there is the parameter "sample" to downsampling the most frequent words but I want to get random sentences)
Can you help me out?
Thank you very much!! :)
Let's label a few of the things you've mentioned explicitly:
corpus-A 6 654 940 sentences, 19 592 258 tokens (2.9 tokens per sentence)
corpus-B 15 607 sentences, 927 711 tokens (60 tokens per sentence)
I'll observe right away that the tiny average size of corpus-A sentences suggests they might not be the kind of natural-language-like runs-of-words against which word2vec is typically run. And, such clipped sentences may not give rise to the kinds of window-sized contexts that are most-typical for this kind of training. (The windows will be atypically small, no matter your choice of window. And note further than no training can happen from a sentence with a single token – it's a no-op.)
So, any scaling/sampling of corpus-A (with its sentences of aaround 3 tokens) is not, at the end of the process, going to be that much like corpus-B (with its more typical sentences of dozens to possibly hundreds of tokens). They won't really be alike, except in some singular measurement you choose to target.
If in fact you have enough memory to operate on corpus-A completely in RAM, then choosing a random subset of 15607 sentences – to match the sentence count of corpus-B, is very simple using standard Python functions:
import random
corpus_a_subset = random.sample(corpus_a, len(corpus_b))
Of course, this particular corpus_a_subset will only match the count of sentences in corpus_b, but in fact be much smaller in raw words – around 47k tokens long – given the much-shorter average size of corpus-A sentences.
If you were instead aiming for a roughly 927k-token-long subset to match the corpus-B token count, you'd need about (927k / 3 =) 309000 sentences:
corpus_a_subset = random.sample(corpus_a, 309000)
Still, while this should make corpus_a_subset closely match the raw word count of corpus_b, it's still likely a very-different corpus in terms of unique tokens, tokens' relative frequencies, and even the total number of training contexts – as the contexts with the shorter sentences will far more often be limited by sentence-end, than full window-length. (Despite the similarity in bottom-line token-count, the training times might be noticeably different, especially if your window is large.)
If you main interest were simply being able to train on corpus-A subsets as quickly as a smaller corpus, there are other ways besides discarding many of its sentences to slim it:
the sample parameter increases the rate at which occurrences of highly-frequent words are randomly skipped. In typical Zipfian word-frequencies, common words appear so many times, in all their possible varied usages, that it's safe to ignore many of them as redundant. And further, discarding many of those excessive examples, by allowing relatively more attention on rarer words, often improves the overall usefulness of the final word-vectors. Especially in very-large corpuses, picking a more aggressive (smaller) sample value can throw out lots of the corpus, speeding training, but still result in better vectors.
raising the min_count parameter discards ever-more of the less-frequent words. As opposed to any intuition that "more data is always better", this often improves the usefulness of the surviving words' vectors. That's because words with just a few usage examples tend not to get great vectors – those few examples won't show the variety & representativeness that's needed – yet the prevalence of so many such rare-but-insufficiently-demonstrated words still interferes with the training of other words.
As long as there are still enough examples of the more-frequent and important words, aggressive settings for sample and min_count, against a large corpus, may decrease the effective size by 90% or more – and still create high-quality vectors for the remaining words.
But also note: neither of your corpuses are quite as large as is best for word2vec training. It benefits a lot from large, varied corpuses. Your corpus-B, especially, is tiny compared to lots of word2vec work – and while you can somewhat 'stretch' a corpus's impact with more training epochs, and using smaller vectors or a smaller surviving vocabulary, you still may be below the corpus size where word2vec works best. So if at all possible, I'd be looking at ways to grow corpus-B, moreso than shrink corpus-A.
I've already built my Doc2Vec model, using around 20.000 files. I'm looking for a way to find the string representation of a given vector/ID, which might be similar to Word2Vec's index2entity. I'm able to get the vector itself, using model['n'], but now I'm wondering whether there's a way to get some sort of string representation of it as well.
If you want to look up your actual training text, for a given text+tag that was part of training, you should retain that mapping outside the Doc2Vec model. (The model doesn't store training texts – only looking at them, repeatedly, during training.)
If you want to generate a text from a Doc2Vec doc-vector, that's not an existing feature, nor do I know any published work describing a reliable technique for doing so.
There's a speculative/experimental bit of work-in-progress for gensim Doc2Vec that will forward-propagate a doc-vector through the model's neural-network, and report back the most-highly-predicted target words. (This is somewhat the opposite of the way infer_vector() works.)
That might, plausibly, give a sort-of summary text. For more details see this open issue & the attached PR-in-progress:
https://github.com/RaRe-Technologies/gensim/issues/2459
Whether this is truly useful or likely to become part of gensim is still unclear.
However, note that such a set-of-words wouldn't be grammatical. (It'll just be the ranked-list of most-predicted words. Perhaps some other subsystem could try to string those words together in a natural, grammatical way.)
Also, the subtleties of whether a concept has many potential associates words, or just one, could greatly affect the "top N" results of such a process. Contriving a possible example: there are many words for describing a 'cold' environment. As a result, a doc-vector for a text about something cold might have lots of near-synonyms for 'cold' in the 11th-20th ranked positions – such that the "total likelihood" of at least one cold-ish word is very high, maybe higher than any one other word. But just looking at the top-10 most-predicted words might instead list other "purer" words whose likelihood isn't so divided, and miss the (more-important-overall) sense of "coldness". So, this experimental pseudo-summarization method might benefit from a second-pass that somehow "coalesces" groups-of-related-words into their most-representative words, until some overall proportion (rather than fixed top-N) of the doc-vector's predicted-words are communicated. (This process might be vaguely like finding a set of M words whose "Word Mover's Distance" to the full set of predicted-words is minimized – though that could be a very expensive search.)
I am using gensim to create word vectors based on my corpus like the following:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
I was wondering if it is possible to start (or somehow avoid having) words at index 0 and 1? I would like my vocabulary to start at index 2, because I need to do other operations and if I keep 0 and 1 as indexes it gets a little confusing.
Thanks for the help!
It's not a native feature of Word2Vec.
This is probably not a good idea, but you could crudely fake it by creating two dummy words with very high-frequency, and add examples containing them to your training data in a way to have a minimal impact on other vectors.
For example, if the most-common word in your corpus occurs 5,000 times, create a fake text with just the words 'dummy000000000' and 'dummy000000001' in it, repeated 1,000 times each. Add this fake text to your corpus 6 times. Then, 'dummy000000000' and 'dummy000000001' will be the two most-frequent words in the corpus, and thus get indexes 0 and 1 (in the usual case). Their training will waste time, and the model will waste a little bit of its potential state giving those words crude vectors, but they should have a minimal effect on other words (since they never co-occur with real words). Voila, you've got 0 and 1 indexes you can ignore (or treat as errors) later!
But having written it out, it's pretty definitely a bad idea. It'll slow and worsen the model slightly. Various progress/tally statistics from the model will be subtly misleading.
And, having such indexes start at 0 is very typical professional programming practice. If you find it confusing, in general or for your specific project, that may be a habit/understanding barrier that it's better to work-through than try to patch-around with non-standard practice.