Python: downsampling tokens or downsampling word2vec model - python

I'd need some help with a downsampling issue. I have to make a larger corpus (6 654 940 sentences, 19 592 258 tokens) comparable to a smaller one (15 607 sentences, 927 711 tokens), to implement them on 2 comparable word2vec models.
Each corpus is a list of lists in which each list is a tokenized sentence:
e.g. [['the', 'boy', 'eats']['the', 'teacher', 'explains']...]
I want to downsample the largest one to have the same number of tokens of the smaller one (keeping the original data structure: downsampling sentences until I get the desidered number of tokens). I am a very beginner of programming and I thought of two possible ways of proceeding but I am not sure how I can implement them:
- downsampling the list of lists
- downsampling the trained word2vec model (I saw in the forum that there is the parameter "sample" to downsampling the most frequent words but I want to get random sentences)
Can you help me out?
Thank you very much!! :)

Let's label a few of the things you've mentioned explicitly:
corpus-A 6 654 940 sentences, 19 592 258 tokens (2.9 tokens per sentence)
corpus-B 15 607 sentences, 927 711 tokens (60 tokens per sentence)
I'll observe right away that the tiny average size of corpus-A sentences suggests they might not be the kind of natural-language-like runs-of-words against which word2vec is typically run. And, such clipped sentences may not give rise to the kinds of window-sized contexts that are most-typical for this kind of training. (The windows will be atypically small, no matter your choice of window. And note further than no training can happen from a sentence with a single token – it's a no-op.)
So, any scaling/sampling of corpus-A (with its sentences of aaround 3 tokens) is not, at the end of the process, going to be that much like corpus-B (with its more typical sentences of dozens to possibly hundreds of tokens). They won't really be alike, except in some singular measurement you choose to target.
If in fact you have enough memory to operate on corpus-A completely in RAM, then choosing a random subset of 15607 sentences – to match the sentence count of corpus-B, is very simple using standard Python functions:
import random
corpus_a_subset = random.sample(corpus_a, len(corpus_b))
Of course, this particular corpus_a_subset will only match the count of sentences in corpus_b, but in fact be much smaller in raw words – around 47k tokens long – given the much-shorter average size of corpus-A sentences.
If you were instead aiming for a roughly 927k-token-long subset to match the corpus-B token count, you'd need about (927k / 3 =) 309000 sentences:
corpus_a_subset = random.sample(corpus_a, 309000)
Still, while this should make corpus_a_subset closely match the raw word count of corpus_b, it's still likely a very-different corpus in terms of unique tokens, tokens' relative frequencies, and even the total number of training contexts – as the contexts with the shorter sentences will far more often be limited by sentence-end, than full window-length. (Despite the similarity in bottom-line token-count, the training times might be noticeably different, especially if your window is large.)
If you main interest were simply being able to train on corpus-A subsets as quickly as a smaller corpus, there are other ways besides discarding many of its sentences to slim it:
the sample parameter increases the rate at which occurrences of highly-frequent words are randomly skipped. In typical Zipfian word-frequencies, common words appear so many times, in all their possible varied usages, that it's safe to ignore many of them as redundant. And further, discarding many of those excessive examples, by allowing relatively more attention on rarer words, often improves the overall usefulness of the final word-vectors. Especially in very-large corpuses, picking a more aggressive (smaller) sample value can throw out lots of the corpus, speeding training, but still result in better vectors.
raising the min_count parameter discards ever-more of the less-frequent words. As opposed to any intuition that "more data is always better", this often improves the usefulness of the surviving words' vectors. That's because words with just a few usage examples tend not to get great vectors – those few examples won't show the variety & representativeness that's needed – yet the prevalence of so many such rare-but-insufficiently-demonstrated words still interferes with the training of other words.
As long as there are still enough examples of the more-frequent and important words, aggressive settings for sample and min_count, against a large corpus, may decrease the effective size by 90% or more – and still create high-quality vectors for the remaining words.
But also note: neither of your corpuses are quite as large as is best for word2vec training. It benefits a lot from large, varied corpuses. Your corpus-B, especially, is tiny compared to lots of word2vec work – and while you can somewhat 'stretch' a corpus's impact with more training epochs, and using smaller vectors or a smaller surviving vocabulary, you still may be below the corpus size where word2vec works best. So if at all possible, I'd be looking at ways to grow corpus-B, moreso than shrink corpus-A.

Related

Why Word2Vec function returns me a lot of 0.99 values

I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:
df=df.text.apply(gensim.utils.simple_preprocess)
and this is the dataset that I get:
0 [understand, location, low, score, look, mcdon...
3 [listen, it, morning, tired, maybe, hangry, ma...
6 [super, cool, bathroom, door, open, foot, nugg...
19 [cant, find, better, mcdonalds, know, getting,...
27 [night, went, mcdonalds, best, mcdonalds, expe...
...
1677 [mcdonalds, app, order, arrived, line, drive, ...
1693 [correct, order, filled, promptly, expecting, ...
1694 [wow, fantastic, eatery, high, quality, ive, e...
1704 [let, tell, eat, lot, mcchickens, best, ive, m...
1716 [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object
Now I create the Word2Vec model and train it:
model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))
What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.
[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]
What am I doing wrong?
You're right that's not normal.
It is unlikely that your df is the proper format Word2Vec expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.
Try displaying next(iter(df)), to see the 1st item in df, if iterated over as Word2Vec does. Does it look like a good piece of training data?
Separately regarding your code:
min_count=1 is always a bad idea with Word2Vec - rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the default min_count=5 shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.
Only 283 texts are unlikely to be enough training data unless each text has tens of thousands of tokens. (And even if it were possible to squeeze some results from this far-smaller-than-ideal corpus, you might need to shrink the vector_size and/or increase the epochs to get the most out of minimal data.
If you supply a corpus to sentences in the Word2Vec() construction, you don't need to call .train(). It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal .build_vocab() & .train() steps if you didn't supply a corpus at construction-time.)
I highly recommend you enable logging to at least the INFO level for the relevant classes (either all Gensim or just Word2Vec). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)
According to the official doc:
Find the top-N most similar words. ...
This method computes cosine similarity between a simple mean of the projection weight
vectors of the given words and the vectors for each word in the model. The method
corresponds to the word-analogy and distance scripts in the original word2vec
implementation. ...
Since you put this df as your sentence base in the param, gensim just calculates the analogy and distance of words in different sentences (dataframe rows). I'm not sure your dataframe contains "service", if yes, the result words are just words which have the closest values to "service" in their sentences.

Using gensim most_similar function on a subset of total vocab

I am trying to use the gensim word2vec most_similar function in the following way:
wv_from_bin.most_similar(positive=["word_a", "word_b"])
So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.
e.g.
Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z
Finite set:
word_d, word_e, word_f
most_similar on whole vocab
wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]
desired output
finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set
output = ['word_d', 'word_f']
Depending on your specific patterns of use, you have a few options.
If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.
Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)
Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.
Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.
Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.
If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.
Two options you could consider include:
Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.
Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.
If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.
Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)
On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.

Inconsistent result output with gensim index_to_key

Good afternoon,
first of all thanks to all who take the time to read this.
My problem is this, I would like to have a word2vec output the most common words.
I do this with the following command:
#how many words to print out ( sort at frequency)
x = list(model.wv.index_to_key[:2500])
Basically it works, but sometimes I get only 1948 or 2290 words printed out. I can't find any connection with the size of the original corpus (tokens, lines etc.) or deviation from the target value (if I increase the output value to e.g. 3500 it outputs 3207 words).
I would like to understand why this is the case, unfortunately I can't find anything on Google and therefore I don't know how to solve the problem. maybe by increasing the value and later deleting all rows after 2501 by using pandas
If any Python list ranged-access, like my_list[:n], returns less than n items, then the original list my_list had less than n items in it.
So, if model.wv.index_to_key[:2500] is only returning a list of length 1948, then I'm pretty sure if you check len(model.wv.index_to_key), you'll see the source list is only 1948 items long. And of course, you can't take the 1st 2500 items from a list that's only 1948 items long!
Why might your model have fewer unique words than you expect, or even that you counted via other methods?
Something might be amiss in your preprocessing/tokenization, but most likely is that you're not considering the effect of the default min_count=5 parameter. That default causes all words that appear fewer than 5 times to be ignored during training, as if they weren't even in the source texts.
You may be tempted to use min_count=1, to keep all words, but that's almost always a bad idea in word2vec training. Word2vec needs many subtly-contrasting alternate uses of a word to train a good word-vector.
Keeping words which only have one, or a few, usage examples winds up not failing to get good generalizable vectors for those rare words, but also interferes with the full learning of vectors for other nearby more-frequent words – now that their training has to fight the noise & extra model-cycles from the insufficiently-represented rare words.
Instead of lowering min_count, it's better to get more data, or live with a smaller final vocabulary.

Word2vec on documents each one containing one sentence

I have some unsupervised data (100.000 files) and each file has a paragraph containing one sentence. The preprocessing went wrong and deleted all stop points (.).
I used word2vec on a small sample (2000 files) and it treated each document as one sentence.
Should I continue the process on all remaining files? Or this would result to a bad model ?
Thank you
Did you try it, and get bad results?
I'm not sure what you mean by "deleted all stop points". But, Gensim's Word2Vec is oblivious to what your tokens are, and doesn't really have any idea of 'sentences'.
All that matters is the lists-of-tokens you provide. (Sometimes people include puntuation like '.' as tokens, and sometimes it's stripped - and it doesn't make a very big different either way, and to the extent it does, whether it's good or bad may depend on your data & goals.)
Any lists-of-tokens that include neighboring related tokens, for the sort of context-window training that's central to the word2vec algorithm, should work well.
For example, it can't learn anything from one-word texts, where there are no neighboring words. But running togther sentences, paragraphs, and even full documents into long texts works fine.
Even concatenating wholly-unrelated texts doesn't hurt much: the bit of random noise from unrelated words now in-each-others' windows is outweighed, with enough training, by the meaningful relationships in the much-longer runs of truly-related text.
The main limit to consider is that each training text (list of tokens) shouldn't be more than 10,000 tokens long, as internal implementation limits up through Gensim 4.0 mean tokens past the 10,000th position will be ignored. (This limit might eventually be fixed - but until then, just splitting overlong texts into 10,000-token chunks is a fine workaround with negligible effects via the lost contexts at the break points.)

Sample size for Named Entity Recognition gold standard corpus

I have a corpus of 170 Dutch literary novels on which I will apply Named Entity Recognition. For an evaluation of existing NER taggers for Dutch I want to manually annotate Named Entities in a random sample of this corpus – I use brat for this purpose. The manually annotated random sample will function as the 'gold standard' in my evaluation of the NER taggers. I wrote a Python script that outputs a random sample of my corpus on the sentence level.
My question is: what is the ideal size of the random sample in terms of the amount of sentences per novel? For now, I used a random 100 sentences per novel, but this leads to a pretty big random sample containing almost 21626 lines (which is a lot to manually annotate, and which leads to a slow working environment in brat).
NB, before the actual answer: The biggest issue I see is that you only can evaluate the tools wrt. those 170 books. So at best, it will tell you how good the NER tools you evaluate will work on those books or similar texts. But I guess that is obvious...
As to sample sizes, I would guesstimate that you need no more than a dozen random sentences per book. Here's a simple way to check if your sample size is already big enough: Randomly choose only half of the sentences (stratified per book!) you annotated and evaluate all the tools on that subset. Do that a few times and see if results for the same tool varies widely between runs (say, more than +/- 0.1 if you use F-score, for example - mostly depending on how "precise" you have to be to detect significant differences between the tools). If the variances are very large, continue to annotate more random sentences. If the numbers start to stabilize, you're good and can stop annotating.
Indeed, the "ideal" size would be... the whole corpus :)
Results will be correlated to the degree of detail of the typology: just PERS, LOC, ORG would require require a minimal size, but what about a fine-grained typology or even full disambiguation (linking)? I suspect good performance wouldn't need much data (just enough to validate), whilst low performance should require more data to have a more detailed view of errors.
As an indicator, cross-validation is considered as a standard methodology, it often uses 10% of the corpus to evaluate (but the evaluation is done 10 times).
Besides, if working with ancient novels, you will probably face lexical coverage problem: many old proper names would not be included in available softwares lexical resources and this is a severe drawback for NER accuracy. Thus it could be a nice idea to split corpus according to decades / centuries and conduct multiple evaluation so as to measure the impact of this trouble on performances.

Categories

Resources