gensim - Doc2Vec: MemoryError when training on english Wikipedia

gensim - Doc2Vec: MemoryError when training on english Wikipedia - python

I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.
Are there any possibilities to train such a large model with limited RAM?
Here is my current code:
corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300,
window=15,
min_count=50, #1
workers=16,
dm=0,
alpha=0.75,
min_alpha=0.001,
sample=0.00001,
negative=5)
model.build_vocab(corpus)
model.train(corpus,
epochs=400,
total_examples=model.corpus_count,
start_alpha=0.025,
end_alpha=0.0001)
Are there maybe some obvious mistakes I am doing? Using it completely wrong?
I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.

The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags.
With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require:
145,000,000 * 300 dimensions * 4 bytes/dimension = 174GB
You could try a smaller data set. You could reduce the vector size. You could get more memory.
I would try one or more of those first, just to verify you're able to get things working and some initial results.
There is one trick, best considered experimental, that may work to allow training larger sets of doc-vectors, at some cost of extra complexity and lower performance: the docvecs_mapfile parameter of Doc2Vec.
Normally, you don't want a Word2Vec/Doc2Vec-style training session to use any virtual memory, because any recourse to slower disk IO makes training extremely slow. However, for a large doc-set, which is only ever iterated over in one order, the performance hit may be survivable after making the doc-vectors array to be backed by a memory-mapped file. Essentially, each training pass sweeps through the file from font-to-back, reading each section in once and paging it out once.
If you supply a docvecs_mapfile argument, Doc2Vec will allocate the doc-vectors array to be backed by that on-disk file. So you'll have a hundreds-of-GB file on disk (ideally SSD) whose ranges are paged in/out of RAM as necessary.
If you try this, be sure to experiment with this option on small runs first, to familiarize yourself with its operation, especially around saving/loading models.
Note also that if you then ever do a default most_similar() on doc-vectors, another 174GB array of unit-normalized vectors must be created from the raw array. (You can force that to be done in-place, clobbering the existing raw values, by explicitly calling the init_sims(replace=True) call before any other method requiring the unit-normed vectors is called.)

Related

Different results infer_vector() of Doc2Vec after saving to disk and load

I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....

First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?

I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples & total_words during training in this case?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')

Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab() call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with vector_size=300 – you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file mode, you can increase workers to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable mode, max throughput is usually
somewhere in the 6-12 workers threads, as long as you ahve that many
cores.)
min_count=1 is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5 does. (It's possible FastText can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100 is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon

Different results of Gensim Word2Vec Model in two editors for the same source code in same environment and platform?

I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.
I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.
But, I am getting different results for the same source code executed in two editors.
Which result should I need to choose and Why?
I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#
I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.
Any suggestion is appreciated.
Results Obtained in Spyder:
Results Obtained using Sublime Text:

The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.
If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.
That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)
How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)
Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)
Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)
Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)

Inaccurate similarities results by doc2vec using gensim library

I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?

What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.

Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.