Incremental Word2Vec Model Training in gensim - python

I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .
#Load data
sentences = []
....................
#Training
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")
#Incremental Training
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")

By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab(). While you can continue to call train(), no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab() with new text examples, and an update=True parameter, to expand the vocabulary. While this would let further train() calls train both old-and-new words, there are many caveats:
such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
such calls to train() should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded
If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.

Related

How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)
Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.

Is it appropriate to train W2V model on entire corpus?

I have a corpus of free text medical narratives, for which I am going to use for a classification task, right now for about 4200 records.
To begin, I wish to create word embeddings using w2v, but I have a question about a train-test split for this task.
When I train the w2v model, is it appropriate to use all of the data for the model creation? Or should I only use the train data for creating the model?
Really, my question sort of comes down to: do I take the whole dataset, create the w2v model, transform the narratives with the model, and then split, or should I split, create w2v, and then transform the two sets independently?
Thanks!
EDIT
I found an internal project at my place of work which was built by a vendor; they create the split, and create the the w2v model on ONLY the train data, then transform the two sets independently in different jobs; so it's the latter of the two options that I specified above. This is what I thought would be the case, as I wouldn't want to contaminate the w2v model on any of the test data.
The answer to most questions like these in NLP is "try both" :-)
Contamination of test vs train data is not relevant or a problem in generating word vectors. That is a relevant issue in the model you use the vectors with. I found performance to be better with whole corpus vectors in my use cases.
Word vectors improve in quality with more data. If you don't use test corpus, you will need to have a method for initializing out-of-vocabulary vectors and understanding the impact they may have on your model performance.

Creating train,test data for Word2Vec model

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.
Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

Doc2vec and word2vec with negative sampling

My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)
Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.

Categories

Resources