I'm testing feeding gensim's Word2Vec different sentences with the same overall vocabulary to see if some sentences carry "better" information than others. My method to train Word2Vec looks like this
def encode_sentences(self, w2v_params, sentences):
model = Word2Vec(sentences, **w2v_params)
idx_order = torch.tensor([int(i) for i in model.wv.index2entity], dtype=torch.long)
X = torch.zeros((idx_order.max()+1, w2v_params['size']), dtype=torch.float)
# Put embeddings back in order
X[idx_order] = torch.tensor(model.wv.vectors)
return X, y
What I'm hoping for here, is each time w2v runs, it starts with a fresh model and trains from scratch. However, I'm testing 3 kinds of sentences, so my test code looks like this:
def test(sentence):
w2v = {'size': 128, 'sg': 1}
X = encode_sentences(w2v, sentence)
evaluate(X) # Basic cluster analysis stuff here
# s1, s2 and s3 are the 3 sets of sentences with the same vocabulary in different order/frequency
[print(test(s) for s in [s1, s2, s3]]
However, I noticed if I remove one of the test sets, and only test s1 and s2 (or any combination of 2 sets of the three), the overall quality of the clusterings decreases. If I go back into encode_sentences and add del model before the return call, the overall cluster quality also goes down but remains consistent no matter how many datasets are tested.
What gives? Is the constructor not actually building a fresh model each time with new weights? The docs and source code give no indication of this. I'm quite sure it isn't my evaluation method, as everything was fixed after the del model was added. I'm at a loss here... Are these runs actually independent, or is each call to Word2Vec(foo, ...) equivalent to retraining the previous model with foo as new data?
And before you ask, no model is nowhere outside of the scope of the encode_sentence variable; that's the only time that variable name is used in the whole program. Very odd.
Edit with more details
If it's important, I'm using Word2Vec to build node embeddings on a graph the way Node2Vec does with different walk strategies. These embeddings are then fed to a Logistic Regression model (evaluate(X)) and which calculates area under the roc.
Here is some sample output of the model before adding the del model call to the encode_sentences method averaged over 5 trials:
Random walks: 0.9153 (+/-) 0.002
Policy walks: 0.9125 (+/-) 0.005
E-greedy walks: 0.8489 (+/-) 0.011
Here is the same output with the only difference being del model in the encoding method:
Random walks: 0.8627 (+/-) 0.005
Policy walks: 0.8527 (+/-) 0.002
E-greedy walks: 0.8385 (+/-) 0.009
As you can see, in each case, the variance is very low (the +/- value is the standard error) but the difference between the two runs is almost a whole standard deviation. It seems odd that if each call to Word2Vec was truly independent that manually freeing the data structure would have such a large effect.
Each call to the Word2Vec() constructor creates an all-new model.
However, runs are not completely deterministic under normal conditions, for a variety of reasons, so results quality for downstream evaluations (like your unshown clustering) will jitter from run-to-run.
If the variance in repeated runs with the same data is very large, there are probably other problems, such an oversized model prone to overfitting. (Stability from run-to-run can be one indicator that your process is sufficiently specified that the data and model choices are driving results, not the randomness used by the algorithm.)
If this explanation isn't satisfying, try adding more info to your question - such as the actual magnitude of your evaluation scores, in repeated runs, both with and without the changes that you conjecture are affecting results. (I suspect the variations from the steps you think are having effect will be no larger than variations from re-runs or different seed values.)
(More generally, Word2Vec is generally hungry for as much varies training data as possible; only if texts are non-representative of the relevant domain are they likely to result in a worse model. So I generally wouldn't expect being choosier about which subset of sentences is best to be an important technique, unless some of the sentences are total junk/noise, but of course there's always a change you'll find some effects in your particular data/goals.)
Related
Running this code give me back loss values that cycle not really decreasing. Could you explain me why ?
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
class callback(CallbackAny2Vec):
'''Callback to print loss after each epoch.'''
def __init__(self):
self.epoch = 0
self.previous_loss = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
print('Loss after epoch {}: {}'.format(self.epoch, loss-self.previous_loss))
self.epoch += 1
self.previous_loss = loss
model = Word2Vec(datapath('lee_background.cor'),epochs=10000,
compute_loss=True, callbacks=[callback()])
Loss can't decrease forever, unless:
the model can perfectly memorize the training set; and
every input – in the case of word2vec, context skip-gram or CBOW word(s) – should always generate the exact same outputs
The latter definitely isn't the case in natural-language: neither one skip-gram word X, nor a window of CBOW words X1, X2, ... Xn, will always exactly predict a target word. Hence, there will always be loss-against-training-examples.
All that you're doing with training (stochastic-gradient-descent optimization) is driving loss to the smallest that's practical given the mechanism/size of the chosen model.
At some point, at a still non-zero loss, changing the model to be better on some training-examples necessarily worsens it on others.
At this point, often called 'convergence', further training can only cause measured loss to jitter up-and-down around some range-of-approximately-best-value. Which seems to be what you're describing.
Related: a model with lower loss is better at the training task – mechanistically predicting words among the texts of the training set. But it won't necessarily be better at all the other downstream things you want to use word-vectors for.
At a certain point, being superficially better at the training-set – memorizing every detail, even the idiosyncratic non-generalizable things – can make things worse for other out-of-training-set tasks. That's 'overfitting'.
Especially with small training-sets, you can see this for yourself by expanding the vector_size. Some size will do best for specific other tasks – creating word-vectors that reflect what you want about the word's relationships – but an ever-larger size will do worse.
(It'll also, at some point, make the model larger than the training-data – an imbalance that practically ensures overfitting, because all 'learning' typically needs to have some aspect of compression: boiling a smaller number of useful compact lessons from a larger amount of suggestive/noisy data.)
That's why assessing model fitness for your project requires evaluations other than just looking-at-loss. Ideally these evaluations are even project-specific, though often more generic ones – like say the analogy-solving often applied to word2vec models – may point in the roughly right direction to match human-salient word senses. Still: on any project-specific goal, like classification or info-retrieval, the word-vectors 'best' at analogy-solving might not be best for the project's purposes.
I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....
First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?
In Deep Learning course, Prof. Ng mentioned about how to implement Dropout Regularisation.
Implementation by Prof.Ng:-
First image tells the implementation of dropout and generating d3 matrix i.e. dropout matrix for third layer with shape (#nodes, #training examples).
As per my understanding, D3 would look like this for a single iteration and keeps on changing with every iteration (here, I've taken 5 nodes and 10 training examples)
Query: One thing I didn't get is why we need to drop different nodes for each training example. Instead, why we can't keep dropped nodes the same for all examples and again randomly drop in the next iteration. For example in the second image, 2nd training example is passing through 4 nodes while the first one is passing through all nodes. Why not the same nodes for all examples?
We choose different nodes within the same batch for the simple reason that it gives slightly better results. This is the rationale for virtually any choice in a neural network. As to why it gives better results ...
If you have many batches per epoch, you will notice very little difference, if any. The reason we have any dropout is that models sometimes exhibit superstitious learning (in the statistical sense): if an early batch of training examples, if several of them just happen to have a particular strong correlation of some sort, then the model will learn that correlation early on (primacy effect), and will take a long time to un-learn it. For instance, students doing the canonical dogs-vs-cats exercise will often use their own data set. Some students find that the model learns to identify anything on a soft chair as a cat, and anything on a lawn as a dog -- because those are common places to photograph each type of pet.
Now, imagine that your shuffling algorithm brings up several such photos in the first three batches. Your model will learn this correlation. It will take quite a few counter-examples (cat in the yard, dog on the furniture) to counteract the original assumption. Dropout disables one or another of the "learned" nodes, allowing others to be more strongly trained.
The broad concept is that a valid correlation will be re-learned easily; an invalid one is likely to disappear. This is the same concept as in repeating other experiments. For instance, if a particular experiment shows significance at p < .05 (a typical standard of scientific acceptance), then there is no more than a 5% chance that the correlation could have happened by chance, rather than the functional connection we hope to find. This is considered to be confident enough to move forward.
However, it's not certain enough to make large, sweeping changes in thousands of lives -- say, with a new drug. In those cases, we repeat the experiment enough times to achieve the social confidence desired. If we achieve the same result in a second, independent experiment, then instead of 1/20 chance of being mistaken, we have a 1/400 chance.
The same idea applies to training a model: dropout reduces the chance that we've learned something that isn't really true.
Back to the original question: why dropout based on each example, rather than on each iteration. Statistically, we do a little better if we drop out more frequently, but for shorter spans. If we drop the same nodes for several iterations, that slightly increase the chance that the active nodes will make a mistake while the few nodes were inactive.
Do note that this is mostly ex post facto rationale: this wasn't predicted in advance; instead, the explanation comes only after the false-training effect was discovered, a solution found, and the mechanics of the solution were studied.
I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?
What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.
Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?
My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)
Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.