I trained a doc2vec model using train(..) with default settings. That worked, but now I'm wondering how infer_vector combines across input words, is it just the average of the individual word vectors?
model.random.seed(0)
model.infer_vector(['cat', 'hat'])
model.random.seed(0)
model.infer_vector(['cat'])
model.infer_vector(['hat']) #doesn't average up to the ['cat', 'hat'] vector
model.random.seed(0)
model.infer_vector(['hat'])
model.infer_vector(['cat']) #doesn't average up to the ['cat', 'hat'] vector
Those don't add up, so I'm wondering what I'm misunderstanding.
infer_vector() doesn't combine the vectors for your given tokens – and in some modes doesn't consider those tokens' vectors at all.
Rather, it considers the entire Doc2Vec model as being frozen against internal changes, and then assumes the tokens you've provided are an example text, with a previously untrained tag. Let's call this implied but unnamed tag X.
Using a training-like process, it tries to find a good vector for X. That is, it starts with a random vector (as it did for all tags in original training), then sees how well that vector as model-input predicts the text's words (by checking the model neural-network's predictions for input X). Then via incremental gradient descent it makes that candidate vector for X better and better at predicting the text's words.
After enough such inference-training, the vector will be about as good (given the rest of the frozen model) as it possibly can be at predicting the text's words. So even though you're providing that text as an "input" to the method, inside the model, what you've provided is used to pick target "outputs" of the algorithm for optimization.
Note that:
tiny examples (like one or a few words) aren't likely to give very meaningful results – they are sharp-edged corner cases, and the essential value of these sorts of dense embedded representations usually arises from the marginal balancing of many word-influences
it will probably help to do far more training-inference cycles than the infer_vector() default steps=5 – some have reported tens or hundreds of steps work best for them, and it may be especially valuable to use more steps with short texts
it may also help to use a starting alpha for inference more like that used in bulk training (alpha=0.025), rather than the infer_vector() default (alpha=0.1)
Related
I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....
First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?
I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?
The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.
I've got a problem/question with Word2Vec
As I understand: let's train a model on a corpus of text (in my way it's a corpus ~2 Gb size)
Let's take one line from this text and calculate a vector of this line (line's vector = sum of words vectors). It will be smth. like this:
for w in words:
coords += model[w]
Than let's calculate length of this vector. With standard library as:
import numpy as np
vectorLen = np.linalg.norm(coords)
Why do we need Word2Vec? Yes, for converting words to vectors AND contextual proximity (near words that are found and words that are close in meaning have similar coordinates)!
And what I want (what I am waiting) - if I will take some line of the text and add some word from the dictionary which is not typical for this line, than again calculate length of this vector, I will get quite different value that if I will calculate only vector of this line without adding some uncharacteristic words to this line from dictionary.
But in fact - the values of this vectors (before adding word(s) and after) are quite the similar! Moreover - they are practically the same! Why am I getting this result?
If I understand right for the line the coordinates of words will quite the same (contextual proximity), but new words will have rather different coordinates and it should affect to result (vector length of line with new words)!
E.x. it's my W2V model settings:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=0,
size=300,
window=3,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
sample=1e-3,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
#train model
model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)
OR this:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=1,
size=100,
window=10,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
seed=7,
sample=1e-3,
hashfxn=hash,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
What's the problem? And how can I get necessary result?
Why do you need the "vector length" to noticeably change, as a "desired result"?
The length of word-vectors (or sums of same) isn't usually of major interest. In fact, it's common to normalize the word-vectors to unit-length before doing comparisons. (And sometimes, when doing sums/averages as a simple way to create vectors for runs-of-multiple-words, the vectors might be unit-normalized before or after such an operation.)
Instead, it's usually the direction (angle) that's of most interest.
Further, what do you mean when describing the length values as "quite the similar"? Without showing the actual lengths you've seen in your tests, it's unclear if your intuitions about what the change "should" be are correct.
Note that in multi-dimensional spaces – and especially high-dimensional spaces - our intuitions are quite often wrong.
For example, try adding a bunch of pairs of random unit vectors in 2d space, and looking at the norm length of the sum. As you might expect, you'll likely see varied results that range from nearly 0.0 to nearly 2.0 – representing moving closer or further to the origin.
Try instead adding a bunch of pairs of random unit vectors in 500d space. Now, the norm length of the sum is going to almost always be close to 1.4. Essentially, with 500 directions to go, most sums won't significantly move closer or further to the origin, even though they still move 1.0 away from either vector individually.
You're likely observing the same thing with your word-vectors. They're fine, but the measure you've chosen to take – the norm of a vector sum – just doesn't change the way you'd expect, in a high-dimensional space.
Separately, unrelated to your main issue, but about your displayed word2vec parameters:
You might think using a non-default min_count=1, by retaining more words/information, results in better vectors. However, it usually hurts word-vector quality to retain such rare words. Word-vector quality requires many varied examples of word usage. Words with just 1, or a few, examples don't get good vectors from those few idiosyncratic usage examples, but do serve as training noise/interference in the improvement of other word-vectors with more examples.
Usual stochastic-gradient-descent optimization relies on the alpha learning-rate decaying to a negligible value over the course of training. Setting the ending min_alpha to the same value as the starting alpha thwarts this. (In general, most users shouldn't change either of the alpha parameters, and if they need to tinker at all, changing the starting value makes more sense.)
My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)
Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.
I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.
Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.
Since a question can sometimes span multiple sentences,
I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?
And this notebook : Doc2Vec-Notebook
seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?
Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.
The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.
The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)
The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.
Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.
You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)
Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).
The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)
(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)