Inaccurate similarities results by doc2vec using gensim library

Inaccurate similarities results by doc2vec using gensim library - python

I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?

What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.

Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?

Related

Different results infer_vector() of Doc2Vec after saving to disk and load

I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....

First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?

I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples & total_words during training in this case?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')

Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab() call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with vector_size=300 – you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file mode, you can increase workers to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable mode, max throughput is usually
somewhere in the 6-12 workers threads, as long as you ahve that many
cores.)
min_count=1 is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5 does. (It's possible FastText can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100 is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon

Does the gensim `Word2Vec()` constructor make a completely independent model?

I'm testing feeding gensim's Word2Vec different sentences with the same overall vocabulary to see if some sentences carry "better" information than others. My method to train Word2Vec looks like this
def encode_sentences(self, w2v_params, sentences):
model = Word2Vec(sentences, **w2v_params)
idx_order = torch.tensor([int(i) for i in model.wv.index2entity], dtype=torch.long)
X = torch.zeros((idx_order.max()+1, w2v_params['size']), dtype=torch.float)
# Put embeddings back in order
X[idx_order] = torch.tensor(model.wv.vectors)
return X, y
What I'm hoping for here, is each time w2v runs, it starts with a fresh model and trains from scratch. However, I'm testing 3 kinds of sentences, so my test code looks like this:
def test(sentence):
w2v = {'size': 128, 'sg': 1}
X = encode_sentences(w2v, sentence)
evaluate(X) # Basic cluster analysis stuff here
# s1, s2 and s3 are the 3 sets of sentences with the same vocabulary in different order/frequency
[print(test(s) for s in [s1, s2, s3]]
However, I noticed if I remove one of the test sets, and only test s1 and s2 (or any combination of 2 sets of the three), the overall quality of the clusterings decreases. If I go back into encode_sentences and add del model before the return call, the overall cluster quality also goes down but remains consistent no matter how many datasets are tested.
What gives? Is the constructor not actually building a fresh model each time with new weights? The docs and source code give no indication of this. I'm quite sure it isn't my evaluation method, as everything was fixed after the del model was added. I'm at a loss here... Are these runs actually independent, or is each call to Word2Vec(foo, ...) equivalent to retraining the previous model with foo as new data?
And before you ask, no model is nowhere outside of the scope of the encode_sentence variable; that's the only time that variable name is used in the whole program. Very odd.
Edit with more details
If it's important, I'm using Word2Vec to build node embeddings on a graph the way Node2Vec does with different walk strategies. These embeddings are then fed to a Logistic Regression model (evaluate(X)) and which calculates area under the roc.
Here is some sample output of the model before adding the del model call to the encode_sentences method averaged over 5 trials:
Random walks: 0.9153 (+/-) 0.002
Policy walks: 0.9125 (+/-) 0.005
E-greedy walks: 0.8489 (+/-) 0.011
Here is the same output with the only difference being del model in the encoding method:
Random walks: 0.8627 (+/-) 0.005
Policy walks: 0.8527 (+/-) 0.002
E-greedy walks: 0.8385 (+/-) 0.009
As you can see, in each case, the variance is very low (the +/- value is the standard error) but the difference between the two runs is almost a whole standard deviation. It seems odd that if each call to Word2Vec was truly independent that manually freeing the data structure would have such a large effect.

Each call to the Word2Vec() constructor creates an all-new model.
However, runs are not completely deterministic under normal conditions, for a variety of reasons, so results quality for downstream evaluations (like your unshown clustering) will jitter from run-to-run.
If the variance in repeated runs with the same data is very large, there are probably other problems, such an oversized model prone to overfitting. (Stability from run-to-run can be one indicator that your process is sufficiently specified that the data and model choices are driving results, not the randomness used by the algorithm.)
If this explanation isn't satisfying, try adding more info to your question - such as the actual magnitude of your evaluation scores, in repeated runs, both with and without the changes that you conjecture are affecting results. (I suspect the variations from the steps you think are having effect will be no larger than variations from re-runs or different seed values.)
(More generally, Word2Vec is generally hungry for as much varies training data as possible; only if texts are non-representative of the relevant domain are they likely to result in a worse model. So I generally wouldn't expect being choosier about which subset of sentences is best to be an important technique, unless some of the sentences are total junk/noise, but of course there's always a change you'll find some effects in your particular data/goals.)

How to effectively tune the hyper-parameters of Gensim Doc2Vec to achieve maximum accuracy in Document Similarity problem?

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?

The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.

Different results of Gensim Word2Vec Model in two editors for the same source code in same environment and platform?

I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.
I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.
But, I am getting different results for the same source code executed in two editors.
Which result should I need to choose and Why?
I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#
I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.
Any suggestion is appreciated.
Results Obtained in Spyder:
Results Obtained using Sublime Text:

The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.
If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.
That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)
How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)
Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)
Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)
Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.