How to predict output word for KeyedVectors word2vec?

How to predict output word for KeyedVectors word2vec? - python

A gensim.models.Word2Vec class has method predict_output_word(). Now I use prelearned model but it was saved in class gensim.models.KeyedVectors. Have a the class analogue method? Or how can I get instance of gensim.models.Word2Vec from gensim.models.KeyedVectors`?
I know about most_similar() but it something another.

A KeyedVectors instance is only the words and vectors themselves, not the full model, including internal weights that were important for training (and the internal predictions made during training).
So, a KeyedVectors object lacks the state required to make predictions, and thus also the method. (Note also that method is relatively expensive to run, only works with negative-sampling models, and isn't given results that are weighted quite the same as the 'sparse' semi-predictions made internally during training. The point of word2Vec isn't really accurate neighbor predictions, but using attempts-at-such-predictions to bootstrap an arrangement of vectors that has other useful properties.)
If you're training the words yourself, you should save the full model if you need full-model functionality later.
There's no way to turn a KeyedVectors method, or mere set-of-word-vectors, into a full Word2Vec model.

Related

Inner workings of Gensim Word2Vec

I have a couple of issues regarding Gensim in its Word2Vec model.
The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?
The second is concerning the WV object in the doc page says:
This object essentially contains the mapping between words and embeddings.
After training, it can be used directly to query those embeddings in various ways.
See the module level docstring for examples.
But that is not clear to me, allow me to explain I have my own created word vectors which I have substitute in the
word2vecObject.wv['word'] = my_own
Then call the train method with those replacement word vectors. But I would like to know which part am I replacing, is it the input to hidden weight layer or the hidden to input? This is to check if it can be called pre-training or not. Any help? Thank you.

I've not tried the nonsense parameter epochs=0, but it might behave as you expect. (Have you tried it and seen otherwise?)
However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab() & .train(), in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab() & its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)
The "word vectors" in the .wv property of type KeyedVectors are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)
So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg (or .syn1 for HS mode) property.

Is it possible to update a Doc2Vec Vector?

I am working with a steadily growing corpus. I train my Document Vector with Doc2Vec which is implemented in Python.
Is it possible to update a Document Vector?
I want to use the Document Vector for Document recommendations.

Individual vectors can be updated, but the gensim Doc2Vec model class doesn't have much support for adding more doc-vectors to itself.
It can, however, return individual vectors for new texts that are compatible (comparable) with the existing vectors, via the .infer_vector(words) method. You can retain these vectors in your own data structures for lookup.
When enough new documents have arrived that you think your core model would be better, if trained on all documents, you can re-train the model with all available data, using it as the new base for .infer_vector(). (Note that vectors from the retrained model won't usually be compatible/comparable with those from the prior model: each training session bootstraps a different self-consistent coordinate space.)

Creating train,test data for Word2Vec model

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.

Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

Calculate accuracy of word2vec model in Python

I want to get the accuracy from the Doc2Vec model implemented in Python.
I saw in the official documentation that there is a method to get the accuracy, which takes as parameter a file. What should be the content of that input file?
I tried to put 4-tuple as documentation says, but I get all the patterns misclassified.

There's no simple measurement of a Doc2Vec model's accuracy – you'd need to have a evaluation method that's custom to your corpus and project goals.
The accuracy() method on Word2Vec, also inherited by Doc2Vec, does a very narrow kind of analogy-testing, using word-vectors only, because the same method was used in the original word2vec paper and original Google word2vec.c toolkit. You can see the test-files they used, questions-words.txt and questions-phrases.txt, in a Github mirror of the Google word2vec-toolkit.
Since some Doc2Vec modes generate word-vectors, you could do this sort of analogy test on those Doc2Vec models – but it doesn't check the document-vectors at all, and a model that does well on those word-analogies might not be best for whatever your downstream document task is.

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?

The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to predict output word for KeyedVectors word2vec? - python

Related

Inner workings of Gensim Word2Vec

Is it possible to update a Doc2Vec Vector?

Creating train,test data for Word2Vec model

Calculate accuracy of word2vec model in Python

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

Categories

Resources