Is it possible to update a Doc2Vec Vector? - python

I am working with a steadily growing corpus. I train my Document Vector with Doc2Vec which is implemented in Python.
Is it possible to update a Document Vector?
I want to use the Document Vector for Document recommendations.

Individual vectors can be updated, but the gensim Doc2Vec model class doesn't have much support for adding more doc-vectors to itself.
It can, however, return individual vectors for new texts that are compatible (comparable) with the existing vectors, via the .infer_vector(words) method. You can retain these vectors in your own data structures for lookup.
When enough new documents have arrived that you think your core model would be better, if trained on all documents, you can re-train the model with all available data, using it as the new base for .infer_vector(). (Note that vectors from the retrained model won't usually be compatible/comparable with those from the prior model: each training session bootstraps a different self-consistent coordinate space.)

Related

Is it appropriate to train W2V model on entire corpus?

I have a corpus of free text medical narratives, for which I am going to use for a classification task, right now for about 4200 records.
To begin, I wish to create word embeddings using w2v, but I have a question about a train-test split for this task.
When I train the w2v model, is it appropriate to use all of the data for the model creation? Or should I only use the train data for creating the model?
Really, my question sort of comes down to: do I take the whole dataset, create the w2v model, transform the narratives with the model, and then split, or should I split, create w2v, and then transform the two sets independently?
Thanks!
EDIT
I found an internal project at my place of work which was built by a vendor; they create the split, and create the the w2v model on ONLY the train data, then transform the two sets independently in different jobs; so it's the latter of the two options that I specified above. This is what I thought would be the case, as I wouldn't want to contaminate the w2v model on any of the test data.
The answer to most questions like these in NLP is "try both" :-)
Contamination of test vs train data is not relevant or a problem in generating word vectors. That is a relevant issue in the model you use the vectors with. I found performance to be better with whole corpus vectors in my use cases.
Word vectors improve in quality with more data. If you don't use test corpus, you will need to have a method for initializing out-of-vocabulary vectors and understanding the impact they may have on your model performance.

Using pretrained Word2Vec model for sentiment analysis

I am using a pretrained Word2Vec model for tweets to create vectors for each word. https://www.fredericgodin.com/software/. I will then compute the average of this and use a classifier to determine sentiment.
My training data is very large and the pretrained Word2Vec model has been trained on millions of tweets, with dimensionality = 400. My problem is that it is taking too long to give vectors to the words in my training data. Is there a way to reduce the time taken to build the word vectors?
Cheers.
It's unclear what you mean by "too long".
Looking up individual word-vectors from a pre-existing model should be very fast: it's a simple in-memory lookup of the word to the array index (from a dict), then an access of that array-index.
If it's slow for you, perhaps you've loaded a model larger than your available RAM? In that case, operation might be relying on much-slower virtual memory (paging working memory to and from slower disk). With these kinds of models, where access is very random across locations, you never ever want to do this. If it's happening, you should get more RAM or use a smaller model.

Gensim save_word2vec_format() vs. model.save()

I am using gensim version 0.12.4 and have trained two separate word embeddings using the same text and same parameters. After training I am calculating the Pearsons correlation between the word occurrence-frequency and vector-length. One model I trained using save_word2vec_format(fname, binary=True) and then loaded using load_word2vec_format the other I trained using model.save(fname) and then loaded using Word2Vec.load(). I understand that the word2vec algorithm is non deterministic so the results will vary however the difference in the correlation between the two models is quite drastic. Which method should I be using in this instance?
EDIT: this was intended as a comment. Don't know how to change it now, sorry
correlation between the word occurrence-frequency and vector-length I don't quite follow - aren't all your vectors the same length? Or are you not referring to the embedding vectors?

Get weight matrices from gensim word2Vec

I am using gensim word2vec package in python.
I would like to retrieve the W and W' weight matrices that have been learn during the skip-gram learning.
It seems to me that model.syn0 gives me the first one but I am not sure how I can get the other one. Any idea?
I would actually love to find any exhaustive documentation on models accessible attributes because the official one does not seem to be precise (for instance syn0 is not described as an attribute)
The model.wv.syn0 contains the input embedding matrix. Output embedding is stored in model.syn1 when it's trained with hierarchical softmax (hs=1) or in model.syn1neg when it uses negative sampling (negative>0). That's it! When both hierarchical softmax and negative sampling are not enabled, Word2Vec uses a single weight matrix model.wv.syn0 for training.
See also a related discussion here.

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?
The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Categories

Resources