I am using gensim version 0.12.4 and have trained two separate word embeddings using the same text and same parameters. After training I am calculating the Pearsons correlation between the word occurrence-frequency and vector-length. One model I trained using save_word2vec_format(fname, binary=True) and then loaded using load_word2vec_format the other I trained using model.save(fname) and then loaded using Word2Vec.load(). I understand that the word2vec algorithm is non deterministic so the results will vary however the difference in the correlation between the two models is quite drastic. Which method should I be using in this instance?
EDIT: this was intended as a comment. Don't know how to change it now, sorry
correlation between the word occurrence-frequency and vector-length I don't quite follow - aren't all your vectors the same length? Or are you not referring to the embedding vectors?
Related
I am working with a steadily growing corpus. I train my Document Vector with Doc2Vec which is implemented in Python.
Is it possible to update a Document Vector?
I want to use the Document Vector for Document recommendations.
Individual vectors can be updated, but the gensim Doc2Vec model class doesn't have much support for adding more doc-vectors to itself.
It can, however, return individual vectors for new texts that are compatible (comparable) with the existing vectors, via the .infer_vector(words) method. You can retain these vectors in your own data structures for lookup.
When enough new documents have arrived that you think your core model would be better, if trained on all documents, you can re-train the model with all available data, using it as the new base for .infer_vector(). (Note that vectors from the retrained model won't usually be compatible/comparable with those from the prior model: each training session bootstraps a different self-consistent coordinate space.)
I have a corpus of free text medical narratives, for which I am going to use for a classification task, right now for about 4200 records.
To begin, I wish to create word embeddings using w2v, but I have a question about a train-test split for this task.
When I train the w2v model, is it appropriate to use all of the data for the model creation? Or should I only use the train data for creating the model?
Really, my question sort of comes down to: do I take the whole dataset, create the w2v model, transform the narratives with the model, and then split, or should I split, create w2v, and then transform the two sets independently?
Thanks!
EDIT
I found an internal project at my place of work which was built by a vendor; they create the split, and create the the w2v model on ONLY the train data, then transform the two sets independently in different jobs; so it's the latter of the two options that I specified above. This is what I thought would be the case, as I wouldn't want to contaminate the w2v model on any of the test data.
The answer to most questions like these in NLP is "try both" :-)
Contamination of test vs train data is not relevant or a problem in generating word vectors. That is a relevant issue in the model you use the vectors with. I found performance to be better with whole corpus vectors in my use cases.
Word vectors improve in quality with more data. If you don't use test corpus, you will need to have a method for initializing out-of-vocabulary vectors and understanding the impact they may have on your model performance.
Background: I have been evaluating a variety of text classification methods on my dataset, including using feature vectors derived from word counts and TF-IDF, and then running these through various classifiers. My dataset is very small (about 2300 sentences and about 5 classes), and considering the above approaches treat different ones as completely separate, would like to use a word vector approach to classification. I have used pretrained word vectors with a shallow NN with little success.
Problem: I am looking for an alternative method of using word vectors to classify my sentences and have thought of taking the word vectors for a sentence, combining them into a single vector, then taking the centroid of each class of sentence vectors - classification would then happen via a distance measure between a new sentence and the centroid.
How can I combine the word vectors into a "sentence vector" given my small dataset?
A great feature of word2vecs is that you can perform simple operations on them. One common way of getting from words to sentences, is to simply take the average your word vectors for all words in your sentence.
since your sample data is small, I'd use a pertained embedding from Gensim Data, retrain using your own sample, and at the end use a simpler classifier like logistic regression.
To Nathan's point if you want to classify documents, Doc2Vec is a great extension of Word2Vec which reduces a lot of the steps. With a few iterations, you can actually achieve really good results. Here is a great implementation of Doc2Vec.
Basically you need to know where to split your sentences first, then you can use a doc2vec model for those sentences.
https://radimrehurek.com/gensim/models/doc2vec.html
Determine where your sentence boundaries are
Model sentence splitting
Train Doc2Vec model on sentences
Input sentence vectors to NN model
Ive done this with limited success. Your corpus is small but you can always try it out and then test/validate/evaluate!
Good luck
I would use gensim's implementation of Paragraph Vector, Doc2Vec for this. I've just written an article describing how to use it classify movie reviews, which might help you!
How can we use ANN to find some similar documents? I know its a silly question, but I am new to this NLP field.
I have made a model using kNN and bag-of-words approach to solve my problem. Using that I can get n number of documents (along with their closeness) that are somewhat similar to the input, but now I want to implement the same using ANN and I am not getting any idea.
Thanks in advance for any help or suggestions.
You can use "word embeddings" - technique, that presents words in the dense vector representation. To find similar documents as the vectors, you can simply use cosine similarity.
An example how to build word2vec model using TensorFlow. One more example how to use embeddings layer from Keras.
The way to obtain embeddings for your language is either training them yourself on your corpus of choice (large enough - e.g. wikipedia) or downloading the trained embeddings (for python there are plenty of sources for embeddings trained or loadable with gensim module - which is a de facto standard for Python word2vec).
You can also use GloVe (using glove-python) or FastText word embeddings.
If you are interested you can find more detailed descriptions of embeddings with code examples and source papers.
Have a look at the paper https://arxiv.org/pdf/1805.10685.pdf that gives you a overall idea.
check this link for more references https://github.com/Hironsan/awesome-embedding-models
I am using gensim word2vec package in python.
I would like to retrieve the W and W' weight matrices that have been learn during the skip-gram learning.
It seems to me that model.syn0 gives me the first one but I am not sure how I can get the other one. Any idea?
I would actually love to find any exhaustive documentation on models accessible attributes because the official one does not seem to be precise (for instance syn0 is not described as an attribute)
The model.wv.syn0 contains the input embedding matrix. Output embedding is stored in model.syn1 when it's trained with hierarchical softmax (hs=1) or in model.syn1neg when it uses negative sampling (negative>0). That's it! When both hierarchical softmax and negative sampling are not enabled, Word2Vec uses a single weight matrix model.wv.syn0 for training.
See also a related discussion here.