I am using a pretrained Word2Vec model for tweets to create vectors for each word. https://www.fredericgodin.com/software/. I will then compute the average of this and use a classifier to determine sentiment.
My training data is very large and the pretrained Word2Vec model has been trained on millions of tweets, with dimensionality = 400. My problem is that it is taking too long to give vectors to the words in my training data. Is there a way to reduce the time taken to build the word vectors?
Cheers.
It's unclear what you mean by "too long".
Looking up individual word-vectors from a pre-existing model should be very fast: it's a simple in-memory lookup of the word to the array index (from a dict), then an access of that array-index.
If it's slow for you, perhaps you've loaded a model larger than your available RAM? In that case, operation might be relying on much-slower virtual memory (paging working memory to and from slower disk). With these kinds of models, where access is very random across locations, you never ever want to do this. If it's happening, you should get more RAM or use a smaller model.
Related
I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples & total_words during training in this case?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')
Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab() call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with vector_size=300 – you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file mode, you can increase workers to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable mode, max throughput is usually
somewhere in the 6-12 workers threads, as long as you ahve that many
cores.)
min_count=1 is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5 does. (It's possible FastText can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100 is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon
I am working with a steadily growing corpus. I train my Document Vector with Doc2Vec which is implemented in Python.
Is it possible to update a Document Vector?
I want to use the Document Vector for Document recommendations.
Individual vectors can be updated, but the gensim Doc2Vec model class doesn't have much support for adding more doc-vectors to itself.
It can, however, return individual vectors for new texts that are compatible (comparable) with the existing vectors, via the .infer_vector(words) method. You can retain these vectors in your own data structures for lookup.
When enough new documents have arrived that you think your core model would be better, if trained on all documents, you can re-train the model with all available data, using it as the new base for .infer_vector(). (Note that vectors from the retrained model won't usually be compatible/comparable with those from the prior model: each training session bootstraps a different self-consistent coordinate space.)
I have a corpus of free text medical narratives, for which I am going to use for a classification task, right now for about 4200 records.
To begin, I wish to create word embeddings using w2v, but I have a question about a train-test split for this task.
When I train the w2v model, is it appropriate to use all of the data for the model creation? Or should I only use the train data for creating the model?
Really, my question sort of comes down to: do I take the whole dataset, create the w2v model, transform the narratives with the model, and then split, or should I split, create w2v, and then transform the two sets independently?
Thanks!
EDIT
I found an internal project at my place of work which was built by a vendor; they create the split, and create the the w2v model on ONLY the train data, then transform the two sets independently in different jobs; so it's the latter of the two options that I specified above. This is what I thought would be the case, as I wouldn't want to contaminate the w2v model on any of the test data.
The answer to most questions like these in NLP is "try both" :-)
Contamination of test vs train data is not relevant or a problem in generating word vectors. That is a relevant issue in the model you use the vectors with. I found performance to be better with whole corpus vectors in my use cases.
Word vectors improve in quality with more data. If you don't use test corpus, you will need to have a method for initializing out-of-vocabulary vectors and understanding the impact they may have on your model performance.
I am using the Doc2Vec model in gensim python library.
Every time I feeds the model with the same sentences data and set the parameter:seed of Doc2Vec to a fixed number, the model gives different vectors after the model is built.
For tests purpose, I need a determined result every time I gave a unchanged input data. I searched a lot and does not find a way to keep the gensim's result unchanged.
Is there anything wrong in the way I use it? thanks for replying in advance.
Here is my code:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(sentences, dm=1, dm_concat=1, size=100, window=5, hs=0, min_count=10, seed=64)
result = model.docvecs
The Doc2Vec algorithm makes use of randomness in both initialization and training, and efficient multi-threaded training introduces more randomness because the batches across mutliple threads won't necessarily be trained-against in the same order from run-to-run.
If the model is training well, the jitter in results from run-to-run shouldn't be large, and the quality of downstream assessments shouldn't vary much. If the quality-of-results does vary a lot, there are likely other problems with the application of the algorithm to your data or training.
Separately: you almost certainly don't want to be using the non-default dm_concat=1 mode. It results in a much larger, much slower-to-train model, and there aren't any clear public examples of it being worth that extra cost. (I'd only try it if I had a strong baseline result from more simple modes, and lots and lots of data and time.)
I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?
HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial