Gensim Word2Vec exhausting iterable

Gensim Word2Vec exhausting iterable - python

I'm getting the following prompt when calling model.train() from gensim word2vec
INFO : EPOCH 0: training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s
The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this:
class MyCorpus:
def __init__(self, corpus):
self.corpus = corpus.copy()
def __iter__(self):
for line in self.corpus:
x = re.sub("(<br ?/?>)|([,.'])|([^ A-Za-z']+)", '', line.lower())
yield utils.simple_preprocess(x)
sentences = MyCorpus(corpus)
w2v_model = Word2Vec(
sentences = sentences,
vector_size = w2v_size,
window = w2v_window,
min_count = w2v_min_freq,
workers = -1
)
The corpus variable is a list containing sentences, and each sentence is a string.
I tried the numerous "tests" to see if my class is indeed iterable, like:
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
For instance, all of them suggest that my class is iterable, so at this point, I think the problem must be something else.

workers=-1 is not a supported value for Gensim's Word2Vec model; it essentially means you're using no threads.
Instead, you must specify the actual number of worker threads you'd like to use.
When using an iterable corpus, the optimal number of workers is usually some number up to your number of CPU cores, but not higher than 8-12 if you've got 16+ cores, because of some hard-to-remove inefficiencies in both the Python's Global Interpreter Lock ("GIL") and the Gensim master-reader-thread approach.
Generally, also, you'll get better throughput if your iterable isn't doing anything expensive or repetitive in its preprocessing - like any regex-based tokenization, or a tokenization that's repeated on every epoch. So best to do such preprocessing once, writing the resulting simple space-delimited tokens to a new file. Then, read that file with a very-simple, no-regex, space-splitting only tokenization.
(If performance becomes a major concern on a large dataset, you can also look into the alternate corpus_file method of specifying your corpus. It expects a single file, where each text is on its own line, and tokens are already just space-delimited. But it then lets every worker thread read its own range of the file, with far less GIL/reader-thread bottlenecking, so using workers equal to the CPU core count is then roughly optimal for throughput.)

Related

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

Working on google collab with python : garbage collector is not working?

I am working on google collab using python and I have a 12Gb Ram.
I am trying to use word2vec pre-trained by google to represent sentences by vectors.
I should have same length vectors even if they do not have the same number of words so I used padding (the maximum length of a sentence here is my variable max)
The problem is that every time I want to create a matrix containing all of my vectors i run out of RAM memory quickly (on 20k th / 128k vector)
This is my code :
final_x_train = []
l=np.zeros((max,300)) # The legnth of a google pretained model is 300
for i in new_X_train:
buildWordVector(final_x_train, i, model, l)
gc.collect() #doesn't do anything except slowing the run time
def buildWordVector(new_X, sent, model, l):
for x in range(len(sent)):
try:
l[x]= list(model[sent[x]])
gc.collect() #doesn't do anything except slowing the run time
except KeyError:
continue
new_X.append([list(x) for x in l])
all the variable that i have :
df: 16.8MiB
new_X_train: 1019.1KiB
X_train: 975.5KiB
y_train: 975.5KiB
new_X_test: 247.7KiB
X_test: 243.9KiB
y_test: 243.9KiB
l: 124.3KiB
final_x_train: 76.0KiB
stop_words: 8.2KiB
But I am at 12Gb/12Gb (RAM) and the session has expired
As you can see the garbage collector is not doing anything because apperently is cannot see the variables but I really need a solution to solve this problem can anyone help me please?

In general in a garbage-collected language like Python you don't need to explicitly request garbage-collection: it happens automatically when you've stopped retaining references (variables/transitive-property-references) to objects.
So, if you're getting a memory error here, it's almost certainly because you're really trying to use more than the available amount of memory at a time.
Your code is a bit incomplete and unclear – what is max? what is new_X_train? where are you getting those memory sizing estimates? etc.
But notably: it's not typical to represent a sentence as a concatenation of each word's vector. (So that, with 300d word-vectors, and an up-to-10-word sentence, you have a 3000d sentence-vector.) It's far more common to average the word-vectors together, so both words and sentences have the same size, and there's no blank padding at the end of short sentences.
(That's still a very crude way to create text-vectors, but more common than padding-to-maximum-sentence-size.)

Detect bigram collection on a very large text dataset

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.

When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Gensim Doc2vec finalize_vocab Memory Error

I am trying to train a Doc2Vec model using gensim with 114M unique documents/labels and vocab size of around 3M unique words. I have 115GB Ram linux machine on Azure.
When I run build_vocab, the iterator parses all files and then throws memory error as listed below.
Traceback (most recent call last):
File "doc_2_vec.py", line 63, in <module>
model.build_vocab(sentences.to_array())
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 579, in build_vocab
self.finalize_vocab(update=update) # build tables & arrays
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 752, in finalize_vocab
self.reset_weights()
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 662, in reset_weights
self.docvecs.reset_weights(self)
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 390, in reset_weights
self.doctag_syn0 = empty((length, model.vector_size), dtype=REAL)
MemoryError
My code-
import parquet
import json
import collections
import multiprocessing
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
def __iter__(self):
for src in self.sources:
with open(src) as fo:
for row in parquet.DictReader(fo, columns=['Id','tokens']):
yield LabeledSentence(utils.to_unicode(row['tokens']).split('\x01'), [row['Id']])
## list of files to be open ##
sources = glob.glob("/data/meghana_home/data/*")
sentences = LabeledLineSentence(sources)
#pre = Doc2Vec(min_count=0)
#pre.scan_vocab(sentences)
"""
for num in range(0, 20):
print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)
print("done")
"""
NUM_WORKERS = multiprocessing.cpu_count()
NUM_VECTORS = 300
model = Doc2Vec(alpha=0.025, min_alpha=0.0001,min_count=15, window=3, size=NUM_VECTORS, sample=1e-4, negative=10, workers=NUM_WORKERS)
model.build_vocab(sentences)
print("built vocab.......")
model.train(sentences,total_examples=model.corpus_count, epochs=10)
Memory usage as per top is-
Can someone please tell me how much is the expected memory? What is better option- Adding swap space and slow the process or add more memory so that cost of cluster might eventually be equivalent.
What vectors gensim stores in memory? Any flag that i am missing for memory efficient usage.

114 million doctags will require at least 114,000,000 doctags * 300 dimensions * 4 bytes/float = 136GB just to store the raw doctag-vectors during training.
(If the doctag keys row['Id'] are strings, there'll be extra overhead for remembering the string-to-int-index mapping dict. If the doctag keys are raw ints from 0 to 114 million, that will avoid filling that dict. If the doctag keys are raw ints, but include any int higher than 114 million, the model will attempt to allocate an array large enough to include a row for the largest int – even if many other lower ints are unused.)
The raw word-vectors and model output-layer (model.syn1) will require about another 8GB, and the vocabulary dictionary another few GB.
So you'd ideally want more addressable memory, or a smaller set of doctags.
You mention a 'cluster', but gensim Doc2Vec does not support multi-machine distribution.
Using swap space is generally a bad idea for these algorithms, which can involve a fair amount of random access and thus become very slow during swapping. But for the case of Doc2Vec, you can set its doctags array to be served by a memory-mapped file, using the Doc2Vec.__init__() optional parameter docvecs_mapfile. In the case of each document having a single tag, and those tags appearing in the same ascending order on each repeated sweep through the training texts, performance may be acceptable.
Separately:
Your management of training iterations and the alpha learning-rate is buggy. You're achieving 2 passes over the data, at alpha values of 0.025 and 0.023, even though each train() call is attempting a default 5 passes but then just getting a single iteration from your non-restartable sentences.to_array() object.
You should aim for more passes, with the model managing alpha from its initial-high to default final-tiny min_alpha value, in fewer lines of code. You need only call train() once unless you're absolutely certain you need to do extra steps between multiple calls. (Nothing shown here requires that.)
Make your sentences object a true iterable-object that can be iterated over multiple times, by changing to_array() to __iter__(), then passing the sentences alone (rather than sentences.to_array()) to the model.
Then call train() once with this multiply-iterable object, and let it do the specified number of iterations with a smooth alpha update from high-to-low. (The default inherited from Word2Vec is 5 iterations, but 10 to 20 are more commonly used in published Doc2Vec work. The default min_alpha of 0.0001 should hardly ever be changed.)

How to count frequency in gensim.Doc2Vec?

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.