Gensim Doc2vec finalize_vocab Memory Error

Gensim Doc2vec finalize_vocab Memory Error - python

I am trying to train a Doc2Vec model using gensim with 114M unique documents/labels and vocab size of around 3M unique words. I have 115GB Ram linux machine on Azure.
When I run build_vocab, the iterator parses all files and then throws memory error as listed below.
Traceback (most recent call last):
File "doc_2_vec.py", line 63, in <module>
model.build_vocab(sentences.to_array())
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 579, in build_vocab
self.finalize_vocab(update=update) # build tables & arrays
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 752, in finalize_vocab
self.reset_weights()
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 662, in reset_weights
self.docvecs.reset_weights(self)
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 390, in reset_weights
self.doctag_syn0 = empty((length, model.vector_size), dtype=REAL)
MemoryError
My code-
import parquet
import json
import collections
import multiprocessing
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
def __iter__(self):
for src in self.sources:
with open(src) as fo:
for row in parquet.DictReader(fo, columns=['Id','tokens']):
yield LabeledSentence(utils.to_unicode(row['tokens']).split('\x01'), [row['Id']])
## list of files to be open ##
sources = glob.glob("/data/meghana_home/data/*")
sentences = LabeledLineSentence(sources)
#pre = Doc2Vec(min_count=0)
#pre.scan_vocab(sentences)
"""
for num in range(0, 20):
print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)
print("done")
"""
NUM_WORKERS = multiprocessing.cpu_count()
NUM_VECTORS = 300
model = Doc2Vec(alpha=0.025, min_alpha=0.0001,min_count=15, window=3, size=NUM_VECTORS, sample=1e-4, negative=10, workers=NUM_WORKERS)
model.build_vocab(sentences)
print("built vocab.......")
model.train(sentences,total_examples=model.corpus_count, epochs=10)
Memory usage as per top is-
Can someone please tell me how much is the expected memory? What is better option- Adding swap space and slow the process or add more memory so that cost of cluster might eventually be equivalent.
What vectors gensim stores in memory? Any flag that i am missing for memory efficient usage.

114 million doctags will require at least 114,000,000 doctags * 300 dimensions * 4 bytes/float = 136GB just to store the raw doctag-vectors during training.
(If the doctag keys row['Id'] are strings, there'll be extra overhead for remembering the string-to-int-index mapping dict. If the doctag keys are raw ints from 0 to 114 million, that will avoid filling that dict. If the doctag keys are raw ints, but include any int higher than 114 million, the model will attempt to allocate an array large enough to include a row for the largest int – even if many other lower ints are unused.)
The raw word-vectors and model output-layer (model.syn1) will require about another 8GB, and the vocabulary dictionary another few GB.
So you'd ideally want more addressable memory, or a smaller set of doctags.
You mention a 'cluster', but gensim Doc2Vec does not support multi-machine distribution.
Using swap space is generally a bad idea for these algorithms, which can involve a fair amount of random access and thus become very slow during swapping. But for the case of Doc2Vec, you can set its doctags array to be served by a memory-mapped file, using the Doc2Vec.__init__() optional parameter docvecs_mapfile. In the case of each document having a single tag, and those tags appearing in the same ascending order on each repeated sweep through the training texts, performance may be acceptable.
Separately:
Your management of training iterations and the alpha learning-rate is buggy. You're achieving 2 passes over the data, at alpha values of 0.025 and 0.023, even though each train() call is attempting a default 5 passes but then just getting a single iteration from your non-restartable sentences.to_array() object.
You should aim for more passes, with the model managing alpha from its initial-high to default final-tiny min_alpha value, in fewer lines of code. You need only call train() once unless you're absolutely certain you need to do extra steps between multiple calls. (Nothing shown here requires that.)
Make your sentences object a true iterable-object that can be iterated over multiple times, by changing to_array() to __iter__(), then passing the sentences alone (rather than sentences.to_array()) to the model.
Then call train() once with this multiply-iterable object, and let it do the specified number of iterations with a smooth alpha update from high-to-low. (The default inherited from Word2Vec is 5 iterations, but 10 to 20 are more commonly used in published Doc2Vec work. The default min_alpha of 0.0001 should hardly ever be changed.)

Related

Gensim Word2Vec exhausting iterable

I'm getting the following prompt when calling model.train() from gensim word2vec
INFO : EPOCH 0: training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s
The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this:
class MyCorpus:
def __init__(self, corpus):
self.corpus = corpus.copy()
def __iter__(self):
for line in self.corpus:
x = re.sub("(<br ?/?>)|([,.'])|([^ A-Za-z']+)", '', line.lower())
yield utils.simple_preprocess(x)
sentences = MyCorpus(corpus)
w2v_model = Word2Vec(
sentences = sentences,
vector_size = w2v_size,
window = w2v_window,
min_count = w2v_min_freq,
workers = -1
)
The corpus variable is a list containing sentences, and each sentence is a string.
I tried the numerous "tests" to see if my class is indeed iterable, like:
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
For instance, all of them suggest that my class is iterable, so at this point, I think the problem must be something else.

workers=-1 is not a supported value for Gensim's Word2Vec model; it essentially means you're using no threads.
Instead, you must specify the actual number of worker threads you'd like to use.
When using an iterable corpus, the optimal number of workers is usually some number up to your number of CPU cores, but not higher than 8-12 if you've got 16+ cores, because of some hard-to-remove inefficiencies in both the Python's Global Interpreter Lock ("GIL") and the Gensim master-reader-thread approach.
Generally, also, you'll get better throughput if your iterable isn't doing anything expensive or repetitive in its preprocessing - like any regex-based tokenization, or a tokenization that's repeated on every epoch. So best to do such preprocessing once, writing the resulting simple space-delimited tokens to a new file. Then, read that file with a very-simple, no-regex, space-splitting only tokenization.
(If performance becomes a major concern on a large dataset, you can also look into the alternate corpus_file method of specifying your corpus. It expects a single file, where each text is on its own line, and tokens are already just space-delimited. But it then lets every worker thread read its own range of the file, with far less GIL/reader-thread bottlenecking, so using workers equal to the CPU core count is then roughly optimal for throughput.)

Unable to understand constant train,test split using hashlib

I was doing chapter 1 in "Hands-on Machine Learning in sci-kit learn and Tensor flow"
and I came across code using hashlib which splits test train data from our dataframe.The code is shown below:
"""
Creating shuffled testset with constant values in training and updated dataset values going to
test set in case dataset is updated, this done via hashlib
"""
import hashlib
import numpy as np
def test_set_check(identifier,test_ratio,hash):
return hash(np.int64(identifier)).digest()[-1]<256*test_ratio
def split_train_test(data,test_ratio,id_column,hash=hashlib.md5):
ids=data[id_column]
in_test_set=ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
return data.loc[~in_test_set],data.loc[in_test_set]
I want to understand why:
This code digest()[-1] gives an integer even though output of .digest() gives us a hashcode
Why the output is compared to a constant of 256 in the code < 256*test_ratio
How is it more robust than using np.random.seed(42)
I am a newbie to all this so it would be great if you could help me figure this out

The hashlib.hash.digest method returns a series of bytes. Each byte is a number from 0 to 255. In this particular example, the size of the hashcode is 16, and indexing a particular location in the hashcode returns the value of that particular byte. As the book explains, the author is proposing to use the last byte of the hash as an identifier.
The reason that the output is compared to 256 * test_ratio is because we want to keep only a specific number of samples in our test set that is consistent with test_ratio. Since each byte value is between 0 and 255, comparing it to 256 * test_ratio is essentially setting a "cap" to decide whether or not to keep a sample. For example, if you take the provided housing data and perform the hashing and splitting, you'll notice that you'll end up with a test set of around 4,000 samples which is roughly 20% of the training set. If you're having trouble understanding, imagine we have a list of integers ranging from 0 to 100. Keeping only the integers that are smaller than 100 * 0.2 = 20 will ensure that we end up with 20% of the samples.
As the book explains, the entire motivation of going through this rather cumbersome hashing process is to mitigate data snooping bias, which refers to the phenomenon of our model having access to the test set. In order to do so, we want to make sure that we always have the same samples in our test set. Simply setting a random seed and then splitting the dataset isn't going to guarantee this, but using hash values will.

Load a part of Glove vectors with gensim

I have a word list like['like','Python']and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it?
What I tried
I iterated through each line of the file to see if the word is in the list and add it to a dict if True. But this method is a little slow.
def readWordEmbeddingVector(Wrd):
f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
words = []
a = f.readline()
while a!= '':
vector = a.split()
if vector[0] in Wrd:
words.append(vector)
Wrd.remove(vector[0])
a = f.readline()
f.close()
words_vector = pd.DataFrame(words).set_index(0).astype('float')
return words_vector
I also tried below, but it loaded the whole file instead of vectors I need
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')
What I want
Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load.

There's no existing gensim support for filtering the words loaded via load_word2vec_format(). The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors).
You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest.

How to count frequency in gensim.Doc2Vec?

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

how to feed pybrain ffn with one entry (to already trained network)?

I need to train network and then feed it with test data one by one. Is there some example or doc including it?
To achieve that I serialized trained network and I use it with every new incoming entry.
The problem is, I got crash from _convertToOneOfMany and even tho I understand its purpose (from here) I do not understand how it exactly works.
Its behaviour is not deterministic for me. It must interpret somehow classes and labels and there must be some requirement I am missing. It works for whole data set, however if I take just random line it goes crazy.
Traceback (most recent call last):
File "ffn_iris.py", line 29, in <module>
tstdata._convertToOneOfMany()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/PyBrain-0.3-py2.6.egg/pybrain/datasets/classification.py", line 142, in _convertToOneOfMany
newtarg[i, int(oldtarg[i])] = bounds[1]
IndexError: index (2) out of range (0<=index<1) in dimension 1
EDIT:
to be more precise let me tell you what I am doing: I want to train network for the most famous NN example in the Internet ;) - Iris Dataset.
It is something like that:
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
etc...
The last zero it the class. Whole dataset holds 60 rows. 20 for 0, 20 for 1 and 20 for 2.
I read the file with data and construct dataset:
alldata = ClassificationDataSet(4, class_labels=['Iris-setosa',
'Iris-versicolor',
'Iris-virginica'])
--- loop here ---
alldata.addSample(line[0:4], line[4])
--- create testing and training sets ---
tstdata, trndata = alldata.splitWithProportion(0.7)
--- converted matrixes ---
trndata._convertToOneOfMany()
tstdata._convertToOneOfMany()
--- not important, just for completeness ----
fnn = buildNetwork(trndata.indim, 10, trndata.outdim, outclass=SoftmaxLayer)
trainer = BackpropTrainer(fnn, dataset=trndata,
momentum=0.01, verbose=True,
weightdecay=0.01)
My problem relates to _convertToOneOfMany(). When dataset or datafile holds just couple entries (not 60, divided into three classes) it crashes with exception from the beginning of the question.
Example of crashing datset:
6.5,3.0,5.2,2.0,1
6.5,3.0,5.2,2.0,1
6.2,3.4,5.4,2.3,2
6.5,3.0,5.2,2.0,0
Example of working one:
6.5,3.0,5.2,2.0,1
6.2,3.4,5.4,2.3,2
6.5,3.0,5.2,2.0,0
How can convertToOneOfMany() be connected to number of entries in the dataset or size of one class subset? One row entries crash as well..

It may be good if you paste more of your code. Regarding your question, it is on their documentation: http://pybrain.org/docs/quickstart/network.html
Basically is this command: net.activate([2, 1])
it this case the network has 2 inputs, and he/she inputs the values 2 and 1
I recommend you going through their documentation

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.