I'm trying to train a Doc2vec for massive data. I have a 20k files with 72GB in total, and write this code:
def train():
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
data = []
random.shuffle(onlyfiles)
tagged_data = []
t = 0
try:
for file_name in onlyfiles:
with open(mypath+"/"+file_name, 'r', encoding="utf-8") as file:
txt = file.read()
tagged_data.append([word_tokenize(txt.lower()), [str(t)]])
t+=1
except Exception as e:
print(t)
return
print("Files Loaded")
max_epochs = 1000
vec_size = 500
alpha = 0.025
model = Doc2Vec(vector_size=vec_size,
alpha=alpha, workers=1,
min_alpha=0.00025,
min_count=1,
dm=1)
print("Model Works")
print("Building vocabulary")
model.build_vocab(tagged_data)
print("Trainning")
for epoch in range(max_epochs):
print("Iteration {0}".format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save(model_name)
print("Model Saved")
But when I run this method, this error appears:
Traceback (most recent call last):
File "doc2vec.py", line 20, in train
tagged_data.append([word_tokenize(txt.lower()), [str(t)]])
MemoryError
And only 3k files are treated. But when view memory, the python process show that only 1.7% from memory was used.
Is there any parameter I can inform to python to solve?
How can I fix it?
You're getting the error long before even trying Doc2Vec, so this isn't really a Doc2Vec question - it's a problem with your Python data handling. Do you have enough RAM to load 72GB of disk-data (which might expand a bit when represented in Python string objects) into RAM?
But also, you won't usually have to bring an entire corpus into memory, by appending to a giant list, to do any of these tasks. Read things one at a time, and process from an iterable/iterator, perhaps writing interim results (like tokenized text) back to the IO sources. This article may be helpful:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
Finally, if your code did proceed to the Doc2Vec section, you'd have other problems. Whatever online example you're consulting as a model has many bad practices. For example:
a typical interation-count is 10-20; you certainly wouldn't use 1000 for a 72GB dataset
min_count=1 leads to a much bigger model; usually discarding low-frequency words is necessary and may even improve resulting vector quality, and larger datasets (and 72GB is very very big) tend to user larger rather than minimal min_count settings
most people shouldn't be using non-default alpha/min_alpha values, or trying to manage them with their own calculations, or even calling train() more than once. train() has its own epochs parameter which if used will smoothly handle the learning-rate alpha for you. As far as I can tell, 100% of the people who call train() multiple times in their own loop are doing it wrong, and I have no idea where they keep getting these examples.
Training goes much slower with workers=1; especially with a large dataset you'll want to try larger workers values, and the optimal value for training throughput in gensim versions up through 3.5.0 is usually somewhere in the range from 3-12 (assuming you have at least that many CPU cores).
So your current code would probably result in a model larger than RAM, training single-thread slowly and 1000s of times more than necessary, with much of the training happening with a nonsensical negative-alpha which makes the model worse every cycle. If it miraculously didn't MemoryError during model initialization, it'd run for months or years and end up with nonsense results.
Related
Can I save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory()?
from tensorflow.keras import preprocessing
train_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'training', # Estamos en training
shuffle = True,
seed= 689
)
val_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'validation',
shuffle = True,
seed= 689
)
test_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/test'
)
I'm reading the documentation but I'm not sure if it's possible.
Answer to #Lescurel question
I want to do this because I want to avoid do this preprocessing each time and have to wait while its done. And furthermore, because I want to see if this new saved file is takes up less space in my computer.
Actually, I don't care the format. I thought that if this can be done, it would already have a standard format that everyone uses.
Thank you very much.
Technically that is possible.
But you don't want that, because:
The preprocessing.text_dataset_from_directory create a generator based dataset, that supports
on the fly loading of data,
shuffle after each epoch (for training),
prefetching and other features.
If you just save a shuffled dataset as file on your computer, you will have to do it again. If the dataset would be/get larger than your RAM you would have to care about that, too.
If you still want to do it: You can get the batches of data with dataset.take(1) and then either save all individual string (using for .. in) or pickle to write the binary objects... But I repeat myself: You do not want to do that.
If you want to do preprocessing up front, use a program that works on your text files and saves them as text files back again (e.g. for cleaning etc.) - but be aware that you will have to do the same for test and production data later on, so everything you remove from the (keras) pipeline, you have to care about yourself.
I am relatively new to Tensorflow and have been putting together some model training based on the tutorial I found on the ts website. I have been able to put together something functional that satisfies my preliminary requirements.
I am reading locally a csv files that provides some links towards images associated with labels written on the same csv row. My code roughly look like that:
def map_func(*row):
img = process_img(img_filename)
output = read(row)
return img, output
dataset = tf.data.experimental.CsvDataset(CSV_FILE, default_format, header=True)
dataset = dataset.map(map_func)
dataset = dataset.shuffle(buffer_size=shuffle_buffer_size)
dataset = dataset.batch(NB_IMG)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
X, y = next(iter(dataset))
X_train, X_test = tf.split(X, split, axis=0)
y_train, y_test = tf.split(y, split, axis=0)
model = create_model()
model.compile(optimizer=OPTIMIZER, loss='mse')
model.fit(x=X_train, y=y_train, epochs=EPOCHS, validation_data=(X_test, y_test))
NB_IMG is the total number of images I have. EPOCHS is here arbitrary fixed to a given value (in general 20 or 40) and the split is a ratio applied on NB_IMG.
All my images are locally on my computer and with that code my GPU currently can manage up to 50000 images roughly. The training is failing with more images (GPU exhausted). I can understand that is due to the fact that I am reading the data all at once, but I am bit blocked to take the next step here to be able to manage a bigger dataset.
This part below is the one that need improvement I guess:
X, y = next(iter(dataset))
Could someone here help me to move forward and guide me towards some examples or snippets where I can train the model on a bigger dataset? I am a bit lost here for the next move and not sure where to focus in the ts documentation. I did not really find a clear example online that would suit my needs. How should I loop on different batches? How is coded the iterator?
Thanks!
Well, can you give more details about the two functions process_img and read?
During my experiments, I have noticed that the shuffle function can be slow when you have a lot of data and the buffer size is big. Try to comment that line and check if it runs faster. If so, you can use pandas to load your CSV file and then shuffle it and use tf.data.Dataset.from_tensor_slices
Tensorflow has a great tool now to profile models and the dataset pipeline (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras).
process_img and read are very simple functions:
def process_img(filename):
img = tf.io.read_file(filename)
return tf.image.decode_jpeg(img, channels=3)
def read(row):
return row[1]
The shuffle part of my code is slow but does not seem to be the cause of failure, I can remove it and shuffle the data directly from the csv. It seems to fail at the X, y = next(iter(dataset)) line if the dataset is too big
Thanks for your suggestions to profile the code, I will give it a go. Is there any other possible approach to split and iterate among the dataset?
I'm using the Tensorflow Dataset API to take a bunch of filenames; shuffle the filenames; perform a python function to load the image files, preprocess them, and turn them into tensors; and then cache, repeat, and batch them. So far, so good.
When I add a shuffle() to the tensors, performance degrades by a factor of 9x. Likewise, when I do self.dataset.apply(tf.data.experimental.shuffle_and_repeat(16384)).
Why does shuffle hurt performance so badly, and how can I fix it?
Code:
filenames = tf.data.Dataset.list_files(self.FILE_PATTERN).shuffle(buffer_size=16384)
dataset = filenames.map(lambda filename: self.pp(filename),
num_parallel_calls=self.N_CPUS)
dataset = dataset.cache("./cachefile")
# The line below (shuffle_and_repeat) made performance very bad (1s/step without, 9s/step with)
# dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(16384))
# This too:
# dataset = dataset.repeat().shuffle(16384)
# This works fine, but doesn't shuffle:
dataset = dataset.repeat()
dataset = dataset.batch(self.BATCH_SIZE)
dataset = dataset.prefetch(4)
try changing the prefetch parameter buffer_size=2
dataset = dataset.prefetch(2)
prefetch is a performance flag, read next number of datasets in background for the next iterations. If prefetch's buffer_size is large, then it creates lots of datasets for iterations and may slow down due to low memory.
Hi I am studying the dataset API in tensorflow now and I have a question regarding to the dataset.map() function which performs data preprocessing.
file_name = ["image1.jpg", "image2.jpg", ......]
im_dataset = tf.data.Dataset.from_tensor_slices(file_names)
im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))
im_dataset = im_dataset.batch(batch_size)
iterator = im_dataset.make_initializable_iterator()
The dataset takes in image names and parse them into 3 tensors (3 infos about the image).
If I have a very larger number of images in my training folder, preprocessing them is gonna take a long time.
My question is that, since Dataset API is said to be designed for efficient input pipeline, the preprocessing is done for the whole dataset before I feed them to my workers (let's say GPUs), or it only preprocess one batch of image each time I call iterator.get_next()?
If your preprocessing pipeline is very long and the output is small, the processed data should fit in memory. If this is the case, you can use tf.data.Dataset.cache to cache the processed data in memory or in a file.
From the official performance guide:
The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.
Example use of cache in memory
Here is an example where each pre-processing takes a lot of time (0.5s). The second epoch on the dataset will be much faster than the first
def my_fn(x):
time.sleep(0.5)
return x
def parse_fn(x):
return tf.py_func(my_fn, [x], tf.int64)
dataset = tf.data.Dataset.range(5)
dataset = dataset.map(parse_fn)
dataset = dataset.cache() # cache the processed dataset, so every input will be processed once
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
# First 5 iterations will take 0.5s each, last 5 will not
print(sess.run(res))
Caching to a file
If you want to write the cached data to a file, you can provide an argument to cache():
dataset = dataset.cache('/tmp/cache') # will write cached data to a file
This will allow you to only process the dataset once, and run multiple experiments on the data without reprocessing it again.
Warning: You have to be careful when caching to a file. If you change your data, but keep the /tmp/cache.* files, it will still read the old data that was cached. For instance, if we use the data from above and change the range of the data to be in [10, 15], we will still obtain data in [0, 5]:
dataset = tf.data.Dataset.range(10, 15)
dataset = dataset.map(parse_fn)
dataset = dataset.cache('/tmp/cache')
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(res)) # will still be in [0, 5]...
Always delete the cached files whenever the data that you want to cache changes.
Another issue that may arise is if you interrupt the script before all the data is cached. You will receive an error like this:
AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists ('/tmp/cache.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator.
Make sure that you let the whole dataset be processed to have an entire cache file.
I am loading image files for a neural network, and their size becomes nontrivial when they are read into numpy. Currently I have all the files as .png files on my hard drive. I can safely load the file names and the response variable, since they are several orders of magnitude less than the images themselves.
I would like to read them like this:
def open_files(files_list,indexes):
file_subset = files_list[indexes[0]:indexes[1]]
input_arrays = []
for file in file_subset:
img = Image.open(file)
img.load()
data = np.asarray(img,dtype="int32")
input_arrays.append(data)
return np.asarray(input_arrays)
I have my givens defined as follows
train_model = theano.function(
[index],
cost,
updates=updates,
givens={
x:open_files(training_files,(index*batch_size,(index + 1)*batch_size)),
y:training_y[index*batch-size:(index + 1)*batch_size]
}
)
Obviously, this doesn't work as I get an IndexError from an invalid slice, but I am not sure how to get theano to update something that isn't already in memory. It should be possible, as part of the tutorial I am reading is as follows:
If you are running your code on the GPU and the dataset you are using is too large to fit in memory the code will crash. In such a case you should store the data in a shared variable. You can however store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores. This way you minimize the number of data transfers between CPU memory and GPU memory.
How do I do this?