keras model.fit_generator() several times slower than model.fit()

keras model.fit_generator() several times slower than model.fit() - python

Even as of Keras 1.2.2, referencing merge, it does have multiprocessing included, but model.fit_generator() is still about 4-5x slower than model.fit() due to disk reading speed limitations. How can this be sped up, say through additional multiprocessing?

You may want to check out the workers and max_queue_size parameters of fit_generator() in the documentation. Essentially, more workers creates more threads for loading the data into the queue that feeds data to your network. There is a chance that filling the queue might cause memory problems, though, so you might want to decrease max_queue_size to avoid this.

I had a similar problem where I switched to dask to load the data into memory rather than using a generator where I was using pandas. So, depending on your data size, if possible, load the data into memory and use the fit function.

Related

Erratic GPU Utilization while training GAN in Tensorflow 2.5

I'm using Tensorflow 2.5 to train a starGAN network for generating images (128x128 jpeg). I am using tf.keras.preprocessing.image_dataset_from_directory to load the images from the subfolders.
Additionally I am using arguments to maximize loading performance as suggested in various posts and threads such as loadedDataset.cache().repeat.prefetch
I'm also using the num_parallel_calls=tf.data.AUTOTUNE for the mapping functions for post-processing the images after loading.
While training the network on GPU the performance I am getting for GPU Utilization is in the picture attached below.
My question regarding this are:
Is the GPU utlization normal or is it not supposed to be so erratic for traning GANs?
Is there any way to make this performance more consistent?
Is there any way to improve the training performance to fully utlize the GPU?
Note that Ive logged my disk I/O also and there is no bottleneck reading/writing from the disk (nvme ssd).
The system has 32GB RAM and a RTX3070 with 8GB Vram. I have tried the running it on colab also; but the performance was similarly erratic.

It is fairly normal for utilization to be erratic like for any kind of parallelized software, including training GANs. Of course, it would be better if you could fully utilize your GPU, but writing software that does this is challenging and becomes virtually impossible when you are talking about complex applications like GANs.
Let me try to demonstrate with a trivial example. Say you have two threads, threadA and threadB. threadA is running the following python code:
x = some_time_comsuming_task()
y = get_y_from_threadB()
print(x+y)
Here threadA is performing lots of calculations to get the value for x, retrieving the value for y, and printing out the sum of x+y. Imagine threadB is also doing some kind of time consuming calculation to generate the value for y. Unless threadA is ready to retrieve y at the exact same time threadB finishes calculating it, you won't have 100% utilization of both threads for the entire duration of the program. And this is just two threads, when you have 100s of threads working together with multiple chained data dependencies, you can see how it becomes exponentially more difficult to eliminate any and all time threads spend waiting on other threads to deliver input to the next step of computation.
Trying to make your "performance more consistent" is pointless. Whether your GPU utilization went up and down (like in the graph you shared) or it stayed exactly at the average utilization for the entire execution would not change the overall execution time, which is probably the actually important metric here. Utilization is mostly useful to identify where you can optimize your code.
Fully utilize? Probably not. As explained in my answer to question one, it's going to be virtually impossible to orchestrate your GAN to completely remove bottlenecks. I would encourage you to try and improve execution time, rather than utilization, when optimizing your GAN. There's no magic setting that you're missing that will completely unlock all of your GPU's potential.

DataLoader num_workers vs torch.set_num_threads

Is there a difference between the parallelization that takes place between these two options? I’m assuming num_workers is solely concerned with the parallelizing the data loading. But is setting torch.set_num_threads for training in general? Trying to understand the difference between these options. Thanks!

The num_workers for the DataLoader specifies how many parallel workers to use to load the data and run all the transformations. If you are loading large images or have expensive transformations then you can be in situation where GPU is fast to process your data and your DataLoader is too slow to continuously feed the GPU. In that case setting higher number of workers helps. I typically increase this number until my epoch step is fast enough. Also, a side tip: if you are using docker, usually you want to set shm to 1X to 2X number of workers in GB for large dataset like ImageNet.
The torch.set_num_threads specifies how many threads to use for parallelizing CPU-bound tensor operations. If you are using GPU for most of your tensor operations then this setting doesn't matter too much. However, if you have tensors that you keep on cpu and you are doing lot of operations on them then you might benefit from setting this. Pytorch docs, unfortunately, don't specify which operations will benefit from this so see your CPU utilization and adjust this number until you can max it out.

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

Multi-processing on Large Image Dataset in Python

I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training.
I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. But I'm not sure how should I do it, any ideas?

For speed I would advise to used HDF5 or LMDB:
I have successfully used ml-pyxis for creating deep learning datasets using LMDBs.
It allows to create binary blobs (LMDB) and they can be read quite fast.
The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos
For multi-processing:
I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method.
fit_generator(self, generator, samples_per_epoch,
nb_epoch, verbose=1, callbacks=[],
validation_data=None, nb_val_samples=None,
class_weight={}, max_q_size=10, nb_worker=1,
pickle_safe=False)
Fits the model on data generated batch-by-batch by a Python generator. The generator is run in parallel to the model, for efficiency. For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. You can find the source code here , and the documentation here.

Don't know whether you prefer tensorflow/keras/torch/caffe whatever.
Multiprocessing is simply Using Multiple GPUs
Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. The example above is for Logistic Regression.
Of course you would be more keen on looking into Convnets -
This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition
Keras is yet to officially provide support but you can "proceed at your own risk"
For multiprocessing, tensorflow is a better way to go about this (my opinion)
In fact they have some good documentation on it too

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.