Tensorflow speed-up prediction

Tensorflow speed-up prediction - python

I have a problem regarding prediction performance. What I do is I repeatedly call test_predictions op in Python loop and put all its return values into the list. The code looks like this:
predictions = []
for _ in trange(args.num_batches):
predictions.extend(sess.run(model.test_predictions))
When I look at performance statistics for more than 2/3 of time my GPU card is idle, probably because of continual switching between Python and TF code. I cannot make batch size bigger, because it won't fit in memory. Is there any better solution I can implement?

There is no such thing as "switching between Python and TF code". If the GPU is idle a lot, that means your fetching of the data (images?) to run the predictions on takes a long time and the GPU has to wait a lot for the data to arrive.
Try implementing pre-fetching.
Alternatively, if you have enough memory, just read all images in at once and feed your network that way.

Related

Erratic GPU Utilization while training GAN in Tensorflow 2.5

I'm using Tensorflow 2.5 to train a starGAN network for generating images (128x128 jpeg). I am using tf.keras.preprocessing.image_dataset_from_directory to load the images from the subfolders.
Additionally I am using arguments to maximize loading performance as suggested in various posts and threads such as loadedDataset.cache().repeat.prefetch
I'm also using the num_parallel_calls=tf.data.AUTOTUNE for the mapping functions for post-processing the images after loading.
While training the network on GPU the performance I am getting for GPU Utilization is in the picture attached below.
My question regarding this are:
Is the GPU utlization normal or is it not supposed to be so erratic for traning GANs?
Is there any way to make this performance more consistent?
Is there any way to improve the training performance to fully utlize the GPU?
Note that Ive logged my disk I/O also and there is no bottleneck reading/writing from the disk (nvme ssd).
The system has 32GB RAM and a RTX3070 with 8GB Vram. I have tried the running it on colab also; but the performance was similarly erratic.

It is fairly normal for utilization to be erratic like for any kind of parallelized software, including training GANs. Of course, it would be better if you could fully utilize your GPU, but writing software that does this is challenging and becomes virtually impossible when you are talking about complex applications like GANs.
Let me try to demonstrate with a trivial example. Say you have two threads, threadA and threadB. threadA is running the following python code:
x = some_time_comsuming_task()
y = get_y_from_threadB()
print(x+y)
Here threadA is performing lots of calculations to get the value for x, retrieving the value for y, and printing out the sum of x+y. Imagine threadB is also doing some kind of time consuming calculation to generate the value for y. Unless threadA is ready to retrieve y at the exact same time threadB finishes calculating it, you won't have 100% utilization of both threads for the entire duration of the program. And this is just two threads, when you have 100s of threads working together with multiple chained data dependencies, you can see how it becomes exponentially more difficult to eliminate any and all time threads spend waiting on other threads to deliver input to the next step of computation.
Trying to make your "performance more consistent" is pointless. Whether your GPU utilization went up and down (like in the graph you shared) or it stayed exactly at the average utilization for the entire execution would not change the overall execution time, which is probably the actually important metric here. Utilization is mostly useful to identify where you can optimize your code.
Fully utilize? Probably not. As explained in my answer to question one, it's going to be virtually impossible to orchestrate your GAN to completely remove bottlenecks. I would encourage you to try and improve execution time, rather than utilization, when optimizing your GAN. There's no magic setting that you're missing that will completely unlock all of your GPU's potential.

How to achieve Tensorflow Python model() (call) performance in C(++) API for small inputs?

The problem
I have a (very) small and fast model saved in the SavedModel format which I can load and run with the following code:
model = tf.keras.models.load_model("./<SavedModelDir>")
outputs = model(inputs, training=False)
The predict function runs in 0.05 seconds with a batch of 5 inputs (on a Nvidia GPU).
If however I use model.predict_on_batch(inputs) or model.predict(inputs) the performance drops significantly to 0.65 - 0.80 seconds for a batch of 5. This is consistent with the documentation that states that using model() (__call__) is usually faster for smaller inputs.
The problem I am having is the fact that I am trying to port my model to a C(++) program. And using TF_SessionRun() for the C api and model_bundle.GetSession()->Run() I am getting performance similar to "slow" Python inference methods.
What I have tried
Another (very) small model with small batch, same result.
I tried disabling optimizations with tf.config.optimizer.set_experimental_options({'disable_meta_optimizer': False}) to make sure this does not negatively impact performance but this made things even slower.
I also tried converting the SavedModel to a TensorRT SavedModel. This increases the performance of the model() (__call__) method even further but all the other methods stop working in Python and in the downloaded precompiled Tensorflow C GPU api (2.5.0) and the C++ API compiled with Tensorflow_CC I get an error about the operation not being found (TensorRT does not seem to work).
All the performance numbers given were run after a few warmup runs.
Performance measured both with Tensorflow profiler and Python's time.time
I checked if model() (__call__) is working correctly by checking the output and it is.
My question(s)
Is there a way to get model() (__call__) performance with the Tensorflow C(++) API?
The problem seems to be somewhere in Tensorflows optimization for larger batch sizes which decreases the performance of smaller batch sizes. Is there another API that allows faster inference on small batches out of the box (TensorRT C++ API?)?

I think I figured it out by accident by doing the following for something else I was trying:
tf.compat.v1.disable_v2_behavior() at the top of the script. And then print(len(outputs)) right after getting the outputs. This gives the following error: TypeError: len is not well defined for symbolic Tensors..
By Googling I found out that symbolic tensors are tensors that do not directly hold values so the values are probably filled in later.
This means that Model() (__call__) does its computation asynchronous, timing the function gives us a false value. This can be "fixed" by stopping the time after printing/using every output or just using the predict() method to avoid this completely.

Tensorflow not fully utilizing GPU in GPT-2 program

I am running the GPT-2 code of the large model(774M). It is used for the generation of text samples through interactive_conditional_samples.py , link: here
So I've given an input file containing prompts which are automatically selected to generate output. This output is also automatically copied into a file. In short, I'm not training it, I'm using the model to generate text.
Also, I'm using a single GPU.
The problem I'm facing in this is, The code is not utilizing the GPU fully.
By using nvidia-smi command, I was able to see the below image
https://imgur.com/CqANNdB

It depends on your application. It is not unusual to have low GPU utilization when the batch_size is small. Try increasing the batch_size for more GPU utilization.
In your case, you have set batch_size=1 in your program. Increase the batch_size to a larger number and verify the GPU utilization.
Let me explain using MNIST size networks. They are tiny and it's hard to achieve high GPU (or CPU) efficiency for them. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models, the statistical efficiency drops off very quickly after a batch_size=100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
Hope this answers your question. Happy Learning.

Best way to import data in google-colaboratory for fast computing and training?

I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds

Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.

Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.

It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.