How to decrease GPU Inference time and Increase its Utilization?

How to decrease GPU Inference time and Increase its Utilization? - python

We ran inference over 1280*720 RGB images using Faster RCNN from TensorFlow model zoo, trained over the COCO dataset and got some results
Test case 1:
Created a tf session using tf.Session(graph=tf.Graph())
Batch Size = 4, GPU use = 100% (as default)
Time Taken for inference of 4 images = 0.32 secs
Test case 2:
Then we restricted the GPU usage for each TF session to a fraction using gpu_options
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.40)
tf.Session(graph=tf.Graph(), config=tf.ConfigProto(gpu_options=gpu_options))
For a batch size of 2, the time taken for inference was 0.16secs, which is understandable because this is linear.
Test case 3:
Ran 2 instances of inferences as two different python processes. Here, both the batch sizes were 2.
The inference slowed down considerably, and the final value was 0.32secs, which is same as the case of one process running inference over the batch size of 4.
Batch Size = 2, GPU use = 40%, Processes = 2
Time Taken = 0.32 secs
Hence, in Case 1 and Case 3, time taken is the same.
Q.1 Is there any way to reduce the time taken?
Q.2 Where is the bottleneck? In 1st case also, it doesn’t seem that the entire GPU memory is being utilised. Therefore, we had thought that if we were to divide the inference into two, more efficient utilisation of resources could be done. Where are we going wrong in our understanding?

Related

Are deep and wide autoencoder trainings just slow or is there something wrong here?

I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)

OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.
While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:
CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds
But if I run let's say 5 iterations the result is different:
CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds
And if I run 100 iterations, the result is drastically different:
CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds
It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why.
So my questions are:
Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?
The code is very basic:
import onnxruntime as rt
import numpy as np
import time
from tqdm import tqdm
limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'
image = np.random.rand(1, 3, 512, 512).astype(np.float32)
# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)

The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend.
This accelerates ONNX model's performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, VPU and FPGA.
Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and Intel® MovidiusTM Vision Processing Units (VPUs).
This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.
You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.
For more information, you may refer to this ONNX Runtime Performance Tuning

Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.

How does the batch size affect the quality of the predict/predict_on_batch method in Keras?

On my way on enhancing the speed of my LSTM-NN I stumbled upon a question.
How does the batch size affect the quality of the predict/predict_on_batch method?
To specify: I know it does affect the calculation of the gradient in training phase and therefore the prediction results,
but should the prediction result vary if different batch sizes are used for batch prediction
(with same size of training batch size)?
Or should one always use the same batch size for training/fitting and prediction, and if so why?
What I tried:
I have a well working (0.96 detection rate and 0 false alarms) LSTM for anomaly detection
but it takes a long time to go through all test files because of the slow prediction (40min),
using predict(x) trained with a batch_size of 512.
When using predict_on_batch with batch_size=1 it again takes 35 min to test with 0.96 DR and 0 FA.
When using predict_on_batch(batch, batch_size), same with predict(batch, batch_size),
with batch size 512 (same as used in training) instead of predict(x) it takes only ca.32sec
but the results are way worse (0.63 DR, 62455 FA).
Similar applies to smaller batch_size values greater than 1.
The answer in this thread: Prediction is depending on the batch size in Keras is pretty outdated but is it still accurate (blaming standardization)?
I dont really understand what is happening here, hope someone can clearify!
Versions used:
Keras 2.3.1
Python 3.7.4

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Idle GPU between inferance when using keras

Question:
There seems to be a slight idling time between inferences when using keras with a tensorflow back end shown in nvprofiler. The attached image of my terminal shows that keras takes ~80ms for each inference, while a check with nvprofiler shows that the calculation takes 62ms (including host to device copy and device to host copy) and it is basically idling the rest of the time. In addition, there is a difference in the average inference time (between 80 ~ 150ms) each time i run the code below and i am unsure why that is the case.
I was wondering if there is any way to speed up the inference time
by reducing this idling time.
Another question is am i missing some steps like clearing the GPU
memory that cause the difference in inference time.
The input to my model is (1,800,700,36) which is basically a point cloud in a voxelized grid. The output is 2 matrix (1,200,175,1) & (1x200,175,6).
Setup:
Ubuntu 16.04
Intel(R) Core(TM) i7-7820X CPU
GeForce GTX 980
Python 2.7
Tensorflow 1.12.0
Keras 2.2.4
Attempts reduce inference idle time
Allowing the memory to dynamically grow on the GPU
config.gpu_options.allow_growth = True
Limit the maximum gpu usage
config.gpu_options.per_process_gpu_memory_fraction = 0.66
Set learning phase to test
keras.backend.set_learning_phase(0)
Change batch size to None and step size to 1. Changing batch size to 1 and step size to None makes inference slower
model.predict(input_data, batch_size=None, verbose=1, steps=1)
Terminal Output
Output of keras verbose in terminal that shows each step taking ~80ms
Same code, Output of keras verbose in terminal that shows each step taking ~100ms
Images of NV profiler
As NV profiler is very long, I have split the image to 3 parts and have minimized process threads that are not running anything during that period. The images shows that the actual calculation takes ~60ms and the GPU is basically doing nothing for 20ms. This idle changes from run to run and sometimes it goes up to 70ms.
Part 1, time between the green lines is not doing anything
Part 2, time between the green lines is not doing anything
Part 3, time between the green lines is not doing anything
Code
import numpy as np
import tensorflow as tf
import keras.backend.tensorflow_backend
from keras.backend.tensorflow_backend import set_session
from keras.models import load_model
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.66
sess = tf.Session(config=config)
set_session(sess)
keras.backend.set_learning_phase(0)
losses = {
"binary_crossentropy" : "binary_crossentropy",
"huber_loss" : losses.huber_loss,
}
model = load_model('/home/louis/pixor/models/city_trainin_set/pixor_model_10_0.008.hdf5', custom_objects=losses)
obj = massage_own_label(base_dir,data)
input_data = obj.process_input_data(obj.load_pc_data(0))
for i in range(0,100):
out_class, out_labels = model.predict(input_data, batch_size=None, verbose=1, steps=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to decrease GPU Inference time and Increase its Utilization? - python

Related

Are deep and wide autoencoder trainings just slow or is there something wrong here?

OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

How does the batch size affect the quality of the predict/predict_on_batch method in Keras?

Keras tf backend predict speed slow for batch size of 1

Idle GPU between inferance when using keras

Categories

Resources