Question:
There seems to be a slight idling time between inferences when using keras with a tensorflow back end shown in nvprofiler. The attached image of my terminal shows that keras takes ~80ms for each inference, while a check with nvprofiler shows that the calculation takes 62ms (including host to device copy and device to host copy) and it is basically idling the rest of the time. In addition, there is a difference in the average inference time (between 80 ~ 150ms) each time i run the code below and i am unsure why that is the case.
I was wondering if there is any way to speed up the inference time
by reducing this idling time.
Another question is am i missing some steps like clearing the GPU
memory that cause the difference in inference time.
The input to my model is (1,800,700,36) which is basically a point cloud in a voxelized grid. The output is 2 matrix (1,200,175,1) & (1x200,175,6).
Setup:
Ubuntu 16.04
Intel(R) Core(TM) i7-7820X CPU
GeForce GTX 980
Python 2.7
Tensorflow 1.12.0
Keras 2.2.4
Attempts reduce inference idle time
Allowing the memory to dynamically grow on the GPU
config.gpu_options.allow_growth = True
Limit the maximum gpu usage
config.gpu_options.per_process_gpu_memory_fraction = 0.66
Set learning phase to test
keras.backend.set_learning_phase(0)
Change batch size to None and step size to 1. Changing batch size to 1 and step size to None makes inference slower
model.predict(input_data, batch_size=None, verbose=1, steps=1)
Terminal Output
Output of keras verbose in terminal that shows each step taking ~80ms
Same code, Output of keras verbose in terminal that shows each step taking ~100ms
Images of NV profiler
As NV profiler is very long, I have split the image to 3 parts and have minimized process threads that are not running anything during that period. The images shows that the actual calculation takes ~60ms and the GPU is basically doing nothing for 20ms. This idle changes from run to run and sometimes it goes up to 70ms.
Part 1, time between the green lines is not doing anything
Part 2, time between the green lines is not doing anything
Part 3, time between the green lines is not doing anything
Code
import numpy as np
import tensorflow as tf
import keras.backend.tensorflow_backend
from keras.backend.tensorflow_backend import set_session
from keras.models import load_model
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.66
sess = tf.Session(config=config)
set_session(sess)
keras.backend.set_learning_phase(0)
losses = {
"binary_crossentropy" : "binary_crossentropy",
"huber_loss" : losses.huber_loss,
}
model = load_model('/home/louis/pixor/models/city_trainin_set/pixor_model_10_0.008.hdf5', custom_objects=losses)
obj = massage_own_label(base_dir,data)
input_data = obj.process_input_data(obj.load_pc_data(0))
for i in range(0,100):
out_class, out_labels = model.predict(input_data, batch_size=None, verbose=1, steps=1)
Related
I want to implement Model Parallelism with Trainer on PyTorch Lightning.
My environment is single machine with 16GB 2 GPUs.
1 GPU cannot read whole my model because that model involves some layers and a large layer(almost 14GB).
I run next code.
import pytorch_lightning as pl
# from pytorch_lightning.strategies import DeepSpeedStrategy
from mymodel import Net
from mydatamodule import DataModule
net = Net()
dm = DataModule()
trainer = pl.Trainer(
accelerator='gpu',
strategy='ddp_shard',
# strategy=DeepSpeedStrategy(
# stage=2,
# config={
# "autotuning": {
# "mp_size": 2
# }
# }
# ),
num_nodes=1,
precision: 16,
check_val_every_n_epoch=10,
)
trainer.fit(net, dm)
I think Model Parallelism of Fairscale/DeepSpeed distributes layers of model to GPUs which stay within those memory limits like below images of articles.
https://towardsdatascience.com/distributed-parallel-training-model-parallel-training-a768058aa02a https://huggingface.co/transformers/v4.9.2/parallelism.html#naive-model-parallel-vertical-and-pipeline-parallel
First Validation of check_val_every_n_epoch is work.
But mentioned above code behaves as Distributed Data Parallel on train phase and raises OOM error.
When OOM error occured, Both of 2 GPUs hold 14GB over.
Also this code raises DeadLock detected from rank: 0 before OOM error.
I haven't understood the cause of this behavior.
I tried strategies "ddp_sharded", "deepspeed-stage-1", "deepspeed-stage-2", "deepspeed-stage-3", "fsdp".
But results are all the same.
I want to resolve 2 problems below.
Is my understanding of Model Parallelism wrong?
How do I fix my code?
System Details
OS: Ubuntu 20.04
Python: 3.8.10
PyTorch: 1.12.1+cu113
PyTorch Lightning: 1.7.7
Fairscale: 0.5.12
DeepSpeed: 0.7.4
I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.
While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:
CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds
But if I run let's say 5 iterations the result is different:
CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds
And if I run 100 iterations, the result is drastically different:
CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds
It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why.
So my questions are:
Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?
The code is very basic:
import onnxruntime as rt
import numpy as np
import time
from tqdm import tqdm
limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'
image = np.random.rand(1, 3, 512, 512).astype(np.float32)
# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend.
This accelerates ONNX model's performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, VPU and FPGA.
Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and Intel® MovidiusTM Vision Processing Units (VPUs).
This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.
You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.
For more information, you may refer to this ONNX Runtime Performance Tuning
Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.
We ran inference over 1280*720 RGB images using Faster RCNN from TensorFlow model zoo, trained over the COCO dataset and got some results
Test case 1:
Created a tf session using tf.Session(graph=tf.Graph())
Batch Size = 4, GPU use = 100% (as default)
Time Taken for inference of 4 images = 0.32 secs
Test case 2:
Then we restricted the GPU usage for each TF session to a fraction using gpu_options
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.40)
tf.Session(graph=tf.Graph(), config=tf.ConfigProto(gpu_options=gpu_options))
For a batch size of 2, the time taken for inference was 0.16secs, which is understandable because this is linear.
Test case 3:
Ran 2 instances of inferences as two different python processes. Here, both the batch sizes were 2.
The inference slowed down considerably, and the final value was 0.32secs, which is same as the case of one process running inference over the batch size of 4.
Batch Size = 2, GPU use = 40%, Processes = 2
Time Taken = 0.32 secs
Hence, in Case 1 and Case 3, time taken is the same.
Q.1 Is there any way to reduce the time taken?
Q.2 Where is the bottleneck? In 1st case also, it doesn’t seem that the entire GPU memory is being utilised. Therefore, we had thought that if we were to divide the inference into two, more efficient utilisation of resources could be done. Where are we going wrong in our understanding?
I am trying to perform some hyperparameter tuning of a convolutional neural network written in Tensorflow 2.0 with GPU extension.
My systems settings are:
Windows 10 64bit
GeForce RTX2070, 8GB
Tensorflow 2.0-beta
CUDA 10.0 properly installed (I hope, deviceQuery.exe and bandwidthTest.exe passed positively)
My neural network has 75.572.574 parameters and I am training it on 3777 samples. In a single run, I have no problems in training the CNN.
As next step, I wanted to tune two hyperparameters of the CNN. To this aim, I created a for loop (iterating on 20 steps), in which I build and compile every time a new model, changing the hyperparameters at every loop iteration.
The gist of the code (this is not an MWE) is the following
import tensorflow as tf
from tensorflow import keras
def build_model(input_shape, output_shape, lr=0.01, dropout=0, summary=True):
model = keras.models.Sequential(name="CNN")
model.add(keras.layers.Conv2D(32, (7, 7), activation='relu', input_shape=input_shape, padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Conv2D(128, (3, 3), activation='relu', padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1024, activation='relu'))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(output_shape, activation='linear'))
model.build()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
loss="mse",
metrics=[RMSE])
if summary:
print(model.summary())
return model
...
for run_id in range(25):
lr = learning_rate.max_value + (learning_rate.min_value - learning_rate.max_value) * np.random.rand(1)
dropout = dropout.min_value + (dropout.max_value -
dropout.min_value) * np.random.rand(1)
print("%=== Run #{0}".format(run_id))
run_dir = hparamdir + "\\run{0}".format(run_id)
model0 = build_model(IMG_SHAPE, Ytrain.shape[1], lr=lr, dropout=dropout)
model0_history = model0.fit(Xtrain,
Ytrain,
validation_split=0.3,
epochs=2,
verbose=2)
The problem I encountered is that, after a few (6) loops, the program halts returning the error
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[73728,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: dense_12/kernel/Initializer/random_uniform/
Process finished with exit code 1.
I believe the problem is that the GPU does not release the memory in between each iteration of the for loop and, after a while, it saturates and crashes.
I have digged around and I tried different solutions as suggested in similar posts (post1, post2)
Trying releasing the memory using the Keras backend at the end of every iteration of the for loop using
from keras import backend as K
K.clear_session()
Trying clearing the GPU using Numba and CUDA with
from numba import cuda
cuda.select_device(0)
cuda.close()
I tried deleting the graph using del model0 but that did not work either.
I couldn't try using tf.reset_default_graph() since the programming style of TF2.0 doesn't have a default graph anymore (AFAIK) and thus I have not found a way to kill/delete a graph at runtime.
Solutions 1. and 3. returned the same out of memory error, while solution 2. returned the following error during the second iteration of the for loop, while building the model in the build_model()call:
2019-07-24 19:51:31.909377: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle
Process finished with exit code -1073740791 (0xC0000409)
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is occupied/cannot be seen by Python anymore.
In any case, I could not find any solution to this issue, except for running the training by hand for every hyperparameter to be tested.
Does anybody have any idea how to solve this problem?
Or a workaround for hyperparameter tuning?
Should I open an issue in TF2.0 Github issue tracker (it does not appear to be a TensorFlow issue per se, since they declare that they don't want to free the GPU to avoid segmentation problems)?
This is due to how TF handles memory.
If you monitor your system while iteratively training TF models, you will observe a linear increase in memory consumption. Additionally, if you watch -n 0.1 nvidia-smi you will notice that the PID for the process remains constant while iterating. TF does not fully release utilized memory until the PID controlling the memory is killed. Also, the Numba documentation notes that cuda.close() is not useful if you want to reset the GPU (though I definitely spent a while trying to make it work when I discovered it!).
The easiest solution is to iterate using the Ray python package and something like the following:
import ray
#ray.remote(
num_gpus=1 # or however many you want to use (e.g., 0.5, 1, 2)
)
class RayNetWrapper:
def __init__(self, net):
self.net = net
def train(self):
return self.net.train()
ray.init()
actors = [RayNetWrapper.remote(model) for _ in range(25)]
results = ray.get([actor.train.remote() for actor in actors]
You should then notice GPU processes will cycle on/off with new PIDs each time and your system memory will no longer increase. Alternatively, you can put your model training code in a new python script and iteratively call out using python's subprocess module. You will also notice some latency now when models shutdown and new models boot up, but this is expected because the GPU is restarting. Ray also has an experimental asynchronous framework that I've had some success with, and enables fractional sharing of GPUs (model size permitting).
you can locate these two lines on top of your code.
from tensorflow.python.framework.config import set_memory_growth
tf.compat.v1.disable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
that works for me.
I'm running a series of neural networks (Keras library using a Tensorflow backend), and I have the following results for the time it took to train each neural network in Jupyter Notebook:
ELAPSED TIME: 2.7005105018615723
0
ELAPSED TIME: 2.4810903072357178
1
ELAPSED TIME: 2.801435708999634
2
ELAPSED TIME: 2.6753993034362793
3
ELAPSED TIME: 2.8625667095184326
4
ELAPSED TIME: 2.5828065872192383
5
while later on you have:
ELAPSED TIME: 5.062163829803467
0
ELAPSED TIME: 5.162402868270874
1
ELAPSED TIME: 5.301288366317749
2
ELAPSED TIME: 5.386904001235962
3
ELAPSED TIME: 6.126806020736694
4
The program consists of a function that trains a separate neural network model on their respective datasets, and only exports their final training accuracy (saved to another file).
I thought the reason why it took longer for the latter networks to train was because the program was consuming too much memory, so I would delete the models (with the del keyword) after having obtained their training accuracy, but that doesn't seem to be doing much.
If I were to restart the Jupyter Notebook kernel, the time to run each network would shorten back to about 2 seconds (original duration), but it takes longer for the latter models to run.
What could be possible reasons for this, and what solutions could be implemented?
.
NOTE: I did not include any code because it would make this post more dense, but if necessary I can upload it.
Are you running on an NVIDA GPU? If so it's possible some part of the old model is still on the GPU. Try running nvidia-smi while the slow model is running and see if anything else is using up GPU memory/resources.
If that doesn't work you can also run the tensorflow timeline and compare between the slow and fast runs. More info on how to generate the timeline in Keras here: https://github.com/tensorflow/tensorflow/issues/9868 and the code I pasted below for creating the timeline was taken from that link
from tensorflow.python.client import timeline
# Make your keras model
# ...
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
model.compile(loss='MSE', optimizer='Adam', options=run_options, run_metadata=run_metadata)
# Run model in your usual way
# ...
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.ctf.json', 'w') as f:
f.write(trace.generate_chrome_trace_format())
For more info on the tensorflow timeline see https://stackoverflow.com/a/37774470/2826818 The timeline will allow you to see how much time each operation is taking and determine which one is causing the slow down.
You can clear your session after you are done with each model, this fixed the issue for me.
from keras import backend as K
for model in models_to_train:
model.fit(X, y)
# save necessary metrics
K.clear_session()
(Also I was getting a segmentation fault until I update tensoflow-gpu to 1.9.0)