Running Neural Networks Takes Longer Time - python

I'm running a series of neural networks (Keras library using a Tensorflow backend), and I have the following results for the time it took to train each neural network in Jupyter Notebook:
ELAPSED TIME: 2.7005105018615723
0
ELAPSED TIME: 2.4810903072357178
1
ELAPSED TIME: 2.801435708999634
2
ELAPSED TIME: 2.6753993034362793
3
ELAPSED TIME: 2.8625667095184326
4
ELAPSED TIME: 2.5828065872192383
5
while later on you have:
ELAPSED TIME: 5.062163829803467
0
ELAPSED TIME: 5.162402868270874
1
ELAPSED TIME: 5.301288366317749
2
ELAPSED TIME: 5.386904001235962
3
ELAPSED TIME: 6.126806020736694
4
The program consists of a function that trains a separate neural network model on their respective datasets, and only exports their final training accuracy (saved to another file).
I thought the reason why it took longer for the latter networks to train was because the program was consuming too much memory, so I would delete the models (with the del keyword) after having obtained their training accuracy, but that doesn't seem to be doing much.
If I were to restart the Jupyter Notebook kernel, the time to run each network would shorten back to about 2 seconds (original duration), but it takes longer for the latter models to run.
What could be possible reasons for this, and what solutions could be implemented?
.
NOTE: I did not include any code because it would make this post more dense, but if necessary I can upload it.

Are you running on an NVIDA GPU? If so it's possible some part of the old model is still on the GPU. Try running nvidia-smi while the slow model is running and see if anything else is using up GPU memory/resources.
If that doesn't work you can also run the tensorflow timeline and compare between the slow and fast runs. More info on how to generate the timeline in Keras here: https://github.com/tensorflow/tensorflow/issues/9868 and the code I pasted below for creating the timeline was taken from that link
from tensorflow.python.client import timeline
# Make your keras model
# ...
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
model.compile(loss='MSE', optimizer='Adam', options=run_options, run_metadata=run_metadata)
# Run model in your usual way
# ...
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.ctf.json', 'w') as f:
f.write(trace.generate_chrome_trace_format())
For more info on the tensorflow timeline see https://stackoverflow.com/a/37774470/2826818 The timeline will allow you to see how much time each operation is taking and determine which one is causing the slow down.

You can clear your session after you are done with each model, this fixed the issue for me.
from keras import backend as K
for model in models_to_train:
model.fit(X, y)
# save necessary metrics
K.clear_session()
(Also I was getting a segmentation fault until I update tensoflow-gpu to 1.9.0)

Related

Tensorflow slow on first prediction, much faster after

I have a trained dataset and saved the weights, at a later date and running the python script from scratch, I load the model and the weights and do a prediction, this takes for example 10 seconds, then all predictions afterward take 0.5 seconds.
I am measuring the time of the prediction only
t = perf_counter()
a = model.predict(p, verbose=0, workers=8).reshape(1,-1)
print(f'prediction took {perf_counter()-t} seconds')
I was expecting there to be no difference.
I see this post Tensorflow JS first prediction delay but not sure in my case how to "warm up".
I am coding a server and hence the concern as the first time someone issues a request for a prediction and in my case that's 10 of them, the users needs to wait a long time, which in this use case is not good.
Thanks for the help!

Are deep and wide autoencoder trainings just slow or is there something wrong here?

I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)

OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.
While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:
CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds
But if I run let's say 5 iterations the result is different:
CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds
And if I run 100 iterations, the result is drastically different:
CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds
It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why.
So my questions are:
Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?
The code is very basic:
import onnxruntime as rt
import numpy as np
import time
from tqdm import tqdm
limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'
image = np.random.rand(1, 3, 512, 512).astype(np.float32)
# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend.
This accelerates ONNX model's performance on the same hardware compared to generic acceleration on IntelĀ® CPU, GPU, VPU and FPGA.
Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and IntelĀ® MovidiusTM Vision Processing Units (VPUs).
This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.
You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.
For more information, you may refer to this ONNX Runtime Performance Tuning
Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.

Tensorflow distributed training pause after each epoch

I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy. With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish. I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.
However, after each epoch of the distributed training, there is a pause of about 8 seconds. When using a single GPU, this pause is less than 1 second. Does anyone know why there is such a long pause after each epoch when training distributed?
Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?
Apparently this had something to do with running TF in graph mode. By setting tf.compat.v1.enable_eager_execution() the problem went away. This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.

Idle GPU between inferance when using keras

Question:
There seems to be a slight idling time between inferences when using keras with a tensorflow back end shown in nvprofiler. The attached image of my terminal shows that keras takes ~80ms for each inference, while a check with nvprofiler shows that the calculation takes 62ms (including host to device copy and device to host copy) and it is basically idling the rest of the time. In addition, there is a difference in the average inference time (between 80 ~ 150ms) each time i run the code below and i am unsure why that is the case.
I was wondering if there is any way to speed up the inference time
by reducing this idling time.
Another question is am i missing some steps like clearing the GPU
memory that cause the difference in inference time.
The input to my model is (1,800,700,36) which is basically a point cloud in a voxelized grid. The output is 2 matrix (1,200,175,1) & (1x200,175,6).
Setup:
Ubuntu 16.04
Intel(R) Core(TM) i7-7820X CPU
GeForce GTX 980
Python 2.7
Tensorflow 1.12.0
Keras 2.2.4
Attempts reduce inference idle time
Allowing the memory to dynamically grow on the GPU
config.gpu_options.allow_growth = True
Limit the maximum gpu usage
config.gpu_options.per_process_gpu_memory_fraction = 0.66
Set learning phase to test
keras.backend.set_learning_phase(0)
Change batch size to None and step size to 1. Changing batch size to 1 and step size to None makes inference slower
model.predict(input_data, batch_size=None, verbose=1, steps=1)
Terminal Output
Output of keras verbose in terminal that shows each step taking ~80ms
Same code, Output of keras verbose in terminal that shows each step taking ~100ms
Images of NV profiler
As NV profiler is very long, I have split the image to 3 parts and have minimized process threads that are not running anything during that period. The images shows that the actual calculation takes ~60ms and the GPU is basically doing nothing for 20ms. This idle changes from run to run and sometimes it goes up to 70ms.
Part 1, time between the green lines is not doing anything
Part 2, time between the green lines is not doing anything
Part 3, time between the green lines is not doing anything
Code
import numpy as np
import tensorflow as tf
import keras.backend.tensorflow_backend
from keras.backend.tensorflow_backend import set_session
from keras.models import load_model
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.66
sess = tf.Session(config=config)
set_session(sess)
keras.backend.set_learning_phase(0)
losses = {
"binary_crossentropy" : "binary_crossentropy",
"huber_loss" : losses.huber_loss,
}
model = load_model('/home/louis/pixor/models/city_trainin_set/pixor_model_10_0.008.hdf5', custom_objects=losses)
obj = massage_own_label(base_dir,data)
input_data = obj.process_input_data(obj.load_pc_data(0))
for i in range(0,100):
out_class, out_labels = model.predict(input_data, batch_size=None, verbose=1, steps=1)

Categories

Resources