Why is inference time slower when using multi processing with Keras? - python

I would like to have several processes, each one loading different images one in a time and performing inference (for example VGG16).
I am using Keras with tensorFlow backend, one GPU (GTX 1070). Following is the code:
import tensorflow as tf
import multiprocessing
from multiprocessing import Pool, Process, Queue
import os
from os.path import isfile, join
from PIL import Image
import time
from keras.applications.vgg16 import VGG16
import numpy as np
from keras.backend.tensorflow_backend import set_session
test_path = 'test path to images ...'
output = Queue()
def worker(file_names, output):
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.25
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))
inference_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3), pooling='avg')
model_image_size = (224,224)
times = []
for file_name in file_names:
image = Image.open(os.path.join(test_path, file_name))
im_width = image.size[0]
im_height = image.size[1]
m = (im_width - im_height) // 2
image = image.crop((m, 0, im_width - m, im_height))
image = image.resize((model_image_size), Image.BICUBIC)
image = np.array(image, dtype='float32')
image /= 255.
image = np.expand_dims(image, 0) # Add batch dimension.
start = time.time()
res = inference_model.predict(image)
end = time.time()
elapsed_time = end - start
print("elapsed time", elapsed_time)
times.append(elapsed_time)
average_time = np.mean(times[2:])
print("average time ", average_time)
if __name__ == '__main__':
file_names = [f for f in os.listdir(test_path) if isfile(join(test_path, f))]
file_names.sort()
num_workers = 3
processes = [Process(target=worker, args=(file_names[x::num_workers], output)) for x in range(num_workers)]
for p in processes:
p.start()
for p in processes:
p.join()
I have noticed that the inference elapsed times per image are slower for multi processes compared to single process. For example while for single image the inference elapsed time is 0.012 sec. When running 3 processes, I would expect the same result, however, the average inference time per image is almost 0.02 sec. What could be the reason for that? (Maybe CUDA context – switching?) Is there a way to solve this?

Related

Can someone help me using 'ray' or other multiprocessing libraries for my semantic segmentation?

I'm doing semantic segmentation for microscope image stacks.
My code works fine but the thing is it only uses one core of my CPU which makes me wait a long time to get the segmented images.
I recently knew that there are ways to use multicore processing with other python libraries, but I don't know how to implement it.
So someone can help me edit my code with one of the multiprocessing libraries?
My code is in the below.
import numpy as np
from patchify import patchify, unpatchify
import os
import cv2
from tqdm import tqdm
from tensorflow import keras
from tensorflow.keras.utils import normalize
import natsort
model = keras.models.load_model("C:/mymodel.h5", compile=False)
#creating recon image directory
recon_image_directory = "C:/Users/recon"
if not os.path.exists(recon_image_directory):
os.makedirs(recon_image_directory)
large_image_path = "C:/original_images/"
check_images = natsort.natsorted(os.listdir(large_image_path))
for num, large_image_name in tqdm(enumerate(check_images), total=len(check_images)):
if (large_image_name.split('.')[1] == "tif"):
img = cv2.imread(large_image_path + large_image_name, 0)
patches = patchify(img, (256, 256), step=256)
predicted_patches = []
for i in range(patches.shape[0]):
for j in range(patches.shape[1]):
single_patch = patches[i,j,:,:] #(256, 256)
single_patch_norm = normalize(np.array(single_patch), axis=1)
single_patch_input = np.stack((single_patch_norm,)*3, axis=-1) # (256, 256, 3)
single_patch_input = np.expand_dims(single_patch_input, 0) #(1,256,256,3)
single_patch_prediction = (model.predict(single_patch_input)[0,:,:,0]>0.5).astype(np.uint8)
predicted_patches.append(single_patch_prediction)
predicted_patches = np.array(predicted_patches)
predicted_patches_reshaped = np.reshape(predicted_patches, (patches.shape[0], patches.shape[1], 256,256) )
reconstructed_image = unpatchify(predicted_patches_reshaped, img.shape)
cv2.imwrite(recon_image_directory + "/recon"+'_' + str(num) + ".tif", reconstructed_image)
Does this snippet work? It should run each prediction in a separate process.
#ray.remote
def predict(large_image_name: str) -> None:
img = cv2.imread(large_image_path + large_image_name, 0)
patches = patchify(img, (256, 256), step=256)
predicted_patches = []
for i in range(patches.shape[0]):
for j in range(patches.shape[1]):
single_patch = patches[i,j,:,:] #(256, 256)
single_patch_norm = normalize(np.array(single_patch), axis=1)
single_patch_input = np.stack((single_patch_norm,)*3, axis=-1) # (256, 256, 3)
single_patch_input = np.expand_dims(single_patch_input, 0) #(1,256,256,3)
single_patch_prediction = (model.predict(single_patch_input)[0,:,:,0]>0.5).astype(np.uint8)
predicted_patches.append(single_patch_prediction)
predicted_patches = np.array(predicted_patches)
predicted_patches_reshaped = np.reshape(predicted_patches, (patches.shape[0], patches.shape[1], 256,256) )
reconstructed_image = unpatchify(predicted_patches_reshaped, img.shape)
cv2.imwrite(recon_image_directory + "/recon"+'_' + str(num) + ".tif", reconstructed_image)
futures = []
for num, large_image_name in tqdm(enumerate(check_images), total=len(check_images)):
if (large_image_name.split('.')[1] == "tif"):
futures.append(predict.remote(large_image_name))
ray.get(futures)
We also have a high-level abstraction for doing this sort of thing. If you're interested, you should check out the Ray AI Runtime (AIR).

How can implement make_one_shot_iterator() function of tensorflow 1.0 in tensorflow 2.0 version?

In the sample one of this tutorial, it use tensorflow 1.0. I want to implement that code in the tensorflow 2.0.
there is two part of code:
import tensorflow as tf
from time import sleep
from time import time
# data generator
def py_gen(gen_name):
gen_name = gen_name.decode('utf-8')
for num in range(20):
sleep(0.3)
yield '{} yields {}'.format(gen_name, num)
# model operation
def model(data):
sleep(0.1)
and
Dataset = tf.data.Dataset
name = 'Gen_0'
ds = Dataset.from_generator(py_gen,
output_types=(tf.string),
args=(name,))
data_tf = ds.make_one_shot_iterator().get_next()
and the run is:
def run_session(data_tf):
with tf.Session() as sess:
while True:
try:
t1 = time()
data_py = sess.run(data_tf)
t2 = time()
t = t2 - t1
model(data_tf)
msg = 'elapsed time: {:.3f}, {}'.format(t, data_py.decode('utf-8'))
print(msg)
except tf.errors.OutOfRangeError:
print('data generator(s) are exhausted')
break
the make_one_shot_iterator() function did not implemented in tensorflow 2.0, but there is tensorflow.v1.data.make_one_shot_iterator that can use this function.
But I want to implement this only with tf 2.0 and not use tensorflow.v1.data.make_one_shot_iterator.
bow can I do this?
If you are planning to iteration on a dataset, you can just do
iterator = iter(ds)
I have modified your code to do that. FYI, I just removed that "decode" because I did not understand what that was supposed to do
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from matplotlib import pyplot as plt
import numpy as np
import tensorflow_hub as hub
from time import sleep
from time import time
# data generator
def py_gen(gen_name):
gen_name = gen_name.decode('utf-8')
for num in range(20):
sleep(0.3)
yield '{} yields {}'.format(gen_name, num)
# model operation
def model(data):
sleep(0.1)
def run_session(d):
while True:
try:
t1 = time()
data_py = d.get_next()
t2 = time()
t = t2 - t1
model(ds)
msg = 'elapsed time: {:.3f}'.format(t)
print(msg)
except tf.errors.OutOfRangeError:
print('data generator(s) are exhausted')
break
Dataset = tf.data.Dataset
name = 'Gen_0'
ds = Dataset.from_generator(py_gen,
output_types=(tf.string),
args=(name,))
it = iter(ds)
run_session(it)

Tensorflow inference too slow when loading multiple models

I finetuned two Mobilenet models on diferent datasets based on the tensorflow object_detection API example from here. When I use eager mode (tf.executing_eagerly() is True) using only one model then the inference runs at 0.036 seconds per image. When I load two models Keras required to convert to graph mode (tf.executing_eagerly() is False) and the inference runs at 1.8 seconds per image. What I'm doing wrong?
def inference(pipeline_config, checkpoint_path):
print('Building model and restoring weights', flush=True)
num_classes = 3
# Load pipeline config and build a detection model.
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
model_config.ssd.num_classes = num_classes
detection_model = model_builder.build(
model_config=model_config, is_training=False)
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(checkpoint_path).expect_partial()
# Run model through a dummy image so that variables are created
image, shapes = detection_model.preprocess(tf.zeros([1, 320, 320, 3]))
prediction_dict = detection_model.predict(image, shapes)
_ = detection_model.postprocess(prediction_dict, shapes)
print('Weights restored!')
return detection_model
def get_model_detection_function(detection_model):
"""Get a tf.function for detection."""
# Again, uncomment this decorator if you want to run inference eagerly
#tf.function
def detect(input_tensor):
"""Run detection on an input image.
Args:
input_tensor: A [1, height, width, 3] Tensor of type tf.float32.
Note that height and width can be anything since the image will be
immediately resized according to the needs of the model within this
function.
Returns:
A dict containing 3 Tensors (`detection_boxes`, `detection_classes`,
and `detection_scores`).
"""
preprocessed_image, shapes = detection_model.preprocess(input_tensor)
prediction_dict = detection_model.predict(preprocessed_image, shapes)
return detection_model.postprocess(prediction_dict, shapes)
return detect
def mainProcess():
print('Loading model 1...')
g1 = tf.Graph()
s1 = tf.compat.v1.Session(graph=g1)
with g1.as_default(), s1.as_default():
detection_model_1 = inference('config_1/pipeline.config', 'Checkpoint_1/ckpt-1')
detect_fn_1 = get_model_detection_function(detection_model_1)
s1.run(tf.compat.v1.global_variables_initializer())
print('Loading model 2...')
g2 = tf.Graph()
s2 = tf.compat.v1.Session(graph=g2)
with g2.as_default():
detection_model_2 = inference('config_2/pipeline.config', 'Checkpoint_2/ckpt-1')
detect_fn_2 = get_model_detection_function(detection_model_2)
s2.run(tf.compat.v1.global_variables_initializer())
for i, f in enumerate(listdir('images_dir/')):
...
... read the image
...
with g1.as_default():
with s1.as_default():
sec = time.time()
input_tensor = tf.convert_to_tensor(test_img, dtype=tf.float32)
detections = detect_fn_1(input_tensor)
detections = s1.run(detections)
curr = time.time()
print("Finished iterating in: " + str(curr - sec) + " seconds")
# the same for detection_model_2
For eager mode with only one model the mainProcess is:
def mainProcess():
print('Loading model...')
detection_model_1 = inference('config_1/pipeline.config', 'Checkpoint_1/ckpt-1')
detect_fn_1 = get_model_detection_function(detection_model_1)
for i, f in enumerate(listdir('images_dir/')):
...
... read the image
...
sec = time.time()
input_tensor = tf.convert_to_tensor(test_img, dtype=tf.float32)
detections = detect_fn_1(input_tensor)
print(detections['detection_boxes'][0].numpy())
print(detections['detection_scores'][0].numpy())
curr = time.time()
print("Finished iterating in: " + str(curr - sec) + " seconds")

Running inference on InceptionV3 network twice bring totally different results

When i calculate inception score, i got NaN most of the time.
Trying to investigate why it happen i found that running the network twice on the same images can lead for some of the images to totally different results (difference greater than 0.9 while the maximum difference can be 1), the images which got high difference changed from run to run.
My GPU is 2080ti, i use Ubuntu with tensorflow=1.13.1.
i try to change drivers, tensorflow version, run form docker, the same problem happen all the time.
I have another server at the university which has the same GPU (2080ti), and when i try to run there the problem disappear.
Thanks for the help.
my script
# Code derived from tensorflow/tensorflow/models/image/imagenet/classify_image.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import tarfile
import numpy as np
from six.moves import urllib
import tensorflow as tf
import sys
import warnings
from scipy import linalg
MODEL_DIR = '/tmp/imagenet'
DATA_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
softmax = None
pool3 = None
# Call this function with list of images. Each of elements should be a
# numpy array with values ranging from 0 to 255.
def get_features(images):
assert ((images.shape[3]) == 3)
assert (np.max(images) > 10)
assert (np.min(images) >= 0.0)
images = images.astype(np.float32)
bs = 100
sess = tf.get_default_session()
preds = []
for inp in np.array_split(images, round(images.shape[0] / bs)):
sys.stdout.write(".")
sys.stdout.flush()
pred = sess.run(softmax, {'InputTensor:0': inp})
preds.append(pred)
preds = np.concatenate(preds, 0)
return preds
# This function is called automatically.
def _init_inception():
global softmax
global pool3
if not os.path.exists(MODEL_DIR):
os.makedirs(MODEL_DIR)
filename = DATA_URL.split('/')[-1]
filepath = os.path.join(MODEL_DIR, filename)
if not os.path.exists(filepath):
def _progress(count, block_size, total_size):
sys.stdout.write('\r>> Downloading %s %.1f%%' % (
filename, float(count * block_size) / float(total_size) * 100.0))
sys.stdout.flush()
filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress)
print()
statinfo = os.stat(filepath)
print('Succesfully downloaded', filename, statinfo.st_size, 'bytes.')
tarfile.open(filepath, 'r:gz').extractall(MODEL_DIR)
with tf.gfile.GFile(os.path.join(
MODEL_DIR, 'classify_image_graph_def.pb'), 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Import model with a modification in the input tensor to accept arbitrary
# batch size.
input_tensor = tf.placeholder(tf.float32, shape=[None, None, None, 3],
name='InputTensor')
_ = tf.import_graph_def(graph_def, name='inception_v3',
input_map={'ExpandDims:0': input_tensor})
# Works with an arbitrary minibatch size.
pool3 = tf.get_default_graph().get_tensor_by_name('inception_v3/pool_3:0')
ops = pool3.graph.get_operations()
for op_idx, op in enumerate(ops):
if 'inception_v3' in op.name:
for o in op.outputs:
shape = o.get_shape()
shape = [s.value for s in shape]
new_shape = []
for j, s in enumerate(shape):
if s == 1 and j == 0:
new_shape.append(None)
else:
new_shape.append(s)
o.set_shape(tf.TensorShape(new_shape))
w = tf.get_default_graph().get_operation_by_name("inception_v3/softmax/logits/MatMul").inputs[1]
logits = tf.matmul(tf.squeeze(pool3, [1, 2]), w)
softmax = tf.nn.softmax(logits)
_init_inception()
if __name__ =='__main__':
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
with tf.Session() as sess:
preds1 = get_features(x_train)
preds2 = get_features(x_train)
print(abs(preds1-preds2).max())

Tensorflow: model wrapper that can release GPU resources

Here is a wrapper for tensorflow .pb frozen model (imagenet classification):
import tensorflow as tf
import numpy as np
import cv2
from numba import cuda
class ModelWrapper():
def __init__(self, model_filepath):
self.graph_def = self.load_graph_def(model_filepath)
self.graph = self.load_graph(self.graph_def)
self.set_inputs_and_outputs()
self.sess = tf.Session(graph=self.graph)
print(self.__class__.__name__, 'call __init__') #
def load_graph_def(self, model_filepath):
# Expects frozen graph in .pb format
with tf.gfile.GFile(model_filepath, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def
def load_graph(self, graph_def):
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name="")
return graph
def set_inputs_and_outputs(self):
input_list = []
for op in self.graph.get_operations(): # tensorflow.python.framework.ops.Operation
if op.type == "Placeholder":
input_list.append(op.name)
print('Inputs:', input_list)
all_name_list = []
input_name_list = []
for node in self.graph_def.node: # tensorflow.core.framework.node_def_pb2.NodeDef
all_name_list.append(node.name)
input_name_list.extend(node.input)
output_list = list(set(all_name_list) - set(input_name_list))
print('Outputs:', output_list)
self.inputs = []
self.input_tensor_names = [name + ":0" for name in input_list]
for input_tensor_name in self.input_tensor_names:
self.inputs.append(self.graph.get_tensor_by_name(input_tensor_name))
self.outputs = []
self.output_tensor_names = [name + ":0" for name in output_list]
for output_tensor_name in self.output_tensor_names:
self.outputs.append(self.graph.get_tensor_by_name(output_tensor_name))
input_dim_list = []
for op in self.graph.get_operations(): # tensorflow.python.framework.ops.Operation
if op.type == "Placeholder":
bs = op.get_attr('shape').dim[0].size
h = op.get_attr('shape').dim[1].size
w = op.get_attr('shape').dim[2].size
c = op.get_attr('shape').dim[3].size
input_dim_list.append([bs, h, w ,c])
assert len(input_dim_list) == 1
_, self.input_img_h, self.input_img_w, _ = input_dim_list[0]
def predict(self, img):
h, w, c = img.shape
if h != self.input_img_h or w != self.input_img_w:
img = cv2.resize(img, (self.input_img_w, self.input_img_h))
batch = img[np.newaxis, ...]
feed_dict = {self.inputs[0]: batch}
outputs = self.sess.run(self.outputs, feed_dict=feed_dict) # (1, 1001)
output = outputs[0]
return output
def __del__(self):
print(self.__class__.__name__, 'call __del__') #
import time #
time.sleep(3) #
cuda.close()
What I'm trying to do is to clean up GPU memory after I don't need model anymore, in this example I just create and delete model in the loop, but in real life it can be several different models.
wget https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz
tar -xvzf inception_v3_2016_08_28_frozen.pb.tar.gz
rm -f imagenet_slim_labels.txt
rm -f inception_v3_2016_08_28_frozen.pb.tar.gz
import os
import time
import tensorflow as tf
import numpy as np
from model_wrapper import ModelWrapper
MODEL_FILEPATH = './inception_v3_2016_08_28_frozen.pb'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
def create_and_delete_in_loop():
for i in range(10):
print('-'*60)
print('i:', i)
model = ModelWrapper(MODEL_FILEPATH)
input_batch = np.zeros((model.input_img_h, model.input_img_w, 3), np.uint8)
y_pred = model.predict(input_batch)
print('y_pred.shape', y_pred.shape)
print('np.argmax(y_pred)', np.argmax(y_pred))
del model
if __name__ == "__main__":
create_and_delete_in_loop()
print('START WAITING')
time.sleep(10)
print('END OF THE PROGRAM!')
Output:
------------------------------------------------------------
i: 0
Inputs: ['input']
Outputs: ['InceptionV3/Predictions/Reshape_1']
ModelWrapper call __init__
y_pred.shape (1, 1001)
np.argmax(y_pred) 112
ModelWrapper call __del__
------------------------------------------------------------
i: 1
Inputs: ['input']
Outputs: ['InceptionV3/Predictions/Reshape_1']
ModelWrapper call __init__
Segmentation fault (core dumped)
What is the proper way of releasing GPU memory?
TL;DR Run your function as a new process+ .
tf.reset_default_graph() is not guaranteed to release memory#. When a process dies, all the memory it was given (including your GPU Memory) will be released. Not only does this help keep things neatly organized, but also, you can analyze how much CPU, GPU, RAM, GPU Memory each process consumes.
For example, if you had these functions,
def train_model(x, y, params):
model = ModelWrapper(params.filepath)
model.fit(x, y, epochs=params.epochs)
def predict_model(x, params):
model = ModelWrapper(params.filepath)
y_pred = model.predict(x)
print(y_pred.shape)
You can use it like,
import multiprocessing
for i in range(8):
print(f"Training Model {i} from {params.filepath}")
process_train = multiprocessing.Process(train_model, args=(x_train, y_train, params))
process_train.start()
process_train.join()
print("Predicting")
process_predict = multiprocessing.Process(predict_model, args=(x_train, params))
process_predict.start()
process_predict.join()
This way python fires a new process for your tasks, which can run with their own memory.
Bonus Tip: You can also choose to run them in parallel if you have many CPUs and GPUs available: you just need to call process_train.join() after the loop in that case. If you had eight GPUs, you can use this parent script to serve parameters, while each of the individual processes shall run on a different GPU.
# I tried a variety of things, separately and together, before I started using processes,
tf.reset_default_graph()
K.clear_session()
cuda.select_device(0); cuda.close()
model = get_new_model() # overwrite
model = None
del model
gc.collect()
+ I also considered using threads, subprocess.Popen, but I was satisfied with multiprocessing since it offered full decoupling that made it a lot easier to manage and allocate resources.

Categories

Resources