speedup TFLite inference in python with multiprocessing pool

speedup TFLite inference in python with multiprocessing pool - python

I was playing with tflite and observed on my multicore CPU that it is not heavily stressed during inference time. I eliminated the IO bottleneck by creating random input data with numpy beforehand (random matrices resembling images) but then tflite still doesn't utilze the full potential of the CPU.
The documentation mentions the possibility to tweak the number of used threads. However I was not able to find out how to do that in the Python API. But since I have seen people using multiple interpreter instances for different models I thought one could maybe use multiple instances of the same model and run them on different threads/processes. I have written the following short script:
import numpy as np
import os, time
import tflite_runtime.interpreter as tflite
from multiprocessing import Pool
# global, but for each process the module is loaded, so only one global var per process
interpreter = None
input_details = None
output_details = None
def init_interpreter(model_path):
global interpreter
global input_details
global output_details
interpreter = tflite.Interpreter(model_path=model_path)
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.allocate_tensors()
print('done init')
def do_inference(img_idx, img):
print('Processing image %d'%img_idx)
print('interpreter: %r' % (hex(id(interpreter)),))
print('input_details: %r' % (hex(id(input_details)),))
print('output_details: %r' % (hex(id(output_details)),))
tstart = time.time()
img = np.stack([img]*3, axis=2) # replicates layer three time for RGB
img = np.array([img]) # create batch dimension
interpreter.set_tensor(input_details[0]['index'], img )
interpreter.invoke()
logit= interpreter.get_tensor(output_details[0]['index'])
pred = np.argmax(logit, axis=1)[0]
logit = list(logit[0])
duration = time.time() - tstart
return logit, pred, duration
def main_par():
optimized_graph_def_file = r'./optimized_graph.lite'
# init model once to find out input dimensions
interpreter_main = tflite.Interpreter(model_path=optimized_graph_def_file)
input_details = interpreter_main.get_input_details()
input_w, intput_h = tuple(input_details[0]['shape'][1:3])
num_test_imgs=1000
# pregenerate random images with values in [0,1]
test_imgs = np.random.rand(num_test_imgs, input_w,intput_h).astype(input_details[0]['dtype'])
scores = []
predictions = []
it_times = []
tstart = time.time()
with Pool(processes=4, initializer=init_interpreter, initargs=(optimized_graph_def_file,)) as pool: # start 4 worker processes
results = pool.starmap(do_inference, enumerate(test_imgs))
scores, predictions, it_times = list(zip(*results))
duration =time.time() - tstart
print('Parent process time for %d images: %.2fs'%(num_test_imgs, duration))
print('Inference time for %d images: %.2fs'%(num_test_imgs, sum(it_times)))
print('mean time per image: %.3fs +- %.3f' % (np.mean(it_times), np.std(it_times)) )
if __name__ == '__main__':
# main_seq()
main_par()
However the memory address of the interpreter instance printed via hex(id(interpreter)) is the same for every process. The memory address of the input/output details is however different. Thus I was wondering if this way of doing it is potentially wrong even though I could experience a speedup? If so how could one achieve parallel inference with TFLite and python?
tflite_runtime version: 1.14.0 from here (the x86-64 Python 3.5 version)
python version: 3.5

I know that this thread was created two and a half years ago.
For me,
import multiprocessing
tf.lite.Interpreter(modelfile, num_threads=multiprocessing.cpu_count())
works very well.

I did not set initializer and use the following codes to load model, and do inference in the same function to workaround this issue.
with Pool(processes=multiprocessing.cpu_count()) as pool:
results = pool.starmap(inference, enumerate(test_imgs))

Related

tflite inference only predicts one label despite multiclass label training

I have trained a multiclass classifier for speech recognition using tensorflow. Then converted the model using tflite converter. The model can predict but it always outputs a single class. I suppose the problem is with the inference code because .h5 model can predict multiclass without any issue. I have been searching online for several days for some insight but I can't quite figure it out. Here is my code. Any suggestions would be really appreciated.
import sounddevice as sd
import numpy as np
import scipy.signal
import timeit
import python_speech_features
import tflite_runtime.interpreter as tflite
import importlib
# Parameters
debug_time = 0
debug_acc = 0
word_threshold = 0.95
rec_duration = 0.5 # 0.5
sample_length = 0.5
window_stride = 0.5 # 0.5
sample_rate = 8000 # The mic requires at least 44100 Hz to work
resample_rate = 8000
num_channels = 1
num_mfcc = 16
model_path = 'model.tflite'
mfccs_old = np.zeros((32, 25))
# Load model (interpreter)
interpreter = tflite.Interpreter(model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)
# Filter and downsample
def decimate(signal, old_fs, new_fs):
# Check to make sure we're downsampling
if new_fs > old_fs:
print("Error: target sample rate higher than original")
return signal, old_fs
# Downsampling is possible only by an integer factor
dec_factor = old_fs / new_fs
if not dec_factor.is_integer():
print("Error: can only downsample by integer factor")
# Do decimation
resampled_signal = scipy.signal.decimate(signal, int(dec_factor))
return resampled_signal, new_fs
# Callback that gets called every 0.5 seconds
def sd_callback(rec, frames, time, status):
# Start timing for debug purposes
start = timeit.default_timer()
# Notify errors
if status:
print('Error:', status)
global mfccs_old
# Compute MFCCs
mfccs = python_speech_features.base.mfcc(rec,
samplerate=resample_rate,
winlen=0.02,
winstep=0.02,
numcep=num_mfcc,
nfilt=26,
nfft=512, # 2048
preemph=0.0,
ceplifter=0,
appendEnergy=True,
winfunc=np.hanning)
delta = python_speech_features.base.delta(mfccs, 2)
mfccs_delta = np.append(mfccs, delta, axis=1)
mfccs_new = mfccs_delta.transpose()
mfccs = np.append(mfccs_old, mfccs_new, axis=1)
# mfccs = np.insert(mfccs, [0], 0, axis=1)
mfccs_old = mfccs_new
# Run inference and make predictions
in_tensor = np.float32(mfccs.reshape(1, mfccs.shape[0], mfccs.shape[1], 1))
interpreter.set_tensor(input_details[0]['index'], in_tensor)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
val = np.amax(output_data) # DEFINED FOR BINARY CLASSIFICATION, CHANGE TO MULTICLASS
ind = np.where(output_data == val)
prediction = ind[1].astype(int)
if val > word_threshold:
print('index:', ind[1])
print('accuracy', val, '/n')
print(int(prediction))
if debug_acc:
# print('accuracy:', val)
# print('index:', ind[1])
print('out tensor:', output_data)
if debug_time:
print(timeit.default_timer() - start)
# Start recording from microphone
with sd.InputStream(channels=num_channels,
samplerate=sample_rate,
blocksize=int(sample_rate * rec_duration),
callback=sd_callback):
while True:
pass

Since I figured out the issue, I am answering it myself in case others find it useful.
The issue is not having a "background noise" class in your dataset. Also make sure you have enough data for background noises. If you look at Google's teachable machine's audio project (https://teachablemachine.withgoogle.com/train/audio), a "background noise" class is already there, you cannot delete or disable the class.
I tested with both codes provided on tensorflow's github example (https://github.com/tensorflow/examples/blob/master/lite/examples/sound_classification/raspberry_pi/classify.py) and on tensorflow's website (https://www.tensorflow.org/tutorials/audio/simple_audio). They both work well for your prediction as long as you have enough background noise samples in your dataset considering the particular environment you are testing it in.
I made slight changes to the tensorflow's github code to output the category name and category confidence score.
# Loop until the user close the classification results plot.
while True:
# Wait until at least interval_between_inference seconds has passed since
# the last inference.
now = time.time()
diff = now - last_inference_time
if diff < interval_between_inference:
time.sleep(pause_time)
continue
last_inference_time = now
# Load the input audio and run classify.
tensor_audio.load_from_audio_record(audio_record)
result = classifier.classify(tensor_audio)
for category in result.classifications[0].categories:
print(category.category_name, category.score)
Hope it's helpful for people playing around with similar projects.

Keras MobileNet example yields different answers on different computers

I have a very simple example with the Keras MobileNet implementation trying to classify a minivan. I run the same code on two different computers and get different results, not just slightly different but different enough that the classifications are not the same.
(note that Tensorflow=1.7.0 and Keras=2.1.5 on both computers)
Code below
import sys
import argparse
import numpy as np
from PIL import Image
import requests
from io import BytesIO
import time
try:
import matplotlib.pyplot as plt
HAS_MATPLOTLIB = True
except:
HAS_MATPLOTLIB = False
from keras.preprocessing import image
#from keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
from keras.applications.mobilenet import MobileNet, preprocess_input, decode_predictions
#model = ResNet50(weights='imagenet')
model = MobileNet()
target_size = (224, 224)
def predict(model, img, target_size, top_n=3):
"""Run model prediction on image
Args:
model: keras model
img: PIL format image
target_size: (w,h) tuple
top_n: # of top predictions to return
Returns:
list of predicted labels and their probabilities
"""
if img.size != target_size:
img = img.resize(target_size)
print "preprocessing input.."
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
print "making predicition..."
preds = model.predict(x)
print "prediction made: %s" % preds
return decode_predictions(preds, top=top_n)[0]
if __name__=="__main__":
a = argparse.ArgumentParser()
a.add_argument("--image", help="path to image")
a.add_argument("--image_url", help="url to image")
args = a.parse_args()
if args.image is None and args.image_url is None:
a.print_help()
sys.exit(1)
if args.image is not None:
img = Image.open(args.image)
preds = predict(model, img, target_size)
if args.image_url is not None:
print "getting image from url"
response = requests.get(args.image_url)
print "image gotten from url"
img = Image.open(BytesIO(response.content))
print "predicting.."
before = time.time()
preds = predict(model, img, target_size)
print "total time to predict: %.2f" % (time.time() - before)
print preds
plot_preds(img, preds)
Now if I run this on my MacBook Pro
$ python classify_example_mobile.py --image_url http://i.imgur.com/cg37Ojo.jpg
[(u'n03770679', u'minivan', 0.39935172), (u'n02974003', u'car_wheel', 0.28071228), (u'n02814533', u'beach_wagon', 0.19400564)]
but if I then run it on another computer that I have
(venv) $ python classify_example_mobile.py --image_url http://i.imgur.com/cg37Ojo.jpg
[(u'n02974003', u'car_wheel', 0.39516035), (u'n02814533', u'beach_wagon', 0.27965376), (u'n03770679', u'minivan', 0.22706936)]
the predictions are reversed, it no longer picks minivan as the top result.
How could this be? I know that different architectures can have different floating-point math accuracy, but would that be enough to account for these results? I also know that models can vary depending on the way the weights are initialized during training, but this is a pre-trained model, so what gives?
edit - to be clear, the image is a picture of a minivan, so in this case one architecture gets it right and the other one gets it wrong - so this is a big deal for me. (http://i.imgur.com/cg37Ojo.jpg)

So I don't quite understand what is going on here, but the error appears to have gone away once I did some more preprocessing of the input, which makes me think that maybe I had different PIL versions of numpy versions or something.
I added these lines
img = img.convert("RGB")
and now the results between the two computers are identical

How to use crop huge batch of images in tensorflow

I am trying to use below function to crop large number of images 100,000s. I am doing this operation serially, but its taking lot of time. What is the efficient way to do this?
tf.image.crop_to_bounding_box
Below is my code:
def crop_images(img_dir, list_images):
outlist=[]
with tf.Session() as session:
for image1 in list_images[:5]:
image = mpimg.imread(img_dir+image1)
x = tf.Variable(image, name='x')
data_t = tf.placeholder(tf.uint8)
op = tf.image.encode_jpeg(data_t, format='rgb')
model = tf.global_variables_initializer()
img_name = "img/"+image1.split("_img_0")[0] + "/img_0"+image1.split("_img_0")[1]
height = x.shape[1]
[x1,y1,x2,y2] = img_bbox_dict[img_name]
x = tf.image.crop_to_bounding_box(x, int(y1), int(x1), int(y2)-int(y1), int(x2)-int(x1))
session.run(model)
result = session.run(x)
data_np = session.run(op, feed_dict={ data_t: result })
with open(img_path+image1, 'w+') as fd:
fd.write(data_np)

I'll give a simplified version of one of the examples from Tensorflow's Programmer's guide on reading data which can be found here. Basically, it uses Reader and Filename Queues to batch together image data using a specified number of threads. These threads are coordinated using what is called a thread Coordinator.
import tensorflow as tf
import glob
images_path = "./" #RELATIVE glob pathname of current directory
images_extension = "*.png"
# Save the list of files matching pattern, so it is only computed once.
filenames = tf.train.match_filenames_once(glob.glob(images_path+images_extension))
batch_size = len(glob.glob1(images_path,images_extension))
num_epochs=1
standard_size = [500, 500]
num_channels = 3
min_after_dequeue = 10
num_preprocess_threads = 3
seed = 14131
"""
IMPORTANT: Cropping params. These are arbitrary values used only for this example.
You will have to change them according to your requirements.
"""
crop_size=[200,200]
boxes = [1,1,460,460]
"""
'WholeFileReader' is a Reader who's 'read' method outputs the next
key-value pair of the filename and the contents of the file (the image) from
the Queue, both of which are string scalar Tensors.
Note that the The QueueRunner works in a thread separate from the
Reader that pulls filenames from the queue, so the shuffling and enqueuing
process does not block the reader.
'resize_images' is used so that all images are resized to the same
size (Aspect ratios may change, so in that case use resize_image_with_crop_or_pad)
'set_shape' is used because the height and width dimensions of 'image' are
data dependent and cannot be computed without executing this operation. Without
this Op, the 'image' Tensor's shape will have None as Dimensions.
"""
def read_my_file_format(filename_queue, standard_size, num_channels):
image_reader = tf.WholeFileReader()
_, image_file = image_reader.read(filename_queue)
if "jpg" in images_extension:
image = tf.image.decode_jpeg(image_file)
elif "png" in images_extension:
image = tf.image.decode_png(image_file)
image = tf.image.resize_images(image, standard_size)
image.set_shape(standard_size+[num_channels])
print "Successfully read file!"
return image
"""
'string_input_producer' Enters matched filenames into a 'QueueRunner' FIFO Queue.
'shuffle_batch' creates batches by randomly shuffling tensors. The 'capacity'
argument controls the how long the prefetching is allowed to grow the queues.
'min_after_dequeue' defines how big a buffer we will randomly
sample from -- bigger means better shuffling but slower startup & more memory used.
'capacity' must be larger than 'min_after_dequeue' and the amount larger
determines the maximum we will prefetch.
Recommendation: min_after_dequeue + (num_threads + a small safety margin) * batch_size
"""
def input_pipeline(filenames, batch_size, num_epochs, standard_size, num_channels, min_after_dequeue, num_preprocess_threads, seed):
filename_queue = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True)
example = read_my_file_format(filename_queue, standard_size, num_channels)
capacity = min_after_dequeue + 3 * batch_size
example_batch = tf.train.shuffle_batch([example], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue, num_threads=num_preprocess_threads, seed=seed, enqueue_many=False)
print "Batching Successful!"
return example_batch
"""
Any transformation on the image batch goes here. Refer the documentation
for the details of how the cropping is done using this function.
"""
def crop_batch(image_batch, batch_size, b_boxes, crop_size):
cropped_images = tf.image.crop_and_resize(image_batch, boxes=[b_boxes for _ in xrange(batch_size)], box_ind=[i for i in xrange(batch_size)], crop_size=crop_size)
print "Cropping Successful!"
return cropped_images
example_batch = input_pipeline(filenames, batch_size, num_epochs, standard_size, num_channels, min_after_dequeue, num_preprocess_threads, seed)
cropped_images = crop_batch(example_batch, batch_size, boxes, crop_size)
"""
if 'num_epochs' is not `None`, the 'string_input_producer' function creates local
counter `epochs`. Use `local_variables_initializer()` to initialize local variables.
'Coordinator' class implements a simple mechanism to coordinate the termination
of a set of threads. Any of the threads can call `coord.request_stop()` to ask for all
the threads to stop. To cooperate with the requests, each thread must check for
`coord.should_stop()` on a regular basis.
`coord.should_stop()` returns True` as soon as `coord.request_stop()` has been called.
A thread can report an exception to the coordinator as part of the `should_stop()`
call. The exception will be re-raised from the `coord.join()` call.
After a thread has called `coord.request_stop()` the other threads have a
fixed time to stop, this is called the 'stop grace period' and defaults to 2 minutes.
If any of the threads is still alive after the grace period expires `coord.join()`
raises a RuntimeError reporting the laggards.
IMPORTANT: 'start_queue_runners' starts threads for all queue runners collected in
the graph, & returns the list of all threads. This must be executed BEFORE running
any other training/inference/operation steps, or it will hang forever.
"""
with tf.Session() as sess:
_, _ = sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
# Run training steps or whatever
cropped_images1 = sess.run(cropped_images)
print cropped_images1.shape
except tf.errors.OutOfRangeError:
print('Load and Process done -- epoch limit reached')
finally:
# When done, ask the threads to stop.
coord.request_stop()
coord.join(threads)
sess.close()

Training huge amounts of data with tensorflow

I have about 60 thousand samples of size 200x870, they are all numpy arrays and I want to build a four-dimensional tensor out of them (with one singleton dimension) and train them with a CNN in tensorflow. Up to this point, I was using data that I could just load and create batches as below:
with tf.Graph().as_default():
data_train = tf.to_float(getInput.data_train)
phase, lr = tf.placeholder(tf.bool), tf.placeholder(tf.float32)
global_step = tf.Variable(0,trainable = False)
image_train, label_train = tf.train.slice_input_producer([data_train, labels_train], num_epochs=args.num_epochs)
images_train, batch_labels_train = tf.train.batch([image_train, label_train], batch_size=args.bsize)
Can someone suggest a way to go around it?
I wanted to split the dataset into subsets and in one epoch train one after the ather using a Queue for the paths of this files:
import scipy.io as sc
import numpy as np
import threading
import time
import tensorflow as tf
from tensorflow.python.client import timeline
def testQueues():
paths = ['data1', 'data2', 'data3', 'data4','data5']
queue_capacity = 6
bsize = 10
num_epochs = 2
filename_queue = tf.FIFOQueue(
#min_after_dequeue=0,
capacity=queue_capacity,
dtypes=tf.string,
shapes=[[]]
)
filenames_placeholder = tf.placeholder(dtype='string', shape=(None))
filenames_enqueue_op = filename_queue.enqueue_many(filenames_placeholder)
data_train, phase = tf.placeholder(tf.float32), tf.placeholder(tf.bool)
sess= tf.Session()
sess.run(filenames_enqueue_op, feed_dict={filenames_placeholder: paths})
for i in range(len(paths)):
train_set_batch_name = sess.run(filename_queue.dequeue())
train_set_batch_name = train_set_batch_name.decode('utf-8')
train_set_batch = np.load(train_set_batch_name+'.npy')
train_set_batch = tf.cast(train_set_batch, tf.float32)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess.run(init_op)
run_one_epoch(train_set_batch, sess)
size = sess.run(filename_queue.size())
print(size)
print(train_set_batch)
def run_one_epoch(train_set,sess):
image_train = tf.train.slice_input_producer([train_set], num_epochs=1)
images_train = tf.train.batch(image_train, batch_size=10)
x = tf.nn.relu(images_train)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
sess.run(x)
except tf.errors.OutOfRangeError:
pass
finally:
# When done, ask the threads to stop.
coord.request_stop()
coord.join(threads)
testQueues()
However I get an error
FailedPreconditionError: Attempting to use uninitialized value input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs
[[Node: input_producer/input_producer/fraction_of_32_full/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:#input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs)]]
Also it seems as I can't feed the dictionary with a tf.tensor only with numpy array, but casting it later to tf.tensor is also troublesome.

Have a look at Dataset api.
"The tf.data API enables you to build complex input pipelines from simple, reusable pieces."
In this approach what you do is you model your graph such that it handles data for you and pulls in limited data at a time for you to train your model on.
If memory issue still persists then you might want to look into generator to create your tf.data.Dataset. Your next step could be to potentially speed up the process by preparing tfrecords to create you Dataset.
Follow all the links to learn more and feel free to comment if you don't understand something.

For data that doesn't fit into memory the standard solution is to use Queues. You can set up some ops that read from files directly (cvs files, image files), and feed them into TensorFlow -- https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html

Caffe feature extraction is too slow? caffe.Classifier or caffe.Net

I have trained a model with images.
And now would like to extract the fc-6 features to .npy files.
I'm using caffe.set_mode_gpu()to run the caffe.Classifier and extract the features.
Instead of extracting and saving the feature per frame.
I save all the features of a folder to a temp variable and the result of the complete video to a npy file(decreasing the number of write operations to disk).
I have also heard that I could use the Caffe.Net and then pass a batch of images. But I'm not sure of what preprocessing has to be done and if this is faster ?
import os
import shutil
import sys
import glob
from multiprocessing import Pool
import numpy as np
import os, sys, getopt
import time
def keep_fldrs(path,listr):
ll =list()
for x in listr:
if os.path.isdir(path+x):
ll.append(x)
return ll
def keep_img(path,listr):
ll = list()
for x in listr:
if os.path.isfile(path+str(x)) & str(x).endswith('.jpg'):
ll.append(x)
return ll
def ifdir(path):
if not os.path.isdir(path):
os.makedirs(path)
# Main path to your caffe installation
caffe_root = '/home/anilil/projects/lstm/lisa-caffe-public/python'
# Model prototxt file
model_prototxt = '/home/anilil/projects/caffe2tensorflow/deploy_singleFrame.prototxt'
# Model caffemodel file
model_trained = '/home/anilil/projects/caffe2tensorflow/snapshots_singleFrame_flow_v2_iter_55000.caffemodel'
sys.path.insert(0, caffe_root)
import caffe
caffe.set_mode_gpu()
net = caffe.Classifier(model_prototxt, model_trained,
mean=np.array([128, 128, 128]),
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(255, 255))
Root='/media/anilil/Data/Datasets/UCf_scales/ori_mv_vis/Ori_MV/'
Out_fldr='/media/anilil/Data/Datasets/UCf_scales/ori_mv_vis/feat_fc6/'
allcalsses=keep_fldrs(Root,os.listdir(Root))
for classin in allcalsses:
temp_class=Root+classin+'/'
temp_out_class=Out_fldr+classin+'/'
ifdir(temp_out_class)
allvids_folders=keep_fldrs(temp_class,os.listdir(temp_class))
for each_vid_fldr in allvids_folders:
temp_pres_dir=temp_class+each_vid_fldr+'/'
temp_out_pres_dir=temp_out_class+each_vid_fldr+'/'
ifdir(temp_out_pres_dir)
all_images=keep_img(temp_pres_dir,os.listdir(temp_pres_dir))
frameno=0
if os.path.isfile(temp_out_pres_dir+'video.npy'):
continue
start = time.time()
temp_npy= np.ndarray((len(all_images),4096),dtype=np.float32)
for each_image in all_images:
input_image = caffe.io.load_image(temp_pres_dir+each_image)
prediction = net.predict([input_image],oversample=False)
temp_npy[frameno,:]=net.blobs['fc6'].data[0]
frameno=frameno+1
np.save(temp_out_pres_dir+'video.npy',temp_npy)
end = time.time()
print "lenght of imgs {} and time taken is {}".format(len(all_images),(end - start))
print ('Class {} done'.format(classin))
Output
lenght of imgs 426 and time taken is 388.539139032
lenght of imgs 203 and time taken is 185.467905998
Time needed per image Around 0.9 Seconds now-

I found the best answer here in this post.
Till now I had used a
net = caffe.Classifier(model_prototxt, model_trained,
mean=np.array([128, 128, 128]),
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(255, 255))
to initialize a model and get the output per image.
But this method is really slow and requires around .9 seconds per image.
The best Idea is to pass a batch of images(maybe 100,200,250) changing. Depending on how much memory you have on your GPU.
for this I set caffe.set_mode_gpu() as I have one and It's faster when you send large batches.
Initialize the model with ur trained model.
net=caffe.Net(model_prototxt,model_trained,caffe.TEST)
Create a Transformer and make sure to set mean and other values depending on how u trained your model.
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1)) # height*width*channel -> channel*height*width
mean_file = np.array([128, 128, 128])
transformer.set_mean('data', mean_file) #### subtract mean ####
transformer.set_raw_scale('data', 255) # pixel value range
transformer.set_channel_swap('data', (2,1,0)) # RGB -> BGR
data_blob_shape = net.blobs['data'].data.shape
data_blob_shape = list(data_blob_shape)
Read a group of images and convert to the network input.
net.blobs['data'].reshape(len(all_images), data_blob_shape[1], data_blob_shape[2], data_blob_shape[3])
images = [temp_pres_dir+str(x) for x in all_images]
net.blobs['data'].data[...] = map(lambda x:
transformer.preprocess('data',caffe.io.load_image(x)), images)
Pass the batch of images through network.
out = net.forward()
You can use this output as you wish.
Speed for each image is now 20 msec

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

speedup TFLite inference in python with multiprocessing pool - python

I know that this thread was created two and a half years ago. For me, import multiprocessing tf.lite.Interpreter(modelfile, num_threads=multiprocessing.cpu_count()) works very well.

I did not set initializer and use the following codes to load model, and do inference in the same function to workaround this issue. with Pool(processes=multiprocessing.cpu_count()) as pool: results = pool.starmap(inference, enumerate(test_imgs))

Related

tflite inference only predicts one label despite multiclass label training

Keras MobileNet example yields different answers on different computers

How to use crop huge batch of images in tensorflow

Training huge amounts of data with tensorflow

Caffe feature extraction is too slow? caffe.Classifier or caffe.Net

Categories

Resources