OpenCV code snippet running slower inside Python multiprocessing process

OpenCV code snippet running slower inside Python multiprocessing process - python

I was doing some tests with multiprocessing to parallelize face detection and recognition and I came across a strange behaviour, in which detectMultiScale() (that performs the face detection) was running slower inside a child process than in the parent process (just calling the function).
Thus, I wrote the code below in which 10 images are enqueued and then the face detection is performed sequentially with one of two approaches: just calling the detection function or running it inside a single new process. For each detectMultiScale() call, the time of execution is printed. Executing this code gives me an average of 0.22s for each call in the first approach and 0.54s for the second. Also, the total time to process the 10 images is greater in the second approach too.
I don't know why the same code snippet is running slower inside the new process. If only the total time were greater I would understand (considering the overhead of setup a new process), but this I don't get it. For the record, I'm running it in a Raspberry Pi 3B+.
import cv2
import multiprocessing
from time import time, sleep
def detect(face_cascade, img_queue, bnd_queue):
while True:
image = img_queue.get()
if image is not None:
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ti = time()
########################################
faces = face_cascade.detectMultiScale(
gray_image,
scaleFactor=1.1,
minNeighbors=3,
minSize=(130, 130))
########################################
tf = time()
print('det time: ' + str(tf-ti))
if len(faces) > 0:
max_bounds = (0,0,0,0)
max_size = 0
for (x,y,w,h) in faces:
if w*h > max_size:
max_size = w*h
max_bounds = (x,y,w,h)
img_queue.task_done()
bnd_queue.put('bound')
else:
img_queue.task_done()
break
face_cascade = cv2.CascadeClassifier('../lbpcascade_frontalface_improved.xml')
cam = cv2.VideoCapture(0)
cam.set(cv2.CAP_PROP_FRAME_WIDTH, 2592)
cam.set(cv2.CAP_PROP_FRAME_HEIGHT, 1944)
cam.set(cv2.CAP_PROP_BUFFERSIZE, 1)
img_queue = multiprocessing.JoinableQueue()
i = 0
while i < 10:
is_there_frame, image = cam.read()
if is_there_frame:
image = image[0:1944, 864:1728]
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
img_queue.put(image)
i += 1
bnd_queue = multiprocessing.JoinableQueue()
num_process = 1
ti = time()
# MULTIPROCESSING PROCESS APPROACH
for _ in range(num_process):
p = multiprocessing.Process(target=detect, args=(face_cascade, img_queue, bnd_queue))
p.start()
for _ in range(num_process):
img_queue.put(None)
#
# FUNCTION CALL APPROACH
#img_queue.put(None)
#while not img_queue.empty():
# detect(face_cascade, img_queue, bnd_queue)
img_queue.join()
tf = time()
print('TOTAL TIME: ' + str(tf-ti))
while not bnd_queue.empty():
bound = bnd_queue.get()
if bound != 'bound':
print('ERROR')
bnd_queue.task_done()

I am having same issue and I think the reason is that tasks is somewhat I/O bound and also the overhead created by multiprocessing itself.
You can also read the article here https://www.pyimagesearch.com/2019/09/09/multiprocessing-with-opencv-and-python/
And the problem you mentioned specifically with detectMultiScale() method is same as mine. I have also tried using serialize and making variables global and also of class level but nothing help..

Related

Python multiprocessing with a while loop and shared resources

I'm new to programming and i cant seem to figure out how to correctly optimise my project, i have a function which takes 2 images and used opencv to stitch the images together. This process ussually takes 0.5seconds for each image to be stitched together, i would like to optimise this so that the images are stitched together at a faster rate.
So, at the moment i have a 2 arrays each containing 800 images, i also have a function called stitch_images which processes each image set to be stitched together. However, for this function im using a while loop to go through each image and stitch it to its corresponding image - this seems to be causing me issues as the while loop is blocking the process. I'm also using 2 shared global variables which contain the images.
Theoretically what i would like to achieve is 4 processes, each process process takes a set of image and works on it --> effectively reducing the computational time by 1/4th.
my question is, how would i go about achieving this? i understand that there are multiple different ways of multiprocessing in python such as threading, multiprocess, queues. which would be the best option for me? if there is an easy way to implement this would anyone have any example code for this?
this is my current set up:
import multiprocessing
import time
import cv2
# Global variables:
frames_1 = []
frames_2 = []
panorama = []
# converting the video into frames for individual image processing
def convert_video_to_frames():
cap = cv2.VideoCapture("Sample_video_1.mp4")
ret = True
while ret:
ret, img = cap.read() # read one frame from the 'capture' object; img is (H, W, C)
if ret:
frames_1.append(img)
cap = cv2.VideoCapture("Sample_video_2.mp4")
ret = True
while ret:
ret, img = cap.read() # read one frame from the 'capture' object; img is (H, W, C)
if ret:
frames_2.append(img)
return frames_1, frames_2
#converting final output images back to video
def convert_frames_to_video():
print("now creating stitched image video")
height, width, layers = panorama[0].shape
size = (width, height)
out = cv2.VideoWriter('project.avi', cv2.VideoWriter_fourcc(*'DIVX'), 15, size)
for i in range(len(panorama)):
out.write(panorama[i])
out.release()
def stitch_images():
print("image processing starting...")
stitcher = cv2.Stitcher_create(cv2.STITCHER_PANORAMA)
while len(frames_1) != 0:
status, result = stitcher.stitch((frames_1.pop(0), frames_2.pop(0)))
if status == 0: # pass
panorama.append(result)
else:
print("image stitching failed")
if __name__ == '__main__':
convert_video_to_frames() # dummy function
start = time.perf_counter()
stitch_images()
finish = time.perf_counter()
print(f'finished in {round(finish - start, 2)} seconds(s)')
print("now converting images to video...")
convert_frames_to_video()
Also, i've attempted at using multiprocessing and adding locks to achieve this but adding:
p1 = multiprocessing.Process(target=stitch_images)
p2 = multiprocessing.Process(target=stitch_images)
p1.start()
p2.start()
p1.join()
p2.join()
but when i run this it seems to skip the while loop all together?

How can I recognize ever 30th frame and ignore the rest?

I am developing a GUI with PyQt5 and I am stuck.
Because my program is running on a RaspberryPi4 I have limited processing power. I am getting video input from my webcam and want to perform face_recognition operations on this input. Due to the limited processing power i need to ignore a lot of input frames and just use every n-th frame to perform face recognition with to speed up the process.
I tried to program a delay similar to this thread: Call function every x seconds (Python)
but it didn't work. Is there a possibility to directly refer to a frame?
This is the function where I am reading from the webcam:
def run(self):
checker=0
process_this_frame = 0
# capture from web cam
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH,640);
cap.set(cv2.CAP_PROP_FRAME_HEIGHT,300);
while True:
ret, cv_img = cap.read()
if ret:
img = cv2.resize(cv_img, (0, 0), fx=0.25, fy=0.25)
process_this_frame = process_this_frame+2
print('process_this_frame: ' , process_this_frame)
if process_this_frame % 20 == 0:
predictions = predict(img, model_path="trained_knn_model.clf")
print('showing predicted face')
cv_img = show_prediction_labels_on_image(cv_img, predictions)
checker=1
self.change_pixmap_signal.emit(cv_img)
else:
checker=0
self.change_pixmap_signal.emit(cv_img)
Specifically I am looking for a suitable if condition to execute the predict function only on every n-th frame and when I am not doing the predict on the cv_img I want to display just the frame in my else case. I tried with multiple modulo operators but did not find a suitable solution.
How can I do that? It would be cool to refer to a number of frames instead of using a time delay so I can try to find the best solution.

Multi process Video Processing

I would like to do video processing on neighboring frames. More specific, I would like to compute the mean square error between neighboring frames:
mean_squared_error(prev_frame,frame)
I know how to compute this in a linear straightforward way: I use the imutils package to utilize a queue to decouple loading the frames and processing them. By storing them in a queue, I don't need to wait for them before I can process them. ... but I want to be even faster...
# import the necessary packages to read the video
import imutils
from imutils.video import FileVideoStream
# package to compute mean squared errror
from skimage.metrics import mean_squared_error
if __name__ == '__main__':
# SPECIFY PATH TO VIDEO FILE
file = "VIDEO_PATH.mp4"
# START IMUTILS VIDEO STREAM
print("[INFO] starting video file thread...")
fvs = FileVideoStream(path_video, transform=transform_image).start()
# INITALIZE LIST to store the results
mean_square_error_list = []
# READ PREVIOUS FRAME
prev_frame = fvs.read()
# LOOP over frames from the video file stream
while fvs.more():
# GRAP THE NEXT FRAME from the threaded video file stream
frame = fvs.read()
# COMPUTE the metric
metric_val = mean_squared_error(prev_frame,frame)
mean_square_error_list.append(1-metric_val) # Append to list
# UPDATE previous frame variable
prev_frame = frame
Now my question is: How can I mutliprocess the computation of the metric to increase speed and save time ?
My operating system is Windows 10 and I am using python 3.8.0

There are too many aspects of making things faster, I'll only focus on the multiprocessing part.
As you don't want to read the whole video at a time, we have to read the video frame by frame.
I'll be using opencv (cv2), numpy for reading the frames, calculating mse, and saving the mse to disk.
First, we can start without any multiprocessing so we can benchmark our results. I'm using a video of 1920 by 1080 dimension, 60 FPS, duration: 1:29, size: 100 MB.
import cv2
import sys
import time
import numpy as np
import subprocess as sp
import multiprocessing as mp
filename = '2.mp4'
def process_video():
cap = cv2.VideoCapture(filename)
proc_frames = 0
mse = []
prev_frame = None
ret = True
while ret:
ret, frame = cap.read() # reading frames sequentially
if ret == False:
break
if not (prev_frame is None):
c_mse = np.mean(np.square(prev_frame-frame))
mse.append(c_mse)
prev_frame = frame
proc_frames += 1
np.save('data/' + 'sp' + '.npy', np.array(mse))
cap.release()
return
if __name__ == "__main__":
t1 = time.time()
process_video()
t2 = time.time()
print(t2-t1)
In my system, it runs for 142 secs.
Now, we can take the multiprocessing approach. The idea can be summarized in the following illustration.
GIF credit: Google
We make some segments (based on how many cpu cores we have) and process those segmented frames in parallel.
import cv2
import sys
import time
import numpy as np
import subprocess as sp
import multiprocessing as mp
filename = '2.mp4'
def process_video(group_number):
cap = cv2.VideoCapture(filename)
num_processes = mp.cpu_count()
frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump_unit * group_number)
proc_frames = 0
mse = []
prev_frame = None
while proc_frames < frame_jump_unit:
ret, frame = cap.read()
if ret == False:
break
if not (prev_frame is None):
c_mse = np.mean(np.square(prev_frame-frame))
mse.append(c_mse)
prev_frame = frame
proc_frames += 1
np.save('data/' + str(group_number) + '.npy', np.array(mse))
cap.release()
return
if __name__ == "__main__":
t1 = time.time()
num_processes = mp.cpu_count()
print(f'CPU: {num_processes}')
# only meta-data
cap = cv2.VideoCapture(filename)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
cap.release()
p = mp.Pool(num_processes)
p.map(process_video, range(num_processes))
# merging
# the missing mse will be
final_mse = []
for i in range(num_processes):
na = np.load(f'data/{i}.npy')
final_mse.extend(na)
try:
cap = cv2.VideoCapture(filename) # you could also take it outside the loop to reduce some overhead
frame_no = (frame_jump_unit) * (i+1) - 1
print(frame_no)
cap.set(1, frame_no)
_, frame1 = cap.read()
#cap.set(1, ((frame_jump_unit) * (i+1)))
_, frame2 = cap.read()
c_mse = np.mean(np.square(frame1-frame2))
final_mse.append(c_mse)
cap.release()
except:
print('failed in 1 case')
# in the last few frames, nothing left
pass
t2 = time.time()
print(t2-t1)
np.save(f'data/final_mse.npy', np.array(final_mse))
I'm using just numpy save to save the partial results, you can try something better.
This one runs for 49.56 secs with my cpu_count = 12. There are definitely some bottlenecks that can be avoided to make it run faster.
The only issue with my implementation is, it's missing the mse for regions where the video was segmented, it's pretty easy to add. As we can index individual frames at any location with OpenCV in O(1), we can just go to those locations and calculate mse separately and merge to the final solution. [Check the updated code it fixes the merging part]
You can write a simple sanity check to ensure, both provide the same result.
import numpy as np
a = np.load('data/sp.npy')
b = np.load('data/final_mse.npy')
print(a.shape)
print(b.shape)
print(a[:10])
print(b[:10])
for i in range(len(a)):
if a[i] != b[i]:
print(i)
Now, some additional speedups can come from using a CUDA-compiled opencv, ffmpeg, adding queuing mechanism plus multiprocessing, etc.

Optimized skeleton function for opencv with python

So I am using OpenCV on raspbian (raspberry pi 2 model B). I am doing vision/image processing obviously and the rasppi is what I was given (I would use a computer if I could for this).
I need to run a skeleton function. I found the following implementation:
import cv2
import numpy as np
img = cv2.imread('img.png',0)
size = np.size(img)
skeleton = np.zeros(img.shape,np.uint8)
ret,img = cv2.threshold(img,127,255,0)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
finished = False
while(not finished):
eroded = cv2.erode(img,kernel)
temp = cv2.dilate(eroded,kernel)
temp = cv2.subtract(img,temp)
skeleton = cv2.bitwise_or(skeleton,temp)
img = eroded.copy()
zeros = size - cv2.countNonZero(img)
if zeros==size:
finished = True
cv2.imshow("skeleton",skeleton)
cv2.waitKey(0)
cv2.destroyAllWindows()
While it runs, it's very very slow unsurprisingly (I am doing an FFT and bandpass filtering operation the image before this, then running the skeleton operation). The other code is slow, but will complete the operations.
The images are big - I could crop them some, but I don't think it would be enough. I was trying to find an optimized version of this, but so far haven't come up with anything. Any ideas or solutions?

In this answer, I'll focus on improving your implementation, rather than the algorithm. While this won't gain us a significant amount, I think it's still useful to be aware of.
Preparation
Let's begin with some boilerplate -- necessary imports, some test image, and few functions to let us compare easily:
from timeit import default_timer as timer
import numpy as np
import cv2
# Create a decent size test image...
img = cv2.imread('cage.png',0)
img = cv2.resize(img, (2048, 2048))
cv2.normalize(img, img, 0, 255, cv2.NORM_MINMAX)
def time_fn(fn, img, iters=1):
start = timer()
result = None
for i in range(iters):
result = fn(img)
end = timer()
return (result,((end - start) / iters) * 1000)
def run_test(fn, img, i):
res, t = time_fn(fn, img, 4)
cv2.imwrite("skeleton_%d.png" % i, res[0])
print "Variant %d" % i
print "Input size = (%d, %d)" % img.shape[:2]
print "Ran %d iterations to find skeleton." % res[1]
print "Avg. find_skeleton time = %0.4f s." % (t/1000)
Variant 1 (Original)
Let's turn your implementation into a function, and remove a few unnecessary bits. Out of curiosity, let's track the number of iterations needed for the skeletonization.
def find_skeleton1(img):
skeleton = np.zeros(img.shape,np.uint8)
_,thresh = cv2.threshold(img,127,255,0)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
iters = 0
while(True):
eroded = cv2.erode(thresh, kernel)
temp = cv2.dilate(eroded, kernel)
temp = cv2.subtract(thresh, temp)
skeleton = cv2.bitwise_or(skeleton, temp)
thresh = eroded.copy()
iters += 1
if cv2.countNonZero(thresh) == 0:
return (skeleton,iters)
And let's see how it performs to set our baseline.
>>> run_test(find_skeleton1, img, 1)
Variant 1
Input size = (2048, 2048)
Ran 338 iterations to find skeleton.
Avg. find_skeleton time = 2.7969 s.
Variant 2
The first improvement we can make is to minimize the number of allocations of new array objects, and reuse as much as possible. We can create a few more temporary arrays (like skeleton), and use the dst parameter of the OpenCV functions in the loop ignoring the return value. Since we provide a destination of correct shape and data type, the existing array gets reused.
def find_skeleton2(img):
skeleton = np.zeros(img.shape,np.uint8)
eroded = np.zeros(img.shape,np.uint8)
temp = np.zeros(img.shape,np.uint8)
_,thresh = cv2.threshold(img,127,255,0)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
iters = 0
while(True):
cv2.erode(thresh, kernel, eroded)
cv2.dilate(eroded, kernel, temp)
cv2.subtract(thresh, temp, temp)
cv2.bitwise_or(skeleton, temp, skeleton)
thresh = eroded.copy()
iters += 1
if cv2.countNonZero(thresh) == 0:
return (skeleton,iters)
Let's try this out, and check that the results are the same:
>>> print np.array_equal(find_skeleton1(img)[0], find_skeleton2(img)[0])
True
>>> run_test(find_skeleton2, img, 2)
Variant 2
Input size = (2048, 2048)
Ran 338 iterations to find skeleton.
Avg. find_skeleton time = 1.4356 s.
Variant 3
The next step is to get rid of unnecessary copies -- there's one that's very obvious: thresh = eroded.copy(). Notice that in the following iteration, we immediately overwrite the contents of eroded. Hence, we don't really care what it contains, as long as it's the correct shape and data type. They are, so this means that rather than performing a copy, we can just swap the two objects.
def find_skeleton3(img):
skeleton = np.zeros(img.shape,np.uint8)
eroded = np.zeros(img.shape,np.uint8)
temp = np.zeros(img.shape,np.uint8)
_,thresh = cv2.threshold(img,127,255,0)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
iters = 0
while(True):
cv2.erode(thresh, kernel, eroded)
cv2.dilate(eroded, kernel, temp)
cv2.subtract(thresh, temp, temp)
cv2.bitwise_or(skeleton, temp, skeleton)
thresh, eroded = eroded, thresh # Swap instead of copy
iters += 1
if cv2.countNonZero(thresh) == 0:
return (skeleton,iters)
Again, let's verify the results match and do some timing.
>>> print np.array_equal(find_skeleton1(img)[0], find_skeleton3(img)[0])
True
>>> run_test(find_skeleton3, img, 3)
Variant 3
Input size = (2048, 2048)
Ran 338 iterations to find skeleton.
Avg. find_skeleton time = 0.9839 s.
Few simple changes got the timing down to ~35% of the original. Of course, it still does hundreds of iterations processing the entire image. Next step would be to look into ways how to reduce the amount of work -- in the latter iterations, significant areas of the working image are black, and don't contribute anything to the skeleton.
NB: Measurements done on i7-4930K. I don't have a raspberry, feel free to add timings from yours, so we see what sort of effect it has.

Python/OpenCV : Laser curve analysis from camera video flow

Good morning,
I'm currently trying to study real-time liquid surface deformations by sending a laser sheet on the surface and gathering its reflection. What I obtain is typically a bright curve at each timestep, and I wish to analyze its coordinates.
I thus brought myself to write a Python script, which is displayed right below (The analysis part is retaken from laser curved line detection using opencv and python, as it represents nearly exactly what I'm trying to do, except that I'm working with a video flow) :
import cv2
from PIL import Image
import cv2.cv as cv
import numpy as np
import time
myfile = open("hauteur.txt","w")
#Import camera flow
class Target:
def __init__(self):
self.capture = cv.CaptureFromCAM(0)
cv.namedWindow("Target", 1)
cv.SetCaptureProperty(self.capture,cv.CV_CAP_PROP_FRAME_WIDTH, 150)
cv.SetCaptureProperty(self.capture,cv.CV_CAP_PROP_FRAME_HEIGHT, 980)
cv.SetCaptureProperty(self.capture,cv.CV_CAP_PROP_FPS, 60 )
def run(self):
frame = cv.QueryFrame(self.capture)
frame_size = cv.GetSize(frame)
color_image_cv = cv.CreateImage(cv.GetSize(frame), 8, 3)
color_image = np.array(color_image_cv)
grey_image = cv.CreateImage(cv.GetSize(frame), cv.IPL_DEPTH_8U, 1)
first = True
t = time.clock()
# Frame analysis
while True:
ret, bw = cv2.threshold(color_image, 0, 255, cv2.THRESH_BINARY)
contours, hierarchy = cv2.findContours(bw, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
curves = np.zeros((img.shape[0], img.shape[1], 3), np.uint8)
for i in range(len(contours)):
for col in range(draw.shape[1]):
M = cv2.moments(draw[:, col])
if M['m00'] != 0:
x = col
y = int (M['m01']/M['m00'])
curves[y, x, :] = (0, 0, 255)
res = {'X' : x, 'Y' : y, 't' : t}
print res
myfile.write('{X}\t{Y}\t{t}'.format(**res))
myfile.write("\n")
cv2.ShowImage("Target", color_image)
# Listen for ESC key
c = cv2.WaitKey(7) % 0x100
if c == 27:
break
if __name__=="__main__":
t = Target()
t.run()
However, the use of cv and cv2 functions within the same code seems to bring a nice mess and I get the error
src data type = 17 is not supported
from line
ret, bw = cv2.threshold(color_image, 0, 255, cv2.THRESH_BINARY)
I understand this arises from the way cv and cv2 functions create and store images, but any conversion process I try doesn't seem to work, and I didn't find equivalent cv2 functions to insert in my video flow importing part (but, as you may understand, I'm clearly not a programming pro and I may have skipped what I'd need in the documentation). Is there then a way to conciliate these cv and cv2 functions, or get a equivalent camera flow with cv2 functions ?
Bonus question : How fast can an script like this run (considering that I'd eventually need this to run at 300-400 fps, I'm not even sure this is actually feasible) ?
Thanks for your attention

ok, cv2 video code:
def __init__(self):
self.capture = cv2.VideoCapture(0)
cv2.namedWindow("Target", 1)
self.capture.set(cv2.CAP_PROP_FRAME_WIDTH, 150)
self.capture.set(cv2.CAP_PROP_FRAME_HEIGHT, 980)
self.capture.set(cv2.CAP_PROP_FPS, 60 )
def run(self):
ok, frame = self.capture.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY);
...
Bonus question : ofc, it can only run as fast, as the capture delivers. 300fps seems absurd, 30fps, more likely.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

OpenCV code snippet running slower inside Python multiprocessing process - python

Related

Python multiprocessing with a while loop and shared resources

How can I recognize ever 30th frame and ignore the rest?

Multi process Video Processing

Optimized skeleton function for opencv with python

Python/OpenCV : Laser curve analysis from camera video flow

Categories

Resources