I have a simple Algorithm, I want to run it fast in parallel. The algo is.
while stream:
img = read_image()
pre_process_img = pre_process(img)
text = ocr(pre_process_img)
fine_text = post_process(text)
Now I want to explore what are the fastest options I can get using python for multiprocessing the algorithm.
Some of the code is as follows:
def pre_process_img(frame):
return cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
def ocr(frame):
return pytesseractt.image_to_string(frame)
How can I run the given code in parallel/multiple threads/other options, especially the pre-process and ocr part?
I have tried JobLib, but it is for for-loops, and I wasn't sure how to implement it while loop in continuous frames.
I have been seeing people's code, but I am unable to reproduce it for my example.
Edit
We can definitely combine it in a pipeline.
while stream:
img = read_image()
results = pipeline(img)
Now I want to execute the pipeline for different frames in multiple processes.
Related
My main aim is to read in around 16k images for a Data science project and I am barely able to perform that serially.
I have performed some parallelization in c++, but I am unfamiliar with using it in python. Essentially, all I need is to parallelize a for loop that calls a function that reads in the image using the matplotlib.image package and returns the image object. I then simply append that object to list. Here is the function,
def read_img(name):
try:
img = mpimg.imread(name)
return img
except:
return("Did not find image")
I ran my code for 100, 1000 and then 5000 images in one go to see if it can run at all, and it ran fine until I ran it for 5000 and my jupyter notebook just crashed. My system has 24gb ram and 12 cores so I def need to find a way to parallelize this.
I know there are 2 modules in python for parallelization, multiprocessing and joblib but I am not sure how to approach this problem which I know is very basic but any guidance would be much appreciated.
You can use the python ThreadPoolExecutor link
Here is the general program which is not perfect but if you fill this should work
# import or some variable from your code mpimg
def read_img(name):
try:
img = mpimg.imread(name)
return img
except:
return("Did not find image")
from concurrent.futures import ThreadPoolExecutor, as_completed
# suppose the files contains th 16k file names
files = ['f1.jpg', 'f2.jpg']
future_to_file = {}
images_read = []
with ThreadPoolExecutor(max_workers=4) as executor:
for file in files:
future = executor.submit(read_img, file)
future_to_file[future] = file
for future in as_completed(future_to_file):
file = future_to_file[future]
img_read = future.result()
if img_read != 'Did not find image':
images_read.append((file, img_read))
I'm writing a Python(3.4.3) program that uses VIPS(8.1.1) on Ubuntu 14.04 LTS to read many small tiles using multiple threads and put them together into a large image.
In a very simple test :
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Lock
from gi.repository import Vips
canvas = Vips.Image.black(8000,1000,bands=3)
def do_work(x):
img = Vips.Image.new_from_file('part.tif') # RGB tiff image
with lock:
canvas = canvas.insert(img, x*1000, 0)
with ThreadPoolExecutor(max_workers=8) as executor:
for x in range(8):
executor.submit(do_work, x)
canvas.write_to_file('complete.tif')
I get correct result. In my full program, the work for each thread involves read binary from a source file, turn them into tiff format, read the image data and insert into canvas. It seems to work but when I try to examine the result, I ran into trouble. Because the image is extremely large(~50000*100000 pixels), I couldn't save the entire image in one file, so I tried
canvas = canvas.resize(.5)
canvas.write_to_file('test.jpg')
This takes extremely long time, and the resulting jpeg has only black pixels. If I do resize three times, the program get killed. I also tried
canvas.extract_area(20000,40000,2000,2000).write_to_file('test.tif')
This results in error message segmentation fault(core dumped) but it does save an image. There are image contents in it, but they seem to be in the wrong place.
I'm wondering what the problem could be?
Below are the codes for the complete program. The same logic was also implemented using OpenCV + sharedmem (sharedmem handled the multiprocessing part) and it worked without a problem.
import os
import subprocess
import pickle
from multiprocessing import Lock
from concurrent.futures import ThreadPoolExecutor
import threading
import numpy as np
from gi.repository import Vips
lock = Lock()
def read_image(x):
with open(file_name, 'rb') as fin:
fin.seek(sublist[x]['dataStartPos'])
temp_array = np.fromfile(fin, dtype='int8', count=sublist[x]['dataSize'])
name_base = os.path.join(rd_path, threading.current_thread().name + 'tempimg')
with open(name_base + '.jxr', 'wb') as fout:
temp_array.tofile(fout)
subprocess.call(['./JxrDecApp', '-i', name_base + '.jxr', '-o', name_base + '.tif'])
temp_img = Vips.Image.new_from_file(name_base + '.tif')
with lock:
global canvas
canvas = canvas.insert(temp_img, sublist[x]['XStart'], sublist[x]['YStart'])
def assemble_all(filename, ramdisk_path, scene):
global canvas, sublist, file_name, rd_path, tilesize_x, tilesize_y
file_name = filename
rd_path = ramdisk_path
file_info = fetch_pickle(filename) # A custom function
# this info includes where to begin reading image data, image size and coordinates
tilesize_x = file_info['sBlockList_P0'][0]['XSize']
tilesize_y = file_info['sBlockList_P0'][0]['YSize']
sublist = [item for item in file_info['sBlockList_P0'] if item['SStart'] == scene]
max_x = max([item['XStart'] for item in file_info['sBlockList_P0']])
max_y = max([item['YStart'] for item in file_info['sBlockList_P0']])
canvas = Vips.Image.black((max_x+tilesize_x), (max_y+tilesize_y), bands=3)
with ThreadPoolExecutor(max_workers=4) as executor:
for x in range(len(sublist)):
executor.submit(read_image, x)
return canvas
The above module (imported as mcv) is called in the driver script :
canvas = mcv.assemble_all(filename, ramdisk_path, 0)
To examine the content, I used
canvas.extract_area(25000, 40000, 2000, 2000).write_to_file('test_vips1.jpg')
I think your problem has to do with the way libvips calculates pixels.
In systems like OpenCV, images are huge areas of memory. You perform a series of operations, and each operation modifies a memory image in some way.
libvips is not like this, though the interface looks similar. In libvips, when you perform an operation on an image, you are actually just adding a new section to a pipeline. It's only when you finally connect the output to some sink (a file on disk, or a region of memory you want filled with image data, or an area of the display) that libvips will actually do any calculations. libvips will then use a recursive algorithm to run a large set of worker threads up and down the whole length of the pipeline, evaluating all of the operations you created at the same time.
To make an analogy with programming languages, systems like OpenCV are imperative, libvips is functional.
The good thing about the way libvips does things is that it can see the whole pipeline at once and it can optimise away most of the memory use and make good use of your CPU. The bad thing is that long sequences of operations can need large amounts of stack to evaluate (whereas with systems like OpenCV you are more likely to be bounded by image size). In particular, the recursive system used by libvips to evaluate means that pipeline length is limited by the C stack, about 2MB on many operating systems.
Here's a simple test program that does more or less what you are doing:
#!/usr/bin/python3
import sys
import pyvips
if len(sys.argv) < 4:
print "usage: %s image-in image-out n" % sys.argv[0]
print " make an n x n grid of image-in"
sys.exit(1)
tile = pyvips.Image.new_from_file(sys.argv[1])
outfile = sys.argv[2]
size = int(sys.argv[3])
img = pyvips.Image.black(size * tile.width, size * tile.height, bands=3)
for y in range(size):
for x in range(size):
img = img.insert(tile, x * size, y * size)
# we're not interested in huge files for this test, just write a small patch
img.crop(10, 10, 100, 100).write_to_file(outfile)
You run it like this:
time ./bigjoin.py ~/pics/k2.jpg out.tif 2
real 0m0.176s
user 0m0.144s
sys 0m0.031s
It loads k2.jpg (a 2k x 2k JPG image), repeats that image into a 2 x 2 grid, and saves a small part of it. This program will work well with very large images, try removing the crop and running as:
./bigjoin.py huge.tif out.tif[bigtiff] 10
and it'll copy the huge tiff image 100 times into a REALLY huge tiff file. It'll be quick and use little memory.
However, this program will become very unhappy with small images being copied many times. For example, on this machine (a Mac), I can run:
./bigjoin.py ~/pics/k2.jpg out.tif 26
But this fails:
./bigjoin.py ~/pics/k2.jpg out.tif 28
Bus error: 10
With a 28 x 28 output, that's 784 tiles. The way we've built the image, repeatedly inserting a single tile, that's a pipeline 784 operations long -- long enough to cause a stack overflow. On my Ubuntu laptop I can get pipelines up to about 2,900 operations long before it starts failing.
There's a simple way to fix this program: build a wide rather than a deep pipeline. Instead of inserting a single image each time, make a set of strips, then join the strips. Now the pipeline depth will be proportional to the square root of the number of tiles. For example:
img = pyvips.Image.black(size * tile.width, size * tile.height, bands=3)
for y in range(size):
strip = pyvips.Image.black(size * tile.width, tile.height, bands=3)
for x in range(size):
strip = strip.insert(tile, x * size, 0)
img = img.insert(strip, 0, y * size)
Now I can run:
./bigjoin2.py ~/pics/k2.jpg out.tif 200
Which is 40,000 images joined together.
I'm looking for a way to output multiple values using the generic_filter module in scipy.ndimage like so:
import numpy as np
from scipy import ndimage
a = np.array([range(1,5),range(5,9),range(9,13),range(13,17)])
def summary(a):
minVal = np.min(a)
maxVal = np.max(a)
return [minVal,maxVal]
[arrMin, arrMax] = ndimage.generic_filter(a, summary, footprint=np.ones((3,3)))
But I keep getting the error that a float is expected.
I've played with the 'output' parameter, like so:
arrMin = np.zeros(np.shape(a))
arrMax = np.zeros(np.shape(a))
ndimage.generic_filter(a, summary, footprint=np.ones((3,3)), output = [arrMin, arrMax])
to no avail. I've also tried returning a named tuple, a class, or a dictionary, as per this question none of which have worked.
Based on the comments, you want to perform multiple filters simultaneously rather than performing them separately.
Unfortunately I do not think this filter works that way. It expects you to return a single filtered output value for each corresponding input value. I looked for a way to do simultaneous filters with numpy/scipy but couldn't find anything.
If you can manage a data flow that allows you to load the image, filter, process and produce some small result data in separate parallel paths (one for each filter), then you may get some benefit from using multiprocessing but if you use it naively it's likely to take more time than doing everything sequentially. If you really have a bottleneck that multiprocessing solves you should also look into sharing your input array rather than loading it in each process.
I am trying to speed up my python script, which uses vtk methods (and vtkobjects) for processing of geometric measurements. Since some of my methods include looping over very similar meshes and computing enclosed points for each of them, I simply wanted to parallelise such for loops:
averaged_contained_points = []
for intersection_actor in intersection_actors:
contained_points = vtk_mesh.points_inside_mesh(point_data=point_data, mesh=intersection_actor.GetMapper().GetInput())
mean_pos = np.mean(contained_points, axis=0)
averaged_contained_points.append(mean_pos)
In this case the function vtk_mesh.points_inside_mesh calls vtk.vtkSelectEnclosedPoints() and takes a vtkActor and vtkPolyData as input.
The main question is: How can this be converted to run in parallel?
My initial attempt was to import multiprocessing, but I then switched to import pathos.multiprocessing, which seems to have a few advantages, but they work fairly similar.
The problem is that the code below doesn't work.
def _parallel_generate_intersection_avg(inputs):
point_data = inputs[0]
intersection_actor = inputs[1]
contained_points = vtk_mesh.points_inside_mesh(point_data=point_data, mesh=intersection_actor.GetMapper().GetInput())
if len(contained_points) is 0:
return np.array([-1,-1,-1])
return np.mean(contained_points, axis=0)
pool = ProcessingPool(CPU_COUNT)
inputs = [[point_data,intersection_actor] for intersection_actor in intersection_actors]
averaged_contained_points = pool.map(_parallel_generate_intersection_avg, inputs)
It results in these sort of errors:
pickle.PicklingError: Can't pickle 'vtkobject' object: (vtkPolyData)0x111ed5bf0
I have done some research and found that vtkobjects probably can't be pickled:
Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map()
However, since I couldn't find a solution for running python vtk code in parallel with the available answers, please let me know if you have any suggestions.
[EDIT]
I didn't try to implement threading, mainly, because I read the comments to the answer in this thread: How do I parallelize a simple Python loop?
Using multiple threads on CPython won't give you better performance
for pure-Python code due to the global interpreter lock (GIL)
It seems that threading doesn't use pickle http://pymotw.com/2/multiprocessing/basics.html:
Unlike with threading, to pass arguments to a multiprocessing Process
the argument must be able to be serialized using pickle.
If anyway you want to use multiprocessing or pickle, you should use a pickable object as input of your function , for example see tvtk (http://docs.enthought.com/mayavi/tvtk/README.html#pickling-tvtk-objects) or use a string as input of vtkreader/writer
example:
def functionWithPickableInput(inputstring0):
r0 = vtk.vtkPolyDataReader()
r0.ReadFromInputStringOn()
r0.SetInputString(inputstring0 )
r0.Update()
polydata0 = r0.GetOutput()
return functionWithVtkInput(polydata0)
#compute the strings to use as input (they are the content of the correspondent vtk file)
vtkstrings = []
w = vtk.vtkPolyDataWriter()
w.WriteToOutputStringOn()
for mesh in meshes:
w.SetInputData(mesh)
w.Update()
w.WriteToOutputStringOn()
vtkstrings.append(w.GetOutputString())
Here I chose to write everything in memory (see methods in http://www.vtk.org/doc/nightly/html/classvtkDataReader.html#a122da63792e83f8eabc612c2929117c3, http://www.vtk.org/doc/nightly/html/classvtkDataWriter.html#a8972eec261faddc3e8f68b86a1180c71 ).
Of course, you will have to call the writer outside the parallel loop, so you will have to judge if the overhead of the writer is reasonable respect to the function you want to parallelize. You can also read your polydata from a file,
if you have ram problems.
If you are familiar with MPI have a look to mpi4py http://www.kitware.com/blog/home/post/716
As a part of my project, I will have to synchronize 2 videos. Since i am implementing it in python, i started using gstreamer.
My pipeline looks like this
filesrc -> decoder-> queuev -> videobox
filesrc-1 -> decoder-> queuev1 -> videobox1
both of these videobox is joined to mixer like this
[videobox 1 and 2 ] -> mixer -> ffmpegcolorspace ->videosink
All of them in a single pipeline.
But problem here is when i run the code , i get 174% cpu usage which i think is not really optimized. Is there any way to reduce this? because even if i simply run 3 videos in parallel pipelines i get 14% cpu usage.
I am also uploading part of my code here.
self.pipeline = gst.Pipeline('pipleline')
self.filesrc = gst.element_factory_make("filesrc", "filesrc")
self.filesrc.set_property('location', videoloc1)
self.pipeline.add(self.filesrc)
self.decode = gst.element_factory_make("decodebin2", "decode")
self.pipeline.add(self.decode)
self.queuev = gst.element_factory_make("queue", "queuev")
self.pipeline.add(self.queuev)
self.video = gst.element_factory_make("autovideosink", "video")
self.pipeline.add(self.video)
self.filesrc_2 = gst.element_factory_make("filesrc", "filesrc2")
self.filesrc_2.set_property('location', videoloc2)
self.pipeline.add(self.filesrc_2)
self.decode_2 = gst.element_factory_make("decodebin2", "decode_2")
self.pipeline.add(self.decode_2)
self.queuev_2 = gst.element_factory_make("queue", "queuev_2")
self.pipeline.add(self.queuev_2)
self.mixer = gst.element_factory_make("videomixer2", "mixer")
self.pipeline.add(self.mixer)
self.videobox_1 = gst.element_factory_make("videobox", "videobox_1")
self.pipeline.add(self.videobox_1)
self.videobox_2 = gst.element_factory_make("videobox", "videobox_2")
self.pipeline.add(self.videobox_2)
self.ffmpeg1 = gst.element_factory_make("ffmpegcolorspace", "ffmpeg1")
self.pipeline.add(self.ffmpeg1)
gst.element_link_many(self.filesrc,self.decode)
gst.element_link_many(self.filesrc_2,self.decode_2)
gst.element_link_many(self.queuev,self.videobox_1,self.mixer,self.ffmpeg1,self.video)
gst.element_link_many(self.queuev_2,self.videobox_2,self.mixer)
Videomixer is using the cpu to mix videos. Anyway, in oder to know, run a profiler (oprofile, sysprof) to see what code is using the most cpu. Also you did not said anything on the resolutions and colorspaces involved and the hardware you run this on. Thus it is hard to say wheter is is unexpectedly slow.
Finally, you don#t need to mix videos to sync them, you can just run them in a single pipeline. It is up to your application to e.g. render into separate drawing areas in your window or whatever.
You can use streamsynchronizer https://gstreamer.freedesktop.org/data/doc/gstreamer/head/gst-plugins-base-plugins/html/gst-plugins-base-plugins-streamsynchronizer.html