I have a function that has to loop through individual pixels of an image and calculate some geometry. This function takes a very long time to run (~5 hours on a 24 Megapixel image) but seems like it should be easy to run in parallel on multiple cores. However, I can't for the life of me find a well documented, well explained example of doing something like this using the Multiprocessing package. Here is the code I am running right now as a toy example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import misc
from skimage import color
import multiprocessing
from multiprocessing import Process
#Some dumb stand in function for this exercise
def dumb_func(image):
ny, nx = image.shape
temp = np.empty_like(image)
for y in range(ny):
for x in range(nx):
temp[y, x] = np.square(image[y, x])
return temp
#Convert image to greyscale
img = color.rgb2gray(misc.ascent())
#Resize the image
ns = 2048 #Pixel size
img = misc.imresize(img, size = (ns, ns))
#Split the image into equal chunks...not sure how this works for arrays that
#are weird shapes and aren't the same size in each dimension
divs = 4
init_split = np.array_split(img, divs, axis = 0)
side = init_split[0].shape[0]
chunked = np.empty((divs, divs, side, side))
cur = 0
for i in range(divs):
split = np.array_split(init_split[i], divs, axis = 1)
for j in range(divs):
chunked[i, j, :, :] = split[j]
cur +=1
#Pull core count and divide by two to be safe
cores = int(multiprocessing.cpu_count() / 2)
result = np.empty_like(chunked)
idxs = np.array(np.meshgrid(np.arange(0, divs, 1),
np.arange(0, divs, 1))).T.reshape(-1, 2)
Basically this code loads in an image, converts it to greyscale, makes it bigger, and then chunks it up. The chunked array is of shape (i, j, ny, nx) where i and j are indices that identify the chunk of the image I am working with, and ny,nx describe the size in pixels of each chunk.
Additionally, I am creating an array called idxs that stores all possible indices into the chunked array to pull the chunked images out.
What I want to do is run a function (in this case the dumb_func as an example) over the chunks in parallel and store the results in the results array of the same shape. The way I imagined doing it was to loop over the idxs array and assign processes the chunks belonging to those indexes up to the number of cores, wait for those cores to finish, then feed the cores more processes until finished. I got stuck because I couldn't A) figure out how to access the return value in the function, and B) how to handle a situation where I might have 16 chunks and 5 cores leading to the last iteration only requiring a single process.
How can I go about doing this? I've spent the last 6-7 hours reading about Multiprocessing Pool, Process, Map, Starmap, etc... and can't for the life of me understand how to implement this.
Edit for Reedinationer:
This is my updated code and runs without error. However the new_data array is never updated. I filled it with a value of 100 and at the end of the routine new_data is exactly how it was initialized.
import numpy as np
import matplotlib.pyplot as plt
from scipy import misc
from multiprocessing import Process, JoinableQueue
from time import time
#SOme dumb stand in function for this exercise
def dumb_func(q, new_data):
while True:
index, image = q.get()
temp = image **2
new_data[index[0], index[1], :, :] = temp
q.task_done()
if __name__ == "__main__":
start = time()
q = JoinableQueue()
img = misc.ascent()
#Resize the image
ns = 2048 #Pixel size
img = misc.imresize(img, size = (ns, ns))
#Split the image into equal chunks...not sure how this works for arrays that
#are weird shapes and aren't the same size in each dimension
divs = 4
init_split = np.array_split(img, divs, axis = 0)
side = init_split[0].shape[0]
chunked = np.empty((divs, divs, side, side))
cur = 0
for i in range(divs):
split = np.array_split(init_split[i], divs, axis = 1)
for j in range(divs):
chunked[i, j, :, :] = split[j]
cur +=1
new_data = np.full(chunked.shape, 100)
idxs = np.array(np.meshgrid(np.arange(0, divs, 1),
np.arange(0, divs, 1))).T.reshape(-1, 2)
for i in range(len(idxs)):
q.put((idxs[i], chunked[idxs[i][0], idxs[i][1], :, :]))
print ('starting workers')
worker_count = len(idxs)
processes = []
for i in range(worker_count):
p = Process(target=dumb_func, args=[q, new_data])
p.daemon = True
p.start()
print('main thread waiting')
q.join()
end = time()
print('{:.3f} seconds elapsed'.format(end - start))
I'd do something like this, starting with dependencies:
from multiprocessing import Pool
import numpy as np
from PIL import Image
# and some for testing
from random import random
from time import sleep
first I define a function to divide an image up into "chunks", sort of as you talked about:
def chunkit(ys, xs, blocksize=64):
for y in range(0, ys, blocksize):
yt = (y, min(ys, y + blocksize))
for x in range(0, xs, blocksize):
xt = (x, min(xs, x + blocksize))
yield yt, xt
it's a lazy iterator, so this can go on for a while.
I then define my worker function:
def dumb_func(cc):
(y0,y1), (x0,x1) = cc
# convert to floats for ease of processing
chunk = image[y0:y1,x0:x1] / 255.
# random slow down for testing
# sleep(random() ** 6)
res = chunk ** 2
# convert back to bytes for efficiency
return cc, (res * 255).astype(np.uint8)
I make sure the source array stays as close to original format as possible for efficiency and send it back in the same format (this might take some fiddling, if you're dealing with other pixel formats obviously).
then I put it together:
if __name__ == '__main__':
source = Image.open('tmp.jpeg')
image = np.asarray(source)
print("loaded", image.shape, image.dtype)
with Pool() as pool:
resit = pool.imap_unordered(
dumb_func, chunkit(*image.shape[:2]))
output = np.empty_like(image)
for cc, res in resit:
(y0,y1), (x0,x1) = cc
output[y0:y1,x0:x1] = res
im = Image.fromarray(output, 'RGB')
im.save('out.jpeg')
this churns through a 15Mpixel image in a couple of seconds, with most of that spent loading/saving the image. it could probably be a lot more clever with array strides and cache friendliness, but hope that helps!
note: I think this code relies on CPython Unix style process forking semantics to make sure the image is shared between processes efficiently. not sure what would happen if you ran it on something else
I've been working on code for basically this same thing. Right now the goal is just to replace white pixels with transparent ones, but it seems to replace the entire image so there is a bug somewhere...It doesn't get an error within the multiprocessing module anymore though, so maybe it could serve as an example of how to load a Queue and then have your worker processes work on it!
from PIL import Image
from multiprocessing import Process, JoinableQueue
from threading import Thread
from time import time
def worker_function(q, new_data):
while True:
# print("Items in queue: {}".format(q.qsize()))
index, pixel = q.get()
if pixel[0] > 240 and pixel[1] > 240 and pixel[2] > 240:
out_pixel = (0, 0, 0, 0)
else:
out_pixel = pixel
new_data[index] = out_pixel
q.task_done()
if __name__ == "__main__":
start = time()
q = JoinableQueue()
my_image = Image.open('InputImage.jpg')
my_image = my_image.convert('RGBA')
datas = list(my_image.getdata())
new_data = [0] * len(datas) # make a blank array the size of our image to fill later
print('putting image into queue')
for count, item in enumerate(datas):
q.put((count, item))
print('starting workers')
worker_count = 50
processes = []
for i in range(worker_count):
p = Process(target=worker_function, args=[q, new_data])
p.daemon = True
p.start()
print('main thread waiting')
q.join()
my_image.putdata(new_data)
my_image.save('output.png', "PNG")
end = time()
print('{:.3f} seconds elapsed'.format(end - start))
I think it's important to "protect" your code inside the if __name__ == "__main__" block otherwise the spawned processes seem to run it.
update
It looks like you need to implement a Manager() (or there are probably other ways I am ignorant of as well!). I got my code to run by altering it into:
from PIL import Image
from multiprocessing import Process, JoinableQueue, Manager
from threading import Thread
from time import time
def worker_function(q, new_data):
while True:
# print("Items in queue: {}".format(q.qsize()))
index, pixel = q.get()
if pixel[0] > 240 and pixel[1] > 240 and pixel[2] > 240:
out_pixel = (0, 0, 0, 0)
else:
out_pixel = pixel
new_data[index] = out_pixel
q.task_done()
if __name__ == "__main__":
start = time()
q = JoinableQueue()
my_image = Image.open('InputImage.jpg')
my_image = my_image.convert('RGBA')
datas = list(my_image.getdata())
# new_data = [(0, 0, 0, 0)]*len(datas)
manager = Manager()
new_data = manager.list([(0, 0, 0, 0)]*len(datas))
print(new_data)
print('putting image into queue')
for count, item in enumerate(datas):
q.put((count, item))
print('starting workers')
worker_count = 50
processes = []
for i in range(worker_count):
p = Process(target=worker_function, args=[q, new_data])
p.daemon = True
p.start()
print('main thread waiting')
q.join()
print("Saving Image")
my_image.putdata(new_data)
my_image.save('output.png', "PNG")
end = time()
print('{:.3f} seconds elapsed'.format(end - start))
Although this doesn't seem like the fastest option! I'm sure there are other ways to increase speed. My code to do the same thing with Threads looks VERY similar:
from PIL import Image
from threading import Thread
from queue import Queue
import time
start = time.time()
q = Queue()
planeIm = Image.open('InputImage.jpg')
planeIm = planeIm.convert('RGBA')
datas = planeIm.getdata()
new_data = [0] * len(datas)
print('putting image into queue')
for count, item in enumerate(datas):
q.put((count, item))
def worker_function():
while True:
# print("Items in queue: {}".format(q.qsize()))
index, pixel = q.get()
if pixel[0] > 240 and pixel[1] > 240 and pixel[2] > 240:
out_pixel = (0, 0, 0, 0)
else:
out_pixel = pixel
new_data[index] = out_pixel
q.task_done()
print('starting workers')
worker_count = 100
for i in range(worker_count):
t = Thread(target=worker_function)
t.daemon = True
t.start()
print('main thread waiting')
q.join()
print('Queue has been joined')
planeIm.putdata(new_data)
planeIm.save('output.png', "PNG")
end = time.time()
elapsed = end - start
print('{:3.3} seconds elapsed'.format(elapsed))
Yet, processing my image takes ~23 seconds with threads and ~170 seconds with multiprocessing!! I suspect this would come from the larger overhead needed to start Process objects, and the fact that my algorithm for processing each pixel is simple for now (just the if pixel[0] > 240 and pixel[1] > 240 and pixel[2] > 240: bit), so I'm likely not yielding the speed improvements that a complex pixel processing algorithm would get me. Also to note multiprocessing documentation
a single manager can be shared by processes on different computers over a network. They are, however, slower than using shared memory.
Which leads me to believe that there are alternatives that are faster.
Related
The following code (python) measures the speedup when increasing number of processing. The task in the multiprocessing is just multiplying a random matrix, the size of which is also varied and corresponding elapsed time is measured.
Note that, each process does not share any object and they are completely independent. So, I expected that performance curve when changing number of process will be almost same for all matrix size. However, when plotting the results (see below), I found that the expectation is false. Specifically, when matrix size becomes large (80, 160), the performance hardly be better though number of process increased. Note: The figures legend indicates the matrix sizes.
Could you explain, why performance does not become better when matrix size is large?
For your information, here is the spec of my CPU:
https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
Product Family: AMD Ryzen™ Processors
Product Line: AMD Ryzen™ 9 Desktop Processors
# of CPU Cores: 12
# of Threads: 24
Max. Boost Clock: Up to 4.6GHz
Base Clock: 3.8GHz
L1 Cache: 768KB
L2 Cache: 6MB
L3 Cache: 64MB
main script
import numpy as np
import pickle
from dataclasses import dataclass
import time
import multiprocessing
import os
import subprocess
import numpy as np
def split_number(n_total, n_split):
return [n_total // n_split + (1 if x < n_total % n_split else 0) for x in range(n_split)]
def task(args):
n_iter, idx, matrix_size = args
#cores = "{},{}".format(2 * idx, 2 * idx+1)
#os.system("taskset -p -c {} {}".format(cores, os.getpid()))
for _ in range(n_iter):
A = np.random.randn(matrix_size, matrix_size)
for _ in range(100):
A = A.dot(A)
def measure_time(n_process: int, matrix_size: int) -> float:
n_total = 100
assigne_list = split_number(n_total, n_process)
pool = multiprocessing.Pool(n_process)
ts = time.time()
pool.map(task, zip(assigne_list, range(n_process), [matrix_size] * n_process))
elapsed = time.time() - ts
return elapsed
if __name__ == "__main__":
n_experiment_sample = 5
n_logical = os.cpu_count()
n_physical = int(0.5 * n_logical)
result = {}
for mat_size in [5, 10, 20, 40, 80, 160]:
subresult = {}
result[mat_size] = subresult
for n_process in range(1, n_physical + 1):
elapsed = np.mean([measure_time(n_process, mat_size) for _ in range(n_experiment_sample)])
subresult[n_process] = elapsed
print("{}, {}, {}".format(mat_size, n_process, elapsed))
with open("result.pkl", "wb") as f:
pickle.dump(result, f)
plot script
import numpy as np
import matplotlib.pyplot as plt
import pickle
with open("result.pkl", "rb") as f:
result = pickle.load(f)
fig, ax = plt.subplots()
for matrix_size in result.keys():
subresult = result[matrix_size]
n_process_list = list(subresult.keys())
elapsed_time_list = np.array(list(subresult.values()))
speedups = elapsed_time_list[0] / elapsed_time_list
ax.plot(n_process_list, speedups, label=matrix_size)
ax.set_xlabel("number of process")
ax.set_ylabel("speed up compared to single process")
ax.legend(loc="upper left", borderaxespad=0, fontsize=10, framealpha=1.0)
plt.show()
I am facing a problem with multiprocessing and threading my program to fast the process.
My program take a list of point into an excel and create a gray scale image from this points.
The problem is I have a million points and it takes around 1 min to process. I am sure, there is a way to speed up the processing.
Here is the code without threading:
import os
import math
import json
import time
import pandas as pd
from PIL import Image
# FUNCTIONS
def CreateDataFrame(path, columns):
print('DataFrame creation ... ', end='')
with open(path) as excel_file:
lines = excel_file.read().splitlines()
np_array = []
for line in lines:
np_array.append(list(map(float, line.split(' '))))
print('done')
return pd.DataFrame(np_array, columns=columns)
def GetOffsets(df):
print('Getting offsets ... ', end='')
dict = {}
for c in df.columns:
dict[c] = min(df[c])
print('done')
return dict
def GetMaximums(df, offsets):
print('Getting Maximums ... ', end='')
max_dict = {}
for c in df.columns:
max_dict[c] = max(df[c]) - offsets[c]
print('done')
return max_dict
def CreateImage(maximums, scale = 1):
return Image.new('RGB', (int(maximums['x'] * scale) + 1, int(maximums['z'] * scale) + 1), color='black')
# MAIN
columns = ['x', 'z', 'y']
scale = 1
df = CreateDataFrame('raw data.csv', columns)
offsets = GetOffsets(df)
maximums = GetMaximums(df, offsets)
img = CreateImage(maximums, scale)
pixels = img.load()
print('Printing ... ', end='')
for i in range(len(df)):
line = df.iloc[i]
color = int(255 * (line['y'] - offsets['y']) / maximums['y'])
pixels[int((maximums['x'] - (line['x'] - offsets['x'])) * scale), int((line['z'] - offsets['z']) * scale)] = (color, color, color)
print('done')
img.save(f'terrain {scale}.png')
If you are interested, this is how it works. First, I create a dataframe from the excel and assign columns. Then, I get the minimum values of each columns to get my offset values. Once done, I do the same thing but with maximums. Thanks to those maximums, I can create an image with the maximum x and y values. Finally, I iterate into my dataframe to get the x and y for the position and gray scale the y value.
Now I am trying to multiprocess/thread it. To do that, I added this code:
import concurrent.futures
def Process(id, start, end):
start_time = time.time()
for i in range(start, end):
line = df.iloc[i]
color = int(255 * (line['y'] - offsets['y']) / maximums['y'])
pixels[int((maximums['x'] - (line['x'] - offsets['x'])) * scale), int((line['z'] - offsets['z']) * scale)] = (color, color, color)
print(f"Thread {id} ends: {time.time() - start_time}s")
nb_thread = 12
df_size = len(df)
nb_full_thread = df_size // nb_thread
thread_rest = df_size % nb_thread
with concurrent.futures.ThreadPoolExecutor(max_workers=nb_thread) as executor:
for i in range(nb_thread):
executor.submit(Process, i, i*nb_full_thread, (i+1) * nb_full_thread - 1)
print(f"Process {i} launched")
img.save(f'threading terrain {scale}.png')
Some explanations, nb_thread is the number of threads I want to create. Then, I get the number of line in my dataframe (df_size). This is usefull to determine how many lines, a thread will manage. Once done, I create my threads in order to they process the image and save the image.
With ThreadPoolExecutor, the program works but it takes the same amount of time as the previous version. And with ProcessPoolExecutor, the forloop in the Process function not orks as expected, it stops at the first value of the range.
I don't understand those behaviours, that is why I am turning to you.
I hope, it's clear enough, do not hesitate if you have a question.
How do I build a performant video stream buffer that I can do numpy array operations on?
This is my implementation currently - I just shift the previous array forward 1 frame and assign the last element to the current frame.
import numpy as np
import cv2
import time
cap = cv2.VideoCapture(0)
status, frame = cap.read()
buffer = np.empty([100, frame.shape[0], frame.shape[1], frame.shape[2]])
i=0
total = 100
while i < total:
if not i:
start = time.time()
status, frame = cap.read()
t = time.time()
if i < total/2:
buffer[i] = frame
else:
buffer[:-1] = buffer[1:]
buffer[-1] = frame
if i == total/2:
middle = t
i += 1
# Calculations on the buffer ommitted for brevity but include mean, std, etc.
stop = time.time()
print((middle-start)/(total/2))
print((stop-middle)/(total/2))
It takes about 350X longer to shift the array as opposed to simply assigning the values of a frame to an element of the array. I know this is because I am shifting all the pointers in the array which is unnecessary and expensive. Keeping the frames in order is nice but not necessary.
One surprisingly simple way to make a minor improvement to this is to use a Python List for the actual shifting/appending, then re-instantiate the buffer as a new NumPy array, like so:
import numpy as np
import cv2
import itertools
import time
cap = cv2.VideoCapture(0)
status, frame = cap.read()
buffer = np.empty([100, frame.shape[0], frame.shape[1], frame.shape[2]])
i=0
total = 100
middle = 0
while i < total:
if not i:
start = time.time()
status, frame = cap.read()
t = time.time()
if i < total/2:
buffer[i] = frame
else:
list_buffer = [item for item in buffer[1:]]
list_buffer.append(frame)
buffer = np.asanyarray(list_buffer)
if i == total/2:
middle = t
i += 1
# Calculations on the buffer ommitted for brevity but include mean, std, etc.
stop = time.time()
print((middle-start)/(total/2))
print((stop-middle)/(total/2))
On my machine that takes the second time total from 1.7 seconds down to about 1.36 seconds. Not a huge improvement, but not insignificant either (~20% speedup).
However, if we instead use the list_buffer in the whole loop to keep track of the contents of the buffer and simply do both our slicing and appending on that:
import numpy as np
import cv2
import itertools
import time
cap = cv2.VideoCapture(0)
status, frame = cap.read()
buffer = np.empty([100, frame.shape[0], frame.shape[1], frame.shape[2]])
i=0
total = 100
middle = 0
list_buffer = []
while i < total:
if not i:
start = time.time()
status, frame = cap.read()
t = time.time()
if i < total/2:
buffer[i] = frame
list_buffer.append(frame)
else:
list_buffer = list_buffer[1:]
list_buffer.append(frame)
buffer = np.asanyarray(list_buffer)
if i == total/2:
middle = t
i += 1
# Calculations on the buffer ommitted for brevity but include mean, std, etc.
stop = time.time()
print((middle-start)/(total/2))
print((stop-middle)/(total/2))
suddenly our output looks like this:
>>> 0.08505516052246094
>>> 0.08459827899932862
Hope that helps!
Reformatting the list as a numpy array costs very little. The deque/linked list was not really any more efficient with 100 frames in the buffer.
import numpy as np
import cv2
import time
import collections
cap = cv2.VideoCapture(0)
i=0
buff_len = 100
# buffer = [] #Standard list
# buffer = collections.deque() #linked list
status, frame = cap.read() #numpy array - replaces the first frame once it reaches the last frame
buffer = np.empty([buff_len, frame.shape[0], frame.shape[1], frame.shape[2]])
times_through = 3
start = time.time()
while i < times_through*buff_len:
t = time.time()
status, frame = cap.read()
# buffer.append(frame) #list and linked list
buffer[i%(buff_len)] = frame #numpy array
# if i >= buff_len: #list and linked list
# buffer.pop(0) #list
# buffer.popleft() #linked list
if i == buff_len:
full = t
i += 1
print(i, np.mean(buffer, dtype=np.int), int((time.time()-t)*100)/100.)
stop = time.time()
print((full-start)/(buff_len))
print((stop-full)/(buff_len*(times_through-1)))
print(len(buffer))
Results in seconds/frame:
# list
# 0.19624330043792726
# 0.3691681241989136
# linked list
# 0.19301403045654297
# 0.3468885350227356
# numpy Array
# 0.316677029132843
# 0.30973124504089355
What would be the fastest/memory efficient way to get average over many frames of 16-bit TIFF image as numpy array?
What I came up so far is the code below. To my surprise, method2 was faster than method1.
But, for profiling never assume, test it! So, I want to test more.
Worth trying Wand? I did not include here because after imstalling ImageMagick-6.8.9-Q16 and MAGICK_HOME env var it still does not import... Any other library for multipage tiff in Python? GDAL maybe little too much for this.
(edit) I included libtiff. Still method2 fastest and quite memory efficient.
from time import time
#import cv2 ## no multi page tiff support
import numpy as np
from PIL import Image
#from scipy.misc import imread ## no multi page tiff support
import tifffile # http://www.lfd.uci.edu/~gohlke/code/tifffile.py.html
from libtiff import TIFF # https://code.google.com/p/pylibtiff/
fp = r"path/2/1000frames-timelapse-image.tif"
def method1(fp):
'''
using tifffile.py by Christoph (Version: 2014.02.05)
(http://www.lfd.uci.edu/~gohlke/code/tifffile.py.html)
'''
with tifffile.TIFFfile(fp) as imfile:
return imfile.asarray().mean(axis=0)
def method2(fp):
'primitive peak memory friendly way with tifffile.py'
with tifffile.TIFFfile(fp) as imfile:
nframe, h, w = imfile.series[0]['shape']
temp = np.zeros( (h,w), dtype=np.float64 )
for n in range(nframe):
curframe = imfile.asarray(n)
temp += curframe
return (temp / nframe)
def method3(fp):
' like method2 but using pillow 2.3.0 '
im = Image.open(fp)
w, h = im.size
temp = np.zeros( (h,w), dtype=np.float64 )
n = 0
while True:
curframe = np.array(im.getdata()).reshape(h,w)
temp += curframe
n += 1
try:
im.seek(n)
except:
break
return (temp / n)
def method4(fp):
'''
https://code.google.com/p/pylibtiff/
documentaion seems out dated.
'''
tif = TIFF.open(fp)
header = tif.info()
meta = dict() # extracting meta
for l in header.splitlines():
if l:
if l.find(':')>0:
parts = l.split(':')
key = parts[0]
value = ':'.join(parts[1:])
elif l.find('=')>0:
key, value =l.split('=')
meta[key] = value
nframes = int(meta['frames'])
h = int(meta['ImageLength'])
w = int(meta['ImageWidth'])
temp = np.zeros( (h,w), dtype=np.float64 )
for frame in tif.iter_images():
temp += frame
return (temp / nframes)
t0 = time()
avgimg1 = method1(fp)
print time() - t0
# 1.17-1.33 s
t0 = time()
avgimg2 = method2(fp)
print time() - t0
# 0.90-1.53 s usually faster than method1 by 20%
t0 = time()
avgimg3 = method3(fp)
print time() - t0
# 21 s
t0 = time()
avgimg4 = method4(fp)
print time() - t0
# 1.96 - 2.21 s # may not be accurate. I got warning for every frame with the tiff file I tested.
np.testing.assert_allclose(avgimg1, avgimg2)
np.testing.assert_allclose(avgimg1, avgimg3)
np.testing.assert_allclose(avgimg1, avgimg4)
Simple logic would make me bet my money on method 1 or 3, since method 2 and 4 have for-loops in them. For-loops Always make your code go slower if you have more input.
I would definitely go for method 1: neat, clear to read...
To be really sure, just test them I would say. If you don't feel like testing, I would go for method one.
Kind regards,
I have a stack of bitmap images (between 2000-4000 ) that I'm doing a z-projection maximum intensity projection on. So from the stack, I need to get a 2d array of maximum values for each x,y position.
I have devised a simple script that splits up the files into chunks and uses multiprocessing.pool to calculate the maximum array for that chuck. These arrays are then compared to find the maximum for the stack.
It works, but it is slow. My system monitor show that my CPUs are hardly working.
Can anyone give me some pointers on how I might speed things up a bit?
import Image
import os
import numpy as np
import multiprocessing
import sys
#Get the stack of images
files = []
for fn in os.listdir(sys.argv[1]):
if fn.endswith('.bmp'):
files.append(os.path.join(sys.argv[1], fn))
def processChunk(filelist):
first = True
max_ = None
for img in filelist:
im = Image.open(img)
array = np.array(im)
if first:
max_ = array
first = False
max_ = np.maximum(array, max_)
return max_
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=8)
#Chop list into chunks
file_chunks = []
chunk_size = 100
ranges = range(0, len(files), chunk_size)
for chunk_idx in ranges:
file_chunks.append(files[chunk_idx:chunk_idx+chunk_size])
#find the maximum x,y vals in chunks of 100
first = True
maxi = None
max_arrays = pool.map(processChunk, file_chunks )
#Find the maximums from the maximums returned from each process
for array in max_arrays:
if first:
maxi = array
first = False
maxi = np.maximum(array, maxi)
img = Image.fromarray(maxi)
img.save("max_intensity.tif")
Edit:
Did some small benchmarking with sample data and you're right. Also, turns out (reading your code more closely), most of my original post is wrong. You are essentially doing the same number of iterations (slightly more, but not 3x more). I also found out that
x = np.maximum(x, y)
is slightly faster than both
x[y > x] = y[y > x]
#or
ind = y > x
x[ind] = y[ind]
I would then alter your code only slightly. Something like:
import numpy as np
from multiprocessing import Pool
def process(chunk):
max_ = np.zeros((4000, 4000))
for im in chunk:
im_array = np.array(Image.open(im))
max_ = np.maximum(max_, im_array)
return max_
if __name__ == "__main__":
p = Pool(8)
chunksize = 500 #4000/8 = 500, might have less overhead
chunks = [files[i:i+chunksize]
for i in range(0, len(files), chunksize)]
# this returns an array of (len(files)/chunksize, 4000, 4000)
max_arrays = np.array(p.map(process, chunks))
maxi = np.amax(max_array, axis=0) #finds maximum along first axis
img = Image.fromarray(maxi) #should be of shape (4000, 4000)
I think this is one of the fastest ways you can do this, although I have an idea for a tree or tournament style algorithm, possible a recursive one too. Good job.
How big are the images? Small enough to load two images into memory at once? If so, then can you do something like:
maxi = np.zeros(image_shape) # something like (1024, 1024)
for im in files:
im_array = np.array(Image.open(im))
inds = im_array > maxi # find where image intensity > max intensity
maxi[inds] = im_array[inds] # update the maximum value at each pixel
max_im = Image.fromarray(maxi)
max_im.save("max_intensity.tif")
After all iterations, the maxi array will contain the maximum intensity for each (x, y) coordinate. No need to break it into chunks. Also, there's only one for loop, not 3, so it will be faster and may not need multiprocessing.