Implement merge_sort with multiprocessing solution - python

I tried to write a merge sort with multiprocessing solution
from heapq import merge
from multiprocessing import Process
def merge_sort1(m):
if len(m) < 2:
return m
middle = len(m) // 2
left = Process(target=merge_sort1, args=(m[:middle],))
left.start()
right = Process(target=merge_sort1, args=(m[middle:],))
right.start()
for p in (left, right):
p.join()
result = list(merge(left, right))
return result
Test it with arr
In [47]: arr = list(range(9))
In [48]: random.shuffle(arr)
It repost error:
In [49]: merge_sort1(arr)
TypeError: 'Process' object is not iterable
What's the problem with my code?

merge(left, right) tries to merge two processes, whereas you presumably want to merge the two lists that resulted from each process. Note that return value of the function passed to Process is lost; it is a different process, not just a different thread, and you can't very easily shuffle data back to parent, so Python doesn't do that, by default. You need to be explicit and code such a channel yourself. Fortunately, there are multiprocessing datatypes to help you; for example, multiprocessing.Pipe:
from heapq import merge
import random
import multiprocessing
def merge_sort1(m, send_end=None):
if len(m) < 2:
result = m
else:
middle = len(m) // 2
inputs = [m[:middle], m[middle:]]
pipes = [multiprocessing.Pipe(False) for _ in inputs]
processes = [multiprocessing.Process(target=merge_sort1, args=(input, send_end))
for input, (recv_end, send_end) in zip(inputs, pipes)]
for process in processes: process.start()
for process in processes: process.join()
results = [recv_end.recv() for recv_end, send_end in pipes]
result = list(merge(*results))
if send_end:
send_end.send(result)
else:
return result
arr = list(range(9))
random.shuffle(arr)
print(merge_sort1(arr))

Related

Parallel Processing in Python with nested loop

Due to performance issue, i would like to run in parallel my function in python :
import multiprocessing as mp
source_nodes = [10413173, 10414530, 10414530, 10437199]
sink_nodes = [10420346, 10438770, 10438711, 10414530, 10436258]
path =[]
def createpath(source,sink):
for i in source:
for j in sink:
path = path + list(nx.all_simple_paths(Directed_G,i,j))
return path
From my understanding i must give 1 iterable to apply function. but my idea was to do something like :
results = [pool.apply(createpath, args=(source_nodes, sink_nodes))]
And then don't give any iterable object to applyfunction
I managed to get it work, but i don't think it run on parallel.
Do you think i should include the apply function inside the first loop ?
from multiprocessing import Pool
source_nodes = [1,2,3,4,5,6]
sink_nodes = [1,1,1,1,1,1,1,1,1]
def sum_values(parameter_tuple):
source,sink, start, stop = parameter_tuple
out = 0
for i in range(start, stop):
val_i = source[i]
for j in sink:
out += val_i*j
return out
if __name__ == "__main__":
params = (source_nodes, sink_nodes, 0, 6)
print(sum_values(params))
with Pool(2) as p:
print(p.map(sum_values, [
(source_nodes, sink_nodes, 0, 3),
(source_nodes, sink_nodes, 3, 6),
]))
You can try to run this one. This runs parallel with map pattern on pool of 2 threads. In this case your output result is the sum of result of each process from pool.

Optimizing a parallel implementation of a list comprehension

I have a dataframe, where each row contains a list of integers. I also have a reference-list that I use to check what integers in the dataframe appear in this list.
I have made two implementations of this, one single-threaded and one multi-threaded. The single-threaded implementation is quite fast (takes roughly 0.1s on my machine), whereas the multithreaded takes roughly 5s.
My question is: Is this due to my implementation being poor, or is this merely a case where the overhead due to multithreading is so large that it doesn't make sense to use multiple threads?
The example is below:
import time
from random import randint
import pandas as pd
import multiprocessing
from functools import partial
class A:
def __init__(self, N):
self.ls = [[randint(0, 99) for i in range(20)] for j in range(N)]
self.ls = pd.DataFrame({'col': self.ls})
self.lst_nums = [randint(0, 99) for i in range(999)]
#classmethod
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
def get_idx_method1(self):
method1 = self.ls['col'].apply(lambda nums: any(x in self.lst_nums for x in nums))
return method1
def get_idx_method2(self):
pool = multiprocessing.Pool(processes=1)
method2 = pool.map(partial(A.helper, self.lst_nums), self.ls['col'])
pool.close()
return method2
if __name__ == "__main__":
a = A(50000)
start = time.time()
m1 = a.get_idx_method1()
end = time.time()
print(end-start)
start = time.time()
m2 = a.get_idx_method2()
end = time.time()
print(end - start)
First of all, multiprocessing is useful when the cost of data communication between the main process and the others is less comparable to the time cost of the function.
Another thing is that you made an error in your code:
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
VS
any(x in self.lst_nums for x in nums)
You have that list [] in the helper method, which will make the any() method to wait for the entire array to be computed, while the second any() will just stop at the first True value.
In conclusion if you remove list brackets from the helper method and maybe increase the randint range for lst_nums initializer, you will notice an increase in speed when using multiple processes.
self.lst_nums = [randint(0, 10000) for i in range(999)]
and
def helper(cls, lst_nums, col):
return any(s in lst_nums for s in col)

Block-wise array writing with Python multiprocessing

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.
The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

Python multiprocessing with Pool - the main process takes forever

I am trying to understand how multiprocessing works with Python. Here's my test code:
import numpy as np
import multiprocessing
import time
def worker(a):
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
result = pool.map(worker, [arr1], chunksize=1000000/(cores-1))
print "mp time", time.time()-tmp
I have 8 cores. It usually ends up with 7 processes using only ~3% of the CPU for about a second, and the last process uses ~1/8 of the CPU for forever...(it has been running for about 15 minutes)
I understand that the interprocess communication usually bounds the complexity of parallel programming, but does it usually take this long? What else could cause the last process to take forever?
This thread: Python multiprocessing never joins seems to address a similar issue but it doesn't solve the problem with Pool.
It looks like you want to divide the work into chunks. You can use the range function to partition the data. On Linux, forked processes get a copy-on-write view of the parent memory so you can just pass down the indexes you want to work on. On Windows, no such luck. You need to pass in each sublist. This program should do it
import numpy as np
import multiprocessing
import time
import platform
def worker(a):
if platform.system() == "Linux":
# on linux we passed in start:len
start, length = a
a = arr1[start:length]
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
chunk = (len(arr1)+cores-1)//cores
# on Windows, pass the sublist, on linux just the indexes and let the
# worker split from the view of parent memory space
if platform.system() == "Linux":
seq = [(i, i+chunk) for i in range(0, len(arr1), chunk)]
else:
seq = [arr1[i:i+chunk] for i in range(0, len(arr1), chunk)]
result = pool.map(worker, seq, chunksize=1)
print "mp time", time.time()-tmp
You point is here:
pool.map will automatically iterate the object which is [arr1] in your program. Please notice that the object is [arr1] but not arr1, that means the length of object you pass to pool.map is only one.
I think the simplest solution is replace [arr1] with arr1.

Python multiprocessing and shared numpy array

I have a problem, which is similar to this:
import numpy as np
C = np.zeros((100,10))
for i in range(10):
C_sub = get_sub_matrix_C(i, other_args) # shape 10x10
C[i*10:(i+1)*10,:10] = C_sub
So, apparently there is no need to run this as a serial calculation, since each submatrix can be calculated independently.
I would like to use the multiprocessing module and create up to 4 processes for the for loop.
I read some tutorials about multiprocessing, but wasn't able to figure out how to use this to solve my problem.
Thanks for your help
A simple way to parallelize that code would be to use a Pool of processes:
pool = multiprocessing.Pool()
results = pool.starmap(get_sub_matrix_C, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res
I've used starmap since the get_sub_matrix_C function has more than one argument (starmap(f, [(x1, ..., xN)]) calls f(x1, ..., xN)).
Note however that serialization/deserialization may take significant time and space, so you may have to use a more low-level solution to avoid that overhead.
It looks like you are running an outdated version of python. You can replace starmap with plain map but then you have to provide a function that takes a single parameter:
def f(args):
return get_sub_matrix_C(*args)
pool = multiprocessing.Pool()
results = pool.map(f, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res
The following recipe perhaps can do the job. Feel free to ask.
import numpy as np
import multiprocessing
def processParallel():
def own_process(i, other_args, out_queue):
C_sub = get_sub_matrix_C(i, other_args)
out_queue.put(C_sub)
sub_matrices_list = []
out_queue = multiprocessing.Queue()
other_args = 0
for i in range(10):
p = multiprocessing.Process(
target=own_process,
args=(i, other_args, out_queue))
procs.append(p)
p.start()
for i in range(10):
sub_matrices_list.extend(out_queue.get())
for p in procs:
p.join()
return sub_matrices_list
C = np.zeros((100,10))
result = processParallel()
for i in range(10):
C[i*10:(i+1)*10,:10] = result[i]

Categories

Resources