Parallel Processing in Python with nested loop

Parallel Processing in Python with nested loop - python

Due to performance issue, i would like to run in parallel my function in python :
import multiprocessing as mp
source_nodes = [10413173, 10414530, 10414530, 10437199]
sink_nodes = [10420346, 10438770, 10438711, 10414530, 10436258]
path =[]
def createpath(source,sink):
for i in source:
for j in sink:
path = path + list(nx.all_simple_paths(Directed_G,i,j))
return path
From my understanding i must give 1 iterable to apply function. but my idea was to do something like :
results = [pool.apply(createpath, args=(source_nodes, sink_nodes))]
And then don't give any iterable object to applyfunction
I managed to get it work, but i don't think it run on parallel.
Do you think i should include the apply function inside the first loop ?

from multiprocessing import Pool
source_nodes = [1,2,3,4,5,6]
sink_nodes = [1,1,1,1,1,1,1,1,1]
def sum_values(parameter_tuple):
source,sink, start, stop = parameter_tuple
out = 0
for i in range(start, stop):
val_i = source[i]
for j in sink:
out += val_i*j
return out
if __name__ == "__main__":
params = (source_nodes, sink_nodes, 0, 6)
print(sum_values(params))
with Pool(2) as p:
print(p.map(sum_values, [
(source_nodes, sink_nodes, 0, 3),
(source_nodes, sink_nodes, 3, 6),
]))
You can try to run this one. This runs parallel with map pattern on pool of 2 threads. In this case your output result is the sum of result of each process from pool.

Related

Python multiprocessing outside main module: slow computations

I have a code that aims at multiprocessing at some inner part in some module that is NOT tha main.py. To make it simple, I have created a replica code that contains two files: the main.py and the file f1_mod.py. My main.py would look something like:
import os
from time import time
import f1_mod
parallel = 1
N = 10000000
if __name__ == '__main__':
if not parallel:
time_start = time()
res = f1_mod.normal_exec( N )
time_end = time()
print("-->time is ", time_end - time_start )
else:
time_start = time()
res = f1_mod.parallel_exec( N )
time_end = time()
print("-->time is ", time_end - time_start )
for j in range( 100 ):
print( res[j] )
else:
print("--> CHILD PROCESS ID: ", os.getpid() )
pass
and the f1_mod.py would be:
import multiprocessing as mp
import numpy as np
def pool_worker_function( k, args ):
d = args[0]
return d["p"]*( k**2 ) + d["x"]**2
def single_thread_exec( N ):
a = list( np.linspace( 0, N, N ) )
d = { "p": 0.2, "x": 0.5 }
result = []
for k in a:
result.append( d["p"]*( k**2) + d["x"]**2 )
return result
def parallel_exec( N ):
number_processors = mp.cpu_count()
number_used_processors = number_processors - 1
#try pool map method
from itertools import repeat
pool = mp.Pool( processes = number_used_processors )
d = { "p": 0.2, "x": 0.5 }
a = list( np.linspace( 0, N, N ) )
args = ( d, )
number_tasks = number_used_processors
chuncks = []
n_chunck = int( ( len(a) - 1 )/number_tasks )
for j in range( 0, number_tasks ):
if ( j == number_tasks - 1 ):
chuncks.append( a[ j*n_chunck: ] )
else:
chuncks.append( a[ j*n_chunck: j*n_chunck + n_chunck ] )
result = pool.starmap( pool_worker_function, zip( a, repeat( args ) ) )
pool.close()
pool.join()
return result
I check that both the serial and parallelized versions give the same results, except that the serial version is much faster than the multiprocessed one. In my real code, this is sometimes the case, and the "args" tuple entering the worker function actually contains a big object data container, with a much bigger dictionary that is used to read data from to perform the operations. Can anyone explain why do I observe this behaviour (i.e. slow performance when multiprocessing)? Data needs to be passed to the worker function every time, and actually this makes the worker function to take a lot of arguments are maybe is what slows the code giving the IPC takin place ? (¿?)
The "repeat" in the args passed to the worker is used since all arguments passed by the tuple have to be the same for each worker, the only iterable is the list "a". Does anyone know how to make an efficient multiprocessing of this? Also, note that multiprocessing does not happen at the "main.py" level, but rather at "deep" functions in some module within the logic of the code. I would appreciate some help here to better understand how this multiprocessing works! I am using a 4 core machine under Windows OS, and I know now Windows does not support "fork" like behaviour when using multiprocessing. However, running the code in Ubuntu on my machine seems to be very slow too! Thanks!!

My best guess for the current code : you don't do enough work in pool_worker_function to offset the cost of communication/synchronization.
I may be wrong, but your code seems half finished : the chunks variable is never used. Anyway, passing a big arrays (your chunks) around wouldn't probably be the best solution.
You may try to pass start/len parameters to your pool_worker_function, then create the array and do the computation there, but I'm not 100% sure it will be enough to offset the communication cost for the result.

Implement merge_sort with multiprocessing solution

I tried to write a merge sort with multiprocessing solution
from heapq import merge
from multiprocessing import Process
def merge_sort1(m):
if len(m) < 2:
return m
middle = len(m) // 2
left = Process(target=merge_sort1, args=(m[:middle],))
left.start()
right = Process(target=merge_sort1, args=(m[middle:],))
right.start()
for p in (left, right):
p.join()
result = list(merge(left, right))
return result
Test it with arr
In [47]: arr = list(range(9))
In [48]: random.shuffle(arr)
It repost error:
In [49]: merge_sort1(arr)
TypeError: 'Process' object is not iterable
What's the problem with my code?

merge(left, right) tries to merge two processes, whereas you presumably want to merge the two lists that resulted from each process. Note that return value of the function passed to Process is lost; it is a different process, not just a different thread, and you can't very easily shuffle data back to parent, so Python doesn't do that, by default. You need to be explicit and code such a channel yourself. Fortunately, there are multiprocessing datatypes to help you; for example, multiprocessing.Pipe:
from heapq import merge
import random
import multiprocessing
def merge_sort1(m, send_end=None):
if len(m) < 2:
result = m
else:
middle = len(m) // 2
inputs = [m[:middle], m[middle:]]
pipes = [multiprocessing.Pipe(False) for _ in inputs]
processes = [multiprocessing.Process(target=merge_sort1, args=(input, send_end))
for input, (recv_end, send_end) in zip(inputs, pipes)]
for process in processes: process.start()
for process in processes: process.join()
results = [recv_end.recv() for recv_end, send_end in pipes]
result = list(merge(*results))
if send_end:
send_end.send(result)
else:
return result
arr = list(range(9))
random.shuffle(arr)
print(merge_sort1(arr))

Python multiprocessing with Pool - the main process takes forever

I am trying to understand how multiprocessing works with Python. Here's my test code:
import numpy as np
import multiprocessing
import time
def worker(a):
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
result = pool.map(worker, [arr1], chunksize=1000000/(cores-1))
print "mp time", time.time()-tmp
I have 8 cores. It usually ends up with 7 processes using only ~3% of the CPU for about a second, and the last process uses ~1/8 of the CPU for forever...(it has been running for about 15 minutes)
I understand that the interprocess communication usually bounds the complexity of parallel programming, but does it usually take this long? What else could cause the last process to take forever?
This thread: Python multiprocessing never joins seems to address a similar issue but it doesn't solve the problem with Pool.

It looks like you want to divide the work into chunks. You can use the range function to partition the data. On Linux, forked processes get a copy-on-write view of the parent memory so you can just pass down the indexes you want to work on. On Windows, no such luck. You need to pass in each sublist. This program should do it
import numpy as np
import multiprocessing
import time
import platform
def worker(a):
if platform.system() == "Linux":
# on linux we passed in start:len
start, length = a
a = arr1[start:length]
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
chunk = (len(arr1)+cores-1)//cores
# on Windows, pass the sublist, on linux just the indexes and let the
# worker split from the view of parent memory space
if platform.system() == "Linux":
seq = [(i, i+chunk) for i in range(0, len(arr1), chunk)]
else:
seq = [arr1[i:i+chunk] for i in range(0, len(arr1), chunk)]
result = pool.map(worker, seq, chunksize=1)
print "mp time", time.time()-tmp

You point is here:
pool.map will automatically iterate the object which is [arr1] in your program. Please notice that the object is [arr1] but not arr1, that means the length of object you pass to pool.map is only one.
I think the simplest solution is replace [arr1] with arr1.

Python multiprocessing and shared numpy array

I have a problem, which is similar to this:
import numpy as np
C = np.zeros((100,10))
for i in range(10):
C_sub = get_sub_matrix_C(i, other_args) # shape 10x10
C[i*10:(i+1)*10,:10] = C_sub
So, apparently there is no need to run this as a serial calculation, since each submatrix can be calculated independently.
I would like to use the multiprocessing module and create up to 4 processes for the for loop.
I read some tutorials about multiprocessing, but wasn't able to figure out how to use this to solve my problem.
Thanks for your help

A simple way to parallelize that code would be to use a Pool of processes:
pool = multiprocessing.Pool()
results = pool.starmap(get_sub_matrix_C, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res
I've used starmap since the get_sub_matrix_C function has more than one argument (starmap(f, [(x1, ..., xN)]) calls f(x1, ..., xN)).
Note however that serialization/deserialization may take significant time and space, so you may have to use a more low-level solution to avoid that overhead.
It looks like you are running an outdated version of python. You can replace starmap with plain map but then you have to provide a function that takes a single parameter:
def f(args):
return get_sub_matrix_C(*args)
pool = multiprocessing.Pool()
results = pool.map(f, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res

The following recipe perhaps can do the job. Feel free to ask.
import numpy as np
import multiprocessing
def processParallel():
def own_process(i, other_args, out_queue):
C_sub = get_sub_matrix_C(i, other_args)
out_queue.put(C_sub)
sub_matrices_list = []
out_queue = multiprocessing.Queue()
other_args = 0
for i in range(10):
p = multiprocessing.Process(
target=own_process,
args=(i, other_args, out_queue))
procs.append(p)
p.start()
for i in range(10):
sub_matrices_list.extend(out_queue.get())
for p in procs:
p.join()
return sub_matrices_list
C = np.zeros((100,10))
result = processParallel()
for i in range(10):
C[i*10:(i+1)*10,:10] = result[i]

Multiprocessing in python to speed up functions

I am confused with Python multiprocessing.
I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with “normal processing”.
Here an example of what I am trying to achieve.
from time import clock, time
from multiprocessing import Pool, freeze_support
from random import choice
def foo(x):
TupWerteMany = []
for i in range(0,len(x)):
TupWerte = []
s = list(x[i][3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
TupWerteMany.append(TupWerte)
return TupWerteMany
if __name__ == '__main__':
start_time = time()
List = [(u'1', u'aa', u'Jacob', u'Emily'),
(u'2', u'bb', u'Ethan', u'Kayla')]
List1 = List*1000000
# METHOD 1 : NORMAL (takes 20 seconds)
x2 = foo(List1)
print x2[1:3]
# METHOD 2 : APPLY_ASYNC (takes 28 seconds)
# pool = Pool(4)
# Werte = pool.apply_async(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
# METHOD 3: MAP (!! DOES NOT WORK !!)
# pool = Pool(4)
# Werte = pool.map(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
print 'Time Elaspse: ', time() - start_time
My questions:
Why does apply_async takes longer than the “normal way” ?
What I am doing wrong with map?
Does it makes sense to speed up such tasks with multiprocessing at all?
Finally: after all I have read here, I am wondering if multiprocessing in python works on windows at all ?

So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.
1)
The idea of a process pool is to have many processes doing computations on separate bits of some data.
# METHOD 2 : APPLY_ASYNC
jobs = 4
size = len(List1)
pool = Pool(4)
results = []
# split the list into 4 equally sized chunks and submit those to the pool
heads = range(size/jobs, size, size/jobs) + [size]
tails = range(0,size,size/jobs)
for tail,head in zip(tails, heads):
werte = pool.apply_async(foo, args=(List1[tail:head],))
results.append(werte)
pool.close()
pool.join() # wait for the pool to be done
for result in results:
werte = result.get() # get the return value from the sub jobs
This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.
2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code
def foo(thing):
print thing
map(foo, ['a','b','c','d'])
That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.
Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk.
pool.map(foo, List1, chunksize=size/jobs)
I don't know though if there is a problem with map on Windows as I don't have one available for testing.
3) yes, given that your problem is big enough to justify forking out new python interpreters
4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.

On question (2)
With the guidance of Dougal and Matti, I figured out what's went wrong.
The original foo function processes a list of lists, while map requires a function to process single elements.
The new function should be
def foo2 (x):
TupWerte = []
s = list(x[3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
return TupWerte
and the block to call it :
jobs = 4
size = len(List1)
pool = Pool()
#Werte = pool.map(foo2, List1, chunksize=size/jobs)
Werte = pool.map(foo2, List1)
pool.close()
print Werte[1:3]
Thanks to all of you who helped me understand this.
Results of all methods:
for List * 2 Mio records: normal 13.3 seconds , parallel with async: 7.5 seconds, parallel with with map with chuncksize : 7.3, without chunksize 5.2 seconds

Here's a generic multiprocessing template if you are interested.
import multiprocessing as mp
import time
def worker(x):
time.sleep(0.2)
print "x= %s, x squared = %s" % (x, x*x)
return x*x
def apply_async():
pool = mp.Pool()
for i in range(100):
pool.apply_async(worker, args = (i, ))
pool.close()
pool.join()
if __name__ == '__main__':
apply_async()
And the output looks like this:
x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144
As you can see, the numbers are not in order, as they are being executed asynchronously.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel Processing in Python with nested loop - python

Related

Python multiprocessing outside main module: slow computations

Implement merge_sort with multiprocessing solution

Python multiprocessing with Pool - the main process takes forever

Python multiprocessing and shared numpy array

Multiprocessing in python to speed up functions

Categories

Resources