Python multiprocessing gets stuck when passing large array through pipe - python

I'm using multiprocessing in python and try to pass a large numpy array to a subprocess though a pipe. It works well with a small array but hangs for larger arrays without returning an error.
I believe that the pipe is blocked and already read a bit about it but cannot figure out how to solve the problem.
def f2(conn, x):
conn.start()
data = conn.recv()
conn.join()
print(data)
do_something(x)
conn.close()
if __name__ == '__main__':
data_input = read_data() # large numpy array
parent_conn, child_conn = Pipe()
p = multiprocessing.Pool(processes=8)
func = partial(f2, child_conn)
parent_conn.send(data_input)
parent_conn.close()
result = p.map(func, processes)
p.close()
p.join()

Ignoring all the other problems in this code (you don't have an x to pass to map, you don't use the x f2 receives, mixing Pool.map with Pipe is usually the wrong thing to do), your ultimate problem is the blocking send call being performed before a worker process is available to read from it.
Assuming you really want to mix map with Pipe, the solution is to launch the map asynchronously before beginning the send, so there is something on the other side to read from the Pipe while the parent is trying to write to it:
if __name__ == '__main__':
data_input = read_data() # large numpy array
parent_conn, child_conn = Pipe()
# Use with to avoid needing to explicitly close/join
with multiprocessing.Pool(processes=8) as p:
func = partial(f2, child_conn)
# Launch async map to ensure workers are running
future = p.map_async(func, x)
# Can perform blocking send as workers will consume as you send
parent_conn.send(data_input)
parent_conn.close()
# Now you can wait on the map to complete
result = future.get()
As noted, this code will not run due to the issues with x, and even if it did, the Pipe documentation explicitly warns that two different processes should not be reading from the Pipe at the same time.
If you wanted to process the data in bulk in a single worker, you'd just use Process and Pipe, something like:
def f2(conn):
data = conn.recv()
conn.close()
print(data)
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
proc = multiprocessing.Process(target=f2, args=(child_conn,))
proc.start()
data_input = read_data() # large numpy array
parent_conn.send(data_input)
parent_conn.close()
proc.join()
If you wanted to process each element separately across many workers, you'd just use Pool and map:
def f2(x):
print(x)
if __name__ == '__main__':
data_input = read_data() # large numpy array
with multiprocessing.Pool(processes=8) as p:
result = p.map(f2, data_input)

Related

Run function in parallel and grab outputs using Queue

I would like to fun a function using different arguments. For each different argument, I would like to run the function in parallel and then get the output of each run. It seems that the multiprocessing module can help here. I am not sure about the right steps to make this work.
Do I start all the processes, then get all the queues and then join all the processes in this order? Or do I get the results after I have joined? Or do I get the ith result after I have joined the ith process?
from numpy.random import uniform
from multiprocessing import Process, Queue
def function(x):
return uniform(0.0, x)
if __name__ == "__main__":
queue = Queue()
processes = []
x_values = [1.0, 10.0, 100.0]
# Start all processes
for x in x_values:
process = Process(target=function, args=(x, queue, ))
processes.append(process)
process.start()
# Grab results of the processes?
outputs = [queue.get() for _ in range(len(x_values))]
# Not even sure what this does but apparently it's needed
for process in processes:
process.join()
So lets make a simple example for multiprocessing pools with a loaded function that sleeps for 3 seconds and returns the value passed to it(your parameter) and also the result of the function which is just doubling it.
IIRC there's some issue with stopping pools cleanly
from multiprocessing import Pool
import time
def time_waster(val):
try:
time.sleep(3)
return (val, val*2) #return a tuple here but you can use a dict as well with all your parameters
except KeyboardInterrupt:
raise KeyboardInterruptError()
if __name__ == '__main__':
x = list(range(5)) #values to pass to the function
results = []
try:
with Pool(2) as p: #I use 2 but you can use as many as you have cores
results.append(p.map(time_waster,x))
except KeyboardInterrupt:
p.terminate()
except Exception as e:
p.terminate()
finally:
p.join()
print(results)
As an extra service added some keyboardinterrupt handlers as IIRC there are some issues interrupting pools.https://stackoverflow.com/questions/1408356/keyboard-interrupts-with-pythons-multiprocessing-pool
proc.join() blocks until the process ended. queue.get() blocks until there is something in the queue. Because your processes don't put anything into the queue (in this example) than this code will never get beyond the queue.get() part... If your processes put something in the queue at the very end, then it doesn't matter if you first join() or get() because they happen at about the same time.

How to return values from Process- or Thread instances?

So I want to run a function which can either search for information on the web or directly from my own mysql database.
The first process will be time-consuming, the second relatively fast.
With this in mind I create a process which starts this compound search (find_compound_view). If the process finishes relatively fast it means it's present on the database so I can render the results immediately. Otherwise, I will render "drax_retrieving_data.html".
The stupid solution I came up with was to run the function twice, once to check if the process takes a long time, the other to actually get the return values of the function. This is pretty much because I don't know how to return the values of my find_compound_view function. I've tried googling but I can't seem to find how to return the values from the class Process specifically.
p = Process(target=find_compound_view, args=(form,))
p.start()
is_running = p.is_alive()
start_time=time.time()
while is_running:
time.sleep(0.05)
is_running = p.is_alive()
if time.time() - start_time > 10 :
print('Timer exceeded, DRAX is retrieving info!',time.time() - start_time)
return render(request,'drax_internal_dbs/drax_retrieving_data.html')
compound = find_compound_view(form,use_email=False)
if compound:
data=*****
return render(request, 'drax_internal_dbs/result.html',data)
You will need a multiprocessing.Pipe or a multiprocessing.Queue to send the results back to your parent-process. If you just do I/0, you should use a Thread instead of a Process, since it's more lightweight and most time will be spend on waiting. I'm showing you how it's done for Process and Threads in general.
Process with Queue
The multiprocessing queue is build on top of a pipe and access is synchronized with locks/semaphores. Queues are thread- and process-safe, meaning you can use one queue for multiple producer/consumer-processes and even multiple threads in these processes. Adding the first item on the queue will also start a feeder-thread in the calling process. The additional overhead of a multiprocessing.Queue makes using a pipe for single-producer/single-consumer scenarios preferable and more performant.
Here's how to send and retrieve a result with a multiprocessing.Queue:
from multiprocessing import Process, Queue
SENTINEL = 'SENTINEL'
def sim_busy(out_queue, x):
for _ in range(int(x)):
assert 1 == 1
result = x
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Process(target=sim_busy, args=(out_queue, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel breaks the loop
print(result)
The queue is passed as argument into the function, results are .put() on the queue and the parent get.()s from the queue. .get() is a blocking call, execution does not resume until something is to get (specifying timeout parameter is possible). Note the work sim_busy does here is cpu-intensive, that's when you would choose processes over threads.
Process & Pipe
For one-to-one connections a pipe is enough. The setup is nearly identical, just the methods are named differently and a call to Pipe() returns two connection objects. In duplex mode, both objects are read-write ends, with duplex=False (simplex) the first connection object is the read-end of the pipe, the second is the write-end. In this basic scenario we just need a simplex-pipe:
from multiprocessing import Process, Pipe
SENTINEL = 'SENTINEL'
def sim_busy(write_conn, x):
for _ in range(int(x)):
assert 1 == 1
result = x
write_conn.send(result)
# If all results are send, send a sentinel-value to let the parent know
# no more results will come.
write_conn.send(SENTINEL)
if __name__ == '__main__':
# duplex=False because we just need one-way communication in this case.
read_conn, write_conn = Pipe(duplex=False)
p = Process(target=sim_busy, args=(write_conn, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(read_conn.recv, SENTINEL): # sentinel breaks the loop
print(result)
Thread & Queue
For use with threading, you want to switch to queue.Queue. queue.Queue is build on top of a collections.deque, adding some locks to make it thread-safe. Unlike with multiprocessing's queue and pipe, objects put on a queue.Queue won't get pickled. Since threads share the same memory address-space, serialization for memory-copying is unnecessary, only pointers are transmitted.
from threading import Thread
from queue import Queue
import time
SENTINEL = 'SENTINEL'
def sim_io(out_queue, query):
time.sleep(1)
result = query + '_result'
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Thread(target=sim_io, args=(out_queue, 'my_query'))
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel-value breaks the loop
print(result)
Read here why for result in iter(out_queue.get, SENTINEL):
should be prefered over a while True...break setup, where possible.
Read here why you should use if __name__ == '__main__': in all your scripts and especially in multiprocessing.
More about get()-usage here.

How do I retrieve output from Multiprocessing in Python?

So, I'm trying to speed up one routine by using the Multiprocessing module in Python. I want to be able to read several .csv files by splitting the job among several cores, for that I have:
def csvreader(string):
from numpy import genfromtxt;
time,signal=np.genfromtxt(string, delimiter=',',unpack="true")
return time,signal
Then I call this function by saying:
if __name__ == '__main__':
for i in range(0,2):
p = multiprocessing.Process(target=CSVReader.csvreader, args=(string_array[i],))
p.start()
The thing is that this doesn't store any output. I have read all the forums online and seen that there might be a way with multiprocessing.queue but I don't understand it quite well.
Is there any simple and straightforward method?
Your best bet are multiprocessing.Queue or multiprocessing.Pipe, which are designed exactly for this problem. They allow you to send data between processes in a safe and easy way.
If you'd like to return the output of your csvreader function, you should pass another argument to it, which is the multiprocessing.Queue through which the data will be sent back to the main process. Instead of returning the values, place them on the queue, and the main process will retrieve them at some point later. If they're not ready when the process tries to get them, by default it will just block (wait) until they are available
Your function would now look like this:
def cvsreader(string, q):
q.put(np.genfromtxt(string, delimiter=',', unpack="true"))
The main routine would be:
if __name__ == '__main__'
q = multiprocessing.Queue()
for i in range(2):
p = multiprocessing.Process(target=csvreader, args=(string_array[i], q,))
p.start()
# Do anything else you need in here
time=np.empty(2,dtype='object')
signal=np.empty(2,dtype='object')
for i in range(2):
time[i], signal[i] = q.get() # Returns output or blocks until ready
# Process my output
Note that you have to call Queue.get() for each item you want to return.
Have a look at the documentation on the multiprocessing module for more examples and information.
Using the example from the introduction to the documentation:
if __name__ == '__main__':
pool = Pool(2)
results = pool.map(CSVReader.csvreader, string_array[:2])
print(results)

Returns values of functions Python Multiprocessing.Process

I'm trying to generate the checksum of two identical files (in two different directories) and am using multiprocessing.Process() to run the checksum of both files simultaneously instead of sequentially.
However, when I run the multiprocessing.Process() object on the checksum generating function I get this return value:
<Process(Process-1, stopped)>
<Process(Process-2, stopped)>
These should be a list of checksum strings.
The return statement from the generating function is:
return chksum_list
Pretty basic and the program works well when running sequentially.
How do I retrieve the return value of the function that is being processed through the multiprocessing.Process() object?
Thanks.
The docs are relatively good on this topic;
Pipes
You could communicate via a pipe to the process objects;
From the docs:
from multiprocessing import Process, Pipe
def f(conn):
conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints "[42, None, 'hello']"
p.join()
Pool & map
Alternatively you could use a Pool of processes:
pool = Pool(processes=4)
returnvals = pool.map(f, range(10))
where f is your function, which will act on each member of range(10).
Similarly, you can pass in any list containing the inputs to your processes;
returnvals = pool.map(f, [input_to_process_1, input_to_process_2])
In your specific case, input_to_process_1/2 could be paths to the files you're doing checksums on, while f is your checksum function.

Collecting result from different process in python

I am doing several process togethere. Each of the proccess returns some results. How would I collect those results from those process.
task_1 = Process(target=do_this_task,args=(para_1,para_2))
task_2 = Process(target=do_this_task,args=(para_1,para_2))
do_this_task returns some results. I would like to collect those results and save them in some variable.
So right now I would suggest you should use the python multiprocessing module's Pool as it handles quite a bit for you. Could you elaborate what you're doing and why you want to use what I assume to be multiprocessing.Process directly?
If you still want to use multiprocessing.Process directly you should use a Queue to get the return values.
example given in the docs:
"
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
"-Multiprocessing Docs
So processes are things that usually run in the background to do something in general, if you do multiprocessing with them you need to 'throw around' the data since processes don't have shared memory like threads - so that's why you use the Queue - it does it for you. Another thing you can do is pipes, and conveniently they give an example for that as well :).
"
from multiprocessing import Process, Pipe
def f(conn):
conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints "[42, None, 'hello']"
p.join()
"
-Multiprocessing Docs
what this does is manually use pipes to throw around the finished results to the 'parent process' in this case.
Also sometimes I find cases which multiprocessing cannot pickle well so I use this great answer (or my modified specialized variants of) by mrule that he posts here:
"
from multiprocessing import Process, Pipe
from itertools import izip
def spawn(f):
def fun(pipe,x):
pipe.send(f(x))
pipe.close()
return fun
def parmap(f,X):
pipe=[Pipe() for x in X]
proc=[Process(target=spawn(f),args=(c,x)) for x,(p,c) in izip(X,pipe)]
[p.start() for p in proc]
[p.join() for p in proc]
return [p.recv() for (p,c) in pipe]
if __name__ == '__main__':
print parmap(lambda x:x**x,range(1,5))
"
you should be warned however that this takes over control manually of the processes so certain things can leave 'dead' processes lying around - which is not a good thing, an example being unexpected signals - this is an example of using pipes for multi-processing though :).
If those commands are not in python, e.g. you want to run ls then you might be better served by using subprocess, as os.system isn't a good thing to use anymore necessarily as it is now considered that subprocess is an easier-to-use and more flexible tool, a small discussion is presented here.
You can do something like this with multiprocessing
from multiprocessing import Pool
mydict = {}
with Pool(processes=5) as pool:
task_1 = pool.apply_async(do_this_task,args=(para_1,para_2))
task_2 = pool.apply_async(do_this_task,args=(para_1,para_2))
mydict.update({"task_1": task_1.get(), "task_2":task_2.get()})
print(mydict)
or if you would like to try multithreading with concurrent.futures then take a look at this answer.
If the processes are external scripts then try using the subprocess module. However, your code suggests you want to run functions in parallel. For this, try the multiprocessing module. Some code from this answer for specific details of using multiprocessing:
def foo(bar, baz):
print 'hello {0}'.format(bar)
return 'foo' + baz
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=1)
async_result = pool.apply_async(foo, ('world', 'foo')) # tuple of args for foo
# do some other stuff in the other processes
return_val = async_result.get() # get the return value from your function.

Categories

Resources