Inverse of ProcessPoolExecutor in Python - python

This is related to my earlier problem which I'm still working on solving. Essentially I need the inverse design of ProcessPoolExecutor, where I have many querying processes and one worker which calculates and sends back results in batches.
Sending the work items is easy with one shared queue, but I still don't have a nice solution for sending all the results back to the right threads on the right processes.

I think it makes the most sense to have a separate multiprocessing.pipe for each querying process. The worker process waits for an available item on any pipe, and the dequeues and processes it, keeping track of which pipe it came from. When it's time to send data back, it feeds the results onto the correct pipe.
Here's a simple example:
#!/usr/bin/env python3
import multiprocessing as mp
def worker(pipes):
quit = [False] * len(pipes)
results = [''] * len(pipes)
# Wait for all workers to send None before quitting
while not all(quit):
ready = mp.connection.wait(pipes)
for pipe in ready:
# Get index of query proc's pipe
i = pipes.index(pipe)
# Receive and "process"
obj = pipe.recv()
if obj is None:
quit[i] = True
continue
result = str(obj)
results[i] += result
# Send back to query proc
pipes[i].send(result)
print(results)
def query(pipe):
for i in 'do some work':
pipe.send(i)
assert pipe.recv() == i
pipe.send(None) # Send sentinel
if __name__ == '__main__':
nquery_procs = 8
work_pipes, query_pipes = zip(*(mp.Pipe() for _ in range(nquery_procs)))
query_procs = [mp.Process(target=query, args=(pipe,)) for pipe in query_pipes]
for p in query_procs:
p.start()
worker(work_pipes)
for p in query_procs:
p.join()
Alternatively, you could give each querying process an ID number (which might just be its pipe's index), and any request must be a tuple which is (id_num, data). This just gets around the worker process doing pipes.index(pipe) on each loop, so I'm not sure how much it buys you.

Related

Python multiprocessing: How to start processes that depend on each other?

I have a basic question that ragards the Python multiprocessing method, how different processes, which use queues to transfer data, could optimally be started.
For that I use a simple example where
Data is received
Data is processed
Data is send
All of the upper steps should happen in parallel through three different processes.
Here the example code:
import multiprocessing
import keyboard
import time
def getData(queue_raw):
for num in range(1000):
queue_raw.put(num)
print("getData: put "+ str(num)+" in queue_raw")
while True:
if keyboard.read_key() == "s":
break
def calcFeatures(queue_raw, queue_features):
while not queue_raw.empty():
data = queue_raw.get()
queue_features.put(data**2)
print("calcFeatures: put "+ str(data**2)+" in queue_features")
def sendFeatures(queue_features):
while not queue_features.empty():
feature = queue_features.get()
print("sendFeatures: put "+ str(feature)+" out")
if __name__ == "__main__":
queue_raw = multiprocessing.Queue()
queue_features = multiprocessing.Queue()
processes = [
multiprocessing.Process(target=getData, args=(queue_raw,)),
multiprocessing.Process(target=calcFeatures, args=(queue_raw, queue_features,)),
multiprocessing.Process(target=sendFeatures, args=(queue_features,))
]
processes[0].start()
time.sleep(0.1)
processes[1].start()
time.sleep(0.1)
processes[2].start()
#for p in processes:
# p.start()
for p in processes:
p.join()
This program works, but my question is regarding the start of the different processes.
Ideally process[1] should start only if process[0] put data in the queue_raw; while process[2] should only start if process[1] put the calculated features in queue_features.
Right now I did that through time.sleep() function, which is suboptimal, since I don't necessarily know how long the processes will take.
I also tried something like:
processes[0].start()
while queue_raw.empty():
time.sleep(0.5)
processes[1].start()
But it won't work, since only the first process is estimated. Any method how this process depending starts could be done?
#moooeeeep pointed out the right comment.
Checking with while not queue.empty(): is not waiting till data is actually in the queue!
An approach via a sentinel object (here None) and a while True loop will enforce that the process waits till the other processes put data in the queue:
FLAG_STOP=False
while FLAG_STOP is False:
data = queue_raw.get() # get will wait
if data is None:
# Finish analysis
FLAG_STOP = True
else:
# work with data

How to return values from Process- or Thread instances?

So I want to run a function which can either search for information on the web or directly from my own mysql database.
The first process will be time-consuming, the second relatively fast.
With this in mind I create a process which starts this compound search (find_compound_view). If the process finishes relatively fast it means it's present on the database so I can render the results immediately. Otherwise, I will render "drax_retrieving_data.html".
The stupid solution I came up with was to run the function twice, once to check if the process takes a long time, the other to actually get the return values of the function. This is pretty much because I don't know how to return the values of my find_compound_view function. I've tried googling but I can't seem to find how to return the values from the class Process specifically.
p = Process(target=find_compound_view, args=(form,))
p.start()
is_running = p.is_alive()
start_time=time.time()
while is_running:
time.sleep(0.05)
is_running = p.is_alive()
if time.time() - start_time > 10 :
print('Timer exceeded, DRAX is retrieving info!',time.time() - start_time)
return render(request,'drax_internal_dbs/drax_retrieving_data.html')
compound = find_compound_view(form,use_email=False)
if compound:
data=*****
return render(request, 'drax_internal_dbs/result.html',data)
You will need a multiprocessing.Pipe or a multiprocessing.Queue to send the results back to your parent-process. If you just do I/0, you should use a Thread instead of a Process, since it's more lightweight and most time will be spend on waiting. I'm showing you how it's done for Process and Threads in general.
Process with Queue
The multiprocessing queue is build on top of a pipe and access is synchronized with locks/semaphores. Queues are thread- and process-safe, meaning you can use one queue for multiple producer/consumer-processes and even multiple threads in these processes. Adding the first item on the queue will also start a feeder-thread in the calling process. The additional overhead of a multiprocessing.Queue makes using a pipe for single-producer/single-consumer scenarios preferable and more performant.
Here's how to send and retrieve a result with a multiprocessing.Queue:
from multiprocessing import Process, Queue
SENTINEL = 'SENTINEL'
def sim_busy(out_queue, x):
for _ in range(int(x)):
assert 1 == 1
result = x
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Process(target=sim_busy, args=(out_queue, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel breaks the loop
print(result)
The queue is passed as argument into the function, results are .put() on the queue and the parent get.()s from the queue. .get() is a blocking call, execution does not resume until something is to get (specifying timeout parameter is possible). Note the work sim_busy does here is cpu-intensive, that's when you would choose processes over threads.
Process & Pipe
For one-to-one connections a pipe is enough. The setup is nearly identical, just the methods are named differently and a call to Pipe() returns two connection objects. In duplex mode, both objects are read-write ends, with duplex=False (simplex) the first connection object is the read-end of the pipe, the second is the write-end. In this basic scenario we just need a simplex-pipe:
from multiprocessing import Process, Pipe
SENTINEL = 'SENTINEL'
def sim_busy(write_conn, x):
for _ in range(int(x)):
assert 1 == 1
result = x
write_conn.send(result)
# If all results are send, send a sentinel-value to let the parent know
# no more results will come.
write_conn.send(SENTINEL)
if __name__ == '__main__':
# duplex=False because we just need one-way communication in this case.
read_conn, write_conn = Pipe(duplex=False)
p = Process(target=sim_busy, args=(write_conn, 150e6)) # 150e6 == 150000000.0
p.start()
for result in iter(read_conn.recv, SENTINEL): # sentinel breaks the loop
print(result)
Thread & Queue
For use with threading, you want to switch to queue.Queue. queue.Queue is build on top of a collections.deque, adding some locks to make it thread-safe. Unlike with multiprocessing's queue and pipe, objects put on a queue.Queue won't get pickled. Since threads share the same memory address-space, serialization for memory-copying is unnecessary, only pointers are transmitted.
from threading import Thread
from queue import Queue
import time
SENTINEL = 'SENTINEL'
def sim_io(out_queue, query):
time.sleep(1)
result = query + '_result'
out_queue.put(result)
# If all results are enqueued, send a sentinel-value to let the parent know
# no more results will come.
out_queue.put(SENTINEL)
if __name__ == '__main__':
out_queue = Queue()
p = Thread(target=sim_io, args=(out_queue, 'my_query'))
p.start()
for result in iter(out_queue.get, SENTINEL): # sentinel-value breaks the loop
print(result)
Read here why for result in iter(out_queue.get, SENTINEL):
should be prefered over a while True...break setup, where possible.
Read here why you should use if __name__ == '__main__': in all your scripts and especially in multiprocessing.
More about get()-usage here.

join() method in multiprocessing with input and output Queue()

I have a SOMETIMES working piece of code:
def my_map2(func, iterable, processes=None):
''' FASTER THAN mp.Pool() - NOT '''
import multiprocessing as mp # Load multiprocessing library
mp.freeze_support()
if (processes == None):
processes = mp.cpu_count() # Set maximum number of cores?
L = len(iterable)
iter2 = zip(iterable,range(L))
IN = mp.Queue()
for x in iter2:
IN.put(x)
OUT = mp.Queue() # = mp.JoinableQueue() Q3
lock = mp.Lock() # Q2
def target_fun(IN, OUT):
while not IN.empty():
inp = IN.get()
out = (inp[0],func(inp[1]))
with lock: # Q2
OUT.put(out)
proc = [mp.Process(target=target_fun, args=(IN, OUT,)) for x in range(processes)]
for p in proc: p.start() # Run proc
for p in proc: p.join() # Exit the completed proc Q1
results = [OUT.get() for p in range(L)] # Get Results Back Q1
results.sort() # Sort
return( [r[1] for r in results] )
import time
def f(x):
time.sleep(0.1);
return(x)
res = my_map2(f,range(100))
Questions:
join() called before collecting from the queue. Why does this code work at all? join() is invoked before before Queue is empty so it should deadlock according to documentation:
any Queue that a Process has put data on must be drained prior to joining the processes which have put data there: otherwise, you’ll get a deadlock.
However if 2 lines marked with 'Q1' are swapped then we should run into an issue "getting from an empty queue", but both variants seem to SOMETIMES work...
If I remove the Lock() (edit 2 lines with Q2) it does get a deadlock. Why now?
Queues are thread and process safe. When using multiple processes, one generally uses message passing for communication between processes and avoids having to use any synchronization primitives like locks. (talking about pipes and queues)
JoinableQueue. If I try using it it does not work. I tried all:
If you use JoinableQueue then you must call JoinableQueue.task_done().
If a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread()), then that process will not terminate until all buffered items have been flushed to the pipe.
Question 3 is: So what is an actual difference between Normal and Joinable queue? Does it mean that you can terminate child that has put something on a normal Queue at any time? I thought only manager processes work like this...
Also, so how do you correctly use both Queues?
I have read library reference, quite a few tutorials and stack overflow posts, but does anybody really understand how this works? (Talking about join(), different Queue types, when is the lock needed and accurate understanding of processes behind)

Multiprocessing has cutoff at 992 integers being joined as result

I am following this book http://doughellmann.com/pages/python-standard-library-by-example.html
Along with some online references. I have some algorithm setup for multiprocessing where i have a large array of dictionaries and do some calculation. I use multiprocessing to divide the indexes on which the calculations are done on the dictionary. To make the question more general, I replaced the algorithm with just some array of return values. From finding information online and other SO, I think it has to do with the join method.
The structure is like so,
Generate some fake data, call the manager function for multiprocessing, create a Queue, divide data over the number of index. Loop through the number of processes to use, send each process function the correct index range. Lastly join the processes and print out the results.
What I have figured out, is if the function used by the processes is trying to return a range(0,992), it works quickly, if the range(0,993), it hangs. I tried on two different computers with different specs.
The code is here:
import multiprocessing
def main():
data = []
for i in range(0,10):
data.append(i)
CalcManager(data,start=0,end=50)
def CalcManager(myData,start,end):
print 'in calc manager'
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
print 'starting processes'
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
print 'started process'
results = range(0,(992))
result_q.put(results)
return
if __name__== '__main__':
main()
Is there something about these numbers specifically or am I just missing something basic that has nothing to do with these numbers?
From my searches, it seems this is some memory issue with the join method, but the book does not really explain how to solve this using this setup. Is it possible to use this structure (i understand it mostly, so it would be nice if i can continue to use this) and also pass back large results. I know there are other methods to share data between processes, but thats not what I need, just return the values and join them to one array once completed.
I can't reproduce this on my machine, but it sounds like items in put into the queue haven't been flushed to the underlying pipe. This will cause a deadlock if you try to terminate the process, according to the docs:
As mentioned above, if a child process has put items on a queue (and
it has not used JoinableQueue.cancel_join_thread), then that process
will not terminate until all buffered items have been flushed to the
pipe. This means that if you try joining that process you may get a
deadlock unless you are sure that all items which have been put on the
queue have been consumed. Similarly, if the child process is
non-daemonic then the parent process may hang on exit when it tries to
join all its non-daemonic children.
If you're in this situation. your p.join() calls will hang forever, because there's still buffered data in the queue. You can avoid it by consuming from the queue before you join the processes:
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
This doesn't affect the way the code works, each result_q.get() call will block until the result is placed on the queue, which has the same effect has calling join on all processes prior to calling get. The only difference is you avoid the deadlock.

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

Categories

Resources