Avoiding loading spaCy data in each subprocess when multiprocessing - python

I would like to use spaCy in a program which is currently implemented with multiprocessing. Specifically I am using ProcessingPool to spawn 4 subprocesses which then go off and do their merry tasks.
To use spaCy (specifically for POS tagging), I need to invoke spacy.load('en'), which is an expensive call (takes ~10 seconds). If I am to load this object within each subprocess then it will take ~40 seconds, as they are all reading from the same location. This is annoyingly long.
But I cannot figure out a way to get them to share the object which is being loaded. This object cannot be pickled, which means (as far as I know):
It cannot be passed into the Pool.map call
It cannot be stored and used by a Manager instance to then be shared amongst the processes
What can I do?

I don't how you use Pool.map exactly but be aware that Pool.map doesn't work with a massive number of input. In Python 3.6, it's implemented in Lib/multiprocessing/pool.py as you can see, it states it takes an iterable as first argument but the implementation does consume the whole iterable before running the multiprocess map. So I think that's not Pool.map that you need to use if you need to process a lot of data. Maybe Pool.imap and Pool.imap_unordered can work.
About your actual issue. I have a solution that doesn't involve Pool.map and works kind of a like multiprocess foreach.
First you need to inherit Pool and create a worker process:
from multiprocessing import cpu_count
from multiprocessing import Queue
from multiprocessing import Process
class Worker(Process):
english = spacy.load('en')
def __init__(self, queue):
super(Worker, self).__init__()
self.queue = queue
def run(self):
for args in iter(self.queue.get, None):
# process args here, you can use self.
You prepare the pool of processus like that:
queue = Queue()
workers = list()
for _ in range(cpu_count()): # minus one if the main processus is CPU intensive
worker = Worker(queue)
workers.append(worker)
worker.start()
Then you can feed the pool via queue:
for args in iterable:
queue.put(args)
iterable is the list of arguments that you pass to the workers. The above code will push the content of iterable as fast as it can. Basically, if the worker is slow enough, almost all the iterable will be pushed to the queue before the workers have finished their job. That's why the content of the iterable must fit into memory.
If the workers arguments (aka. iterable) can't fit into memory you must synchronize somehow the main processus and the workers...
At the end make sure to call the following:
for worker in workers:
queue.put(None)
for worker in workers:
worker.join()

Related

Multiprocessing in python. Can a multiprocessed function call functions as multiprocesses?

Recently I have started using the multiprocessor pool executor in python to accelerate my processing.
So instead of doing a
list_of_res=[]
for n in range(a_number):
res=calculate_something(list_of sources[n])
list_of_res.append(res)
joint_results=pd.concat(list_of_res)
I do
with ProcessPoolExecutor(max_workers=8) as executor:
joint_results=pd.concat(executor.map(calculate_something,list_of_sources))
It works great.
However I've noticed that inside the calculate_something function I call the same function like 8 times, one after another, so I might as well apply a map to them instead of a loop
My question is, can I apply multiprocessing to a function that is already being called in multiprocess?
yes you can have a worker process spawn another pool of workers, but it is not optimal.
each time you launch a new process it takes a few hundred milliseconds to a few seconds for this new process to initialize and start executing work (OS, disk and code dependent.)
launching a worker from a worker is just wasting the overhead of spawning the first child to begin with, and you are better off extracting the loop inside calculate_something and launching it directly within your initial executor.
a better approach is to launch your initial calculate_something using a ThreadPoolExecutor and have one shared ProcessPoolExecutor that all your thread workers will push work into, this way you can limit the number of newly created processes and avoid creating and deleting much more workers than you actually need, and it takes only a few microseconds to launch a threadpool.
this is an example of how to nest threadpool and process_pool.
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def process_worker(n):
print(n)
return n
def thread_worker(list_of_n,process_pool:ProcessPoolExecutor):
work_done = list(process_pool.map(process_worker,list_of_n))
return work_done
if __name__ == "__main__":
list_of_lists_of_n = [[1,2,3],[4,5,6]]
with ProcessPoolExecutor() as process_pool, ThreadPoolExecutor() as threadpool:
tasks = []
work_done = []
for item in list_of_lists_of_n:
tasks.append(threadpool.submit(thread_worker,item,process_pool))
for item in tasks:
work_done.append(item.result())
print(work_done)

How to fork and join multiple subprocesses with a global timeout in Python?

I want to execute some tasks in parallel in multiple subprocesses and time out if the tasks were not completed within some delay.
A first approach consists in forking and joining the subprocesses individually with remaining timeouts computed with respect to the global timeout, like suggested in this answer. It works fine for me.
A second approach, which I want to use here, consists in creating a pool of subprocesses and waiting with the global timeout, like suggested in this answer.
However I have a problem with the second approach: after feeding the pool of subprocesses with tasks that have multiprocessing.Event() objects, waiting for their completion raises this exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
Here is the Python code snippet:
import multiprocessing.pool
import time
class Worker:
def __init__(self):
self.event = multiprocessing.Event() # commenting this removes the RuntimeError
def work(self, x):
time.sleep(1)
return x * 10
if __name__ == "__main__":
pool_size = 2
timeout = 5
with multiprocessing.pool.Pool(pool_size) as pool:
result = pool.map_async(Worker().work, [4, 5, 2, 7])
print(result.get(timeout)) # raises the RuntimeError
In the "Programming guidlines" section of the multiprocessing — Process-based parallelism documentation, there is this paragraph:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So multiprocessing.Event() caused a RuntimeError because it is not pickable, as demonstrated by the following Python code snippet:
import multiprocessing
import pickle
pickle.dumps(multiprocessing.Event())
which raises the same exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
A solution is to use a proxy object:
A proxy is an object which refers to a shared object which lives (presumably) in a different process.
because:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
multiprocessing.Manager().Event() creates a shared threading.Event() object and returns a proxy for it, so replacing this line:
self.event = multiprocessing.Event()
by the following line in the Python code snippet of the question solves the problem:
self.event = multiprocessing.Manager().Event()

multiprocessing's Queue inside Manger.Namespace()

I am currently creating a class which is supposed to execute some methods in a multi-threaded way, using the multiprocessing module. I execute the real computation using a Pool of n workers. Now I wanted to assign each of the currently n active workers an index between 0 and n for some other calculation. To do this, I wanted to use a shared Queue to assign an index in a way, that at every time no two workers have the same id. To share the same Queue inside the class between the different threads, I wanted to store it inside a Manager.Namespace(). But doing this, I got some problems with the Queue. Therefore, I created a minimal version of my problem and ended up with something like this:
from multiprocess import Process, Queue, Manager, Pool, cpu_count
class A(object):
def __init__(self):
manager = Manager()
self.ns = manager.Namespace()
self.ns.q = manager.Queue()
def foo(self):
for i in range(10):
print(i)
self.ns.q.put(i)
print(self.ns.q.get())
print(self.ns.q.qsize())
a = A()
a.foo()
In this code, the execution stops before the second print statement - therefore, I think, that no data is actually written in the Queue. When I remove the namespace related stuff the code works flawlessly. Is this the intended behaviour of the multiprocessings objects and am I doing something wrong? Or is this some kind of bug?
yes, you should not use Namespace here. when you put a Queue object into manager.Namespace(), each process will get a new Queue instance, all the writer/reader of those newly created queue objects have no connection with parent process, therefore no message will be received by worker processes. share a Queue solely instead.
by the way, you mentioned "thread" many times, but in the context of multiprocess module, a worker is a process, not a thread.

python multiprocessing .join() deadlock depends on worker function

I am using the multiprocessing python library to spawn 4 Process() objects to parallelize a cpu intensive task. The task (inspiration and code from this great article) is to compute the prime factors for every integer in a list.
main.py:
import random
import multiprocessing
import sys
num_inputs = 4000
num_procs = 4
proc_inputs = num_inputs/num_procs
input_list = [int(1000*random.random()) for i in xrange(num_inputs)]
output_queue = multiprocessing.Queue()
procs = []
for p_i in xrange(num_procs):
print "Process [%d]"%p_i
proc_list = input_list[proc_inputs * p_i:proc_inputs * (p_i + 1)]
print " - num inputs: [%d]"%len(proc_list)
# Using target=worker1 HANGS on join
p = multiprocessing.Process(target=worker1, args=(p_i, proc_list, output_queue))
# Using target=worker2 RETURNS with success
#p = multiprocessing.Process(target=worker2, args=(p_i, proc_list, output_queue))
procs.append(p)
p.start()
for p in jobs:
print "joining ", p, output_queue.qsize(), output_queue.full()
p.join()
print "joined ", p, output_queue.qsize(), output_queue.full()
print "Processing complete."
ret_vals = []
while output_queue.empty() == False:
ret_vals.append(output_queue.get())
print len(ret_vals)
print sys.getsizeof(ret_vals)
Observation:
If the target for each process is the function worker1, for an input list larger than 4000 elements the main thread gets stuck on .join(), waiting for the spawned processes to terminate and never returns.
If the target for each process is the function worker2, for the same input list the code works just fine and the main thread returns.
This is very confusing to me, as the only difference between worker1 and worker2 (see below) is that the former inserts individual lists in the Queue whereas the latter inserts a single list of lists for each process.
Why is there deadlock using worker1 and not using worker2 target?
Shouldn't both (or neither) go beyond the Multiprocessing Queue maxsize limit is 32767?
worker1 vs worker2:
def worker1(proc_num, proc_list, output_queue):
'''worker function which deadlocks'''
for num in proc_list:
output_queue.put(factorize_naive(num))
def worker2(proc_num, proc_list, output_queue):
'''worker function that works'''
workers_stuff = []
for num in proc_list:
workers_stuff.append(factorize_naive(num))
output_queue.put(workers_stuff)
There are a lot of similar questions on SO, but I believe the core of this questions is clearly distinct from all of them.
Related Links:
https://sopython.com/canon/82/programs-using-multiprocessing-hang-deadlock-and-never-complete/
python multiprocessing issues
python multiprocessing - process hangs on join for large queue
Process.join() and queue don't work with large numbers
Python 3 Multiprocessing queue deadlock when calling join before the queue is empty
Script using multiprocessing module does not terminate
Why does multiprocessing.Process.join() hang?
When to call .join() on a process?
What exactly is Python multiprocessing Module's .join() Method Doing?
The docs warn about this:
Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
While a Queue appears to be unbounded, under the covers queued items are buffered in memory to avoid overloading inter-process pipes. A process cannot end normally before those memory buffers are flushed. Your worker1() puts a lot more items on the queue than your worker2(), and that's all there is to it. Note that the number of items that can queued before the implementation resorts to buffering in memory isn't defined: it can vary across OS and Python release.
As the docs suggest, the normal way to avoid this is to .get() all the items off the queue before you attempt to .join() the processes. As you've discovered, whether it's necessary to do so depends in an undefined way on how many items have been put on the queue by each worker process.

Shared pool map between processes with object-oriented python

(python2.7)
I'm trying to do a kind of scanner, that has to walk through CFG nodes, and split in different processes on branching for parallelism purpose.
The scanner is represented by an object of class Scanner. This class has one method traverse that walks through the said graph and splits if necessary.
Here how it looks:
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
self.process_pool = Pool(processes=4)
def traverse(self, ...):
[...]
if branch:
self.process_pool.map(my_func, todo_list).
My problem is the following:
How do I create a instance of multiprocessing.Pool, that is shared between all of my processes ? I want it to be shared, because since a path can be splitted again, I do not want to end with a kind of fork bomb, and having the same Pool will help me to limit the number of processes running at the same time.
The above code does not work, since Pool can not be pickled. In consequence, I have tried that:
class Scanner(object):
def __getstate__(self):
self_dict = self.__dict__.copy()
def self_dict['process_pool']
return self_dict
[...]
But obviously, it results in having self.process_pool not defined in the created processes.
Then, I tried to create a Pool as a module attribute:
process_pool = Pool(processes=4)
def my_func(x):
[...]
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
def traverse(self, ...):
[...]
if branch:
process_pool.map(my_func, todo_list)
It does not work, and this answer explains why.
But here comes the thing, wherever I create my Pool, something is missing. If I create this Pool at the end of my file, it does not see self.attribute1, the same way it did not see answer and fails with an AttributeError.
I'm not even trying to share it yet, and I'm already stuck with Multiprocessing way of doing thing.
I don't know if I have not been thinking correctly the whole thing, but I can not believe it's so complicated to handle something as simple as "having a worker pool and giving them tasks".
Thank you,
EDIT:
I resolved my first problem (AttributeError), my class had a callback as its attribute, and this callback was defined in the main script file, after the import of the scanner module... But the concurrency and "do not fork bomb" thing is still a problem.
What you want to do can't be done safely. Think about if you somehow had a single shared Pool shared across parent and worker processes, with, say, two worker processes. The parent runs a map that tries to perform two tasks, and each task needs to map two more tasks. The two parent dispatched tasks go to each worker, and the parent blocks. Each worker sends two more tasks to the shared pool and blocks for them to complete. But now all workers are now occupied, waiting for a worker to become free; you've deadlocked.
A safer approach would be to have the workers return enough information to dispatch additional tasks in the parent. Then you could do something like:
class MoreWork(object):
def __init__(self, func, *args):
self.func = func
self.args = args
pool = multiprocessing.Pool()
try:
base_task = somefunc, someargs
outstanding = collections.deque([pool.apply_async(*base_task)])
while outstanding:
result = outstanding.popleft().get()
if isinstance(result, MoreWork):
outstanding.append(pool.apply_async(result.func, result.args))
else:
... do something with a "final" result, maybe breaking the loop ...
finally:
pool.terminate()
What the functions are is up to you, they'd just return information in a MoreWork when there was more to do, not launch a task directly. The point is to ensure that by having the parent be solely responsible for task dispatch, and the workers solely responsible for task completion, you can't deadlock due to all workers being blocked waiting for tasks that are in the queue, but not being processed.
This is also not at all optimized; ideally, you wouldn't block waiting on the first item in the queue if other items in the queue were complete; it's a lot easier to do this with the concurrent.futures module, specifically with concurrent.futures.wait to wait on the first available result from an arbitrary number of outstanding tasks, but you'd need a third party PyPI package to get concurrent.futures on Python 2.7.

Categories

Resources