Let me start by saying that I'm not using a Queue, so this question is not a duplicate of this one and I'm not using a process pool, so it's not a duplicate of this one.
I have a Process object that uses a pool of thread workers to accomplish some task. For the sake of an MCVE, this task is just constructing a list of the integers from 0 to 9. Here's my source:
#!/usr/bin/env python3
from multiprocessing.pool import ThreadPool as Pool
from multiprocessing import Process
from sys import stdout
class Quest():
def __init__(self):
pass
def doIt(self, i):
return i
class Test(Process):
def __init__(self, arg):
super(Test, self).__init__()
self.arg = arg
self.pool = Pool()
def run(self):
quest = Quest()
done = self.pool.map_async(quest.doIt, range(10), error_callback=print)
stdout.flush()
self.arg = [item for item in done.get()]
def __str__(self):
return str(self.arg)
# I tried both with and without this method
def join(self, timeout=None):
self.pool.close()
self.pool.join()
super(Test, self).join(timeout)
test = Test("test")
print(test) # should print 'test' (and does)
test.start()
# this line hangs forever
_ = test.join()
print(test) # should print '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]'
This is a pretty rough model of what I want my actual program to do. The problem, as indicated in the comments, is that Test.join always hangs forever. That's totally independent of whether or not that method is overridden in the Test class. It also never prints anything, but the output when I send a KeyboardInterrupt signal indicates that the problem lies in getting the results from the workers:
test
^CTraceback (most recent call last):
File "./test.py", line 44, in <module>
Process Test-1:
_ = test.join()
File "./test.py", line 34, in join
super(Test, self).join(timeout)
File "/path/to/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/path/to/multiprocessing/popen_fork.py", line 51, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/path/to/multiprocessing/popen_fork.py", line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
File "/path/to/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "./test.py", line 25, in run
self.arg = [item for item in done.get()]
File "/path/to/multiprocessing/pool.py", line 638, in get
self.wait(timeout)
File "/path/to/multiprocessing/pool.py", line 635, in wait
self._event.wait(timeout)
File "/path/to/threading.py", line 551, in wait
signaled = self._cond.wait(timeout)
File "/path/to/threading.py", line 295, in wait
waiter.acquire()
KeyboardInterrupt
Why doesn't the stupid process stupid exit? The only thing a worker does is a single dereference and function call that executes one operation, it should be really simple.
I forgot to mention: This works fine if I make Test a subclass of threading.Thread instead of multiprocessing.Process. I'm really not sure why this breaks it in half.
Your goal is to do this work asynchronously. Why not spawn the asynchronous subprocess workers from your main process WITHOUT spawning a child process (class Test)? The results will be available in your main process and no fancy stuff needs to be done. You can stop reading here if you choose to do this. Otherwise, read on.
Your join is running forever because there are two separate pools, one when you create the process object (local to your main process), and another when you fork the process by calling process.start() (local to the spawned process)
For example, this doesn't work:
def __init__(self, arg, shared):
super(Test, self).__init__()
self.arg = arg
self.quest = Quest()
self.shared = shared
self.pool = Pool()
def run(self):
iterable = list(range(10))
self.shared.extend(self.pool.map_async(self.quest.doIt, iterable, error_callback=print).get())
print("1" + str(self.shared))
self.pool.close()
However, this works:
def __init__(self, arg, shared):
super(Test, self).__init__()
self.arg = arg
self.quest = Quest()
self.shared = shared
def run(self):
pool = Pool()
iterable = list(range(10))
self.shared.extend(pool.map_async(self.quest.doIt, iterable, error_callback=print).get())
print("1" + str(self.shared))
pool.close()
This has to do with the fact that when you spawn a process, the entire code, stack, and heap segments of your process is cloned into the process such that your main process and subprocess have separate contexts.
So, you are calling join() on the pool object created local to your main process, and that calls close() on the pool. Then, in run() there's another pool object that was cloned into the subprocess when start() was called, and that pool was never closed and cannot be joined in the way you're doing it. Simply put, your main process has no reference to the cloned pool object in the subprocess.
This works fine if I make Test a subclass of threading.Thread instead
of multiprocessing.Process. I'm really not sure why this breaks it in
half.
Makes sense, because threads differ from processes in that they have independent call stacks, but share the other segments of memory, so any updates you make to an object created in another thread is visible in your main process (which is the parent of these threads) and vice versa.
Resolution is to create the pool object local to the run() function. Close the pool object in the subprocess context, and join the subprocess in the main process. Which brings us to #2...
Shared state: There are these multiprocessing.Manager() objects that allow for some sort of magical process-safe shared state between processes. Doesn't seem like the manager allows for reassignment of object references, which makes sense, because if you reassign the managed value in a subprocess, when the subprocess is terminated, that process context (code, stack, heap) disappears and your main process never sees this assignment (since it was done referencing an object local to the context of the subprocess). It may work for ctype primitive values, though.
If someone more experienced with Manager() wants to chime in on its innards, that'd be cool. But, the following code gives you your expected behavior:
#!/usr/bin/env python3
from multiprocessing.pool import ThreadPool as Pool
from multiprocessing import Process, Manager
from sys import stdout
class Quest():
def __init__(self):
pass
def doIt(self, i):
return i
class Test(Process):
def __init__(self, arg, shared):
super(Test, self).__init__()
self.arg = arg
self.quest = Quest()
self.shared = shared
def run(self):
with Pool() as pool:
iterable = list(range(10))
self.shared.extend(pool.map_async(self.quest.doIt, iterable, error_callback=print).get())
print("1" + str(self.shared)) # can remove, just to make sure we've updated state
def __str__(self):
return str(self.arg)
with Manager() as manager:
res = manager.list()
test = Test("test", res)
print(test) # should print 'test' (and does)
test.start()
test.join()
print("2" + str(res)) # should print '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]'
Outputs:
rpg711$ python multiprocess_async_join.py
test
1[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Related
I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.
Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.
I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation
When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.
As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()
I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.
This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.
A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))
Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)
As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.
from multiprocessing spawn:
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver. [Available on Unix and Windows. The default on Windows and macOS.]
import multiprocessing as mp
import signal
FOO = 10
def foo():
assert FOO == 0
def test1():
global FOO
FOO = 0
ctx = mp.get_context("spawn")
p = ctx.Process(target=foo, args=())
p.start()
p.join()
def bar():
assert signal.getsignal(signal.SIGTERM) == signal.SIG_DFL
def test2():
orignal = signal.getsignal(signal.SIGTERM)
assert orignal == signal.SIG_DFL
signal.signal(signal.SIGTERM, signal.SIG_IGN)
ctx = mp.get_context("spawn")
p = ctx.Process(target=bar, args=())
p.start()
p.join()
if __name__ == "__main__":
test1()
test2()
output:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/Users/foo/opt/miniconda3/envs/iris-dev/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/foo/opt/miniconda3/envs/iris-dev/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/foo/Work/iris/iris2/example.py", line 7, in foo
assert FOO == 0
AssertionError
Process SpawnProcess-2:
Traceback (most recent call last):
File "/Users/foo/opt/miniconda3/envs/iris-dev/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/foo/opt/miniconda3/envs/iris-dev/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/foo/Work/iris/iris2/example.py", line 18, in bar
assert signal.getsignal(signal.SIGTERM) == signal.SIG_DFL
AssertionError
For the first test, FOO in parent process state is mutated but its child doesn't see this modification.
But for the test2, the signal handler mutated state is reflected in both child and parent. (maybe it's C-level code???)
I know when you do fork, children should have same memory of its parent at the fork point. But is seems it's not always true for spawn
So my questions are
What's happening during spawn?
Why the memory state of child and parent sometimes differ while sometimes not?
update: I have test 3, where I register a signal handler (python code) in parent process, which is not replicated to children process by spwan
def bar():
# assert this is not handler installed at parent
assert signal.getsignal(signal.SIGTERM) == signal.SIG_DFL
def dummy(signum, frame):
print("dummy", signum, frame)
def test3():
orignal = signal.getsignal(signal.SIGTERM)
assert orignal == signal.SIG_DFL
signal.signal(signal.SIGTERM, dummy)
ctx = mp.get_context("spawn")
p = ctx.Process(target=bar, args=())
p.start()
p.join()
Python relies on the exec primitive to implement the spawn start method on UNIX platforms.
When a new process is forked, the exec loads a new Python interpreter and points it out to the module and function you are giving as a target to your Process object. When the module is loaded, the if __name__ == "__main__": evaluates to False. This avoids your logic from entering an endless loop which would end up spawning infinite processes.
Assuming you are executing this code on a UNIX machine, this is the correct behaviour based on POSIX specifications.
This volume of POSIX.1-2017 specifies that signals set to SIG_IGN remain set to SIG_IGN, and that the new process image inherits the signal mask of the thread that called exec in the old process image.
This works only for SIG_IGN. In fact, on test3 you can observe how your handler is reset.
I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.
Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.
I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation
When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.
As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()
I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.
This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.
A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))
Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)
As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.
This is not very important, just a silly experiment. I would like to create my own message passing.
I would like to have a dictionary of queues, where each key is the PID of the process.
Because I'd like to have the processes (created by Process()) to exchange messages inserting them in the queue of the process they want to send it to (knowing its pid).
This is a silly code:
from multiprocessing import Process, Manager, Queue
from os import getpid
from time import sleep
def begin(dic, manager, parentQ):
parentQ.put(getpid())
dic[getpid()] = manager.Queue()
dic[getpid()].put("Something...")
if __name__== '__main__':
manager = Manager()
dic = manager.dict()
parentQ = Queue()
p = Process(target = begin, args=(dic, manager, parentQ))
p.start()
son = parentQ.get()
print son
sleep(2)
print dic[son].get()
dic[getpid()] = manager.Queue(), this works fine. But when I perform
dic[son].put()/get() I get this message:
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "mps.py", line 8, in begin
dic[getpid()].put("Something...")
File "<string>", line 2, in __getitem__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 773, in _callmethod
raise convert_to_error(kind, result)
RemoteError:
---------------------------------------------------------------------------
Unserializable message: ('#RETURN', <Queue.Queue instance at 0x8a92d0c>)
---------------------------------------------------------------------------
do you know what's the right way to do it?
I believe your code is failing because Queues are not serializable, just like the traceback says. The multiprocessing.Manager() object can create a shared dict for you without a problem, just as you've done here, but values stored in the dict still need to be serializable (or picklable in Pythonese). If you're okay with the subprocesses not having access to each other's queues, then this should work for you:
from multiprocessing import Process, Manager, Queue
from os import getpid
number_of_subprocesses_i_want = 5
def begin(myQ):
myQ.put("Something sentimental from your friend, PID {0}".format(getpid()))
return
if __name__== '__main__':
queue_dic = {}
queue_manager = Manager()
process_list = []
for i in xrange(number_of_subprocesses_i_want):
child_queue = queue_manager.Queue()
p = Process(target = begin, args=(child_queue,))
p.start()
queue_dic[p.pid] = child_queue
process_list.append(p)
for p in process_list:
print(queue_dic[p.pid].get())
p.join()
This leaves you with a dictionary whose keys are the child processes, and the values are their respective queues, which can be used from the main process.
I don't think your original goal is achievable with queues because queues that you want a subprocess to use must be passed to the processes when they are created, so as you launch more processes, you have no way to give an existing process access to a new queue.
One possible way to have inter-process communication would be to have everyone share a single queue to pass messages back to your main process bundled with some kind of header, such as in a tuple:
(destination_pid, sender_pid, message)
..and have main read the destination_pid and direct (sender_pid, message) to that subprocess' queue. Of course, this implies that you need a method of notifying existing processes when a new process is available to communicate with.
I have some Python code (on Windows) that uses the multiprocessing module to run a pool of worker processes. Each worker process needs to do some cleanup at the end of the map_async method.
Does anyone know how to do that?
Do you really want to run a cleanup function once for each worker process rather than once for every task created by the map_async call?
multiprocess.pool.Pool creates a pool of, say, 8 worker processes. map_async might submit 40 tasks to be distributed among the 8 workers.
I can imagine why you might want to run cleanup code at the end of each task, but I'm having trouble imagining why you would want to run cleanup code just before each of the 8 worker processes is finalized.
Nevertheless, if that is what you want to do, you could do it by monkeypatching multiprocessing.pool.worker:
import multiprocessing as mp
import multiprocessing.pool as mpool
from multiprocessing.util import debug
def cleanup():
print('{n} CLEANUP'.format(n=mp.current_process().name))
# This code comes from /usr/lib/python2.6/multiprocessing/pool.py,
# except for the single line at the end which calls cleanup().
def myworker(inqueue, outqueue, initializer=None, initargs=()):
put = outqueue.put
get = inqueue.get
if hasattr(inqueue, '_writer'):
inqueue._writer.close()
outqueue._reader.close()
if initializer is not None:
initializer(*initargs)
while 1:
try:
task = get()
except (EOFError, IOError):
debug('worker got EOFError or IOError -- exiting')
break
if task is None:
debug('worker got sentinel -- exiting')
break
job, i, func, args, kwds = task
try:
result = (True, func(*args, **kwds))
except Exception, e:
result = (False, e)
put((job, i, result))
cleanup()
# Here we monkeypatch mpool.worker
mpool.worker=myworker
def foo(i):
return i*i
def main():
pool = mp.Pool(8)
results = pool.map_async(foo, range(40)).get()
print(results)
if __name__=='__main__':
main()
yields:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521]
PoolWorker-8 CLEANUP
PoolWorker-3 CLEANUP
PoolWorker-7 CLEANUP
PoolWorker-1 CLEANUP
PoolWorker-6 CLEANUP
PoolWorker-2 CLEANUP
PoolWorker-4 CLEANUP
PoolWorker-5 CLEANUP
Your only real option here is to run cleanup at the end of the function you map_async to.
If this cleanup is honestly intended for at process death, you cannot use the concept of a pool. They are orthogonal. A pool does not dictate the process lifetime unless you use maxtasksperchild, which is new in Python 2.7. Even then, you do not gain the ability to run code at process death. However, maxtasksperchild might suit you, because any resources that the process opens will definitely go away when the process is terminated.
That being said, if you have a bunch of functions that you need to run cleanup on, you can save duplication of effort by designing a decorator. Here's an example of what I mean:
import functools
import multiprocessing
def cleanup(f):
"""Decorator for shared cleanup mechanism"""
#functools.wraps(f)
def wrapped(arg):
result = f(arg)
print("Cleaning up after f({0})".format(arg))
return result
return wrapped
#cleanup
def task1(arg):
print("Hello from task1({0})".format(arg))
return arg * 2
#cleanup
def task2(arg):
print("Bonjour from task2({0})".format(arg))
return arg ** 2
def main():
p = multiprocessing.Pool(processes=3)
print(p.map(task1, [1, 2, 3]))
print(p.map(task2, [1, 2, 3]))
if __name__ == "__main__":
main()
When you execute this (barring stdout being jumbled because I'm not locking it here for brevity), the order you get things out should indicate that your cleanup task is running at the end of each task:
Hello from task1(1)
Cleaning up after f(1)
Hello from task1(2)
Cleaning up after f(2)
Hello from task1(3)
Cleaning up after f(3)
[2, 4, 6]
Bonjour from task2(1)
Cleaning up after f(1)
Bonjour from task2(2)
Cleaning up after f(2)
Bonjour from task2(3)
Cleaning up after f(3)
[1, 4, 9]