Why does my parallel code generate an error? - python

Issue 1: When sys.stdout.write is not wrapped in a separate function, the code below fails.
Issue 2: When ssys.stdout.write is wrapped in a separate function, the code prints spaces between each letter.
Code (v1):
#!/usr/bin/env python
import pp
import sys
def main():
server = pp.Server()
for c in "Hello World!\n":
server.submit(sys.stdout.write, (c,), (), ("sys",))()
if __name__=="__main__":
main()
Trace:
$ ./parhello.py
Traceback (most recent call last):
File "./parhello.py", line 15, in <module>
main()
File "./parhello.py", line 12, in main
server.submit(write, (c,), (), ("sys",))()
File "/Library/Python/2.7/site-packages/pp.py", line 461, in submit
sfunc = self.__dumpsfunc((func, ) + depfuncs, modules)
File "/Library/Python/2.7/site-packages/pp.py", line 639, in __dumpsfunc
sources = [self.__get_source(func) for func in funcs]
File "/Library/Python/2.7/site-packages/pp.py", line 706, in __get_source
sourcelines = inspect.getsourcelines(func)[0]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 688, in getsourcelines
lines, lnum = findsource(object)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 527, in findsource
file = getsourcefile(object)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 446, in getsourcefile
filename = getfile(object)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 422, in getfile
'function, traceback, frame, or code object'.format(object))
TypeError: <built-in method write of file object at 0x1002811e0> is not a module, class, method, function, traceback, frame, or code object
make: *** [test] Error 1
Code (v2):
#!/usr/bin/env python
import pp
import sys
def hello(c):
sys.stdout.write(c)
def main():
server = pp.Server()
for c in "Hello World!\n":
server.submit(hello, (c,), (), ("sys",))()
if __name__=="__main__":
main()
Trace:
$ ./parhello.py
H e l l o W o r l d !

For the first part, pp wasn't designed to handle built-ins as arguments to submit. The second problem is more complicated. Before pp calls the submitted function, it redirects stdout and stderr to a StringIO object. On completing the task, it prints the value from the StringIO object with
print sout,
This means that it appends a space the contents of sout before printing it. To get around this, don't have your functions use sys.stdout directly, but print either to a file or a queue you manage and handle the printing of in a better way.

Related

Python execute a function in parallel in loop

I tried to improve the execution time of a script which import datas from CSV into Graphite/Go-Carbon DB time series.
this is the loop which parse all zipfiles and read them in function (execute_run) :
It tried this code but i got an error:
for idx4, Lst_f in enumerate(full_csvfile_paths):
if lst_metrics in Lst_f:
zip_file = Lst_f
with zipfile.ZipFile(zip_file) as zipobj:
print("Using ZipFile:",zipobj.filename)
#execute_run(zipobj.filename, confcsv_path, storage_type, serial)
output = subprocess.run(execute_run(zipobj.filename, confcsv_path, storage_type, serial),stdout=subprocess.PIPE)
print ("Return code: %i" % output.returncode)
print ("Output data: %s" % output.stdout)
Error:
Traceback (most recent call last):
File "./02-pickle-client.py", line 451, in <module>
main()
File "./02-pickle-client.py", line 361, in main
output = subprocess.run(execute_run(zipobj.filename, confcsv_path, storage_type, serial),stdout=subprocess.PIPE)
File "/usr/lib64/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1240, in _execute_child
args = list(args)
TypeError: 'NoneType' object is not iterable
Is there a way to execute X times the function :"execute_run" and control the correct running.
Many thanks for help.
The problem could be that the parallel processes is not set up to handle iterables correctly. Instead of subprocess.run, I would recommend using
multiprocessing.pool or multiprocessing.starmap as specified in these docs.
This could look something like this:
import multiprocessing as mp
# Step 1: Use multiprocessing.Pool() and specify number of cores to use (here I use 4).
pool = mp.Pool(4)
# Step 2: Use pool.starmap which takes a multiple iterable arguments
results = pool.starmap(My_Function, [(variable1,variable2,variable3) for i in data])
# Step 3: Don't forget to close
pool.close()

AttributeError while using plac: 'Namespace' object has no attribute

Trying to write a command line function, and I've been stymied by this AttributeError. I know that other people have asked similar questions but I haven't seen any using plac so I figured I'd write this out.
#plac.annotations(
training_file=("The filename containing the text you wish to annotate", "option", "-tf", Path),
entity_type=("The name of the entity you wish to annotate", "option", "-e", str)
)
def main(training_file=None, entity_type=None):
"""Script to more easily annotate spaCy NER training examples"""
if not training_file:
training_file = input("Please enter the filename of the data you wish to annotate: ")
with open(training_file, 'r') as training_file:
list_to_annotate = training_file.read()
print(list_to_annotate)
and where it's run:
if __name__ == "__main__":
plac.call(main)
There's more to my actual command, but whenever I run this I get the same error message:
Traceback (most recent call last):
File "C:\Users\Steve\PycharmProjects\GroceryListMaker\model_scripts\training_data_maker.py", line 79, in <module>
plac.call(main)
File "C:\Users\Steve\PycharmProjects\GroceryListMaker\lib\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Steve\PycharmProjects\GroceryListMaker\lib\site-packages\plac_core.py", line 230, in consume
args = [getattr(ns, a) for a in self.argspec.args]
File "C:\Users\Steve\PycharmProjects\GroceryListMaker\lib\site-packages\plac_core.py", line 230, in <listcomp>
args = [getattr(ns, a) for a in self.argspec.args]
AttributeError: 'Namespace' object has no attribute 'training_file'
I'm really not sure what's wrong, and it's making me tear my hair out here. Any help greatly appreciated, thank you.
If you replace it with:
#plac.annotations(
training_file=("The filename containing the text you wish to annotate",
"option", "tf", Path),
entity_type=("The name of the entity you wish to annotate", "option", "e", str)
)
it works (note that I removed the - in the abbreviations).
In the future you can use pdb to track down problems like this more quickly. Here's what I did:
$ python -m pdb main.py
> /home/embray/src/junk/so/60005716/main.py(1)<module>()
-> import plac
(Pdb) cont
Traceback (most recent call last):
File "/usr/lib/python3.6/pdb.py", line 1667, in main
pdb._runscript(mainpyfile)
File "/usr/lib/python3.6/pdb.py", line 1548, in _runscript
self.run(statement)
File "/usr/lib/python3.6/bdb.py", line 434, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/home/embray/src/junk/so/60005716/main.py", line 1, in <module>
import plac
File "/home/embray/.virtualenvs/tmp-954ecd64f7669c29/lib/python3.6/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/embray/.virtualenvs/tmp-954ecd64f7669c29/lib/python3.6/site-packages/plac_core.py", line 230, in consume
args = [getattr(ns, a) for a in self.argspec.args]
File "/home/embray/.virtualenvs/tmp-954ecd64f7669c29/lib/python3.6/site-packages/plac_core.py", line 230, in <listcomp>
args = [getattr(ns, a) for a in self.argspec.args]
AttributeError: 'Namespace' object has no attribute 'training_file'
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/embray/.virtualenvs/tmp-954ecd64f7669c29/lib/python3.6/site-packages/plac_core.py(230)<listcomp>()
-> args = [getattr(ns, a) for a in self.argspec.args]
(Pdb) up
> /home/embray/.virtualenvs/tmp-954ecd64f7669c29/lib/python3.6/site-packages/plac_core.py(230)consume()
-> args = [getattr(ns, a) for a in self.argspec.args]
(Pdb) p ns
Namespace(e=None, tf=None)
Here you can see that your argument namespace was replaced with e and tf, suggesting that somehow putting a - in the abbreviation actually replaces the argument name (this was just a guess on my part but it turned out to be correct).
I'd consider that a bit of a bug on plac's part--it's very confusing and the documentation doesn't indicate anything about this.

Error pickling a `matlab` object in joblib `Parallel` context

I'm running some Matlab code in parallel from inside a Python context (I know, but that's what's going on), and I'm hitting an import error involving matlab.double. The same code works fine in a multiprocessing.Pool, so I am having trouble figuring out what the problem is. Here's a minimal reproducing test case.
import matlab
from multiprocessing import Pool
from joblib import Parallel, delayed
# A global object that I would like to be available in the parallel subroutine
x = matlab.double([[0.0]])
def f(i):
print(i, x)
with Pool(4) as p:
p.map(f, range(10))
# This prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
for _ in Parallel(4, backend='multiprocessing')(delayed(f)(i) for i in range(10)):
pass
# This also prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
# Now run with default `backend='loky'`
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
pass
# ^ this crashes.
So, the only problematic one is the one using the 'loky' backend.
The full traceback is:
exception calling callback for <Future at 0x7f63b5a57358 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 20, in <module>
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
self.retrieve()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Looking at the traceback, it seems like the root cause is an issue importing the matlab package in the child process.
It's probably worth noting that this all runs just fine if instead I had defined x = np.array([[0.0]]) (after importing numpy as np). And of course the main process has no problem with any matlab imports, so I am not sure why the child process would.
I'm not sure if this error has anything in particular to do with the matlab package, or if it's something to do with global variables and cloudpickle or loky. In my application it would help to stick with loky, so I'd appreciate any insight!
I should also note that I'm using the official Matlab engine for Python: https://www.mathworks.com/help/matlab/matlab-engine-for-python.html. I suppose that might make it hard for others to try out the test cases, so I wish I could reproduce this error with a type other than matlab.double, but I haven't found another yet.
Digging around more, I've noticed that the process of importing the matlab package is more circular than I would expect, and I'm speculating that this could be part of the problem? The issue is that when import matlab is run by loky's _ForkingPickler, first some file matlab/mlarray.py is imported, which imports some other files, one of which contains import matlab, and this causes matlab/__init__.py to be run, which internally has from mlarray import double, single, uint8, ... which is the line that causes the crash.
Could this circularity be the issue? If so, why can I import this module in the main process but not in the loky backend?
The error is caused by incorrect loading order of global objects in the child processes. It can be seen clearly in the traceback
_ForkingPickler.loads(res) -> ... -> import matlab -> from mlarray import ...
that matlab is not yet imported when the global variable x is loaded by cloudpickle.
joblib with loky seems to treat modules as normal global objects and send them dynamically to the child processes. joblib doesn't record the order in which those objects/modules were defined. Therefore they are loaded (initialized) in a random order in the child processes.
A simple workaround is to manually pickle the matlab object and load it after importing matlab inside your function.
import matlab
import pickle
px = pickle.dumps(matlab.double([[0.0]]))
def f(i):
import matlab
x=pickle.loads(px)
print(i, x)
Of course you can also use the joblib.dumps and loads to serialize the objects.
Use initializer
Thanks to the suggestion of #Aaron, you can also use an initializer (for loky) to import Matlab before loading x.
Currently there's no simple API to specify initializer. So I wrote a simple function:
def with_initializer(self, f_init):
# Overwrite initializer hook in the Loky ProcessPoolExecutor
# https://github.com/tomMoral/loky/blob/f4739e123acb711781e46581d5ed31ed8201c7a9/loky/process_executor.py#L850
hasattr(self._backend, '_workers') or self.__enter__()
origin_init = self._backend._workers._initializer
def new_init():
origin_init()
f_init()
self._backend._workers._initializer = new_init if callable(origin_init) else f_init
return self
It is a little bit hacky but works well with the current version of joblib and loky.
Then you can use it like:
import matlab
from joblib import Parallel, delayed
x = matlab.double([[0.0]])
def f(i):
print(i, x)
def _init_matlab():
import matlab
with Parallel(4) as p:
for _ in with_initializer(p, _init_matlab)(delayed(f)(i) for i in range(10)):
pass
I hope the developers of joblib will add initializer argument to the constructor of Parallel in the future.

Store console output in a file - Python , unittest

I have a python script in which I wrote some unit tests and I am using selenium.
I want to extract the whole output of the console (not only my prints but also the unit test related results), so that I can import them later in my test management tool.
Here is my code:
import unittest
from selenium import webdriver
import json
import requests
import sys
class TestUbuntuHomepage(unittest.TestCase):
global strs
strs = []
def setUp(self):
sys.stdout = open("C:\\Users\\Marialena\\Downloads\\out2.log", 'wt')
self.driver = webdriver.Firefox(executable_path="C:\\Users\\Marialena\\Downloads\\selenium-drivers\\geckodriver")
def testTitle(self):
self.driver.get('http://www.ubuntu.com/')
if self.assertIn('Ubuntu', self.driver.title):
strs.append('test'})
def tearDown(self):
self.driver.quit()
if __name__ == '__main__':
unittest.main(verbosity=2)
Using sys.stdout = open("C:\\Users\\Marialena\\Downloads\\out2.log", 'wt') I get in the file everything I have printed and I also get this exception:
Traceback (most recent call last): File "C:\Program
Files\JetBrains\PyCharm\PyCharm Community Edition
2017.3.3\helpers\pycharm_jb_unittest_runner.py", line 35, in
main(argv=args, module=None, testRunner=unittestpy.TeamcityTestRunner, buffer=not
JB_DISABLE_BUFFERING) File
"C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\main.py",
line 95, in init
self.runTests() File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\main.py",
line 256, in runTests
self.result = testRunner.run(self.test) File "C:\Program Files\JetBrains\PyCharm\PyCharm Community Edition
2017.3.3\helpers\pycharm\teamcity\unittestpy.py", line 304, in run
return super(TeamcityTestRunner, self).run(test) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\runner.py",
line 176, in run
test(result) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 84, in call
return self.run(*args, **kwds) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 122, in run
test(result) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 84, in call
return self.run(*args, **kwds) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 122, in run
test(result) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 84, in call
return self.run(*args, **kwds) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\suite.py",
line 122, in run
test(result) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\case.py",
line 653, in call
return self.run(*args, **kwds) File "C:\Users\Marialena\AppData\Local\Programs\Python\Python36-32\Lib\unittest\case.py",
line 624, in run
result.stopTest(self) File "C:\Program Files\JetBrains\PyCharm\PyCharm Community Edition
2017.3.3\helpers\pycharm\teamcity\unittestpy.py", line 260, in stopTest
output = sys.stdout.getvalue() AttributeError: '_io.TextIOWrapper' object has no attribute 'getvalue'
Any help with this, please? Thank you.
It looks like PyCharm is replacing sys.stdout with its own stream, so when you replace it with a file stream, PyCharm fails to use it.
So, limit your interventions to the scope of one function, to avoid interference with PyCharm.
This is the general idea:
def testTitle(self):
original_stdout = sys.stdout
sys.stdout = open("C:\\Users\\Marialena\\Downloads\\out2.log", 'wt')
# your test code goes here
sys.stdout = original_stdout
Now, from the outside, it will look like sys.stdout was never modified.
Of course, you'll want to improve on some things:
handle exceptions in test - restore stdout in a finally block
avoid deleting the file in each test - open the file in append mode
avoid having to copy-and-paste this code in each test - make a context manager
#contextmanager
def redirected_stdout(filename):
original_stdout = sys.stdout
sys.stdout = open(filename, 'at')
try:
yield
finally:
sys.stdout = original_stdout
and then:
def testTitle(self):
with redirected_stdout("C:\\Users\\Marialena\\Downloads\\out2.log"):
# your test code goes here
Alternatively:
Investigate how PyCharm expects sys.stdout to behave and make your own class which does both: writes to file and provides the API which PyCharm expects.

Why would it throws "'module' object has no attribute XXX" error when I call on apply_async from multiprocessing.Pool?

The code is as below. When I copy-and-paste it in my cmd prompt, it throws 'module' object has no attribute 'func', but when I save it as a .py file and execute python test.py, it just works fine.
import multiprocessing
import time
def func(msg):
for i in xrange(3):
print msg
time.sleep(1)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
for i in xrange(5):
msg = "hello %d" %(i)
pool.apply_async(func, (msg, ))
pool.close()
pool.join()
print "Sub-process(es) done."
Could anyone give me an explanation on the difference between in prompt and in file when running a python code? Thanks a lot!
This is happening because on Windows, func needs to be pickled and sent to the child process via IPC. In order for the child to unpickle func, it needs to be able to import it from the parent's __main__ module. When this happens in a normal Python script, the child can re-import your script, and __main__ will contain all the functions declared at the top-level of your script, so it works fine. However, in the interactive interpreter, functions you've defined while in the interpreter can't simply be re-imported from a file like in a normal script, so they will not be in __main__ in the child. This is more clear if you use multiprocessing.Process directly to recreate the issue:
>>> def f():
... print "HI"
...
>>> import multiprocessing
>>> p = multiprocessing.Process(target=f)
>>> p.start()
>>> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\python27\lib\multiprocessing\forking.py", line 381, in main
self = load(from_parent)
File "C:\python27\lib\pickle.py", line 1378, in load
return Unpickler(file).load()
File "C:\python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\python27\lib\pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "C:\python27\lib\pickle.py", line 1126, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'f'
This way, it's more clear that pickle can't find the module. If you add some tracing to pickle.py you can see that 'module' is referring to __main__:
def load_global(self):
module = self.readline()[:-1]
name = self.readline()[:-1]
print("module {} name {}".format(module, name)) # I added this.
klass = self.find_class(module, name)
self.append(klass)
Rrerunning the same code again with that extra print statement yields this:
module multiprocessing.process name Process
module __main__ name f
< same traceback as before>
It's worth noting that this example actually works fine on Posix platforms, because os.fork() is used to spawn the child processes, which means that any function defined prior to the Pool being created will be available in the child's __main__ module. So, while the above example will work, this one will still fail, because the worker function is defined after creating the Pool (which means after os.fork() is called):
>>> import multiprocessing
>>> p = multiprocessing.Pool(2)
>>> def f(a):
... print(a)
...
>>> p.apply(f, "hi")
Process PoolWorker-1:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 231, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/usr/lib64/python2.6/multiprocessing/queues.py", line 339, in get
return recv()
AttributeError: 'module' object has no attribute 'f'

Categories

Resources