Python DEAP and multiprocessing on Windows: AttributeError - python

I have the following situation:
Windows 10
Python 3.7
deap 1.3.1
There is a main.py with
def main():
...
schedule.schedule()
...
if __name__== "__main__":
main()
Then, I also have a file schedule.py with
def schedule()
...
toolbox = base.Toolbox()
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox.register('individual', init_indiv, creator.Individual, bounds=bounds)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", fitness, data=args)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)
# Further parameters
cxpb = 0.7
mutpb = 0.2
# Measure how long it takes to caluclate 1 generation
MAX_HOURS_GA = parameter._MAX_HOURS_GA
POPSIZE_GA = parameter._POPSIZE_GA
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
pop = toolbox.population(n=POPSIZE_GA * len(bounds))
result = algorithms.eaSimple(pop, toolbox, cxpb, mutpb, 1, verbose=False)
Now, executing this gives me the following error:
Process SpawnPoolWorker-1:
Traceback (most recent call last):
File "C:\Users\...\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "C:\Users\...\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\...\lib\multiprocessing\pool.py", line 110, in worker
task = get()
File "C:\Users\...\lib\multiprocessing\queues.py", line 354, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'Individual' on <module 'deap.creator' from 'C:\\Users...
Now, I do note that the DEAP documentation (https://deap.readthedocs.io/en/master/tutorials/basic/part4.html) says
Warning
As stated in the multiprocessing guidelines, under Windows, a process pool must be protected in a >if __name__ == "__main__" section because of the way processes are initialized.
but that doesn't really help me as I certainly don't want to have all the toolbox.register(...) in my main and it even might not be possible to do so. Just moving the creation of the pool
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
to the main did not help.
There seem to be other people with similar issues, even fairly recently (https://github.com/rsteca/sklearn-deap/issues/59). For most of them, some sort of workaround seems to exist but none of them seems to fit in my situation or at least I couldn't figure out how to make them work.
I've also tried moving around the order of registering the functions and initializing the pool, but with no luck. I've also tried using SCOOP instead but with similar results.
Any ideas?

The solution is to create "FitnessMin" and "Individual" in the global scope, i.e. in main.py:
import ...
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
def main():
...
schedule.schedule()
...
if __name__== "__main__":
main()

Related

How to implement multi-processing into a python module?

I would like to implement multiprocessing into a simulation which I have written in python. The simulation is very extensive and to clean the code I have created a number of modules.
One of the modules is now supposed to do some number crunching. Thus, I'd like to implement multiprocessing. However, I will always encounter an issue as I can not employ an if __name__ == "__main__" guard with in the module.
I can reproduce the error by running the following:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(arg):
return arg
class TestMpModule():
def __init__(self):
pass
def do(arg):
para = [1,2,3]
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
# filename: main.py
from test_mp_module import TestMpModule
test = TestMpModule()
test.do()
The Exception displayed states:
runfile('C:/XXX/test_mp.py', wdir='C:/XXX')
Reloaded modules: test_mp_module
Traceback (most recent call last):
File "C:\XXX\test_mp.py", line 17, in <module>
test.do()
File "C:\XXX\test_mp_module.py", line 22, in do
for result in results:
File "C:\YYY\Anaconda3\lib\concurrent\futures\process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 611, in result_iterator
yield fs.pop().result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm using Python 3.8.3, usually execute my code in Spyder and run a Windows machine.
How may I adapt my code to utilise multiprocessing within a module? Would that be even possible in the first place - I found very conflicting statements?
Any help is appreciated, cheers.
Try this for your file "main.py":
if __name__ == '__main__':
test = TestMpModule()
test.do()
For the multiprocessing part, I recommend to use the multiprocessing package. Here is a little exemple on how to use it:
import multiprocessing
def my_func(i):
return i
if __name__ == '__main__':
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
outputs = p.starmap(my_func, [(i, ) for i in range(5)])
print(outputs) # > [0, 1, 2, 3, 4]
I found a solution not sure if it is considered pretty, though. The name_guard needs to be carried into the module as follows:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(i):
return i
class TestMpModule():
def __init__(self):
pass
def do(self, name_guard):
para = [1, 2, 3]
if name_guard == 'parent_module_name': # check parent module name here
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
from test_mp_module import TestMpModule
if __name__ == "__main__":
name_guard = "parent_module_name" # insert __name__ here
test = TestMpModule()
test.do(name_guard)
Works fine now.

Getting erratic runtime exceptions trying to access persistant data in multiprocessing.Pool worker processes

Inspired by this solution I am trying to set up a multiprocessing pool of worker processes in Python. The idea is to pass some data to the worker processes before they actually start their work and reuse it eventually. It's intended to minimize the amount of data which needs to be packed/unpacked for every call into a worker process (i.e. reducing inter-process communication overhead). My MCVE looks like this:
import multiprocessing as mp
import numpy as np
def create_worker_context():
global context # create "global" context in worker process
context = {}
def init_worker_context(worker_id, some_const_array, DIMS, DTYPE):
context.update({
'worker_id': worker_id,
'some_const_array': some_const_array,
'tmp': np.zeros((DIMS, DIMS), dtype = DTYPE),
}) # store context information in global namespace of worker
return True # return True, verifying that the worker process received its data
class data_analysis:
def __init__(self):
self.DTYPE = 'float32'
self.CPU_LEN = mp.cpu_count()
self.DIMS = 100
self.some_const_array = np.zeros((self.DIMS, self.DIMS), dtype = self.DTYPE)
# Init multiprocessing pool
self.cpu_pool = mp.Pool(processes = self.CPU_LEN, initializer = create_worker_context) # create pool and context in workers
pool_results = [
self.cpu_pool.apply_async(
init_worker_context,
args = (core_id, self.some_const_array, self.DIMS, self.DTYPE)
) for core_id in range(self.CPU_LEN)
] # pass information to workers' context
result_batches = [result.get() for result in pool_results] # check if they got the information
if not all(result_batches): # raise an error if things did not work
raise SyntaxError('Workers could not be initialized ...')
#staticmethod
def process_batch(batch_data):
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
return context['tmp'] # return result
def process_all(self):
input_data = np.arange(0, self.DIMS ** 2, dtype = self.DTYPE).reshape(self.DIMS, self.DIMS)
pool_results = [
self.cpu_pool.apply_async(
data_analysis.process_batch,
args = (input_data,)
) for _ in range(self.CPU_LEN)
] # let workers actually work
result_batches = [result.get() for result in pool_results]
for batch in result_batches[1:]:
np.add(result_batches[0], batch, out = result_batches[0]) # reduce batches
print(result_batches[0]) # show result
if __name__ == '__main__':
data_analysis().process_all()
I am running the above with CPython 3.6.6.
The strange thing is ... sometimes it works, sometimes it does not. If it does not work, the process_batch method throws an exception, because it can not find some_const_array as a key in context. The full traceback looks like this:
(env) me#box:/path> python so.py
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/path/so.py", line 37, in process_batch
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
KeyError: 'some_const_array'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/so.py", line 54, in <module>
data_analysis().process_all()
File "/path/so.py", line 48, in process_all
result_batches = [result.get() for result in pool_results]
File "/path/so.py", line 48, in <listcomp>
result_batches = [result.get() for result in pool_results]
File "/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: 'some_const_array'
I am puzzled. What is going on here?
If my context dictionaries contain an object of "higher type", e.g. a database driver or similar, I am not getting this kind of problem. I can only reproduce this if my context dictionaries contain basic Python data types, collections or numpy arrays.
(Is there a potentially better approach for achieving the same thing in a more reliable manner? I know my approach is considered a hack ...)
You need to relocate the content of init_worker_context into your initializer function create_worker_context.
Your assumption that every single worker process will run init_worker_context is responsible for your confusion.
The tasks you submit to a pool get fed into one internal taskqueue all worker processes read from. What happens in your case is, that some worker processes complete their task and compete again for getting new tasks. So it can happen that one worker processes will execute multiple tasks while another one will not get a single one. Keep in mind the OS schedules runtime for threads (of the worker processes).

How to pass classes into Pool.map as arguments - Pickling Error

I am trying to process a file by cutting it up into chunks and running them through a function which processes the chunks and returns a numpy array. After looking around it seems the best method would be to use the Pool.map method by passing through classes as the arguments. These classes are initiated with the chunk sections as a variable, and another variable to store the numpy array. The output list of classes can then be parsed to get out the information I need to continue with the problem. Here is a simplified version of the script I am trying to write:
from multiprocessing import Pool
class container():
def __init__(self, k):
self.input_section = k
self.ouput_answer = 0
def compute(object_class):
# Main operation would go on in here....
object_class.output_answer = object_class.input_section
return object_class
def Main():
# Create list of classes to path as arguments
sections = [container(k) for k in range(10)]
# Create pool and compute modified classes
with Pool(4) as p:
results = p.map(compute, sections)
# Decode here to get answers
sections = [k.output_answer for k in results]
# Print answers
print(sections)
if __name__ == '__main__':
Main()
This is the error that I get when I run the script:
Exception in thread Thread-9: Traceback (most recent call last):
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'container' on module '__main__' from
'C:\\Users\\rbernon\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\spyder\\utils\\ipython\\start_kernel.py'>
Any help would be greatly apprectiated!
Keep in mind that every piece of data you want to have processed needs to be pickled and sent to the worker processes.
The overhead of this will reduce (and might even eliminate) the advantages of using multiple processes.
If the data file is large, it is probably better to send each worker a start and end offset as a 2-tuple of numbers, so each worker can read part of the file and process it.

Python3 filling a dictionary concurrently

I want to fill a dictionary in a loop. Iterations in the loop are independent from each other. I want to perform this on a cluster with thousands of processors. Here is a simplified version of what I tried and need to do.
import multiprocessing
class Worker(multiprocessing.Process):
def setName(self,name):
self.name=name
def run(self):
print ('In %s' % self.name)
return
if __name__ == '__main__':
jobs = []
names=dict()
for i in range(10000):
p = Worker()
p.setName(str(i))
names[str(i)]=i
jobs.append(p)
p.start()
for j in jobs:
j.join()
I tried this one in python3 on my own computer and received the following error:
..
In 249
Traceback (most recent call last):
File "test.py", line 16, in <module>
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
In 250
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
Is there any better way to do this?
multiprocessing talks to its subprocesses via pipes. Each subprocesses requires two open file descriptors, one for read and one for write. If you launch 10000 workers, you'll end opening 20000 file descriptors which exceeds the default limit on OS X (which your paths indicate you're using).
You can fix the issue by raising the limit. See https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1 for details - basically, it amounts to setting two sysctl knobs and upping your shell's ulimit setting.
You are spawning 10000 processes at once at the moment. That really isn't a good idea.
The error you see is most definitely raised because the multiprocessing module (seem to) use pipes for the Inter Proccess Communication and there is a limit of open pipes/FDs.
I suggest using an python interpreter without a Global interpreter lock like Jython or IronPython and just replace the multiprocessing module with the threading one.
If you still want to use the multiprocessing module, you could use an Proccess Pool like this to collect the return values:
from multiprocessing import Pool
def worker(params):
name, someArg = params
print ('In %s' % name)
# do something with someArg here
return (name, someArg)
if __name__ == '__main__':
jobs = []
names=dict()
# Spawn 100 worker processes
pool = Pool(processes=100)
# Fill with real data
task_dict = dict(('name_{}'.format(i), i) for i in range(1000))
# Process every task via our pool
results = pool.map(worker, task_dict.items())
# And convert the rsult to a dict
results = dict(results)
print (results)
This should work with minimal changes for the threading module, too.

Python multiprocessing, ValueError: I/O operation on closed file

I'm having a problem with the Python multiprocessing package. Below is a simple example code that illustrates my problem.
import multiprocessing as mp
import time
def test_file(f):
f.write("Testing...\n")
print f.name
return None
if __name__ == "__main__":
f = open("test.txt", 'w')
proc = mp.Process(target=test_file, args=[f])
proc.start()
proc.join()
When I run this, I get the following error.
Process Process-1:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self.target(*self._args, **self._kwargs)
File "C:\Users\Ray\Google Drive\Programming\Python\tests\follow_test.py", line 24, in test_file
f.write("Testing...\n")
ValueError: I/O operation on closed file
Press any key to continue . . .
It seems that somehow the file handle is 'lost' during the creation of the new process. Could someone please explain what's going on?
I had similar issues in the past. Not sure whether it is done within the multiprocessing module or whether open sets the close-on-exec flag by default but I know for sure that file handles opened in the main process are closed in the multiprocessing children.
The obvious work around is to pass the filename as a parameter to the child process' init function and open it once within each child (if using a pool), or to pass it as a parameter to the target function and open/close on each invocation. The former requires the use of a global to store the file handle (not a good thing) - unless someone can show me how to avoid that :) - and the latter can incur a performance hit (but can be used with multiprocessing.Process directly).
Example of the former:
filehandle = None
def child_init(filename):
global filehandle
filehandle = open(filename,...)
../..
def child_target(args):
../..
if __name__ == '__main__':
# some code which defines filename
proc = multiprocessing.Pool(processes=1,initializer=child_init,initargs=[filename])
proc.apply(child_target,args)

Categories

Resources