How to implement multi-processing into a python module? - python

I would like to implement multiprocessing into a simulation which I have written in python. The simulation is very extensive and to clean the code I have created a number of modules.
One of the modules is now supposed to do some number crunching. Thus, I'd like to implement multiprocessing. However, I will always encounter an issue as I can not employ an if __name__ == "__main__" guard with in the module.
I can reproduce the error by running the following:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(arg):
return arg
class TestMpModule():
def __init__(self):
pass
def do(arg):
para = [1,2,3]
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
# filename: main.py
from test_mp_module import TestMpModule
test = TestMpModule()
test.do()
The Exception displayed states:
runfile('C:/XXX/test_mp.py', wdir='C:/XXX')
Reloaded modules: test_mp_module
Traceback (most recent call last):
File "C:\XXX\test_mp.py", line 17, in <module>
test.do()
File "C:\XXX\test_mp_module.py", line 22, in do
for result in results:
File "C:\YYY\Anaconda3\lib\concurrent\futures\process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 611, in result_iterator
yield fs.pop().result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm using Python 3.8.3, usually execute my code in Spyder and run a Windows machine.
How may I adapt my code to utilise multiprocessing within a module? Would that be even possible in the first place - I found very conflicting statements?
Any help is appreciated, cheers.

Try this for your file "main.py":
if __name__ == '__main__':
test = TestMpModule()
test.do()
For the multiprocessing part, I recommend to use the multiprocessing package. Here is a little exemple on how to use it:
import multiprocessing
def my_func(i):
return i
if __name__ == '__main__':
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
outputs = p.starmap(my_func, [(i, ) for i in range(5)])
print(outputs) # > [0, 1, 2, 3, 4]

I found a solution not sure if it is considered pretty, though. The name_guard needs to be carried into the module as follows:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(i):
return i
class TestMpModule():
def __init__(self):
pass
def do(self, name_guard):
para = [1, 2, 3]
if name_guard == 'parent_module_name': # check parent module name here
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
from test_mp_module import TestMpModule
if __name__ == "__main__":
name_guard = "parent_module_name" # insert __name__ here
test = TestMpModule()
test.do(name_guard)
Works fine now.

Related

Trouble importing name '_args_from_interpreter_flags' from 'subprocess'

I am attempting to write a Python script to download and unzip hundreds of files from an AWS server. As I understand it, these tasks are I/O-bound tasks, so I would like to multi-thread this task to speed up processing times.
Since I am new to Python, I've been reading guides like this one and that one on multithreading and multiprocessing.
Both of the above links suggest code to import methods from the subprocess library, but I am running into trouble completing these imports. The second link above suggests the following code to illustrate multithreading:
from multiprocessing import Pool as ProcessPool
from urllib.request import urlopen
def run_tasks(function, args, pool, chunk_size=None):
results = pool.map(function, args, chunk_size)
return results
def work(n):
with urlopen("https://www.google.com/#{n}") as f:
contents = f.read(32)
return contents
if __name__ == '__main__':
numbers = [x for x in range(1,100)]
# Run the task using a thread pool
t_p = ThreadPool()
result = run_tasks(work, numbers, t_p)
print (result)
t_p.close()
When I tried running this script, I got the following error with traceback:
PS C:\Users\USERNAME> & "C:/Users/USERNAME/AppData/Local/Continuum/anaconda3/python.exe" "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py"
Traceback (most recent call last):
File "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py", line 38, in <module>
t_p = ThreadPool()
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\dummy\__init__.py", line 123, in Pool
from ..pool import ThreadPool
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\pool.py", line 26, in <module>
from . import util
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\util.py", line 17, in <module>
from subprocess import _args_from_interpreter_flags
ImportError: cannot import name '_args_from_interpreter_flags' from 'subprocess' (h:\PSO Post-Processing\API Query\Python Test\subprocess_test\subprocess.py)
I found this SO thread, in which the answer suggests adding
from subprocess import _args_from_interpreter_flags
to the list of imports. However, when I added this line, the import error seems to shift into my current script:
Traceback (most recent call last):
File "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py", line 20, in <module>
from subprocess import _args_from_interpreter_flags
File "h:\Post-Processing\API Query\Python Test\subprocess_test\subprocess.py", line 20, in <module>
from subprocess import _args_from_interpreter_flags
ImportError: cannot import name '_args_from_interpreter_flags' from 'subprocess' (h:\PSO Post-Processing\API Query\Python Test\subprocess_test\subprocess.py)
I am now suspecting that something is wrong with my Python installation, but I am not sure how to troubleshoot it.
I am running Windows 10 on a work computer and using Visual Studio Code as my editor. According to Visual Studio Code, I'm running Python 3.7.6 64-bit ('Continuum': virtualenv). I found that I have subprocess.py installed at
"C:\Users\USER\AppData\Local\Continuum\anaconda3\Lib\subprocess.py"
and this subprocess.py file indeed has a segment with
def _args_from_interpreter_flags():
"""Return a list of command-line arguments reproducing the current
settings in sys.flags, sys.warnoptions and sys._xoptions."""
flag_opt_map = {
'debug': 'd',
# 'inspect': 'i',
# 'interactive': 'i',
'dont_write_bytecode': 'B',
'no_site': 'S',
'verbose': 'v',
'bytes_warning': 'b',
'quiet': 'q',
# -O is handled in _optim_args_from_interpreter_flags()
}
args = _optim_args_from_interpreter_flags()
for flag, opt in flag_opt_map.items():
v = getattr(sys.flags, flag)
if v > 0:
args.append('-' + opt * v)
if sys.flags.isolated:
args.append('-I')
else:
if sys.flags.ignore_environment:
args.append('-E')
if sys.flags.no_user_site:
args.append('-s')
# -W options
warnopts = sys.warnoptions[:]
bytes_warning = sys.flags.bytes_warning
xoptions = getattr(sys, '_xoptions', {})
dev_mode = ('dev' in xoptions)
if bytes_warning > 1:
warnopts.remove("error::BytesWarning")
elif bytes_warning:
warnopts.remove("default::BytesWarning")
if dev_mode:
warnopts.remove('default')
for opt in warnopts:
args.append('-W' + opt)
# -X options
if dev_mode:
args.extend(('-X', 'dev'))
for opt in ('faulthandler', 'tracemalloc', 'importtime',
'showalloccount', 'showrefcount', 'utf8'):
if opt in xoptions:
value = xoptions[opt]
if value is True:
arg = opt
else:
arg = '%s=%s' % (opt, value)
args.extend(('-X', arg))
return args
Given all this information, I am sure that I'm missing a simple detail that's stopping the threading code from working. I appreciate any help you can give.
Thank you!!

Python DEAP and multiprocessing on Windows: AttributeError

I have the following situation:
Windows 10
Python 3.7
deap 1.3.1
There is a main.py with
def main():
...
schedule.schedule()
...
if __name__== "__main__":
main()
Then, I also have a file schedule.py with
def schedule()
...
toolbox = base.Toolbox()
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox.register('individual', init_indiv, creator.Individual, bounds=bounds)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", fitness, data=args)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)
# Further parameters
cxpb = 0.7
mutpb = 0.2
# Measure how long it takes to caluclate 1 generation
MAX_HOURS_GA = parameter._MAX_HOURS_GA
POPSIZE_GA = parameter._POPSIZE_GA
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
pop = toolbox.population(n=POPSIZE_GA * len(bounds))
result = algorithms.eaSimple(pop, toolbox, cxpb, mutpb, 1, verbose=False)
Now, executing this gives me the following error:
Process SpawnPoolWorker-1:
Traceback (most recent call last):
File "C:\Users\...\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "C:\Users\...\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\...\lib\multiprocessing\pool.py", line 110, in worker
task = get()
File "C:\Users\...\lib\multiprocessing\queues.py", line 354, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'Individual' on <module 'deap.creator' from 'C:\\Users...
Now, I do note that the DEAP documentation (https://deap.readthedocs.io/en/master/tutorials/basic/part4.html) says
Warning
As stated in the multiprocessing guidelines, under Windows, a process pool must be protected in a >if __name__ == "__main__" section because of the way processes are initialized.
but that doesn't really help me as I certainly don't want to have all the toolbox.register(...) in my main and it even might not be possible to do so. Just moving the creation of the pool
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
to the main did not help.
There seem to be other people with similar issues, even fairly recently (https://github.com/rsteca/sklearn-deap/issues/59). For most of them, some sort of workaround seems to exist but none of them seems to fit in my situation or at least I couldn't figure out how to make them work.
I've also tried moving around the order of registering the functions and initializing the pool, but with no luck. I've also tried using SCOOP instead but with similar results.
Any ideas?
The solution is to create "FitnessMin" and "Individual" in the global scope, i.e. in main.py:
import ...
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
def main():
...
schedule.schedule()
...
if __name__== "__main__":
main()

How to pass classes into Pool.map as arguments - Pickling Error

I am trying to process a file by cutting it up into chunks and running them through a function which processes the chunks and returns a numpy array. After looking around it seems the best method would be to use the Pool.map method by passing through classes as the arguments. These classes are initiated with the chunk sections as a variable, and another variable to store the numpy array. The output list of classes can then be parsed to get out the information I need to continue with the problem. Here is a simplified version of the script I am trying to write:
from multiprocessing import Pool
class container():
def __init__(self, k):
self.input_section = k
self.ouput_answer = 0
def compute(object_class):
# Main operation would go on in here....
object_class.output_answer = object_class.input_section
return object_class
def Main():
# Create list of classes to path as arguments
sections = [container(k) for k in range(10)]
# Create pool and compute modified classes
with Pool(4) as p:
results = p.map(compute, sections)
# Decode here to get answers
sections = [k.output_answer for k in results]
# Print answers
print(sections)
if __name__ == '__main__':
Main()
This is the error that I get when I run the script:
Exception in thread Thread-9: Traceback (most recent call last):
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'container' on module '__main__' from
'C:\\Users\\rbernon\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\spyder\\utils\\ipython\\start_kernel.py'>
Any help would be greatly apprectiated!
Keep in mind that every piece of data you want to have processed needs to be pickled and sent to the worker processes.
The overhead of this will reduce (and might even eliminate) the advantages of using multiple processes.
If the data file is large, it is probably better to send each worker a start and end offset as a 2-tuple of numbers, so each worker can read part of the file and process it.

Python3 filling a dictionary concurrently

I want to fill a dictionary in a loop. Iterations in the loop are independent from each other. I want to perform this on a cluster with thousands of processors. Here is a simplified version of what I tried and need to do.
import multiprocessing
class Worker(multiprocessing.Process):
def setName(self,name):
self.name=name
def run(self):
print ('In %s' % self.name)
return
if __name__ == '__main__':
jobs = []
names=dict()
for i in range(10000):
p = Worker()
p.setName(str(i))
names[str(i)]=i
jobs.append(p)
p.start()
for j in jobs:
j.join()
I tried this one in python3 on my own computer and received the following error:
..
In 249
Traceback (most recent call last):
File "test.py", line 16, in <module>
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
In 250
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
Is there any better way to do this?
multiprocessing talks to its subprocesses via pipes. Each subprocesses requires two open file descriptors, one for read and one for write. If you launch 10000 workers, you'll end opening 20000 file descriptors which exceeds the default limit on OS X (which your paths indicate you're using).
You can fix the issue by raising the limit. See https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1 for details - basically, it amounts to setting two sysctl knobs and upping your shell's ulimit setting.
You are spawning 10000 processes at once at the moment. That really isn't a good idea.
The error you see is most definitely raised because the multiprocessing module (seem to) use pipes for the Inter Proccess Communication and there is a limit of open pipes/FDs.
I suggest using an python interpreter without a Global interpreter lock like Jython or IronPython and just replace the multiprocessing module with the threading one.
If you still want to use the multiprocessing module, you could use an Proccess Pool like this to collect the return values:
from multiprocessing import Pool
def worker(params):
name, someArg = params
print ('In %s' % name)
# do something with someArg here
return (name, someArg)
if __name__ == '__main__':
jobs = []
names=dict()
# Spawn 100 worker processes
pool = Pool(processes=100)
# Fill with real data
task_dict = dict(('name_{}'.format(i), i) for i in range(1000))
# Process every task via our pool
results = pool.map(worker, task_dict.items())
# And convert the rsult to a dict
results = dict(results)
print (results)
This should work with minimal changes for the threading module, too.

Python debugging with "assert" when using multiprocessing module

I have a specific version of this question on debugging multiprocessing in Python. I use assert statements extensively throughout my code to catch bugs. When a false assert fires, the program stops, and the file name and line number of the offending assert is printed to stderr.
However when I used multiprocessing.Pool, all I get back is that there was an AssertionError but no information about where the offending assert is.
For example, the following minimal code uses multiprocessing map for cores >= 2' and regularmapfunction forcores == 1`:
import multiprocessing
import logging
mpl = multiprocessing.log_to_stderr()
mpl.setLevel(logging.INFO)
def test(foo):
print foo
assert False
cores = 2
if cores > 1:
pool = multiprocessing.Pool(cores)
pool.map(test, range(cores))
else:
map(test, range(cores))
For cores == 1 I get the following error:
File "test_multiprocessing.py", line 16, in test
assert False
For cores == 2 I get:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
If you notice, I tried the logging answer suggest here but that doesn't provide the assert information either.
Is there a way using multiprocessing module or any other threading module to get the location of an offending assert statement?

Categories

Resources