I want to fill a dictionary in a loop. Iterations in the loop are independent from each other. I want to perform this on a cluster with thousands of processors. Here is a simplified version of what I tried and need to do.
import multiprocessing
class Worker(multiprocessing.Process):
def setName(self,name):
self.name=name
def run(self):
print ('In %s' % self.name)
return
if __name__ == '__main__':
jobs = []
names=dict()
for i in range(10000):
p = Worker()
p.setName(str(i))
names[str(i)]=i
jobs.append(p)
p.start()
for j in jobs:
j.join()
I tried this one in python3 on my own computer and received the following error:
..
In 249
Traceback (most recent call last):
File "test.py", line 16, in <module>
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
In 250
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
Is there any better way to do this?
multiprocessing talks to its subprocesses via pipes. Each subprocesses requires two open file descriptors, one for read and one for write. If you launch 10000 workers, you'll end opening 20000 file descriptors which exceeds the default limit on OS X (which your paths indicate you're using).
You can fix the issue by raising the limit. See https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1 for details - basically, it amounts to setting two sysctl knobs and upping your shell's ulimit setting.
You are spawning 10000 processes at once at the moment. That really isn't a good idea.
The error you see is most definitely raised because the multiprocessing module (seem to) use pipes for the Inter Proccess Communication and there is a limit of open pipes/FDs.
I suggest using an python interpreter without a Global interpreter lock like Jython or IronPython and just replace the multiprocessing module with the threading one.
If you still want to use the multiprocessing module, you could use an Proccess Pool like this to collect the return values:
from multiprocessing import Pool
def worker(params):
name, someArg = params
print ('In %s' % name)
# do something with someArg here
return (name, someArg)
if __name__ == '__main__':
jobs = []
names=dict()
# Spawn 100 worker processes
pool = Pool(processes=100)
# Fill with real data
task_dict = dict(('name_{}'.format(i), i) for i in range(1000))
# Process every task via our pool
results = pool.map(worker, task_dict.items())
# And convert the rsult to a dict
results = dict(results)
print (results)
This should work with minimal changes for the threading module, too.
Related
I am trying to:
share a dataframe between processes
update a shared dict based on calculations performed on (but not changing) that dataframe
I am using a multiprocessing.Manager() to create a dict in shared memory (to store results) and a Namespace to store/share my dataframe that I want to read from.
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
n = 100
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
for p in jobs:
p.join()
print(shared_dict[1])
When running the above, it writes to shared_dict correctly as my print statement executes with some data. I also get an error regarding the manager:
Process Process-88:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 788, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/henrysorsky/Library/Preferences/PyCharm2019.2/scratches/scratch_13.py", line 34, in edit_df_in_shared_dict
row_to_insert = namespace.df.loc[ind]
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 1099, in __getattr__
return callmethod('__getattribute__', (key,))
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 792, in _callmethod
self._connect()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 779, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
I understand this is coming from the manager and seems to be due to it not shutting down properly. The only similar issue I can find online:
Share list between process in python server
suggests joining all the child processes, which I am already doing.
So after a full nights sleep I realised it was actually the reading of the dataframe in shared memory that was causing issues and that at around the 20th child process, some of them were failing this read. I added a max number of processes to run at once and this solved it.
For anyone wondering, the code I used is:
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
# region define inputs
max_jobs_running = 4
n = 100
# endregion
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
jobs_running = 0
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
jobs_running += 1
if jobs_running >= max_jobs_running:
while jobs_running >= max_jobs_running:
jobs_running = 0
for p in jobs:
jobs_running += p.is_alive()
for p in jobs:
p.join()
for key, value in shared_dict.items():
print(f"key: {key}")
print(f"value: {value}")
print("-" * 50)
This would probably be better handled by a Queue and Pool setup rather than my hacky fix.
The problem is probably in your main process, which created the shared dict. If you forgot to use process.join() (or an infinite loop) in your main process, then the main process may finish before the other processes using the dict. This way the dict gets destroyed, and the processes cannot connect to it.
The number of processes should not be a problem. You should be able to use the dict with as many as you wish.
TL;DR This error might happen if you initiate too many new connections to multiprocessing.Manager() objects in parallel due to hard-coded backlog limit (16 at the time of writing) in multiprocessing/managers.py:
# do authentication later
self.listener = Listener(address=address, backlog=16)
self.address = self.listener.address
Details: I was starting a few hundreds subprocesses trying to get a value from multiprocessing.Manager().dict object at the very start of my program (basically instantly parallel). First few worked fine, but then they started to fail sporadically.
Interestingly, in my case, this only happened under VSCode debugger. I have found a mailing list discussion mentioning this issue more than 10 years ago. Looking at the source code of multiprocessing I found out that the backlog limit is still hard-coded (seems to get increased from 5 to 16 in modern versions). I increased it to 64 and all errors were gone.
So if the pending connections queue reaches the limit, all new connections will be refused. Especially when you run your code under debugger, connections are getting served a tick slower and the backlog buffer may get full when hundreds of them are flowing fast in parallel.
Inspired by this solution I am trying to set up a multiprocessing pool of worker processes in Python. The idea is to pass some data to the worker processes before they actually start their work and reuse it eventually. It's intended to minimize the amount of data which needs to be packed/unpacked for every call into a worker process (i.e. reducing inter-process communication overhead). My MCVE looks like this:
import multiprocessing as mp
import numpy as np
def create_worker_context():
global context # create "global" context in worker process
context = {}
def init_worker_context(worker_id, some_const_array, DIMS, DTYPE):
context.update({
'worker_id': worker_id,
'some_const_array': some_const_array,
'tmp': np.zeros((DIMS, DIMS), dtype = DTYPE),
}) # store context information in global namespace of worker
return True # return True, verifying that the worker process received its data
class data_analysis:
def __init__(self):
self.DTYPE = 'float32'
self.CPU_LEN = mp.cpu_count()
self.DIMS = 100
self.some_const_array = np.zeros((self.DIMS, self.DIMS), dtype = self.DTYPE)
# Init multiprocessing pool
self.cpu_pool = mp.Pool(processes = self.CPU_LEN, initializer = create_worker_context) # create pool and context in workers
pool_results = [
self.cpu_pool.apply_async(
init_worker_context,
args = (core_id, self.some_const_array, self.DIMS, self.DTYPE)
) for core_id in range(self.CPU_LEN)
] # pass information to workers' context
result_batches = [result.get() for result in pool_results] # check if they got the information
if not all(result_batches): # raise an error if things did not work
raise SyntaxError('Workers could not be initialized ...')
#staticmethod
def process_batch(batch_data):
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
return context['tmp'] # return result
def process_all(self):
input_data = np.arange(0, self.DIMS ** 2, dtype = self.DTYPE).reshape(self.DIMS, self.DIMS)
pool_results = [
self.cpu_pool.apply_async(
data_analysis.process_batch,
args = (input_data,)
) for _ in range(self.CPU_LEN)
] # let workers actually work
result_batches = [result.get() for result in pool_results]
for batch in result_batches[1:]:
np.add(result_batches[0], batch, out = result_batches[0]) # reduce batches
print(result_batches[0]) # show result
if __name__ == '__main__':
data_analysis().process_all()
I am running the above with CPython 3.6.6.
The strange thing is ... sometimes it works, sometimes it does not. If it does not work, the process_batch method throws an exception, because it can not find some_const_array as a key in context. The full traceback looks like this:
(env) me#box:/path> python so.py
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/path/so.py", line 37, in process_batch
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
KeyError: 'some_const_array'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/so.py", line 54, in <module>
data_analysis().process_all()
File "/path/so.py", line 48, in process_all
result_batches = [result.get() for result in pool_results]
File "/path/so.py", line 48, in <listcomp>
result_batches = [result.get() for result in pool_results]
File "/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: 'some_const_array'
I am puzzled. What is going on here?
If my context dictionaries contain an object of "higher type", e.g. a database driver or similar, I am not getting this kind of problem. I can only reproduce this if my context dictionaries contain basic Python data types, collections or numpy arrays.
(Is there a potentially better approach for achieving the same thing in a more reliable manner? I know my approach is considered a hack ...)
You need to relocate the content of init_worker_context into your initializer function create_worker_context.
Your assumption that every single worker process will run init_worker_context is responsible for your confusion.
The tasks you submit to a pool get fed into one internal taskqueue all worker processes read from. What happens in your case is, that some worker processes complete their task and compete again for getting new tasks. So it can happen that one worker processes will execute multiple tasks while another one will not get a single one. Keep in mind the OS schedules runtime for threads (of the worker processes).
I have a 100-1000 timeseries paths and a fairly expensive simulation that I'd like to parallelize. However, the library I'm using hangs on rare occasions and I'd like to make it robust to those issues. This is the current setup:
with Pool() as pool:
res = pool.map_async(simulation_that_occasionally_hangs, (p for p in paths))
all_costs = res.get()
I know get() has a timeout parameter but if I understand correctly that works on the whole process of the 1000 paths. What I'd like to do is check if any single simulation is taking longer than 5 minutes (a normal path takes 4 seconds) and if so just stop that path and continue to get() the rest.
EDIT:
Testing timeout in pebble
def fibonacci(n):
if n == 0: return 0
elif n == 1: return 1
else: return fibonacci(n - 1) + fibonacci(n - 2)
def main():
with ProcessPool() as pool:
future = pool.map(fibonacci, range(40), timeout=10)
iterator = future.result()
all = []
while True:
try:
all.append(next(iterator))
except StopIteration:
break
except TimeoutError as e:
print(f'function took longer than {e.args[1]} seconds')
print(all)
Errors:
RuntimeError: I/O operations still in flight while destroying Overlapped object, the process may crash
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\anaconda3\lib\multiprocessing\spawn.py", line 99, in spawn_main
new_handle = reduction.steal_handle(parent_pid, pipe_handle)
File "C:\anaconda3\lib\multiprocessing\reduction.py", line 87, in steal_handle
_winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] Access is denied
The pebble library has been designed to address these kinds of issues. It handles transparently job timeouts and failures such as C library crashes.
You can check the documentation examples to see how to use it. It has a similar interface as concurrent.futures.
Probably the easiest way is to run each heavy simulation in a separate subprocess, with the parent process watching it. Specifically:
def risky_simulation(path):
...
def safe_simulation(path):
p = multiprocessing.Process(target=risky_simulation, args=(path,))
p.start()
p.join(timeout) # Your timeout here
p.kill() # or p.terminate()
# Here read and return the output of the simulation
# Can be from a file, or using some communication object
# between processes, from the `multiprocessing` module
with Pool() as pool:
res = pool.map_async(safe_simulation, paths)
all_costs = res.get()
Notes:
If the simulation may hang, you may want to run it in a separate process (i.e. the Process object should not be a thread), as depending on how it's done, it may catch the GIL.
This solution only uses the pool for the immediate sub-processes, but the computations are off-loaded to new processes. We can also make sure the computations share a pool, but that would result in uglier code, so I skipped it.
I am trying to process a file by cutting it up into chunks and running them through a function which processes the chunks and returns a numpy array. After looking around it seems the best method would be to use the Pool.map method by passing through classes as the arguments. These classes are initiated with the chunk sections as a variable, and another variable to store the numpy array. The output list of classes can then be parsed to get out the information I need to continue with the problem. Here is a simplified version of the script I am trying to write:
from multiprocessing import Pool
class container():
def __init__(self, k):
self.input_section = k
self.ouput_answer = 0
def compute(object_class):
# Main operation would go on in here....
object_class.output_answer = object_class.input_section
return object_class
def Main():
# Create list of classes to path as arguments
sections = [container(k) for k in range(10)]
# Create pool and compute modified classes
with Pool(4) as p:
results = p.map(compute, sections)
# Decode here to get answers
sections = [k.output_answer for k in results]
# Print answers
print(sections)
if __name__ == '__main__':
Main()
This is the error that I get when I run the script:
Exception in thread Thread-9: Traceback (most recent call last):
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'container' on module '__main__' from
'C:\\Users\\rbernon\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\spyder\\utils\\ipython\\start_kernel.py'>
Any help would be greatly apprectiated!
Keep in mind that every piece of data you want to have processed needs to be pickled and sent to the worker processes.
The overhead of this will reduce (and might even eliminate) the advantages of using multiple processes.
If the data file is large, it is probably better to send each worker a start and end offset as a 2-tuple of numbers, so each worker can read part of the file and process it.
I'm having a problem with the Python multiprocessing package. Below is a simple example code that illustrates my problem.
import multiprocessing as mp
import time
def test_file(f):
f.write("Testing...\n")
print f.name
return None
if __name__ == "__main__":
f = open("test.txt", 'w')
proc = mp.Process(target=test_file, args=[f])
proc.start()
proc.join()
When I run this, I get the following error.
Process Process-1:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self.target(*self._args, **self._kwargs)
File "C:\Users\Ray\Google Drive\Programming\Python\tests\follow_test.py", line 24, in test_file
f.write("Testing...\n")
ValueError: I/O operation on closed file
Press any key to continue . . .
It seems that somehow the file handle is 'lost' during the creation of the new process. Could someone please explain what's going on?
I had similar issues in the past. Not sure whether it is done within the multiprocessing module or whether open sets the close-on-exec flag by default but I know for sure that file handles opened in the main process are closed in the multiprocessing children.
The obvious work around is to pass the filename as a parameter to the child process' init function and open it once within each child (if using a pool), or to pass it as a parameter to the target function and open/close on each invocation. The former requires the use of a global to store the file handle (not a good thing) - unless someone can show me how to avoid that :) - and the latter can incur a performance hit (but can be used with multiprocessing.Process directly).
Example of the former:
filehandle = None
def child_init(filename):
global filehandle
filehandle = open(filename,...)
../..
def child_target(args):
../..
if __name__ == '__main__':
# some code which defines filename
proc = multiprocessing.Pool(processes=1,initializer=child_init,initargs=[filename])
proc.apply(child_target,args)