This is not the first time I have problems with Pool() while trying to debug something.
Please note, that I do not try to debug the code which is paralellized. I want to debug code which gets executed much later. The problem is, that this code is never reached and the cause for this is that pool.map() is not terminating.
I have created a simple pipeline which does a few things. Among those things is a very simple preprocessing step for textual data.
To speed things up I am running:
print('Preprocess text')
with Pool() as pool:
df_preprocessed['text'] = pool.map(preprocessor.preprocess, df.text)
But here is the thing:
For some reason this code runs only one time. The second time I am ending up in an endless loop in _handle_workers() of the pool.py module:
#staticmethod
def _handle_workers(pool):
thread = threading.current_thread()
# Keep maintaining workers until the cache gets drained, unless the pool
# is terminated.
while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
pool._maintain_pool()
time.sleep(0.1)
# send sentinel to stop workers
pool._taskqueue.put(None)
util.debug('worker handler exiting')
Note the while-loop. My script simply ends up there when calling my preprocessing function a second time.
Please note: This only happens during a debugging session! If the script is executed without a debugger everything is working fine.
What could be the reason for this?
Environment
$ python --version
Python 3.5.6 :: Anaconda, Inc.
Update
This could be a known bug
infinite waiting in multiprocessing.Pool
Related
I have a loop (all this is being done in Python 3.10) that is running relatively fast compared to a function that needs to consume data from the loop. I don't want to slow down the data and am trying to find a way to run the function asynchronously but only execute the function again after completion... basically:
queue=[]
def flow():
thing=queue[0]
time.sleep(.5)
print(str(thing))
delete=queue.pop(0)
p1 = multiprocessing.Process(target=flow)
while True:
print('stream')
time.sleep(.25)
if len(queue)<1:
print('emptyQ')
queue.append('flow')
p1.start()
I've tried running the function in a thread and a process and both seem to try to start another while the function is still running. I tired using a queue to pass the data and as a semaphore by not removing the item until the end of the function and only adding an item and starting the thread or process if the queue was empty, but that didn't seem to work either.
EDIT : to add an explicit question...
Is there a way to execute a function asynchronously without executing it multiple time simultaneously?
EDIT2 : Updated with functional test code (accurately reproduces failure) since real code is a bit more substantial... I have noticed that it seems to work on the first execution of the function (the print doesn't work inside the function...) but the next execution it fails for whatever reason. I assume it tires to load the process twice?
The error I get is - RuntimeError : An attempt has been made to start a new process before the current process has finished its bootstrapping phase...
I have a program, which uses multiprocesses to execute functions from an external hardware library. The communication between the multiprocess and my program happens with JoinableQueue().
A part of the code looks like this:
# Main Code
queue_cmd.put("do_something")
queue_cmd.join() # here is my problem
# multiprocess
task = queue_cmd.get()
if task == "do_something":
external_class.do_something()
queue_cmd.task_done()
Note: external_class is the external hardware library.
This library sometimes crashes and the line queue_cmd.task_done() never gets executed. As a result, my main program hangs indefinitely in the queue_cmd.join() part, waiting for the queue_cmd.task_done() to be called. Unfortunately, there is no timeout parameter for the join() function.
How can I wait for the element in the JoinableQueue to be processed, but also deal with the event of my multiprocess terminating (due to the crash in the do_something() function)?
Ideally, the join function would have a timeout parameter (.join(timeout=30)), which I could use to restart the multiprocess - but it does not.
You can always wrap a blocking function on another thread:
queue_cmd.put("do_something")
t = Thread(target=queue_cmd.join)
t.start()
# implement a timeout
start = datetime.now()
timeout = 10 # seconds
while t.is_alive() and (datetime.now() - start).seconds < timeout:
# do something else
# waiting for the join or timeout
if t.is_alive():
# kill the subprocess that failed
pass
I think the best approach here is to start the "crashable" module in (yet) another process:
Main code
queue_cmd.put("do_something")
queue_cmd.join()
Multiprocess (You can now move this to a thread)
task = queue_cmd.get()
if task == "do_something":
subprocess.run(["python", "pleasedontcrash.py"])
queue_cmd.task_done()
pleasedontcrash.py
external_class.do_something()
As shown, I'd do it using subprocess. If you need to pass parameters (which you could with subprocess using pipes or arguments), it's easier to use multiprocessing.
Imports:
from dask.distributed import Client
import streamz
import time
Simulated workload:
def increment(x):
time.sleep(0.5)
return x + 1
Let's suppose I'd like to process some workload on a local Dask client:
if __name__ == "__main__":
with Client() as dask_client:
ps = streamz.Stream()
ps.scatter().map(increment).gather().sink(print)
for i in range(10):
ps.emit(i)
This works as expected, but sink(print) will, of course, enforce waiting for each result, thus the stream will not execute in parallel.
However, if I use buffer() to allow results to be cached, then gather() does not seem to correctly collect all results anymore and the interpreter exits before getting results. This approach:
if __name__ == "__main__":
with Client() as dask_client:
ps = streamz.Stream()
ps.scatter().map(increment).buffer(10).gather().sink(print)
# ^
for i in range(10): # - allow parallel execution
ps.emit(i) # - before gather()
...does not print any results for me. The Python interpreter just exits shortly after starting the script and before buffer() emits it's results, thus nothing gets printed.
However, if the main process is forced to wait for some time, the results are printed in parallel fashion (so they do not wait for each other, but are printed nearly simultaneously):
if __name__ == "__main__":
with Client() as dask_client:
ps = streamz.Stream()
ps.scatter().map(increment).buffer(10).gather().sink(print)
for i in range(10):
ps.emit(i)
time.sleep(10) # <- force main process to wait while ps is working
Why is that? I thought gather() should wait for a batch of 10 results since buffer() should cache exactly 10 results in parallel before flushing them to gather(). Why does gather() not block in this case?
Is there a nice way to otherwise check if a Stream still contains elements being processed in order to prevent the main process from exiting prematurely?
"Why is that?": because the Dask distributed scheduler (which executes the stream mapper and sink functions) and your python script run in different processes. When the "with" block context ends, your Dask Client is closed and execution shuts down before the items emitted to the stream are able reach the sink function.
"Is there a nice way to otherwise check if a Stream still contains elements being processed": not that I am aware of. However: if the behaviour you want is (I'm just guessing here) the parallel processing of a bunch of items, then Streamz is not what you should be using, vanilla Dask should suffice.
I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.
I've been trying to parallelise some code using concurrent.futures.ProcessPoolExecutor but have kept having strange deadlocks that don't occur with ThreadPoolExecutor. A minimal example:
from concurrent import futures
def test():
pass
with futures.ProcessPoolExecutor(4) as executor:
for i in range(100):
print('submitting {}'.format(i))
executor.submit(test)
In python 3.2.2 (on 64-bit Ubuntu), this seems to hang consistently after submitting all the jobs - and this seems to happen whenever the number of jobs submitted is greater than the number of workers. If I replace ProcessPoolExecutor with ThreadPoolExecutor it works flawlessly.
As an attempt to investigate, I gave each future a callback to print the value of i:
from concurrent import futures
def test():
pass
with futures.ProcessPoolExecutor(4) as executor:
for i in range(100):
print('submitting {}'.format(i))
future = executor.submit(test)
def callback(f):
print('callback {}'.format(i))
future.add_done_callback(callback)
This just confused me even more - the value of i printed out by callback is the value at the time it is called, rather than at the time it was defined (so I never see callback 0 but I get lots of callback 99s). Again, ThreadPoolExecutor prints out the expected value.
Wondering if this might be a bug, I tried a recent development version of python. Now, the code at least seems to terminate, but I still get the wrong value of i printed out.
So can anyone explain:
what happened to ProcessPoolExecutor in between python 3.2 and the current dev version that apparently fixed this deadlock
why the 'wrong' value of i is being printed
EDIT: as jukiewicz pointed out below, of course printing i will print the value at the time the callback is called, I don't know what I was thinking... if I pass a callable object with the value of i as one of its attributes, that works as expected.
EDIT: a little bit more information: all of the callbacks are executed, so it looks like it is executor.shutdown (called by executor.__exit__) that is unable to tell that the processes have completed. This does seem to be completely fixed in the current python 3.3, but there seem to have been a lot of changes to multiprocessing and concurrent.futures, so I don't know what fixed this. Since I can't use 3.3 (it doesn't seem to be compatible with either the release or dev versions of numpy), I tried simply copying its multiprocessing and concurrent packages across to my 3.2 installation, which seems to work fine. Still, it seems a little weird that - as far as I can see - ProcessPoolExecutor is completely broken in the latest release version but nobody else is affected.
I modified the code as follows, that solved both problems. callback function was defined as a closure, thus would use the updated value of i every time. As for deadlock, that's likely to be a cause of shutting down the Executor before all the task are complete. Waiting for the futures to complete solves that, too.
from concurrent import futures
def test(i):
return i
def callback(f):
print('callback {}'.format(f.result()))
with futures.ProcessPoolExecutor(4) as executor:
fs = []
for i in range(100):
print('submitting {}'.format(i))
future = executor.submit(test, i)
future.add_done_callback(callback)
fs.append(future)
for _ in futures.as_completed(fs): pass
UPDATE: oh, sorry, I haven't read your updates, this seems have been solved already.