In the following code, I'm trying to create a sandboxed master-worker system, in which changes to global variables in a worker don't reflect to other workers.
To achieve this, a new process is created each time a task is created, and to make the execution parallel, the creation of processes itself is managed by ThreadPoolExecutor.
import time
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pipe, Process
def task(conn, arg):
conn.send(arg * 2)
def isolate_fn(fn, arg):
def wrapped():
parent_conn, child_conn = Pipe()
p = Process(target=fn, args=(child_conn, arg), daemon=True)
try:
p.start()
r = parent_conn.recv()
finally:
p.join()
return r
return wrapped
def main():
with ThreadPoolExecutor(max_workers=4) as executor:
pair = []
for i in range(0, 10):
pair.append((i, executor.submit(isolate_fn(task, i))))
# This function makes the program broken.
#
print('foo')
time.sleep(2)
for arg, future in pair:
if future.done():
print('arg: {}, res: {}'.format(arg, future.result()))
else:
print('not finished: {}'.format(arg))
print('finished')
main()
This program works fine, until I put the print('foo') function inside the loop. If the function exists, some tasks remain unfinished, and what is worse, this program itself doesn't finish.
Results are not always the same, but the following is the typical output:
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
arg: 0, res: 0
arg: 1, res: 2
arg: 2, res: 4
not finished: 3
not finished: 4
not finished: 5
not finished: 6
not finished: 7
not finished: 8
not finished: 9
Why is this program so fragile?
I use Python 3.4.5.
Try using
from multiprocessing import set_start_method
... rest of your code here ....
if __name__ == '__main__':
set_start_method('spawn')
main()
If you search Stackoverflow for python multiprocessing and multithreading you will find a a fair few questions mentioning similar hanging issues. (esp. for python version 2.7 and 3.2)
Mixing multithreading and multiprocessing ist still a bit of an issue and even the python docs for multiprocessing.set_start_method mention that. In your case 'spawn' and 'forkserver' should work without any issues.
Another option might be to use MultiProcessingPool directly, but this may not be possible for you in a more complex use case.
Btw. 'Not Finished' may still appear in your output, as you are not waiting for your sub processes to finish, but the whole code should not hang anymore and always finish cleanly.
You are not creating ThreadPoolExecutor every time , rather using the pre initialized pool for every iteration. I really not able to track which print statement is hindering you?
Related
I'm trying to use "exec" to check some external code snippets for correctness and I wanted to trap infinite loops by spawning a process, waiting for a short period of time, then checking the local variables. I managed to shrink the code to this example:
import multiprocessing
def fHelper(queue, codeIn, globalsParamIn, localsParamIn):
exec(codeIn, globalsParamIn, localsParamIn) # Execute code string with limited builtins
queue.put(localsParamIn['spam'])
def f(codeIn):
globalsParam = {"float" : float, "int" : int, "len" : len}
spam = False
localsParam = {'spam': spam}
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=fHelper, args=(queue, codeIn, globalsParam, localsParam))
p.start()
p.join(3) # Wait for 3 seconds or until process finishes
if p.is_alive(): # Just in case p hangs
p.terminate()
p.join()
return queue.get(timeout=3)
fOut = f("spam=True")
print(fOut)
# assert fOut
Now the code as-is executes fine, but if you uncomment the last line (or use almost anything else - print(fOut.copy()) will do it) the queue times out. I'm using Python 3.8.2 on Windows.
I would welcome any suggestions on how to fix the bug, or better yet understand what on earth is going on.
Thanks!
I've been trying to write an interactive wrapper (for use in ipython) for a library that controls some hardware. Some calls are heavy on the IO so it makes sense to carry out the tasks in parallel. Using a ThreadPool (almost) works nicely:
from multiprocessing.pool import ThreadPool
class hardware():
def __init__(IPaddress):
connect_to_hardware(IPaddress)
def some_long_task_to_hardware(wtime):
wait(wtime)
result = 'blah'
return result
pool = ThreadPool(processes=4)
Threads=[]
h=[hardware(IP1),hardware(IP2),hardware(IP3),hardware(IP4)]
for tt in range(4):
task=pool.apply_async(h[tt].some_long_task_to_hardware,(1000))
threads.append(task)
alive = [True]*4
Try:
while any(alive) :
for tt in range(4): alive[tt] = not threads[tt].ready()
do_other_stuff_for_a_bit()
except:
#some command I cannot find that will stop the threads...
raise
for tt in range(4): print(threads[tt].get())
The problem comes if the user wants to stop the process or there is an IO error in do_other_stuff_for_a_bit(). Pressing Ctrl+C stops the main process but the worker threads carry on running until their current task is complete.
Is there some way to stop these threads without having to rewrite the library or have the user exit python? pool.terminate() and pool.join() that I have seen used in other examples do not seem to do the job.
The actual routine (instead of the simplified version above) uses logging and although all the worker threads are shut down at some point, I can see the processes that they started running carry on until complete (and being hardware I can see their effect by looking across the room).
This is in python 2.7.
UPDATE:
The solution seems to be to switch to using multiprocessing.Process instead of a thread pool. The test code I tried is to run foo_pulse:
class foo(object):
def foo_pulse(self,nPulse,name): #just one method of *many*
print('starting pulse for '+name)
result=[]
for ii in range(nPulse):
print('on for '+name)
time.sleep(2)
print('off for '+name)
time.sleep(2)
result.append(ii)
return result,name
If you try running this using ThreadPool then ctrl-C does not stop foo_pulse from running (even though it does kill the threads right away, the print statements keep on coming:
from multiprocessing.pool import ThreadPool
import time
def test(nPulse):
a=foo()
pool=ThreadPool(processes=4)
threads=[]
for rn in range(4) :
r=pool.apply_async(a.foo_pulse,(nPulse,'loop '+str(rn)))
threads.append(r)
alive=[True]*4
try:
while any(alive) : #wait until all threads complete
for rn in range(4):
alive[rn] = not threads[rn].ready()
time.sleep(1)
except : #stop threads if user presses ctrl-c
print('trying to stop threads')
pool.terminate()
print('stopped threads') # this line prints but output from foo_pulse carried on.
raise
else :
for t in threads : print(t.get())
However a version using multiprocessing.Process works as expected:
import multiprocessing as mp
import time
def test_pro(nPulse):
pros=[]
ans=[]
a=foo()
for rn in range(4) :
q=mp.Queue()
ans.append(q)
r=mp.Process(target=wrapper,args=(a,"foo_pulse",q),kwargs={'args':(nPulse,'loop '+str(rn))})
r.start()
pros.append(r)
try:
for p in pros : p.join()
print('all done')
except : #stop threads if user stops findRes
print('trying to stop threads')
for p in pros : p.terminate()
print('stopped threads')
else :
print('output here')
for q in ans :
print(q.get())
print('exit time')
Where I have defined a wrapper for the library foo (so that it did not need to be re-written). If the return value is not needed the neither is this wrapper :
def wrapper(a,target,q,args=(),kwargs={}):
'''Used when return value is wanted'''
q.put(getattr(a,target)(*args,**kwargs))
From the documentation I see no reason why a pool would not work (other than a bug).
This is a very interesting use of parallelism.
However, if you are using multiprocessing, the goal is to have many processes running in parallel, as opposed to one process running many threads.
Consider these few changes to implement it using multiprocessing:
You have these functions that will run in parallel:
import time
import multiprocessing as mp
def some_long_task_from_library(wtime):
time.sleep(wtime)
class MyException(Exception): pass
def do_other_stuff_for_a_bit():
time.sleep(5)
raise MyException("Something Happened...")
Let's create and start the processes, say 4:
procs = [] # this is not a Pool, it is just a way to handle the
# processes instead of calling them p1, p2, p3, p4...
for _ in range(4):
p = mp.Process(target=some_long_task_from_library, args=(1000,))
p.start()
procs.append(p)
mp.active_children() # this joins all the started processes, and runs them.
The processes are running in parallel, presumably in a separate cpu core, but that is to the OS to decide. You can check in your system monitor.
In the meantime you run a process that will break, and you want to stop the running processes, not leaving them orphan:
try:
do_other_stuff_for_a_bit()
except MyException as exc:
print(exc)
print("Now stopping all processes...")
for p in procs:
p.terminate()
print("The rest of the process will continue")
If it doesn't make sense to continue with the main process when one or all of the subprocesses have terminated, you should handle the exit of the main program.
Hope it helps, and you can adapt bits of this for your library.
In answer to the question of why pool did not work then this is due to (as quoted in the Documentation) then main needs to be importable by the child processes and due to the nature of this project interactive python is being used.
At the same time it was not clear why ThreadPool would - although the clue is right there in the name. ThreadPool creates its pool of worker processes using multiprocessing.dummy which as noted here is just a wrapper around the Threading module. Pool uses the multiprocessing.Process. This can be seen by this test:
p=ThreadPool(processes=3)
p._pool[0]
<DummyProcess(Thread23, started daemon 12345)> #no terminate() method
p=Pool(processes=3)
p._pool[0]
<Process(PoolWorker-1, started daemon)> #has handy terminate() method if needed
As threads do not have a terminate method the worker threads carry on running until they have completed their current task. Killing threads is messy (which is why I tried to use the multiprocessing module) but solutions are here.
The one warning about the solution using the above:
def wrapper(a,target,q,args=(),kwargs={}):
'''Used when return value is wanted'''
q.put(getattr(a,target)(*args,**kwargs))
is that changes to attributes inside the instance of the object are not passed back up to the main program. As an example the class foo above can also have methods such as:
def addIP(newIP):
self.hardwareIP=newIP
A call to r=mp.Process(target=a.addIP,args=(127.0.0.1)) does not update a.
The only way round this for a complex object seems to be shared memory using a custom manager which can give access to both the methods and attributes of object a For a very large complex object based on a library this may be best done using dir(foo) to populate the manager. If I can figure out how I'll update this answer with an example (for my future self as much as others).
If for some reasons using threads is preferable, we can use this.
We can send some siginal to the threads we want to terminate. The simplest siginal is global variable:
import time
from multiprocessing.pool import ThreadPool
_FINISH = False
def hang():
while True:
if _FINISH:
break
print 'hanging..'
time.sleep(10)
def main():
global _FINISH
pool = ThreadPool(processes=1)
pool.apply_async(hang)
time.sleep(10)
_FINISH = True
pool.terminate()
pool.join()
print 'main process exiting..'
if __name__ == '__main__':
main()
This code prints nothing:
def foo(i):
print i
def main():
pool = eventlet.GreenPool(size=100)
for i in xrange(100):
pool.spawn_n(foo, i)
while True:
pass
But this code prints numbers:
def foo(i):
print i
def main():
pool = eventlet.GreenPool(size=100)
for i in xrange(100):
pool.spawn_n(foo, i)
pool.waitall()
while True:
pass
The only difference is pool.waitall(). In my mind, waitall() means wait until all greenthreads in the pool are finished working, but an infinite loop waits for every greenthread, so pool.waitall() is not necessary.
So why does this happen?
Reference: http://eventlet.net/doc/modules/greenpool.html#eventlet.greenpool.GreenPool.waitall
The threads created in an eventlet GreenPool are green threads. This means that they all exist within one thread at the operating-system level, and the Python interpreter handles switching between them. This switching can only happen when one thread either yields (deliberately provides an opportunity for other threads to run) or is waiting for I/O.
When your code runs:
while True:
pass
… that thread of execution is blocked – stuck on that code – and no other green threads can get scheduled.
When you instead run:
pool.waitall()
… eventlet makes sure that it yields while waiting.
You could emulate this same behaviour by modifying your while loop slightly to call the eventlet.sleep function, which yields:
while True:
eventlet.sleep()
This could be useful if you wanted to do something else in the while True: loop while waiting for the threads in your pool to complete. Otherwise, just use pool.waitall() – that’s what it’s for.
I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.
I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)
Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)
Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.
You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.
I'm trying to understand multiprocessing in python.
from multiprocessing import Process
def multiply(a,b):
print(a*b)
return a*b
if __name__ == '__main__':
p = Process(target= multiply, args= (5,4))
p.start()
p.join()
print("ok.")
In this codeblock, for example, if there was an variable that called "result". How can we assign return value of multiply function to "result"?
And a little problem about IDLE: when i'm tried to run this sample with Python Shell, it doesn't work properly? If i double click .py file, output is like that:
20
ok.
But if i try to run this in IDLE:
ok.
Thanks...
Ok, i somehow managed this. I looked to python documentation, and i learnt that: with using Queue class, we can get return values from a function. And final version of my code is like this:
from multiprocessing import Process, Queue
def multiply(a,b,que): #add a argument to function for assigning a queue
que.put(a*b) #we're putting return value into queue
if __name__ == '__main__':
queue1 = Queue() #create a queue object
p = Process(target= multiply, args= (5,4,queue1)) #we're setting 3rd argument to queue1
p.start()
print(queue1.get()) #and we're getting return value: 20
p.join()
print("ok.")
And there is also a pipe() function, i think we can use pipe() function,too. But Queue worked for me, now.
Does this help? This takes a list of functions (and their arguments), runs them in parallel,
and returns their outputs.: (This is old. Much newer version of this is at https://gitlab.com/cpbl/cpblUtilities/blob/master/parallel.py )
def runFunctionsInParallel(listOf_FuncAndArgLists):
"""
Take a list of lists like [function, arg1, arg2, ...]. Run those functions in parallel, wait for them all to finish, and return the list of their return values, in order.
(This still needs error handling ie to ensure everything returned okay.)
"""
from multiprocessing import Process, Queue
def storeOutputFFF(fff,theArgs,que): #add a argument to function for assigning a queue
print 'MULTIPROCESSING: Launching %s in parallel '%fff.func_name
que.put(fff(*theArgs)) #we're putting return value into queue
queues=[Queue() for fff in listOf_FuncAndArgLists] #create a queue object for each function
jobs = [Process(target=storeOutputFFF,args=[funcArgs[0],funcArgs[1:],queues[iii]]) for iii,funcArgs in enumerate(listOf_FuncAndArgLists)]
for job in jobs: job.start() # Launch them all
for job in jobs: job.join() # Wait for them all to finish
# And now, collect all the outputs:
return([queue.get() for queue in queues])