Large objects and `multiprocessing` pipes and `send()` - python

I've recently found out that, if we create a pair of parent-child connection objects by using multiprocessing.Pipe, and if an object obj we're trying to send through the pipe is too large, my program hangs without throwing exception or doing anything at all. See code below. (The code below uses the numpy package to produce a large array of floats.)
import multiprocessing as mp
import numpy as np
def big_array(conn, size=1200):
a = np.random.rand(size)
print "Child process trying to send array of %d floats." %size
conn.send(a)
return a
if __name__ == "__main__":
print "Main process started."
parent_conn, child_conn = mp.Pipe()
proc = mp.Process(target=big_array, args=[child_conn, 1200])
proc.start()
print "Child process started."
proc.join()
print "Child process joined."
a = parent_conn.recv()
print "Received the following object."
print "Type: %s. Size: %d." %(type(a), len(a))
The output is the following.
Main process started.
Child process started.
Child process trying to send array of 1200 floats.
And it hangs here indefinitely. However, if instead of 1200, we try to send an array with 1000 floats, then the program executes successfully, with the following output as expected.
Main process started.
Child process started.
Child process trying to send array of 1000 floats.
Child process joined.
Received the following object.
Type: <type 'numpy.ndarray'>. Size: 1000.
Press any key to continue . . .
This looks like a bug to me. The documentation says the following.
send(obj)
Send an object to the other end of the connection which should be read using recv().
The object must be picklable. Very large pickles (approximately 32 MB+, though it depends on the OS) may raise a ValueError exception.
But with my run, not even a ValueError exception was thrown, the program just hangs there. Moreover, the 1200-long numpy array is 9600 bytes big, certainly not more than 32MB! This looks like a bug. Does anyone know how to solve this problem?
By the way, I'm using Windows 7, 64-bit.

Try to move join() below recv():
import multiprocessing as mp
def big_array(conn, size=1200):
a = "a" * size
print "Child process trying to send array of %d floats." %size
conn.send(a)
return a
if __name__ == "__main__":
print "Main process started."
parent_conn, child_conn = mp.Pipe()
proc = mp.Process(target=big_array, args=[child_conn, 120000])
proc.start()
print "Child process started."
print "Child process joined."
a = parent_conn.recv()
proc.join()
print "Received the following object."
print "Type: %s. Size: %d." %(type(a), len(a))
But I don't really understand why your example works even for small sizes. I was thinking that writing to pipe and then making the process to join without first reading the data from pipe will block the join. You should first receive from pipe, then join. But apparently it does not block for small sizes...?
Edit: from the docs (http://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming):
"An example which will deadlock is the following:"
from multiprocessing import Process, Queue
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
queue = Queue()
p = Process(target=f, args=(queue,))
p.start()
p.join() # this deadlocks
obj = queue.get()

Related

kill the sub processes by signal, but position affects?

I use the signal function to kill all sub-processes in the mul-process program, the code is shown blow, save as a file named mul_process.py:
import time
import os
import signal
from multiprocessing import Process
processes = []
def fun(x):
print 'current sub-process pid is %s' % os.getpid()
while True:
print 'args is %s' % x
time.sleep(100)
def term(sig_num, frame):
print 'terminate process %d' % os.getpid()
for p in processes:
print p.pid
try:
for p in processes:
print 'process %d terminate' % p.pid
p.terminate()
p.join()
except Exception as e:
print str(e)
if __name__ == '__main__':
print 'current main-process pid is %s' % os.getpid()
for i in range(3):
t = Process(target=fun, args=(str(i),))
t.start()
processes.append(t)
signal.signal(signal.SIGTERM, term)
try:
for p in processes:
p.join()
except Exception as e:
print str(e)
Using 'python mul_process.py' to launch the program on Ubuntu 10.04.4 and Python 2.6, when it start running, in another tab, I use kill -15 with the main process pid to send signal SIGTERM to kill all processes, when the main process receive the signal SIGTERM, it exit after terminate all sub processes, but when I use kill -15 with the sub process pid, it does not work, the program still alive and running as before, and does not print the sentence defined in the function term, seems that the subprocess doesn't receive the SIGTERM.As I know, the sub process will inherit the signal handler, but it doesn`t work, here is the first question.
And then I move the line 'signal.signal(signal.SIGTERM, term)' to position after line 'if name == 'main':', like this:
if __name__ == '__main__':
signal.signal(signal.SIGTERM, term)
print 'current main-process pid is %s' % os.getpid()
for i in range(3):
t = Process(target=fun, args=(str(i),))
t.start()
processes.append(t)
try:
for p in processes:
p.join()
except Exception as e:
print str(e)
Launch the program, and use kill -15 with the main process pid to send the signal SIGTERM, the program receive the signal and call the function term but also doesn't kill any subprocessed and exit itself, this is the second question.
Few problems in your program- Agree that subprocess will inherit signal handler in your 2nd code snippet, But global variable "processes" list won't be shared. So list of process would be available with main process only. "process" would be empty list for other sub process.
You can use queue or pipe kind of mechanism to pass list of process to sub processes. But it will bring another problem
You terminate process1 and handler of process1 try to terminate process2 to process4.
Now process 2 also has same handler,
So Process 2 handler again try to terminate all other process
which will push your program into infinite loop.

How to get id of process from which came signal in Python?

Please see following Python code:
def Handler(signum, frame):
#do something
signal.signal(signal.SIGCHLD, Handler)
Is there a way to get process ID from which signal came from?
Or is there another way to get process ID from which signal came from without blocking main flow of application?
You cannot directly. The signal module of Python standard library has no provision for giving access to the Posix sigaction_t structure. If you really need that, you will have to build a Python extension in C or C++.
You will find pointers for that in Extending and Embedding the Python Interpreter - this document should also be available in your Python distribution
os.getpid() returns the current process id. So when you send a signal, you can print it out, for example.
import signal
import os
import time
def receive_signal(signum, stack):
print 'Received:', signum
signal.signal(signal.SIGUSR1, receive_signal)
signal.signal(signal.SIGUSR2, receive_signal)
print 'My PID is:', os.getpid()
Check this for more info on signals.
To send pid to the process one may use Pipe
import os
from multiprocessing import Process, Pipe
def f(conn):
conn.send([os.getpid()])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints os.getpid()
p.join()
Exchanging objects between processes

python multiprocessing.Pool kill *specific* long running or hung process

I need to execute a pool of many parallel database connections and queries. I would like to use a multiprocessing.Pool or concurrent.futures ProcessPoolExecutor. Python 2.7.5
In some cases, query requests take too long or will never finish (hung/zombie process). I would like to kill the specific process from the multiprocessing.Pool or concurrent.futures ProcessPoolExecutor that has timed out.
Here is an example of how to kill/re-spawn the entire process pool, but ideally I would minimize that CPU thrashing since I only want to kill a specific long running process that has not returned data after timeout seconds.
For some reason the code below does not seem to be able to terminate/join the process Pool after all results are returned and completed. It may have to do with killing worker processes when a timeout occurs, however the Pool creates new workers when they are killed and results are as expected.
from multiprocessing import Pool
import time
import numpy as np
from threading import Timer
import thread, time, sys
def f(x):
time.sleep(x)
return x
if __name__ == '__main__':
pool = Pool(processes=4, maxtasksperchild=4)
results = [(x, pool.apply_async(f, (x,))) for x in np.random.randint(10, size=10).tolist()]
while results:
try:
x, result = results.pop(0)
start = time.time()
print result.get(timeout=5), '%d done in %f Seconds!' % (x, time.time()-start)
except Exception as e:
print str(e)
print '%d Timeout Exception! in %f' % (x, time.time()-start)
for p in pool._pool:
if p.exitcode is None:
p.terminate()
pool.terminate()
pool.join()
I am not fully understanding your question. You say you want to stop one specific process, but then, in your exception handling phase, you are calling terminate on all jobs. Not sure why you are doing that. Also, I am pretty sure using internal variables from multiprocessing.Pool is not quite safe. Having said all of that, I think your question is why this program does not finish when a time out happens. If that is the problem, then the following does the trick:
from multiprocessing import Pool
import time
import numpy as np
from threading import Timer
import thread, time, sys
def f(x):
time.sleep(x)
return x
if __name__ == '__main__':
pool = Pool(processes=4, maxtasksperchild=4)
results = [(x, pool.apply_async(f, (x,))) for x in np.random.randint(10, size=10).tolist()]
result = None
start = time.time()
while results:
try:
x, result = results.pop(0)
print result.get(timeout=5), '%d done in %f Seconds!' % (x, time.time()-start)
except Exception as e:
print str(e)
print '%d Timeout Exception! in %f' % (x, time.time()-start)
for i in reversed(range(len(pool._pool))):
p = pool._pool[i]
if p.exitcode is None:
p.terminate()
del pool._pool[i]
pool.terminate()
pool.join()
The point is you need to remove items from the pool; just calling terminate on them is not enough.
In your solution you're tampering internal variables of the pool itself. The pool is relying on 3 different threads in order to correctly operate, it is not safe to intervene in their internal variables without being really aware of what you're doing.
There's not a clean way to stop timing out processes in the standard Python Pools, but there are alternative implementations which expose such feature.
You can take a look at the following libraries:
pebble
billiard
To avoid access to the internal variables you can save multiprocessing.current_process().pid from the executing task into the shared memory. Then iterate over the multiprocessing.active_children() from the main process and kill the target pid if exists.
However, after such external termination of the workers, they are recreated, but the pool becomes nonjoinable and also requires explicit termination before the join()
I also came across this problem.
The original code and the edited version by #stacksia has the same issue:
in both cases it will kill all currently running processes when timeout is reached for just one of the processes (ie when the loop over pool._pool is done).
Find below my solution. It involves creating a .pid file for each worker process as suggested by #luart. It will work if there is a way to tag each worker process (in the code below, x does this job).
If someone has a more elegant solution (such as saving PID in memory) please share it.
#!/usr/bin/env python
from multiprocessing import Pool
import time, os
import subprocess
def f(x):
PID = os.getpid()
print 'Started:', x, 'PID=', PID
pidfile = "/tmp/PoolWorker_"+str(x)+".pid"
if os.path.isfile(pidfile):
print "%s already exists, exiting" % pidfile
sys.exit()
file(pidfile, 'w').write(str(PID))
# Do the work here
time.sleep(x*x)
# Delete the PID file
os.remove(pidfile)
return x*x
if __name__ == '__main__':
pool = Pool(processes=3, maxtasksperchild=4)
results = [(x, pool.apply_async(f, (x,))) for x in [1,2,3,4,5,6]]
pool.close()
while results:
print results
try:
x, result = results.pop(0)
start = time.time()
print result.get(timeout=3), '%d done in %f Seconds!' % (x, time.time()-start)
except Exception as e:
print str(e)
print '%d Timeout Exception! in %f' % (x, time.time()-start)
# We know which process gave us an exception: it is "x", so let's kill it!
# First, let's get the PID of that process:
pidfile = '/tmp/PoolWorker_'+str(x)+'.pid'
PID = None
if os.path.isfile(pidfile):
PID = str(open(pidfile).read())
print x, 'pidfile=',pidfile, 'PID=', PID
# Now, let's check if there is indeed such process runing:
for p in pool._pool:
print p, p.pid
if str(p.pid)==PID:
print 'Found it still running!', p, p.pid, p.is_alive(), p.exitcode
# We can also double-check how long it's been running with system 'ps' command:"
tt = str(subprocess.check_output('ps -p "'+str(p.pid)+'" o etimes=', shell=True)).strip()
print 'Run time from OS (may be way off the real time..) = ', tt
# Now, KILL the m*$#r:
p.terminate()
pool._pool.remove(p)
pool._repopulate_pool()
# Let's not forget to remove the pidfile
os.remove(pidfile)
break
pool.terminate()
pool.join()
Many people suggest pebble. It looks nice, but only available for Python 3. If someone has a way to get pebble imported for python 2.6 - would be great.

How to keep track of multiple concurrent subprocesses created by a loop in Python

I have written a Python script for tiling imagery using the GDAL open source library and the command line utilities provided with that library. First, I read an input dataset that tells me each tile extent. Then, I loop through the tiles and start a subprocess to call gdalwarp in order to clip the input image to the current tile in the loop.
I don't to use Popen.wait() because this will keep the tiles from being processed concurrently, but I do want to keep track of any messages returned by the subprocess. In addition, once a particular tile is done being created, I need to calculate the statistics for the new file using gdalinfo, which requires another subprocess.
Here is the code:
processing = {}
for tile in tileNums:
subp = subprocess.Popen(['gdalwarp', '-ot', 'Int16', '-r', 'cubic', '-of', 'HFA', '-cutline', tileIndexShp, '-cl', os.path.splitext(os.path.basename(tileIndexShp))[0], '-cwhere', "%s = '%s'" % (tileNumField, tile), '-crop_to_cutline', os.path.join(inputTileDir, 'mosaic_Proj.vrt'), os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_', tile))], stdout=subprocess.PIPE)
processing[tile] = [subp]
while processing:
for tile, subps in processing.items():
for idx, subp in enumerate(subps):
if subp == None: continue
poll = subp.poll()
if poll == None: continue
elif poll != 0:
subps[idx] = None
print tile, "%s Unsuccessful" % ("Retile" if idx == 0 else "Statistics")
else:
subps[idx] = None
print tile, "%s Succeeded" % ("Retile" if idx == 0 else "Statistics")
if subps == [None, None]:
del processing[tile]
continue
subps.append(subprocess.Popen(['gdalinfo', '-stats', os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_',tile))], stdout=subprocess.PIPE))
For the most part, this works for me, but the one issue I am seeing is that it seems to create an infinite loop when it gets to the last tile. I know this is not the best way to do this, but I am very new to the subprocess module and I basically just threw this together to try and get it to work.
Can anyone recommend a better way to loop through the list of tiles, spawn a subprocess for each tile that can process concurrently, and spawn a second subprocess when the first completes for each tile?
UPDATE:
Thanks for all the advice so far. I tried to refactor the code above to take advantage of the multiprocessing module and Pool.
Here is the new code:
def ProcessTile(tile):
tileName = os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_', tile))
warp = subprocess.Popen(['gdalwarp', '-ot', 'Int16', '-r', 'cubic', '-of', 'HFA', '-cutline', tileIndexShp, '-cl', os.path.splitext(os.path.basename(tileIndexShp))[0], '-cwhere', "%s = '%s'" % (tileNumField, tile), '-crop_to_cutline', os.path.join(inputTileDir, 'mosaic_Proj.vrt'), tileName], stdout=subprocess.PIPE)
warpMsg = tile, "Retile %s" % "Successful" if warp.wait() == 0 else "Unsuccessful"
info = subprocess.Popen(['gdalinfo', '-stats', tileName], stdout=subprocess.PIPE)
statsMsg = tile, "Statistics %s" % "Successful" if info.wait() == 0 else "Unsuccessful"
return warpMsg, statsMsg
print "Retiling..."
pool = multiprocessing.Pool()
for warpMsg, statsMsg in pool.imap_unordered(ProcessTile, tileNums): print "%s\n%s" % (warpMsg, statsMsg)
This is causing some major problems for me. First of all, I end up with many new processes being created. About half are python.exe and the other half are another gdal utility that I call before the code above to mosaic the incoming imagery if it is already tiled in another tiling scheme (gdalbuildvrt.exe). Between all the python.exe and gdalbuildvrt.exe processes that are being created, about 25% of my CPU (Intel I7 with 8 cores when hyperthreaded) and 99% of my 16gb of RAM are in use and the computer completely hangs. I can't even kill the processes in Task Manager or via command line with taskkill.
What am I missing here?
Instead of spawning and managing individual subprocesses, use the python multiprocessing module to create a Pool of processes.
I haven't tested it, but it should work:
import Queue
from threading import Thread
class Consumer(Thread):
def __init__(self, queue=None):
super(Consumer, self).__init__()
self.daemon = True
self.queue = queue
def run(self):
while True:
task = self.queue.get()
# Spawn your process and .wait() for it to finish.
self.queue.task_done()
if __name__ == '__main__':
queue = Queue.Queue()
for task in get_tasks():
queue.put(task)
# You spawn 20 worker threads to process your queue nonstop
for i in range(20):
consumer = Consumer(queue)
consumer.start()
queue.join()
Basically, you have a queue filled with tasks that you need to accomplish. Then, you just spawn 20 worker threads to continually pull new tasks from the queue and process them concurrently.

Python multiprocessing pipe recv() doc unclear or did I miss anything?

I have been learning how to use the Python multiprocessing module recently, and reading the official doc. In 16.6.1.2. Exchanging objects between processes there is a simple example about using pipe to exchange data.
And, in 16.6.2.4. Connection Objects, there is this statement, quoted "Raises EOFError if there is nothing left to receive and the other end was closed."
So, I revised the example as shown below. IMHO this should trigger an EOFError exception: nothing sent and the sending end is closed.
The revised code:
from multiprocessing import Process, Pipe
def f(conn):
#conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
#print parent_conn.recv() # prints "[42, None, 'hello']"
try:
print parent_conn.recv()
except EOFError:
pass
p.join()
But, when I tried the revised example on my Ubuntu 11.04 machine, Python 2.7.2, the script hang.
If anyone can point out to me what I missed, I would be very appreciative.
When you start a new process with mp.Process, the child process inherits the pipes of the parent. When the child closes conn, the parent process still has child_conn open, so the reference count for the pipe file descriptor is still greater than 0, and so EOFError is not raised.
To get the EOFError, close the end of the pipe in both the parent and child processes:
import multiprocessing as mp
def foo_pipe(conn):
conn.close()
def pipe():
conn = mp.Pipe()
parent_conn, child_conn = conn
proc = mp.Process(target = foo_pipe, args = (child_conn, ))
proc.start()
child_conn.close() # <-- Close the child_conn end in the main process too.
try:
print(parent_conn.recv())
except EOFError as err:
print('Got here')
proc.join()
if __name__=='__main__':
pipe()

Categories

Resources