I'm currently launching a programme using subprocess.Popen(cmd, shell=TRUE)
I'm fairly new to Python, but it 'feels' like there ought to be some api that lets me do something similar to:
subprocess.Popen(cmd, shell=TRUE, postexec_fn=function_to_call_on_exit)
I am doing this so that function_to_call_on_exit can do something based on knowing that the cmd has exited (for example keeping count of the number of external processes currently running)
I assume that I could fairly trivially wrap subprocess in a class that combined threading with the Popen.wait() method, but as I've not done threading in Python yet and it seems like this might be common enough for an API to exist, I thought I'd try and find one first.
Thanks in advance :)
You're right - there is no nice API for this. You're also right on your second point - it's trivially easy to design a function that does this for you using threading.
import threading
import subprocess
def popen_and_call(on_exit, popen_args):
"""
Runs the given args in a subprocess.Popen, and then calls the function
on_exit when the subprocess completes.
on_exit is a callable object, and popen_args is a list/tuple of args that
would give to subprocess.Popen.
"""
def run_in_thread(on_exit, popen_args):
proc = subprocess.Popen(*popen_args)
proc.wait()
on_exit()
return
thread = threading.Thread(target=run_in_thread, args=(on_exit, popen_args))
thread.start()
# returns immediately after the thread starts
return thread
Even threading is pretty easy in Python, but note that if on_exit() is computationally expensive, you'll want to put this in a separate process instead using multiprocessing (so that the GIL doesn't slow your program down). It's actually very simple - you can basically just replace all calls to threading.Thread with multiprocessing.Process since they follow (almost) the same API.
There is concurrent.futures module in Python 3.2 (available via pip install futures for older Python < 3.2):
pool = Pool(max_workers=1)
f = pool.submit(subprocess.call, "sleep 2; echo done", shell=True)
f.add_done_callback(callback)
The callback will be called in the same process that called f.add_done_callback().
Full program
import logging
import subprocess
# to install run `pip install futures` on Python <3.2
from concurrent.futures import ThreadPoolExecutor as Pool
info = logging.getLogger(__name__).info
def callback(future):
if future.exception() is not None:
info("got exception: %s" % future.exception())
else:
info("process returned %d" % future.result())
def main():
logging.basicConfig(
level=logging.INFO,
format=("%(relativeCreated)04d %(process)05d %(threadName)-10s "
"%(levelname)-5s %(msg)s"))
# wait for the process completion asynchronously
info("begin waiting")
pool = Pool(max_workers=1)
f = pool.submit(subprocess.call, "sleep 2; echo done", shell=True)
f.add_done_callback(callback)
pool.shutdown(wait=False) # no .submit() calls after that point
info("continue waiting asynchronously")
if __name__=="__main__":
main()
Output
$ python . && python3 .
0013 05382 MainThread INFO begin waiting
0021 05382 MainThread INFO continue waiting asynchronously
done
2025 05382 Thread-1 INFO process returned 0
0007 05402 MainThread INFO begin waiting
0014 05402 MainThread INFO continue waiting asynchronously
done
2018 05402 Thread-1 INFO process returned 0
I modified Daniel G's answer to simply pass the subprocess.Popen args and kwargs as themselves instead of as a separate tuple/list, since I wanted to use keyword arguments with subprocess.Popen.
In my case I had a method postExec() that I wanted to run after subprocess.Popen('exe', cwd=WORKING_DIR)
With the code below, it simply becomes popenAndCall(postExec, 'exe', cwd=WORKING_DIR)
import threading
import subprocess
def popenAndCall(onExit, *popenArgs, **popenKWArgs):
"""
Runs a subprocess.Popen, and then calls the function onExit when the
subprocess completes.
Use it exactly the way you'd normally use subprocess.Popen, except include a
callable to execute as the first argument. onExit is a callable object, and
*popenArgs and **popenKWArgs are simply passed up to subprocess.Popen.
"""
def runInThread(onExit, popenArgs, popenKWArgs):
proc = subprocess.Popen(*popenArgs, **popenKWArgs)
proc.wait()
onExit()
return
thread = threading.Thread(target=runInThread,
args=(onExit, popenArgs, popenKWArgs))
thread.start()
return thread # returns immediately after the thread starts
I had same problem, and solved it using multiprocessing.Pool. There are two hacky tricks involved:
make size of pool 1
pass iterable arguments within an iterable of length 1
result is one function executed with callback on completion
def sub(arg):
print arg #prints [1,2,3,4,5]
return "hello"
def cb(arg):
print arg # prints "hello"
pool = multiprocessing.Pool(1)
rval = pool.map_async(sub,([[1,2,3,4,5]]),callback =cb)
(do stuff)
pool.close()
In my case, I wanted invocation to be non-blocking as well. Works beautifully
I was inspired by Daniel G. answer and implemented a very simple use case - in my work I often need to make repeated calls to the same (external) process with different arguments. I had hacked a way to determine when each specific call was done, but now I have a much cleaner way to issue callbacks.
I like this implementation because it is very simple, yet it allows me to issue asynchronous calls to multiple processors (notice I use multiprocessing instead of threading) and receive notification upon completion.
I tested the sample program and works great. Please edit at will and provide feedback.
import multiprocessing
import subprocess
class Process(object):
"""This class spawns a subprocess asynchronously and calls a
`callback` upon completion; it is not meant to be instantiated
directly (derived classes are called instead)"""
def __call__(self, *args):
# store the arguments for later retrieval
self.args = args
# define the target function to be called by
# `multiprocessing.Process`
def target():
cmd = [self.command] + [str(arg) for arg in self.args]
process = subprocess.Popen(cmd)
# the `multiprocessing.Process` process will wait until
# the call to the `subprocess.Popen` object is completed
process.wait()
# upon completion, call `callback`
return self.callback()
mp_process = multiprocessing.Process(target=target)
# this call issues the call to `target`, but returns immediately
mp_process.start()
return mp_process
if __name__ == "__main__":
def squeal(who):
"""this serves as the callback function; its argument is the
instance of a subclass of Process making the call"""
print "finished %s calling %s with arguments %s" % (
who.__class__.__name__, who.command, who.args)
class Sleeper(Process):
"""Sample implementation of an asynchronous process - define
the command name (available in the system path) and a callback
function (previously defined)"""
command = "./sleeper"
callback = squeal
# create an instance to Sleeper - this is the Process object that
# can be called repeatedly in an asynchronous manner
sleeper_run = Sleeper()
# spawn three sleeper runs with different arguments
sleeper_run(5)
sleeper_run(2)
sleeper_run(1)
# the user should see the following message immediately (even
# though the Sleeper calls are not done yet)
print "program continued"
Sample output:
program continued
finished Sleeper calling ./sleeper with arguments (1,)
finished Sleeper calling ./sleeper with arguments (2,)
finished Sleeper calling ./sleeper with arguments (5,)
Below is the source code of sleeper.c - my sample "time consuming" external process
#include<stdlib.h>
#include<unistd.h>
int main(int argc, char *argv[]){
unsigned int t = atoi(argv[1]);
sleep(t);
return EXIT_SUCCESS;
}
compile as:
gcc -o sleeper sleeper.c
There is also ProcesPoolExecutor since 3.2 in concurrent.futures (https://docs.python.org/3/library/concurrent.futures.html). The usage is as of the ThreadPoolExecutor mentioned above. With on exit callback being attached via executor.add_done_callback().
Thanks guys, for pointing me into the right direction. I made a class from what I found here and added a stop-function to kill the process:
class popenplus():
def __init__(self, onExit, *popenArgs, **popenKWArgs):
thread = Thread(target=self.runInThread, args=(onExit, popenArgs, popenKWArgs))
thread.start()
def runInThread(self, onExit, popenArgs, popenKWArgs):
self.proc = Popen(*popenArgs, **popenKWArgs)
self.proc.wait()
self.proc = None
onExit()
def stop(self):
if self.proc:
self.proc.kill()
On POSIX systems, the parent process receives a SIGCHLD signal when a child process exits. To run a callback when a subprocess command exits, handle the SIGCHLD signal in the parent. Something like this:
import signal
import subprocess
def sigchld_handler(signum, frame):
# This is run when the child exits.
# Do something here ...
pass
signal.signal(signal.SIGCHLD, sigchld_handler)
process = subprocess.Popen('mycmd', shell=TRUE)
Note that this will not work on Windows.
AFAIK there is no such API, at least not in subprocess module. You need to roll something on your own, possibly using threads.
Related
I have a program, which uses multiprocesses to execute functions from an external hardware library. The communication between the multiprocess and my program happens with JoinableQueue().
A part of the code looks like this:
# Main Code
queue_cmd.put("do_something")
queue_cmd.join() # here is my problem
# multiprocess
task = queue_cmd.get()
if task == "do_something":
external_class.do_something()
queue_cmd.task_done()
Note: external_class is the external hardware library.
This library sometimes crashes and the line queue_cmd.task_done() never gets executed. As a result, my main program hangs indefinitely in the queue_cmd.join() part, waiting for the queue_cmd.task_done() to be called. Unfortunately, there is no timeout parameter for the join() function.
How can I wait for the element in the JoinableQueue to be processed, but also deal with the event of my multiprocess terminating (due to the crash in the do_something() function)?
Ideally, the join function would have a timeout parameter (.join(timeout=30)), which I could use to restart the multiprocess - but it does not.
You can always wrap a blocking function on another thread:
queue_cmd.put("do_something")
t = Thread(target=queue_cmd.join)
t.start()
# implement a timeout
start = datetime.now()
timeout = 10 # seconds
while t.is_alive() and (datetime.now() - start).seconds < timeout:
# do something else
# waiting for the join or timeout
if t.is_alive():
# kill the subprocess that failed
pass
I think the best approach here is to start the "crashable" module in (yet) another process:
Main code
queue_cmd.put("do_something")
queue_cmd.join()
Multiprocess (You can now move this to a thread)
task = queue_cmd.get()
if task == "do_something":
subprocess.run(["python", "pleasedontcrash.py"])
queue_cmd.task_done()
pleasedontcrash.py
external_class.do_something()
As shown, I'd do it using subprocess. If you need to pass parameters (which you could with subprocess using pipes or arguments), it's easier to use multiprocessing.
I need to convert 86,000 TEX files to XML using the LaTeXML library in the command line. I tried to write a Python script to automate this with the subprocess module, utilizing all 4 cores.
def get_outpath(tex_path):
path_parts = pathlib.Path(tex_path).parts
arxiv_id = path_parts[2]
outpath = 'xml/' + arxiv_id + '.xml'
return outpath
def convert_to_xml(inpath):
outpath = get_outpath(inpath)
if os.path.isfile(outpath):
message = '{}: Already converted.'.format(inpath)
print(message)
return
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
message = '{}: Converted!'.format(inpath)
print(message)
def start():
start_time = time.time()
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count(),
maxtasksperchild=1)
print('Initialized {} threads'.format(multiprocessing.cpu_count()))
print('Beginning conversion...')
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
pool.close()
pool.join()
print("TIME: {}".format(total_time))
start()
The script results in Too many open files and slows down my computer. From looking at Activity Monitor, it looks like this script is trying to create 86,000 conversion subprocesses at once, and each process is trying to open a file. Maybe this is the result of pool.imap_unordered(convert_to_xml, preprints) -- maybe I need to not use map in conjunction with subprocess.Popen, since I just have too many commands to call? What would be an alternative?
I've spent all day trying to figure out the right way to approach bulk subprocessing. I'm new to this part of Python, so any tips for heading in the right direction would be much appreciated. Thanks!
In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
Regarding if __name__ == '__main__':
As martineau pointed out, there is a warning in the multiprocessing docs that
code that spawns new processes should not be called at the top level of a module.
Instead, the code should be contained inside a if __name__ == '__main__' statement.
In Linux, nothing terrible happens if you disregard this warning.
But in Windows, the code "fork-bombs". Or more accurately, the code
causes an unmitigated chain of subprocesses to be spawned, because on Windows fork is simulated by spawning a new Python process which then imports the calling script. Every import spawns a new Python process. Every Python process tries to import the calling script. The cycle is not broken until all resources are consumed.
So to be nice to our Windows-fork-bereft brethren, use
if __name__ == '__main__:
start()
Sometimes processes require a lot of memory. The only reliable way to free memory is to terminate the process. maxtasksperchild=1 tells the pool to terminate each worker process after it completes 1 task. It then spawns a new worker process to handle another task (if there are any). This frees the (memory) resources the original worker may have allocated which could not otherwise have been freed.
In your situation it does not look like the worker process is going to require much memory, so you probably don't need maxtasksperchild=1.
In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
The chunksize affects how many tasks a worker performs before sending the result back to the main process.
Sometimes this can affect performance, especially if interprocess communication is a signficant portion of overall runtime.
In your situation, convert_to_xml takes a relatively long time (assuming we wait until latexml finishes) and it simply returns None. So interprocess communication probably isn't a significant portion of overall runtime. Therefore, I don't expect you would find a significant change in performance in this case (though it never hurts to experiment!).
In plain Python, map should not be used just to call a function multiple times.
For a similar stylistic reason, I would reserve using the pool.*map* methods for situations where I cared about the return values.
So instead of
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
you might consider using
for preprint in preprints:
pool.apply_async(convert_to_xml, args=(preprint, ))
instead.
The iterable passed to any of the pool.*map* functions is consumed
immediately. It doesn't matter if the iterable is an iterator. There is no
special memory benefit to using an iterator here. imap_unordered returns an
iterator, but it does not handle its input in any especially iterator-friendly
way.
No matter what type of iterable you pass, upon calling the pool.*map* function the iterable is
consumed and turned into tasks which are put into a task queue.
Here is code which corroborates this claim:
version1.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in pool.imap_unordered(foo, gen()):
pass
pool.close()
pool.join()
if __name__ == '__main__':
start()
version2.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in gen():
result = pool.apply_async(foo, args=(item, ))
pool.close()
pool.join()
if __name__ == '__main__':
start()
Running version1.py and version2.py both produce the same result.
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Crucially, you will notice that Got here is printed 10 times very quickly at
the beginning of the run, and then there is a long pause (while the calculation
is done) before the program ends.
If the generator gen() were somehow consumed slowly by pool.imap_unordered,
we should expect Got here to be printed slowly as well. Since Got here is
printed 10 times and quickly, we can see that the iterable gen() is being
completely consumed well before the tasks are completed.
Running these programs should hopefully give you confidence that
pool.imap_unordered and pool.apply_async are putting tasks in the queue
essentially in the same way: immediate after the call is made.
I have a web2py application that basically serves as a browser interface for a Python script. This script usually returns pretty quickly, but can occasionally take a long time. I want to provide a way for the user to stop the script's execution if it takes too long.
I am currently calling the function like this:
def myView(): # this function is called from ajax
session.model = myFunc() # myFunc is from a module which i have complete control over
return dict(model=session.model)
myFunc, when called with certain options, uses multiprocessing but still ends up taking a long time. I need some way to terminate the function, or at the very least the thread's children.
The first thing i tried was to run myFunc in a new process, and roll my own simple event system to kill it:
# in the controller
def myView():
p_conn, c_conn = multiprocessing.Pipe()
events = multiprocessing.Manager().dict()
proc = multiprocessing.Process(target=_fit, args=(options, events c_conn))
proc.start()
sleep(0.01)
session.events = events
proc.join()
session.model = p_conn.recv()
return dict(model=session.model)
def _fit(options, events pipe):
pipe.send(fitting.logistic_fit(options=options, events=events))
pipe.close()
def stop():
try:
session.events['kill']()
except SystemExit:
pass # because it raises that error intentionally
return dict()
# in the module
def kill():
print multiprocessing.active_children()
for p in multiprocessing.active_children():
p.terminate()
raise SystemExit
def myFunc(options, events):
events['kill'] = kill
I ran into a few major problems with this.
The session in stop() wasn't always the same as the session in myView(), so session.events was None.
Even when the session was the same, kill() wasn't properly killing the children.
The long-running function would hang the web2py thread, so stop() wasn't even processed until the function finished.
I considered not calling join() and using AJAX to pick up the result of the function at a later time, but I wasn't able to save the process object in session for later use. The pipe seemed to be able to be pickled, but then I had the problem with not being able to access the same session from another view.
How can I implement this feature?
For long running tasks, you are better off queuing them via the built-in scheduler. If you want to allow the user to manually stop a task that is taking too long, you can use the scheduler.stop_task(ref) method (where ref is the task id or uuid). Alternatively, when you queue a task, you can specify a timeout, so it will automatically stop if not completed within the timeout period.
You can do simple Ajax polling to notify the client when the task has completed (or implement something more sophisticated with websockets or SSE).
I want to create multi process app. Here is sample:
import threading
import time
from logs import LOG
def start_first():
LOG.log("First thread has started")
time.sleep(1000)
def start_second():
LOG.log("second thread has started")
if __name__ == '__main__':
### call birhtday daemon
first_thread = threading.Thread(target=start_first())
### call billing daemon
second_thread = threading.Thread(target=start_second())
### starting all daemons
first_thread.start()
second_thread.start()
In this code second thread does not work. I guess, after calling sleep function inside first_thread main process is slept. I found this post. But here sleep was used with class. I got that(Process finished with exit code 0
) as a result when I run answer. Could anybody explain me where I made a mistake ?
I am using python 3.* on windows
When creating your thread you are actually invoking the functions when trying to set the target for the Thread instead of passing a function to it. This means when you try to create the first_thread you are actually calling start_first which includes the very long sleep. I imagine you then get frustrated that you don't see the output from the second thread and kill it, right?
Remove the parens from your target= statements and you will get what you want
first_thread = threading.Thread(target=start_first)
second_thread = threading.Thread(target=start_second)
first_thread.start()
second_thread.start()
will do what you are trying
I am trying to automate some big data file processing using python.
A lop of the processing is chained , i.e script1 writes a file , that is then processed by script2 , then script2's output by script3 etc.
I am using the subprocess module in a threaded context.
I have one class that creates tuples of chained scripts
("scr1.sh","scr2.sh","scr3.sh").
Then another class that uses a call like
for script in scriplist:
subprocess.call(script)
My question is that in this for loop , is each script only called after subprocess.call(script1) returns a successful retcode?.
Or is it that all three get called right after one another since I am using subprocess.call, Without using "sleep" or "wait", I want to make sure that the second script only starts after the first one is over.
edit: The pydoc says
"subprocess.call(*popenargs, **kwargs)
Run command with arguments. Wait for command to complete, then return the returncode attribute."
So in the for loop (above) , does it wait for each retcode before iterating to the next script.
I am new to threading . I am attaching the stripped-down code for the class that runs the analysis here. The subprocess.call loop is part of this class.
class ThreadedDataProcessor(Thread):
def __init__(self, in_queue, out_queue):
# Uses Queue
Thread.__init__(self)
self.in_queue = in_queue
self.out_queue = out_queue
def run(self):
while True:
path = self.in_queue.get()
if path is None:
break
myprocessor = ProcessorScriptCreator(path)
scrfiles = myprocessor.create_and_return_shell_scripts()
for index,file in enumerate(scrfiles):
subprocess.call([file])
print "CALLED%s%s" % (index,file) *5
#report(myfile.describe())
#report("Done %s" % path)
self.out_queue.put(path)
in_queue = Queue()
The loop will serially call each script, wait until it completes, and then call the next one regardless of success or failure of the previous call. You probably want to say:
try:
map(subprocess.check_call, script_list)
except Exception, e:
# failed script
A new thread will start with each call to run, and also end when run is done. You iterate over the script with subprocess within one thread.
You should make sure that each set of calls in each thread are not going to impact other calls from other threads. For example trying to read and write to the same file from a script call in multiple threads at the same time.