I'm writing a grading robot for a Python programming class, and the students' submissions may loop infinitely. I want to sandbox a shell call to their program so that it can't run longer than a specific amount of time. I'd like to run, say,
restrict --msec=100 --default=-1 python -c "while True: pass"
and have it return -1 if the program runs longer than 100ms, and otherwise return the value of the executed expression (in this case, the output of the python program)
Does Python support this internally? I'm also writing the grading robot in Perl, so I could use some Perl module wrapped around the call to the shell script.
Use apply_async to call the student's function, (foo in the example below). Use the get method with a timeout to get the result if it returns in 0.1 seconds or less, otherwise get raise a TimeoutError:
import multiprocessing as mp
import time
import sys
def foo(x):
time.sleep(x)
return x*x
pool = mp.Pool(1)
for x in (0.01, 1.0):
try:
result = pool.apply_async(foo, args = (x,)).get(timeout = 0.1)
except KeyboardInterrupt:
pool.terminate()
sys.exit("Cancelled")
except mp.TimeoutError:
print('Timed out')
else:
print "Result: {r}".format(r = result)
Or, if the student submits a script instead of function, then you could use jcollado's Command class.
The standard approach is to do the following.
Create a subprocess which runs the student's program in it's own Python instance.
Wait for a time.
If the student subprocess exits, good.
If the subprocess has not exited, you need to kill it.
You'll be happiest downloading the psutil module which allows each status checking of the subprocess.
Related
I need to convert 86,000 TEX files to XML using the LaTeXML library in the command line. I tried to write a Python script to automate this with the subprocess module, utilizing all 4 cores.
def get_outpath(tex_path):
path_parts = pathlib.Path(tex_path).parts
arxiv_id = path_parts[2]
outpath = 'xml/' + arxiv_id + '.xml'
return outpath
def convert_to_xml(inpath):
outpath = get_outpath(inpath)
if os.path.isfile(outpath):
message = '{}: Already converted.'.format(inpath)
print(message)
return
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
message = '{}: Converted!'.format(inpath)
print(message)
def start():
start_time = time.time()
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count(),
maxtasksperchild=1)
print('Initialized {} threads'.format(multiprocessing.cpu_count()))
print('Beginning conversion...')
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
pool.close()
pool.join()
print("TIME: {}".format(total_time))
start()
The script results in Too many open files and slows down my computer. From looking at Activity Monitor, it looks like this script is trying to create 86,000 conversion subprocesses at once, and each process is trying to open a file. Maybe this is the result of pool.imap_unordered(convert_to_xml, preprints) -- maybe I need to not use map in conjunction with subprocess.Popen, since I just have too many commands to call? What would be an alternative?
I've spent all day trying to figure out the right way to approach bulk subprocessing. I'm new to this part of Python, so any tips for heading in the right direction would be much appreciated. Thanks!
In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
Regarding if __name__ == '__main__':
As martineau pointed out, there is a warning in the multiprocessing docs that
code that spawns new processes should not be called at the top level of a module.
Instead, the code should be contained inside a if __name__ == '__main__' statement.
In Linux, nothing terrible happens if you disregard this warning.
But in Windows, the code "fork-bombs". Or more accurately, the code
causes an unmitigated chain of subprocesses to be spawned, because on Windows fork is simulated by spawning a new Python process which then imports the calling script. Every import spawns a new Python process. Every Python process tries to import the calling script. The cycle is not broken until all resources are consumed.
So to be nice to our Windows-fork-bereft brethren, use
if __name__ == '__main__:
start()
Sometimes processes require a lot of memory. The only reliable way to free memory is to terminate the process. maxtasksperchild=1 tells the pool to terminate each worker process after it completes 1 task. It then spawns a new worker process to handle another task (if there are any). This frees the (memory) resources the original worker may have allocated which could not otherwise have been freed.
In your situation it does not look like the worker process is going to require much memory, so you probably don't need maxtasksperchild=1.
In convert_to_xml, the process = subprocess.Popen(...) statements spawns a latexml subprocess.
Without a blocking call such as process.communicate(), the convert_to_xml ends even while latexml continues to run in the background.
Since convert_to_xml ends, the Pool sends the associated worker process another task to run and so convert_to_xml is called again.
Once again another latexml process is spawned in the background.
Pretty soon, you are up to your eyeballs in latexml processes and the resource limit on the number of open files is reached.
The fix is easy: add process.communicate() to tell convert_to_xml to wait until the latexml process has finished.
try:
process = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
process.communicate()
except Exception as error:
process.kill()
message = "error: %s run(*%r, **%r)" % (e, args, kwargs)
print(message)
else: # use else so that this won't run if there is an Exception
message = '{}: Converted!'.format(inpath)
print(message)
The chunksize affects how many tasks a worker performs before sending the result back to the main process.
Sometimes this can affect performance, especially if interprocess communication is a signficant portion of overall runtime.
In your situation, convert_to_xml takes a relatively long time (assuming we wait until latexml finishes) and it simply returns None. So interprocess communication probably isn't a significant portion of overall runtime. Therefore, I don't expect you would find a significant change in performance in this case (though it never hurts to experiment!).
In plain Python, map should not be used just to call a function multiple times.
For a similar stylistic reason, I would reserve using the pool.*map* methods for situations where I cared about the return values.
So instead of
for _ in pool.imap_unordered(convert_to_xml, preprints, chunksize=5):
pass
you might consider using
for preprint in preprints:
pool.apply_async(convert_to_xml, args=(preprint, ))
instead.
The iterable passed to any of the pool.*map* functions is consumed
immediately. It doesn't matter if the iterable is an iterator. There is no
special memory benefit to using an iterator here. imap_unordered returns an
iterator, but it does not handle its input in any especially iterator-friendly
way.
No matter what type of iterable you pass, upon calling the pool.*map* function the iterable is
consumed and turned into tasks which are put into a task queue.
Here is code which corroborates this claim:
version1.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in pool.imap_unordered(foo, gen()):
pass
pool.close()
pool.join()
if __name__ == '__main__':
start()
version2.py:
import multiprocessing as mp
import time
def foo(x):
time.sleep(0.1)
return x * x
def gen():
for x in range(1000):
if x % 100 == 0:
print('Got here')
yield x
def start():
pool = mp.Pool()
for item in gen():
result = pool.apply_async(foo, args=(item, ))
pool.close()
pool.join()
if __name__ == '__main__':
start()
Running version1.py and version2.py both produce the same result.
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Got here
Crucially, you will notice that Got here is printed 10 times very quickly at
the beginning of the run, and then there is a long pause (while the calculation
is done) before the program ends.
If the generator gen() were somehow consumed slowly by pool.imap_unordered,
we should expect Got here to be printed slowly as well. Since Got here is
printed 10 times and quickly, we can see that the iterable gen() is being
completely consumed well before the tasks are completed.
Running these programs should hopefully give you confidence that
pool.imap_unordered and pool.apply_async are putting tasks in the queue
essentially in the same way: immediate after the call is made.
Novice here: I am trying to execute some code serially and then create a pool of threads and execute some code in parallel. After the parallel execution is done, I want to execute some more code serially.
For example...
import time
from multiprocessing import Pool
print("I only want to print this statement once")
def worker(i):
"""worker function"""
now = time.time()
time.sleep(i)
then = time.time()
print(now, then)
if __name__ == '__main__':
with Pool(3) as p:
p.map(worker, [1, 1, 1])
p.close()
print("Only print this once as well")
I would like this to return...
I only want to print this statement once
1533511478.0619314 1533511479.0620182
1533511478.0789354 1533511479.0791905
1533511478.0979397 1533511479.098235
Only print this once as well
However what it returns is this:
I only want to print this statement once
I only want to print this statement once
Only print this once as well
I only want to print this statement once
Only print this once as well
I only want to print this statement once
Only print this once as well
I only want to print this statement once
Only print this once as well
I only want to print this statement once
Only print this once as well
1533511478.0619314 1533511479.0620182
1533511478.0789354 1533511479.0791905
1533511478.0979397 1533511479.098235
Only print this once as well
So it seems to be running the print statements an additional time for each pool.
Any help would be appreciated!
Based on the observed behaviour, I assume you are on a NT/Windows Operating System.
The reason you see all those prints is because on Windows the spawn start strategy is used. When a new process is "spawned", a new Python interpreter is launched and it receives the module and the function it's supposed to execute. When the new interpreter imports the module, the top level print functions are executed. Hence the duplicate prints.
Just move those print statement within the __main__ and you won't see them again.
I have two functions, draw_ascii_spinner and findCluster(companyid).
I would like to:
Run findCluster(companyid) in the backround and while its processing....
Run draw_ascii_spinner until findCluster(companyid) finishes
How do I begin to try to solve for this (Python 2.7)?
Use threads:
import threading, time
def wrapper(func, args, res):
res.append(func(*args))
res = []
t = threading.Thread(target=wrapper, args=(findcluster, (companyid,), res))
t.start()
while t.is_alive():
# print next iteration of ASCII spinner
t.join(0.2)
print res[0]
You can use multiprocessing. Or, if findCluster(companyid) has sensible stopping points, you can turn it into a generator along with draw_ascii_spinner, to do something like this:
for tick in findCluster(companyid):
ascii_spinner.next()
Generally, you will use Threads. Here is a simplistic approach which assumes, that there are only two threads: 1) the main thread executing a task, 2) the spinner thread:
#!/usr/bin/env python
import time
import thread
def spinner():
while True:
print '.'
time.sleep(1)
def task():
time.sleep(5)
if __name__ == '__main__':
thread.start_new_thread(spinner, ())
# as soon as task finishes (and so the program)
# spinner will be gone as well
task()
This can be done with threads. FindCluster runs in a separate thread and when done, it can simply signal another thread that is polling for a reply.
You'll want to do some research on threading, the general form is going to be this
Create a new thread for findCluster and create some way for the program to know the method is running - simplest in Python is just a global boolean
Run draw_ascii_spinner in a while loop conditioned on whether it is still running, you'll probably want to have this thread sleep for a short period of time between iterations
Here's a short tutorial in Python - http://linuxgazette.net/107/pai.html
Run findCluster() in a thread (the Threading module makes this very easy), and then draw_ascii_spinner until some condition is met.
Instead of using sleep() to set the pace of the spinner, you can wait on the thread's wait() with a timeout.
It is possible to have a working example? I am new in Python. I have 6 tasks to run in one python program. These 6 tasks should work in coordinations, meaning that one should start when another finishes. I saw the answers , but I couldn't adopted the codes you shared to my program.
I used "time.sleep" but I know that it is not good because I cannot know how much time it takes each time.
# Sending commands
for i in range(0,len(cmdList)): # port Sending commands
cmd = cmdList[i]
cmdFull = convert(cmd)
port.write(cmd.encode('ascii'))
# s = port.read(10)
print(cmd)
# Terminate the command + close serial port
port.write(cmdFull.encode('ascii'))
print('Termination')
port.close()
# time.sleep(1*60)
I am trying to automate some big data file processing using python.
A lop of the processing is chained , i.e script1 writes a file , that is then processed by script2 , then script2's output by script3 etc.
I am using the subprocess module in a threaded context.
I have one class that creates tuples of chained scripts
("scr1.sh","scr2.sh","scr3.sh").
Then another class that uses a call like
for script in scriplist:
subprocess.call(script)
My question is that in this for loop , is each script only called after subprocess.call(script1) returns a successful retcode?.
Or is it that all three get called right after one another since I am using subprocess.call, Without using "sleep" or "wait", I want to make sure that the second script only starts after the first one is over.
edit: The pydoc says
"subprocess.call(*popenargs, **kwargs)
Run command with arguments. Wait for command to complete, then return the returncode attribute."
So in the for loop (above) , does it wait for each retcode before iterating to the next script.
I am new to threading . I am attaching the stripped-down code for the class that runs the analysis here. The subprocess.call loop is part of this class.
class ThreadedDataProcessor(Thread):
def __init__(self, in_queue, out_queue):
# Uses Queue
Thread.__init__(self)
self.in_queue = in_queue
self.out_queue = out_queue
def run(self):
while True:
path = self.in_queue.get()
if path is None:
break
myprocessor = ProcessorScriptCreator(path)
scrfiles = myprocessor.create_and_return_shell_scripts()
for index,file in enumerate(scrfiles):
subprocess.call([file])
print "CALLED%s%s" % (index,file) *5
#report(myfile.describe())
#report("Done %s" % path)
self.out_queue.put(path)
in_queue = Queue()
The loop will serially call each script, wait until it completes, and then call the next one regardless of success or failure of the previous call. You probably want to say:
try:
map(subprocess.check_call, script_list)
except Exception, e:
# failed script
A new thread will start with each call to run, and also end when run is done. You iterate over the script with subprocess within one thread.
You should make sure that each set of calls in each thread are not going to impact other calls from other threads. For example trying to read and write to the same file from a script call in multiple threads at the same time.
I'm currently launching a programme using subprocess.Popen(cmd, shell=TRUE)
I'm fairly new to Python, but it 'feels' like there ought to be some api that lets me do something similar to:
subprocess.Popen(cmd, shell=TRUE, postexec_fn=function_to_call_on_exit)
I am doing this so that function_to_call_on_exit can do something based on knowing that the cmd has exited (for example keeping count of the number of external processes currently running)
I assume that I could fairly trivially wrap subprocess in a class that combined threading with the Popen.wait() method, but as I've not done threading in Python yet and it seems like this might be common enough for an API to exist, I thought I'd try and find one first.
Thanks in advance :)
You're right - there is no nice API for this. You're also right on your second point - it's trivially easy to design a function that does this for you using threading.
import threading
import subprocess
def popen_and_call(on_exit, popen_args):
"""
Runs the given args in a subprocess.Popen, and then calls the function
on_exit when the subprocess completes.
on_exit is a callable object, and popen_args is a list/tuple of args that
would give to subprocess.Popen.
"""
def run_in_thread(on_exit, popen_args):
proc = subprocess.Popen(*popen_args)
proc.wait()
on_exit()
return
thread = threading.Thread(target=run_in_thread, args=(on_exit, popen_args))
thread.start()
# returns immediately after the thread starts
return thread
Even threading is pretty easy in Python, but note that if on_exit() is computationally expensive, you'll want to put this in a separate process instead using multiprocessing (so that the GIL doesn't slow your program down). It's actually very simple - you can basically just replace all calls to threading.Thread with multiprocessing.Process since they follow (almost) the same API.
There is concurrent.futures module in Python 3.2 (available via pip install futures for older Python < 3.2):
pool = Pool(max_workers=1)
f = pool.submit(subprocess.call, "sleep 2; echo done", shell=True)
f.add_done_callback(callback)
The callback will be called in the same process that called f.add_done_callback().
Full program
import logging
import subprocess
# to install run `pip install futures` on Python <3.2
from concurrent.futures import ThreadPoolExecutor as Pool
info = logging.getLogger(__name__).info
def callback(future):
if future.exception() is not None:
info("got exception: %s" % future.exception())
else:
info("process returned %d" % future.result())
def main():
logging.basicConfig(
level=logging.INFO,
format=("%(relativeCreated)04d %(process)05d %(threadName)-10s "
"%(levelname)-5s %(msg)s"))
# wait for the process completion asynchronously
info("begin waiting")
pool = Pool(max_workers=1)
f = pool.submit(subprocess.call, "sleep 2; echo done", shell=True)
f.add_done_callback(callback)
pool.shutdown(wait=False) # no .submit() calls after that point
info("continue waiting asynchronously")
if __name__=="__main__":
main()
Output
$ python . && python3 .
0013 05382 MainThread INFO begin waiting
0021 05382 MainThread INFO continue waiting asynchronously
done
2025 05382 Thread-1 INFO process returned 0
0007 05402 MainThread INFO begin waiting
0014 05402 MainThread INFO continue waiting asynchronously
done
2018 05402 Thread-1 INFO process returned 0
I modified Daniel G's answer to simply pass the subprocess.Popen args and kwargs as themselves instead of as a separate tuple/list, since I wanted to use keyword arguments with subprocess.Popen.
In my case I had a method postExec() that I wanted to run after subprocess.Popen('exe', cwd=WORKING_DIR)
With the code below, it simply becomes popenAndCall(postExec, 'exe', cwd=WORKING_DIR)
import threading
import subprocess
def popenAndCall(onExit, *popenArgs, **popenKWArgs):
"""
Runs a subprocess.Popen, and then calls the function onExit when the
subprocess completes.
Use it exactly the way you'd normally use subprocess.Popen, except include a
callable to execute as the first argument. onExit is a callable object, and
*popenArgs and **popenKWArgs are simply passed up to subprocess.Popen.
"""
def runInThread(onExit, popenArgs, popenKWArgs):
proc = subprocess.Popen(*popenArgs, **popenKWArgs)
proc.wait()
onExit()
return
thread = threading.Thread(target=runInThread,
args=(onExit, popenArgs, popenKWArgs))
thread.start()
return thread # returns immediately after the thread starts
I had same problem, and solved it using multiprocessing.Pool. There are two hacky tricks involved:
make size of pool 1
pass iterable arguments within an iterable of length 1
result is one function executed with callback on completion
def sub(arg):
print arg #prints [1,2,3,4,5]
return "hello"
def cb(arg):
print arg # prints "hello"
pool = multiprocessing.Pool(1)
rval = pool.map_async(sub,([[1,2,3,4,5]]),callback =cb)
(do stuff)
pool.close()
In my case, I wanted invocation to be non-blocking as well. Works beautifully
I was inspired by Daniel G. answer and implemented a very simple use case - in my work I often need to make repeated calls to the same (external) process with different arguments. I had hacked a way to determine when each specific call was done, but now I have a much cleaner way to issue callbacks.
I like this implementation because it is very simple, yet it allows me to issue asynchronous calls to multiple processors (notice I use multiprocessing instead of threading) and receive notification upon completion.
I tested the sample program and works great. Please edit at will and provide feedback.
import multiprocessing
import subprocess
class Process(object):
"""This class spawns a subprocess asynchronously and calls a
`callback` upon completion; it is not meant to be instantiated
directly (derived classes are called instead)"""
def __call__(self, *args):
# store the arguments for later retrieval
self.args = args
# define the target function to be called by
# `multiprocessing.Process`
def target():
cmd = [self.command] + [str(arg) for arg in self.args]
process = subprocess.Popen(cmd)
# the `multiprocessing.Process` process will wait until
# the call to the `subprocess.Popen` object is completed
process.wait()
# upon completion, call `callback`
return self.callback()
mp_process = multiprocessing.Process(target=target)
# this call issues the call to `target`, but returns immediately
mp_process.start()
return mp_process
if __name__ == "__main__":
def squeal(who):
"""this serves as the callback function; its argument is the
instance of a subclass of Process making the call"""
print "finished %s calling %s with arguments %s" % (
who.__class__.__name__, who.command, who.args)
class Sleeper(Process):
"""Sample implementation of an asynchronous process - define
the command name (available in the system path) and a callback
function (previously defined)"""
command = "./sleeper"
callback = squeal
# create an instance to Sleeper - this is the Process object that
# can be called repeatedly in an asynchronous manner
sleeper_run = Sleeper()
# spawn three sleeper runs with different arguments
sleeper_run(5)
sleeper_run(2)
sleeper_run(1)
# the user should see the following message immediately (even
# though the Sleeper calls are not done yet)
print "program continued"
Sample output:
program continued
finished Sleeper calling ./sleeper with arguments (1,)
finished Sleeper calling ./sleeper with arguments (2,)
finished Sleeper calling ./sleeper with arguments (5,)
Below is the source code of sleeper.c - my sample "time consuming" external process
#include<stdlib.h>
#include<unistd.h>
int main(int argc, char *argv[]){
unsigned int t = atoi(argv[1]);
sleep(t);
return EXIT_SUCCESS;
}
compile as:
gcc -o sleeper sleeper.c
There is also ProcesPoolExecutor since 3.2 in concurrent.futures (https://docs.python.org/3/library/concurrent.futures.html). The usage is as of the ThreadPoolExecutor mentioned above. With on exit callback being attached via executor.add_done_callback().
Thanks guys, for pointing me into the right direction. I made a class from what I found here and added a stop-function to kill the process:
class popenplus():
def __init__(self, onExit, *popenArgs, **popenKWArgs):
thread = Thread(target=self.runInThread, args=(onExit, popenArgs, popenKWArgs))
thread.start()
def runInThread(self, onExit, popenArgs, popenKWArgs):
self.proc = Popen(*popenArgs, **popenKWArgs)
self.proc.wait()
self.proc = None
onExit()
def stop(self):
if self.proc:
self.proc.kill()
On POSIX systems, the parent process receives a SIGCHLD signal when a child process exits. To run a callback when a subprocess command exits, handle the SIGCHLD signal in the parent. Something like this:
import signal
import subprocess
def sigchld_handler(signum, frame):
# This is run when the child exits.
# Do something here ...
pass
signal.signal(signal.SIGCHLD, sigchld_handler)
process = subprocess.Popen('mycmd', shell=TRUE)
Note that this will not work on Windows.
AFAIK there is no such API, at least not in subprocess module. You need to roll something on your own, possibly using threads.