python multiple threads redirecting stdout - python

I'm building an icecast2 radio station which will restream existing stations in lower quality. This program will generate multiple FFmpeg processes restreaming 24/7. For troubleshooting purposes, I would like to have an output of every FFmpeg process redirected to the separate file.
import ffmpeg, csv
from threading import Thread
def run(name, mount, source):
icecast = "icecast://"+ICECAST2_USER+":"+ICECAST2_PASS+"#localhost:"+ICECAST2_PORT+"/"+mount
stream = (
ffmpeg
.input(source)
.output(
icecast,
audio_bitrate=BITRATE, sample_rate=SAMPLE_RATE, format=FORMAT, acodec=CODEC,
reconnect="1", reconnect_streamed="1", reconnect_at_eof="1", reconnect_delay_max="120",
ice_name=name, ice_genre=source
)
)
return stream
with open('stations.csv', mode='r') as data:
for station in csv.DictReader(data):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream.run)
thread.start()
As I understand I can't redirect stdout of each thread separately, I also can't use ffmpeg reporting which is only configured by an environment variable. Do I have any other options?

You need to create a thread function of your own
def stream_runner(stream,id):
# open a stream-specific log file to write to
with open(f'stream_{id}.log','wt') as f:
# block until ffmpeg is done
sp.run(stream.compile(),stderr=f)
for i, station in enumerate(csv.DictReader(data)):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream_runner,args=(stream,i))
thread.start()
Something like this should work.

ffmpeg-python doesn't quite give you the tools to do this - you want to control one of the arguments to subprocess, stderr, but ffmpeg doesn't have an argument for this.
However, what ffmpeg-python does have, is the ability to show the command line arguments that it would have used. You can make your own call to subprocess after that.
You also don't need to use threads to do this - you can set up each ffmpeg subprocess, without waiting for it to complete, and check in on it each second. This example starts up two ffmpeg instances in parallel, and monitors each one by printing out the most recent line of output from each one every second, as well as tracking if they've exited.
I made two changes for testing:
It gets the stations from a dictionary rather than a CSV file.
It transcodes an MP4 file rather than an audio stream, since I don't have an icecast server. If you want to test it, it expects to have a file named 'sample.mp4' in the same directory.
Both should be pretty easy to change back.
import ffmpeg
import subprocess
import os
import time
stations = [
{'name': 'foo1', 'input': 'sample.mp4', 'output': 'output.mp4'},
{'name': 'foo2', 'input': 'sample.mp4', 'output': 'output2.mp4'},
]
class Transcoder():
def __init__(self, arguments):
self.arguments = arguments
def run(self):
stream = (
ffmpeg
.input(self.arguments['input'])
.output(self.arguments['output'])
)
args = stream.compile(overwrite_output=True)
with open(self.log_name(), 'ab') as logfile:
self.subproc = subprocess.Popen(
args,
stdin=None,
stdout=None,
stderr=logfile,
)
def log_name(self):
return self.arguments['name'] + "-ffmpeg.log"
def still_running(self):
return self.subproc.poll() is None
def last_log_line(self):
with open(self.log_name(), 'rb') as f:
try: # catch OSError in case of a one line file
f.seek(-2, os.SEEK_END)
while f.read(1) not in [b'\n', 'b\r']:
f.seek(-2, os.SEEK_CUR)
except OSError:
f.seek(0)
last_line = f.readline().decode()
last_line = last_line.split('\n')[-1]
return last_line
def name(self):
return self.arguments['name']
transcoders = []
for station in stations:
t = Transcoder(station)
t.run()
transcoders.append(t)
while True:
for t in list(transcoders):
if not t.still_running():
print(f"{t.name()} has exited")
transcoders.remove(t)
print(t.name(), repr(t.last_log_line()))
if len(transcoders) == 0:
break
time.sleep(1)

Related

How to check output of a sub process but also hide it? [duplicate]

NB. I have seen Log output of multiprocessing.Process - unfortunately, it doesn't answer this question.
I am creating a child process (on windows) via multiprocessing. I want all of the child process's stdout and stderr output to be redirected to a log file, rather than appearing at the console. The only suggestion I have seen is for the child process to set sys.stdout to a file. However, this does not effectively redirect all stdout output, due to the behaviour of stdout redirection on Windows.
To illustrate the problem, build a Windows DLL with the following code
#include <iostream>
extern "C"
{
__declspec(dllexport) void writeToStdOut()
{
std::cout << "Writing to STDOUT from test DLL" << std::endl;
}
}
Then create and run a python script like the following, which imports this DLL and calls the function:
from ctypes import *
import sys
print
print "Writing to STDOUT from python, before redirect"
print
sys.stdout = open("stdout_redirect_log.txt", "w")
print "Writing to STDOUT from python, after redirect"
testdll = CDLL("Release/stdout_test.dll")
testdll.writeToStdOut()
In order to see the same behaviour as me, it is probably necessary for the DLL to be built against a different C runtime than than the one Python uses. In my case, python is built with Visual Studio 2010, but my DLL is built with VS 2005.
The behaviour I see is that the console shows:
> stdout_test.py
Writing to STDOUT from python, before redirect
Writing to STDOUT from test DLL
While the file stdout_redirect_log.txt ends up containing:
Writing to STDOUT from python, after redirect
In other words, setting sys.stdout failed to redirect the stdout output generated by the DLL. This is unsurprising given the nature of the underlying APIs for stdout redirection in Windows. I have encountered this problem at the native/C++ level before and never found a way to reliably redirect stdout from within a process. It has to be done externally.
This is actually the very reason I am launching a child process - it's so that I can connect externally to its pipes and thus guarantee that I am intercepting all of its output. I can definitely do this by launching the process manually with pywin32, but I would very much like to be able to use the facilities of multiprocessing, in particular the ability to communicate with the child process via a multiprocessing Pipe object, in order to get progress updates. The question is whether there is any way to both use multiprocessing for its IPC facilities and to reliably redirect all of the child's stdout and stderr output to a file.
UPDATE: Looking at the source code for multiprocessing.Processs, it has a static member, _Popen, which looks like it can be used to override the class used to create the process. If it's set to None (default), it uses a multiprocessing.forking._Popen, but it looks like by saying
multiprocessing.Process._Popen = MyPopenClass
I could override the process creation. However, although I could derive this from multiprocessing.forking._Popen, it looks like I would have to copy a bunch of internal stuff into my implementation, which sounds flaky and not very future-proof. If that's the only choice I think I'd probably plump for doing the whole thing manually with pywin32 instead.
The solution you suggest is a good one: create your processes manually such that you have explicit access to their stdout/stderr file handles. You can then create a socket to communicate with the sub-process and use multiprocessing.connection over that socket (multiprocessing.Pipe creates the same type of connection object, so this should give you all the same IPC functionality).
Here's a two-file example.
master.py:
import multiprocessing.connection
import subprocess
import socket
import sys, os
## Listen for connection from remote process (and find free port number)
port = 10000
while True:
try:
l = multiprocessing.connection.Listener(('localhost', int(port)), authkey="secret")
break
except socket.error as ex:
if ex.errno != 98:
raise
port += 1 ## if errno==98, then port is not available.
proc = subprocess.Popen((sys.executable, "subproc.py", str(port)), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
## open connection for remote process
conn = l.accept()
conn.send([1, "asd", None])
print(proc.stdout.readline())
subproc.py:
import multiprocessing.connection
import subprocess
import sys, os, time
port = int(sys.argv[1])
conn = multiprocessing.connection.Client(('localhost', port), authkey="secret")
while True:
try:
obj = conn.recv()
print("received: %s\n" % str(obj))
sys.stdout.flush()
except EOFError: ## connection closed
break
You may also want to see the first answer to this question to get non-blocking reads from the subprocess.
I don't think you have a better option than redirecting a subprocess to a file as you mentioned in your comment.
The way consoles stdin/out/err work in windows is each process when it's born has its std handles defined. You can change them with SetStdHandle. When you modify python's sys.stdout you only modify where python prints out stuff, not where other DLL's are printing stuff. Part of the CRT in your DLL is using GetStdHandle to find out where to print out to. If you want, you can do whatever piping you want in windows API in your DLL or in your python script with pywin32. Though I do think it'll be simpler with subprocess.
Alternatively - and I know this might be slightly off-topic, but helped in my case for the same problem - , this can be resolved with screen on Linux:
screen -L -Logfile './logfile_%Y-%m-%d.log' python my_multiproc_script.py
this way no need to implement all the master-child communication
I assume I'm off base and missing something, but for what it's worth here is what came to mind when I read your question.
If you can intercept all of the stdout and stderr (I got that impression from your question), then why not add or wrap that capture functionality around each of your processes? Then send what is captured through a queue to a consumer that can do whatever you want with all of the outputs?
In my situation I changed sys.stdout.write to write to a PySide QTextEdit. I couldn't read from sys.stdout and I didn't know how to change sys.stdout to be readable. I created two Pipes. One for stdout and the other for stderr. In the separate process I redirect sys.stdout and sys.stderr to the child connection of the multiprocessing pipe. On the main process I created two threads to read the stdout and stderr parent pipe and redirect the pipe data to sys.stdout and sys.stderr.
import sys
import contextlib
import threading
import multiprocessing as mp
import multiprocessing.queues
from queue import Empty
import time
class PipeProcess(mp.Process):
"""Process to pipe the output of the sub process and redirect it to this sys.stdout and sys.stderr.
Note:
The use_queue = True argument will pass data between processes using Queues instead of Pipes. Queues will
give you the full output and read all of the data from the Queue. A pipe is more efficient, but may not
redirect all of the output back to the main process.
"""
def __init__(self, group=None, target=None, name=None, args=tuple(), kwargs={}, *_, daemon=None,
use_pipe=None, use_queue=None):
self.read_out_th = None
self.read_err_th = None
self.pipe_target = target
self.pipe_alive = mp.Event()
if use_pipe or (use_pipe is None and not use_queue): # Default
self.parent_stdout, self.child_stdout = mp.Pipe(False)
self.parent_stderr, self.child_stderr = mp.Pipe(False)
else:
self.parent_stdout = self.child_stdout = mp.Queue()
self.parent_stderr = self.child_stderr = mp.Queue()
args = (self.child_stdout, self.child_stderr, target) + tuple(args)
target = self.run_pipe_out_target
super(PipeProcess, self).__init__(group=group, target=target, name=name, args=args, kwargs=kwargs,
daemon=daemon)
def start(self):
"""Start the multiprocess and reading thread."""
self.pipe_alive.set()
super(PipeProcess, self).start()
self.read_out_th = threading.Thread(target=self.read_pipe_out,
args=(self.pipe_alive, self.parent_stdout, sys.stdout))
self.read_err_th = threading.Thread(target=self.read_pipe_out,
args=(self.pipe_alive, self.parent_stderr, sys.stderr))
self.read_out_th.daemon = True
self.read_err_th.daemon = True
self.read_out_th.start()
self.read_err_th.start()
#classmethod
def run_pipe_out_target(cls, pipe_stdout, pipe_stderr, pipe_target, *args, **kwargs):
"""The real multiprocessing target to redirect stdout and stderr to a pipe or queue."""
sys.stdout.write = cls.redirect_write(pipe_stdout) # , sys.__stdout__) # Is redirected in main process
sys.stderr.write = cls.redirect_write(pipe_stderr) # , sys.__stderr__) # Is redirected in main process
pipe_target(*args, **kwargs)
#staticmethod
def redirect_write(child, out=None):
"""Create a function to write out a pipe and write out an additional out."""
if isinstance(child, mp.queues.Queue):
send = child.put
else:
send = child.send_bytes # No need to pickle with child_conn.send(data)
def write(data, *args):
try:
if isinstance(data, str):
data = data.encode('utf-8')
send(data)
if out is not None:
out.write(data)
except:
pass
return write
#classmethod
def read_pipe_out(cls, pipe_alive, pipe_out, out):
if isinstance(pipe_out, mp.queues.Queue):
# Queue has better functionality to get all of the data
def recv():
return pipe_out.get(timeout=0.5)
def is_alive():
return pipe_alive.is_set() or pipe_out.qsize() > 0
else:
# Pipe is more efficient
recv = pipe_out.recv_bytes # No need to unpickle with data = pipe_out.recv()
is_alive = pipe_alive.is_set
# Loop through reading and redirecting data
while is_alive():
try:
data = recv()
if isinstance(data, bytes):
data = data.decode('utf-8')
out.write(data)
except EOFError:
break
except Empty:
pass
except:
pass
def join(self, *args):
# Wait for process to finish (unless a timeout was given)
super(PipeProcess, self).join(*args)
# Trigger to stop the threads
self.pipe_alive.clear()
# Pipe must close to prevent blocking and waiting on recv forever
if not isinstance(self.parent_stdout, mp.queues.Queue):
with contextlib.suppress():
self.parent_stdout.close()
with contextlib.suppress():
self.parent_stderr.close()
# Close the pipes and threads
with contextlib.suppress():
self.read_out_th.join()
with contextlib.suppress():
self.read_err_th.join()
def run_long_print():
for i in range(1000):
print(i)
print(i, file=sys.stderr)
print('finished')
if __name__ == '__main__':
# Example test write (My case was a QTextEdit)
out = open('stdout.log', 'w')
err = open('stderr.log', 'w')
# Overwrite the write function and not the actual stdout object to prove this works
sys.stdout.write = out.write
sys.stderr.write = err.write
# Create a process that uses pipes to read multiprocess output back into sys.stdout.write
proc = PipeProcess(target=run_long_print, use_queue=True) # If use_pipe=True Pipe may not write out all values
# proc.daemon = True # If daemon and use_queue Not all output may be redirected to stdout
proc.start()
# time.sleep(5) # Not needed unless use_pipe or daemon and all of stdout/stderr is desired
# Close the process
proc.join() # For some odd reason this blocks forever when use_queue=False
# Close the output files for this test
out.close()
err.close()
Here is the simple and straightforward way for capturing stdout for multiprocessing.Process:
import app
import io
import sys
from multiprocessing import Process
def run_app(some_param):
sys.stdout = io.TextIOWrapper(open(sys.stdout.fileno(), 'wb', 0), write_through=True)
app.run()
app_process = Process(target=run_app, args=('some_param',))
app_process.start()
# Use app_process.termninate() for python <= 3.7.
app_process.kill()

Python Subprocess - filter out logging

Python 3.6
I want to take all input from a subprocess which I run with the subprocess module. I can easily pipe this output to a log file, and it works great.
But, I want to filter out a lot of the lines (lots of noisy output from modules I do not control).
Attempt 1
def run_command(command, log_file):
process = subprocess.Popen(command, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, bufsize=1,
universal_newlines=True)
while True:
output = process.stdout.readline()
if output == '' and process.poll() is not None:
break
if output and not_noisy_line(output):
log_file.write(output)
log_file.flush()
return process.poll()
But this introduced a race condition between my subprocess and the output.
Attempt 2
I created a new method and a class to wrap the logging.
def run_command(command, log_file):
process = subprocess.run(command, stdout=QuiteLogger(log_file), stderr=QuiteLogger(log_file), timeout=120)
return process.returncode
class QuiteLogger(io.TextIOWrapper):
def write(self, data, encoding=sys.getdefaultencoding()):
data = filter(data)
super().write(data)
This does however just completely skip my filter function, my write method is not called at all by the subprocess. (If I call QuietLogger().write('asdasdsa') it goes through the filters)
Any clues?
This is an interesting situation in which the file object abstraction partially breaks down. The reason your solution does not work, is because subprocess is not actually using your QuietLogger but is getting the raw file number out of it (then repackaging it as a io.TextIOWrapper object).
I don't know if this is an intrinsic limitation in how the subprocess is handled, relying on OS support, or if this is just a mistake in the Python design, but in order to achieve what you want, you need to use the standard subprocess.PIPE and then roll your own file writer.
If you can wait for the subprocess to finish, then it can be trivially done, using the subprocess.run and then picking the stream out of the CompletedProcess (p) object:
p = subprocess.run(command, stdout=subprocess.PIPE, universal_newlines=True)
data = filter(p.stdout)
with open(logfile, 'w') as f:
f.write(data)
If you must work with the ouput while it is being generated (thus, you cannot wait for the subprocess to end) the simplest way is to resort to subprocess.Popen and threads:
import subprocess
import threading
logfile ='tmp.txt'
filter_passed = lambda line: line[:3] != 'Bad'
command = ['my_cmd', 'arg']
def writer(p, logfile):
with open(logfile, 'w') as f:
for line in p.stdout:
if filter_passed(line):
f.write(line)
p = subprocess.Popen(command, stdout=subprocess.PIPE, universal_newlines=True)
t = threading.Thread(target=writer, args=(p,logfile))
t.start()
t.join()
[Edit: My brain got derailed along the way, and I ended up answering another question than was actually asked. The following solution is useful for concurrently writing to a file, not for using the logging module in any way. However, since at least it's useful for that, I'll leave the answer in place for now.]
If you were just using threads, not separate processes, you'd just have to have a standard lock. So you could try something similar.
There's always the option of locking the output file. I don't know if your operating system supports anything like that, but the usual Unix way of doing it is to create a lock file. Basically, if the file exists, then wait; otherwise create the file before writing to your log file, and after you're done, remove the lock file again. You could use a context manager like this:
import os
import os.path
from time import sleep
class LockedFile():
def __init__(self, filename, mode):
self.filename = filename
self.lockfile = filename + '.lock'
self.mode = mode
def __enter__(self):
while True:
if os.path.isfile(self.lockfile):
sleep(0.1)
else:
break
with open(self.lockfile, 'a'):
os.utime(self.lockfile)
self.f = open(self.filename, self.mode)
return self.f
def __exit__(self, *args):
self.f.close()
os.remove(self.lockfile)
# And here's how to use it:
with LockedFile('blorg', 'a') as f:
f.write('foo\n')

tail multiple logfiles in python

This is probably a bit of a silly excercise for me, but it raises a bunch of interesting questions. I have a directory of logfiles from my chat client, and I want to be notified using notify-osd every time one of them changes.
The script that I wrote basically uses os.popen to run the linux tail command on every one of the files to get the last line, and then check each line against a dictionary of what the lines were the last time it ran. If the line changed, it used pynotify to send me a notification.
This script actually worked perfectly, except for the fact that it used a huge amount of cpu (probably because it was running tail about 16 times every time the loop ran, on files that were mounted over sshfs.)
It seems like something like this would be a great solution, but I don't see how to implement that for more than one file.
Here is the script that I wrote. Pardon my lack of comments and poor style.
Edit: To clarify, this is all linux on a desktop.
Not even looking at your source code, there are two ways you could easily do this more efficiently and handle multiple files.
Don't bother running tail unless you have to. Simply os.stat all of the files and record the last modified time. If the last modified time is different, then raise a notification.
Use pyinotify to call out to Linux's inotify facility; this will have the kernel do option 1 for you and call back to you when any files in your directory change. Then translate the callback into your osd notification.
Now, there might be some trickiness depending on how many notifications you want when there are multiple messages and whether you care about missing a notification for a message.
An approach that preserves the use of tail would be to instead use tail -f. Open all of the files with tail -f and then use the select module to have the OS tell you when there's additional input on one of the file descriptors open for tail -f. Your main loop would call select and then iterate over each of the readable descriptors to generate notifications. (You could probably do this without using tail and just calling readline() when it's readable.)
Other areas of improvement in your script:
Use os.listdir and native Python filtering (say, using list comprehensions) instead of a popen with a bunch of grep filters.
Update the list of buffers to scan periodically instead of only doing it at program boot.
Use subprocess.popen instead of os.popen.
If you're already using the pyinotify module, it's easy to do this in pure Python (i.e. no need to spawn a separate process to tail each file).
Here is an example that is event-driven by inotify, and should use very little cpu. When IN_MODIFY occurs for a given path we read all available data from the file handle and output any complete lines found, buffering the incomplete line until more data is available:
import os
import select
import sys
import pynotify
import pyinotify
class Watcher(pyinotify.ProcessEvent):
def __init__(self, paths):
self._manager = pyinotify.WatchManager()
self._notify = pyinotify.Notifier(self._manager, self)
self._paths = {}
for path in paths:
self._manager.add_watch(path, pyinotify.IN_MODIFY)
fh = open(path, 'rb')
fh.seek(0, os.SEEK_END)
self._paths[os.path.realpath(path)] = [fh, '']
def run(self):
while True:
self._notify.process_events()
if self._notify.check_events():
self._notify.read_events()
def process_default(self, evt):
path = evt.pathname
fh, buf = self._paths[path]
data = fh.read()
lines = data.split('\n')
# output previous incomplete line.
if buf:
lines[0] = buf + lines[0]
# only output the last line if it was complete.
if lines[-1]:
buf = lines[-1]
lines.pop()
# display a notification
notice = pynotify.Notification('%s changed' % path, '\n'.join(lines))
notice.show()
# and output to stdout
for line in lines:
sys.stdout.write(path + ': ' + line + '\n')
sys.stdout.flush()
self._paths[path][1] = buf
pynotify.init('watcher')
paths = sys.argv[1:]
Watcher(paths).run()
Usage:
% python watcher.py [path1 path2 ... pathN]
Simple pure python solution (not the best, but doesn't fork, spits out 4 empty lines after idle period and marks everytime the source of the chunk, if changed):
#!/usr/bin/env python
from __future__ import with_statement
'''
Implement multi-file tail
'''
import os
import sys
import time
def print_file_from(filename, pos):
with open(filename, 'rb') as fh:
fh.seek(pos)
while True:
chunk = fh.read(8192)
if not chunk:
break
sys.stdout.write(chunk)
def _fstat(filename):
st_results = os.stat(filename)
return (st_results[6], st_results[8])
def _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = False
#Find the size of the file and move to the end
tup = _fstat(filename)
# print tup
if last_stats[filename] != tup:
changed = True
if not no_fn and last_fn != filename:
print '\n<%s>' % filename
print_file_from(filename, last_stats[filename][0])
last_stats[filename] = tup
return changed
def multi_tail(filenames, stdout=sys.stdout, interval=1, idle=10, no_fn=False):
S = lambda (st_size, st_mtime): (max(0, st_size - 124), st_mtime)
last_stats = dict((fn, S(_fstat(fn))) for fn in filenames)
last_fn = None
last_print = 0
while 1:
# print last_stats
changed = False
for filename in filenames:
if _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = True
last_fn = filename
if changed:
if idle > 0:
last_print = time.time()
else:
if idle > 0 and last_print is not None:
if time.time() - last_print >= idle:
last_print = None
print '\n' * 4
time.sleep(interval)
if '__main__' == __name__:
from optparse import OptionParser
op = OptionParser()
op.add_option('-F', '--no-fn', help="don't print filename when changes",
default=False, action='store_true')
op.add_option('-i', '--idle', help='idle time, in seconds (0 turns off)',
type='int', default=10)
op.add_option('--interval', help='check interval, in seconds', type='int',
default=1)
opts, args = op.parse_args()
try:
multi_tail(args, interval=opts.interval, idle=opts.idle,
no_fn=opts.no_fn)
except KeyboardInterrupt:
pass

Python: select() doesn't signal all input from pipe

I am trying to load an external command line program with Python and communicate with it via pipes. The progam takes text input via stdin and produces text output in lines to stdout. Communication should be asynchronous using select().
The problem is, that not all output of the program is signalled in select(). Usually the last one or two lines are not signalled. If select() returns with a timeout and I am trying to read from the pipe anyway readline() returns immediately with the line sent from the program. See code below.
The program doesn't buffer the output and sends all output in text lines. Connecting to the program via pipes in many other languages and environments has worked fine so far.
I have tried Python 3.1 and 3.2 on Mac OSX 10.6.
import subprocess
import select
engine = subprocess.Popen("Engine", bufsize=0, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
engine.stdin.write(b"go\n")
engine.stdin.flush()
while True:
inputready,outputready,exceptready = select.select( [engine.stdout.fileno()] , [], [], 10.0)
if (inputready, outputready, exceptready) == ([], [], []):
print("trying to read from engine anyway...")
line = engine.stdout.readline()
print(line)
for s in inputready:
line = engine.stdout.readline()
print(line)
Note that internally file.readlines([size]) loops and invokes the read() syscall more than once, attempting to fill an internal buffer of size. The first call to read() will immediately return, since select() indicated the fd was readable. However the 2nd call will block until data is available, which defeats the purpose of using select. In any case it is tricky to use file.readlines([size]) in an asynchronous app.
You should call os.read(fd, size) once on each fd for every pass through select. This performs a non-blocking read, and lets you buffer partial lines until data is available and detects EOF unambiguously.
I modified your code to illustrate using os.read. It also reads from the process' stderr:
import os
import select
import subprocess
from cStringIO import StringIO
target = 'Engine'
PIPE = subprocess.PIPE
engine = subprocess.Popen(target, bufsize=0, stdin=PIPE, stdout=PIPE, stderr=PIPE)
engine.stdin.write(b"go\n")
engine.stdin.flush()
class LineReader(object):
def __init__(self, fd):
self._fd = fd
self._buf = ''
def fileno(self):
return self._fd
def readlines(self):
data = os.read(self._fd, 4096)
if not data:
# EOF
return None
self._buf += data
if '\n' not in data:
return []
tmp = self._buf.split('\n')
lines, self._buf = tmp[:-1], tmp[-1]
return lines
proc_stdout = LineReader(engine.stdout.fileno())
proc_stderr = LineReader(engine.stderr.fileno())
readable = [proc_stdout, proc_stderr]
while readable:
ready = select.select(readable, [], [], 10.0)[0]
if not ready:
continue
for stream in ready:
lines = stream.readlines()
if lines is None:
# got EOF on this stream
readable.remove(stream)
continue
for line in lines:
print line

In Django, how to call a subprocess with a slow start-up time

Suppose you're running Django on Linux, and you've got a view, and you want that view to return the data from a subprocess called cmd that operates on a file that the view creates, for example likeso:
def call_subprocess(request):
response = HttpResponse()
with tempfile.NamedTemporaryFile("W") as f:
f.write(request.GET['data']) # i.e. some data
# cmd operates on fname and returns output
p = subprocess.Popen(["cmd", f.name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = p.communicate()
response.write(p.out) # would be text/plain...
return response
Now, suppose cmd has a very slow start-up time, but a very fast operating time, and it does not natively have a daemon mode. I would like to improve the response-time of this view.
I would like to make the whole system would run much faster by starting up a number of instances of cmd in a worker-pool, have them wait for input, and having call_process ask one of those worker pool processes handle the data.
This is really 2 parts:
Part 1. A function that calls cmd and cmd waits for input. This could be done with pipes, i.e.
def _run_subcmd():
p = subprocess.Popen(["cmd", fname],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# write 'out' to a tmp file
o = open("out.txt", "W")
o.write(out)
o.close()
p.close()
exit()
def _run_cmd(data):
f = tempfile.NamedTemporaryFile("W")
pipe = os.mkfifo(f.name)
if os.fork() == 0:
_run_subcmd(fname)
else:
f.write(data)
r = open("out.txt", "r")
out = r.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Part 2. A set of workers running in the background that are waiting on the data. i.e. We want to extend the above so that the subprocess is already running, e.g. when the Django instance initializes, or this call_process is first called, a set of these workers is created
WORKER_COUNT = 6
WORKERS = []
class Worker(object):
def __init__(index):
self.tmp_file = tempfile.NamedTemporaryFile("W") # get a tmp file name
os.mkfifo(self.tmp_file.name)
self.p = subprocess.Popen(["cmd", self.tmp_file],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.index = index
def run(out_filename, data):
WORKERS[self.index] = Null # qua-mutex??
self.tmp_file.write(data)
if (os.fork() == 0): # does the child have access to self.p??
out, err = self.p.communicate()
o = open(out_filename, "w")
o.write(out)
exit()
self.p.close()
self.o.close()
self.tmp_file.close()
WORKERS[self.index] = Worker(index) # replace this one
return out_file
#classmethod
def get_worker() # get the next worker
# ... static, incrementing index
There should be some initialization of workers somewhere, like this:
def init_workers(): # create WORKERS_COUNT workers
for i in xrange(0, WORKERS_COUNT):
tmp_file = tempfile.NamedTemporaryFile()
WORKERS.push(Worker(i))
Now, what I have above becomes something likeso:
def _run_cmd(data):
Worker.get_worker() # this needs to be atomic & lock worker at Worker.index
fifo = open(tempfile.NamedTemporaryFile("r")) # this stores output of cmd
Worker.run(fifo.name, data)
# please ignore the fact that everything will be
# appended to out.txt ... these will be tmp files, too, but named elsewhere.
out = fifo.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Now, the questions:
Will this work? (I've just typed this off the top of my head into StackOverflow, so I'm sure there are problems, but conceptually, will it work)
What are the problems to look for?
Are there better alternatives to this? e.g. Could threads work just as well (it's Debian Lenny Linux)? Are there any libraries that handle parallel process worker-pools like this?
Are there interactions with Django that I ought to be conscious of?
Thanks for reading! I hope you find this as interesting a problem as I do.
Brian
It may seem like i am punting this product as this is the second time i have responded with a recommendation of this.
But it seems like you need a Message Queing service, in particular a distributed message queue.
ere is how it will work:
Your Django App requests CMD
CMD gets added to a queue
CMD gets pushed to several works
It is executed and results returned upstream
Most of this code exists, and you dont have to go about building your own system.
Have a look at Celery which was initially built with Django.
http://www.celeryq.org/
http://robertpogorzelski.com/blog/2009/09/10/rabbitmq-celery-and-django/
Issy already mentioned Celery, but since comments doesn't work well
with code samples, I'll reply as an answer instead.
You should try to use Celery synchronously with the AMQP result store.
You could distribute the actual execution to another process or even another machine. Executing synchronously in celery is easy, e.g.:
>>> from celery.task import Task
>>> from celery.registry import tasks
>>> class MyTask(Task):
...
... def run(self, x, y):
... return x * y
>>> tasks.register(MyTask)
>>> async_result = MyTask.delay(2, 2)
>>> retval = async_result.get() # Now synchronous
>>> retval 4
The AMQP result store makes sending back the result very fast,
but it's only available in the current development version (in code-freeze to become
0.8.0)
How about "daemonizing" the subprocess call using python-daemon or its successor, grizzled.

Categories

Resources