Python popen running out of memory with large output

Python popen running out of memory with large output - python

I am using the subprocess.popen() function to run an external tool that reads & writes a lot of data (>GB) to stdout. However, I'm finding that the kernel is killing the python process when it runs out of memory:
Out of memory: Kill process 8221 (python) score 971 or sacrifice child
Killed process 8221 (python) total-vm:8532708kB, anon-rss:3703912kB, file-rss:48kB
Since I know I'm handling a large amount of data I've set up popen to write stdout and stderr to files so I'm not using pipes. My code looks something like this:
errorFile = open(errorFilePath, "w")
outFile = open(outFilePath, "w")
#Use Popen to run the command
try:
procExecCommand = subprocess.Popen(commandToExecute, shell=False, stderr=errorFile, stdout=outFile)
exitCode = procExecCommand.wait()
except Exception, e:
#Write exception to error log
errorFile.write(str(e))
errorFile.close()
outFile.close()
I've tried changing the shell parameter to True and setting the bufsize parameter = -1 also with no luck.
I've profiled the memory running this script and via bash and I see big spike in the memory usage when running via Python than compared to bash.
I'm not sure what exactly Python is doing to consume so much more memory than the just using bash unless it has something to with trying to write the output to the file? The bash script just pipes the output to a file.
I initially found that my swap space was quite low so I increased it and that helped initially, but as the volume of data grows then I start running out of memory again.
So is there anything with Python I can do to try and handle these data volumes better, or is it just a case of recommending more memory with plenty of swap space. That or jettison Python altogether.
System details:
Ubuntu 12.04
Python 2.7.3
The tool I'm running is mpileup from samtools.

The problem might be that your are using the wait() method (as in procExecCommand.wait()) which tries to run the subprocess to completion and then returns. Try the approach used in this question, which uses e.g. stdout.read() on the process handle. This way you can regularly empty the pipes, write to files, and there should be no build-up of memory.

What kind of output your process generates, maybe the clue is in that.
Warning : The script won't terminate, you have to kill it.
This sample setup works as expected for me.
import subprocess
fobj = open("/home/tst//output","w")
subprocess.Popen("/home/tst//whileone",stdout=fobj).wait()
And whileone
#!/bin/bash
let i=1
while [ 1 ]
do
echo "We are in iteration $i"
let i=$i+1
usleep 10000
done

Related

How to properly redirect stdin to multiple subprocesses created sequentially?

Context
I am experimenting with a script that is similar to vegeta's ramp-requests.py. In this script, I am running multiple subprocesses sequentially using subprocess.run(), and expect the standard input of the script to be redirected to those subprocesses during their entire lifetime (5s each).
#!/usr/bin/env python3
import json
import os
import subprocess
import sys
import time
rates = [1.0, 2.0, 3.0, 4.0]
# Run vegeta attack
for rate in rates:
filename='results_%i.bin' % (1000*rate)
if not os.path.exists(filename):
cmd = 'vegeta attack -format=json -lazy --duration 5s -rate %i/1000s -output %s' % (1000*rate, filename)
print(cmd, file=sys.stderr)
subprocess.run(cmd, shell=True, encoding='utf-8')
I invoke the script as follows, by piping an infinite amount of inputs to it, each input separated by a new line. vegeta reads this input continuously until --duration has elapsed:
$ target-generator | ./ramp-requests.py
Problem
The first subprocess (rate=1.0) seems to receive stdin as I expect, and the command runs successfully, every time.
The second iteration (rate=2.0), however, fails silently, along with all subsequent iterations. If I inspect the corresponding report files (e.g. results_2000.bin) using the vegeta report command, I see fragments of errors such as parse error: syntax error near offset 0 of 'ource":["c...'.
My intuition is telling me that the second subprocess started consuming the input where the first one left it, in the middle of a line, but injecting a sys.stdin.readline() after subprocess.run() doesn't solve it. If that is the case, how can I cleanly solve this issue and ensure each subprocess starts reading from a "good" position?

Read a number of lines from stdin in your parent process, and pass that to your child process as -its- stdin. Repeat as needed. In this way, you do not need to worry about a child process making a mess of your stdin.
Feel free to borrow ideas from https://stromberg.dnsalias.org/~strombrg/mtee.html
HTH

As mentioned in #Barmar's comments, Python 3 opens stdin in buffered text mode, so both sys.stdin.read(1) and sys.stdin.readline() cause a read ahead and do not reposition the sys.stdin stream to the beginning of a new line.
There is, however, a way to disable buffering by opening sys.stdin in binary mode, as pointed out by Denilson Sá Maia in his answer to Setting smaller buffer size for sys.stdin?:
unbuffered_stdin = os.fdopen(sys.stdin.fileno(), 'rb', buffering=0)
By doing so, it is possible to read the truncated input until the end of the line from this unbuffered io object after each subprocess returns:
# Run vegeta attack
for rate in rates:
# [...]
cmd = 'vegeta attack [...]'
subprocess.run(cmd, shell=True, encoding='utf-8')
# Read potentially truncated input until the next '\n' byte
# to reposition stdin to a location that is safe to consume.
unbuffered_stdin.readline()
Printing the read line shows something similar to the output below:
b'a4b-b142-fabe0e96a6ca"],"Ce-Type":["perf.drill"],"Ce-Source":["load-test"]}}\n'
All subprocesses are now being executed successfully:
$ for r in results_*.bin; do vegeta report "$r"; done
[...]
Success [ratio] 100.00%
Status Codes [code:count] 200:5
Error Set:
[...]
Success [ratio] 100.00%
Status Codes [code:count] 200:7
Error Set:
[...]
Success [ratio] 100.00%
Status Codes [code:count] 200:8
Error Set:
[...]
See also io - Raw I/O (Python 3 docs)

In Python, what is the difference between open(file).read() and subprocess(['cat', file]) and is there a preference for one over the other?

Let's say I want to read RAM usage from /proc/meminfo. There are two basic ways to do this that I can think of.
Use a shell command
output = subprocess.check_output('cat /proc/meminfo', shell=True)
# or output = subprocess.check_output(['cat', '/proc/meminfo'])
lines = output.splitlines()
Use open()
with open('/proc/meminfo') as meminfo:
output = meminfo.read()
lines = output.splitlines()
My question is what is the difference between the two methods? Is there a significant performance difference? My assumption is that using open() is the preferred method, since using a shell command is a bit hackish and may be system dependent, but I can't find any information on this so I thought I'd ask.

...so, let's look at what output = subprocess.check_output('cat /proc/meminfo', shell=True) does:
Creates a FIFO pair with mkfifo(), and spawns a shell running sh -c 'cat /proc/meminfo' writing to the input end of the FIFO (while the Python interpreter itself watches for output on the other end, either using the select() call or blocking IO operations). This means opening /bin/sh, opening all the libraries it depends on, etc.
The shell parses those arguments as code. This can be dangerous if, instead of opening /proc/meminfo. you're instead opening /tmp/$(rm -rf ~)/pwned.txt.
The shell forks a subprocess (optionally; shells may have an implicit exec), which then uses the execve system call to invoke /bin/cat with an argv of ['cat', '/proc/meminfo'] -- meaning that /bin/cat is again loaded as an executable, with its dynamic libraries, with all the performance overhead that implies.
/bin/cat then opens /proc/meminfo, reads from it, and writes to its stdout
The shell, if it did not use the implicit-exec optimization, waits for the /bin/cat executable to finish and exit using a wait()-family syscall.
The Python interpreter reads from the remote end of the FIFO until it provides an EOF (which will not happen until after cat has closed its output pipeline, potentially by exiting), and then uses a wait()-family call to retrieve information on how the shell it spawned exited, checking that exit status to determine whether an error occurred.
Now, let's look at what open('/proc/meminfo').read() does:
Opens the file using the open() syscall.
Reads the file using the read() syscall.
Drops the reference count on the file, allowing it to be closed (either immediately or on a future garbage collection pass) with the close() syscall.
One of these things is much, much, much more efficient and generally sensible than the other.

Avoid python setup time

This image below says python takes lot of time in user space. Is it possible to reduce this time at all ?
In the sense I will be running a script several 100 times. Is it possible to start python so that it takes time to initialize once and doesn't do it the subsequent time ??

I just searched for the same and found this:
http://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/
Python-launcher does not solve the problem directly, but it points into an interesting direction: If you create a small daemon which you can contact via the shell to fork a new instance, you might be able to get rid of your startup time.
For example get the python-launcher and socat¹ and do the following:
PYTHONPATH="../lib.linux-x86_64-2.7/" python python-launcher-daemon &
echo pass > 1
for i in {1..100}; do
echo 1 | socat STDIN UNIX-CONNECT:/tmp/python-launcher-daemon.socket &
done
Todo: Adapt it to your program, remove the GTK stuff. Note the & at the end: Closing the socket connection seems to be slow.
The essential trick is to just create a server which opens a socket. Then it reads all the data from the socket. Once it has the data, it forks like the following:
pid = os.fork()
if pid:
return
signal.signal(signal.SIGPIPE, signal.SIG_DFL)
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
glob = dict(__name__="__main__")
print 'launching', program
execfile(program, glob, glob)
raise SystemExit
Running 100 programs that way took just 0.7 seconds for me.
You might have to switch from forking to just executing the code instead of forking if you want to be really fast.
(That’s what I also do with emacsclient… My emacs takes ~30s to start (due to excessive use of additional libraries I added), but emacsclient -c shows up almost instantly.)
¹: http://www.socat.org

Write the "do this several 100 times" logic in your Python script. Call it ONCE from that other language.

Use timeit instead:
http://docs.python.org/library/timeit.html

Keeping a pipe to a process open

I have an app that reads in stuff from stdin and returns, after a newline, results to stdout
A simple (stupid) example:
$ app
Expand[(x+1)^2]<CR>
x^2 + 2*x + 1
100 - 4<CR>
96
Opening and closing the app requires a lot of initialization and clean-up (its an interface to a Computer Algebra System), so I want to keep this to a minimum.
I want to open a pipe in Python to this process, write strings to its stdin and read out the results from stdout. Popen.communicate() doesn't work for this, as it closes the file handle, requiring to reopen the pipe.
I've tried something along the lines of this related question:
Communicate multiple times with a process without breaking the pipe? but I'm not sure how to wait for the output. It is also difficult to know a priori how long it will take the app to finish to process for the input at hand, so I don't want to make any assumptions. I guess most of my confusion comes from this question: Non-blocking read on a subprocess.PIPE in python where it is stated that mixing high and low level functions is not a good idea.
EDIT:
Sorry that I didn't give any code before, got interrupted. This is what I've tried so far and it seems to work, I'm just worried that something goes wrong unnoticed:
from subprocess import Popen, PIPE
pipe = Popen(["MathPipe"], stdin=PIPE, stdout=PIPE)
expressions = ["Expand[(x+1)^2]", "Integrate[Sin[x], {x,0,2*Pi}]"] # ...
for expr in expressions:
pipe.stdin.write(expr)
while True:
line = pipe.stdout.readline()
if line != '':
print line
# output of MathPipe is always terminated by ';'
if ";" in line:
break
Potential problems with this?

Using subprocess, you can't do this reliably. You might want to look at using the pexpect library. That won't work on Windows - if you're on Windows, try winpexpect.
Also, if you're trying to do mathematical stuff in Python, check out SAGE. They do a lot of work on interfacing with other open-source maths software, so there's a chance they've already done what you're trying to.

Perhaps you could pass stdin=subprocess.PIPE as an argument to subprocess.Popen. This will make the process' stdin available as a general file-like object:
import sys, subprocess
proc = subprocess.Popen(["mathematica <args>"], stdin=subprocess.PIPE,
stdout=sys.stdout, shell=True)
proc.stdin.write("Expand[ (x-1)^2 ]") # Write whatever to the process
proc.stdin.flush() # Ensure nothing is left in the buffer
proc.terminate() # Kill the process
This directs the subprocess' output directly to your python process' stdout. If you need to read the output and do some editing first, that is possible as well. Check out http://docs.python.org/library/subprocess.html#popen-objects.

subprocess.Popen.stdout - reading stdout in real-time (again)

Again, the same question.
The reason is - I still can't make it work after reading the following:
Real-time intercepting of stdout from another process in Python
Intercepting stdout of a subprocess while it is running
How do I get 'real-time' information back from a subprocess.Popen in python (2.5)
catching stdout in realtime from subprocess
My case is that I have a console app written in C, lets take for example this code in a loop:
tmp = 0.0;
printf("\ninput>>");
scanf_s("%f",&tmp);
printf ("\ninput was: %f",tmp);
It continuously reads some input and writes some output.
My python code to interact with it is the following:
p=subprocess.Popen([path],stdout=subprocess.PIPE,stdin=subprocess.PIPE)
p.stdin.write('12345\n')
for line in p.stdout:
print(">>> " + str(line.rstrip()))
p.stdout.flush()
So far whenever I read form p.stdout it always waits until the process is terminated and then outputs an empty string. I've tried lots of stuff - but still the same result.
I tried Python 2.6 and 3.1, but the version doesn't matter - I just need to make it work somewhere.

Trying to write to and read from pipes to a sub-process is tricky because of the default buffering going on in both directions. It's extremely easy to get a deadlock where one or the other process (parent or child) is reading from an empty buffer, writing into a full buffer or doing a blocking read on a buffer that's awaiting data before the system libraries flush it.
For more modest amounts of data the Popen.communicate() method might be sufficient. However, for data that exceeds its buffering you'd probably get stalled processes (similar to what you're already seeing?)
You might want to look for details on using the fcntl module and making one or the other (or both) of your file descriptors non-blocking. In that case, of course, you'll have to wrap all reads and/or writes to those file descriptors in the appropriate exception handling to handle the "EWOULDBLOCK" events. (I don't remember the exact Python exception that's raised for these).
A completely different approach would be for your parent to use the select module and os.fork() ... and for the child process to execve() the target program after directly handling any file dup()ing. (Basically you'd be re-implement parts of Popen() but with different parent file descriptor (PIPE) handling.
Incidentally, .communicate, at least in Python's 2.5 and 2.6 standard libraries, will only handle about 64K of remote data (on Linux and FreeBSD). This number may vary based on various factors (possibly including the build options used to compile your Python interpreter, or the version of libc being linked to it). It is NOT simply limited by available memory (despite J.F. Sebastian's assertion to the contrary) but is limited to a much smaller value.

Push reading from the pipe into a separate thread that signals when a chunk of output is available:
How can I read all availably data from subprocess.Popen.stdout (non blocking)?

The bufsize=256 argument prevents 12345\n from being sent to the child process in a chunk smaller than 256 bytes, as it will be when omitting bufsize or inserting p.stdin.flush() after p.stdin.write(). Default behaviour is line-buffering.
In either case you should at least see one empty line before blocking as emitted by the first printf(\n...) in your example.

Your particular example doesn't require "real-time" interaction. The following works:
from subprocess import Popen, PIPE
p = Popen(["./a.out"], stdin=PIPE, stdout=PIPE)
output = p.communicate(b"12345")[0] # send input/read all output
print output,
where a.out is your example C program.
In general, for a dialog-based interaction with a subprocess you could use pexpect module (or its analogs on Windows):
import pexpect
child = pexpect.spawn("./a.out")
child.expect("input>>")
child.sendline("12345.67890") # send a number
child.expect(r"\d+\.\d+") # expect the number at the end
print float(child.after) # assert that we can parse it
child.close()

I had the same problem, and "proc.communicate()" does not solve it because it waits for process terminating.
So here is what is working for me, on Windows with Python 3.5.1 :
import subprocess as sp
myProcess = sp.Popen( cmd, creationflags=sp.CREATE_NEW_PROCESS_GROUP,stdout=sp.PIPE,stderr=sp.STDOUT)
while i<40:
i+=1
time.sleep(.5)
out = myProcess.stdout.readline().decode("utf-8").rstrip()
I guess creationflags and other arguments are not mandatory (but I don't have time to test), so this would be the minimal syntax :
myProcess = sp.Popen( cmd, stdout=sp.PIPE)
for i in range(40)
time.sleep(.5)
out = myProcess.stdout.readline()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python popen running out of memory with large output - python

Related

How to properly redirect stdin to multiple subprocesses created sequentially?

In Python, what is the difference between open(file).read() and subprocess(['cat', file]) and is there a preference for one over the other?

Avoid python setup time

Keeping a pipe to a process open

subprocess.Popen.stdout - reading stdout in real-time (again)

Categories

Resources