closed fd with subprocesses, IPC, SMP - python

Given the function
def get_files_from_sha(sha, files):
from subprocess import Popen, PIPE
import tarfile
if 0 == len(files):
return {}
p = Popen(["git", "archive", sha], bufsize=10240, stdin=PIPE, stdout=PIPE, stderr=PIPE)
tar = tarfile.open(fileobj=p.stdout, mode='r|')
p.communicate()
contents = {}
doall = files == '*'
if not doall:
files = set(files)
for entry in tar:
if (isinstance(files, set) and entry.name in files) or doall:
tf = tar.extractfile(entry)
contents[entry.name] = tf.read()
if not doall:
files.discard(entry.name)
if not doall:
for fname in files:
contents[fname] = None
tar.close()
return contents
which is called in a loop for some values of sha, after a while (in my case, 4 iterations) it starts to fail at the call to tf.read(), with the message:
Traceback (most recent call last):
File "../yap-analysis/extract.py", line 243, in <module>
commits, identities, identities_by_name, identities_by_email, identities_freq = build_commits(commits)
File "../yap-analysis/extract.py", line 186, in build_commits
commit = get_commit(commit)
File "../yap-analysis/extract.py", line 84, in get_commit
contents = get_files_from_sha(commit['sha'], files)
File "../yap-analysis/extract.py", line 42, in get_files_from_sha
contents[entry.name] = tf.read()
File "/usr/lib/python2.7/tarfile.py", line 817, in read
buf += self.fileobj.read()
File "/usr/lib/python2.7/tarfile.py", line 737, in read
return self.readnormal(size)
File "/usr/lib/python2.7/tarfile.py", line 746, in readnormal
return self.fileobj.read(size)
File "/usr/lib/python2.7/tarfile.py", line 573, in read
buf = self._read(size)
File "/usr/lib/python2.7/tarfile.py", line 581, in _read
return self.__read(size)
File "/usr/lib/python2.7/tarfile.py", line 606, in __read
buf = self.fileobj.read(self.bufsize)
ValueError: I/O operation on closed file
I suspect there is some parallelization that subprocess attempts to make (?).
What is the actual cause and how to solve it in a clean and robust way on python2?

Do not use .communicate() on the Popen instance; it'll read the stdout stream until it is finished. From the documentation:
Interact with process: Send data to stdin. Read data from stdout and stderr, until end-of-file is reached.
The code for .communicate() even adds an explicit .close() call on the stdout of the pipe.
Simply removing the call to .communicate() should be enough, but do also add a .wait() after reading the tarfile contents:
tar.close()
p.stdout.close()
p.wait()
It could be that tar.close() also closes p.stdout, but an extra .close() there should not hurt.

I think your problem is the p.communicate(). This method sends to stdin, reads from stdout and stderr (which you are not capturing) and waits for the process to terminate.
tarfile is trying to read from the processes stdout, and by the time it does then the process is finished, hence the error.
I have not tried running your code (I don't have access to git) but you probably don't want the p.communicate at all, try commenting it out.

Related

mkstemp opening too many files

I'm using subprocess.run in a loop (more than 10 000 times) to call some java command.
Like this:
import subprocess
import tempfile
for i in range(10000):
ret = subprocess.run(["ls"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
(_, name) = tempfile.mkstemp()
with open(name, 'w+') as fp:
fp.write(ret.stdout.decode())
However, after some time, I got the following exception:
Traceback (most recent call last):
File "mwe.py", line 5, in <module>
ret = subprocess.run(["ls"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
File "/usr/lib/python3.5/subprocess.py", line 693, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1454, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
Am I missing something to close some file descriptor?
Thanks
mkstemp returns an already open file descriptor fd followed by the filename. You are ignoring the file descriptor (your choice of the name _ suggests you have explicitly chosen to ignore it) and as a result you are neglecting to close it. Instead, you open the file a second time using the filename, creating a file object that contains a second file descriptor for the same file. Regardless of whether you close that second one, the first one remains open.
Here's a fix to the mkstemp approach:
temporaryFiles = []
for i in range(1000):
...
fd, name = tempfile.mkstemp()
os.write(fd, ... )
os.close(fd)
temporaryFiles.append(name) # remember the filename for future processing/deletion
Building on Wyrmwood's suggestion in the comments, an even better approach would be:
temporaryFiles = []
for i in range(1000):
...
with tempfile.NamedTemporaryFile(delete=False) as tmp:
# tmp is a context manager that will automatically close the file when you exit this clause
tmp.file.write( ... )
temporaryFiles.append(tmp.name) # remember the filename for future processing/deletion
Note that both mkstemp and the NamedTemporaryFile constructor have arguments that allow you to be more specific about the file's location (dir) and naming (prefix, suffix). If you want to keep the files, you should specify dir so that you keep them out of the default temporary directory, since the default location may get cleaned up by the OS.

How does one create custom output stream for subprocess.call

I am trying to get realtime output of a subprocess.call by defining my own output stream but it doesn't seem to work.
Reason: I want to run a subprocess and get output of that call to both stdout(in realtime so i can look at the script and see current progress) as well as logging it to a file
subprocess.py:
import time
while True:
print("Things")
time.sleep(1)
mainprocess.py
import subprocess
import io
class CustomIO(io.IOBase):
def write(self, str):
print("CustomIO: %s"%str)
# logging to be implemented here
customio = CustomIO()
subprocess.call(["python3", "print_process.py"], stdout=customio)
But when i run this code i get this error message:
Traceback (most recent call last):
File "call_test.py", line 9, in <module>
subprocess.call(["python3", "print_process.py"], stdout=customio)
File "/usr/lib/python3.4/subprocess.py", line 537, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.4/subprocess.py", line 823, in __init__
errread, errwrite) = self._get_handles(stdin, stdout, stderr)
File "/usr/lib/python3.4/subprocess.py", line 1302, in _get_handles
c2pwrite = stdout.fileno()
io.UnsupportedOperation: fileno
So, anyone have any clue if this is possible?
Am i inheriting the wrong baseclass?
Am i not overloading the proper methods?
Or am i completely off the rails and should be going about this in a completely different way?
If you want to process the output of a subprocess, you need to pass stdout=subprocess.PIPE. However, call() and run() will both wait until the process is finished before making it available, so you cannot handle it in real time using these functions.
You need to use subprocess.Popen:
import subprocess as sp
def handle_output(output_line):
...
my_process = sp.Popen(["python3", "print_process.py"],
stdout=sp.PIPE,
universal_newlines=True) # changes stdout from bytes to text
for line in my_process.stdout:
handle_output(line)
my_process.wait()
Update: Make sure to flush the output buffer in your child process:
while True:
print("Things", flush=True)
time.sleep(1)
You need to specify and open stream with a file descriptor. fileno isn't implemented for io.IOBase because this is just an in-memory stream:
Frequently Used Arguments
stdin, stdout and stderr specify the executed program’s standard
input, standard output and standard error file handles, respectively.
Valid values are PIPE, DEVNULL, an existing file descriptor (a
positive integer), an existing file object, and None. PIPE indicates
that a new pipe to the child should be created. DEVNULL indicates that
the special file os.devnull will be used. With the default settings of
None, no redirection will occur;
So you might use sockets, pipes, and open files as stdout, the file descriptor is passed to the child process as it's stdout. I didn't use sockets with subprocess.Popen though, but I expect them to work, I believe what matters here is the file descriptor to the child, not what type of object the file descriptor points to.

OSError: [Errno 36] File name too long while using Popen - Python

As I started asking on a previous question, I'm extracting a tarball using the tarfile module of python. I don't want the extracted files to be written on the disk, but rather get piped directly to another program, specifically bgzip.
#!/usr/bin/env python
import tarfile, subprocess, re
mov = []
def clean(s):
s = re.sub('[^0-9a-zA-Z_]', '', s)
s = re.sub('^[^a-zA-Z_]+', '', s)
return s
with tarfile.open("SomeTarballHere.tar.gz", "r:gz") as tar:
for file in tar.getmembers():
if file.isreg():
mov = file.name
proc = subprocess.Popen(tar.extractfile(file).read(), stdout = subprocess.PIPE)
proc2 = subprocess.Popen('bgzip -c > ' + clean(mov), stdin = proc, stdout = subprocess.PIPE)
mov = None
But now I get stuck on this:
Traceback (most recent call last):
File "preformat.py", line 12, in <module>
proc = subprocess.Popen(tar.extractfile(file).read(), stdout = subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 36] File name too long
Is there any workaround for this? I have been using the LightTableLinux.tar.gz (it contains the files for a text editor program) as a tarball to test the script on it.
The exception is raised in the forked-off child process when trying to execute the target program from this invocation:
proc = subprocess.Popen(tar.extractfile(file).read(), stdout = subprocess.PIPE)
This
reads the contents of an entry in the tar file
tries to execute a program with the name of the contents of that entry.
Also your second invocation won't work, as you are trying to use shell redirection without using shell=True in Popen():
proc2 = subprocess.Popen('bgzip -c > ' + clean(mov), stdin = proc, stdout = subprocess.PIPE)
The redirect may also not be necessary, as you should be able to simply redirect the output from bgzip to a file from python directly.
Edit: Unfortunately, despite extractfile() returning a file-like object, Popen() expects a real file (with a fileno). Hence, a little wrapping is required:
with tar.extractfile(file) as tarfile, file(clean(mov), 'wb') as outfile:
proc = subprocess.Popen(
('bgzip', '-c'),
stdin=subprocess.PIPE,
stdout=outfile,
)
shutil.copyfileobj(tarfile, proc.stdin)
proc.stdin.close()
proc.wait()

python subprocess Popen hangs

OpenSolaris derivate (NexentaStor), python 2.5.5
I've seen numerous examples and many seem to indicate that the problem is a deadlock. I'm not writing to stdin so I think the problem is that one of the shell commands exits prematurely.
What's executed in Popen is:
ssh <remotehost> "zfs send tank/dataset#snapshot | gzip -9" | gzip -d | zfs recv tank/dataset
In other words, login to a remote host and (send a replication stream of a storage volume, pipe it to gzip) pipe it to zfs recv to write to a local datastore.
I've seen the explanation about buffers but Im definitely not filling up those, and gzip is bailing out prematurely so I think that the process.wait() never gets an exit.
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
process.wait()
if process.returncode == 0:
for line in process.stdout:
stdout_arr.append([line])
return stdout_arr
else:
return False
Here's what happens when I run and interrupt it
# ./zfs_replication.py
gzip: stdout: Broken pipe
^CKilled by signal 2.
Traceback (most recent call last):
File "./zfs_replication.py", line 155, in <module>
Exec(zfsSendRecv(dataset, today), LOCAL)
File "./zfs_replication.py", line 83, in Exec
process.wait()
File "/usr/lib/python2.5/subprocess.py", line 1184, in wait
pid, sts = self._waitpid_no_intr(self.pid, 0)
File "/usr/lib/python2.5/subprocess.py", line 1014, in _waitpid_no_intr
return os.waitpid(pid, options)
KeyboardInterrupt
I also tried to use the Popen.communicat() method but that too hangs if gzip bail out. In this case the last part of my command (zfs recv) exits because the local dataset has been modified so the incremental replication stream will not be applied, so even though that will be fixed there has got to be a way of dealing with gzips broken pipes?
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
stdout, stderr = process.communicate()
if process.returncode == 0:
dosomething()
else:
dosomethingelse()
And when run:
cannot receive incremental stream: destination tank/repl_test has been modified
since most recent snapshot
gzip: stdout: Broken pipe
^CKilled by signal 2.Traceback (most recent call last):
File "./zfs_replication.py", line 154, in <module>
Exec(zfsSendRecv(dataset, today), LOCAL)
File "./zfs_replication.py", line 83, in Exec
stdout, stderr = process.communicate()
File "/usr/lib/python2.5/subprocess.py", line 662, in communicate
stdout = self._fo_read_no_intr(self.stdout)
File "/usr/lib/python2.5/subprocess.py", line 1025, in _fo_read_no_intr
return obj.read()
KeyboardInterrupt

Collecting stderr in memory with subprocess.call

I'm trying to collect stderr in memory, instead of directly writing it to a file or stdout. I do this so I can generated the error log file in a certain way. I found a library called StringIO that is an in-memory 'file'. I don't think it does the trick. Here's my code:
buffer = StringIO.StringIO()
status = subprocess.call(args, stdout=log_fps["trace"], stderr=buffer)
if status and self.V_LEVEL:
sys.stderr.write(buffer.getvalue())
print "generated error"
if status:
log_fps["fail"].write("==> Error with files %s and %s\n" % (domain_file, problem_file))
log_fps["fail"].write(buffer.getvalue())
I get the following error:
Traceback (most recent call last):
File "./runit.py", line 284, in <module>
launcher.run_all_cff_domain_examples("ring")
File "./runit.py", line 259, in run_all_cff_domain_examples
result = self.run_clg(in_d["domain"], in_d["problem"], in_d["prefix"])
File "./runit.py", line 123, in run_clg
status = subprocess.call(args, stdout=log_fps["trace"], stderr=buffer)
File "/usr/lib/python2.7/subprocess.py", line 493, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/lib/python2.7/subprocess.py", line 672, in __init__
errread, errwrite) = self._get_handles(stdin, stdout, stderr)
File "/usr/lib/python2.7/subprocess.py", line 1075, in _get_handles
errwrite = stderr.fileno()
AttributeError: StringIO instance has no attribute 'fileno'
I guess this means that I can't use StringIO to collect stderr in memory. What else can I do, short of writing to a file in /tmp?
stdout = subprocess.check_output(args)
See check_output documentation for more options.
If you don't want to capture stdout, use Popen.communicate:
from subprocess import Popen, PIPE
p = Popen(args, stdout=log_fps["trace"], stderr=PIPE)
_, stderr = p.communicate()
import subprocess
p = subprocess.Popen(args, stdout=log_fps["trace"], stderr=subprocess.PIPE)
_, stderr = p.communicate()
print stderr,

Categories

Resources