Python pipeline using GNU Parallel

Python pipeline using GNU Parallel - python

I'm trying to write a wrapper around GNU Parallel in Python to run a command in parallel, but seem to be misunderstanding either how GNU Parallel works, system pipes and/or python subprocess pipes.
Essentially I am looking to use GNU Parallel to handle splitting up an input file and then running another command in parallel on multiple hosts.
I can investigate some pure python way to do this in the future, but it seems like it should be easily implemented using GNU Parallel.
t.py
#!/usr/bin/env python
import sys
print
print sys.stdin.read()
print
p.py
from subprocess import *
import os
from os.path import *
args = ['--block', '10', '--recstart', '">"', '--sshlogin', '3/:', '--pipe', './t.py']
infile = 'test.fa'
fh = open('test.fa','w')
fh.write('''>M02261:11:000000000-ADWJ7:1:1101:16207:1115 1:N:0:1
CAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTT
>M02261:11:000000000-ADWJ7:1:1101:21410:1136 1:N:0:1
ATAGTAGATAGGGACATAGGGAATCTCGTTAATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACC
>M02261:11:000000000-ADWJ7:1:1101:13828:1155 1:N:0:1
GGTTTAGAGTCTCTAGTCGATAGATCAATGTAGGTAAGGGAAGTCGGCAAATTAGATCCGTAACTTCGGGATAAGGATTGGCTCTGAAGGCTGGGATGACTCGGGCTCTGGTGCCTTCGCGGGTGCTTTGCCTCAACGCGCGCCGGCCGGCTCGGGTGGTTTGCGCCGCCTGTGGTCGCGTCGGCCGCTGCAGTCATCAATAAACAGCCAATTCAGAACTGGCACGGCTGAGGGAATCCGACGGTCTAATTAAAACAAAGCATTGTGATGGACTCCGCAGGTGTTGACACAATGTGATTTT
>M02261:11:000000000-ADWJ7:1:1101:14120:1159 1:N:0:1
GAGTAGCTGCGAGCGAAAAGGGAAGAGCTCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCGAAAAGGGAAGCGCCCAAGGGGGGCAACAGGAACTAACAAGAATTCGCCGACTAGCTGCGACCTGAAAAGGAAAAACCCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCAGAAAAGGAAAAGCACAAGAGGAGGAAACGACACTAATAAGACTTCCCATACAAGCGGCGAGCAAAACAGCACGAGCCCAACGGCGAGAAAAGCAAAA
>M02261:11:000000000-ADWJ7:1:1101:8638:1172 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
''')
fh.close()
# Call 1
Popen(['parallel']+args, stdin=open(infile,'rb',0), stdout=open('output','w')).wait()
# Call 2
_cat = Popen(['cat', infile], stdout=PIPE)
Popen(['parallel']+args, stdin=_cat.stdout, stdout=open('output2','w')).wait()
# Call 3
Popen('cat '+infile+' | parallel ' + ' '.join(args), shell=True, stdout=open('output3','w')).wait()
Call 1 and Call 2 produce the same output while Call 3 produces the output I would expect where the input file was split up and contains empty lines between records.
I'm more curious about what the differences are between Call 1,2 and Call 3.

TL;DR Don't quote ">" when shell=False.
If you use shell=True, you can use all the shell's facilities, like globbing, I/O redirection, etc. You will need to quote anything which needs to be escaped from the shell. You can pass the entire command line as a single string, and the shell will parse it.
unsafe = subprocess.Popen('echo `date` "my files" * >output', shell=True)
With shell=False, you have no "secret" side effects behind the scenes, and none of the shell's facilities are available to you. So you need to take care of globbing, redirection, etc on the Python side. On the plus account, you save a (potentially significant) extra process, you have more control, and you don't need (and indeed mustn't) quote things which had to be quoted when the shell was involved. In summary, this is also safer, because you can see exactly what you are doing.
cmd = ['echo']
cmd.append(datestamp())
cmd.append['my files'] # notice absence of shell quotes around string
cmd.extend(glob('*'))
safer = subprocess.Popen(cmd, shell=False, stdout=open('output', 'w+'))
(This still differs slightly, because with modern shells, echo is a builtin, whereas now, we will be executing an external utility /bin/echo or whichever executable with that name comes first in your PATH.)
Now, returning to your examples, the problem in your args is that you are quoting a literal ">" as the record separator. When a shell is involved, an unquoted right broket would invoke redirection, so to specify it as a string, it has to be escaped or quoted; but when no shell is in the picture, there isn't anything which handles (or requires) those quotes, so to pass a literal > argument, simply pass that literally.
With that out of the way, your call #1 definitely seems like the way to go. (Though I'm not entirely convinced that it's sane to write a Python wrapper for a shell command implemented in Perl. I suspect that juggling a bunch of parallel child processes in Python directly would not be more complicated.)

Related

How to run a subprocess and store the results in a file?

I am trying to run a hive/spark submit from python using subprocess module. I am trying to write the data output to a file (log file). cCn you please help me in this?
import subprocess
file = ["hive" "-f" "test.sql"]
process = subprocess.Popen(file,shell=False,stderr=subprocess.PIPE,
stdout=subprocess.STDOUT,universal_newlines=True)
process.wait()
out,err=process.communicate()
The out file I need to write it to new file let say test.log/test.txt file.

You have an error in your command; the list needs to have commas between the strings (otherwise you are pasting together the individual strings to a single long string "hive-ftest.sql"!)
As pointed out in the subprocess documentation, you should generally avoid bare Popen when you can. If all you need is for a command to run to completion, subprocess.run or its legacy siblings check_call et al. should be preferred for simplicity and robustness.
import subprocess
# Renamed the variable; this is not a "file" by any stretch
cmd = ["hive", "-f", "test.sql"]
with open(filename, "wb") as outputfile:
process = subprocess.run(cmd, stdout=outputfile, check=True)
Specifying a binary output mode avoids having Python try to infer anything about the encoding of the bytes emitted; if you need to process text, you might want to add an encoding= keyword argument to the subprocess call.
Not specifying any destination for stderr means error messages will be displayed to the user, which is probably a useful simplification if the tool will be invoked interactively. If not, you will probably need to capture any diagnostic messages and display them in a log file or something.
check=True specifies that Python should check that the command succeeds, and raise an exception if not. This is usually good hygiene, but might need to be tweaked if the command you run could emit an error status in situations where your use case could nevertheless be completed, or if you need to avoid tracebacks in unattended use.
shell=False is the default, and so I omitted that.
I can see no reason to store the command in a variable, but perhaps you have one. Inlining the command will avoid having to come up with a useful name for the variable (^:

Subprocess call failed to parse argument (kill function) Python [duplicate]

import os
import subprocess
proc = subprocess.Popen(['ls','*.bc'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out,err = proc.communicate()
print out
This script should print all the files with .bc suffix however it returns an empty list. If I do ls *.bc manually in the command line it works. Doing ['ls','test.bc'] inside the script works as well but for some reason the star symbol doesnt work.. Any ideas ?

You need to supply shell=True to execute the command through a shell interpreter.
If you do that however, you can no longer supply a list as the first argument, because the arguments will get quoted then. Instead, specify the raw commandline as you want it to be passed to the shell:
proc = subprocess.Popen('ls *.bc', shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)

Expanding the * glob is part of the shell, but by default subprocess does not send your commands via a shell, so the command (first argument, ls) is executed, then a literal * is used as an argument.
This is a good thing, see the warning block in the "Frequently Used Arguments" section, of the subprocess docs. It mainly discusses security implications, but can also helps avoid silly programming errors (as there are no magic shell characters to worry about)
My main complaint with shell=True is it usually implies there is a better way to go about the problem - with your example, you should use the glob module:
import glob
files = glob.glob("*.bc")
print files # ['file1.bc', 'file2.bc']
This will be quicker (no process startup overhead), more reliable and cross platform (not dependent on the platform having an ls command)

Besides doing shell=True, also make sure that your path is not quoted. Otherwise it will not be expanded by shell.
If your path may have special characters, you will have to escape them manually.

How does subprocess.call() work with shell=False?

I am using Python's subprocess module to call some Linux command line functions. The documentation explains the shell=True argument as
If shell is True, the specified command will be executed through the shell
There are two examples, which seem the same to me from a descriptive viewpoint (i.e. both of them call some command-line command), but one of them uses shell=True and the other does not
>>> subprocess.call(["ls", "-l"])
0
>>> subprocess.call("exit 1", shell=True)
1
My question is:
What does running the command with shell=False do, in contrast to shell=True?
I was under the impression that subprocess.call and check_call and check_output all must execute the argument through the shell. In other words, how can it possibly not execute the argument through the shell?
It would also be helpful to get some examples of:
Things that can be done with shell=True that can't be done with
shell=False and why they can't be done.
Vice versa (although it seems that there are no such examples)
Things for which it does not matter whether shell=True or False and why it doesn't matter

UNIX programs start each other with the following three calls, or derivatives/equivalents thereto:
fork() - Create a new copy of yourself.
exec() - Replace yourself with a different program (do this if you're the copy!).
wait() - Wait for another process to finish (optional, if not running in background).
Thus, with shell=False, you do just that (as Python-syntax pseudocode below -- exclude the wait() if not a blocking invocation such as subprocess.call()):
pid = fork()
if pid == 0: # we're the child process, not the parent
execlp("ls", "ls", "-l", NUL);
else:
retval = wait(pid) # we're the parent; wait for the child to exit & get its exit status
whereas with shell=True, you do this:
pid = fork()
if pid == 0:
execlp("sh", "sh", "-c", "ls -l", NUL);
else:
retval = wait(pid)
Note that with shell=False, the command we executed was ls, whereas with shell=True, the command we executed was sh.
That is to say:
subprocess.Popen(foo, shell=True)
is exactly the same as:
subprocess.Popen(
["sh", "-c"] + ([foo] if isinstance(foo, basestring) else foo),
shell=False)
That is to say, you execute a copy of /bin/sh, and direct that copy of /bin/sh to parse the string into an argument list and execute ls -l itself.
So, why would you use shell=True?
You're invoking a shell builtin.
For instance, the exit command is actually part of the shell itself, rather than an external command. That said, this is a fairly small set of commands, and it's rare for them to be useful in the context of a shell instance that only exists for the duration of a single subprocess.call() invocation.
You have some code with shell constructs (ie. redirections) that would be difficult to emulate without it.
If, for instance, your command is cat one two >three, the syntax >three is a redirection: It's not an argument to cat, but an instruction to the shell to set stdout=open('three', 'w') when running the command ['cat', 'one', 'two']. If you don't want to deal with redirections and pipelines yourself, you need a shell to do it.
A slightly trickier case is cat foo bar | baz. To do that without a shell, you need to start both sides of the pipeline yourself: p1 = Popen(['cat', 'foo', 'bar'], stdout=PIPE), p2=Popen(['baz'], stdin=p1.stdout).
You don't give a damn about security bugs.
...okay, that's a little bit too strong, but not by much. Using shell=True is dangerous. You can't do this: Popen('cat -- %s' % (filename,), shell=True) without a shell injection vulnerability: If your code were ever invoked with a filename containing $(rm -rf ~), you'd have a very bad day. On the other hand, ['cat', '--', filename] is safe with all possible filenames: The filename is purely data, not parsed as source code by a shell or anything else.
It is possible to write safe scripts in shell, but you need to be careful about it. Consider the following:
filenames = ['file1', 'file2'] # these can be user-provided
subprocess.Popen(['cat -- "$#" | baz', '_'] + filenames, shell=True)
That code is safe (well -- as safe as letting a user read any file they want ever is), because it's passing your filenames out-of-band from your script code -- but it's safe only because the string being passed to the shell is fixed and hardcoded, and the parameterized content is external variables (the filenames list). And even then, it's "safe" only to a point -- a bug like Shellshock that triggers on shell initialization would impact it as much as anything else.

I was under the impression that subprocess.call and check_call and check_output all must execute the argument through the shell.
No, subprocess is perfectly capable of starting a program directly (via an operating system call). It does not need a shell
Things that can be done with shell=True that can't be done with shell=False
You can use shell=False for any command that simply runs some executable optionally with some specified arguments.
You must use shell=True if your command uses shell features. This includes pipelines, |, or redirections or that contains compound statements combined with ; or && or || etc.
Thus, one can use shell=False for a command like grep string file. But, a command like grep string file | xargs something will, because of the | require shell=True.
Because the shell has power features that python programmers do not always find intuitive, it is considered better practice to use shell=False unless you really truly need the shell feature. As an example, pipelines are not really truly needed because they can also be done using subprocess' PIPE feature.

In Python, what is the difference between open(file).read() and subprocess(['cat', file]) and is there a preference for one over the other?

Let's say I want to read RAM usage from /proc/meminfo. There are two basic ways to do this that I can think of.
Use a shell command
output = subprocess.check_output('cat /proc/meminfo', shell=True)
# or output = subprocess.check_output(['cat', '/proc/meminfo'])
lines = output.splitlines()
Use open()
with open('/proc/meminfo') as meminfo:
output = meminfo.read()
lines = output.splitlines()
My question is what is the difference between the two methods? Is there a significant performance difference? My assumption is that using open() is the preferred method, since using a shell command is a bit hackish and may be system dependent, but I can't find any information on this so I thought I'd ask.

...so, let's look at what output = subprocess.check_output('cat /proc/meminfo', shell=True) does:
Creates a FIFO pair with mkfifo(), and spawns a shell running sh -c 'cat /proc/meminfo' writing to the input end of the FIFO (while the Python interpreter itself watches for output on the other end, either using the select() call or blocking IO operations). This means opening /bin/sh, opening all the libraries it depends on, etc.
The shell parses those arguments as code. This can be dangerous if, instead of opening /proc/meminfo. you're instead opening /tmp/$(rm -rf ~)/pwned.txt.
The shell forks a subprocess (optionally; shells may have an implicit exec), which then uses the execve system call to invoke /bin/cat with an argv of ['cat', '/proc/meminfo'] -- meaning that /bin/cat is again loaded as an executable, with its dynamic libraries, with all the performance overhead that implies.
/bin/cat then opens /proc/meminfo, reads from it, and writes to its stdout
The shell, if it did not use the implicit-exec optimization, waits for the /bin/cat executable to finish and exit using a wait()-family syscall.
The Python interpreter reads from the remote end of the FIFO until it provides an EOF (which will not happen until after cat has closed its output pipeline, potentially by exiting), and then uses a wait()-family call to retrieve information on how the shell it spawned exited, checking that exit status to determine whether an error occurred.
Now, let's look at what open('/proc/meminfo').read() does:
Opens the file using the open() syscall.
Reads the file using the read() syscall.
Drops the reference count on the file, allowing it to be closed (either immediately or on a future garbage collection pass) with the close() syscall.
One of these things is much, much, much more efficient and generally sensible than the other.

Best way to pipe output of Linux sort

I would like process a file line by line. However I need to sort it first which I normally do by piping:
sort --key=1,2 data |./script.py.
What's the best to call sort from within python? Searching online I see subprocess or the sh module might be possibilities? I don't want to read the file into memory and sort in python as the data is very big.

Its easy. Use subprocess.Popen to run sort and read its stdout to get your data.
import subprocess
myfile = 'data'
sort = subprocess.Popen(['sort', '--key=1,2', myfile],
stdout=subprocess.PIPE)
for line in sort.stdout:
your_code_here
sort.wait()
assert sort.returncode == 0, 'sort failed'

I think this page will answer your question
The answer I prefer, from #Eli Courtwright is (all quoted verbatim):
Here's a summary of the ways to call external programs and the advantages and disadvantages of each:
os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example,
os.system("some_command < input_file | another_command > output_file")
However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs.
http://docs.python.org/lib/os-process.html
stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything.
http://docs.python.org/lib/os-newstreams.html
The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say
print Popen("echo Hello World", stdout=PIPE, shell=True).stdout.read()
instead of
print os.popen("echo Hello World").read()
but it is nice to have all of the options there in one unified class instead of 4 different popen functions.
http://docs.python.org/lib/node528.html
The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply wait until the command completes and gives you the return code. For example:
return_code = call("echo Hello World", shell=True)
http://docs.python.org/lib/node529.html
The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.
The subprocess module should probably be what you use.

I believe sort will read all data in memory, so I'm not sure you will won anything but you can use shell=True in subprocess and use pipeline
>>> subprocess.check_output("ls", shell = True)
'1\na\na.cpp\nA.java\na.php\nerase_no_module.cpp\nerase_no_module.cpp~\nWeatherSTADFork.cpp\n'
>>> subprocess.check_output("ls | grep j", shell = True)
'A.java\n'
Warning
Invoking the system shell with shell=True can be a security hazard if combined with untrusted input. See the warning under Frequently Used Arguments for details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.