Best way to pipe output of Linux sort - python

I would like process a file line by line. However I need to sort it first which I normally do by piping:
sort --key=1,2 data |./script.py.
What's the best to call sort from within python? Searching online I see subprocess or the sh module might be possibilities? I don't want to read the file into memory and sort in python as the data is very big.

Its easy. Use subprocess.Popen to run sort and read its stdout to get your data.
import subprocess
myfile = 'data'
sort = subprocess.Popen(['sort', '--key=1,2', myfile],
stdout=subprocess.PIPE)
for line in sort.stdout:
your_code_here
sort.wait()
assert sort.returncode == 0, 'sort failed'

I think this page will answer your question
The answer I prefer, from #Eli Courtwright is (all quoted verbatim):
Here's a summary of the ways to call external programs and the advantages and disadvantages of each:
os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example,
os.system("some_command < input_file | another_command > output_file")
However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs.
http://docs.python.org/lib/os-process.html
stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything.
http://docs.python.org/lib/os-newstreams.html
The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say
print Popen("echo Hello World", stdout=PIPE, shell=True).stdout.read()
instead of
print os.popen("echo Hello World").read()
but it is nice to have all of the options there in one unified class instead of 4 different popen functions.
http://docs.python.org/lib/node528.html
The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply wait until the command completes and gives you the return code. For example:
return_code = call("echo Hello World", shell=True)
http://docs.python.org/lib/node529.html
The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.
The subprocess module should probably be what you use.

I believe sort will read all data in memory, so I'm not sure you will won anything but you can use shell=True in subprocess and use pipeline
>>> subprocess.check_output("ls", shell = True)
'1\na\na.cpp\nA.java\na.php\nerase_no_module.cpp\nerase_no_module.cpp~\nWeatherSTADFork.cpp\n'
>>> subprocess.check_output("ls | grep j", shell = True)
'A.java\n'
Warning
Invoking the system shell with shell=True can be a security hazard if combined with untrusted input. See the warning under Frequently Used Arguments for details.

Related

Python pipeline using GNU Parallel

I'm trying to write a wrapper around GNU Parallel in Python to run a command in parallel, but seem to be misunderstanding either how GNU Parallel works, system pipes and/or python subprocess pipes.
Essentially I am looking to use GNU Parallel to handle splitting up an input file and then running another command in parallel on multiple hosts.
I can investigate some pure python way to do this in the future, but it seems like it should be easily implemented using GNU Parallel.
t.py
#!/usr/bin/env python
import sys
print
print sys.stdin.read()
print
p.py
from subprocess import *
import os
from os.path import *
args = ['--block', '10', '--recstart', '">"', '--sshlogin', '3/:', '--pipe', './t.py']
infile = 'test.fa'
fh = open('test.fa','w')
fh.write('''>M02261:11:000000000-ADWJ7:1:1101:16207:1115 1:N:0:1
CAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTT
>M02261:11:000000000-ADWJ7:1:1101:21410:1136 1:N:0:1
ATAGTAGATAGGGACATAGGGAATCTCGTTAATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACC
>M02261:11:000000000-ADWJ7:1:1101:13828:1155 1:N:0:1
GGTTTAGAGTCTCTAGTCGATAGATCAATGTAGGTAAGGGAAGTCGGCAAATTAGATCCGTAACTTCGGGATAAGGATTGGCTCTGAAGGCTGGGATGACTCGGGCTCTGGTGCCTTCGCGGGTGCTTTGCCTCAACGCGCGCCGGCCGGCTCGGGTGGTTTGCGCCGCCTGTGGTCGCGTCGGCCGCTGCAGTCATCAATAAACAGCCAATTCAGAACTGGCACGGCTGAGGGAATCCGACGGTCTAATTAAAACAAAGCATTGTGATGGACTCCGCAGGTGTTGACACAATGTGATTTT
>M02261:11:000000000-ADWJ7:1:1101:14120:1159 1:N:0:1
GAGTAGCTGCGAGCGAAAAGGGAAGAGCTCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCGAAAAGGGAAGCGCCCAAGGGGGGCAACAGGAACTAACAAGAATTCGCCGACTAGCTGCGACCTGAAAAGGAAAAACCCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCAGAAAAGGAAAAGCACAAGAGGAGGAAACGACACTAATAAGACTTCCCATACAAGCGGCGAGCAAAACAGCACGAGCCCAACGGCGAGAAAAGCAAAA
>M02261:11:000000000-ADWJ7:1:1101:8638:1172 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
''')
fh.close()
# Call 1
Popen(['parallel']+args, stdin=open(infile,'rb',0), stdout=open('output','w')).wait()
# Call 2
_cat = Popen(['cat', infile], stdout=PIPE)
Popen(['parallel']+args, stdin=_cat.stdout, stdout=open('output2','w')).wait()
# Call 3
Popen('cat '+infile+' | parallel ' + ' '.join(args), shell=True, stdout=open('output3','w')).wait()
Call 1 and Call 2 produce the same output while Call 3 produces the output I would expect where the input file was split up and contains empty lines between records.
I'm more curious about what the differences are between Call 1,2 and Call 3.
TL;DR Don't quote ">" when shell=False.
If you use shell=True, you can use all the shell's facilities, like globbing, I/O redirection, etc. You will need to quote anything which needs to be escaped from the shell. You can pass the entire command line as a single string, and the shell will parse it.
unsafe = subprocess.Popen('echo `date` "my files" * >output', shell=True)
With shell=False, you have no "secret" side effects behind the scenes, and none of the shell's facilities are available to you. So you need to take care of globbing, redirection, etc on the Python side. On the plus account, you save a (potentially significant) extra process, you have more control, and you don't need (and indeed mustn't) quote things which had to be quoted when the shell was involved. In summary, this is also safer, because you can see exactly what you are doing.
cmd = ['echo']
cmd.append(datestamp())
cmd.append['my files'] # notice absence of shell quotes around string
cmd.extend(glob('*'))
safer = subprocess.Popen(cmd, shell=False, stdout=open('output', 'w+'))
(This still differs slightly, because with modern shells, echo is a builtin, whereas now, we will be executing an external utility /bin/echo or whichever executable with that name comes first in your PATH.)
Now, returning to your examples, the problem in your args is that you are quoting a literal ">" as the record separator. When a shell is involved, an unquoted right broket would invoke redirection, so to specify it as a string, it has to be escaped or quoted; but when no shell is in the picture, there isn't anything which handles (or requires) those quotes, so to pass a literal > argument, simply pass that literally.
With that out of the way, your call #1 definitely seems like the way to go. (Though I'm not entirely convinced that it's sane to write a Python wrapper for a shell command implemented in Perl. I suspect that juggling a bunch of parallel child processes in Python directly would not be more complicated.)

How to call a series of bash commands in python and store output

I am trying to run the following bash script in Python and store the readlist output. The readlist that I want to be stored as a python list, is a list of all files in the current directory ending in *concat_001.fastq.
I know it may be easier to do this in python (i.e.
import os
readlist = [f for f in os.listdir(os.getcwd()) if f.endswith("concat_001.fastq")]
readlist = sorted(readlist)
However, this is problematic, as I need Python to sort the list in EXACTLY the same was as bash, and I was finding that bash and Python sort certain things in different orders (eg Python and bash deal with capitalised and uncapitalised things differently - but when I tried
readlist = np.asarray(sorted(flist, key=str.lower))
I still found that two files starting with ML_ and M_ were sorted in different order with bash and Python. Hence trying to run my exact bash script through Python, then to use the sorted list generated with bash in my subsequent Python code.
input_suffix="concat_001.fastq"
ender=`echo $input_suffix | sed "s/concat_001.fastq/\*concat_001.fastq/g" `
readlist="$(echo $ender)"
I have tried
proc = subprocess.call(command1, shell=True, stdout=subprocess.PIPE)
proc = subprocess.call(command2, shell=True, stdout=subprocess.PIPE)
proc = subprocess.Popen(command3, shell=True, stdout=subprocess.PIPE)
But I just get: subprocess.Popen object at 0x7f31cfcd9190
Also - I don't understand the difference between subprocess.call and subprocess.Popen. I have tried both.
Thanks,
Ruth
So your question is a little confusing and does not exactly explain what you want. However, I'll try to give some suggestions to help you update it, or in my effort, answer it.
I will assume the following: your python script is passing to the command line 'input_suffix' and that you want your python program to receive the contents of 'readlist' when the external script finishes.
To make our lives simpler, and allow things to be more complicated, I would make the following bash script to contain your commands:
script.sh
#!/bin/bash
input_suffix=$1
ender=`echo $input_suffix | sed "s/concat_001.fastq/\*concat_001.fastq/g"`
readlist="$(echo $ender)"
echo $readlist
You would execute this as script.sh "concat_001.fastq", where $1 takes in the first argument passed on the command line.
To use python to execute external scripts, as you quite rightly found, you can use subprocess (or as noted by another response, os.system - although subprocess is recommended).
The docs tell you that subprocess.call:
"Wait for command to complete, then return the returncode attribute."
and that
"For more advanced use cases when these do not meet your needs, use the underlying Popen interface."
Given you want to pipe the output from the bash script to your python script, let's use Popen as suggested by the docs. As I posted the other stackoverflow answer, it could look like the following:
import subprocess
from subprocess import Popen, PIPE
# Execute out script and pipe the output to stdout
process = subprocess.Popen(['script.sh', 'concat_001.fastq'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
# Obtain the standard out, and standard error
stdout, stderr = process.communicate()
and then:
>>> print stdout
*concat_001.fastq

Python inline linux commands

I am testing sorting algorithms and therefore I would like to compine in my Python code, the linux command "time", because it takes some interesting arguments and for example the call of quicksort.
from subprocess import Popen
import quicksort
import rand
time=Popen("time quicksort.main(rand.main())")
This is tottaly wrong, but it is the closest I managed to get. I haven't grasped the idea of subprocess class, is it possible to combine method calls with linux commands, or only add commands in python like "grep..." and send the output to a variable??
If you use Popen from subprocess you need to do a lot of things differently.
I believe what you are looking for is check_output, another function belonging to the subprocess module.
But in order to further your understanding, since you are sort-of close, here is what you need to change to get it to work:
The command string "time quicksort.main(rand.main())" is not going to mean anything to bash. That is python. BUT in the case that it was valid bash language, it would need to be split on word boundaries (like bash WOULD normally do) so you would make it into a list:
['time', '...','...']
The only time you can pass Popen a command STRING (not a list) is when you set shell=True in the keywords to Popen.
But let's just leave shell at False, do some word-splitting for bash, and pass in a list. On to the next part.
Popen returns something you can communicate to/at/with. Not the result of the process' stdout. Use subprocess.PIPE for stdin and stdout keywords to Popen.
Once you have made a Popen object as described, you can call it's communicate method.
The result is two things, stdout and stderr.
You're after the first one. One use case for Popen is for when you need to keep errors and output seperate. Obviously this isn't turning out to be the best option for inline but oh well. Lets deal with stdout.
sdtout will probably need to be decoded:
stdout.decode()
or perhaps even have newlines stripped as well:
stdout.decode().rstrip()
So as you can see, Popen does not fit the use case you have in mind. There is no need to use subprocess and make system calls in order to time python. Look into timeit.

redirecting the output of shell script executing through python

Hi I am trying to execute shell script from python using following command.
os.system("sh myscript.sh")
in my shell script I have written some SOP's, now how do I get the SOP's in my Python so that I can log them into some file?
I know using subprocess.Popen I can do it, for some reason I can not use it.
p=subprocess.Popen(
'DMEARAntRunner \"'+mount_path+'\"',
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
while 1:
line=p.stdout.readline()[:-1]
if not line:
break
write_to_log('INFO',line)
p.communicate()
If I understand your question correctly, you want something like this:
import subprocess
find_txt_command = ['find', '-maxdepth', '2', '-name', '*.txt']
with open('mylog.log', 'w') as logfile:
subprocess.call(find_txt_command, stdout=logfile, shell=False)
You can use Popen instead of call if you need to, the syntax is very similar. Notice that command is a list with the process you want to run and the arguments. In general you want to use Popen/call with shell=False, it prevents unexpected behavior that can be hard to debug and it is more portable.
Kindly check this official documentation which uses the subprocess module in python. It is currently the recommended way over os.system calls to execute system functions and retrieve the results. The link above gives examples very close to what you need.
I personally would advise you to leave the shell argument at its default value of False. In that case, the first argument isn't a string as you'd type into a terminal, but a list of "words", the first being the program, the ones after that being arguments. This means that there is no need to quote arguments, making your program more resilient to whitespace arguments and injection attacks.
This should do the trick:
p = subsprocess.Popen(['DMEARAntRunner', mount_path],
stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
As always with executing shell commands the question remains whether it's the easiest/best way to solve a problem, but that's another discussion altogether.

How to spawn multiple python scripts from a python program?

I want to spawn (fork?) multiple Python scripts from my program (written in Python as well).
My problem is that I want to dedicate one terminal to each script, because I'll gather their output using pexpect.
I've tried using pexpect, os.execlp, and os.forkpty but neither of them do as I expect.
I want to spawn the child processes and forget about them (they will process some data, write the output to the terminal which I could read with pexpect and then exit).
Is there any library/best practice/etc. to accomplish this job?
p.s. Before you ask why I would write to STDOUT and read from it, I shall say that I don't write to STDOUT, I read the output of tshark.
See the subprocess module
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several other, older modules and functions, such as:
os.system
os.spawn*
os.popen*
popen2.*
commands.*
From Python 3.5 onwards you can do:
import subprocess
result = subprocess.run(['python', 'my_script.py', '--arg1', val1])
if result.returncode != 0:
print('script returned error')
This also automatically redirects stdout and stderr.
I don't understand why you need expect for this. tshark should send its output to stdout, and only for some strange reason would it send it to stderr.
Therefore, what you want should be:
import subprocess
fp= subprocess.Popen( ("/usr/bin/tshark", "option1", "option2"), stdout=subprocess.PIPE).stdout
# now, whenever you are ready, read stuff from fp
You want to dedicate one terminal or one python shell?
You already have some useful answers for Popen and Subprocess, you could also use pexpect if you're already planning on using it anyways.
#for multiple python shells
import pexpect
#make your commands however you want them, this is just one method
mycommand1 = "print 'hello first python shell'"
mycommand2 = "print 'this is my second shell'"
#add a "for" statement if you want
child1 = pexpect.spawn('python')
child1.sendline(mycommand1)
child2 = pexpect.spawn('python')
child2.sendline(mycommand2)
Make as many children/shells as you want and then use the child.before() or child.after() to get your responses.
Of course you would want to add definitions or classes to be sent instead of "mycommand1", but this is just a simple example.
If you wanted to make a bunch of terminals in linux, you just need to replace the 'python' in the pextpext.spawn line
Note: I haven't tested the above code. I'm just replying from past experience with pexpect.

Categories

Resources