pythonic equivalent this sed command - python

I have this awk/sed command
awk '{full=full$0}END{print full;}' initial.xml | sed 's|</Product>|</Product>\
|g' > final.xml
to break an XML doc containing large number of tags
such that the new file will have all contents of the product node in a single line
I am trying to run it using os.system and subprocess module however this is wrapping all the contents of the file into one line.
Can anyone convert it into equivalent python script?
Thanks!

Something like this?
from __future__ import print_function
import fileinput
for line in fileinput.input('initial.xml'):
print(line.rstrip('\n').replace('</Product>','</Product>\n'),end='')
I'm using the print function because the default print in Python 2.x will add a space or newline after each set of output. There are various other ways to work around that, some of which involve buffering your output before printing it.
For the record, your problem could equally well be solved in just a simple Awk script.
awk '{ gsub(/<Product>/,"&\n"); printf $0 }' initial.xml
Printing output as it arrives without a trailing newline is going to be a lot more efficient than buffering the whole file and then printing it at the end, and of course, Awk has all the necessary facilities to do the substition as well. (gsub is not available in all dialects of Awk, though.)

Related

subprocess.call with command having embedded spaces and quotes

I would like to retrieve output from a shell command that contains spaces and quotes. It looks like this:
import subprocess
cmd = "docker logs nc1 2>&1 |grep mortality| awk '{print $1}'|sort|uniq"
subprocess.check_output(cmd)
This fails with "No such file or directory". What is the best/easiest way to pass commands such as these to subprocess?
The absolutely best solution here is to refactor the code to replace the entire tail of the pipeline with native Python code.
import subprocess
from collections import Counter
s = subprocess.run(
["docker", "logs", "nc1"],
text=True, capture_output=True, check=True)
count = Counter()
for line in s.stdout.splitlines():
if "mortality" in line:
count[line.split()[0]] += 1
for count, word in count.most_common():
print(count, word)
There are minor differences in how Counter objects resolve ties (if two words have the same count, the one which was seen first is returned first, rather than by sort order), but I'm guessing that's unimportant here.
I am also ignoring standard output from the subprocess; if you genuinely want to include output from error messages, too, just include s.stderr in the loop driver too.
However, my hunch is that you don't realize your code was doing that, which drives home the point nicely: Mixing shell script and Python raises the mainainability burden, because now you have to understand both shell script and Python to understand the code.
(And in terms of shell script style, I would definitely get rid of the useless grep by refactoring it into the Awk script, and probably also fold in the sort | uniq which has a trivial and more efficient replacement in Awk. But here, we are replacing all of that with Python code anyway.)
If you really wanted to stick to a pipeline, then you need to add shell=True to use shell features like redirection, pipes, and quoting. Without shell=True, Python looks for a command whose file name is the entire string you were passing in, which of course doesn't exist.

Executing awk command from python

I am trying to execute the following awk command from a python script
awk 'BEGIN {FS="\t"}; {print $1"\t"$2}' file_a > file_b
For this, I tried to use subprocess as follows:
subprocess.check_output(["awk", 'BEGIN {FS="\t"}; {print $1"\t"$2}',
file_a, ">",
file_b])
where file_a and file_b are strings pointing to the path of the files.
From this, I am getting the error
awk: cannot open > (No such file or directory)
I'm sure I'm inputing the arguments to subprocess in a wrong way, but I can't figure out what's wrong.
While it may look like it in your shell of choice, >, <, and | are not actually passed as arguments to the program you run. Rather, they're a special part of the shell that the program never gets to see.
Since they're part of the shell, and not part of the OS or program, you have to emulate their effects yourself with the normal facilities the language gives you. In your case, since you're trying to pipe to a file, simply use Python's open() as you would normally. The subprocess API supports arguments to specify stdout, stdin, and stderr, and you can supply any file object for those.
Check it out:
with open(file_b, 'wb') as f:
subprocess.call(["awk", 'BEGIN {FS="\t"}; {print $1"\t"$2}', file_a], stdout=f)
Since subprocess.check_output redirects output already, it doesn't take the stdout argument. Using subprocess.call avoids this. If you also need the output later in the script, you can instead assign the return value of check_output to a variable, and then save that to file_b.
If you use a lot of shell commands, you might also want to check out Plumbum, which gives you a large set of fairly silly shell-like operator overloads.

python: How does subprocess.check_output create it's calls?

I'm trying to read the duration of video files using mediainfo. This shell command works
mediainfo --Inform="Video;%Duration/String3%" file
and produces an output like
00:00:33.600
But when I try to run it in python with this line
subprocess.check_output(['mediainfo', '--Inform="Video;%Duration/String3%"', file])
the whole --Inform thing is ignored and I get the full mediainfo output instead.
Is there a way to see the command constructed by subprocess to see what's wrong?
Or can anybody just tell what's wrong?
Try:
subprocess.check_output(['mediainfo', '--Inform=Video;%Duration/String3%', file])
The " in your python string are likely passed on to mediainfo, which can't parse them and will ignore the option.
These kind of problems are often caused by shell commands requiring/swallowing various special characters. Quotes such as " are often removed by bash due to shell magic. In contrast, python does not require them for magic, and will thus replicate them the way you used them. Why would you use them if you wouldn't need them? (Well, d'uh, because bash makes you believe you need them).
For example, in bash I can do
$ dd of="foobar"
and it will write to a file named foobar, swallowing the quotes.
In python, if I do
subprocess.check_output(["dd", 'of="barfoo"', 'if=foobar'])
it will write to a file named "barfoo", keeping the quotes.

How to use awk if statement and for loop in subprocess.call

Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

Python equivalent to perl -pe?

I need to pick some numbers out of some text files. I can pick out the lines I need with grep, but didn't know how to extract the numbers from the lines. A colleague showed me how to do this from bash with perl:
cat results.txt | perl -pe 's/.+(\d\.\d+)\.\n/\1 /'
However, I usually code in Python, not Perl. So my question is, could I have used Python in the same way? I.e., could I have piped something from bash to Python and then gotten the result straight to stdout? ... if that makes sense. Or is Perl just more convenient in this case?
Yes, you can use Python from the command line. python -c <stuff> will run <stuff> as Python code. Example:
python -c "import sys; print sys.path"
There isn't a direct equivalent to the -p option for Perl (the automatic input/output line-by-line processing), but that's mostly because Python doesn't use the same concept of $_ and whatnot that Perl does - in Python, all input and output is done manually (via raw_input()/input(), and print/print()).
For your particular example:
cat results.txt | python -c "import re, sys; print ''.join(re.sub(r'.+(\d\.\d+)\.\n', r'\1 ', line) for line in sys.stdin)"
(Obviously somewhat more unwieldy. It's probably better to just write the script to do it in actual Python.)
You can use:
$ python -c '<your code here>'
You can in theory, but Python doesn't have anywhere near as much regex magic that Perl does, so the resulting command will be much more unwieldy, especially as you can't use regular expressions without importing re (and you'll probably need sys for sys.stdin too).
The Python equivalent of your colleague's Perl one-liner is approximately:
import sys, re
for line in sys.stdin:
print re.sub(r'.+(\d\.\d+)\.\n', r'\1 ', line)
You have a problem which can be solved several ways.
I think you should consider using regular expression (what perl is doing in your example) directly from Python. Regular expressions are in the re module. An example would be:
import re
filecontent = open('somefile.txt').read()
print re.findall('.+(\d\.\d+)\.$', filecontent)
(I would prefer using $ instead of '\n' for line endings, because line endings are different between operational systems and file encodings)
If you want to call bash commands from inside Python, you could use:
import os
os.system(mycommand)
Where command is the bash command. I use it all the time, because some operations are better to perform in bash than in Python.
Finally, if you want to extract the numbers with grep, use the -o option, which prints only the matched part.
Perl (or sed) is more convenient. However it is possible, if ugly:
python -c 'import sys, re; print "\n".join(re.sub(".+(\d\.\d+)\.\n","\1 ", l) for l in sys.stdin)'
Quoting from https://stackoverflow.com/a/12259852/411282:
for ln in __import__("fileinput").input(): print ln.rstrip()
See the explanation linked above, but this does much more of what perl -p does, including support for multiple file names and stdin when no filename is given.
https://docs.python.org/3/library/fileinput.html#fileinput.input
You can use python to execute code directly from your bash command line, by using python -c, or you can process input piped to stdin using sys.stdin, see here.

Categories

Resources