Issues calling awk from within Python using subprocess.call

Issues calling awk from within Python using subprocess.call - python

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.
Open up command line, in admin mode or not.
Change my directory to awk.exe, namely cd R\GnuWin32\bin
Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv
My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"',
'\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.
Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)
Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.
I took the time to code & test it:
def split_ByInputColumn():
# get rid of the old data from previous runs
for f in glob.glob("split-*.csv"):
os.remove(f)
open_files = dict()
with open('large.csv') as f:
cr = csv.reader(f,delimiter=',')
for r in cr:
tenth_row = r[9]
filename = "split-{}.csv".format(tenth_row)
if not filename in open_files:
handle = open(filename,"wb")
open_files[filename] = (handle,csv.writer(handle,delimiter=','))
open_files[filename][1].writerow(r)
for f,_ in open_files.values():
f.close()
split_ByInputColumn()
in detail:
read the big file as csv (advantage: quoting is handled properly)
compute the destination filename
if filename not in dictionary, open it and create csv.writer object
write the row in the corresponding dictionary
in the end, close file handles
Aside: My old solution, using awk properly:
import subprocess
def split_ByInputColumn():
subprocess.call(['awk.exe', '-F', ',',
'{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')

Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',',
'{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')

Related

How do i make my awk command into a python command

I have an awk command that works in bash, but im now trying to put it into a python script
I have tried both os.system, and subprocess.call both return the same error. sh: 1: Syntax error: "(" unexpected
os.system('awk \'FNR<=27{print;next} ++count%10==0{print;count}\' \'{0} > {1}\'.format(inputfile, outpufile)')
So this awk command will take the large inputfile and create an output file that leaves the first 27 lines of header, but then starting on line 28 it only takes every 10th line and puts it into the output file
Im using the .format() because it is within a python script where the input file will be different every times its run.
ive also tried
subprocess.call('awk \'FNR<=27{print;next} ++count%10==0{print;count}\' \'{0} > {1}\'.format(inputfile, outpufile)')
both come up with the same error above. What am I missing?

As per the comment above, probably more pythonic (and more manageable) to directly use python.
But, if you want to use awk then one way is to format your command with your variable filenames separately.
This works using a basic test text file:
import os
def awk_runner(inputfile, outputfile):
cmd = "awk 'FNR<=27{print;next} ++count%10==0{print;count}' " + inputfile + " > " + outputfile
os.system(cmd)
awk_runner('test1.txt', 'testout1.txt')

There are two main issues with your Python code:
format() is a python method call, it should not be put into the string of awk_cmd to execute under the shell
when calling format() method, braces {} are used to identify substitution target in the format string objects, they need to be escaped using {{ ... }}
See below a modified version of your code:
awk_cmd = "awk 'FNR<=7{{print;next}} ++count%10==0{{print;count}}' {0} > {1}".format(inputfile, outpufile)
os.system(awk_cmd)

Running .py file with and argument in .bat

The problem: I want to iterate over folder in search of certain file type, then execute it with a program and the name.ext as argument, and then run python script that changes the output name of the first program.
I know there is probably a better way to do the above, but the way I thought of was this:
[BAT]
for /R "C:\..\folder" %%a IN (*.extension) do ( SET name=%%a "C:\...\first_program.exe" "%%a" "C:\...\script.py" "%name%" )
[PY]
import io
import sys
def rename(i):
name = i
with open('my_file.txt', 'r') as file:
data = file.readlines()
data[40] ='"C:\\\\Users\\\\UserName\\\\Desktop\\\\folder\\\\folder\\\\' + name + '"\n'
with open('my_file.txt', 'w') as file:
file.writelines( data )
if __name__ == "__main__":
rename(sys.argv[1])
Expected result: I wish the python file changed the name, but after putting it once into the console it seems to stay with the script. The BAT does not change it and it bothers me.
PS. If there is a better way, I'll be glad to get to know it.

This is the linux bash version, I am sure you can change the loop etc to make it work as batch file instead of your *.exe I use cat as a generic input output example
#! /bin/sh
for f in *.txt
do
suffix=".txt"
name=${f%$suffix}
cat $f > tmp.dat
awk -v myName=$f '{if(NR==5) print $0 myName; else print $0 }' tmp.dat > $name.dat
done
This produces "unique" output *.dat files named after the input *.txt files. The files are treated by cat (virtually your *.exe) and the output is put into a temorary file. Eventually, this is handled by awk changing line 5 here. with the output placed in the unique file, as mentioned above.

How can I run this shell script inside python?

I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!

While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()

Escape the ' marks with a \.
i.e. For every: ', replace with: \'

Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?

python - processing very large files (>90GB)

I need to process a couple of very large files (>90GB each). Only a small portion of the files are important for me. I want to scan through the files and write the necessary lines to another file, so I don't need to process such large files every time I run an experiment. Every line is about 1000 characters.
I use the following code:
def readFile(inputFile, outputFile):
startDate = datetime.datetime.strptime('10/06/2010 00:00:00', '%m/%d/%Y %H:%M:%S')
endDate = datetime.datetime.strptime('10/13/2010 23:59:59', '%m/%d/%Y %H:%M:%S')
total_lines = 0
with open(inputFile, 'r') as a_file:
for a_line in a_file:
total_lines += 1
id, date, content = splitLine(a_line)
datetime_object = datetime.datetime.strptime(date, '%m/%d/%Y %H:%M:%S')
if (datetime_object > startDate and datetime_object < endDate):
appendToFile(outputFile, a_line)
return total_lines
def splitLine(long_string):
values = long_string.split(",")
return values[0],values[1],values[2]
def appendToFile(outputFile, outputString):
try:
file = open(outputFile, 'a+')
file.write(outputString)
file.close()
except Exception as ex:
print("Error writing to file: " + outputFile)
return
The problem is, every time I run the script, the process gets stuck around 10.000.000th line. When I use htop command, I can see that Python only uses around 8GB of RAM when it gets stuck, and the virtual memory used keeps increasing and then the OS kills the process after a while.
I used different files, and also both Python 2.7 and 3.5. I also tried using with open(inputFile, 'r', 16777216) to use buffering but the result didn't change. I'm running the code on macOS Sierra 10.12.4 and the machine has 16GBs of RAM.
Any ideas?

Open the file in pieces, until you find what you want. Like this:
f = open('yourfile')
piece = f.read(4096)
while piece:
# Implementation for each piece
piece = f.read(4096)
f.close()

A more efficient way to do that would be to call Unix awk command from python. This will work on both Mac and unix.
You call call unix commands from python like this:
import os
os.popen('ls -l > result.txt')
Running this sample code will create a file named result.txt that contains the output of the ls -l command.
Similarly, you can scan through your files with awk and pipe the result to another file.
From the man page of awk:
awk
NAME
awk - pattern-directed scanning and processing language
SYNOPSIS
awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
DESCRIPTION:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile. With each pattern there can be an associated action that will be performed when a line of a file matches the pattern. Each line is matched
against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern. The file name - means the standard input. Any file of the form var=value is treated as an assignment, not a filename, and is executed at the time it would have been opened if it were a filename. The option -v followed by var=value is an assignment to be done before prog is executed; any number of -v options may be present. The -F fs option defines the input field separator to be the regular expression fs.
Read this answer https://unix.stackexchange.com/questions/76805/read-log-file-between-two-dates to see how to use awk to read log files between two dates.

Forcing a python script to take input from STDIN

A python script I need to run takes input only from a file passed as a command line argument, like so:
$ markdown.py input_file
Is there any way to get it to accept input from STDIN instead? I want to be able to do this through Bash, without significantly modifying the python script:
$ echo "Some text here" | markdown.py
If I have to modify the Python script, how would I go about it?
(EDIT: Here is the script that is parsing the command line options.)

I'm not sure how portable it is, but on Unix-y systems you can name /dev/stdin as your file:
$ echo -n hi there | wc /dev/stdin
0 2 8 /dev/stdin

Make sure this is near the top of the file:
import sys
Then look for something like this:
filename = sys.argv[1]
f = open(filename)
and replace it with this:
f = sys.stdin
It's hard to be more specific without seeing the script that you're starting with.

In the code you have a line like this:
if not len(args) == 1:
What you could do there is to check if you don't have a filename and instead either use "/dev/stdin" (on a system that allows it).
Another solution is to just replace:
if not len(args) == 1:
parser.print_help()
return None, None
else:
input_file = args[0]
with
if not len(args) == 1:
input_file = sys.stdin
else:
input_file = open(args[0])
That means of course that the returned "input_file" is no longer a file name but a file object, which means further modifications in the calling function.
First solution is less modifications but more platform specific, second is more work, but should work on more systems.

I'm guessing from the details of your question that you're asking about Python-Markdown, so I tracked down the relevant line in the source code for you: to do it Daniel's way, in line 443 of markdown/__init__.py, you'd want to replace
input_file = codecs.open(input, mode="r", encoding=encoding)
with
input_file = codecs.EncodedFile(sys.stdin, encoding)
Although then you wouldn't be able to actually process files afterwards, so for a more generally useful hack, you could put in a conditional:
if input:
input_file = codecs.open(input, mode="r", encoding=encoding)
else:
input_file = codecs.EncodedFile(sys.stdin, encoding)
and then you'd have to adjust markdown/commandline.py to not quit if it isn't given a filename: change lines 72-73
parser.print_help()
return None, None
to
input_file = None
The point is, it's not really a simple thing to do. At this point I was going to suggest using a special file like Mark Rushakoff did, if he hadn't beaten me to it ;-)

I suggest going here:
http://codaset.com/repo/python-markdown/tickets/new
And submitting a ticket requesting them to add the feature. It should be straightforward for them and so they might be willing to go ahead and do it.

In bash, you can also use process substitution:
markdown.py <(echo "Some text here")
For a single input /dev/stdin works, but process substitution also applies for several inputs (and even outputs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.