I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!
While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()
Escape the ' marks with a \.
i.e. For every: ', replace with: \'
Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?
Related
The problem: I want to iterate over folder in search of certain file type, then execute it with a program and the name.ext as argument, and then run python script that changes the output name of the first program.
I know there is probably a better way to do the above, but the way I thought of was this:
[BAT]
for /R "C:\..\folder" %%a IN (*.extension) do ( SET name=%%a "C:\...\first_program.exe" "%%a" "C:\...\script.py" "%name%" )
[PY]
import io
import sys
def rename(i):
name = i
with open('my_file.txt', 'r') as file:
data = file.readlines()
data[40] ='"C:\\\\Users\\\\UserName\\\\Desktop\\\\folder\\\\folder\\\\' + name + '"\n'
with open('my_file.txt', 'w') as file:
file.writelines( data )
if __name__ == "__main__":
rename(sys.argv[1])
Expected result: I wish the python file changed the name, but after putting it once into the console it seems to stay with the script. The BAT does not change it and it bothers me.
PS. If there is a better way, I'll be glad to get to know it.
This is the linux bash version, I am sure you can change the loop etc to make it work as batch file instead of your *.exe I use cat as a generic input output example
#! /bin/sh
for f in *.txt
do
suffix=".txt"
name=${f%$suffix}
cat $f > tmp.dat
awk -v myName=$f '{if(NR==5) print $0 myName; else print $0 }' tmp.dat > $name.dat
done
This produces "unique" output *.dat files named after the input *.txt files. The files are treated by cat (virtually your *.exe) and the output is put into a temorary file. Eventually, this is handled by awk changing line 5 here. with the output placed in the unique file, as mentioned above.
This question already has answers here:
How do I execute a program or call a system command?
(65 answers)
Closed 5 years ago.
i'm trying to automate the research about a list of domain i have (this list is a .txt file, about 350/400 lines).
I need to give the same command (that uses a py script) for each line i have in the txt file. Something like that:
import os
with open('/home/dogher/Desktop/copia.txt') as f:
for line in f:
process(line)
os.system("/home/dogher/Desktop/theHarvester-master/theHarvester.py -d "(line)" -l 300 -b google -f "(line)".html")
I know there is wrong syntax with "os.system" but i don't know how to insert the text in the line, into the command..
Thanks so much and sorry for bad english..
import os
with open('data.txt') as f:
for line in f:
os.system('python other.py ' + line)
If the contents of other.py are as follows:
import sys
print sys.argv[1]
then the output of the first code snippet would be the contents of your data.txt.
I hope this was what you wanted, instead of simply printing by print, you can process your line too.
Due to the Linux tag i suggest you a way to do what you want using bash
process_file.sh:
#!/bin/bash
#your input file
input=my_file.txt
#your python script
py_script=script.py
# use each line of the file `input` as argument of your script
while read line
do
python $py_script $line
done < "$input"
you can access the passed lines in python as follow:
script.py:
import sys
print sys.argv[1]
Hope below solution will be helpful for you :
with open('name.txt') as fp:
for line in fp:
subprocess.check_output('python name.py {}'.format(line), shell=True)
Sample File I have used :
name.py
import sys
name = sys.argv[1]
print name
name.txt:
harry
kat
patrick
Your approach subjects each line of your file to evaluation by the shell, which will break when (not if) it comes across a line with any of the characters with special meaning to the shell: spaces, quotes, parentheses, ampersands, semicolons, etc. Even if today's input file doesn't contain any such character, your next project will. So learn to do this correctly today:
for line in openfile:
subprocess.call("/home/dogher/Desktop/theHarvester-master/theHarvester.py",
"-d", line, "-l", "300", "-b", "google", "-f", line+".html")
Since the command line arguments do not need to be parsed, subprocess will execute your command without involving a shell.
Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.
Open up command line, in admin mode or not.
Change my directory to awk.exe, namely cd R\GnuWin32\bin
Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv
My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"',
'\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?
As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.
Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)
Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.
I took the time to code & test it:
def split_ByInputColumn():
# get rid of the old data from previous runs
for f in glob.glob("split-*.csv"):
os.remove(f)
open_files = dict()
with open('large.csv') as f:
cr = csv.reader(f,delimiter=',')
for r in cr:
tenth_row = r[9]
filename = "split-{}.csv".format(tenth_row)
if not filename in open_files:
handle = open(filename,"wb")
open_files[filename] = (handle,csv.writer(handle,delimiter=','))
open_files[filename][1].writerow(r)
for f,_ in open_files.values():
f.close()
split_ByInputColumn()
in detail:
read the big file as csv (advantage: quoting is handled properly)
compute the destination filename
if filename not in dictionary, open it and create csv.writer object
write the row in the corresponding dictionary
in the end, close file handles
Aside: My old solution, using awk properly:
import subprocess
def split_ByInputColumn():
subprocess.call(['awk.exe', '-F', ',',
'{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')
Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',',
'{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.
If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.
You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.
But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
edu.stanford.nlp.process.PTBTokenizer >> output.txt
To pass all .txt files in the current directory at once to the java subprocess:
#!/usr/bin/env python
from glob import glob
from subprocess import check_call
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
check_call(cmd + glob('*.txt'), stdout=file)
It is similar to running the shell command but without running the shell:
$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:
#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
from threading import get_ident # Python 3.3+
except ImportError: # Python 2
from thread import get_ident
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
with open('output%d.txt' % get_ident(), 'ab', 0) as file:
return files, call(cmd + files, stdout=file)
all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
pass
It is similar to this xargs command (suggested by #abarnert):
$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt
except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.
If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"
procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
proc.wait()
So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:
procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
'edu.stanford.nlp.process.PTBTokenizer'] + group)
for group in groups]
But now how to do you get all of the results?
One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:
procs = []
files = []
for i, group in enumerate(groups):
file = open('output_{}'.format(i), 'w')
files.append(file)
procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
proc.wait()
for file in files:
file.close()
(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)
Inside your input file directory you can do the following in bash:
#!/bin/bash
for file in *.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt
If you want to run it as a script. Save the file with some name, my_exec.bash:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Invalid Input. Enter a directory and a output file"
exit 1
fi
if [ ! -d $1 ]; then
echo "Please pass a valid directory"
exit 1
fi
for file in $1*.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2
Make it an executable file
chmod +x my_exec.bash
USAGE:
./my_exec.bash <folder> <output_file>
I am converting some Python scripts I wrote in a Windows environment to run in Unix (Red Hat 5.4), and I'm having trouble converting the lines that deal with filepaths. In Windows, I usually read in all .txt files within a directory using something like:
pathtotxt = "C:\\Text Data\\EJC\\Philosophical Transactions 1665-1678\\*\\*.txt"
for file in glob.glob(pathtotxt):
It seems one can use the glob.glob() method in Unix as well, so I'm trying to implement this method to find all text files within a directory entitled "source" using the following code:
#!/usr/bin/env python
import commands
import sys
import glob
import os
testout = open('testoutput.txt', 'w')
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
testout.close
sourceout = open('sourceoutput.txt', 'w')
pathtosource = "/afs/crc.nd.edu/user/d/dduhaime/data/hill/source/*.txt"
for file in glob.glob(pathtosource):
with open(file, 'r') as openfile:
readfile = openfile.read()
souceout.write (str(readfile))
sourceout.close
When I run this code, the testout.txt file comes out as expected, but the sourceout.txt file is empty. I thought the problem might be solved if I change the line
pathtosource = "/afs/crc.nd.edu/user/d/dduhaime/data/hill/source/*.txt"
to
pathtosource = "/source/*.txt"
and then run the code from the /hill directory, but that didn't resolve my problem. Do others know how I might be able to read in the text files in the source directory? I would be grateful for any insights others can offer.
EDIT: In case it is relevant, the /afs/ tree of directories referenced above is located on a remote server that I'm ssh-ing into via Putty. I'm also using a test.job file to qsub the Python script above. (This is all to prepare myself to submit jobs on the SGE cluster system.) The test.job script looks like:
#!/bin/csh
#$ -M dduhaime#nd.edu
#$ -m abe
#$ -r y
#$ -o tmp.out
#$ -e tmp.err
module load python/2.7.3
echo "Start - `date`"
python tmp.py
echo "Finish - `date`"
Got it! I had misspelled the output command. I wrote
souceout.write (str(readfile))
instead of
sourceout.write (str(readfile))
What a dunce. I also added a newline bit to the line:
sourceout.write (str(readfile) + "\r\n")
and it works fine. I think it's time for a new IDE!
You haven't really closed the file. The function testout.close() isn't called, because you have forgotten the parentheses. The same is for sourceout.close()
testout.close
...
sourceout.close
Has to be:
testout.close()
...
sourceout.close()
If the program finishes all files are automatically closed so it is only important if you reopen the file.
Even better (the pythonic version) would be to use the with statement. Instead of this:
testout = open('testoutput.txt', 'w')
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
testout.close()
you would write this:
with open('testoutput.txt', 'w') as testout:
numbers = [1,2,3]
for number in numbers:
testout.write(str(number + 1) + "\r\n")
In this case the file will be automatically closed even when an error occurs.