Reading lines of file into xargs for parallel python script

Reading lines of file into xargs for parallel python script - python

I am trying to execute a python script in parallel by using the lines of a file as arguments to a python script. The file is named experiments.txt and might look like this:
--x_timesteps 3 --y_timesteps 3 --exp_path ./logs
--x_timesteps 4 --y_timesteps 3 --exp_path ./logs
--x_timesteps 5 --y_timesteps 3 --exp_path ./logs
--x_timesteps 6 --y_timesteps 3 --exp_path ./logs
I want to speed up the processing by using xargs; however, I don't know how to do this with file input. How can I parallelize a python script by reading line by line from the file and piping to xargs?
I know I can solve this problem with a simple for-loop; however, I need to know how to do it with file input.
Typing this into the command line in the appropriate directory,
for x in {3..6}; \
do printf '%s\0' "--x_timesteps=${x}" "--y_timesteps=3" "--exp_path=./logs"; \
done | xargs -0 -n 3 -P 8 python script.py
The for-loop style parallelization is derived from the answer to "Using xargs for parallel Python scripts"

IMHO, it is simpler and more controllable with GNU Parallel like this:
parallel --dry-run --colsep ' ' python script.py :::: experiments.txt
You can simply add or remove --dry-run to debug. You can add --eta or --bar for progress reports. You can distribute tasks across multiple hosts. You can easily do fail/retry processing. You can extract basenames, filenames, directory names from parameters. You can do permutations of parameters. And so on.

Related

Parallel Processing Issue in Python

I have a python script A.pyand it takes arguments with a target file with some list of IPs and outs a CSV file with Information found regarding the IPs from some sources.( Run Method : python A.py Input.txt -c Output.csv ).
It took ages to get the work done. Later,I split Input file ( split -l 1000 Input.txt) -> created directories ( 10 directories) -> executed the script with the Input splitted in 10 directories parallel in screen mode
How to do this kind of jobs efficiently ? Any suggestions please ?

Try this:
parallel --round --pipepart -a Input.txt --cat python A.py {} -c {#}.csv
If A.py can read from a fifo then this is more efficient:
parallel --round --pipepart -a Input.txt --fifo python A.py {} -c {#}.csv
If your disk has long seek times then it might be faster to use --pipe instead of --pipepart.

python -c vs python -<< heredoc

I am trying to run some piece of Python code in a Bash script, so i wanted to understand what is the difference between:
#!/bin/bash
#your bash code
python -c "
#your py code
"
vs
python - <<DOC
#your py code
DOC
I checked the web but couldn't compile the bits around the topic. Do you think one is better over the other?
If you wanted to return a value from Python code block to your Bash script then is a heredoc the only way?

The main flaw of using a here document is that the script's standard input will be the here document. So if you have a script which wants to process its standard input, python -c is pretty much your only option.
On the other hand, using python -c '...' ties up the single-quote for the shell's needs, so you can only use double-quoted strings in your Python script; using double-quotes instead to protect the script from the shell introduces additional problems (strings in double-quotes undergo various substitutions, whereas single-quoted strings are literal in the shell).
As an aside, notice that you probably want to single-quote the here-doc delimiter, too, otherwise the Python script is subject to similar substitutions.
python - <<'____HERE'
print("""Look, we can have double quotes!""")
print('And single quotes! And `back ticks`!')
print("$(and what looks to the shell like process substitutions and $variables!)")
____HERE
As an alternative, escaping the delimiter works identically, if you prefer that (python - <<\____HERE)

If you are using bash, you can avoid heredoc problems if you apply a little bit more of boilerplate:
python <(cat <<EoF
name = input()
print(f'hello, {name}!')
EoF
)
This will let you run your embedded Python script without you giving up the standard input. The overhead is mostly the same of using cmda | cmdb. This technique is known as Process Substitution.
If want to be able to somehow validate the script, I suggest that you dump it to a temporary file:
#!/bin/bash
temp_file=$(mktemp my_generated_python_script.XXXXXX.py)
cat > $temp_file <<EoF
# embedded python script
EoF
python3 $temp_file && rm $temp_file
This will keep the script if it fails to run.

If you prefer to use python -c '...' without having to escape with the double-quotes you can first load the code in a bash variable using here-documents:
read -r -d '' CMD << '--END'
print ("'quoted'")
--END
python -c "$CMD"
The python code is loaded verbatim into the CMD variable and there's no need to escape double quotes.

How to use here-docs with input
tripleee's answer has all the details, but there's Unix tricks to work around this limitation:
So if you have a script which wants to process its standard input, python -c is pretty much your only option.
This trick applies to all programs that want to read from a redirected stdin (e.g., ./script.py < myinputs) and also take user input:
python - <<'____HERE'
import os
os.dup2(1, 0)
print(input("--> "))
____HERE
Running this works:
$ bash heredocpy.sh
--> Hello World!
Hello World!
If you want to get the original stdin, run os.dup(0) first. Here is a real-world example.
This works because as long as either stdout or stderr are a tty, one can read from them as well as write to them. (Otherwise, you could just open /dev/tty. This is what less does.)
In case you want to process inputs from a file instead, that's possible too -- you just have to use a new fd:
Example with a file
cat <<'____HERE' > file.txt
With software there are only two possibilites:
either the users control the programme
or the programme controls the users.
____HERE
python - <<'____HERE' 4< file.txt
import os
for line in os.fdopen(4):
print(line.rstrip().upper())
____HERE
Example with a command
Unfortunately, pipelines don't work here -- but process substitution does:
python - <<'____HERE' 4< <(fortune)
import os
for line in os.fdopen(4):
print(line.rstrip().upper())
____HERE

python or bash script to pass all files in a folder to java command line

I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.
You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.
But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
edu.stanford.nlp.process.PTBTokenizer >> output.txt

To pass all .txt files in the current directory at once to the java subprocess:
#!/usr/bin/env python
from glob import glob
from subprocess import check_call
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
check_call(cmd + glob('*.txt'), stdout=file)
It is similar to running the shell command but without running the shell:
$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:
#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
from threading import get_ident # Python 3.3+
except ImportError: # Python 2
from thread import get_ident
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
with open('output%d.txt' % get_ident(), 'ab', 0) as file:
return files, call(cmd + files, stdout=file)
all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
pass
It is similar to this xargs command (suggested by #abarnert):
$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt
except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"
procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
proc.wait()
So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:
procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
'edu.stanford.nlp.process.PTBTokenizer'] + group)
for group in groups]
But now how to do you get all of the results?
One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:
procs = []
files = []
for i, group in enumerate(groups):
file = open('output_{}'.format(i), 'w')
files.append(file)
procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
proc.wait()
for file in files:
file.close()
(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)

Inside your input file directory you can do the following in bash:
#!/bin/bash
for file in *.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt
If you want to run it as a script. Save the file with some name, my_exec.bash:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Invalid Input. Enter a directory and a output file"
exit 1
fi
if [ ! -d $1 ]; then
echo "Please pass a valid directory"
exit 1
fi
for file in $1*.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2
Make it an executable file
chmod +x my_exec.bash
USAGE:
./my_exec.bash <folder> <output_file>

Problems with running a python script over many files

I am on a Linux (Ubuntu 11.10) machine; bourne again shell.
I have to process a directory full of files with a python script. My colleague wrote the python script and I have successfully used it before on one file at a time. It takes two arguments: a path to the file to be processed enclosed in quotes and a secondary argument called -min which requires an integer. Also, the script writes to standard out.
From my experience of shell scripting and following others on this forum, I used the following method to iterate over the directory of files:
for f in path/to/data_directory/*; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f;
done
I get the desired file names in the out_directory. The content of each is something only the python script can write. That is, the above for loop successfully passes the files to the script. However, the nature of the content of each file is completely wrong (as in the computation the script does was wrong). When I run the python script on one of the files in the data_directory, the output file has the correct content (the computation performed by the script is correct).
The thing that makes it more complex is that the same shell method (the for loop) works perfectly in the Mac OS X my colleague has.
Where is the issue? Am I missing something very fundamental about Linux shells? Maybe it's a syntax error?
Any help will be appreciated.
Update: I just ran the for loop again but instead of pointing it to the data_directory of files, I pointed it to a file within the data_directory. I had the same problem - the python script did not compute the correct result.

The only problem I see is that filenames may contain white-space - so you should quote filenames:
for f in path/to/data_directory/*; do
path/to/pythonscript.py "$f" -min 1 > "path/to/out_directory/$f"
done

Well I don't know if this helps but.
path/to/pythonscript.py $f -min > path/to/out_director/$f
Substitutes out to
path/to/pythongscript.py path/to/data_directory/myfile -min 1 > path/out_directory/path/to/data_directory/myfile
script should be
cd path/to/data_directory
for f in *; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f
done
What version of bash are you running?
what do you get if you run this script?
cd path/to/data_directory
for f in *; do
echo $f > /tmp/$f
done
of course that should give you a bunch of files containing their own file names.

write python script to run the shell script?

I run the svn status got the modified files :
svn status
? .settings
? .buildpath
? .directory
A A.php
M B.php
D html/C.html
M html/D.fr
M api/E.api
M F.php
..
After I want to keep all of these files
zcvf MY.tar.gz all files that svn stat display
(not include ? just M,A,D)
My idea is to create the python script can run the shell,because right now to do this I just copy the file name one by one.
zcvf MY.tar.gz all the files that we run svn stat
Anybody could guide how to do this or some related tutorial? But if you think it difficult than copy && paste I will ignore my trying?
thanks

I don't see why you would use python for this if you can do it in a single line of code in the shell.
svn status | grep "^[AMD]" | sed 's/^.\{8\}//' | xargs zcvf My.tar.gz
I used grep to only select lines that are modified, if you want all files that svn status lists (also those that are added / deleted) you can leave that part out. I've used sed here to delete the first part of every line, I'm sure there is an easier way to do that but I can't think of one right now.

Once you figure out your command as a string you can just call it with subprocess
subprocess
This module spawns called processes and allows you to control them. From there its up to you.

You could use check_output() and check_call() functions:
#!/usr/bin/env python
from subprocess import check_call, check_output as qx
filenames = [line[8:] # extract filename
for line in qx(['svn', 'status']).splitlines()
if not line.startswith('?')] # exclude files that are not under VC
check_call(['tar', 'cvfz', 'MY.tar.gz'] + filenames)
On Python < 2.7 see check_output() recipe.

subprocess is the Pythonic way, but using a small bash one-liner could be a shorter idea.
Bash one-liner
svn status | egrep "^ M" | awk '{s=s $2 " "} END {print "tar cvfz MY.tar "s}'
Subprocess
import subprocess as sp
p=sp.Popen('svn status', shell=True, stdout=sp.PIPE,
stderr=sp.PIPE).communicate()[0]
files=[]
for line in p:
if line.strip().find('M')==0:
files.append(line.split(' ')[1].strip())
files=' '.join(files)
sp.Popen('tar cvfz MY.tar.gz '+files, shell=True).communicate()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.