Use GNU parallel to parallelise a bash for loop - python

I have a for loop which runs a Python script ~100 times on 100 different input folders. The python script is most efficient on 2 cores, and I have 50 cores available. So I'd like to use GNU parallel to run the script on 25 folders at a time.
Here's my for loop (works fine, but is sequential of course), the python script takes a bunch of input variables including the -p 2 which runs it on two cores:
for folder in $(find /home/rob/PartitionFinder/ -maxdepth 2 -type d); do
python script.py --raxml --quick --no-ml-tree $folder --force -p 2
done
and here's my attempt to parallelise it, which does not work:
folders=$(find /home/rob/PartitionFinder/ -maxdepth 2 -type d)
echo $folders | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
The issue I'm hitting (perhaps it's just the first of many though) is that my folders variable is not a list, so it's really just passing a long string of 100 folders as the {} to the script.
All hints gratefully received.

Replace echo $folders | parallel ... with echo "$folders" | parallel ....
Without the double quotes, the shell parses spaces in $folders and passes them as separate arguments to echo, which causes them to be printed on one line. parallel provides each line as argument to the job.
To avoid such quoting issues altogether, it is always a good idea to pipe find to parallel directly, and use the null character as the delimiter:
find ... -print0 | parallel -0 ...
This will work even when encountering file names that contain multiple spaces or a newline character.

you can pipe find directly to parallel:
find /home/rob/PartitionFinder/ -maxdepth 2 -type d | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
If you want to keep the string in $folder, you can pipe the echo to xargs.
echo $folders | xargs -n 1 | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2

You can create a Makefile like this:
#!/usr/bin/make -f
FOLDERS=$(shell find /home/rob/PartitionFinder/ -maxdepth 2 -type d)
all: ${FOLDERS}
# To execute the find before the all
find_folders:
# echo $(FOLDERS) > /dev/null
${FOLDERS}: find_folders
# python script.py --raxml --quick --no-ml-tree $# --force -p 2
and then run make -j 25
Be careful: use tabs to indent in your file
Also, files with spaces in the name won't work.

Related

How to measure execution time in a shell script in Python?

I created this sh script and want to measure time spent on the program:
#!/bin/bash
start=`date +%s`
cd nlu/creator
python3 move_file.py ../../../../../base_data/ ../../resources/kb/
python3 converter.py
python3 transformer.py
cd ../../resources/kb/
find . -name '*.xml' | xargs -I{} rm -rf {}
find . -name '*Object.txt' | xargs -I{} rm -rf {}
end =`date +%s`
runtime=$((end-start))
echo "Building time: ${runtime}"
I execute:
nlu/creator/builder.sh
The error message is:
nlu/creator/builder.sh: line 15: end: command not found
Building time: -1651032434
why does it complain about 'end:command not found'?
Also, why is the time a negative number? I am on a Mac.
why does it complain?
This is general shell script syntax:
In general, P X Y runs the command P with the two arguments X and Y. Similarily, end = something runs the command end with the two arguments = and something.
Assignment is done by VAR=VALUE, for instance
end=something
If you one day need to execute a command named end=something (i.e. not treat it as assignment, you still can do it by writing
'end=something'
BTW: If you are only interested in knowing the time and are not picky about a particular format of how the time is printed, you could also remove your timing calculation from the script completely, and run the script with the time builtin:
time nlu/creator/builder.sh
Give it a try!

Running python scripts for different input directory through bash terminal

I am trying to automate my task through the terminal using bash. I have a python script that takes two parameters (paths of input and output) and then the script runs and saves the output in a text file.
All the input directories have a pattern that starts from "g-" whereas the output directory remains static.
So, I want to write a script that could run on its own so that I don't have to manually run it on hundreds of directories.
$ python3 program.py ../g-changing-directory/ ~/static-directory/ > ~/static-directory/final/results.txt
You can do it like this:
find .. -maxdepth 1 -type d -name "g-*" | xargs -n1 -P1 -I{} python3 program.py {} ~/static-directory/ >> ~/static-directory/final/results.txt
find .. will look in the parent directory -maxdepth 1 will look only on the top level and not take any subdirectories -type d only takes directories -name "g-*" takes objects starting with g- (use -iname "g-*" if you want objects starting with g- or G-).
We pipe it to xargs which will apply the input from stdin to the command specified. -n1 tells it to start a process per input word, -P1 tells it to only run one process at a time, -I{} tells it to replace {} with the input in the command.
Then we specify the command to run for the input, where {} is replaced by xargs.: python3 program.py {} ~/static-directory/ >> ~/static-directory/final/results.txt have a look at the >> this will append to a file if it exists, while > will overwrite the file, if it exists.
With -P4 you could start four processes in parallel. But you do not want to do that, as you are writing into one file and multi-processing can mess up your output file. If every process would write into its own file, you could do multi-processing safely.
Refer to man find and man xargs for further details.
There are many other ways to do this, as well. E.g. for loops like this:
for F in $(ls .. | grep -oP "g-.*"); do
python3 program.py $F ~/static-directory/ >> ~/static-directory/final/results.txt
done
There are many ways to do this, here's what I would write:
find .. -type d -name "g-*" -exec python3 program.py {} ~/static-directory/ \; > ~/static-directory/final/results.txt
You haven't mentioned if you want nested directories to be included, if the answer is no then you have to add the -maxdepth parameter as in #toydarian's answer.

How to pass parameters from xargs to python script?

I have command.list file with command parameters for my python script my_script.py which have 3 parameters.
One line of which look like:
<path1> <path2> -sc 4
Looks like it not work like this because parameters should be split?
cat command.list | xargs -I {} python3 my_script.py {}
How to split string to pararmeters and pass it to python script?
What about cat command.list | xargs -L 1 python3 my_script.py? This will pass one line (-L 1) at a time to your script.
The documentation of -I from man xargs
-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separator is the newline character. Implies -x and -L 1.
What you want is
xargs -L1 python3 my_script.py
By the way: cat is not necessary. Use one of the following commands
< command.list xargs -L1 python3 my_script.py
xargs -a command.list -L1 python3 my_script.py
Not sure, what you are trying to do with xargs -I {} python3 my_script.py {} there.
But are you looking for,
$ cat file
<path1> <path2> -sc 4
....
<path1n> <path2n> -sc 4
$ while read -r path1 path2 unwanted unwanted; do python3 my_script.py "$path2" ; done<file

How to add argparse flags based on shell find results

I have a python script that accepts a -f flag, and appends multiple uses of the flag.
For example, if I run python myscript -f file1.txt -f file2.txt, I would have a list of files, files=['file1.txt', 'files2.txt']. This works great, but am wondering how I can automatically use the results of a find command to append as many -f flags as there are files.
I've tried:
find ./ -iname '*.txt' -print0 | xargs python myscript.py -f
But it only grabs the first file
With the caveat that this will fail if there are more files than will fit on a single command line (whereas xargs would run myscript.py multiple times, each with a subset of the full list of arguments):
#!/usr/bin/env bash
args=( )
while IFS= read -r -d '' name; do
args+=( -f "$name" )
done < <(find . -iname '*.txt' -print0)
python myscript.py "${args[#]}"
If you want to do this safely in a way that tolerates an arbitrary number of filenames, you're better off using a long-form option -- such as --file rather than -f -- with the = separator allowing the individual name to be passed as part of the same argv entry, thus preventing xargs from splitting a filename apart from the sigil that precedes it:
#!/usr/bin/env bash
# This requires -printf, a GNU find extension
find . -iname '*.txt' -printf '--file=%p\0' | xargs -0 python myscript.py
...or, more portably (running on MacOS, albeit still requiring a shell -- such as bash -- that can handle NUL-delimited reads):
#!/usr/bin/env bash
# requires find -print0 and xargs -0; these extensions are available on BSD as well as GNU
find . -iname '*.txt' -print0 |
while IFS= read -r -d '' f; do printf '--file=%s\0' "$f"; done |
xargs -0 python myscript.py
Your title seems to imply that you can modify the script. In that case, use the nargs (number of args) option to allow more arguments for the -f flag:
parser = argparse.ArgumentParser()
parser.add_argument('--files', '-f', nargs='+')
args = parser.parse_args()
print(args.files)
Then you can use your find command easily:
15:44 $ find . -depth 1 | xargs python args.py -f
['./args.py', './for_clint', './install.sh', './sys_user.json']
Otherwise, if you can't modify the script, see #CharlesDuffy's answer.

Restart kivy-program respectively python with inotify

I fear that my question is a duplicate but I can't find the answer. Maybe you can help me?
I would like to restart my kivy-program if I save the kv or py file.
I tried with
inotifywait -mq -e close_write /home/name/kivy/ | while read FILE
do
pkill python
python /home/name/kivy/main.py
done
If I change a file the first time, main.py starts, but if I change it again I need to close the program by hand before it restarts.
Instead of pkill python I also tried to use
kill $(ps aux | pgrep '[p]ython' | awk '{print $2}')
but with the same result and the problem that the mintMenu.py is closing, too.
Should I use something totally different to inotify?
I'm using entr to achieve the same thing. Once installed (e.g. via brew), just run the following command in your work directory /home/name/kivy/:
find . -name "*.py" -or -name "*.kv" | entr sh -c "pkill -f python main.py ; python main.py &"

Categories

Resources