How to add argparse flags based on shell find results - python

I have a python script that accepts a -f flag, and appends multiple uses of the flag.
For example, if I run python myscript -f file1.txt -f file2.txt, I would have a list of files, files=['file1.txt', 'files2.txt']. This works great, but am wondering how I can automatically use the results of a find command to append as many -f flags as there are files.
I've tried:
find ./ -iname '*.txt' -print0 | xargs python myscript.py -f
But it only grabs the first file

With the caveat that this will fail if there are more files than will fit on a single command line (whereas xargs would run myscript.py multiple times, each with a subset of the full list of arguments):
#!/usr/bin/env bash
args=( )
while IFS= read -r -d '' name; do
args+=( -f "$name" )
done < <(find . -iname '*.txt' -print0)
python myscript.py "${args[#]}"
If you want to do this safely in a way that tolerates an arbitrary number of filenames, you're better off using a long-form option -- such as --file rather than -f -- with the = separator allowing the individual name to be passed as part of the same argv entry, thus preventing xargs from splitting a filename apart from the sigil that precedes it:
#!/usr/bin/env bash
# This requires -printf, a GNU find extension
find . -iname '*.txt' -printf '--file=%p\0' | xargs -0 python myscript.py
...or, more portably (running on MacOS, albeit still requiring a shell -- such as bash -- that can handle NUL-delimited reads):
#!/usr/bin/env bash
# requires find -print0 and xargs -0; these extensions are available on BSD as well as GNU
find . -iname '*.txt' -print0 |
while IFS= read -r -d '' f; do printf '--file=%s\0' "$f"; done |
xargs -0 python myscript.py

Your title seems to imply that you can modify the script. In that case, use the nargs (number of args) option to allow more arguments for the -f flag:
parser = argparse.ArgumentParser()
parser.add_argument('--files', '-f', nargs='+')
args = parser.parse_args()
print(args.files)
Then you can use your find command easily:
15:44 $ find . -depth 1 | xargs python args.py -f
['./args.py', './for_clint', './install.sh', './sys_user.json']
Otherwise, if you can't modify the script, see #CharlesDuffy's answer.

Related

Running python scripts for different input directory through bash terminal

I am trying to automate my task through the terminal using bash. I have a python script that takes two parameters (paths of input and output) and then the script runs and saves the output in a text file.
All the input directories have a pattern that starts from "g-" whereas the output directory remains static.
So, I want to write a script that could run on its own so that I don't have to manually run it on hundreds of directories.
$ python3 program.py ../g-changing-directory/ ~/static-directory/ > ~/static-directory/final/results.txt
You can do it like this:
find .. -maxdepth 1 -type d -name "g-*" | xargs -n1 -P1 -I{} python3 program.py {} ~/static-directory/ >> ~/static-directory/final/results.txt
find .. will look in the parent directory -maxdepth 1 will look only on the top level and not take any subdirectories -type d only takes directories -name "g-*" takes objects starting with g- (use -iname "g-*" if you want objects starting with g- or G-).
We pipe it to xargs which will apply the input from stdin to the command specified. -n1 tells it to start a process per input word, -P1 tells it to only run one process at a time, -I{} tells it to replace {} with the input in the command.
Then we specify the command to run for the input, where {} is replaced by xargs.: python3 program.py {} ~/static-directory/ >> ~/static-directory/final/results.txt have a look at the >> this will append to a file if it exists, while > will overwrite the file, if it exists.
With -P4 you could start four processes in parallel. But you do not want to do that, as you are writing into one file and multi-processing can mess up your output file. If every process would write into its own file, you could do multi-processing safely.
Refer to man find and man xargs for further details.
There are many other ways to do this, as well. E.g. for loops like this:
for F in $(ls .. | grep -oP "g-.*"); do
python3 program.py $F ~/static-directory/ >> ~/static-directory/final/results.txt
done
There are many ways to do this, here's what I would write:
find .. -type d -name "g-*" -exec python3 program.py {} ~/static-directory/ \; > ~/static-directory/final/results.txt
You haven't mentioned if you want nested directories to be included, if the answer is no then you have to add the -maxdepth parameter as in #toydarian's answer.

Pass output of 'find' command to Python with docopt (issue with spaces)

Consider this simple Python command-line script:
"""foobar
Description
Usage:
foobar [options] <files>...
Arguments:
<files> List of files.
Options:
-h, --help Show help.
--version Show version.
"""
import docopt
args = docopt.docopt(__doc__)
print(args['<files>'])
And consider that I have the following files in a folder:
file1.pdf
file 2.pdf
Now I want to pass the output of the find command to my simple command-line script. But when I try
foobar `find . -iname '*.pdf'`
I don't get the list of files that I wanted, because the input is split on spaces. I.e. I get:
['./file', '2.pdf', './file1.pdf']
How can I correctly do this?
This isn't a Python question. This is all about how the shell tokenizes command lines. Whitespace is used to separate command arguments, which is why file 2.pdf is showing as as two separate arguments.
You can combine find and xargs to do what you want like this:
find . -iname '*.pdf' -print0 | xargs -0 foobar
The -print0 argument to find tells it to output filenames seperated by ASCII NUL characters rather than spaces, and the -0 argument to xargs tells it to expect that form of input. xargs with then call your foobar script with the correct arguments.
Compare:
$ ./foobar $(find . -iname '*.pdf' )
['./file', '2.pdf', './file1.pdf']
To:
$ find . -iname '*.pdf' -print0 | xargs -0 ./foobar
['./file 2.pdf', './file1.pdf']

Use GNU parallel to parallelise a bash for loop

I have a for loop which runs a Python script ~100 times on 100 different input folders. The python script is most efficient on 2 cores, and I have 50 cores available. So I'd like to use GNU parallel to run the script on 25 folders at a time.
Here's my for loop (works fine, but is sequential of course), the python script takes a bunch of input variables including the -p 2 which runs it on two cores:
for folder in $(find /home/rob/PartitionFinder/ -maxdepth 2 -type d); do
python script.py --raxml --quick --no-ml-tree $folder --force -p 2
done
and here's my attempt to parallelise it, which does not work:
folders=$(find /home/rob/PartitionFinder/ -maxdepth 2 -type d)
echo $folders | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
The issue I'm hitting (perhaps it's just the first of many though) is that my folders variable is not a list, so it's really just passing a long string of 100 folders as the {} to the script.
All hints gratefully received.
Replace echo $folders | parallel ... with echo "$folders" | parallel ....
Without the double quotes, the shell parses spaces in $folders and passes them as separate arguments to echo, which causes them to be printed on one line. parallel provides each line as argument to the job.
To avoid such quoting issues altogether, it is always a good idea to pipe find to parallel directly, and use the null character as the delimiter:
find ... -print0 | parallel -0 ...
This will work even when encountering file names that contain multiple spaces or a newline character.
you can pipe find directly to parallel:
find /home/rob/PartitionFinder/ -maxdepth 2 -type d | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
If you want to keep the string in $folder, you can pipe the echo to xargs.
echo $folders | xargs -n 1 | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
You can create a Makefile like this:
#!/usr/bin/make -f
FOLDERS=$(shell find /home/rob/PartitionFinder/ -maxdepth 2 -type d)
all: ${FOLDERS}
# To execute the find before the all
find_folders:
# echo $(FOLDERS) > /dev/null
${FOLDERS}: find_folders
# python script.py --raxml --quick --no-ml-tree $# --force -p 2
and then run make -j 25
Be careful: use tabs to indent in your file
Also, files with spaces in the name won't work.

Piping find output to xargs to run python script

I'm seeing the weirdest results here and was hoping somebody can explain this to me.
So, I'm using a find command to locate all files of type .log, and piping the results of that command to a python script. ONLY the first result of the find command is being piped to xargs, or xargs is receiving all results and passing them as a string to the python script.
Example:
# Find returns 3 .log files
find . -name "*.log"
file1.log
file2.log
file3.log
# xargs only runs the python script for first found file (or all 3 are being piped to script as a single string, and only first result is read in)
find . -name "*.log" | xargs python myscript.py -infile
Success: parsed file1.log
What I want to happen is the python script to run for all 3 files found.
find . -name "*.log" | xargs python myscript.py -infile
Success: parsed file1.log
Success: parsed file1.log
Success: parsed file1.log
A safer way to do this is as follows:
find . -name "*.log" -print0 | \
xargs -0 -I {} python myscript.py -infile {}
When passing file names from find, it is very important to use the -0 or -d option to set the separator to \0 (null). Filenames can not contain / or \0 characters, so it guarantees a safe use of the filename.
With xargs, you must supply -0 to inform of the use of \0 separators. You also need:
"-L 1" if you just need the filename as the last argument.
"-I {}" to pass one or more to the command anywhere in the command.

How to pass UNIX find output as arguments for a Python script?

I have a list of files that I can obtain using the UNIX 'find' command such as:
$ find . -name "*.txt"
foo/foo.txt
bar/bar.txt
How can I pass this output into a Python script like hello.py so I can parse it using Python's argparse library?
Thanks!
If you want just text output of find(1), then use a pipe:
~$ find . -name "*.txt" | python hello.py
If you are looking to pass list of files as arguments to the script, use xargs(1):
~$ find . -name "*.txt" -print0 | xargs -0 python hello.py
or use -exec option of find(1).
Use xargs:
find . -name "*.txt" | xargs python -c 'import sys; print sys.argv[1:]'
From man find:
-exec command ;
Execute command; true if 0 status is returned. All following
arguments to find are taken to be arguments to the command until
an argument consisting of `;' is encountered. The string `{}'
is replaced by the current file name being processed everywhere
it occurs in the arguments to the command, not just in arguments
where it is alone, as in some versions of find. Both of these
constructions might need to be escaped (with a `\') or quoted to
protect them from expansion by the shell. See the EXAMPLES sec‐
tion for examples of the use of the -exec option. The specified
command is run once for each matched file. The command is exe‐
cuted in the starting directory. There are unavoidable secu‐
rity problems surrounding use of the -exec action; you should
use the -execdir option instead.
-exec command {} +
This variant of the -exec action runs the specified command on
the selected files, but the command line is built by appending
each selected file name at the end; the total number of invoca‐
tions of the command will be much less than the number of
matched files. The command line is built in much the same way
that xargs builds its command lines. Only one instance of `{}'
is allowed within the command. The command is executed in the
starting directory.
So you can do
find . -name "*.txt" -exec python myscript.py {} +
This helps, if you need to pass arguments after the list of arguments from the find output:
$ python hello.py `find . -name "*.txt"`
I used it to concat pdf files into another one:
$ pdfunite `find . -name "*.pdf" | sort` all.pdf

Categories

Resources