I have a Python script that needs to process a large number of files. To get around Linux's relatively small limit on the number of arguments that can be passed to a command, I am using find -print0 with xargs -0.
I know another option would be to use Python's glob module, but that won't help when I have a more advanced find command, looking for modification times, etc.
When running my script on a large number of files, Python only accepts a subset of the arguments, a limitation I first thought was in argparse, but appears to be in sys.argv. I can't find any documentation on this. Is it a bug?
Here's a sample Python script illustrating the point:
import argparse
import sys
import os
parser = argparse.ArgumentParser()
parser.add_argument('input_files', nargs='+')
args = parser.parse_args(sys.argv[1:])
print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)
I have a lot of files to run this on:
$ find ~/ -name "*" -print0 | xargs -0 ls > filelist
748709 filelist
But it appears xargs or Python is chunking my big list of files and processing it with several different Python runs:
$ find ~/ -name "*" -print0 | xargs -0 python test.py
pid: 4216 argv files 1819 number of files: 1819
pid: 4217 argv files 1845 number of files: 1845
pid: 4218 argv files 1845 number of files: 1845
pid: 4219 argv files 1845 number of files: 1845
pid: 4220 argv files 1845 number of files: 1845
pid: 4221 argv files 1845 number of files: 1845
...
Why are multiple processes being created to process the list? Why is it being chunked at all? I don't think there are newlines in the file names and shouldn't -print0 and -0 take care of that issue? If there were newlines, I'd expect sed -n '1810,1830p' filelist to show some weirdness for the above example. What gives?
I almost forgot:
$ python -V
Python 2.7.2+
xargs will chunk your arguments by default. Have a look at the --max-args and --max-chars options of xargs. Its man page also explains the limits (under --max-chars).
Python does not seem to place a limit on the number of arguments but the operating system does.
Have a look here for a more comprehensive discussion.
Everything that you want from find is available from os.walk.
Don't use find and the shell for any of this.
Use os.walk and write all your rules and filters in Python.
"looking for modification times" means that you'll be using os.stat or some similar library function.
xargs will pass as much as it can, but there's still a limit. For instance,
find ~/ -name "*" -print0 | xargs -0 wc -l | grep total
will give you multiple lines of output.
You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.
The problem is that xargs is limited by the number of chars of the calling arguments (maximum 2091281).
A quick test showed this ranges from 5000 files - 55000 files, depending on the length of the path.
The solution to get more is to accept piping in the file path through standard-in instead.
find ... -print0 | script.py
#!/usr/bin/env python3
import sys
files = sys.stdin.read().split('\0')
...
Related
I'm trying to find the location of the file that has been most recently modified. In bash, you can do this through
find /media/tiwa/usb/ -not -path '*/\.*' -type f -printf '%T# %p\n' 2> >(grep -v 'Permission denied' >&2) | sort -k1,1nr | head -1`
Indeed, on my system, this returns
1527379702.1060795850 /media/tiwa/usb/hi.txt
I intend to take the output of this command (within Python), split it on the first space, and parse the file path (yes, I could use awk, but the same errors get thrown regardless). So I did
import subprocess
bashCommand = "find /media/tiwa/usb/ -not -path '*/\.*' -type f -printf '%T# %p\n' 2> >(grep -v 'Permission denied' >&2) | sort -k1,1nr | head -1"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
print(output)
but this prints out
find: paths must precede expression: `%p'
Escaping the backslashes doesn't appear to help either.
What is causing this issue, and how do I solve it?
You have an entire shell command line, not just a single command plus its arguments, which means you need to use the shell=True option instead of (erroneously) splitting the string into multiple strings. (Python string splitting is not equivalent to the shell's word splitting, which is much more involved and complicated.) Further, since your command line contains bash-specific features, you need to tell Popen to use /bin/bash explicitly, rather than the default /bin/sh.
import subprocess
bashCommand = "find /media/tiwa/usb/ -not -path '*/\.*' -type f -printf '%T# %p\n' 2> >(grep -v 'Permission denied' >&2) | sort -k1,1nr | head -1"
path_to_bash = "/bin/bash" # or whatever is appropriate
process = subprocess.Popen(bashCommand,
stdout=subprocess.PIPE,
shell=True,
executable=path_to_bash)
output, error = process.communicate()
print(output)
(It would, however, be simpler and more robust to use os.walk() to get each file, and use os.stat() to get the modification time of each relevant file, and only keep the newest file found so far until you have examined every file.
import os
newest = (0, "")
for (dir, subdirs, fname) in os.walk("/media/tiwa/usb"):
if fname.startswith(".") or not os.path.isfile(fname):
continue
mtime = os.stat(fname).st_mtime
if mtime > newest[0]:
newest = (mtime, fname)
Or perhaps
def names_and_times(d):
for (_, _, fname) in os.walk(d):
if fname.startswith(".") or not os.path.isfile(fname):
continue
yield (os.stat(fname).st_mtime), fname)
newest = max(names_and_times("/media/tiwa/usb"))
)
Keep in mind that any of the preceding approaches will only return one file with the newest modification time.
I have a python script A.pyand it takes arguments with a target file with some list of IPs and outs a CSV file with Information found regarding the IPs from some sources.( Run Method : python A.py Input.txt -c Output.csv ).
It took ages to get the work done. Later,I split Input file ( split -l 1000 Input.txt) -> created directories ( 10 directories) -> executed the script with the Input splitted in 10 directories parallel in screen mode
How to do this kind of jobs efficiently ? Any suggestions please ?
Try this:
parallel --round --pipepart -a Input.txt --cat python A.py {} -c {#}.csv
If A.py can read from a fifo then this is more efficient:
parallel --round --pipepart -a Input.txt --fifo python A.py {} -c {#}.csv
If your disk has long seek times then it might be faster to use --pipe instead of --pipepart.
I need to find files with the same name but different content in a linux folder structure with a lot of files.
Something like this does the job partially, how do i eliminate files with different content?
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
(How to find duplicate filenames (recursively) in a given directory? BASH)
Thanks so much !
The first question is, how can you determine whether two files have the same content?
One obviously possibility is to read (or mmap) both files and compare them a block at a time. On some platforms, a stat is a lot faster than a read, so you may want to first compare sizes. And there are other optimizations that might be useful, depending on what you're actually doing (e.g., if you're going to run this thousands of times, and most of the files are the same every time, you could hash them and cache the hashes, and only check the actual files when the hashes match). But I doubt you're too worried about that kind of performance tweak if your existing code is acceptable (since it searches the whole tree once for every file in the tree), so let's just do the simplest thing.
Here's one way to do it in Python:
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
This will exit with code 1 if all files are identical, code 0 if any pair of files are different. So, save this as allequal.py, make it executable, and your bash code can just run allequal.py on the results of that grep, and use the exit value (e.g., via $?) to decide whether to print those results for you.
I am facing the same problem as described in the question. In a large directory tree, some files have the same name and either same content or different content. The ones where the content differs need human attention to decide how to fix the situation in each case. I need to create a list of these files to guide the person doing this.
The code in the question and the code in the abernet's response are both helpful. Here is how one would combine both: Store the python code from abernet's response in some file, e.g. /usr/local/bin/do_these_files_have_different_content:
sudo tee /usr/local/bin/do_these_files_have_different_content <<EOF
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
EOF
sudo chmod a+x /usr/local/bin/do_these_files_have_different_content
Then extend the bash code from Illusionist's question to call this program when needed, and react on its outcome:
#!/bin/sh
dirname=$1
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
if do_these_files_have_different_content $(find $dirname -type f | grep "$fileName")
then find $dirname -type f | grep "$fileName"
echo
fi
done
This will write to stdout the paths of all files with same name but different content. Groups of files with same name but different content are separated by empty lines. I store the shell script in /usr/local/bin/find_files_with_same_name_but_different_content and invoke it as
find_files_with_same_name_but_different_content /path/to/my/storage/directory
I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.
If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.
You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.
But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
edu.stanford.nlp.process.PTBTokenizer >> output.txt
To pass all .txt files in the current directory at once to the java subprocess:
#!/usr/bin/env python
from glob import glob
from subprocess import check_call
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
check_call(cmd + glob('*.txt'), stdout=file)
It is similar to running the shell command but without running the shell:
$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:
#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
from threading import get_ident # Python 3.3+
except ImportError: # Python 2
from thread import get_ident
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
with open('output%d.txt' % get_ident(), 'ab', 0) as file:
return files, call(cmd + files, stdout=file)
all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
pass
It is similar to this xargs command (suggested by #abarnert):
$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt
except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.
If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"
procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
proc.wait()
So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:
procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
'edu.stanford.nlp.process.PTBTokenizer'] + group)
for group in groups]
But now how to do you get all of the results?
One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:
procs = []
files = []
for i, group in enumerate(groups):
file = open('output_{}'.format(i), 'w')
files.append(file)
procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
proc.wait()
for file in files:
file.close()
(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)
Inside your input file directory you can do the following in bash:
#!/bin/bash
for file in *.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt
If you want to run it as a script. Save the file with some name, my_exec.bash:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Invalid Input. Enter a directory and a output file"
exit 1
fi
if [ ! -d $1 ]; then
echo "Please pass a valid directory"
exit 1
fi
for file in $1*.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2
Make it an executable file
chmod +x my_exec.bash
USAGE:
./my_exec.bash <folder> <output_file>
I run the svn status got the modified files :
svn status
? .settings
? .buildpath
? .directory
A A.php
M B.php
D html/C.html
M html/D.fr
M api/E.api
M F.php
..
After I want to keep all of these files
zcvf MY.tar.gz all files that svn stat display
(not include ? just M,A,D)
My idea is to create the python script can run the shell,because right now to do this I just copy the file name one by one.
zcvf MY.tar.gz all the files that we run svn stat
Anybody could guide how to do this or some related tutorial? But if you think it difficult than copy && paste I will ignore my trying?
thanks
I don't see why you would use python for this if you can do it in a single line of code in the shell.
svn status | grep "^[AMD]" | sed 's/^.\{8\}//' | xargs zcvf My.tar.gz
I used grep to only select lines that are modified, if you want all files that svn status lists (also those that are added / deleted) you can leave that part out. I've used sed here to delete the first part of every line, I'm sure there is an easier way to do that but I can't think of one right now.
Once you figure out your command as a string you can just call it with subprocess
subprocess
This module spawns called processes and allows you to control them. From there its up to you.
You could use check_output() and check_call() functions:
#!/usr/bin/env python
from subprocess import check_call, check_output as qx
filenames = [line[8:] # extract filename
for line in qx(['svn', 'status']).splitlines()
if not line.startswith('?')] # exclude files that are not under VC
check_call(['tar', 'cvfz', 'MY.tar.gz'] + filenames)
On Python < 2.7 see check_output() recipe.
subprocess is the Pythonic way, but using a small bash one-liner could be a shorter idea.
Bash one-liner
svn status | egrep "^ M" | awk '{s=s $2 " "} END {print "tar cvfz MY.tar "s}'
Subprocess
import subprocess as sp
p=sp.Popen('svn status', shell=True, stdout=sp.PIPE,
stderr=sp.PIPE).communicate()[0]
files=[]
for line in p:
if line.strip().find('M')==0:
files.append(line.split(' ')[1].strip())
files=' '.join(files)
sp.Popen('tar cvfz MY.tar.gz '+files, shell=True).communicate()