Parallel Processing Issue in Python

Parallel Processing Issue in Python - python

I have a python script A.pyand it takes arguments with a target file with some list of IPs and outs a CSV file with Information found regarding the IPs from some sources.( Run Method : python A.py Input.txt -c Output.csv ).
It took ages to get the work done. Later,I split Input file ( split -l 1000 Input.txt) -> created directories ( 10 directories) -> executed the script with the Input splitted in 10 directories parallel in screen mode
How to do this kind of jobs efficiently ? Any suggestions please ?

Try this:
parallel --round --pipepart -a Input.txt --cat python A.py {} -c {#}.csv
If A.py can read from a fifo then this is more efficient:
parallel --round --pipepart -a Input.txt --fifo python A.py {} -c {#}.csv
If your disk has long seek times then it might be faster to use --pipe instead of --pipepart.

Related

Reading lines of file into xargs for parallel python script

I am trying to execute a python script in parallel by using the lines of a file as arguments to a python script. The file is named experiments.txt and might look like this:
--x_timesteps 3 --y_timesteps 3 --exp_path ./logs
--x_timesteps 4 --y_timesteps 3 --exp_path ./logs
--x_timesteps 5 --y_timesteps 3 --exp_path ./logs
--x_timesteps 6 --y_timesteps 3 --exp_path ./logs
I want to speed up the processing by using xargs; however, I don't know how to do this with file input. How can I parallelize a python script by reading line by line from the file and piping to xargs?
I know I can solve this problem with a simple for-loop; however, I need to know how to do it with file input.
Typing this into the command line in the appropriate directory,
for x in {3..6}; \
do printf '%s\0' "--x_timesteps=${x}" "--y_timesteps=3" "--exp_path=./logs"; \
done | xargs -0 -n 3 -P 8 python script.py
The for-loop style parallelization is derived from the answer to "Using xargs for parallel Python scripts"

IMHO, it is simpler and more controllable with GNU Parallel like this:
parallel --dry-run --colsep ' ' python script.py :::: experiments.txt
You can simply add or remove --dry-run to debug. You can add --eta or --bar for progress reports. You can distribute tasks across multiple hosts. You can easily do fail/retry processing. You can extract basenames, filenames, directory names from parameters. You can do permutations of parameters. And so on.

time difference between counting lines in a file from python/unix

im using 'wc -l' for file with 50 columns and 3000 records to count the lines in python code itself below
cmd='wc -l /path of file'
status,output=command.getstatusoutput(cmd)
and again i tried using the below one in python
row_count=sum(1 for line in(file path))
I just tried taking time from both of the commands wc -l is faster,I just dont know which is faster could you let me know the reasons behind this
ex : time
wc -l : 0.005s
python : 0.54s

Try this one:
with open("inp.txt", "r") as inpt:
print(len(inpt.readlines()))

python or bash script to pass all files in a folder to java command line

I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.
You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.
But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
edu.stanford.nlp.process.PTBTokenizer >> output.txt

To pass all .txt files in the current directory at once to the java subprocess:
#!/usr/bin/env python
from glob import glob
from subprocess import check_call
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
check_call(cmd + glob('*.txt'), stdout=file)
It is similar to running the shell command but without running the shell:
$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:
#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
from threading import get_ident # Python 3.3+
except ImportError: # Python 2
from thread import get_ident
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
with open('output%d.txt' % get_ident(), 'ab', 0) as file:
return files, call(cmd + files, stdout=file)
all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
pass
It is similar to this xargs command (suggested by #abarnert):
$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt
except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"
procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
proc.wait()
So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:
procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
'edu.stanford.nlp.process.PTBTokenizer'] + group)
for group in groups]
But now how to do you get all of the results?
One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:
procs = []
files = []
for i, group in enumerate(groups):
file = open('output_{}'.format(i), 'w')
files.append(file)
procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
proc.wait()
for file in files:
file.close()
(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)

Inside your input file directory you can do the following in bash:
#!/bin/bash
for file in *.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt
If you want to run it as a script. Save the file with some name, my_exec.bash:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Invalid Input. Enter a directory and a output file"
exit 1
fi
if [ ! -d $1 ]; then
echo "Please pass a valid directory"
exit 1
fi
for file in $1*.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2
Make it an executable file
chmod +x my_exec.bash
USAGE:
./my_exec.bash <folder> <output_file>

Problems with running a python script over many files

I am on a Linux (Ubuntu 11.10) machine; bourne again shell.
I have to process a directory full of files with a python script. My colleague wrote the python script and I have successfully used it before on one file at a time. It takes two arguments: a path to the file to be processed enclosed in quotes and a secondary argument called -min which requires an integer. Also, the script writes to standard out.
From my experience of shell scripting and following others on this forum, I used the following method to iterate over the directory of files:
for f in path/to/data_directory/*; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f;
done
I get the desired file names in the out_directory. The content of each is something only the python script can write. That is, the above for loop successfully passes the files to the script. However, the nature of the content of each file is completely wrong (as in the computation the script does was wrong). When I run the python script on one of the files in the data_directory, the output file has the correct content (the computation performed by the script is correct).
The thing that makes it more complex is that the same shell method (the for loop) works perfectly in the Mac OS X my colleague has.
Where is the issue? Am I missing something very fundamental about Linux shells? Maybe it's a syntax error?
Any help will be appreciated.
Update: I just ran the for loop again but instead of pointing it to the data_directory of files, I pointed it to a file within the data_directory. I had the same problem - the python script did not compute the correct result.

The only problem I see is that filenames may contain white-space - so you should quote filenames:
for f in path/to/data_directory/*; do
path/to/pythonscript.py "$f" -min 1 > "path/to/out_directory/$f"
done

Well I don't know if this helps but.
path/to/pythonscript.py $f -min > path/to/out_director/$f
Substitutes out to
path/to/pythongscript.py path/to/data_directory/myfile -min 1 > path/out_directory/path/to/data_directory/myfile
script should be
cd path/to/data_directory
for f in *; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f
done
What version of bash are you running?
what do you get if you run this script?
cd path/to/data_directory
for f in *; do
echo $f > /tmp/$f
done
of course that should give you a bunch of files containing their own file names.

Is Python 'sys.argv' limited in the maximum number of arguments?

I have a Python script that needs to process a large number of files. To get around Linux's relatively small limit on the number of arguments that can be passed to a command, I am using find -print0 with xargs -0.
I know another option would be to use Python's glob module, but that won't help when I have a more advanced find command, looking for modification times, etc.
When running my script on a large number of files, Python only accepts a subset of the arguments, a limitation I first thought was in argparse, but appears to be in sys.argv. I can't find any documentation on this. Is it a bug?
Here's a sample Python script illustrating the point:
import argparse
import sys
import os
parser = argparse.ArgumentParser()
parser.add_argument('input_files', nargs='+')
args = parser.parse_args(sys.argv[1:])
print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)
I have a lot of files to run this on:
$ find ~/ -name "*" -print0 | xargs -0 ls > filelist
748709 filelist
But it appears xargs or Python is chunking my big list of files and processing it with several different Python runs:
$ find ~/ -name "*" -print0 | xargs -0 python test.py
pid: 4216 argv files 1819 number of files: 1819
pid: 4217 argv files 1845 number of files: 1845
pid: 4218 argv files 1845 number of files: 1845
pid: 4219 argv files 1845 number of files: 1845
pid: 4220 argv files 1845 number of files: 1845
pid: 4221 argv files 1845 number of files: 1845
...
Why are multiple processes being created to process the list? Why is it being chunked at all? I don't think there are newlines in the file names and shouldn't -print0 and -0 take care of that issue? If there were newlines, I'd expect sed -n '1810,1830p' filelist to show some weirdness for the above example. What gives?
I almost forgot:
$ python -V
Python 2.7.2+

xargs will chunk your arguments by default. Have a look at the --max-args and --max-chars options of xargs. Its man page also explains the limits (under --max-chars).

Python does not seem to place a limit on the number of arguments but the operating system does.
Have a look here for a more comprehensive discussion.

Everything that you want from find is available from os.walk.
Don't use find and the shell for any of this.
Use os.walk and write all your rules and filters in Python.
"looking for modification times" means that you'll be using os.stat or some similar library function.

xargs will pass as much as it can, but there's still a limit. For instance,
find ~/ -name "*" -print0 | xargs -0 wc -l | grep total
will give you multiple lines of output.
You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.

The problem is that xargs is limited by the number of chars of the calling arguments (maximum 2091281).
A quick test showed this ranges from 5000 files - 55000 files, depending on the length of the path.
The solution to get more is to accept piping in the file path through standard-in instead.
find ... -print0 | script.py
#!/usr/bin/env python3
import sys
files = sys.stdin.read().split('\0')
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.