Run python process (using Pool of Multiprocessing ) parallel for batch

Run python process (using Pool of Multiprocessing ) parallel for batch - python

See, I need to write a code for ~quarter million input files to run on batch. I saw this post: https://codereview.stackexchange.com/questions/20416/python-parallelization-using-popen
I can't figure it out how to implement this in my code.
What I want
I want to give each process specific number of cores or in other words, specific number of processes only can run at certain time.
If one process is finished another one should takes its place.
My code (using subprocess)
Main.py
import subprocess
import os
import multiprocessing
import time
MAXCPU = multiprocessing.cpu_count()
try:
cp = int(raw_input("Enter Number of CPU's to use (Total %d) = "%MAXCPU))
assert cp <= MAXCPU
except:
print "Bad command taking all %d cores"%MAXCPU
cp =MAXCPU # set MAXCPU as CPU
list_pdb = [i for i in os.listdir(".") if i.endswith(".pdb")] # Input PDB files
assert len(list_pdb) != 0
c = {}
d = {}
t = {}
devnull = file("Devnull","wb")
for each in range(0, len(list_pdb), cp): # Number of cores in Use = 4
for e in range(cp):
if each + e < len(list_pdb):
args = ["sh", "Child.sh", list_pdb[each + e], str(cp)]
p = subprocess.Popen(args, shell=False,
stdout=devnull, stderr=devnull)
c[p.pid] = p
print "Started Process : %s" % list_pdb[each + e]
while c:
print c.keys()
pid, status = os.wait()
if pid in c:
print "Ended Process"
del c[pid]
devnull.close()
Child.sh
#!/bin/sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
# Some heavy processes with $1
grand_Child.sh
#!/bin/sh
sleep 5
Output

Here's a version of the code using multiprocessing.Pool. It's a lot simpler, as the module does nearly all the work!
This version also does:
lots of logging, when a proc starts/ends
prints how many files will be processed
lets you process more than numcpus at a time
Often when running multiprocess jobs, it's best to run more processes than CPUs. Different procs will wait on I/O, vs waiting for CPU. Often people run 2n+1, so for a 4 proc system they run 2*4+1 or 9 procs for a job. (I generally hardcode "5" or "10" until there's a reason to change, I'm lazy that way :) )
Enjoy!
source
import glob
import multiprocessing
import os
import subprocess
MAXCPU = multiprocessing.cpu_count()
TEST = False
def do_work(args):
path,numproc = args
curproc = multiprocessing.current_process()
print curproc, "Started Process, args={}".format(args)
devnull = open(os.devnull, 'w')
cmd = ["sh", "Child.sh", path, str(numproc)]
if TEST:
cmd.insert(0, 'echo')
try:
return subprocess.check_output(
cmd, shell=False,
stderr=devnull,
)
finally:
print curproc, "Ended Process"
if TEST:
cp = MAXCPU
list_pdb = glob.glob('t*.py')
else:
cp = int(raw_input("Enter Number of processes to use (%d CPUs) = " % MAXCPU))
list_pdb = glob.glob('*.pdb') # Input PDB files
# assert cp <= MAXCPU
print '{} files, {} procs'.format(len(list_pdb), cp)
assert len(list_pdb) != 0
pool = multiprocessing.Pool(cp)
print pool.map(
do_work, [ (path,cp) for path in list_pdb ],
)
pool.close()
pool.join()
output
27 files, 4 procs
<Process(PoolWorker-2, started daemon)> Started Process, args=('tdownload.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process
<Process(PoolWorker-2, started daemon)> Started Process, args=('tscapy.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process

Related

Display Popen.communicate() in real time [duplicate]

I have a python subprocess that I'm trying to read output and error streams from. Currently I have it working, but I'm only able to read from stderr after I've finished reading from stdout. Here's what it looks like:
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout_iterator = iter(process.stdout.readline, b"")
stderr_iterator = iter(process.stderr.readline, b"")
for line in stdout_iterator:
# Do stuff with line
print line
for line in stderr_iterator:
# Do stuff with line
print line
As you can see, the stderr for loop can't start until the stdout loop completes. How can I modify this to be able to read from both in the correct order the lines come in?
To clarify: I still need to be able to tell whether a line came from stdout or stderr because they will be treated differently in my code.

The code in your question may deadlock if the child process produces enough output on stderr (~100KB on my Linux machine).
There is a communicate() method that allows to read from both stdout and stderr separately:
from subprocess import Popen, PIPE
process = Popen(command, stdout=PIPE, stderr=PIPE)
output, err = process.communicate()
If you need to read the streams while the child process is still running then the portable solution is to use threads (not tested):
from subprocess import Popen, PIPE
from threading import Thread
from Queue import Queue # Python 2
def reader(pipe, queue):
try:
with pipe:
for line in iter(pipe.readline, b''):
queue.put((pipe, line))
finally:
queue.put(None)
process = Popen(command, stdout=PIPE, stderr=PIPE, bufsize=1)
q = Queue()
Thread(target=reader, args=[process.stdout, q]).start()
Thread(target=reader, args=[process.stderr, q]).start()
for _ in range(2):
for source, line in iter(q.get, None):
print "%s: %s" % (source, line),
See:
Python: read streaming input from subprocess.communicate()
Non-blocking read on a subprocess.PIPE in python
Python subprocess get children's output to file and terminal?

Here's a solution based on selectors, but one that preserves order, and streams variable-length characters (even single chars).
The trick is to use read1(), instead of read().
import selectors
import subprocess
import sys
p = subprocess.Popen(
["python", "random_out.py"], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
sel = selectors.DefaultSelector()
sel.register(p.stdout, selectors.EVENT_READ)
sel.register(p.stderr, selectors.EVENT_READ)
while True:
for key, _ in sel.select():
data = key.fileobj.read1().decode()
if not data:
exit()
if key.fileobj is p.stdout:
print(data, end="")
else:
print(data, end="", file=sys.stderr)
If you want a test program, use this.
import sys
from time import sleep
for i in range(10):
print(f" x{i} ", file=sys.stderr, end="")
sleep(0.1)
print(f" y{i} ", end="")
sleep(0.1)

The order in which a process writes data to different pipes is lost after write.
There is no way you can tell if stdout has been written before stderr.
You can try to read data simultaneously from multiple file descriptors in a non-blocking way
as soon as data is available, but this would only minimize the probability that the order is incorrect.
This program should demonstrate this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import select
import subprocess
testapps={
'slow': '''
import os
import time
os.write(1, 'aaa')
time.sleep(0.01)
os.write(2, 'bbb')
time.sleep(0.01)
os.write(1, 'ccc')
''',
'fast': '''
import os
os.write(1, 'aaa')
os.write(2, 'bbb')
os.write(1, 'ccc')
''',
'fast2': '''
import os
os.write(1, 'aaa')
os.write(2, 'bbbbbbbbbbbbbbb')
os.write(1, 'ccc')
'''
}
def readfds(fds, maxread):
while True:
fdsin, _, _ = select.select(fds,[],[])
for fd in fdsin:
s = os.read(fd, maxread)
if len(s) == 0:
fds.remove(fd)
continue
yield fd, s
if fds == []:
break
def readfromapp(app, rounds=10, maxread=1024):
f=open('testapp.py', 'w')
f.write(testapps[app])
f.close()
results={}
for i in range(0, rounds):
p = subprocess.Popen(['python', 'testapp.py'], stdout=subprocess.PIPE
, stderr=subprocess.PIPE)
data=''
for (fd, s) in readfds([p.stdout.fileno(), p.stderr.fileno()], maxread):
data = data + s
results[data] = results[data] + 1 if data in results else 1
print 'running %i rounds %s with maxread=%i' % (rounds, app, maxread)
results = sorted(results.items(), key=lambda (k,v): k, reverse=False)
for data, count in results:
print '%03i x %s' % (count, data)
print
print "=> if output is produced slowly this should work as whished"
print " and should return: aaabbbccc"
readfromapp('slow', rounds=100, maxread=1024)
print
print "=> now mostly aaacccbbb is returnd, not as it should be"
readfromapp('fast', rounds=100, maxread=1024)
print
print "=> you could try to read data one by one, and return"
print " e.g. a whole line only when LF is read"
print " (b's should be finished before c's)"
readfromapp('fast', rounds=100, maxread=1)
print
print "=> but even this won't work ..."
readfromapp('fast2', rounds=100, maxread=1)
and outputs something like this:
=> if output is produced slowly this should work as whished
and should return: aaabbbccc
running 100 rounds slow with maxread=1024
100 x aaabbbccc
=> now mostly aaacccbbb is returnd, not as it should be
running 100 rounds fast with maxread=1024
006 x aaabbbccc
094 x aaacccbbb
=> you could try to read data one by one, and return
e.g. a whole line only when LF is read
(b's should be finished before c's)
running 100 rounds fast with maxread=1
003 x aaabbbccc
003 x aababcbcc
094 x abababccc
=> but even this won't work ...
running 100 rounds fast2 with maxread=1
003 x aaabbbbbbbbbbbbbbbccc
001 x aaacbcbcbbbbbbbbbbbbb
008 x aababcbcbcbbbbbbbbbbb
088 x abababcbcbcbbbbbbbbbb

This works for Python3 (3.6):
p = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True)
# Read both stdout and stderr simultaneously
sel = selectors.DefaultSelector()
sel.register(p.stdout, selectors.EVENT_READ)
sel.register(p.stderr, selectors.EVENT_READ)
ok = True
while ok:
for key, val1 in sel.select():
line = key.fileobj.readline()
if not line:
ok = False
break
if key.fileobj is p.stdout:
print(f"STDOUT: {line}", end="")
else:
print(f"STDERR: {line}", end="", file=sys.stderr)

from https://docs.python.org/3/library/subprocess.html#using-the-subprocess-module
If you wish to capture and combine both streams into one, use
stdout=PIPE and stderr=STDOUT instead of capture_output.
so the easiest solution would be:
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
stdout_iterator = iter(process.stdout.readline, b"")
for line in stdout_iterator:
# Do stuff with line
print line

I know this question is very old, but this answer may help others who stumble upon this page in researching a solution for a similar situation, so I'm posting it anyway.
I've built a simple python snippet that will merge any number of pipes into a single one. Of course, as stated above, the order cannot be guaranteed, but this is as close as I think you can get in Python.
It spawns a thread for each of the pipes, reads them line by line and puts them into a Queue (which is FIFO). The main thread loops through the queue, yielding each line.
import threading, queue
def merge_pipes(**named_pipes):
r'''
Merges multiple pipes from subprocess.Popen (maybe other sources as well).
The keyword argument keys will be used in the output to identify the source
of the line.
Example:
p = subprocess.Popen(['some', 'call'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
outputs = {'out': log.info, 'err': log.warn}
for name, line in merge_pipes(out=p.stdout, err=p.stderr):
outputs[name](line)
This will output stdout to the info logger, and stderr to the warning logger
'''
# Constants. Could also be placed outside of the method. I just put them here
# so the method is fully self-contained
PIPE_OPENED=1
PIPE_OUTPUT=2
PIPE_CLOSED=3
# Create a queue where the pipes will be read into
output = queue.Queue()
# This method is the run body for the threads that are instatiated below
# This could be easily rewritten to be outside of the merge_pipes method,
# but to make it fully self-contained I put it here
def pipe_reader(name, pipe):
r"""
reads a single pipe into the queue
"""
output.put( ( PIPE_OPENED, name, ) )
try:
for line in iter(pipe.readline,''):
output.put( ( PIPE_OUTPUT, name, line.rstrip(), ) )
finally:
output.put( ( PIPE_CLOSED, name, ) )
# Start a reader for each pipe
for name, pipe in named_pipes.items():
t=threading.Thread(target=pipe_reader, args=(name, pipe, ))
t.daemon = True
t.start()
# Use a counter to determine how many pipes are left open.
# If all are closed, we can return
pipe_count = 0
# Read the queue in order, blocking if there's no data
for data in iter(output.get,''):
code=data[0]
if code == PIPE_OPENED:
pipe_count += 1
elif code == PIPE_CLOSED:
pipe_count -= 1
elif code == PIPE_OUTPUT:
yield data[1:]
if pipe_count == 0:
return

This works for me (on windows):
https://github.com/waszil/subpiper
from subpiper import subpiper
def my_stdout_callback(line: str):
print(f'STDOUT: {line}')
def my_stderr_callback(line: str):
print(f'STDERR: {line}')
my_additional_path_list = [r'c:\important_location']
retcode = subpiper(cmd='echo magic',
stdout_callback=my_stdout_callback,
stderr_callback=my_stderr_callback,
add_path_list=my_additional_path_list)

How am I using the multiprocessing (python) module wrong?

Can someone help me figure out why the following code won't run properly? I want to spawn new processes as the previous ones finish but running this code automatically runs everything, i.e. all the jobs report finished and stopped when they arent, and their windows are open as well. Any thoughts on why is_alive() returns false when it is actually true?
import subprocess
import sys
import multiprocessing
import time
start_on = 33 #'!'
end_on = 34
num_processors = 4;
jobs = []
def createInstance():
global start_on, end_on, jobs
cmd = "python scrape.py" + " " + str(start_on) + " " + str(end_on)
print cmd
p = multiprocessing.Process(target=processCreator(cmd))
jobs.append(p)
p.start()
start_on += 1
end_on += 1
print "length of jobs is: " + str(len(jobs))
def processCreator(cmd):
subprocess.Popen(cmd, creationflags=subprocess.CREATE_NEW_CONSOLE)
if __name__ == '__main__':
num_processors = input("How many instances to run simultaneously?: ")
for i in range(num_processors):
createInstance()
while len(jobs) > 0:
jobs = [job for job in jobs if job.is_alive()]
for i in range(num_processors - len(jobs)):
createInstance()
time.sleep(1)
print('*** All jobs finished ***')

Your code is spawning 2 processes on each createInstance() call, I think that's messing the is_alive() call.
p = multiprocessing.Process(target=processCreator(cmd))
This will spawn 1 process to run processCreator(cmd). Then, subprocess.Popen(cmd, creationflags=subprocess.CREATE_NEW_CONSOLE) will spawn a child process to run the command. This subprocess will return immediately, so the parent process.
I think this version will work, removing the usage of multiprocess. I also have changed the cmd definition(see docs):
import subprocess
import sys
import time
start_on = 33 #'!'
end_on = 34
num_processors = 4;
jobs = []
def createInstance():
global start_on, end_on, jobs
cmd = ["python","scrape.py", str(start_on), str(end_on)]
print(str(cmd))
p = subprocess.Popen(cmd, creationflags=subprocess.CREATE_NEW_CONSOLE)
jobs.append(p)
p.start()
start_on += 1
end_on += 1
print "length of jobs is: " + str(len(jobs))
if __name__ == '__main__':
num_processors = input("How many instances to run simultaneously?: ")
for i in range(num_processors):
createInstance()
while len(jobs) > 0:
jobs = [job for job in jobs if job.poll() is None]
for i in range(num_processors - len(jobs)):
createInstance()
time.sleep(1)
print('*** All jobs finished ***')

Why main process can not exit when I use multiple processes in Python?

I hope that all child processes finished, and then main process exit, but it can not exit, why?
#!/usr/bin/env python
# coding=utf-8
import os
from multiprocessing import Manager
from multiprocessing import Pool
def write_file_name_to_queue(q, src_folder):
print('Process to write: %s' % os.getpid())
if not os.path.exists(src_folder):
print "Please input folder path"
return
for (dirpath, dirnames, filelist) in os.walk(src_folder):
for name in filelist:
if name[0] == '.':
continue
q.put(os.path.join(dirpath, name))
def read_file_name_from_queue(q):
print('Process to read: %s' % os.getpid())
while True:
value = q.get(True)
print('Get %s from queue.' % value)
if __name__ == "__main__":
mg = Manager()
q = mg.Queue()
p = Pool()
p.apply_async(func=write_file_name_to_queue, args=(q, "./test/"))
for i in xrange(8):
p.apply_async(func=read_file_name_from_queue, args=(q,))
p.close()
p.join()
Run it and get the follow result:
➜ check python check_process.py
Process to write: 3918
Process to read: 3919
Process to read: 3920
Get ./test/a from queue.
Get ./test/b from queue.
Get ./test/c from queue.
Get ./test/e from queue.
Get ./test/f from queue.
Process to read: 3921
Process to read: 3918
The process still waits.

process stop working while queue is not empty

I try to write a script in python to convert url into its corresponding ip. Since the url file is huge (nearly 10GB), so I'm trying to use multiprocessing lib.
I create one process to write output to file and a set of processes to convert url.
Here is my code:
import multiprocessing as mp
import socket
import time
num_processes = mp.cpu_count()
sentinel = None
def url2ip(inqueue, output):
v_url = inqueue.get()
print 'v_url '+v_url
try:
v_ip = socket.gethostbyname(v_url)
output_string = v_url+'|||'+v_ip+'\n'
except:
output_string = v_url+'|||-1'+'\n'
print 'output_string '+output_string
output.put(output_string)
print output.full()
def handle_output(output):
f_ip = open("outputfile", "a")
while True:
output_v = output.get()
if output_v:
print 'output_v '+output_v
f_ip.write(output_v)
else:
break
f_ip.close()
if __name__ == '__main__':
output = mp.Queue()
inqueue = mp.Queue()
jobs = []
proc = mp.Process(target=handle_output, args=(output, ))
proc.start()
print 'run in %d processes' % num_processes
for i in range(num_processes):
p = mp.Process(target=url2ip, args=(inqueue, output))
jobs.append(p)
p.start()
for line in open('inputfile','r'):
print 'ori '+line.strip()
inqueue.put(line.strip())
for i in range(num_processes):
# Send the sentinal to tell Simulation to end
inqueue.put(sentinel)
for p in jobs:
p.join()
output.put(None)
proc.join()
However, it did not work. It did produce several outputs (4 out of 10 urls in the test file) but it just suddenly stops while queues are not empty (I did check queue.empty())
Could anyone suggest what's wrong?Thanks

You're workers exit after processing a single url each, they need to loop internally until they get the sentinel. However, you should probably just look at multiprocessing.pool instead, as that does the bookkeeping for you.

Confusing on subprocess.Popen

This problem makes me confused
I just want to run 1 command on 18 different input file so I've wrote it like
while filenames or running:
while filenames and len(running) < N_CORES:
filename = filenames.pop(0)
print 'Submiting process for %s' % filename
cmd = COMMAND % dict(filename=filename, localdir=localdir)
p = subprocess.Popen(cmd, shell=True)
print 'running:', cmd
running.append((cmd, p))
i = 0
while i < len(running):
(cmd, p) = running[i]
ret = p.poll()
if ret is not None:
rep = open('Crux.report.%d' % (report_number), 'w')
rep.write('Command: %s' % cmd)
print localdir
print 'done!'
report_number += 1
running.remove((cmd, p))
else:
i += 1
time.sleep(1)
But when I've run it after 3 hours all of the process going to Sleep mode.
But if I call the command from terminal manually (for all of the different files), all of them have been Ok.
Any help would be appreciate.

I assume you want to run 18 processes (one process per file) with no more than N_CORES processes in parallel.
The simplest way could be to use multiprocessing.Pool here:
import multiprocessing as mp
import subprocess
def process_file(filename):
try:
return filename, subprocess.call([cmd, filename], cwd=localdir)
except OSError:
return filename, None # failed to start subprocess
if __name__ == "__main__":
pool = mp.Pool()
for result in pool.imap_unordered(process_file, filenames):
# report result here

Whithout knowing what your subprocesses are supposed to do and how long they are supposed to be running it's hard to give an accurate answer here.
Some problems I see with your program:
you check for i < len(running), while incrementing i and removing from running.
Either use a counter or check if the list still contains elements, but don't do both at the same time. This way you will break out of the loop halfway.
you increment i each time a process has not finished, you probably want to increment if a process has finished.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run python process (using Pool of Multiprocessing ) parallel for batch - python

Related

Display Popen.communicate() in real time [duplicate]

How am I using the multiprocessing (python) module wrong?

Why main process can not exit when I use multiple processes in Python?

process stop working while queue is not empty

Confusing on subprocess.Popen

Categories

Resources