I'm performing parallel operations on a set of files, the basic function is:
def funcOnFile(fileName):
print('Executing ', fileName)
readFile
....
saveOutputFile
Since each file is independent from the others, I can easily do this in parallel without having threads talking to each others.
However I must be sure to keep at least one core free at all the time otherwise my old computer will freeze and die
What I do is then:
from multiprocessing import Process
for i in range(0, len(filenames), numProcesses):
processes = []
for j in range(numProcesses):
index = i + j
if index >= len(filenames):
break
filename = filenames[index]
process = multiprocessing.Process(
name=os.path.basename(filename),
target=func, args=(filename, args)
)
processes.append(process)
process.start()
for p in processes:
p.join()
at the end of this process I want to sync the files to my s3 remote repository, and I do it using subprocess:
subprocess.run(['aws', 's3', 'sync', localOuputs, s3Output])
However, it happens that the sync starts before the last files are saved!
Does anyone has an explanation/fix for this? I thought that this would be avoided by the join()
Related
I am reading various tutorials on the multiprocessing module in Python, and am having trouble understanding why/when to call process.join(). For example, I stumbled across this example:
nums = range(100000)
nprocs = 4
def worker(nums, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outdict = {}
for n in nums:
outdict[n] = factorize_naive(n)
out_q.put(outdict)
# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
# Wait for all worker processes to finish
for p in procs:
p.join()
print resultdict
From what I understand, process.join() will block the calling process until the process whose join method was called has completed execution. I also believe that the child processes which have been started in the above code example complete execution upon completing the target function, that is, after they have pushed their results to the out_q. Lastly, I believe that out_q.get() blocks the calling process until there are results to be pulled. Thus, if you consider the code:
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
# Wait for all worker processes to finish
for p in procs:
p.join()
the main process is blocked by the out_q.get() calls until every single worker process has finished pushing its results to the queue. Thus, by the time the main process exits the for loop, each child process should have completed execution, correct?
If that is the case, is there any reason for calling the p.join() methods at this point? Haven't all worker processes already finished, so how does that cause the main process to "wait for all worker processes to finish?" I ask mainly because I have seen this in multiple different examples, and I am curious if I have failed to understand something.
Try to run this:
import math
import time
from multiprocessing import Queue
import multiprocessing
def factorize_naive(n):
factors = []
for div in range(2, int(n**.5)+1):
while not n % div:
factors.append(div)
n //= div
if n != 1:
factors.append(n)
return factors
nums = range(100000)
nprocs = 4
def worker(nums, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outdict = {}
for n in nums:
outdict[n] = factorize_naive(n)
out_q.put(outdict)
# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
time.sleep(5)
# Wait for all worker processes to finish
for p in procs:
p.join()
print resultdict
time.sleep(15)
And open the task-manager. You should be able to see that the 4 subprocesses go in zombie state for some seconds before being terminated by the OS(due to the join calls):
With more complex situations the child processes could stay in zombie state forever(like the situation you was asking about in an other question), and if you create enough child-processes you could fill the process table causing troubles to the OS(which may kill your main process to avoid failures).
At the point just before you call join, all workers have put their results into their queues, but they did not necessarily return, and their processes may not yet have terminated. They may or may not have done so, depending on timing.
Calling join makes sure that all processes are given the time to properly terminate.
I am not exactly sure of the implementation details, but join also seems to be necessary to reflect that a process has indeed terminated (after calling terminate on it for example). In the example here, if you don't call join after terminating a process, process.is_alive() returns True, even though the process was terminated with a process.terminate() call.
I have n files to analyze separately and independently of each other with the same Python script analysis.py. In a wrapper script, wrapper.py, I loop over those files and call analysis.py as a separate process with subprocess.Popen:
for a_file in all_files:
command = "python analysis.py %s" % a_file
analysis_process = subprocess.Popen(
shlex.split(command),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
analysis_process.wait()
Now, I would like to use all the k CPU cores of my machine in order to speed up the whole analysis.
Is there a way to always have k-1 running processes as long as there are files to analyze?
This outlines how to use multiprocessing.Pool which exists exactly for these tasks:
from multiprocessing import Pool, cpu_count
# ...
all_files = ["file%d" % i for i in range(5)]
def process_file(file_name):
# process file
return "finished file %s" % file_name
pool = Pool(cpu_count())
# this is a blocking call - when it's done, all files have been processed
results = pool.map(process_file, all_files)
# no more tasks can go in the pool
pool.close()
# wait for all workers to complete their task (though we used a blocking call...)
pool.join()
# ['finished file file0', 'finished file file1', ... , 'finished file file4']
print results
Adding Joel's comment mentioning a common pitfall:
Make sure that the function you pass to pool.map() contains only objects that are defined at the module level. Python multiprocessing uses pickle to pass objects between processes, and pickle has issues with things like functions defined in a nested scope.
The docs for what can be pickled
I am reading various tutorials on the multiprocessing module in Python, and am having trouble understanding why/when to call process.join(). For example, I stumbled across this example:
nums = range(100000)
nprocs = 4
def worker(nums, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outdict = {}
for n in nums:
outdict[n] = factorize_naive(n)
out_q.put(outdict)
# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
# Wait for all worker processes to finish
for p in procs:
p.join()
print resultdict
From what I understand, process.join() will block the calling process until the process whose join method was called has completed execution. I also believe that the child processes which have been started in the above code example complete execution upon completing the target function, that is, after they have pushed their results to the out_q. Lastly, I believe that out_q.get() blocks the calling process until there are results to be pulled. Thus, if you consider the code:
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
# Wait for all worker processes to finish
for p in procs:
p.join()
the main process is blocked by the out_q.get() calls until every single worker process has finished pushing its results to the queue. Thus, by the time the main process exits the for loop, each child process should have completed execution, correct?
If that is the case, is there any reason for calling the p.join() methods at this point? Haven't all worker processes already finished, so how does that cause the main process to "wait for all worker processes to finish?" I ask mainly because I have seen this in multiple different examples, and I am curious if I have failed to understand something.
Try to run this:
import math
import time
from multiprocessing import Queue
import multiprocessing
def factorize_naive(n):
factors = []
for div in range(2, int(n**.5)+1):
while not n % div:
factors.append(div)
n //= div
if n != 1:
factors.append(n)
return factors
nums = range(100000)
nprocs = 4
def worker(nums, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outdict = {}
for n in nums:
outdict[n] = factorize_naive(n)
out_q.put(outdict)
# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
time.sleep(5)
# Wait for all worker processes to finish
for p in procs:
p.join()
print resultdict
time.sleep(15)
And open the task-manager. You should be able to see that the 4 subprocesses go in zombie state for some seconds before being terminated by the OS(due to the join calls):
With more complex situations the child processes could stay in zombie state forever(like the situation you was asking about in an other question), and if you create enough child-processes you could fill the process table causing troubles to the OS(which may kill your main process to avoid failures).
At the point just before you call join, all workers have put their results into their queues, but they did not necessarily return, and their processes may not yet have terminated. They may or may not have done so, depending on timing.
Calling join makes sure that all processes are given the time to properly terminate.
I am not exactly sure of the implementation details, but join also seems to be necessary to reflect that a process has indeed terminated (after calling terminate on it for example). In the example here, if you don't call join after terminating a process, process.is_alive() returns True, even though the process was terminated with a process.terminate() call.
I have a python script to run a few external commands using the os.subprocess module. But one of these steps takes a huge time and so I would like to run it separately. I need to launch them, check they are finished and then execute the next command which is not parallel.
My code is something like this:
nproc = 24
for i in xrange(nproc):
#Run program in parallel
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
for i in xrange(nproc):
for zline in open('Niben_%s_file%d_structures' % (zfile_name,i)):handle.write(zline)
handle.close()
#Run next step
cmd = 'bowtie-build -f Niben_%s_precursors.fa bowtie-index/Niben_%s_precursors' % (zfile_name,zfile_name)
For your example, you just want to shell out in parallel - you don't need threads for that.
Use the Popen constructor in the subprocess module: http://docs.python.org/library/subprocess.htm
Collect the Popen instances for each process you spawned and then wait() for them to finish:
procs = []
for i in xrange(nproc):
procs.append(subprocess.Popen(ARGS_GO_HERE)) #Run program in parallel
for p in procs:
p.wait()
You can get away with this (as opposed to using the multiprocessing or threading modules), since you aren't really interested in having these interoperate - you just want the os to run them in parallel and be sure they are all finished when you go to combine the results...
Running things in parallel can also be implemented using multiple processes in Python. I had written a blog post on this topic a while ago, you can find it here
http://multicodecjukebox.blogspot.de/2010/11/parallelizing-multiprocessing-commands.html
Basically, the idea is to use "worker processes" which independently retrieve jobs from a queue and then complete these jobs.
Works quite well in my experience.
You can do it using threads. This is very short and (not tested) example with very ugly if-else on what you are actually doing in the thread, but you can write you own worker classes..
import threading
class Worker(threading.Thread):
def __init__(self, i):
self._i = i
super(threading.Thread,self).__init__()
def run(self):
if self._i == 1:
self.result = do_this()
elif self._i == 2:
self.result = do_that()
threads = []
nproc = 24
for i in xrange(nproc):
#Run program in parallel
w = Worker(i)
threads.append(w)
w.start()
w.join()
# ...now all threads are done
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
...etc...
I have a Python script that I want to use as a controller to another Python script. I have a server with 64 processors, so want to spawn up to 64 child processes of this second Python script. The child script is called:
$ python create_graphs.py --name=NAME
where NAME is something like XYZ, ABC, NYU etc.
In my parent controller script I retrieve the name variable from a list:
my_list = [ 'XYZ', 'ABC', 'NYU' ]
So my question is, what is the best way to spawn off these processes as children? I want to limit the number of children to 64 at a time, so need to track the status (if the child process has finished or not) so I can efficiently keep the whole generation running.
I looked into using the subprocess package, but rejected it because it only spawns one child at a time. I finally found the multiprocessor package, but I admit to being overwhelmed by the whole threads vs. subprocesses documentation.
Right now, my script uses subprocess.call to only spawn one child at a time and looks like this:
#!/path/to/python
import subprocess, multiprocessing, Queue
from multiprocessing import Process
my_list = [ 'XYZ', 'ABC', 'NYU' ]
if __name__ == '__main__':
processors = multiprocessing.cpu_count()
for i in range(len(my_list)):
if( i < processors ):
cmd = ["python", "/path/to/create_graphs.py", "--name="+ my_list[i]]
child = subprocess.call( cmd, shell=False )
I really want it to spawn up 64 children at a time. In other stackoverflow questions I saw people using Queue, but it seems like that creates a performance hit?
What you are looking for is the process pool class in multiprocessing.
import multiprocessing
import subprocess
def work(cmd):
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
count = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=count)
print pool.map(work, ['ls'] * count)
And here is a calculation example to make it easier to understand. The following will divide 10000 tasks on N processes where N is the cpu count. Note that I'm passing None as the number of processes. This will cause the Pool class to use cpu_count for the number of processes (reference)
import multiprocessing
import subprocess
def calculate(value):
return value * 10
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10000)
results = []
r = pool.map_async(calculate, tasks, callback=results.append)
r.wait() # Wait on the results
print results
Here is the solution I came up, based on Nadia and Jim's comments. I am not sure if it is the best way, but it works. The original child script being called needs to be a shell script because I need to use some 3rd party apps including Matlab. So I had to take it out of Python and code it in bash.
import sys
import os
import multiprocessing
import subprocess
def work(staname):
print 'Processing station:',staname
print 'Parent process:', os.getppid()
print 'Process id:', os.getpid()
cmd = [ "/bin/bash" "/path/to/executable/create_graphs.sh","--name=%s" % (staname) ]
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
my_list = [ 'XYZ', 'ABC', 'NYU' ]
my_list.sort()
print my_list
# Get the number of processors available
num_processes = multiprocessing.cpu_count()
threads = []
len_stas = len(my_list)
print "+++ Number of stations to process: %s" % (len_stas)
# run until all the threads are done, and there is no data left
for list_item in my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if( len(threads) < num_processes ):
p = multiprocessing.Process(target=work,args=[list_item])
p.start()
print p, p.is_alive()
threads.append(p)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
Does this seem like a reasonable solution? I tried to use Jim's while loop format, but my script just returned nothing. I am not sure why that would be. Here is the output when I run the script with Jim's 'while' loop replacing the 'for' loop:
hostname{me}2% controller.py
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
hostname{me}3%
When I run it with the 'for' loop, I get something more meaningful:
hostname{me}6% controller.py
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
Processing station: ABC
Parent process: 1056
Process id: 1068
Processing station: NYU
Parent process: 1056
Process id: 1069
Processing station: XYZ
Parent process: 1056
Process id: 1071
hostname{me}7%
So this works, and I am happy. However, I still don't get why I can't use Jim's 'while' style loop instead of the 'for' loop I am using. Thanks for all the help - I am impressed with the breadth of knowledge # stackoverflow.
I would definitely use multiprocessing rather than rolling my own solution using subprocess.
I don't think you need queue unless you intend to get data out of the applications (Which if you do want data, I think it may be easier to add it to a database anyway)
but try this on for size:
put the contents of your create_graphs.py script all into a function called "create_graphs"
import threading
from create_graphs import create_graphs
num_processes = 64
my_list = [ 'XYZ', 'ABC', 'NYU' ]
threads = []
# run until all the threads are done, and there is no data left
while threads or my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if (len(threads) < num_processes) and my_list:
t = threading.Thread(target=create_graphs, args=[ my_list.pop() ])
t.setDaemon(True)
t.start()
threads.append(t)
# in the case that we have the maximum number of threads check if any of them
# are done. (also do this when we run out of data, until all the threads are done)
else:
for thread in threads:
if not thread.isAlive():
threads.remove(thread)
I know that this will result in 1 less threads than processors, which is probably good, it leaves a processor to manage the threads, disk i/o, and other things happening on the computer. If you decide you want to use the last core just add one to it
edit: I think I may have misinterpreted the purpose of my_list. You do not need my_list to keep track of the threads at all (as they're all referenced by the items in the threads list). But this is a fine way of feeding the processes input - or even better: use a generator function ;)
The purpose of my_list and threads
my_list holds the data that you need to process in your function
threads is just a list of the currently running threads
the while loop does two things, start new threads to process the data, and check if any threads are done running.
So as long as you have either (a) more data to process, or (b) threads that aren't finished running.... you want to program to continue running. Once both lists are empty they will evaluate to False and the while loop will exit