Graceful Termination of Worker Pool

Graceful Termination of Worker Pool - python

I want to spawn X number of Pool workers and give each of them X% of the work to do. My issue is that the work takes about 20 minutes to exhaust, longer for each extra process running, due to the type of calculations being done my answer may be found within minutes or hours. What I would like to do is implement some way for a single worker to go "HEY I FOUND IT" and use that signal to kill the remainder of the pool and move on with my calculations.
Key points:
I have tried callbacks, they don't seem to run on a starmap_async until the entire pool finishes.
I only care about the first suitable answer found.
I am not sharing resources and surprise process death, albeit rude, is perfectly acceptable.
I've also considered using a Queue, but it wouldn't make since because the scope of work I'm passing to each is already built into the parameters of the function.
Below is a very dulled down version of what I'm working with (the calculations I'm working with can take hours to finish over a 4.2 billion complex iterable.)
def doWork():
workers = Pool(2)
results = workers.starmap_async( func = distSearch , iterable = Sections1_5, callback = killPool )
workers.close()
print("Found answer : {}".format(results.get()))
workers.join()
def killPool():
workers.terminate()
print("Worker Pool Terminated")
I should probably specify that my process only returns if it finds an answer otherwise it just exits once done. I have looked at this thread but it has my completely lost and seems like a lot of overhead to consistently check for the win condition when that should come in the return/callback of the Worker Pool.
All the answers I've found result in significant overhead by supervising the worker pool, I'm looking for a solution that sources the kill signal at the worker level, autonomously.

I'm looking for a solution that sources the kill signal at the worker level, autonomously.
AFAIK, that doesn't exist. The methods of the Pool object (like Pool.terminate) should only be used in the process that created the pool.
What you could do is use Pool.imap_unordered. This returns an iterator in the parent process over the results which yields results as soon as they become available. As soon as the desired result pops up, you could then use Pool.terminate().
Edit:
From looking at the 3.5 implementation starmap_async returns a MapResult instance, which is not an iterator.
You can wrap multiple inputs in a tuple and use imap_unordered over a list of those.

Related

Should I create a new Pool object every time or reuse a single one?

I'm trying to understand the best practices with Python's multiprocessing.Pool object.
In my program I use Pool.imap very frequently. Normally every time I start tasks in parallel I create a new pool object and then close it after I'm done.
I recently encountered a hang where the number of tasks submitted to the pool was less than the number of processes. What was odd was that it only occurred in my test pipeline which had a bunch of things run before it. Running the test as a standalone did not cause the hand. I assume it has to do with making multiple pools.
I'd really like to find some resources to help me understand the best practices in using Python's multiprocessing. Specifically I'm currently trying to understand the implications of making several pool objects versus using only one.

When you create a Pool of worker processes, new processes are spawned from the parent one. This is a very fast operation but it has its cost.
Therefore, as long as you don't have a very good reason, for example the Pool breaks due to one worker dying unexpectedly, it's better to always use the same Pool instance.
The reason for the hang is hard to tell without inspecting the code. You might not have clean the previous instances properly (call close()/stop() and then always call join()). You might have sent too big data through the Pool channel which usually ends up with a deadlock and so on.
Surely a pool does not break if you submit less tasks than workers. The pool is designed exactly to de-couple the number of tasks from the number of workers.

Python multiprocessing - Why does pool.close() take so long to return?

Sometimes a call to the function pool.close() takes a lot of time to return, and I want to understand why. Typically, I would have each process return a big set or a big dict, and the main merge them. It looks like this:
def worker() :
s = set()
# add millions of elements to s
return s
if __name__ == '__main__' :
pool = multiprocessing.Pool( processes=20 )
fullSet = set.union( * pool.imap_unordered( worker, xrange(100) ) )
pool.close() # This takes a LOT OF TIME!
pool.join()
As I said, the pool.close() might take 5, 10 min or more to return. Same problem occurs when using dictionaries instead of sets. This is what the documentation says about close:
Prevents any more tasks from being submitted to the pool. Once all the
tasks have been completed the worker processes will exit.
I guess I don't understand what's going on. After the line fullSet = ..., all the work is done and I don't need the workers anymore. What are they doing that is taking so much time?

It is very unlikely that Pool.close is taking that long. Simply because this is the source of close
def close(self):
debug('closing pool')
if self._state == RUN:
self._state = CLOSE
self._worker_handler._state = CLOSE
So all that’s happening is that some state variables are changed. This has no measurable impact on the runtime of that method and will not cause it to return later. You could just assume close to return instantaneously.
Now instead, what’s way more likely is that your pool.join() line is the “culprit” of this delay. But it’s just doing its job:
Wait for the worker processes to exit.
It essentially calls join on every process in the pool. And if you are joining a process or thread, you are actively waiting for it to complete its work and terminate.
So in your case, you have 20 processes running that add a million elements to a set. That takes a while. To make your main process not quit early (causing child processes to die btw.), you are waiting for the worker processes to finish their work; by joining on them. So what you’re experiencing is likely what should happen for the amount of work you do.
On a side note: If you do heavy CPU work in your worker functions, you shouldn’t spawn more processes than your CPU has hardware threads available, as you will only introduce additional overhead from managing and switching processes. For example for a consumer Core i7, this number would be 8.

It is probably the iteration over the result of pool.imap_unordered and the subsequent set.union that take a long time.
After each worker has finished building a set, it has to be pickled, sent back to the original process and unpickled. This takes time and memory. And then the * has to unpack all the sets for union to process.
You might get better results with map_async. Have the callback append the returned set to a list, and loop over that list using union on each set.

When, why, and how to call thread.join() in Python?

I have this python threading code.
import threading
def sum(value):
sum = 0
for i in range(value+1):
sum += i
print "I'm done with %d - %d\n" % (value, sum)
return sum
r = range(500001, 500000*2, 100)
ts = []
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
for t in ts:
t.join()
Executing this, I have hundreds of threads are working.
However, when I move the t.join() right after the t.start(), I have only two threads working.
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
t.join()
I tested with the code that does not invoke the t.join(), but it seems to work fine?
Then when, how, and how to use thread.join()?

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.
The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

Short answer: this one:
for t in ts:
t.join()
is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.
Longer answer:
len(list(range(500001, 500000*2, 100)))
Out[1]: 5000
You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!
Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!
At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.
Obligatory GIL Discussion
cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.
multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

join() waits for your thread to finish, so the first use starts a hundred threads, and then waits for all of them to finish. The second use wait for end of every thread before it launches another one, which kind of defeats the purpose of threading.
The first use makes most sense. You run the threads (all of them) to do some parallel computation, and then wait until all of them finish, before you move on and use the results, to make sure the work is done (i.e. the results are actually there).

Clarification regarding python Pool.map function used for python parallelism

I have a couple of questions regarding the functioning of the following code fragment.
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=10) # start 10 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1)
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
In the line pool = Pool(processes=10), does it even make a difference if i'm running on 4 processor architecture (quad-core) and instantiate more than 4 worker processes since only up to 4 processes can execute at any point in time?
In thepool.map(f,range(10)) function, if I instantiate 10 worker processes, and have maybe 50 mappers does python take care of assigning mappers to processes as they complete execution or am I supposed to figure out how many mappers are created and instantiate that many number of processes in the line pool = Pool(processes=number_of_mappers) ?.
This is my first attempt at parallelizing anything and I am thoroughly confused. so any help would be much appreciated.
Thanks in advance!

If you create more worker processes than you have available CPUs, that's fine, but the processes will compete with each other for cycles. That is, you'll waste more cycles, in the sense that cycles devoted to switching among processes does nothing to get you closer to finishing. For CPU-bound tasks, it's just wasteful. For I/O-bound tasks, though, it may be just what you want, since in that case processes will spend lots of their time idle, waiting for blocking I/O to complete.
The map functions automatically slice up their iterable argument and send pieces of it to all worker processes. I really don't know what you mean by mappers, though. How many mappers do you think you created in your example? 10? 1? Something else? In what you wrote, pool.map() blocks until all work is completed.

You can create more workers than the number of threads your CPU can execute. This is required in real-time applications, like a web server, where you must ensure that each client is able communicate with you without having to wait others. If it's not a real-time application and you just want to finish all the jobs as soon as possible, it would be wiser to create as many threads as your CPU can handle simultaneously.
Python takes care of assigning jobs to workers no matter how many jobs you have.

How do I limit the number of active threads in python?

Am new to python and making some headway with threading - am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
How do I do this?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
Edit: maybe I should add that each of my threads is using subprocess.Popen to run a separate command line decoder (flac) piped to stdout which is fed into a command line encoder (lame/mp3).

If you want to limit the number of parallel threads, use a semaphore:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads will wait in threadLimiter.acquire() and a waiting thread will only continue once another thread goes through threadLimiter.release().

"Each of my threads is using subprocess.Popen to run a separate command line [process]".
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
You want to have more work than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.

If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing module in Python 2.6 -- it makes parallel programming a cinch. You can create a Pool object with 2*num_threads processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads tasks at a time, until all are done.
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.

It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
As you are new to Python though I'll assume you don't know about the GIL and it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessing module (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
In your case you are probably intersted in using a pool as specified here.

Short answer: don't use threads.
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh which runs a configurable number of Popen() subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll() methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep() values and in the past I've written a version (before the subprocess module was added to Python) which used the signal module (SIGCHLD and SIGALRM) and the os.fork() and os.execve() functions --- which my on pipe and file descriptor plumbing, etc).
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
So I know this model works reliably. Running 100 current ssh processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).

I am not an expert in this, but I have read something about "Lock"s. This article might help you out
Hope this helps

I would like to add something, just as a reference for others looking to do something similar, but who might have coded things different from the OP. This question was the first one I came across when searching and the chosen answer pointed me in the right direction. Just trying to give something back.
import threading
import time
maximumNumberOfThreads = 2
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
def simulateThread(a,b):
threadLimiter.acquire()
try:
#do some stuff
c = a + b
print('a + b = ',c)
time.sleep(3)
except NameError: # Or some other type of error
# in case of exception, release
print('some error')
threadLimiter.release()
finally:
# if everything completes without error, release
threadLimiter.release()
threads = []
sample = [1,2,3,4,5,6,7,8,9]
for i in range(len(sample)):
thread = threading.Thread(target=(simulateThread),args=(sample[i],2))
thread.daemon = True
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This basically follows what you will find on this site:
https://www.kite.com/python/docs/threading.BoundedSemaphore

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.