Is there a cost to calling python multiprocessing .join() method

Is there a cost to calling python multiprocessing .join() method - python

My question is inspired by a comment on the solving embarassingly parallel problem with multiprocessing post.
I am asking about the general case where python multiprocessing is used to (1) read data from file, (2) manipulate data, (3) write results to file. In the case I describe, data that is read from file is passed to a queue A in (1) and fetched from this queue A in (2). (2) also passes results to a separate queue B and (3) fetches results from this queue B to write them to file.
When (1) is done, it passes a STOP signal* to queue A so (2) knows queue A is empty. (2) then terminates and passes a STOP signal to queue B so (3) knows queue B is empty and terminates when it has used up the results queue.
So is there any need to call the multiprocessing .join() method on (1) and (2)? I would have thought that (2) will not finish until (1) finishes and sends a STOP signal? For (3) it makes sense to wait as any subsequent instructions might else proceed without (3).
But maybe calling the .join() method costs nothing and can be used just to avoid having to think about it?
*actually, the STOP signal consists of a sequence of N stop signals where N is equivalent to the number of processes running in (2).

According to the docs, it is safe to call join multiple times - this suggests that if p has already stopped, p.join() will return immediately. This means that if you expect p to have already stopped by this time, the cost of joining it should be negligible. If p hasn't stopped (as you say you expect the writer process might not have), there is a potential cost to joining it depending on what your main process needs to do. If it does any user interaction, it will appear hung. If that is a problem, you might consider this type of pattern:
while p.is_alive():
iterate_mainloop()
p.join(small_timeout)
But if that process doesn't do user interaction, joining the others should be fine. That seems to be the most likely situation here - if you can afford to be blocked waiting for a disk read, you should also be fine waiting for another process to complete (modulo any defensive timeouts in case it misbehaves).

Related

Lock threads in Python for a task

I have a program that uses threads to start another thread once a certain threshold is reached. Right now the second thread is being started multiple times. I implemented a lock but I don't think I did it right.
for i in range(max_threads):
t1 = Thread(target=grab_queue)
t1.start()
in grab_queue, I have:
...
rows.append(resultJson)
if len(rows.value()) >= 250:
with Lock():
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
Which starts another thread to process the list of rows. I would like to make sure that as soon as it hits the if condition, the other threads wont run in order to make sure that extra threads to process the list of rows aren't started.

Your lock is covering the wrong portion of the code. You have a race condition between the check for the size of rows, and the portion of the code where you reset the rows. Given that the lock is taken only after the size check, two threads could easily both decide that the array has grown too large, and only then would the lock kick in to serialize the resetting of the array. "Serialize" in this case means that the task would still be performed twice, once by each thread, but it would happen in succession rather than in parallel.
The correct code could look like this:
rows.append(resultJson)
with grow_lock:
if len(rows.value()) >= 250:
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
There is another issue with the code as shown in the question: if Lock() refers to threading.Lock, it is creating and locking a new lock on each invocation, and in each thread! A lock protects a resource shared among threads, and to perform that function, the lock must itself be shared. To fix the problem, instantiate the lock once and pass it to the thread's target function.
Taking a step back, your code implements a custom thread pool. Getting that right and covering all the corner cases takes a lot of work, testing, and debugging. There are production-tested modules specialized for that purpose, such as the multiprocessing module shipped with Python (which supports both process and thread pools), and it is a good idea to get acquainted with them before reimplementing their functionality. See, for example, this article for an accessible introduction to multiprocessing-based thread pools.

Graceful Termination of Worker Pool

I want to spawn X number of Pool workers and give each of them X% of the work to do. My issue is that the work takes about 20 minutes to exhaust, longer for each extra process running, due to the type of calculations being done my answer may be found within minutes or hours. What I would like to do is implement some way for a single worker to go "HEY I FOUND IT" and use that signal to kill the remainder of the pool and move on with my calculations.
Key points:
I have tried callbacks, they don't seem to run on a starmap_async until the entire pool finishes.
I only care about the first suitable answer found.
I am not sharing resources and surprise process death, albeit rude, is perfectly acceptable.
I've also considered using a Queue, but it wouldn't make since because the scope of work I'm passing to each is already built into the parameters of the function.
Below is a very dulled down version of what I'm working with (the calculations I'm working with can take hours to finish over a 4.2 billion complex iterable.)
def doWork():
workers = Pool(2)
results = workers.starmap_async( func = distSearch , iterable = Sections1_5, callback = killPool )
workers.close()
print("Found answer : {}".format(results.get()))
workers.join()
def killPool():
workers.terminate()
print("Worker Pool Terminated")
I should probably specify that my process only returns if it finds an answer otherwise it just exits once done. I have looked at this thread but it has my completely lost and seems like a lot of overhead to consistently check for the win condition when that should come in the return/callback of the Worker Pool.
All the answers I've found result in significant overhead by supervising the worker pool, I'm looking for a solution that sources the kill signal at the worker level, autonomously.

I'm looking for a solution that sources the kill signal at the worker level, autonomously.
AFAIK, that doesn't exist. The methods of the Pool object (like Pool.terminate) should only be used in the process that created the pool.
What you could do is use Pool.imap_unordered. This returns an iterator in the parent process over the results which yields results as soon as they become available. As soon as the desired result pops up, you could then use Pool.terminate().
Edit:
From looking at the 3.5 implementation starmap_async returns a MapResult instance, which is not an iterator.
You can wrap multiple inputs in a tuple and use imap_unordered over a list of those.

Put large ndarrays fast to multiprocessing.Queue

When trying to put a large ndarray to a Queue in a Process, I encounter the following problem:
First, here is the code:
import numpy
import multiprocessing
from ctypes import c_bool
import time
def run(acquisition_running, data_queue):
while acquisition_running.value:
length = 65536
data = numpy.ndarray(length, dtype='float')
data_queue.put(data)
time.sleep(0.1)
if __name__ == '__main__':
acquisition_running = multiprocessing.Value(c_bool)
data_queue = multiprocessing.Queue()
process = multiprocessing.Process(
target=run, args=(acquisition_running, data_queue))
acquisition_running.value = True
process.start()
time.sleep(1)
acquisition_running.value = False
process.join()
print('Finished')
number_items = 0
while not data_queue.empty():
data_item = data_queue.get()
number_items += 1
print(number_items)
If I use length=10 or so, everything works fine. I get 9 items transmitted through the Queue.
If I increase to length=1000, on my computer the process.join() blocks, although the function run() is already done. I can comment the line with process.join() and will see, that there are only 2 items put in the Queue, so apparently putting data to the Queue got very slow.
My plan is actually to transport 4 ndarray, each with length 65536. For the Thread this worked very fast (<1ms). Is there a way to improve speed of transmitting data for processes?
I used Python 3.4 on a Windows machine, but with Python 3.4 on Linux I get the same behavior.

"Is there a way to improve speed of transmitting data for processes?"
Surely, given the right problem to solve. Currently, you are just filling a buffer without emptying it simultaneously. Congratulations, you have just built yourself a so-called deadlock. The corresponding quote from the documentation is:
Bear in mind that a process that has put items in a queue will wait
before terminating until all the buffered items are fed by the
“feeder” thread to the underlying pipe.
But, let's approach this slowly. First of all, "speed" is not your problem! I understand that you are just experimenting with Python's multiprocessing. The most important insight when reading your code is that the flow of communication between parent and child and especially the event handling does not really make sense. If you have a real-world problem that you are trying to solve, you definitely cannot solve it this way. If you do not have a real-world problem, then you first need to come up with a good one before you should start writing code ;-). Eventually, you will need to understand the communication primitives an operating system provides for inter-process communication.
Explanation for what you are observing:
Your child process generates about 10 * length * size(float) bytes of data (considering the fact that your child process can perform about 10 iterations while your parent sleeps about 1 s before it sets acquisition_running to False). While your parent process sleeps, the child puts named amount of data into a queue. You need to appreciate that a queue is a complex construct. You do not need to understand every bit of it. But one thing really really is important: a queue for inter-process communication clearly uses some kind of buffer* that sits between parent and child. Buffers usually have a limited size. You are writing to this buffer from within the child without simultaneously reading from it in the parent. That is, the buffer contents steadily grow while the parent is just sleeping. By increasing length you run into the situation where the queue buffer is full and the child process cannot write to it anymore. However, the child process cannot terminate before it has written all data. At the same time, the parent process waits for the child to terminate.
You see? One entity waits for the other. The parent waits for the child to terminate and the child waits for the parent to make some space. Such a situation is called deadlock. It cannot resolve itself.
Regarding the details, the buffer situation is a little more complex than described above. Your child process has spawned an additional thread that tries to push the buffered data through a pipe to the parent. Actually, the buffer of this pipe is the limiting entity. It is defined by the operating system and, at least on Linux, is usually not larger than 65536 Bytes.
The essential part is, in other words: the parent does not read from the pipe before the child finishes attempting to write to the pipe. In every meaningful scenario where pipes are used, reading and writing happen in a rather simultaneous fashion so that one process can quickly react to input provided by another process. You are doing the exact opposite: you put your parent to sleep and therefore render it dis-responsive to input from the child, resulting in a deadlock situation.
(*) "When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe", from https://docs.python.org/2/library/multiprocessing.html

If you have really big arrays, you might want to only pass their pickled state -- or a better alternative might be to use multiprocessing.Array or multiprocessing.sharedctypes.RawArray to make a shared memory array (for the latter, see http://briansimulator.org/sharing-numpy-arrays-between-processes/). You have to worry about conflicts, as you'll have an array that's not bound by the GIL -- and needs locks. However, you only need to send array indices to access the shared array data.

One thing you could do to resolve that issue, in tandem with the excellent answer from JPG, is to unload your Queue between every processes.
So do this instead:
process.start()
data_item = data_queue.get()
process.join()
While this does not fully replicate the behavior in the code (number of data counting), you get the idea ;)

Convert array/list to str(your_array)
q.put(str(your_array))

When, why, and how to call thread.join() in Python?

I have this python threading code.
import threading
def sum(value):
sum = 0
for i in range(value+1):
sum += i
print "I'm done with %d - %d\n" % (value, sum)
return sum
r = range(500001, 500000*2, 100)
ts = []
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
for t in ts:
t.join()
Executing this, I have hundreds of threads are working.
However, when I move the t.join() right after the t.start(), I have only two threads working.
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
t.join()
I tested with the code that does not invoke the t.join(), but it seems to work fine?
Then when, how, and how to use thread.join()?

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.
The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

Short answer: this one:
for t in ts:
t.join()
is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.
Longer answer:
len(list(range(500001, 500000*2, 100)))
Out[1]: 5000
You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!
Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!
At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.
Obligatory GIL Discussion
cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.
multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

join() waits for your thread to finish, so the first use starts a hundred threads, and then waits for all of them to finish. The second use wait for end of every thread before it launches another one, which kind of defeats the purpose of threading.
The first use makes most sense. You run the threads (all of them) to do some parallel computation, and then wait until all of them finish, before you move on and use the results, to make sure the work is done (i.e. the results are actually there).

Executing callback immediately after map_async finishes

I'm using map_async with processes that return a ton of data. The normal map_async results in the data being stored in memory, then returned when everything is processed. To get around this, I've used a generator approach from:
Combining itertools and multiprocessing?
However, this doesn't make full use of multi-threading (as in, if you have 29 threads finished and 1 thread hanging, it won't start the next batch of jobs until everyone is done). Is there a way to have the map_async or does there exist a similar function which will send its returns to a callback function as each thread finishes?

What you want is to use a producer-consumer-based solution. The producer puts tasks in a multiprocessing.Queue, and the consumers (subprocesses) get and processes them, in a loop.
This is a good SO question with a (detailed) possible solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.