Python gevent pool.join() waiting forever

Python gevent pool.join() waiting forever - python

I have a function like this
def check_urls(res):
pool = Pool(10)
print pool.free_count()
for row in res:
pool.spawn(fetch, row[0], row[1])
pool.join()
pool.free_count() outputs value 10.
I used pdb to trace. Program works fine until pool.spawn() loop.
But its waiting forever at pool.join() line.
Can someone tell me whats wrong?

But its waiting forever at pool.join() line.
Can someone tell me whats wrong?
Nothing!
Though, I first wrote what's below the line, the join() function in gevent is still behaving pretty much the same way as in subprocess/threading. It's blocking until all the greenlets are done.
If you want to only test whether all the greenlets in the pool are over or not, you might want to check for the ready() on each greenlet of the pool:
is_over = all(gl.ready() for gl in pool.greenlets)
Basically, .join() is not waiting forever, it's waiting until your threads are over. If one of your threads is never ending, then join() will block forever. So make sure every greenlet thread terminate, and join() will get back to execution once all the jobs are done.
edit: The following applies only to subprocess or threading modules standard API. The GEvent's greenlet pools is not matching the "standard" API.
The join() method on a Thread/Process has for purpose to make the main process/thread wait forever until the children processes/threads are over.
You can use the timeout parameter to make it get back to execution after some time, or you can use the is_alive() method to check if it's running or not without blocking.
In the context of a process/thread pool, the join() also needs to be triggered after a call to either close() or terminate(), so you may want to:
for row in res:
pool.spawn(fetch, row[0], row[1])
pool.close()
pool.join()

Related

Is there ever a reason to call join when using pool.map while using python multiprocessing?

As multiprocessing.Pool().map() blocks the main process from moving ahead with the execution. And, yet it gets stated everywhere that join should be called after close as a good practice. I wanted to understand, through example, what could ever be the scenario under which using join makes sense after a multiprocessing.Pool().map() call

Where does it state that "good practice"? If you have no further need of the pool, i.e. you do not plan on submitting any more tasks and your program is not terminating but you want to release the resource used by the pool and "clean up" right away, you can just call terminate either explicitly or implicitly, which happens if you use a with block as follows:
with Pool() as pool:
...
# terminate is called implicitly when the above block exits
But note that terminate will not wait for outstanding tasks, if any, to complete. If there are submitted tasks that queued up to run but not yet running or are currently running, they will be canceled.
Calling close prevents further tasks from being submitted and should only be called when you have no further use for the pool. Calling join, which requires that you first call close, will wait for any outstanding tasks to complete and the processes in the pool to terminate. But if you are using map, by definition that blocks until the tasks submitted complete. So unless you have any other tasks you submitted there is no compelling need to call close followed by join. These calls are, however, useful to wait for outstanding tasks submitted with, for example, apply_async to complete without having to explicitly call get on the AsyncResult instance returned by that call:
pool = Pool()
pool.submit(worker1, args=(arg1, arg2))
pool.submit(worker2, args=(arg3,))
pool.submit(worker3)
# wait for all 3 tasks to complete
pool.close()
pool.join()
Of course, the above is only useful if you do not need any return values from the worker functions.
So to answer your question: Not really; only if you happen to have other tasks submitted asynchronously whose completion you are awaiting. It is, however, one way of immediately releasing pool resource if you are not planning on exiting your program right away, the other way being to call method terminate.

python multiprocessing pool blocking main thread

I have the following snippet which attempts to split processing across multiple sub-processes.
def search(self):
print("Checking queue for jobs to process")
if self._job_queue.has_jobs_to_process():
print("Queue threshold met, processing jobs.")
job_sub_lists = partition_jobs(self._job_queue.get_jobs_to_process(), self._process_pool_size)
populated_sub_lists = [sub_list for sub_list in job_sub_lists if len(sub_list) > 0]
self._process_pool.map(process, populated_sub_lists)
print("Job processing pool mapped")
The search function is being called by the main process in a while loop and if the queue reaches a threshold count, the processing pool is mapped to the process function with the jobs sourced from the queue. My question is, does the python multiprocessing pool block the main process during execution or does it immediately continue execution? I don't want to encounter the scenario where "has_jobs_to_process()" evaluates to true and during the processing of the jobs, it evaluates to true for another set of jobs and "self._process_pool.map(process, populated_sub_lists)" is called again as I do not know the consequences of calling map again while processes are running.

multiprocessing.Pool.map blocks the calling thread (not necessarily the MainThread!), not the whole process.
Other threads of the parent process will not be blocked. You could call pool.map from multiple threads in the parent process without breaking things (doesn't make much sense, though). That's because Pool uses thread-safe queue.Queue internally for it's _taskqueue.

From the multiprocessing docs, multiprocessing.map will block the main process during execution until a result is ready, and multiprocessing.map_async will not.

python function not running as thread

this is done in python 2.7.12
serialHelper is a class module arround python serial and this code does work nicely
#!/usr/bin/env python
import threading
from time import sleep
import serialHelper
sh = serialHelper.SerialHelper()
def serialGetter():
h = 0
while True:
h = h + 1
s_resp = sh.getResponse()
print ('response ' + s_resp)
sleep(3)
if __name__ == '__main__':
try:
t = threading.Thread(target=sh.serialReader)
t.setDaemon(True)
t.start()
serialGetter()
#tSR = threading.Thread(target=serialGetter)
#tSR.setDaemon(True)
#tSR.start()
except Exception as e:
print (e)
however the attemp to run serialGetter as thread as remarked it just dies.
Any reason why that function can not run as thread ?

Quoting from the Python documentation:
The entire Python program exits when no alive non-daemon threads are left.
So if you setDaemon(True) every new thread and then exit the main thread (by falling off the end of the script), the whole program will exit immediately. This kills all of the threads. Either don't use setDaemon(True), or don't exit the main thread without first calling join() on all of the threads you want to wait for.
Stepping back for a moment, it may help to think about the intended use case of a daemon thread. In Unix, a daemon is a process that runs in the background and (typically) serves requests or performs operations, either on behalf of remote clients over the network or local processes. The same basic idea applies to daemon threads:
You launch the daemon thread with some kind of work queue.
When you need some work done on the thread, you hand it a work object.
When you want the result of that work, you use an event or a future to wait for it to complete.
After requesting some work, you always eventually wait for it to complete, or perhaps cancel it (if your worker protocol supports cancellation).
You don't have to clean up the daemon thread at program termination. It just quietly goes away when there are no other threads left.
The problem is step (4). If you forget about some work object, and exit the app without waiting for it to complete, the work may get interrupted. Daemon threads don't gracefully shut down, so you could leave the outside world in an inconsistent state (e.g. an incomplete database transaction, a file that never got closed, etc.). It's often better to use a regular thread, and replace step (5) with an explicit "Finish up your work and shut down" work object that the main thread hands to the worker thread before exiting. The worker thread then recognizes this object, stops waiting on the work queue, and terminates itself once it's no longer doing anything else. This is slightly more up-front work, but is much safer in the event that a work object is inadvertently abandoned.
Because of all of the above, I recommend not using daemon threads unless you have a strong reason for them.

When, why, and how to call thread.join() in Python?

I have this python threading code.
import threading
def sum(value):
sum = 0
for i in range(value+1):
sum += i
print "I'm done with %d - %d\n" % (value, sum)
return sum
r = range(500001, 500000*2, 100)
ts = []
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
for t in ts:
t.join()
Executing this, I have hundreds of threads are working.
However, when I move the t.join() right after the t.start(), I have only two threads working.
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
t.join()
I tested with the code that does not invoke the t.join(), but it seems to work fine?
Then when, how, and how to use thread.join()?

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.
The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

Short answer: this one:
for t in ts:
t.join()
is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.
Longer answer:
len(list(range(500001, 500000*2, 100)))
Out[1]: 5000
You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!
Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!
At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.
Obligatory GIL Discussion
cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.
multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

join() waits for your thread to finish, so the first use starts a hundred threads, and then waits for all of them to finish. The second use wait for end of every thread before it launches another one, which kind of defeats the purpose of threading.
The first use makes most sense. You run the threads (all of them) to do some parallel computation, and then wait until all of them finish, before you move on and use the results, to make sure the work is done (i.e. the results are actually there).

why should i join() after start() in multithreading in python?

just as the sample code shows:
for temp in range(0, 10):
thread = threading.Thread(target = current_post.post)
thread.start()
threads.append(thread)
for current in range(10):
threads[current].join()
the code is just a part of a python file, but it stands for most circumstances: i should execute join() after start() in multithreading. i have been confusing by it for a few days. as we all know, when we execute thread.start(), a new thread then starts, python runs through different threads automatically, thats all we need. if so, why should i add thread.join() after start()? join() means waiting until the current thread finished IMO. but does it means a kind of single-thread? i have to wait each thread to finish their tasks, its not multithreading! join() only means executing the specified function one by one IMO. cannot start() finish the multi-threading perfectly? why should i add join() function to let them finish one by one? thx for any help :)

You do it in order to be sure that your threads have actually finished (and do not become, for example, zombie processes, after your main thread exits).
However, you don't have to do it right after starting the threads. You can do it at the very end of your process.

Join will block the current thread until the thread upon which join is called has finished.
Essentially your code is starting a load of threads and then waiting for them all to complete.
If you didn't then the chances are the process would exit and none of your threads would do anything.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.