Real difference between thread and threadpool?

Real difference between thread and threadpool? - python

In this example is there any real difference or is it just syntactic sugar?
threads = []
for job in jobs:
t = threading.Thread(target=job, args=[exchange])
t.start()
threads.append(t)
for thread in threads:
thread.join()
And
with concurrent.futures.ThreadPoolExecutor(max_workers=len(jobs)) as executor:
for job in jobs:
executor.submit(job, exchange)
Main point of ThreadPool should be to reuse threads but in this example are all threads exited after "with" statement, Am I right?
How to achieve reuse? Keep instance of ThreadPool alive somewhere without with statement?

You can keep the ThreadPool alive somewhere else for as long as you need. But in this particular case you probably want to utilize the result of .submit like this:
with concurrent.futures.ThreadPoolExecutor(max_workers=len(jobs)) as executor:
futures = []
for job in jobs:
future = executor.submit(job, exchange)
futures.append(future)
for future in futures:
future.result()
which is very similar to raw threads, except threads are reused and with future.result() we can retrieve value (if any) and catch exceptions (you may want to try-except the future.result() call).
Btw, I wouldn't do max_workers=len(jobs), it seems to be against the point of ThreadPool. Also I encourage you to have a look at async api instead. Threads are of limited usage in Python anyway.

What you're asking is like asking whether there is any real difference between owning a truck, and renting a truck just on the days when you need it.
A thread is like the truck. A thread pool is like the truck rental company. Any time you create a thread pool, you are indirectly creating threads—probably more than one.
Creating and destroying threads is a costly operation. Thread pools are useful in programs that continually create many small tasks that need to be performed in different threads. Instead of creating and destroying a new thread for each task, the program submits the task to a thread pool, and the thread pool assigns the tasks to one of its worker threads. The worker threads can live a long time. They don't need to be continually created and destroyed because each one can perform any number of tasks.
If the "tasks" that your program creates need to run for almost as long as the whole program itself, Then it might make more sense just to create raw threads for that. But, if your program creates many short-lived tasks, then the thread pool probably is the better choice.

Related

Is there ever a reason to call join when using pool.map while using python multiprocessing?

As multiprocessing.Pool().map() blocks the main process from moving ahead with the execution. And, yet it gets stated everywhere that join should be called after close as a good practice. I wanted to understand, through example, what could ever be the scenario under which using join makes sense after a multiprocessing.Pool().map() call

Where does it state that "good practice"? If you have no further need of the pool, i.e. you do not plan on submitting any more tasks and your program is not terminating but you want to release the resource used by the pool and "clean up" right away, you can just call terminate either explicitly or implicitly, which happens if you use a with block as follows:
with Pool() as pool:
...
# terminate is called implicitly when the above block exits
But note that terminate will not wait for outstanding tasks, if any, to complete. If there are submitted tasks that queued up to run but not yet running or are currently running, they will be canceled.
Calling close prevents further tasks from being submitted and should only be called when you have no further use for the pool. Calling join, which requires that you first call close, will wait for any outstanding tasks to complete and the processes in the pool to terminate. But if you are using map, by definition that blocks until the tasks submitted complete. So unless you have any other tasks you submitted there is no compelling need to call close followed by join. These calls are, however, useful to wait for outstanding tasks submitted with, for example, apply_async to complete without having to explicitly call get on the AsyncResult instance returned by that call:
pool = Pool()
pool.submit(worker1, args=(arg1, arg2))
pool.submit(worker2, args=(arg3,))
pool.submit(worker3)
# wait for all 3 tasks to complete
pool.close()
pool.join()
Of course, the above is only useful if you do not need any return values from the worker functions.
So to answer your question: Not really; only if you happen to have other tasks submitted asynchronously whose completion you are awaiting. It is, however, one way of immediately releasing pool resource if you are not planning on exiting your program right away, the other way being to call method terminate.

Can I fire & forget callables submitted to a thread pool?

In my Python 3 application I have to deal with many small (simultaneous) I/O tasks that should not block the main thread, so I want to make use of thread pool:
from concurrent.futures import ThreadPoolExecutor
class MyApp:
def __init__(self):
self.app_thread_pool = ThreadPoolExecutor()
def submit_task(self, task):
self.app_thread_pool.submit(MyApp.task_runner, task)
#staticmethod
def task_runner(task):
# Do some stuff with the task, save to disk, etc.
This works fine, the jobs are being submitted and started in the threads of the thread pool, the tasks do what they are supposed to do. Now, reading the documentation on the concurrent.futures module, it seems this module / thread pool is to be used with Future objects in order to handle the outcome of the submitted tasks. In my case, however, I am not interested in these results, I want to fire & forget my tasks, they are able to handle themselves.
So my question is: Do I have to use futures, or can I simply submit() a task, as shown above, and ignore any Future objects returned from the job submission? I ask in terms of memory and resouces management. Note that I also don't want to use the thread pool using the with statement, as described in the docs, because I need this thread pool over the whole lifetime of my application and the main thread starting the thread pool has many other things to do ...

No, of course you don't have to use the returned Future objects. As you have discovered, your code seems to correctly work without doing so. It's just a question of observability and robustness.
Keeping the Futures allows you to keep track of the submitted tasks. You can learn when they finish, retrieve their results, cancel them, etc. This is important for just knowing what's going on with your tasks.
But an arguably more important reason to keep tabs on the tasks is for robustness. What if one of the tasks fails? You probably want to retry it somehow. And note that there are very few tasks that cannot fail. You say your tasks are "I/O", which is a classic example of something where failures are common, and only known after you try something.
So while there is nothing forcing you to keep track of the futures, you probably should, especially if you need a robust, long-running application.

python multiprocessing pool blocking main thread

I have the following snippet which attempts to split processing across multiple sub-processes.
def search(self):
print("Checking queue for jobs to process")
if self._job_queue.has_jobs_to_process():
print("Queue threshold met, processing jobs.")
job_sub_lists = partition_jobs(self._job_queue.get_jobs_to_process(), self._process_pool_size)
populated_sub_lists = [sub_list for sub_list in job_sub_lists if len(sub_list) > 0]
self._process_pool.map(process, populated_sub_lists)
print("Job processing pool mapped")
The search function is being called by the main process in a while loop and if the queue reaches a threshold count, the processing pool is mapped to the process function with the jobs sourced from the queue. My question is, does the python multiprocessing pool block the main process during execution or does it immediately continue execution? I don't want to encounter the scenario where "has_jobs_to_process()" evaluates to true and during the processing of the jobs, it evaluates to true for another set of jobs and "self._process_pool.map(process, populated_sub_lists)" is called again as I do not know the consequences of calling map again while processes are running.

multiprocessing.Pool.map blocks the calling thread (not necessarily the MainThread!), not the whole process.
Other threads of the parent process will not be blocked. You could call pool.map from multiple threads in the parent process without breaking things (doesn't make much sense, though). That's because Pool uses thread-safe queue.Queue internally for it's _taskqueue.

From the multiprocessing docs, multiprocessing.map will block the main process during execution until a result is ready, and multiprocessing.map_async will not.

prevent thread starvation Python

I have some function which does some file writing. The semaphore is for limiting a number of threads to 2. The total number of threads are 3. How can I prevent from the 3 threads a starvation? Is the queue is an option for that?
import time
import threading
sema = threading.Semaphore(2)
def write_file(file,data):
sema.acquire()
try:
f=open(file,"a")
f.write(data)
f.close()
finally:
sema.release()

I have to object to the accepted question. It is true that Condition queues the waits, but the more important part is when it tries to acquire the Condition lock.
The order in which threads are released is not deterministic
The implementation may pick one at random, so the order in which blocked threads are awakened should not be relied on.
In the case of three threads, there I agree, it's very unlikely that two are trying to acquire the lock at the same time (one working, one in wait, one acquiring the lock), but there still might be interferences.
A good solution for your problem IMO would be a thread that's single purpose is to read your data from a queue and write it to a file. All other threads can write to the queue and continue working.

If a thread is waiting to acquire the semaphore, either of the other two threads will be done writing and release the semaphore.
If you are worried that if there is a lot of writing going on, the writers might reacquire the semaphore before the waiting thread is notified. This can not happen, I think.
The Semaphore object in Python (2.7) uses a Condition. The Condition adds waiting threads (actually a lock, which the waiting thread is blocking on) to the end of an waiters list and when notifying threads, the notified threads are taken from the beginning of the list. So the list acts like a FIFO-queue.
It looks something like this:
def wait(self, timeout=None):
self.__waiters.append(waiter)
...
def notify(self, n=1):
...
waiters = self.__waiters[:n]
for waiter in waiters:
waiter.release()
...
My understanding, after reading the source code, is that Python's Semaphores are FIFO. I couldn't find any other information about this, so please correct me if I'm wrong.

When, why, and how to call thread.join() in Python?

I have this python threading code.
import threading
def sum(value):
sum = 0
for i in range(value+1):
sum += i
print "I'm done with %d - %d\n" % (value, sum)
return sum
r = range(500001, 500000*2, 100)
ts = []
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
for t in ts:
t.join()
Executing this, I have hundreds of threads are working.
However, when I move the t.join() right after the t.start(), I have only two threads working.
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
t.join()
I tested with the code that does not invoke the t.join(), but it seems to work fine?
Then when, how, and how to use thread.join()?

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.
The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

Short answer: this one:
for t in ts:
t.join()
is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.
Longer answer:
len(list(range(500001, 500000*2, 100)))
Out[1]: 5000
You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!
Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!
At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.
Obligatory GIL Discussion
cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.
multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

join() waits for your thread to finish, so the first use starts a hundred threads, and then waits for all of them to finish. The second use wait for end of every thread before it launches another one, which kind of defeats the purpose of threading.
The first use makes most sense. You run the threads (all of them) to do some parallel computation, and then wait until all of them finish, before you move on and use the results, to make sure the work is done (i.e. the results are actually there).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.