In my Python 3 application I have to deal with many small (simultaneous) I/O tasks that should not block the main thread, so I want to make use of thread pool:
from concurrent.futures import ThreadPoolExecutor
class MyApp:
def __init__(self):
self.app_thread_pool = ThreadPoolExecutor()
def submit_task(self, task):
self.app_thread_pool.submit(MyApp.task_runner, task)
#staticmethod
def task_runner(task):
# Do some stuff with the task, save to disk, etc.
This works fine, the jobs are being submitted and started in the threads of the thread pool, the tasks do what they are supposed to do. Now, reading the documentation on the concurrent.futures module, it seems this module / thread pool is to be used with Future objects in order to handle the outcome of the submitted tasks. In my case, however, I am not interested in these results, I want to fire & forget my tasks, they are able to handle themselves.
So my question is: Do I have to use futures, or can I simply submit() a task, as shown above, and ignore any Future objects returned from the job submission? I ask in terms of memory and resouces management. Note that I also don't want to use the thread pool using the with statement, as described in the docs, because I need this thread pool over the whole lifetime of my application and the main thread starting the thread pool has many other things to do ...
No, of course you don't have to use the returned Future objects. As you have discovered, your code seems to correctly work without doing so. It's just a question of observability and robustness.
Keeping the Futures allows you to keep track of the submitted tasks. You can learn when they finish, retrieve their results, cancel them, etc. This is important for just knowing what's going on with your tasks.
But an arguably more important reason to keep tabs on the tasks is for robustness. What if one of the tasks fails? You probably want to retry it somehow. And note that there are very few tasks that cannot fail. You say your tasks are "I/O", which is a classic example of something where failures are common, and only known after you try something.
So while there is nothing forcing you to keep track of the futures, you probably should, especially if you need a robust, long-running application.
Related
Being new to using concurrency, I am confused about when to use the different python concurrency libraries. To my understanding, multiprocessing, multithreading and asynchronous programming are part of concurrency, while multiprocessing is part of a subset of concurrency called parallelism.
I searched around on the web about different ways to approach concurrency in python, and I came across the multiprocessing library, concurrenct.futures' ProcessPoolExecutor() and ThreadPoolExecutor(), and asyncio. What confuses me is the difference between these libraries. Especially what the multiprocessing library does, since it has methods like pool.apply_async, does it also do the job of asyncio? If so, why is it called multiprocessing when it is a different method to achieve concurrency from asyncio (multiple processes vs cooperative multitasking)?
There are several different libraries at play:
threading: interface to OS-level threads. Note that CPU-bound work is mostly serialized by the GIL, so don't expect threading to speed up calculations. Use it when you need to invoke blocking APIs in parallel, and when you require precise control over thread creation. Avoid creating too many threads (e.g. thousands), as they are not free. If possible, don't create threads yourself, use concurrent.futures instead.
multiprocessing: interface to spawning multiple python processes with an API intentionally similar to threading. Multiple processes work in parallel, so you can actually speed up calculations using this method. The disadvantage is that you can't share in-memory datastructures without using multi-processing specific tools.
concurrent.futures: A modern interface to threading and multiprocessing, which provides convenient thread/process pools it calls executors. The pool's main entry point is the submit method which returns a handle that you can test for completion or wait for its result. Getting the result gives you the return value of the submitted function and correctly propagates raised exceptions (if any), which would be tedious to do with threading. concurrent.futures should be the tool of choice when considering thread or process based parallelism.
asyncio: While the previous options are "async" in the sense that they provide non-blocking APIs (this is what methods like apply_async refer to), they are still relying on thread/process pools to do their magic, and cannot really do more things in parallel than they have workers in the pool. Asyncio is different: it uses a single thread of execution and async system calls across the board. It has no blocking calls at all, the only blocking part being the asyncio.run() entry point. Asyncio code is typically written using coroutines, which use await to suspend until something interesting happens. (Suspending is different than blocking in that it allows the event loop thread to continue to other things while you're waiting.) It has many advantages compared to thread-based solutions, such as being able to spawn thousands of cheap "tasks" without bogging down the system, and being able to cancel tasks or easily wait for multiple things at once. Asyncio should be the tool of choice for servers and for clients connecting to multiple servers.
When choosing between asyncio and multithreading/multiprocessing, consider the adage that "threading is for working in parallel, and async is for waiting in parallel".
Also note that asyncio can await functions executed in thread or process pools provided by concurrent.futures, so it can serve as glue between all those different models. This is part of the reason why asyncio is often used to build new library infrastructure.
I'm looking for a conceptual answer on this question.
I'm wondering whether using ThreadPool in python to perform concurrent tasks, guarantees that data is not corrupted; I mean multiple threads don't access the critical data at the same time.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
Thread pools do not guarantee that shared data is not corrupted. Threads can swap at any byte code execution boundary and corruption is always a risk. Shared data should be protected by synchronization resources such as locks, condition variables and events. See the threading module docs
concurrent.futures.ThreadPoolExecutor is a thread pool specialized to the concurrent.futures async task model. But all of the risks of traditional threading are still there.
If you are using the python async model, things that fiddle with shared data should be dispatched on the main thread. The thread pool should be used for autonomous events, especially those that wait on blocking I/O.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
It doesn't, that's your job.
The high-level methods like map will use a safe work queue and not share work items between threads, but if you've got other resources which can be shared then the pool does not know or care, it's your problem as the developer.
I'm trying to understand the best practices with Python's multiprocessing.Pool object.
In my program I use Pool.imap very frequently. Normally every time I start tasks in parallel I create a new pool object and then close it after I'm done.
I recently encountered a hang where the number of tasks submitted to the pool was less than the number of processes. What was odd was that it only occurred in my test pipeline which had a bunch of things run before it. Running the test as a standalone did not cause the hand. I assume it has to do with making multiple pools.
I'd really like to find some resources to help me understand the best practices in using Python's multiprocessing. Specifically I'm currently trying to understand the implications of making several pool objects versus using only one.
When you create a Pool of worker processes, new processes are spawned from the parent one. This is a very fast operation but it has its cost.
Therefore, as long as you don't have a very good reason, for example the Pool breaks due to one worker dying unexpectedly, it's better to always use the same Pool instance.
The reason for the hang is hard to tell without inspecting the code. You might not have clean the previous instances properly (call close()/stop() and then always call join()). You might have sent too big data through the Pool channel which usually ends up with a deadlock and so on.
Surely a pool does not break if you submit less tasks than workers. The pool is designed exactly to de-couple the number of tasks from the number of workers.
I have recently been working on a pet project in python using flask. It is a simple pastebin with server-side syntax highlighting support with pygments. Because this is a costly task, I delegated the syntax highlighting to a celery task queue and in the request handler I'm waiting for it to finish. Needless to say this does no more than alleviate CPU usage to another worker, because waiting for a result still locks the connection to the webserver.
Despite my instincts telling me to avoid premature optimization like the plague, I still couldn't help myself from looking into async.
Async
If have been following python web development lately, you surely have seen that async is everywhere. What async does is bringing back cooperative-multitasking, meaning each "thread" decides when and where to yield to another. This non-preemptive process is more efficient than OS-threads, but still has it's drawbacks. At the moment there seem to be 2 major approaches:
event/callback style multitasking
coroutines
The first one provides concurrency through loosely-coupled components executed in an event loop. Although this is safer with respect to race conditions and provides for more consistency, it is considerably less intuitive and harder to code than preemptive multitasking.
The other one is a more traditional solution, closer to threaded programming style, the programmer only having to manually switch context. Although more prone to race-conditions and deadlocks, it provides an easy drop-in solution.
Most async work at the moment is done on what is known as IO-bound tasks, tasks that block to wait for input or output. This is usually accomplished through the use of polling and timeout based functions that can be called and if they return negatively, context can be switched.
Despite the name, this could be applied to CPU-bound tasks too, which can be delegated to another worker(thread, process, etc) and then non-blockingly waited for to yield. Ideally, these tasks would be written in an async-friendly manner, but realistically this would imply separating code into small enough chunks not to block, preferably without scattering context switches after every line of code. This is especially inconvenient for existing synchronous libraries.
Due to the convenience, I settled on using gevent for async work and was wondering how is to be dealt with CPU-bound tasks in an async environment(using futures, celery, etc?).
How to use async execution models(gevent in this case) with traditional web frameworks such as flask? What are some commonly agreed-upon solutions to these problems in python(futures, task queues)?
EDIT: To be more specific - How to use gevent with flask and how to deal with CPU-bound tasks in this context?
EDIT2: Considering how Python has the GIL which prevents optimal execution of threaded code, this leaves only the multiprocessing option, in my case at least. This means either using concurrent.futures or some other external service dealing with processing(can open the doors for even something language agnostic). What would, in this case, be some popular or often-used solutions with gevent(i.e. celery)? - best practices
It should be thread-safe to do something like the following to separate cpu intensive tasks into asynchronous threads:
from threading import Thread
def send_async_email(msg):
mail.send(msg)
def send_email(subject, sender, recipients, text_body, html_body):
msg = Message(subject, sender = sender, recipients = recipients)
msg.body = text_body
msg.html = html_body
thr = Thread(target = send_async_email, args = [msg])
thr.start()
IF you need something more complicated, then perhaps Flask-Celery or Multiprocessing library with "Pool" might be useful to you.
I'm not too familiar with gevent though I can't imagine what more complexity you might need or why.
I mean if you're attempting to have efficiency of a major world-website, then I'd recommend building C++ applications to do your CPU-intensive work, and then use Flask-celery or Pool to run that process. (this is what YouTube does when mixing C++ & Python)
How about simply using ThreadPool and Queue? You can then process your stuff in a seperate thread in a synchronous manner and you won't have to worry about blocking at all. Well, Python is not suited for CPU bound tasks in the first place, so you should also think of spawning subprocesses.
I wrote a python script that:
1. submits search queries
2. waits for the results
3. parses the returned results(XML)
I used the threading and Queue modules to perform this in parallel (5 workers).
It works great for the querying portion because i can submit multiple search jobs and deal with the results as they come in.
However, it appears that all my threads get bound to the same core. This is apparent when it gets to the part where it processes the XML(cpu intensive).
Has anyone else encountered this problem? Am i missing something conceptually?
Also, i was pondering the idea of having two separate work queues, one for making the queries and one for parsing the XML. As it is now, one worker will do both in serial. I'm not sure what that will buy me, if anything. Any help is greatly appreciated.
Here is the code: (proprietary data removed)
def addWork(source_list):
for item in source_list:
#print "adding: '%s'"%(item)
work_queue.put(item)
def doWork(thread_id):
while 1:
try:
gw = work_queue.get(block=False)
except Queue.Empty:
#print "thread '%d' is terminating..."%(thread_id)
sys.exit() # no more work in the queue for this thread, die quietly
##Here is where i make the call to the REST API
##Here is were i wait for the results
##Here is where i parse the XML results and dump the data into a "global" dict
#MAIN
producer_thread = Thread(target=addWork, args=(sources,))
producer_thread.start() # start the thread (ie call the target/function)
producer_thread.join() # wait for thread/target function to terminate(block)
#start the consumers
for i in range(5):
consumer_thread = Thread(target=doWork, args=(i,))
consumer_thread.start()
thread_list.append(consumer_thread)
for thread in thread_list:
thread.join()
This is a byproduct of how CPython handles threads. There are endless discussions around the internet (search for GIL) but the solution is to use the multiprocessing module instead of threading. Multiprocessing is built with pretty much the same interface (and synchronization structures, so you can still use queues) as threading. It just gives every thread its own entire process, thus avoiding the GIL and forced serialization of parallel workloads.
Using CPython, your threads will never actually run in parallel in two different cores. Look up information on the Global Interpreter Lock (GIL).
Basically, there's a mutual exclusion lock protecting the actual execution part of the interpreter, so no two threads can compute in parallel. Threading for I/O tasks will work just fine, because of blocking.
edit: If you want to fully take advantage of multiple cores, you need to use multiple processes. There's a lot of articles about this topic, I'm trying to look one up for you I remember was great, but can't find it =/.
As Nathon suggested, you can use the multiprocessing module. There are tools to help you share objects between processes (take a look at POSH, Python Object Sharing).