I've been reading about asyncio module in python 3, and more broadly about coroutines in python, and I can't get what makes asyncio such a great tool.
I have the feeling that all you can do with coroutines, you can do better by using task queues based on the multiprocessing module (celery for example).
Are there use cases where coroutines are better than task queues?
Not a proper answer, but a list of hints that could not fit into a comment:
You are mentioning the multiprocessing module (and let's consider threading too). Suppose you have to handle hundreds of sockets: can you spawn hundreds of processes or threads?
Again, with threads and processes: how do you handle concurrent access to shared resources? What is the overhead of mechanisms like locking?
Frameworks like Celery also add an important overhead. Can you use it e.g. for handling every single request on a high-traffic web server? By the way, in that scenario, who is responsible for handling sockets and connections (Celery for its nature can't do that for you)?
Be sure to read the rationale behind asyncio. That rationale (among other things) mentions a system call: writev() -- isn't that much more efficient than multiple write()s?
Adding to the above answer:
If the task at hand is I/O bound and operates on a shared data, coroutines and asyncio are probably the way to go.
If on the other hand, you have CPU-bound tasks where data is not shared, a multiprocessing system like Celery should be better.
If the task at hand is a both CPU and I/O bound and sharing of data is not required, I would still use Celery.You can use async I/O from within Celery!
If you have a CPU bound task but with the need to share data, the only viable option as I see now is to save the shared data in a database. There have been recent attempts like pyparallel but they are still work in progress.
Related
I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?
Being new to using concurrency, I am confused about when to use the different python concurrency libraries. To my understanding, multiprocessing, multithreading and asynchronous programming are part of concurrency, while multiprocessing is part of a subset of concurrency called parallelism.
I searched around on the web about different ways to approach concurrency in python, and I came across the multiprocessing library, concurrenct.futures' ProcessPoolExecutor() and ThreadPoolExecutor(), and asyncio. What confuses me is the difference between these libraries. Especially what the multiprocessing library does, since it has methods like pool.apply_async, does it also do the job of asyncio? If so, why is it called multiprocessing when it is a different method to achieve concurrency from asyncio (multiple processes vs cooperative multitasking)?
There are several different libraries at play:
threading: interface to OS-level threads. Note that CPU-bound work is mostly serialized by the GIL, so don't expect threading to speed up calculations. Use it when you need to invoke blocking APIs in parallel, and when you require precise control over thread creation. Avoid creating too many threads (e.g. thousands), as they are not free. If possible, don't create threads yourself, use concurrent.futures instead.
multiprocessing: interface to spawning multiple python processes with an API intentionally similar to threading. Multiple processes work in parallel, so you can actually speed up calculations using this method. The disadvantage is that you can't share in-memory datastructures without using multi-processing specific tools.
concurrent.futures: A modern interface to threading and multiprocessing, which provides convenient thread/process pools it calls executors. The pool's main entry point is the submit method which returns a handle that you can test for completion or wait for its result. Getting the result gives you the return value of the submitted function and correctly propagates raised exceptions (if any), which would be tedious to do with threading. concurrent.futures should be the tool of choice when considering thread or process based parallelism.
asyncio: While the previous options are "async" in the sense that they provide non-blocking APIs (this is what methods like apply_async refer to), they are still relying on thread/process pools to do their magic, and cannot really do more things in parallel than they have workers in the pool. Asyncio is different: it uses a single thread of execution and async system calls across the board. It has no blocking calls at all, the only blocking part being the asyncio.run() entry point. Asyncio code is typically written using coroutines, which use await to suspend until something interesting happens. (Suspending is different than blocking in that it allows the event loop thread to continue to other things while you're waiting.) It has many advantages compared to thread-based solutions, such as being able to spawn thousands of cheap "tasks" without bogging down the system, and being able to cancel tasks or easily wait for multiple things at once. Asyncio should be the tool of choice for servers and for clients connecting to multiple servers.
When choosing between asyncio and multithreading/multiprocessing, consider the adage that "threading is for working in parallel, and async is for waiting in parallel".
Also note that asyncio can await functions executed in thread or process pools provided by concurrent.futures, so it can serve as glue between all those different models. This is part of the reason why asyncio is often used to build new library infrastructure.
I'm looking for a conceptual answer on this question.
I'm wondering whether using ThreadPool in python to perform concurrent tasks, guarantees that data is not corrupted; I mean multiple threads don't access the critical data at the same time.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
Thread pools do not guarantee that shared data is not corrupted. Threads can swap at any byte code execution boundary and corruption is always a risk. Shared data should be protected by synchronization resources such as locks, condition variables and events. See the threading module docs
concurrent.futures.ThreadPoolExecutor is a thread pool specialized to the concurrent.futures async task model. But all of the risks of traditional threading are still there.
If you are using the python async model, things that fiddle with shared data should be dispatched on the main thread. The thread pool should be used for autonomous events, especially those that wait on blocking I/O.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
It doesn't, that's your job.
The high-level methods like map will use a safe work queue and not share work items between threads, but if you've got other resources which can be shared then the pool does not know or care, it's your problem as the developer.
From the gevent docs:
The greenlets all run in the same OS thread and are scheduled cooperatively.
From asyncio docs:
This module provides infrastructure for writing single-threaded concurrent code using coroutines. asyncio does provide
Try as I might, I haven't come across any major Python libraries that implement multi-threaded or multi-process coroutines i.e. spreading coroutines across multiple threads so as to increase the number of I/O connections that can be made.
I understand coroutines essentially allow the main thread to pause executing this one I/O bound task and move on to the next I/O bound task, forcing an interrupt only when one of these I/O operations finish and require handling. If that is the case, then distributing I/O tasks across several threads, each of which could be operating on different cores, should obviously increase the number of requests you could make.
Maybe I'm misunderstanding how coroutines work or are meant to work, so my question is in two parts:
Is it possible to even have a coroutine library that operates over multiple threads (possibly on different cores) or multiple processes?
If so, is there such a library?
As I understand, asynchronous networking frameworks/libraries like twisted, tornado, and asyncio provide asynchronous IO through implementing nonblocking sockets and an event loop. Gevent achieves essentially the same thing through monkey patching the standard library, so explicit asynchronous programming via callbacks and coroutines is not required.
On the other hand, asynchronous task queues, like Celery, manage background tasks and distribute those tasks across multiple threads or machines. I do not fully understand this process but it involves message brokers, messages, and workers.
My questions,
Do asynchronous task queues require asynchronous IO? Are they in any way related? The two concepts seem similar, but the implementations at the application level are different. I would think that the only thing they have in common is the word "asynchronous", so perhaps that is throwing me off.
Can someone elaborate on how task queues work and the relationship between the message broker (why are they required?), the workers, and the messages (what are messages? bytes?).
Oh, and I'm not trying to solve any specific problems, I'm just trying to understand the ideas behind asynchronous task queues and asynchronous IO.
Asynchronous IO is a way to use sockets (or more generally file descriptors) without blocking. This term is specific to one process or even one thread. You can even imagine mixing threads with asynchronous calls. It would be completely fine, yet somewhat complicated.
Now I have no idea what asynchronous task queue means. IMHO there's only a task queue, it's a data structure. You can access it in asynchronous or synchronous way. And by "access" I mean push and pop calls. These can use network internally.
So task queue is a data structure. (A)synchronous IO is a way to access it. That's everything there is to it.
The term asynchronous is havily overused nowadays. The hype is real.
As for your second question:
Message is just a set of data, a sequence of bytes. It can be anything. Usually these are some structured strings, like JSON.
Task == message. The different word is used to notify the purpose of that data: to perform some task. For example you would send a message {"task": "process_image"} and your consumer will fire an appropriate function.
Task queue Q is a just a queue (the data structure).
Producer P is a process/thread/class/function/thing that pushes messages to Q.
Consumer (or worker) C is a process/thread/class/function/thing that pops messages from Q and does some processing on it.
Message broker B is a process that redistributes messages. In this case a producer P sends a message to B (rather then directly to a queue) and then B can (for example) duplicate this message and send to 2 different queues Q1 and Q2 so that 2 different workers C1 and C2 will get that message. Message brokers can also act as protocol translators, can transform messages, aggregate them and do many many things. Generally it's just a blackbox between producers and consumers.
As you can see there are no formal definitions of those things and you have to use a bit of intuition to fully understand them.
Asynchronous tasks or celery tasks are just tasks that are executed asynchronously. In particular case of celery, tasks are executed by multiple workers thereby leveraging full benefits of threading, multiprocessing as well as distributed nodes. So in a way, we can easily accomplish what celery does by using libraries like multiprocessing or multithreading, but the benefit of using celery is it handles all that complexity by itself.
Now Asyncronous IO is quite different from multithreading or multiprocessing does. Aync IO is suitable for tasks that are IO bound (not CPU). It executes multiple IO request simultaneously using single thread only. Gevent or asyncio (in case of python 3) helps in accomplishing that.
Celery - ideal for tasks that don't need to be realtime
Multiprocessing - ideal for tasks that are CPU bound.
Asyncio/Gevent - ideal for tasks that are IO bound
Multithreading - Due to inherent Global Interpreter Locking in Python, not of much use in CPU bound programs. In case of IO bound programs, I believe asyncio is a better option
Tornado - A framework that performs IO request asynchronously.
Twisted - A networking framework that provides lots of features besides Asynchronous IO.