Being new to using concurrency, I am confused about when to use the different python concurrency libraries. To my understanding, multiprocessing, multithreading and asynchronous programming are part of concurrency, while multiprocessing is part of a subset of concurrency called parallelism.
I searched around on the web about different ways to approach concurrency in python, and I came across the multiprocessing library, concurrenct.futures' ProcessPoolExecutor() and ThreadPoolExecutor(), and asyncio. What confuses me is the difference between these libraries. Especially what the multiprocessing library does, since it has methods like pool.apply_async, does it also do the job of asyncio? If so, why is it called multiprocessing when it is a different method to achieve concurrency from asyncio (multiple processes vs cooperative multitasking)?
There are several different libraries at play:
threading: interface to OS-level threads. Note that CPU-bound work is mostly serialized by the GIL, so don't expect threading to speed up calculations. Use it when you need to invoke blocking APIs in parallel, and when you require precise control over thread creation. Avoid creating too many threads (e.g. thousands), as they are not free. If possible, don't create threads yourself, use concurrent.futures instead.
multiprocessing: interface to spawning multiple python processes with an API intentionally similar to threading. Multiple processes work in parallel, so you can actually speed up calculations using this method. The disadvantage is that you can't share in-memory datastructures without using multi-processing specific tools.
concurrent.futures: A modern interface to threading and multiprocessing, which provides convenient thread/process pools it calls executors. The pool's main entry point is the submit method which returns a handle that you can test for completion or wait for its result. Getting the result gives you the return value of the submitted function and correctly propagates raised exceptions (if any), which would be tedious to do with threading. concurrent.futures should be the tool of choice when considering thread or process based parallelism.
asyncio: While the previous options are "async" in the sense that they provide non-blocking APIs (this is what methods like apply_async refer to), they are still relying on thread/process pools to do their magic, and cannot really do more things in parallel than they have workers in the pool. Asyncio is different: it uses a single thread of execution and async system calls across the board. It has no blocking calls at all, the only blocking part being the asyncio.run() entry point. Asyncio code is typically written using coroutines, which use await to suspend until something interesting happens. (Suspending is different than blocking in that it allows the event loop thread to continue to other things while you're waiting.) It has many advantages compared to thread-based solutions, such as being able to spawn thousands of cheap "tasks" without bogging down the system, and being able to cancel tasks or easily wait for multiple things at once. Asyncio should be the tool of choice for servers and for clients connecting to multiple servers.
When choosing between asyncio and multithreading/multiprocessing, consider the adage that "threading is for working in parallel, and async is for waiting in parallel".
Also note that asyncio can await functions executed in thread or process pools provided by concurrent.futures, so it can serve as glue between all those different models. This is part of the reason why asyncio is often used to build new library infrastructure.
Related
I've read quite a few articles on threading and asyncio modules in python and the major difference I can seem to draw (correct me if I'm wrong) is that in,
threading: multiple threads can be used to execute the python program and these threads are juggled by the OS itself. Further only when non blocking I/O is happening on a thread the GIL lock can be released to allow another thread to use it (since GIL makes python interpreter single threaded). This is also more resource intensive than asyncio io, since multiple threads will be utilising multiple resources.
asyncio: one single thread can have multiple tasks/coroutines that multitask cooperatively to achieve concurrency. Here, the issue of GIL doesn't arise since it is on a single thread anyway and whenever one non blocking I/O bound task is happening, python interpreter can be used by another coroutine - and all of this is managed by asyncio's event loop.
Also, one article: http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/
says,
if io_bound:
if io_very_slow:
print("Use Asyncio")
else:
print("Use Threads")
else:
print("Multi Processing")
I'd like to understand, just for better clarity, why exactly we can't use asyncio and threading as substitutes for each other, given we have sufficient resources available. Use cases of when to use what would help understand better. Further, since this topic is very new for me, there might be gaps in my understanding, so any kind of resources, explanations and corrections would be really appreciated.
I'm looking for a conceptual answer on this question.
I'm wondering whether using ThreadPool in python to perform concurrent tasks, guarantees that data is not corrupted; I mean multiple threads don't access the critical data at the same time.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
Thread pools do not guarantee that shared data is not corrupted. Threads can swap at any byte code execution boundary and corruption is always a risk. Shared data should be protected by synchronization resources such as locks, condition variables and events. See the threading module docs
concurrent.futures.ThreadPoolExecutor is a thread pool specialized to the concurrent.futures async task model. But all of the risks of traditional threading are still there.
If you are using the python async model, things that fiddle with shared data should be dispatched on the main thread. The thread pool should be used for autonomous events, especially those that wait on blocking I/O.
If so, how does this ThreadPoolExecutor internally works to ensure that critical data is accessed by only one thread at a time?
It doesn't, that's your job.
The high-level methods like map will use a safe work queue and not share work items between threads, but if you've got other resources which can be shared then the pool does not know or care, it's your problem as the developer.
In the crawler i am working on. It makes requests using pycurl multi.
What kind of efficiency improvement can i expect if i switch to aiohttp?
Skepticism has me doubting the potential improvement since python has the GIL. Most of the time is spent waiting for the requests(network IO), so if i could do them in a true parallel way and then process them as they come in i could get a good speedup.
Has anyone been through this and can offer some insights?
Thanks
The global interpreter lock is a mutex that protects access to Python
objects, preventing multiple threads from executing Python bytecodes
at once.
This means that affects the performance of your multithreaded code. AsyncIO is more about handling concurrent requests rather than parallel. With AsyncIO your code will be able to handle more request even with a single threaded loop because the network IO is going to be async. This means that during the time a coroutine fetches a network resource it will "pause" and not lock the thread it's running on and allow other coroutines to execute. The main idea with asyncIO is that even with a single thread you can have your CPU performing calculation constantly instead of waiting for network IO.
If you want to understand more about asyncIO, you need to understand the difference between concurrency and parallelism. This is an excellent Go talk about this subject, but the principals are the same.
So even if python has GIL, performance with asyncIO will be by far better than using traditional threads. Here are some benchmarks:
From the gevent docs:
The greenlets all run in the same OS thread and are scheduled cooperatively.
From asyncio docs:
This module provides infrastructure for writing single-threaded concurrent code using coroutines. asyncio does provide
Try as I might, I haven't come across any major Python libraries that implement multi-threaded or multi-process coroutines i.e. spreading coroutines across multiple threads so as to increase the number of I/O connections that can be made.
I understand coroutines essentially allow the main thread to pause executing this one I/O bound task and move on to the next I/O bound task, forcing an interrupt only when one of these I/O operations finish and require handling. If that is the case, then distributing I/O tasks across several threads, each of which could be operating on different cores, should obviously increase the number of requests you could make.
Maybe I'm misunderstanding how coroutines work or are meant to work, so my question is in two parts:
Is it possible to even have a coroutine library that operates over multiple threads (possibly on different cores) or multiple processes?
If so, is there such a library?
I've been reading about asyncio module in python 3, and more broadly about coroutines in python, and I can't get what makes asyncio such a great tool.
I have the feeling that all you can do with coroutines, you can do better by using task queues based on the multiprocessing module (celery for example).
Are there use cases where coroutines are better than task queues?
Not a proper answer, but a list of hints that could not fit into a comment:
You are mentioning the multiprocessing module (and let's consider threading too). Suppose you have to handle hundreds of sockets: can you spawn hundreds of processes or threads?
Again, with threads and processes: how do you handle concurrent access to shared resources? What is the overhead of mechanisms like locking?
Frameworks like Celery also add an important overhead. Can you use it e.g. for handling every single request on a high-traffic web server? By the way, in that scenario, who is responsible for handling sockets and connections (Celery for its nature can't do that for you)?
Be sure to read the rationale behind asyncio. That rationale (among other things) mentions a system call: writev() -- isn't that much more efficient than multiple write()s?
Adding to the above answer:
If the task at hand is I/O bound and operates on a shared data, coroutines and asyncio are probably the way to go.
If on the other hand, you have CPU-bound tasks where data is not shared, a multiprocessing system like Celery should be better.
If the task at hand is a both CPU and I/O bound and sharing of data is not required, I would still use Celery.You can use async I/O from within Celery!
If you have a CPU bound task but with the need to share data, the only viable option as I see now is to save the shared data in a database. There have been recent attempts like pyparallel but they are still work in progress.