Python async and CPU-bound tasks?

Python async and CPU-bound tasks? - python

I have recently been working on a pet project in python using flask. It is a simple pastebin with server-side syntax highlighting support with pygments. Because this is a costly task, I delegated the syntax highlighting to a celery task queue and in the request handler I'm waiting for it to finish. Needless to say this does no more than alleviate CPU usage to another worker, because waiting for a result still locks the connection to the webserver.
Despite my instincts telling me to avoid premature optimization like the plague, I still couldn't help myself from looking into async.
Async
If have been following python web development lately, you surely have seen that async is everywhere. What async does is bringing back cooperative-multitasking, meaning each "thread" decides when and where to yield to another. This non-preemptive process is more efficient than OS-threads, but still has it's drawbacks. At the moment there seem to be 2 major approaches:
event/callback style multitasking
coroutines
The first one provides concurrency through loosely-coupled components executed in an event loop. Although this is safer with respect to race conditions and provides for more consistency, it is considerably less intuitive and harder to code than preemptive multitasking.
The other one is a more traditional solution, closer to threaded programming style, the programmer only having to manually switch context. Although more prone to race-conditions and deadlocks, it provides an easy drop-in solution.
Most async work at the moment is done on what is known as IO-bound tasks, tasks that block to wait for input or output. This is usually accomplished through the use of polling and timeout based functions that can be called and if they return negatively, context can be switched.
Despite the name, this could be applied to CPU-bound tasks too, which can be delegated to another worker(thread, process, etc) and then non-blockingly waited for to yield. Ideally, these tasks would be written in an async-friendly manner, but realistically this would imply separating code into small enough chunks not to block, preferably without scattering context switches after every line of code. This is especially inconvenient for existing synchronous libraries.
Due to the convenience, I settled on using gevent for async work and was wondering how is to be dealt with CPU-bound tasks in an async environment(using futures, celery, etc?).
How to use async execution models(gevent in this case) with traditional web frameworks such as flask? What are some commonly agreed-upon solutions to these problems in python(futures, task queues)?
EDIT: To be more specific - How to use gevent with flask and how to deal with CPU-bound tasks in this context?
EDIT2: Considering how Python has the GIL which prevents optimal execution of threaded code, this leaves only the multiprocessing option, in my case at least. This means either using concurrent.futures or some other external service dealing with processing(can open the doors for even something language agnostic). What would, in this case, be some popular or often-used solutions with gevent(i.e. celery)? - best practices

It should be thread-safe to do something like the following to separate cpu intensive tasks into asynchronous threads:
from threading import Thread
def send_async_email(msg):
mail.send(msg)
def send_email(subject, sender, recipients, text_body, html_body):
msg = Message(subject, sender = sender, recipients = recipients)
msg.body = text_body
msg.html = html_body
thr = Thread(target = send_async_email, args = [msg])
thr.start()
IF you need something more complicated, then perhaps Flask-Celery or Multiprocessing library with "Pool" might be useful to you.
I'm not too familiar with gevent though I can't imagine what more complexity you might need or why.
I mean if you're attempting to have efficiency of a major world-website, then I'd recommend building C++ applications to do your CPU-intensive work, and then use Flask-celery or Pool to run that process. (this is what YouTube does when mixing C++ & Python)

How about simply using ThreadPool and Queue? You can then process your stuff in a seperate thread in a synchronous manner and you won't have to worry about blocking at all. Well, Python is not suited for CPU bound tasks in the first place, so you should also think of spawning subprocesses.

Related

Concurrency: multiprocessing, threading, greenthreads and asyncio

I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?

Difference between multiprocessing, asyncio, threading and concurrency.futures in python

Being new to using concurrency, I am confused about when to use the different python concurrency libraries. To my understanding, multiprocessing, multithreading and asynchronous programming are part of concurrency, while multiprocessing is part of a subset of concurrency called parallelism.
I searched around on the web about different ways to approach concurrency in python, and I came across the multiprocessing library, concurrenct.futures' ProcessPoolExecutor() and ThreadPoolExecutor(), and asyncio. What confuses me is the difference between these libraries. Especially what the multiprocessing library does, since it has methods like pool.apply_async, does it also do the job of asyncio? If so, why is it called multiprocessing when it is a different method to achieve concurrency from asyncio (multiple processes vs cooperative multitasking)?

There are several different libraries at play:
threading: interface to OS-level threads. Note that CPU-bound work is mostly serialized by the GIL, so don't expect threading to speed up calculations. Use it when you need to invoke blocking APIs in parallel, and when you require precise control over thread creation. Avoid creating too many threads (e.g. thousands), as they are not free. If possible, don't create threads yourself, use concurrent.futures instead.
multiprocessing: interface to spawning multiple python processes with an API intentionally similar to threading. Multiple processes work in parallel, so you can actually speed up calculations using this method. The disadvantage is that you can't share in-memory datastructures without using multi-processing specific tools.
concurrent.futures: A modern interface to threading and multiprocessing, which provides convenient thread/process pools it calls executors. The pool's main entry point is the submit method which returns a handle that you can test for completion or wait for its result. Getting the result gives you the return value of the submitted function and correctly propagates raised exceptions (if any), which would be tedious to do with threading. concurrent.futures should be the tool of choice when considering thread or process based parallelism.
asyncio: While the previous options are "async" in the sense that they provide non-blocking APIs (this is what methods like apply_async refer to), they are still relying on thread/process pools to do their magic, and cannot really do more things in parallel than they have workers in the pool. Asyncio is different: it uses a single thread of execution and async system calls across the board. It has no blocking calls at all, the only blocking part being the asyncio.run() entry point. Asyncio code is typically written using coroutines, which use await to suspend until something interesting happens. (Suspending is different than blocking in that it allows the event loop thread to continue to other things while you're waiting.) It has many advantages compared to thread-based solutions, such as being able to spawn thousands of cheap "tasks" without bogging down the system, and being able to cancel tasks or easily wait for multiple things at once. Asyncio should be the tool of choice for servers and for clients connecting to multiple servers.
When choosing between asyncio and multithreading/multiprocessing, consider the adage that "threading is for working in parallel, and async is for waiting in parallel".
Also note that asyncio can await functions executed in thread or process pools provided by concurrent.futures, so it can serve as glue between all those different models. This is part of the reason why asyncio is often used to build new library infrastructure.

aiohttp over pycurl multi, since python has the gil, what benefits do i get by switching to aiohttp?

In the crawler i am working on. It makes requests using pycurl multi.
What kind of efficiency improvement can i expect if i switch to aiohttp?
Skepticism has me doubting the potential improvement since python has the GIL. Most of the time is spent waiting for the requests(network IO), so if i could do them in a true parallel way and then process them as they come in i could get a good speedup.
Has anyone been through this and can offer some insights?
Thanks

The global interpreter lock is a mutex that protects access to Python
objects, preventing multiple threads from executing Python bytecodes
at once.
This means that affects the performance of your multithreaded code. AsyncIO is more about handling concurrent requests rather than parallel. With AsyncIO your code will be able to handle more request even with a single threaded loop because the network IO is going to be async. This means that during the time a coroutine fetches a network resource it will "pause" and not lock the thread it's running on and allow other coroutines to execute. The main idea with asyncIO is that even with a single thread you can have your CPU performing calculation constantly instead of waiting for network IO.
If you want to understand more about asyncIO, you need to understand the difference between concurrency and parallelism. This is an excellent Go talk about this subject, but the principals are the same.
So even if python has GIL, performance with asyncIO will be by far better than using traditional threads. Here are some benchmarks:

Asynchronous task queues and asynchronous IO

As I understand, asynchronous networking frameworks/libraries like twisted, tornado, and asyncio provide asynchronous IO through implementing nonblocking sockets and an event loop. Gevent achieves essentially the same thing through monkey patching the standard library, so explicit asynchronous programming via callbacks and coroutines is not required.
On the other hand, asynchronous task queues, like Celery, manage background tasks and distribute those tasks across multiple threads or machines. I do not fully understand this process but it involves message brokers, messages, and workers.
My questions,
Do asynchronous task queues require asynchronous IO? Are they in any way related? The two concepts seem similar, but the implementations at the application level are different. I would think that the only thing they have in common is the word "asynchronous", so perhaps that is throwing me off.
Can someone elaborate on how task queues work and the relationship between the message broker (why are they required?), the workers, and the messages (what are messages? bytes?).
Oh, and I'm not trying to solve any specific problems, I'm just trying to understand the ideas behind asynchronous task queues and asynchronous IO.

Asynchronous IO is a way to use sockets (or more generally file descriptors) without blocking. This term is specific to one process or even one thread. You can even imagine mixing threads with asynchronous calls. It would be completely fine, yet somewhat complicated.
Now I have no idea what asynchronous task queue means. IMHO there's only a task queue, it's a data structure. You can access it in asynchronous or synchronous way. And by "access" I mean push and pop calls. These can use network internally.
So task queue is a data structure. (A)synchronous IO is a way to access it. That's everything there is to it.
The term asynchronous is havily overused nowadays. The hype is real.
As for your second question:
Message is just a set of data, a sequence of bytes. It can be anything. Usually these are some structured strings, like JSON.
Task == message. The different word is used to notify the purpose of that data: to perform some task. For example you would send a message {"task": "process_image"} and your consumer will fire an appropriate function.
Task queue Q is a just a queue (the data structure).
Producer P is a process/thread/class/function/thing that pushes messages to Q.
Consumer (or worker) C is a process/thread/class/function/thing that pops messages from Q and does some processing on it.
Message broker B is a process that redistributes messages. In this case a producer P sends a message to B (rather then directly to a queue) and then B can (for example) duplicate this message and send to 2 different queues Q1 and Q2 so that 2 different workers C1 and C2 will get that message. Message brokers can also act as protocol translators, can transform messages, aggregate them and do many many things. Generally it's just a blackbox between producers and consumers.
As you can see there are no formal definitions of those things and you have to use a bit of intuition to fully understand them.

Asynchronous tasks or celery tasks are just tasks that are executed asynchronously. In particular case of celery, tasks are executed by multiple workers thereby leveraging full benefits of threading, multiprocessing as well as distributed nodes. So in a way, we can easily accomplish what celery does by using libraries like multiprocessing or multithreading, but the benefit of using celery is it handles all that complexity by itself.
Now Asyncronous IO is quite different from multithreading or multiprocessing does. Aync IO is suitable for tasks that are IO bound (not CPU). It executes multiple IO request simultaneously using single thread only. Gevent or asyncio (in case of python 3) helps in accomplishing that.
Celery - ideal for tasks that don't need to be realtime
Multiprocessing - ideal for tasks that are CPU bound.
Asyncio/Gevent - ideal for tasks that are IO bound
Multithreading - Due to inherent Global Interpreter Locking in Python, not of much use in CPU bound programs. In case of IO bound programs, I believe asyncio is a better option
Tornado - A framework that performs IO request asynchronously.
Twisted - A networking framework that provides lots of features besides Asynchronous IO.

asyncio and coroutines vs task queues

I've been reading about asyncio module in python 3, and more broadly about coroutines in python, and I can't get what makes asyncio such a great tool.
I have the feeling that all you can do with coroutines, you can do better by using task queues based on the multiprocessing module (celery for example).
Are there use cases where coroutines are better than task queues?

Not a proper answer, but a list of hints that could not fit into a comment:
You are mentioning the multiprocessing module (and let's consider threading too). Suppose you have to handle hundreds of sockets: can you spawn hundreds of processes or threads?
Again, with threads and processes: how do you handle concurrent access to shared resources? What is the overhead of mechanisms like locking?
Frameworks like Celery also add an important overhead. Can you use it e.g. for handling every single request on a high-traffic web server? By the way, in that scenario, who is responsible for handling sockets and connections (Celery for its nature can't do that for you)?
Be sure to read the rationale behind asyncio. That rationale (among other things) mentions a system call: writev() -- isn't that much more efficient than multiple write()s?

Adding to the above answer:
If the task at hand is I/O bound and operates on a shared data, coroutines and asyncio are probably the way to go.
If on the other hand, you have CPU-bound tasks where data is not shared, a multiprocessing system like Celery should be better.
If the task at hand is a both CPU and I/O bound and sharing of data is not required, I would still use Celery.You can use async I/O from within Celery!
If you have a CPU bound task but with the need to share data, the only viable option as I see now is to save the shared data in a database. There have been recent attempts like pyparallel but they are still work in progress.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.