Concurrent processing of messages (Python)

Concurrent processing of messages (Python) - python

I have the following scenario:
There is one thread that manages long-polling HTTP connection (non-stop) from an API. When a new message arrives, it must be processed within the special process() method.
I just want to design it in a way that incoming messages will be processed concurrently, but there is another important point: in the end of each processing an answer should be passed to the outcoming queue, which is organized in a separated thread. From there the answers will be sent via HTTP.
Here is a scheme:
Let's consider that it can be 30-50 messages in a second, and procces method will work from 1 up to 10 seconds.
The question is: what library or framework can I use to implement this architecture?
As far as I have researched, Python Tornado have good benchmarks, but here I do not need a web framework, just a tool that can provide a concurrent running of message processors.

Your message rate is pretty low. So you may freely use "standard" tools like RabbitMQ/Redis, Celery ("Celery Project") and asyncio.
RabbitMQ/Redis with Celery - are great tools to implement queues and manage your tasks and processes.
Asyncio is faster than Tornado but it doesn't matter for your task. What is more important is that asyncio gives you all the benefits of modern async/await coroutine technique.

Related

Concurrency: multiprocessing, threading, greenthreads and asyncio

I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?

Asynchronous task queues and asynchronous IO

As I understand, asynchronous networking frameworks/libraries like twisted, tornado, and asyncio provide asynchronous IO through implementing nonblocking sockets and an event loop. Gevent achieves essentially the same thing through monkey patching the standard library, so explicit asynchronous programming via callbacks and coroutines is not required.
On the other hand, asynchronous task queues, like Celery, manage background tasks and distribute those tasks across multiple threads or machines. I do not fully understand this process but it involves message brokers, messages, and workers.
My questions,
Do asynchronous task queues require asynchronous IO? Are they in any way related? The two concepts seem similar, but the implementations at the application level are different. I would think that the only thing they have in common is the word "asynchronous", so perhaps that is throwing me off.
Can someone elaborate on how task queues work and the relationship between the message broker (why are they required?), the workers, and the messages (what are messages? bytes?).
Oh, and I'm not trying to solve any specific problems, I'm just trying to understand the ideas behind asynchronous task queues and asynchronous IO.

Asynchronous IO is a way to use sockets (or more generally file descriptors) without blocking. This term is specific to one process or even one thread. You can even imagine mixing threads with asynchronous calls. It would be completely fine, yet somewhat complicated.
Now I have no idea what asynchronous task queue means. IMHO there's only a task queue, it's a data structure. You can access it in asynchronous or synchronous way. And by "access" I mean push and pop calls. These can use network internally.
So task queue is a data structure. (A)synchronous IO is a way to access it. That's everything there is to it.
The term asynchronous is havily overused nowadays. The hype is real.
As for your second question:
Message is just a set of data, a sequence of bytes. It can be anything. Usually these are some structured strings, like JSON.
Task == message. The different word is used to notify the purpose of that data: to perform some task. For example you would send a message {"task": "process_image"} and your consumer will fire an appropriate function.
Task queue Q is a just a queue (the data structure).
Producer P is a process/thread/class/function/thing that pushes messages to Q.
Consumer (or worker) C is a process/thread/class/function/thing that pops messages from Q and does some processing on it.
Message broker B is a process that redistributes messages. In this case a producer P sends a message to B (rather then directly to a queue) and then B can (for example) duplicate this message and send to 2 different queues Q1 and Q2 so that 2 different workers C1 and C2 will get that message. Message brokers can also act as protocol translators, can transform messages, aggregate them and do many many things. Generally it's just a blackbox between producers and consumers.
As you can see there are no formal definitions of those things and you have to use a bit of intuition to fully understand them.

Asynchronous tasks or celery tasks are just tasks that are executed asynchronously. In particular case of celery, tasks are executed by multiple workers thereby leveraging full benefits of threading, multiprocessing as well as distributed nodes. So in a way, we can easily accomplish what celery does by using libraries like multiprocessing or multithreading, but the benefit of using celery is it handles all that complexity by itself.
Now Asyncronous IO is quite different from multithreading or multiprocessing does. Aync IO is suitable for tasks that are IO bound (not CPU). It executes multiple IO request simultaneously using single thread only. Gevent or asyncio (in case of python 3) helps in accomplishing that.
Celery - ideal for tasks that don't need to be realtime
Multiprocessing - ideal for tasks that are CPU bound.
Asyncio/Gevent - ideal for tasks that are IO bound
Multithreading - Due to inherent Global Interpreter Locking in Python, not of much use in CPU bound programs. In case of IO bound programs, I believe asyncio is a better option
Tornado - A framework that performs IO request asynchronously.
Twisted - A networking framework that provides lots of features besides Asynchronous IO.

Parallelism in one web request

Our server has a lot if CPUs, and some web requests could be faster if request handlers would do some parallel processing.
Example: Some work needs to be done on N (about 1-20) pictures, to severe one web request.
Caching or doing the stuff before the request comes in is not possible.
What can be done to use several CPUs of the hardware:
threads: I don't like them
multiprocessing: Every request needs to start N processes. Many CPU cycles will be lost for starting a new process and importing libraries.
special (hand made) service, which has N processes ready for processing
cellery (rabbitMQ): I don't know how big the communication overhead is...
Other solution?
Platform: Django (Python)

Regarding your second and third alternatives: you do not need to start a new process for every request. This is what process pools are for. New processes are created when your app starts up. When you submit a request to the pool, it is automatically queued until a worker is available. The disadvantage is that requests are blocking- if no worker is available at the moment, your user will sit and wait.

You could use the standard library module asyncore.
This module provides the basic infrastructure for writing asynchronous socket service clients and servers.
There is an example for how to create a basic HTML client.
Then there's Twisted, it can do lots and lots of things, which is why it's somewhat daunting. Here is an example using its HTTP client.
Twisted "speaks HTTP", asyncore does not, you'll have to.
Other libraries:
Tornado's httpclient
asynchttp

How can I offer concurrency with Pika in long-working consumers?

Short version: How can I prevent blocking Pika in a Remote Procedure Call situation?
Long version:
None of the Pika examples demonstrate my use case.
I have a Tornado server which communicates with other processes/machines over AMQP (RabbitMQ, Pika). These other processes are not very well-defined, but they will, for the most part, be returning data (see the RPC example on RabbitMQ's website). Sometimes, a process might need to take an extremely long time to process a large amount of information, but it shouldn't completely block smaller requests from being taken by the process. Or maybe the remote server is blocking because it sent out a web request. Think of it like a web server, but using AMQP instead of HTTP.
Since Pika documentation claims that it's not thread-safe, I cannot pass the connection to multiple threads (or processes, for that matter). What I want to do is start a new process, and add a socket event (for the pipe to that program) to the Pika IOLoop, as I would be able to do with Tornado. The Pika IOLoop is much different from the Tornado IOLoop, and it doesn't seem to support adding multiple handlers; it seems to operate using one "poller" on one socket.
I'd like to avoid requiring the Tornado package for this package, because I would only be using the IOLoop. It's not out of the question, but I want to see what my other options are, or if there is a solution to my problem by somehow connecting multiple Pika IOLoops/Pollers. RabbitMQ's documentation says that workers can often be "scaled up" by adding more. I'd like to avoid creating a connection for every request that comes in (if they're coming in fast).

From what you described, I believe you unfortunately either need a different communication model or need multiple Pika IOLoops/Pollers/Redundant Connections.
It sounds like from documentation and from other sites that RPC in Pika is always a blocking statement and unable to be passed around between threads. See http://www.rabbitmq.com/tutorials/tutorial-six-python.html where the author points out that RPC in Pika is inherently blocking once you actually call the ioloop.
"When in doubt avoid RPC. If you can, you should use an asynchronous pipeline - instead of RPC-like blocking"
If you want to keep sending multiple RPC calls on the same connection before one completes, you'll need a different Asynchronous model. Multiple RPC calls on the same connection before completion isn't the usual implementation of the RPC model, though it's not technically forbidden ( http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.progcomm%2Fdoc%2Fprogcomc%2Frpc_mod.htm ). I don't think Pika operates with this model, though it does have asynchronous support via callbacks (not what you are looking for I think).
If you just want to easily be able to generate new connections on the fly you could use a thread or process wrapper on a connection, where you create and block on the RPC in the other context and push to a common Queue which the main thread can monitor. Tornado might give you this, but I agree that it's a bit of overkill, and making such a connection wrapper shouldn't be all that difficult as I've done something similar for other I/O ops in less than 100 lines of Python (see Queue package for Threaded wrapper version). I think you already saw this possibility though based on your talk of multiple IOLoops.

twisted.web2 and spawining threads for synchronous code?

So, I'm writing a python web application using the twisted web2 framework. There's a library that I need to use (SQLAlchemy, to be specific) that doesn't have asynchronous code. Would it be bad to spawn a thread to handle the request, fetch any data from the DB, and then return a response? I'm afraid that if there was a flood of requests, too many threads would be started and the server would be overwhelmed. Is there something built into twisted that prevents this from happening (eg request throttling)?

See the docs, and specifically the thread pool which lets you control how many threads are active at most. Spawning one new thread per request would definitely be an inferior idea!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.