The requirement is as follows:
There are N producers, that generate messages or jobs or whatever you want to call it.
Messages from each procuder must be processed in order and each message must be processed exactly once.
There's one more restriction: at any time for any given producer there must be not more than one message that is being processed.
The consuming side consists of a number of threads (they are identical in their functionality) that are spread across a number of processes - it is a WSGI application run via mod_wsgi.
At the moment, the queueing on the consuming side is implemented as a custom queue, that subclasses Queue, but it has its own problems that I won't get into, the main one being that upon process restart its queue is lost.
Is there a product, that will make it possible to fulfill the requirements I've outlined above? Support for persistency would've been great, though that is not so important (since the queue will not reside in the worker process' memory any more).
There are many products that do what you are looking for. People with Django experience will probably tell you "celery", but that's not a complete answer. Celery is a (useful) wrapper around the actual queuing system, and using a wrapper doesn't mean you don't have to think about your underlying technology.
ZeroMQ, Redis, and RabbitMQ are a few different solutions that come to mind. There are of course more options. I'm fairly certain that no queueing solution will support your "at any time for any given producer there must be not more than one message that is being processed" requirement as a configuration parameter; you should probably implement this requirement at the producer (i.e. do not submit job #2 until you receive confirmation that job #1 has completed).
Redis is not a real queueing system, but a very fast database with pub/sub features; you would not be able to use Redis pub/sub to satisfy the "job processed exactly once" requirement out of the box, although you could use Redis pub/sub to publish jobs to a single subscriber which then pushes them into the database as a list (a poor man's queue). Your consumers would then atomically pull a job from the list. It'll work if you want to go this route.
RabbitMQ is an "enterprise" queueing system, and would absolutely meet your requirements, but you'd have to deploy the RabbitMQ server somewhere, and it might be overkill. For the record, I use RabbitMQ on numerous projects, and it gets the job done. Set up a "direct"-type exchange, bind it to a single queue, and subscribe all your consumers to this queue. You get pretty good persistence from RabbitMQ too.
ZeroMQ has a very very flexible queueing model, and ZeroMQ can absolutely be made to do what you want. ZeroMQ is basically just the transport mechanism though, so when it comes to making your publishers and subscribers and a broker to distribute them, you may end up rolling your own.
Related
I have 1 big task which consists out of 200 sub-tasks (messages) which will be published onto a queue. If I want to cancel this 1 task, the 200 messages (or the ones that are left and not processed yet) should be deleted. Is there any way to delete these published messages in a queue?
One solution I could think of is to create a queue (Q) which where I publish the name of a new queue (X). Each consumer connects then to this new dynamically created queue (X) and process the 200 published messages. If I want to abort the entire task I delete only that queue (X) from the publisher side. Is that a common approach?
I see few issues with your suggested approach.
The first problem is due to RMQ consumer prefetch which is intended to improve performance by reducing the amount of requests to the broker. If your consumers have retrieved a batch of tasks they will process them all before they ask for new ones, only then they will realize the queue was cancelled. Therefore, your cancellation request would not be handled properly most of the times. You could reduce the prefetch count to 1 to avoid this side effect but this would increase the pressure over the network and reduce overall speed.
The second issue is that the AMQP protocol does not provide mechanisms for gracefully dealing with queue deletion. Therefore your consumers would need to carefully deal with queues disappearing as they would otherwise crash. By doing so, you would loose visibility over bugs and issues. How can you distinguish when a queue was explicitly deleted from a case where it actually crashed?
What I would recommend in this case is marking all your tasks with an identifier of their parent job. Each time a consumer starts consuming a new task, it would check if the parent job is valid or has been cancelled. In the latter case, it would simply ignore the task and move to the next one. You need a supporting service for that. A Redis instance should be more then enough for example.
This mechanism would be way simpler and robust. You can spin as many consumers as you want without the need of orchestrating their connection to the right queue. Also out-of-order or interleaved tasks would not be a problem.
Reading about consumers from the Rabbitmq docs here revealed that there are two possible ways a consumer gets messages to process:
Storing messages in queues is useless unless applications can consume
them. In the AMQP 0-9-1 Model, there are two ways for applications to
do this:
Have messages delivered to them ("push API")
Fetch messages as needed ("pull API")
With the "push API", applications have to indicate interest in
consuming messages from a particular queue. When they do so, we say
that they register a consumer or, simply put, subscribe to a queue.
I was just wondering:
Which way celery workers work?
Is there a way to choose/change the way?
Didn't find anything specific about this in Celery docs.
Celery uses the push method, based on the fact that it registers consumers to queues, and that it maintains long-lived connections to the broker.
No, as far as I can tell the pull method was never really accommodated in Celery's design.
The RabbitMQ doc has been updated (after this question was asked) to note that the push method is the strongly recommended option, and that the pull/polling method is “highly inefficient and should be avoided in most cases”. In a related doc, it says:
Fetching messages one by one is highly discouraged as it is very inefficient compared to regular long-lived consumers. As with any polling-based algorithm, it will be extremely wasteful in systems where message publishing is sporadic and queues can stay empty for prolonged periods of time.
This claim about extreme wastefulness probably doesn't hold when the tasks/messages are not needed on a real-time basis, such that the queue can just be polled and drained at intervals on the order of hours or even less frequently, though a solution like Celery/RabbitMQ is probably overkill for such use cases in the first place.
I've taken a quick (and limited) tour of Celery's source code and I can say that a big part of its architecture seems to simply assume that the push method is what's being used. There are complex components like heartbeat mechanisms that make the system more robust in the context of long-running operation and inevitable network failures; these components simply are not needed in the pull/polling mode.
I've run in to a problem wherein the RabbitMQ (RMQ) Web UI reports different numbers than the command-line tools. I may be misunderstanding the results, but I'll step though what I think is representative of the problem.
I'm using Celery as a task manager. When I submit a set of tasks, the queues fill up and the workers start churning out results. The RMQ UI, shows a list of ready, unacked, etc. messages (tasks). Meanwhile, if I look at the response from the command-line tools, I can get a list of the active, reserved, etc. messages (tasks on the worker). There seems to be a mismatch between these two sets of numbers, especially when the UI claims there are no messages left (tasks to be run), i.e. the queues have been drained.
While I know that UI numbers and the command-line results represent different views: i.e. one looks at the broker, while the other looks at the workers. It's my feeling that when the broker says there are no remaining messages, the workers should not be working on anything either. Unfortunately, this is not the case; instead I see that the workers are still busy consuming messages and executing tasks.
I though that if the workers ack'd on completion of the work, rather than when they receive the task, that there should be at least some messages reported in the Web UI's unacked column.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I'm fairly new to Celery/AMQP and am trying to come up with a task/queue/worker design to meet the following requirements.
I have multiple types of "per-user" tasks: e.g., TaskA, TaskB, TaskC. Each of these "per-user" tasks read/write data for one particular user in the system. So at any given time, I might need to create tasks User1_TaskA, User1_TaskB, User1_TaskC, User2_TaskA, User2_TaskB, etc. I need to ensure that, for each user, no two tasks of any task type execute concurrently. I want a system in which no worker can execute User1_TaskA at the same time as any other worker is executing User1_TaskB or User1_TaskC, but while User1_TaskA is executing, other workers shouldn't be blocked from concurrently executing User2_TaskA, User3_TaskA, etc.
I realize this could be implemented using some sort of external locking mechanism (e.g., in the DB), but I'm hoping there's a more elegant task/queue/worker design that would work.
I suppose one possible solution is to implement queues as user buckets such that, when the workers are launched there's config that specifies how many buckets to create, and each "bucket worker" is bound to exactly one bucket. Then an "intermediate worker" would pull off tasks from the main task queue and assign them into the bucketed queues via, say, a hash/mod scheme. So UserA's tasks would always end up in the same queue, and multiple tasks for UserA would back up behind each other. I don't love this approach, as it would require the number of buckets to be defined ahead of time, and would seem to prevent (easily) adding workers dynamically. Seems to me there's got to be a better way -- suggestions would be greatly appreciated.
What's so bad in using an external locking mechanism? It's simple, straightforward, and efficient enough. You can find an example of distributed task locking in Celery here. Extend it by creating a lock per user, and you're done!