I've run in to a problem wherein the RabbitMQ (RMQ) Web UI reports different numbers than the command-line tools. I may be misunderstanding the results, but I'll step though what I think is representative of the problem.
I'm using Celery as a task manager. When I submit a set of tasks, the queues fill up and the workers start churning out results. The RMQ UI, shows a list of ready, unacked, etc. messages (tasks). Meanwhile, if I look at the response from the command-line tools, I can get a list of the active, reserved, etc. messages (tasks on the worker). There seems to be a mismatch between these two sets of numbers, especially when the UI claims there are no messages left (tasks to be run), i.e. the queues have been drained.
While I know that UI numbers and the command-line results represent different views: i.e. one looks at the broker, while the other looks at the workers. It's my feeling that when the broker says there are no remaining messages, the workers should not be working on anything either. Unfortunately, this is not the case; instead I see that the workers are still busy consuming messages and executing tasks.
I though that if the workers ack'd on completion of the work, rather than when they receive the task, that there should be at least some messages reported in the Web UI's unacked column.
Related
I have 1 big task which consists out of 200 sub-tasks (messages) which will be published onto a queue. If I want to cancel this 1 task, the 200 messages (or the ones that are left and not processed yet) should be deleted. Is there any way to delete these published messages in a queue?
One solution I could think of is to create a queue (Q) which where I publish the name of a new queue (X). Each consumer connects then to this new dynamically created queue (X) and process the 200 published messages. If I want to abort the entire task I delete only that queue (X) from the publisher side. Is that a common approach?
I see few issues with your suggested approach.
The first problem is due to RMQ consumer prefetch which is intended to improve performance by reducing the amount of requests to the broker. If your consumers have retrieved a batch of tasks they will process them all before they ask for new ones, only then they will realize the queue was cancelled. Therefore, your cancellation request would not be handled properly most of the times. You could reduce the prefetch count to 1 to avoid this side effect but this would increase the pressure over the network and reduce overall speed.
The second issue is that the AMQP protocol does not provide mechanisms for gracefully dealing with queue deletion. Therefore your consumers would need to carefully deal with queues disappearing as they would otherwise crash. By doing so, you would loose visibility over bugs and issues. How can you distinguish when a queue was explicitly deleted from a case where it actually crashed?
What I would recommend in this case is marking all your tasks with an identifier of their parent job. Each time a consumer starts consuming a new task, it would check if the parent job is valid or has been cancelled. In the latter case, it would simply ignore the task and move to the next one. You need a supporting service for that. A Redis instance should be more then enough for example.
This mechanism would be way simpler and robust. You can spin as many consumers as you want without the need of orchestrating their connection to the right queue. Also out-of-order or interleaved tasks would not be a problem.
Reading about consumers from the Rabbitmq docs here revealed that there are two possible ways a consumer gets messages to process:
Storing messages in queues is useless unless applications can consume
them. In the AMQP 0-9-1 Model, there are two ways for applications to
do this:
Have messages delivered to them ("push API")
Fetch messages as needed ("pull API")
With the "push API", applications have to indicate interest in
consuming messages from a particular queue. When they do so, we say
that they register a consumer or, simply put, subscribe to a queue.
I was just wondering:
Which way celery workers work?
Is there a way to choose/change the way?
Didn't find anything specific about this in Celery docs.
Celery uses the push method, based on the fact that it registers consumers to queues, and that it maintains long-lived connections to the broker.
No, as far as I can tell the pull method was never really accommodated in Celery's design.
The RabbitMQ doc has been updated (after this question was asked) to note that the push method is the strongly recommended option, and that the pull/polling method is “highly inefficient and should be avoided in most cases”. In a related doc, it says:
Fetching messages one by one is highly discouraged as it is very inefficient compared to regular long-lived consumers. As with any polling-based algorithm, it will be extremely wasteful in systems where message publishing is sporadic and queues can stay empty for prolonged periods of time.
This claim about extreme wastefulness probably doesn't hold when the tasks/messages are not needed on a real-time basis, such that the queue can just be polled and drained at intervals on the order of hours or even less frequently, though a solution like Celery/RabbitMQ is probably overkill for such use cases in the first place.
I've taken a quick (and limited) tour of Celery's source code and I can say that a big part of its architecture seems to simply assume that the push method is what's being used. There are complex components like heartbeat mechanisms that make the system more robust in the context of long-running operation and inevitable network failures; these components simply are not needed in the pull/polling mode.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
The requirement is as follows:
There are N producers, that generate messages or jobs or whatever you want to call it.
Messages from each procuder must be processed in order and each message must be processed exactly once.
There's one more restriction: at any time for any given producer there must be not more than one message that is being processed.
The consuming side consists of a number of threads (they are identical in their functionality) that are spread across a number of processes - it is a WSGI application run via mod_wsgi.
At the moment, the queueing on the consuming side is implemented as a custom queue, that subclasses Queue, but it has its own problems that I won't get into, the main one being that upon process restart its queue is lost.
Is there a product, that will make it possible to fulfill the requirements I've outlined above? Support for persistency would've been great, though that is not so important (since the queue will not reside in the worker process' memory any more).
There are many products that do what you are looking for. People with Django experience will probably tell you "celery", but that's not a complete answer. Celery is a (useful) wrapper around the actual queuing system, and using a wrapper doesn't mean you don't have to think about your underlying technology.
ZeroMQ, Redis, and RabbitMQ are a few different solutions that come to mind. There are of course more options. I'm fairly certain that no queueing solution will support your "at any time for any given producer there must be not more than one message that is being processed" requirement as a configuration parameter; you should probably implement this requirement at the producer (i.e. do not submit job #2 until you receive confirmation that job #1 has completed).
Redis is not a real queueing system, but a very fast database with pub/sub features; you would not be able to use Redis pub/sub to satisfy the "job processed exactly once" requirement out of the box, although you could use Redis pub/sub to publish jobs to a single subscriber which then pushes them into the database as a list (a poor man's queue). Your consumers would then atomically pull a job from the list. It'll work if you want to go this route.
RabbitMQ is an "enterprise" queueing system, and would absolutely meet your requirements, but you'd have to deploy the RabbitMQ server somewhere, and it might be overkill. For the record, I use RabbitMQ on numerous projects, and it gets the job done. Set up a "direct"-type exchange, bind it to a single queue, and subscribe all your consumers to this queue. You get pretty good persistence from RabbitMQ too.
ZeroMQ has a very very flexible queueing model, and ZeroMQ can absolutely be made to do what you want. ZeroMQ is basically just the transport mechanism though, so when it comes to making your publishers and subscribers and a broker to distribute them, you may end up rolling your own.
I need to to handle a large (time and memory-consuming) process asynchronously in a web2py application called inside a controller method.
My specific use case is to call a process via stdlib.subprocess and wait for it to exit without blocking the web server, but I am open to alternative methods.
Hands-on examples would be a plus.
3rd party library recommendations
are welcome.
CRON scheduling is not required/wanted.
Assuming you'll need to start multiple, possibly simultaneous, instances of the background task, the solution is a task queue. I've heard good things about Celery and RabbitMQ, if you're looking for 3rd-party options, and web2py includes it's own task queue system that might be sufficient for your needs.
With either tool, you'll define a function that encapsulates the operation you want the background process to perform. Then bring the task queue workers online. The web2py manual and forums indicate this can be done with an #reboot statement in the web2py cron system, which is triggered whenever the web server starts. There are probably other ways to start the workers if this is unsatisfactory.
In your controller you'll insert a task into the task queue, passing any necessary parameters as inputs to the function (the background function will not run in the same environment as the controller, so it won't have access to the session, DB, etc. unless you explicitly pass the appropriate values into the task function).
Now, to get the output of the background operation to the user. When you insert a task into the task queue, you should get back a unique ID for the task. You would then implement controller logic (either something that expects an AJAX call, or a page that keeps refreshing until the task completes) that calls the task queue's API to check the status of the specified task. If the task's status is "finished", return the data to the user. If not, keep waiting.
Maybe review the book section on running tasks in the background. You can use the new scheduler or create a homemade queue (email example). There's also a web2py-celery plugin, though I'm not sure what state that is in.
This is more difficult than one might expect. Note the deadlock warnings in the stdlib.subprocess documentation. It's easy if you don't mind blocking---use Popen.communicate. To work around the blocking, you can manage the process using stdlib.subprocess from a thread.
My favorite way to deal with subprocesses is to use Twisted's spawnProcess. But, it is not easy to get Twisted to play nicely with other frameworks.