I have 1 big task which consists out of 200 sub-tasks (messages) which will be published onto a queue. If I want to cancel this 1 task, the 200 messages (or the ones that are left and not processed yet) should be deleted. Is there any way to delete these published messages in a queue?
One solution I could think of is to create a queue (Q) which where I publish the name of a new queue (X). Each consumer connects then to this new dynamically created queue (X) and process the 200 published messages. If I want to abort the entire task I delete only that queue (X) from the publisher side. Is that a common approach?
I see few issues with your suggested approach.
The first problem is due to RMQ consumer prefetch which is intended to improve performance by reducing the amount of requests to the broker. If your consumers have retrieved a batch of tasks they will process them all before they ask for new ones, only then they will realize the queue was cancelled. Therefore, your cancellation request would not be handled properly most of the times. You could reduce the prefetch count to 1 to avoid this side effect but this would increase the pressure over the network and reduce overall speed.
The second issue is that the AMQP protocol does not provide mechanisms for gracefully dealing with queue deletion. Therefore your consumers would need to carefully deal with queues disappearing as they would otherwise crash. By doing so, you would loose visibility over bugs and issues. How can you distinguish when a queue was explicitly deleted from a case where it actually crashed?
What I would recommend in this case is marking all your tasks with an identifier of their parent job. Each time a consumer starts consuming a new task, it would check if the parent job is valid or has been cancelled. In the latter case, it would simply ignore the task and move to the next one. You need a supporting service for that. A Redis instance should be more then enough for example.
This mechanism would be way simpler and robust. You can spin as many consumers as you want without the need of orchestrating their connection to the right queue. Also out-of-order or interleaved tasks would not be a problem.
Related
Suppose all my tasks on a celery queue are hitting a 3rd party API. However, the API has a rate limit, which I am keeping track of (there is a day limit and hourly limit which I need to respect). As soon as I hit the rate limit, I want to pause consumption of new tasks, and then resume when I know I am good.
I achieved this by using the following two tasks:
#celery.task()
def cancel_api_queue(minutes_to_resume):
resume_api_queue.apply_async(countdown=minutes_to_resume*60, queue='celery')
celery.control.cancel_consumer('third_party', reply=True)
#celery.task(default_retry_delay=300, max_retries=5)
def resume_api_queue():
celery.control.add_consumer('third_party', destination=['y#local'])
Then I can keep submitting my 3rd party API tasks, and as soon as my consumer is added back, all my tasks get consumed. Great.
However, since I have no consumer on this queue, this seems to mean I cannot see the jobs that are being submitted in Flower any more (until my consumer gets added).
Is there something I am doing wrong? Can I achieve this 'pause' another way to allow me to continue to see submitted jobs in flower?
p.s. maybe this is related to this issue, but not 100% sure: https://github.com/celery/celery/issues/1452
I am using amqp broker if that makes a difference.
thanks girls and boys.
I'd suspect that peeking into contents of the queue messages before a worker picks them up is not really part of Flower's intended design. Therefore, if you stop consuming tasks from a queue, the best Flower can do is show you how many of them have been enqueued as a single number on the "Broker" pane.
One hackish way to observe the internals of the incoming tasks could be to add an intermediate dummy "forwarding" task, which simply forwards the message from one queue (let us call it query_inbox) to another (say, query_processing).
E.g. something like:
#celery.task(queue='query_inbox')
def query(params):
process_query.delay(params)
#celery.task(queue='query_processing')
def process_query(params):
... do rate-limited stuff ...
Now you may stop consuming tasks from query_processing, but you will still be able to observe their parameters as they flow through the query_inbox worker.
I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
The requirement is as follows:
There are N producers, that generate messages or jobs or whatever you want to call it.
Messages from each procuder must be processed in order and each message must be processed exactly once.
There's one more restriction: at any time for any given producer there must be not more than one message that is being processed.
The consuming side consists of a number of threads (they are identical in their functionality) that are spread across a number of processes - it is a WSGI application run via mod_wsgi.
At the moment, the queueing on the consuming side is implemented as a custom queue, that subclasses Queue, but it has its own problems that I won't get into, the main one being that upon process restart its queue is lost.
Is there a product, that will make it possible to fulfill the requirements I've outlined above? Support for persistency would've been great, though that is not so important (since the queue will not reside in the worker process' memory any more).
There are many products that do what you are looking for. People with Django experience will probably tell you "celery", but that's not a complete answer. Celery is a (useful) wrapper around the actual queuing system, and using a wrapper doesn't mean you don't have to think about your underlying technology.
ZeroMQ, Redis, and RabbitMQ are a few different solutions that come to mind. There are of course more options. I'm fairly certain that no queueing solution will support your "at any time for any given producer there must be not more than one message that is being processed" requirement as a configuration parameter; you should probably implement this requirement at the producer (i.e. do not submit job #2 until you receive confirmation that job #1 has completed).
Redis is not a real queueing system, but a very fast database with pub/sub features; you would not be able to use Redis pub/sub to satisfy the "job processed exactly once" requirement out of the box, although you could use Redis pub/sub to publish jobs to a single subscriber which then pushes them into the database as a list (a poor man's queue). Your consumers would then atomically pull a job from the list. It'll work if you want to go this route.
RabbitMQ is an "enterprise" queueing system, and would absolutely meet your requirements, but you'd have to deploy the RabbitMQ server somewhere, and it might be overkill. For the record, I use RabbitMQ on numerous projects, and it gets the job done. Set up a "direct"-type exchange, bind it to a single queue, and subscribe all your consumers to this queue. You get pretty good persistence from RabbitMQ too.
ZeroMQ has a very very flexible queueing model, and ZeroMQ can absolutely be made to do what you want. ZeroMQ is basically just the transport mechanism though, so when it comes to making your publishers and subscribers and a broker to distribute them, you may end up rolling your own.
I have some queue, for etc:
online_queue = self._channel.queue_declare(
durable = True,
queue = 'online'
)
At the moment, I need to flush all content in this queue.
But, at this moment, another process, probably, may publish to this queue.
If I use channel.queue_purge(queue='online'), what will happened with messages, published, while queue_purge still working?
Depending on your ultimate goal, you might be able to solve this issue by using a temporary queue.
To make things more clear, lets give things some names. Call your current queue (the one you want to purge) Queue A, and assume it is 1-1 bound to Exchange A.
If you create a new queue (Queue B) and bind it to Exchange A in the same way that Queue A is bound, Queue B will now get all of the messages (from the time of binding) that Queue A gets.
You can now safely purge Queue A without loosing any of the messages that got sent in after Queue B was bound.
Re-bind Queue A to Exchange A and you are back up and running.
You can then deal with the "interim" messages in Queue B however you might need to.
This has the advantage of having a very well defined behavior and doesn't get you into any race conditions because you can completely blow Queue A away and re-create it instead of purging.
You're describing a race condition. Some might remain in the queue and some others might get purged. Or all of them will get purged. Or none of them will get purged.
There's just no way to tell, because it's a time-dependent situation. You should re-examine your need to purge a queue which is still active, or build a more robust consumer that can live with the fact that there might be messages in the queue it is connecting to (which is basically what consumers have to live with, anyway).