So far for a single queue in RabbitMQ I have used a single channel
But now I have multiple queues created dynamically, so do I have to create a new channel for each queue or one channel can be to receive/send messages from/to different queues?
# consuming
for ch in items:
channel1 = rconn.channel()
channel1.queue_declare(queue=itm)
channel1.basic_consume(some_callback, queue=itm, no_ack=True)
channel1.start_consuming()
# publishing
for ch in items:
# ....
channel1.basic_publish(exchange="", routing_key=itm, body="fdsfds")
I've had weird issues when I tried to reuse the channel. I'd go with multiple channels. One per each type of producer/consumer is what I ended using iirc.
You do not need to have one queue per channel. You can both declare and consume from multiple queues on the same channel. See this question for more info.
In many client libraries, the queue declaration "RPC" operations should not be mixed with the consume "streaming" operations. In such cases, it's better to have two channels: one for any number of RPC things like queue declarations, deletions, binding creation, etc., and one for any number of consumes.
I think the official Python driver handles this correctly and does not require more than one channel for both.
To (very roughly and nondeterministically) test this, start a publisher somewhere sending a steady stream of messages to a queue, and create a consumer on that queue that consumes messages while repeatedly declaring other queues. If everything works well for awhile, your client is fine mixing RPC and streaming operations. Of course, the client's documentation on the subject is a better authority than this test.
Related
I have 1 big task which consists out of 200 sub-tasks (messages) which will be published onto a queue. If I want to cancel this 1 task, the 200 messages (or the ones that are left and not processed yet) should be deleted. Is there any way to delete these published messages in a queue?
One solution I could think of is to create a queue (Q) which where I publish the name of a new queue (X). Each consumer connects then to this new dynamically created queue (X) and process the 200 published messages. If I want to abort the entire task I delete only that queue (X) from the publisher side. Is that a common approach?
I see few issues with your suggested approach.
The first problem is due to RMQ consumer prefetch which is intended to improve performance by reducing the amount of requests to the broker. If your consumers have retrieved a batch of tasks they will process them all before they ask for new ones, only then they will realize the queue was cancelled. Therefore, your cancellation request would not be handled properly most of the times. You could reduce the prefetch count to 1 to avoid this side effect but this would increase the pressure over the network and reduce overall speed.
The second issue is that the AMQP protocol does not provide mechanisms for gracefully dealing with queue deletion. Therefore your consumers would need to carefully deal with queues disappearing as they would otherwise crash. By doing so, you would loose visibility over bugs and issues. How can you distinguish when a queue was explicitly deleted from a case where it actually crashed?
What I would recommend in this case is marking all your tasks with an identifier of their parent job. Each time a consumer starts consuming a new task, it would check if the parent job is valid or has been cancelled. In the latter case, it would simply ignore the task and move to the next one. You need a supporting service for that. A Redis instance should be more then enough for example.
This mechanism would be way simpler and robust. You can spin as many consumers as you want without the need of orchestrating their connection to the right queue. Also out-of-order or interleaved tasks would not be a problem.
Reading about consumers from the Rabbitmq docs here revealed that there are two possible ways a consumer gets messages to process:
Storing messages in queues is useless unless applications can consume
them. In the AMQP 0-9-1 Model, there are two ways for applications to
do this:
Have messages delivered to them ("push API")
Fetch messages as needed ("pull API")
With the "push API", applications have to indicate interest in
consuming messages from a particular queue. When they do so, we say
that they register a consumer or, simply put, subscribe to a queue.
I was just wondering:
Which way celery workers work?
Is there a way to choose/change the way?
Didn't find anything specific about this in Celery docs.
Celery uses the push method, based on the fact that it registers consumers to queues, and that it maintains long-lived connections to the broker.
No, as far as I can tell the pull method was never really accommodated in Celery's design.
The RabbitMQ doc has been updated (after this question was asked) to note that the push method is the strongly recommended option, and that the pull/polling method is “highly inefficient and should be avoided in most cases”. In a related doc, it says:
Fetching messages one by one is highly discouraged as it is very inefficient compared to regular long-lived consumers. As with any polling-based algorithm, it will be extremely wasteful in systems where message publishing is sporadic and queues can stay empty for prolonged periods of time.
This claim about extreme wastefulness probably doesn't hold when the tasks/messages are not needed on a real-time basis, such that the queue can just be polled and drained at intervals on the order of hours or even less frequently, though a solution like Celery/RabbitMQ is probably overkill for such use cases in the first place.
I've taken a quick (and limited) tour of Celery's source code and I can say that a big part of its architecture seems to simply assume that the push method is what's being used. There are complex components like heartbeat mechanisms that make the system more robust in the context of long-running operation and inevitable network failures; these components simply are not needed in the pull/polling mode.
I have a python messaging application that uses ZMQ. Each object has a PUB and a SUB queue, and they connect to each other. In some particular cases I want to wait for a particular message in the SUB queue, leaving the ones that I am not interested for later processing.
Right now, I am getting all messages and queuing those I am not interested in a Python Queue, until I found the one I am waiting for. But his means that in each processing routing I need to check first in the Python Queue for old messages. Is there a better way?
A zmq publisher doesn't do any queueing... it drops messages when there isn't a SUB available to receive those messages.
The better way in your situation would be to create a generic sub who only will subscribe to certain messages of interest. That way you can spin up all of the different SUBs (even within one thread and using a zmq poller) and they will all process messages as they come from the PUB....
This is what the PUB/SUB pattern is primarily used for. Subs only subscribe to messages of interest, thus eliminating the need to cycle through a queue of messages at every loop looking for messages of interest.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I'm getting into the whole amqp thing and i have a question regarding which type of exchange type to use under the following scenario's:
1) i have the need to create a worker pool where each worker does something when they receive a message. now i want different workers attached to different types of tasks; which i can specify by using the routing keys of each message in a topic fashion. on the consumer end, playing around a bit with kombu i notice that if i specify the same queue name but with different routing keys i can not 'filter' the messages. eg if i have one consumer with '#' and another with 'foo.#' - both using the same queue name, the latter consumer will work round robin on the queue with the former consumer. is this expected? i am running both consumers on the same machine.
2) so given that, i construct unique queue names for each consumer and this time, each consumer does only get what i ask for with the routing key. however, because they are distinct queues, i may get a task in more than just one consumer. eg if consumer 1 has key '#', and consumer 2 has 'foo.#'; when consumer 2 receives (and acks) a message, consumer 1 also gets the same message. this is not what i want; i would like only one consumer to get the message only. is there a way i can achieve this without writing a 'task manager'?
cheers,
For most people it is best to just use a topic exchange for everything until you fully understand how AMQP works. You can get fanout and direct behavior just by choosing the right binding key for a queue. For instance if you use "#" for a binding key, then that queue behaves as if it was connected to a direct exchange. And if you bind two or more queues to the same routing keys, then those queues function as it if was a fanout exchange.
The round robin behavior is expected. Both tasks are subscribed to the exact same queue. The fact that the binding keys are different just confuses everything. Probably whoever binds last, will set the binding key for every queue user. Best not to do that. I build a system in which several queues have anywhere from 4 to 15 instances of the exact same worker code, pulling messages off the same queue and then collecting data from web services. I have even had the workers running on different CPUs although in the end that was not necessary for performance.
I'm not sure why you are using wildcards in the binding keys. If you have 8 consumers named A through H, and each one does a different job, then why not publish messages with routing keys work.A through work.H and then use the same binding keys work.A through work.H. That way if you have multiple instances of worker B, they all bind to work.B and no message is delivered twice.
Also, if you don't ack a message after handling it, then eventually it will go back on the queue and be delivered again. Hopefully you are acking after successfully handling the message. No task manager is needed, just better understanding of all the AMQP knobs.