I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.
Related
I have 1 big task which consists out of 200 sub-tasks (messages) which will be published onto a queue. If I want to cancel this 1 task, the 200 messages (or the ones that are left and not processed yet) should be deleted. Is there any way to delete these published messages in a queue?
One solution I could think of is to create a queue (Q) which where I publish the name of a new queue (X). Each consumer connects then to this new dynamically created queue (X) and process the 200 published messages. If I want to abort the entire task I delete only that queue (X) from the publisher side. Is that a common approach?
I see few issues with your suggested approach.
The first problem is due to RMQ consumer prefetch which is intended to improve performance by reducing the amount of requests to the broker. If your consumers have retrieved a batch of tasks they will process them all before they ask for new ones, only then they will realize the queue was cancelled. Therefore, your cancellation request would not be handled properly most of the times. You could reduce the prefetch count to 1 to avoid this side effect but this would increase the pressure over the network and reduce overall speed.
The second issue is that the AMQP protocol does not provide mechanisms for gracefully dealing with queue deletion. Therefore your consumers would need to carefully deal with queues disappearing as they would otherwise crash. By doing so, you would loose visibility over bugs and issues. How can you distinguish when a queue was explicitly deleted from a case where it actually crashed?
What I would recommend in this case is marking all your tasks with an identifier of their parent job. Each time a consumer starts consuming a new task, it would check if the parent job is valid or has been cancelled. In the latter case, it would simply ignore the task and move to the next one. You need a supporting service for that. A Redis instance should be more then enough for example.
This mechanism would be way simpler and robust. You can spin as many consumers as you want without the need of orchestrating their connection to the right queue. Also out-of-order or interleaved tasks would not be a problem.
I'm trying to stay connected to multiple queues in RabbitMQ. Each time I pop a new message from one of these queue, I'd like to spawn an external process.
This process will take some time to process the message, and I don't want to start processing another message from that specific queue until the one I popped earlier is completed. If possible, I wouldn't want to keep a process/thread around just to wait on the external process to complete and ack the server. Ideally, I would like to ack in this external process, maybe passing some identifier so that it can connect to RabbitMQ and ack the message.
Is it possible to design this system with RabbitMQ? I'm using Python and Pika, if this is relevant to the answer.
Thanks!
RabbitMQ can do this.
You only want to read from the queue when you're ready - so spin up a thread that can spawn the external process and watch it, then fetch the next message from the queue when the process is done. You can then have mulitiple threads running in parallel to manage multiple queues.
I'm not sure what you want an ack for? Are you trying to stop RabbitMQ from adding new elements to that queue if it gets too full (because its elements are being processed too slowly/not at all)? There might be a way to do this when you add messages to the queues - before adding an item, check to make sure that the number of messages already in that queue is not "much greater than" the average across all queues?
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I have some queue, for etc:
online_queue = self._channel.queue_declare(
durable = True,
queue = 'online'
)
At the moment, I need to flush all content in this queue.
But, at this moment, another process, probably, may publish to this queue.
If I use channel.queue_purge(queue='online'), what will happened with messages, published, while queue_purge still working?
Depending on your ultimate goal, you might be able to solve this issue by using a temporary queue.
To make things more clear, lets give things some names. Call your current queue (the one you want to purge) Queue A, and assume it is 1-1 bound to Exchange A.
If you create a new queue (Queue B) and bind it to Exchange A in the same way that Queue A is bound, Queue B will now get all of the messages (from the time of binding) that Queue A gets.
You can now safely purge Queue A without loosing any of the messages that got sent in after Queue B was bound.
Re-bind Queue A to Exchange A and you are back up and running.
You can then deal with the "interim" messages in Queue B however you might need to.
This has the advantage of having a very well defined behavior and doesn't get you into any race conditions because you can completely blow Queue A away and re-create it instead of purging.
You're describing a race condition. Some might remain in the queue and some others might get purged. Or all of them will get purged. Or none of them will get purged.
There's just no way to tell, because it's a time-dependent situation. You should re-examine your need to purge a queue which is still active, or build a more robust consumer that can live with the fact that there might be messages in the queue it is connecting to (which is basically what consumers have to live with, anyway).
I'm getting into the whole amqp thing and i have a question regarding which type of exchange type to use under the following scenario's:
1) i have the need to create a worker pool where each worker does something when they receive a message. now i want different workers attached to different types of tasks; which i can specify by using the routing keys of each message in a topic fashion. on the consumer end, playing around a bit with kombu i notice that if i specify the same queue name but with different routing keys i can not 'filter' the messages. eg if i have one consumer with '#' and another with 'foo.#' - both using the same queue name, the latter consumer will work round robin on the queue with the former consumer. is this expected? i am running both consumers on the same machine.
2) so given that, i construct unique queue names for each consumer and this time, each consumer does only get what i ask for with the routing key. however, because they are distinct queues, i may get a task in more than just one consumer. eg if consumer 1 has key '#', and consumer 2 has 'foo.#'; when consumer 2 receives (and acks) a message, consumer 1 also gets the same message. this is not what i want; i would like only one consumer to get the message only. is there a way i can achieve this without writing a 'task manager'?
cheers,
For most people it is best to just use a topic exchange for everything until you fully understand how AMQP works. You can get fanout and direct behavior just by choosing the right binding key for a queue. For instance if you use "#" for a binding key, then that queue behaves as if it was connected to a direct exchange. And if you bind two or more queues to the same routing keys, then those queues function as it if was a fanout exchange.
The round robin behavior is expected. Both tasks are subscribed to the exact same queue. The fact that the binding keys are different just confuses everything. Probably whoever binds last, will set the binding key for every queue user. Best not to do that. I build a system in which several queues have anywhere from 4 to 15 instances of the exact same worker code, pulling messages off the same queue and then collecting data from web services. I have even had the workers running on different CPUs although in the end that was not necessary for performance.
I'm not sure why you are using wildcards in the binding keys. If you have 8 consumers named A through H, and each one does a different job, then why not publish messages with routing keys work.A through work.H and then use the same binding keys work.A through work.H. That way if you have multiple instances of worker B, they all bind to work.B and no message is delivered twice.
Also, if you don't ack a message after handling it, then eventually it will go back on the queue and be delivered again. Hopefully you are acking after successfully handling the message. No task manager is needed, just better understanding of all the AMQP knobs.