Rq Worker with multiple connections - python

I have 3 servers in the same network. On each of those servers a redis service and some sort of producer are running. The producer enqueues jobs to a local rq queue named tasks.
So each server has it's own tasks queue.
Also, there's one more server that is running an rq worker. Is it possible to have that worker check the tasks queue on each of the 3 servers?
I have tried creating a list of connections
import redis
from rq import Queue, Worker
from rq import push_connection
# urls = [url1, url2, url3]
connections = list(map(redis.from_url, urls))
which I then use to create a list of queues.
queues = list(map(lambda c: Queue('tasks', connection=c), connections))
Afterwards I push all the connections
for connection in connections:
push_connection(connection)
and pass the queues to Worker
Worker(queues=queues).work()
This results in the worker only listening on tasks on whatever connection was pushed last.
I've been looking into the code on rq and I think I could write a custom worker class that does this but before I do that I wanted to ask if there's another way. Maybe even another queueing framework entirely?

Okay , I solved the problem. I'm still unsure if I have permission to post actual source code here so I will outline my solution.
I had to override register_birth(self), register_death(self), and dequeue_job_and_maintain_ttl(self, timeout). The original implementation for these functions can be found here.
register_birth
Basically, you have to iterate over all connections, push_connection(connection), complete the registration process, and pop_connection().
Be careful to only list the queues corresponding to that connection in the mapping variable. The original implementation uses queue_names(self) to get a list of queue names. You'll have to do the same thing queue_names(self) does but only for the relevant queues.
register_death
Essentially the same as register_birth. Iterate over all connections, push_connection(connection), complete the same steps as the original implementation, and pop_connection().
dequeue_job_and_maintain_ttl
Let's take a look at the original implementation of this function. We'll want to keep everything the same until we get to the try block. Here we want to iterate over all connections endlessly. You can do this by using itertools.cycle.
Inside the loop push_connection(connection), and set self.connection to the current connection. If self.connection = connection is missing, the result of the job may not be properly returned.
Now we'll proceed to call self.queue_class.dequeue_any similar to the original implementation. But we'll set the timeout to 1 so we can proceed to check another connection if the current one doesn't have any jobs for the worker.
Make sure self.queue_class.dequeue_any is called with a list of queues corresponding to the current connection. In this case queues contains only the relevant queues.
result = self.queue_class.dequeue_any(
queues, 1, connection=connection, job_class=self.job_class)
Afterwards pop_connection(), and do the same check on result as the original implementation. If result is not None we've found a job to do and need to break out of the loop.
Keep everything else from the original implementation. Don't forget the break at the end of the try block. It breaks out of the while True loop.
Another thing
Queues contain a reference to their connection. You could use this to create a list of (connection, queues) where queues contains all queues with connection connection.
If you pass the resulting list to itertools.cycle you get the endless iterator you need in overriding dequeue_job_and_maintain_ttl.

Related

Python Tornado TCPServer - TCPClient alternative to interface with other objects through Queues?

I have a Tornado TCPServer which is acting as a "bridge" between two Python programs on different computers that need to exchange data (streaming & files) and commands. There is only ever one client at a time. Since the TCPServer runs using IOLoop I have it in a separate thread to avoid blocking other server actions.
Commands are received as strings from reading the TCP connection and are put in a Queue that can be accessed in both the TCPServer thread and the outer Python thread. There is an additional Queue for sending data back to the TCPServer after a command is interpreted and executed in the outer Python thread. This arrangement is mirrored on the client side with its TCPClient as well. Each Queue is used as one-directional.
Example simplified flowchart:
My questions:
Queues are very limited in the sense that there is no relation between the request and the response (i.e. two requests submitted at once could get each-other's responses if one queue is used). Other than making a list of queues and routing commands/responses through them, are there good alternatives for parallel, cross-thread communication?
I would prefer not to reinvent the wheel and I imagine this is not a totally unique
use-case. Are there alternatives to this TCP-to-Queue routing?
Adding a partial answer to address the potential cross-communication of requests and responses in one Queue:
class OnDemandQueue(Queue):
def __init__(self, *args, **kwargs):
super(OnDemandQueue, self).__init__(*args, **kwargs)
# Create and share new temporary queues as-needed via a single existing queue
def get_queue(self):
send = Queue()
receive = Queue()
# Remote reference
self.put_nowait([receive, send])
# Local reference
return [send, receive]
One or two instances of OnDemandQueue is/are shared across threads, and any time a new command-response pair is needed a new Queue pair is generated and shared via get_queue. This keeps everything nicely segregated and the pairs can be maintained as long as they are needed, then cleanly discarded.
In typical use a listener is needed to watch for new items in the shared OnDemandQueue, then determine what to do with the resulting Queue pair. While communication is bidirectional through the get_queue pair, the
OnDemandQueue itself is treated as unidirectional due to the listener, so two instances may be needed.

How to find a list of jobs with no workers

When using Celery you can use i.inspect() to find the active tasks, the tasks that have been scheduled and all the registered tasks. However, I have disabled pre-fetch, so only 1 job is registered to a worker at a time. How do I access a list of all jobs in the queue, that do not have a registered worker?
If you use RabbitMQ, read this. If you use Redis read this. It gives an information how many tasks there are in particular queue.
Since I believe you want to list all the tasks waiting in the queue, you need to iterate through the list. If you use Redis for an example, instead of llen like in the example in the documentation, you would iterate through the QUEUE_NAME list (it is a Redis list object). NOTE: This process can take some time so by the time you go through the list some of the tasks may already be finished.

Delete a Queue from Python-RQ / Redis

I'm using Python RQ (backed by Redis) to feed tasks to a bunch of worker processes.
I accidentally sent a tuple when adding a job to a queue, so now I have queues like this:
high
medium
low
('low',)
default
I can't seem to figure out how to get rid of the ('low',) queue. The queue also seems to cause some issues due to its name (for instance, I couldn't view it or clear it in rq-dashboard as the page would refuse to load).
There is some discussion here: RQ - Empty & Delete Queues, but that only covers emptying a queue. I am able to empty the queue just fine from within Python, but I can't seem to actually delete the queue from the Redis server.
The RQ documentation doesn't seem to provide any information on getting rid of a queue you don't want.
I want to actually delete that queue (not just empty it) instead of carrying it around forever.
The RQ stores all the queues under rq:queues keys. This can be accessed by the redis-cli.
smembers rq:queues
I also stumbled upon Destroying / removing a Queue() in Redis Queue (rq) programmatically This might help!

Recommended way to send messages between threads in python?

I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.

Celery - Can a message in RabbitMQ be consumed by two or more workers at the same time?

Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).

Categories

Resources