Here's my situation:
I have a simple wrapper for Redis. What it does is, it takes an item (a dict) as input and puts it in a linked-list data type by doing an RPUSH. A request for a new item to be put at one queue may actually produce many new queues.
Now I want to read the left-most item in each queue without doing an LPOP. I need the items to stay in the queue. So what I do is, I do an LRANGE and get the left-most item from the queue.
My issue is I don't know how to write a consumer for my queues. I assume I should have one consumer for each queue? That means I will have many many instances of my consumer which sounds like a bad design decision.
Should I just get all the keys for all my queues and loop through them processing the left-most item repeatedly?
how should I write my consumer for the queues I have?
I'm honestly stuck and seems like none of the standard producer/consumer implementations answer my question. I have also looked into redis streams but I can't use those since the decision is out of my hands.
Related
here I have some question about possible critical sections.
In my code I have a function dealing with queue. This function is one and only to put elements in the queue. But a number of threads operating concurently get elements from this queue. Since there is a chance (I am not sure if such a chance exists tbh) that multiple threads will attempt to get one element each from the queue at the same time, is it possible that they will get exactly the same element from the queue?
One of the things my workers do is opening a file (different workers opens different files in exclusive dirs). I am using context manager "with open(>some file<, 'w') as file...". So is it possible, that at the same time multiple threads opening different files but using exectly the same variable 'file' will mess up things cause it looks like I have a critical section here, doesnt it?
Your first question is easy to answer with the documentation of the queue class. If you implemented a custom queue, the locking is on you but the python queue module states:
Internally, those three types of queues use locks to temporarily block competing threads; however, they are not designed to handle reentrancy within a thread.
I am uncertain if your second question follows from the first question.
It would be helpful to clear up your question with an example.
I have a multithreaded program that has a management thread that at regular intervals puts jobs of various types in a queue shared among many worker threads, that pick jobs up as they are put in the queue (it's a queue.SimpleQueue).
The time that a worker thread needs to finish a job can vary greatly, so basically my need is, from the management thread, to know whether a given job type is already in the queue to avoid putting in another one of the same type. However I see no way to peek into the queue.
You need to use a separate data structure to keep track of the tasks put in the queue. A good idea is to generate unique task ID for each task and put them on a dictionary.
When a task completes, you set an attribute (say, done=True) using that task ID.
Using a external data store (a database or Redis for example) might be easier to manage this in a distributed system.
I have a single process A a the first Pool, and several processes B1..Bk in the second pool and I would like to put items into the queue in A and consume items in B1..Bk.
My first attempt was to just create multiprocessing.Queue and pass it to all those processes. However this gave me the error
RuntimeError: Queue objects should only be shared between processes through inheritance
I found advice that suggests to use multiprocessing.Manager().Queue() instead. But when I do this and try to read from the queue in Bi, I get the error
TypeError: 'AutoProxy[Queue]' object is not iterable
So what is the correct way to do this?
OK, this was just sillyness from my side, I misunderstood what a Queue is!
Most importantly a queue is not iterable so one cannot do "for el in somequeue".
(My mistake was to think that the queue proxy is not iterable because it is the proxy. However the proxy works fine in place of the actual queue if put/get are used)
Also, a (FIFO) queue cannot be closed and does not have a natural "end" which I found annoying because it means one has to send around special "end of queue" entries but not too many of them to not unintentionally block the queue.
So the bottom line is: to share the queue I create a multiprocessing.Manager().Queue() and pass that around, then I use put/get to write/read the queue in different processes and I send some special entry to the reader to indicate end of job.
That a queue cannot get closed and the consumer get an "end of queue" condition is really annoying though, especially when there is an error situation: if a queue is consumed by k consumers, then the writer has to know k and send k end of job indicators and the k consumers all have to be well behaved to retrieve those and shut down. If there is any error, all this cannot be guaranteed any more and e.g. a consumer may lock or timeout waiting for the end of job indicator that will never arrive.
I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.
A producer thread queries a data store and puts objects into a queue. Each consumer thread(s) will then pull an object off of the shared queue and do a very long call to an external service. When the call returns, the consumer marks the object as having been completed.
My problem is that I basically have to wait until the queue is empty before the producer can add to it again, or else I risk getting duplicates being sent through.
[edit] Someone asked a good question over IRC and I figured I would add the answer here. The question was, "Why do your producers produce duplicates?" The answer is basically that the producer produces duplicates because we don't track a "sending" state of each object, only "sent" or "unsent".
Is there a way that I can check for duplicates in the queue?
It seems to me like it's not really a problem to have duplicate objects in the queue; you just want to make sure you only do the processing once per object.
EDIT: I originally suggested using a set or OrderedDict to keep track of the objects, but Python has a perfect solution: functools.lru_cache
Use #functools.lru_cache as a decorator on your worker function, and it will manage a cache for you. You can set a maximum size, and it will not grow beyond that size. If you use an ordinary set and don't manage it, it could grow to very large size and slow down your workers.
If you are using multiple worker processes instead of threads, you would need a solution that works across processes. Instead of a set or an lru_cache you could use a shared dict where the key is the unique ID value you use to detect duplicates, and the value is a timestamp for when the object went into the dict; then from time to time you could delete the really old entries in the dict. Here's a StackOverflow answer about shared dict objects:
multiprocessing: How do I share a dict among multiple processes?
And the rest of my original answer follows:
If so, I suggest you have the consumer thread(s) use a set to keep track of objects that have been seen. If an object is not in the set, add it and process it; if it is in the set, ignore it as a duplicate.
If this will be a long-running system, instead of a set, use an OrderedDict to track seen objects. Then from time to time clean out the oldest entries in the OrderedDict.
If you talk about the classes in the Queue module: following the API there is no way to detect if a queue contains a given object.
What do you mean by mark the object as having been completed? Do you leave the object in the queue and change a flag? Or do you mean you mark the object as having been completed in the data store. If the former, how does the queue ever become empty? If the latter, why not remove the object from the queue before you start processing?
Assuming you want to be able to handle cases where the processing fails without losing data, one approach would be to create a separate work queue and processing queue. Then, when a consumer pulls a job from the work queue, they move it to the processing queue and start the long running call to an external service. When that returns, it can mark the data complete and remove it from the processing queue. If you add a field for when the data was put into the processing queue, you could potentially run a periodic job that checks for processing jobs that exceed a certain time and attempt to reprocess them (updating the timestamp before restarting).