How to prevent duplicate values in a shared queue - python

A producer thread queries a data store and puts objects into a queue. Each consumer thread(s) will then pull an object off of the shared queue and do a very long call to an external service. When the call returns, the consumer marks the object as having been completed.
My problem is that I basically have to wait until the queue is empty before the producer can add to it again, or else I risk getting duplicates being sent through.
[edit] Someone asked a good question over IRC and I figured I would add the answer here. The question was, "Why do your producers produce duplicates?" The answer is basically that the producer produces duplicates because we don't track a "sending" state of each object, only "sent" or "unsent".
Is there a way that I can check for duplicates in the queue?

It seems to me like it's not really a problem to have duplicate objects in the queue; you just want to make sure you only do the processing once per object.
EDIT: I originally suggested using a set or OrderedDict to keep track of the objects, but Python has a perfect solution: functools.lru_cache
Use #functools.lru_cache as a decorator on your worker function, and it will manage a cache for you. You can set a maximum size, and it will not grow beyond that size. If you use an ordinary set and don't manage it, it could grow to very large size and slow down your workers.
If you are using multiple worker processes instead of threads, you would need a solution that works across processes. Instead of a set or an lru_cache you could use a shared dict where the key is the unique ID value you use to detect duplicates, and the value is a timestamp for when the object went into the dict; then from time to time you could delete the really old entries in the dict. Here's a StackOverflow answer about shared dict objects:
multiprocessing: How do I share a dict among multiple processes?
And the rest of my original answer follows:
If so, I suggest you have the consumer thread(s) use a set to keep track of objects that have been seen. If an object is not in the set, add it and process it; if it is in the set, ignore it as a duplicate.
If this will be a long-running system, instead of a set, use an OrderedDict to track seen objects. Then from time to time clean out the oldest entries in the OrderedDict.

If you talk about the classes in the Queue module: following the API there is no way to detect if a queue contains a given object.

What do you mean by mark the object as having been completed? Do you leave the object in the queue and change a flag? Or do you mean you mark the object as having been completed in the data store. If the former, how does the queue ever become empty? If the latter, why not remove the object from the queue before you start processing?
Assuming you want to be able to handle cases where the processing fails without losing data, one approach would be to create a separate work queue and processing queue. Then, when a consumer pulls a job from the work queue, they move it to the processing queue and start the long running call to an external service. When that returns, it can mark the data complete and remove it from the processing queue. If you add a field for when the data was put into the processing queue, you could potentially run a periodic job that checks for processing jobs that exceed a certain time and attempt to reprocess them (updating the timestamp before restarting).

Related

Python, concurency, critical sections

here I have some question about possible critical sections.
In my code I have a function dealing with queue. This function is one and only to put elements in the queue. But a number of threads operating concurently get elements from this queue. Since there is a chance (I am not sure if such a chance exists tbh) that multiple threads will attempt to get one element each from the queue at the same time, is it possible that they will get exactly the same element from the queue?
One of the things my workers do is opening a file (different workers opens different files in exclusive dirs). I am using context manager "with open(>some file<, 'w') as file...". So is it possible, that at the same time multiple threads opening different files but using exectly the same variable 'file' will mess up things cause it looks like I have a critical section here, doesnt it?
Your first question is easy to answer with the documentation of the queue class. If you implemented a custom queue, the locking is on you but the python queue module states:
Internally, those three types of queues use locks to temporarily block competing threads; however, they are not designed to handle reentrancy within a thread.
I am uncertain if your second question follows from the first question.
It would be helpful to clear up your question with an example.

Shared dict that gets filled by separate processes

I am trying to divide a process that takes a 20k items array, periodically computes the element of the array and fills a global dict with the current result of that processing, into multiple processes.
my issue is the return dict as i need it to be in a single place in order to later on and periodically send it in it's entirety via an HTTP call.
my reasoning is to have the dict in the main process/thread and divide the 20k items into chunks over 4 processes, each having about 500 threads with each thread processing a number of items, but it seems I can't just pass a global variable to all processes and have that be filled, as each process creates an empty variable and I get nothing in my global variable.
I had the idea of making each process send their result via HTTP to a server and that would buffer the results and then send the entire dict to the destination. but that would introduce huge latency which is not desirable.
how can I achieve the division? is there any way that i can buffer the results coming from the multiple processes with the most reduced latency? the global variable must be a dict.
I believe you cannot share variables between subprocesses, or at least, it is not particularly easy. I am also not sure why you would need this for this problem.
Have you considered using Pythons multiprocessing.Pool functionality? Its official documentation with examples can be found here.
Each thread therein could process a subset of your input dict. After execution, multiprocessing.Pool would return a list of the output of each process, with the list having a length equal to the number of threads used. You can merge the separate outputs into a single dict() once all the processes have finished to gather all data into a single dict(). Would that work?

Best way to write consumers for many queues

Here's my situation:
I have a simple wrapper for Redis. What it does is, it takes an item (a dict) as input and puts it in a linked-list data type by doing an RPUSH. A request for a new item to be put at one queue may actually produce many new queues.
Now I want to read the left-most item in each queue without doing an LPOP. I need the items to stay in the queue. So what I do is, I do an LRANGE and get the left-most item from the queue.
My issue is I don't know how to write a consumer for my queues. I assume I should have one consumer for each queue? That means I will have many many instances of my consumer which sounds like a bad design decision.
Should I just get all the keys for all my queues and loop through them processing the left-most item repeatedly?
how should I write my consumer for the queues I have?
I'm honestly stuck and seems like none of the standard producer/consumer implementations answer my question. I have also looked into redis streams but I can't use those since the decision is out of my hands.

Python multithread queue peeking

I have a multithreaded program that has a management thread that at regular intervals puts jobs of various types in a queue shared among many worker threads, that pick jobs up as they are put in the queue (it's a queue.SimpleQueue).
The time that a worker thread needs to finish a job can vary greatly, so basically my need is, from the management thread, to know whether a given job type is already in the queue to avoid putting in another one of the same type. However I see no way to peek into the queue.
You need to use a separate data structure to keep track of the tasks put in the queue. A good idea is to generate unique task ID for each task and put them on a dictionary.
When a task completes, you set an attribute (say, done=True) using that task ID.
Using a external data store (a database or Redis for example) might be easier to manage this in a distributed system.

RabbitMQ: purge queue

I have some queue, for etc:
online_queue = self._channel.queue_declare(
durable = True,
queue = 'online'
)
At the moment, I need to flush all content in this queue.
But, at this moment, another process, probably, may publish to this queue.
If I use channel.queue_purge(queue='online'), what will happened with messages, published, while queue_purge still working?
Depending on your ultimate goal, you might be able to solve this issue by using a temporary queue.
To make things more clear, lets give things some names. Call your current queue (the one you want to purge) Queue A, and assume it is 1-1 bound to Exchange A.
If you create a new queue (Queue B) and bind it to Exchange A in the same way that Queue A is bound, Queue B will now get all of the messages (from the time of binding) that Queue A gets.
You can now safely purge Queue A without loosing any of the messages that got sent in after Queue B was bound.
Re-bind Queue A to Exchange A and you are back up and running.
You can then deal with the "interim" messages in Queue B however you might need to.
This has the advantage of having a very well defined behavior and doesn't get you into any race conditions because you can completely blow Queue A away and re-create it instead of purging.
You're describing a race condition. Some might remain in the queue and some others might get purged. Or all of them will get purged. Or none of them will get purged.
There's just no way to tell, because it's a time-dependent situation. You should re-examine your need to purge a queue which is still active, or build a more robust consumer that can live with the fact that there might be messages in the queue it is connecting to (which is basically what consumers have to live with, anyway).

Categories

Resources