Python multithread queue peeking - python

I have a multithreaded program that has a management thread that at regular intervals puts jobs of various types in a queue shared among many worker threads, that pick jobs up as they are put in the queue (it's a queue.SimpleQueue).
The time that a worker thread needs to finish a job can vary greatly, so basically my need is, from the management thread, to know whether a given job type is already in the queue to avoid putting in another one of the same type. However I see no way to peek into the queue.

You need to use a separate data structure to keep track of the tasks put in the queue. A good idea is to generate unique task ID for each task and put them on a dictionary.
When a task completes, you set an attribute (say, done=True) using that task ID.
Using a external data store (a database or Redis for example) might be easier to manage this in a distributed system.

Related

How to find a list of jobs with no workers

When using Celery you can use i.inspect() to find the active tasks, the tasks that have been scheduled and all the registered tasks. However, I have disabled pre-fetch, so only 1 job is registered to a worker at a time. How do I access a list of all jobs in the queue, that do not have a registered worker?
If you use RabbitMQ, read this. If you use Redis read this. It gives an information how many tasks there are in particular queue.
Since I believe you want to list all the tasks waiting in the queue, you need to iterate through the list. If you use Redis for an example, instead of llen like in the example in the documentation, you would iterate through the QUEUE_NAME list (it is a Redis list object). NOTE: This process can take some time so by the time you go through the list some of the tasks may already be finished.

What is the best way to dispatch many tasks to concurrent worker threads in Python?

There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.

Separate process sharing queue with main process (producer/consumer)

I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?
What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.

Celery design help: how to prevent concurrently executing tasks

I'm fairly new to Celery/AMQP and am trying to come up with a task/queue/worker design to meet the following requirements.
I have multiple types of "per-user" tasks: e.g., TaskA, TaskB, TaskC. Each of these "per-user" tasks read/write data for one particular user in the system. So at any given time, I might need to create tasks User1_TaskA, User1_TaskB, User1_TaskC, User2_TaskA, User2_TaskB, etc. I need to ensure that, for each user, no two tasks of any task type execute concurrently. I want a system in which no worker can execute User1_TaskA at the same time as any other worker is executing User1_TaskB or User1_TaskC, but while User1_TaskA is executing, other workers shouldn't be blocked from concurrently executing User2_TaskA, User3_TaskA, etc.
I realize this could be implemented using some sort of external locking mechanism (e.g., in the DB), but I'm hoping there's a more elegant task/queue/worker design that would work.
I suppose one possible solution is to implement queues as user buckets such that, when the workers are launched there's config that specifies how many buckets to create, and each "bucket worker" is bound to exactly one bucket. Then an "intermediate worker" would pull off tasks from the main task queue and assign them into the bucketed queues via, say, a hash/mod scheme. So UserA's tasks would always end up in the same queue, and multiple tasks for UserA would back up behind each other. I don't love this approach, as it would require the number of buckets to be defined ahead of time, and would seem to prevent (easily) adding workers dynamically. Seems to me there's got to be a better way -- suggestions would be greatly appreciated.
What's so bad in using an external locking mechanism? It's simple, straightforward, and efficient enough. You can find an example of distributed task locking in Celery here. Extend it by creating a lock per user, and you're done!

How to prevent duplicate values in a shared queue

A producer thread queries a data store and puts objects into a queue. Each consumer thread(s) will then pull an object off of the shared queue and do a very long call to an external service. When the call returns, the consumer marks the object as having been completed.
My problem is that I basically have to wait until the queue is empty before the producer can add to it again, or else I risk getting duplicates being sent through.
[edit] Someone asked a good question over IRC and I figured I would add the answer here. The question was, "Why do your producers produce duplicates?" The answer is basically that the producer produces duplicates because we don't track a "sending" state of each object, only "sent" or "unsent".
Is there a way that I can check for duplicates in the queue?
It seems to me like it's not really a problem to have duplicate objects in the queue; you just want to make sure you only do the processing once per object.
EDIT: I originally suggested using a set or OrderedDict to keep track of the objects, but Python has a perfect solution: functools.lru_cache
Use #functools.lru_cache as a decorator on your worker function, and it will manage a cache for you. You can set a maximum size, and it will not grow beyond that size. If you use an ordinary set and don't manage it, it could grow to very large size and slow down your workers.
If you are using multiple worker processes instead of threads, you would need a solution that works across processes. Instead of a set or an lru_cache you could use a shared dict where the key is the unique ID value you use to detect duplicates, and the value is a timestamp for when the object went into the dict; then from time to time you could delete the really old entries in the dict. Here's a StackOverflow answer about shared dict objects:
multiprocessing: How do I share a dict among multiple processes?
And the rest of my original answer follows:
If so, I suggest you have the consumer thread(s) use a set to keep track of objects that have been seen. If an object is not in the set, add it and process it; if it is in the set, ignore it as a duplicate.
If this will be a long-running system, instead of a set, use an OrderedDict to track seen objects. Then from time to time clean out the oldest entries in the OrderedDict.
If you talk about the classes in the Queue module: following the API there is no way to detect if a queue contains a given object.
What do you mean by mark the object as having been completed? Do you leave the object in the queue and change a flag? Or do you mean you mark the object as having been completed in the data store. If the former, how does the queue ever become empty? If the latter, why not remove the object from the queue before you start processing?
Assuming you want to be able to handle cases where the processing fails without losing data, one approach would be to create a separate work queue and processing queue. Then, when a consumer pulls a job from the work queue, they move it to the processing queue and start the long running call to an external service. When that returns, it can mark the data complete and remove it from the processing queue. If you add a field for when the data was put into the processing queue, you could potentially run a periodic job that checks for processing jobs that exceed a certain time and attempt to reprocess them (updating the timestamp before restarting).

Categories

Resources