What is the best way to dispatch many tasks to concurrent worker threads in Python?

What is the best way to dispatch many tasks to concurrent worker threads in Python? - python

There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.

Related

Python multithread queue peeking

I have a multithreaded program that has a management thread that at regular intervals puts jobs of various types in a queue shared among many worker threads, that pick jobs up as they are put in the queue (it's a queue.SimpleQueue).
The time that a worker thread needs to finish a job can vary greatly, so basically my need is, from the management thread, to know whether a given job type is already in the queue to avoid putting in another one of the same type. However I see no way to peek into the queue.

You need to use a separate data structure to keep track of the tasks put in the queue. A good idea is to generate unique task ID for each task and put them on a dictionary.
When a task completes, you set an attribute (say, done=True) using that task ID.
Using a external data store (a database or Redis for example) might be easier to manage this in a distributed system.

Understanding the working of process pool in an infinite loop

In the following snippet, as I understand, a pool of two processes is being created and then the main script enters an infinite loop while continuously checking for messages and delegates the task to some function action_fn if it finds any message.
p = Pool(processes = 2)
while True:
message = receive_message_from_queue()
if message is not None:
# Do some task
p.map_async(action_fn, [temp_iterables])
What would happen here if there are 100 messages in the queue? Will there be 100 processes created by python? Or is it that at any time only two messages will be processed? Also, in the case such as this, what is the way to kill the process when its task is done and recreate the process when there is a new message?

The Pool of Workers is a design pattern which aims to separate the service logic from the business logic.
With service logic is intended all the logic needed to support a given task such as data storage and retrieval, metrics, logging and error handling.
Business logic instead refers to the components which do the "actual job" such as enriching or transforming the data, generating statistics etc.
It is usually implemented adopting the Publisher/Subscriber design pattern where one or more workers listen to a queue of jobs which is fed from the service side.
Most of the Pool implementations require the User to set a static number of workers during their declaration. Some more advanced ones allow to change the number of workers dynamically.
Jobs can be scheduled in non-blocking (asynchronous) fashion allowing the service to continue its execution flow or in blocking (synchronous) mode stopping the execution until results are not ready.
In your specific example, you are declaring a Pool with 2 workers. Assuming you are using the multiprocessing.Pool class, the interpreter will start 2 processes which will wait for new jobs. When you call map_async, the iterable gets split into multiple chunks which get enqueued inside the Pool internal queue. The workers will pick the chunks in the order they arrive, run the action_fn function against them and publish the results in a second results queue which gets consumed by the service.
Multiple calls to map_async result in more chucks getting appended to the internal queue. Virtually the queue is infinite in size. Practically, if you manage to fill it up, the subsequent call to map_async would block until the workers make some more space for new jobs to be enqueued.
You don't need to "kill the process when is done" as the Pool manages the workflow for you in a transparent manner. Concretely, the process never dies. It simply picks the next task from the queue and executes it until there are no more tasks available. At that point it goes into sleep until either new tasks are not scheduled or the Pool itself is terminated.

Separate process sharing queue with main process (producer/consumer)

I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?

What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.

pika connection times out during execution of long task (3+ minutes)

I have a process in which I need to assign long running tasks amongst a pool of workers, in python. So far I have been using RabbitMQ to queue the tasks (input is a nodejs frontend); a python worker subscribes to the queue, obtains a task and executes it. Each task takes several minutes minimum.
After an update this process started breaking, and I eventually discovered this was due to RabbitMQ version 3.6.10 having changed the way it handles timeouts. I now believe I need to rethink my method of assigning tasks, but I want to make sure I do it the right way.
Until now I only had one worker (the task is to control a sequence of actions in a VM - I couldn't afford a new Windows license for a while, so until recently I had no practical way of testing parallel task execution); I suspect if I'd had two before I would have noticed this sooner. The worker attaches to a VM using libvirt to control it. The way my code is written currently implies that I would run one instance of the script per VM that I wish to control.
I suspect that part of my problem is the use of BlockingConnection - I think I need a way for the worker to disconnect from the queue when it has received and validated a task (this part takes less than 1 sec), then reconnect once it has completed the actions, but I haven't figured out how to do this yet. Is this correct? If so, how should I do this, and if not, what should I do instead?
One other idea I've had is that instead of running a script per VM I could have a global control script that on receiving a task would spin off a thread which would handle the task. This would solve the problem of the connection timing out during task execution, but the timeout would just have moved to a different stage: I would potentially receive tasks while there were no idle VMs, and I would have to come up with a way to make the script await an available VM without breaking the RabbitMQ connection.
My current code can be seen here:
https://github.com/scherma/antfarm/blob/master/src/runmanager/runmanager.py#L342
Any thoughts folks?

celery and long running tasks

I just watch a youtube video where the presenter mentioned that one should design his/her celery to be short. Tasks running several minutes are bad.
Is this correct? What I do see is that I have some long running task, which takes say 10 minutes to finish. When these kind of task is scheduled frequently, the queue is swamped and no other tasks get scheduled. Is this the reason?
If so, what should be used for long running tasks?

Long running tasks aren't great but It's by no means appropriate to say they are bad. The best way to handle long running tasks is to create a queue for just those tasks and have them run on a separate worker then the short tasks.

The problem with long running tasks is that you have to wait for them when you're pushing a new software version on your server. If you don't wait, your task may run possibly incompatible code, especially if you pickled some complex object as a parameter (which is strongly discouraged).

As #user2097159 said its a good practice to keep the long running tasks in a dedicate queue. You should do that by routing using "settings.CELERY_ROUTES" more info here
If you could estimate how long a task can be running, I recommend to use soft_time_limit per task, you will be able to handle it.
There is a gist from a talk I gave here

Augment the basic Task definition to optionally treat the task instantiation as a generator, and check for TERM or soft timeout on every iteration through the generator. Generically inject a "state" dict kwarg into tasks that support it. If it's the first time the task is run, allocate a new one in results cache, otherwise look up the existing one from results cache.
In your task, figure out a good place to yield which results in short execution times. Update the state parameter as necessary.
When control returns to the master task class, check for TERM or soft timeout, and if there is one, save off the state object and respond to the signal.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.