Python Multiprocessing Management

Python Multiprocessing Management - python

I'm fairly familiar with the python multiprocessing module, but I'm unsure of how to implement this setup. My project has this basic flow:
request serial device -> gather response -> parse response -> assert response -> repeat
It is right now a sequential operation that loops over this until it has gather the desired number of asserted responses. I was hoping to speed this task up by having a 'master process' do the first two operations, and then pass off the parsing and assertion task into a queue of worker processes. However, this is only beneficial if the master process is ALWAYS running. I'm guaranteed to be working on a multi-core machine.
Is there any way to have a process in the multiprocessing module always have focus / make run so I can achieve this?

From what I can gather (assuming that you don't have a stringent requirement that the master is always logging the data from the serial device) you just want the master to be ready to give any worker a chunk of data and be ready to receive data from any worker as soon as the worj=ker is ready.
to acheive this use two queus and multiprocessing
Multiprocessing Queue in Python
How to use multiprocessing queue in Python?
this should be sufficient fro your needs if time(parse data)>>gather data

Here's one way to implement your workflow:
Have two multiprocessing.Queue objects: tasks_queue and
results_queue. The tasks_queue will hold device outputs, and results_queue will hold results of the assertions.
Have a pool of workers, where each worker pulls device output from
tasks_queue, parses it, asserts, and puts the result of assertion on the results_queue.
Have another process continuously polling device and put device
output on the tasks_queue.
Have one last process continuously polling results_queue, and
ending the overall program when the desired number of resuts (successful
assertions) is reached.
Total number of processes (multiprocessing.Process objects) is 2 + k, where k is the number of workers in the pool.

Related

Understanding the working of process pool in an infinite loop

In the following snippet, as I understand, a pool of two processes is being created and then the main script enters an infinite loop while continuously checking for messages and delegates the task to some function action_fn if it finds any message.
p = Pool(processes = 2)
while True:
message = receive_message_from_queue()
if message is not None:
# Do some task
p.map_async(action_fn, [temp_iterables])
What would happen here if there are 100 messages in the queue? Will there be 100 processes created by python? Or is it that at any time only two messages will be processed? Also, in the case such as this, what is the way to kill the process when its task is done and recreate the process when there is a new message?

The Pool of Workers is a design pattern which aims to separate the service logic from the business logic.
With service logic is intended all the logic needed to support a given task such as data storage and retrieval, metrics, logging and error handling.
Business logic instead refers to the components which do the "actual job" such as enriching or transforming the data, generating statistics etc.
It is usually implemented adopting the Publisher/Subscriber design pattern where one or more workers listen to a queue of jobs which is fed from the service side.
Most of the Pool implementations require the User to set a static number of workers during their declaration. Some more advanced ones allow to change the number of workers dynamically.
Jobs can be scheduled in non-blocking (asynchronous) fashion allowing the service to continue its execution flow or in blocking (synchronous) mode stopping the execution until results are not ready.
In your specific example, you are declaring a Pool with 2 workers. Assuming you are using the multiprocessing.Pool class, the interpreter will start 2 processes which will wait for new jobs. When you call map_async, the iterable gets split into multiple chunks which get enqueued inside the Pool internal queue. The workers will pick the chunks in the order they arrive, run the action_fn function against them and publish the results in a second results queue which gets consumed by the service.
Multiple calls to map_async result in more chucks getting appended to the internal queue. Virtually the queue is infinite in size. Practically, if you manage to fill it up, the subsequent call to map_async would block until the workers make some more space for new jobs to be enqueued.
You don't need to "kill the process when is done" as the Pool manages the workflow for you in a transparent manner. Concretely, the process never dies. It simply picks the next task from the queue and executes it until there are no more tasks available. At that point it goes into sleep until either new tasks are not scheduled or the Pool itself is terminated.

What is the best way to dispatch many tasks to concurrent worker threads in Python?

There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.

Python multiprocessing queue empty in one process and full in another

I use a list of processes with queues for each one. Another thread is used to fill these queues one after the other and the processes fetch the data from it. The problem is that after a while the queues raise an empty exception from within the processes but the thread get a full exception. When I check the queue size it is consistent with the exceptions.
To make it worse this behavior can only be reproduced as part of a large code base, i can’t generate a small program to reproduce this.
Anyone had similar issues with multiprocessing queues not being consistent in different processes?
Edit
To add more to the description of the pipeline. I have multiple worker objects, each worker has an input queue (multiprocessing.Queue), a worker queue (multiprocessing.Queue), an output queue (threading.Queue), a worker process (multiprocessing.Process) and a manager thread (threading.Thread).
Against all these workers, I have a single feeder thread (threading.Thread) that adds sample identifiers to the input queues of all workers, one by one. The sample identifiers are very small in size (paths of files) so the feeder thread can keep up with the processes.
The worker gets the sample identifiers from the input queue, reads these samples, processes them and puts them into the worker queue on by one. The manager thread reads the data in the worker queues and puts it into the output queue because multiprocessing.Queue is slower on read.
All .get() and .put() calls have timeouts and I keep track of time it takes to get new data from this pipeline. I also have mechanisms for closing it and reopening it, by joining all processes and threads (even for queues) and then recreating all of them from scratch. When everything is working, the main process goes over the workers and reads the data off of their output queue one by one. It also takes a few ms to read new data most of the time.
This whole pipeline exists two times in my code (used for machine learning with Tensorflow). One instance is used for training and is created close to the beginning of the program, the other is used for testing. The second instance is created after a while of training, it goes over all of my dataset and then resets. When the second instance is run for the second time it gets stuck after 1000 samples or so. When it is stuck and I break on debug mode in the main process, I see that the input queue is full and the worker and output queues are empty. When I then break inside one of the worker processes I see that their input queue is empty. It seems like for some reason the worker process sees a different input queue than it should. Note that this is not some race issue because this result is stable.
Edit 2
I zeroed in on the point that the program hangs on. It seems like performing json.loads() on read file data. This means that the problem is different than what originally described. The processes hang and don't see an empty queue.
code for opening the file:
with open(json_path, 'r') as f:
data = f.read()
json_data = json.loads(data) # <== program hangs at this line
I tried using signal.alarm package to pinpoint where in json.loads() the program hangs but it doesn't raise the exception. The problem is reproduced with a single multiprocessing.Process as well, but not when all processing is done in the main process.
Rings a bell to anyone?

Separate process sharing queue with main process (producer/consumer)

I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?

What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.

Is multiprocessing the right tool for me?

I need to write a very specific data processing daemon.
Here is how I thought it could work with multiprocessing :
Process #1: One process to fetch some vital meta data, they can be fetched every second, but those data must be available in process #2. Process #1 writes the data, and Process #2 reads them.
Process #2: Two processes which will fetch the real data based on what has been received in process #1. Fetched data will be stored into a (big) queue to be processed "later"
Process #3: Two (or more) processes which poll the queue created in Process #2 and process those data. Once done, a new queue is filled up to be used in Process #4
Process #4 : Two processes which will read the queue filled by Process(es) #3 and send the result back over HTTP.
The idea behind all these different processes is to specialize them as much as possible and to make them as independent as possible.
All thoses processes will be wrapped into a main daemon which is implemented here :
http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
I am wondering if what I have imagined is relevant/stupid/overkill/etc, especially if I run daemon multiprocessing.Process(es) within a main parent process which will daemonized.
Furthermore I am a bit concerned about potential locking problems. In theory processes that read and write data uses different variables/structures so that should avoid a few problems, but I am still concerned.
Maybe using multiprocessing for my context is not the right thing to do. I would love to get your feedback about this.
Notes :
I can not use Redis as a data structure server
I thought about using ZeroMQ for IPC but I would avoid using another extra library if multiprocessing can do the job as well.
Thanks in advance for your feedback.

Generally, your division in different workers with different tasks as well as your plan to let them communicate already looks good. However, one thing you should be aware of is whenever a processing step is I/O or CPU bound. If you are I/O bound, I'd go for the threading module whenever you can: the memory footprint of your application will be smaller and the communication between threads can be more efficient, as shared memory is allowed. Only if you need additional CPU power, go for multiprocessing. In your system, you can use both (it looks like process 3 (or more) will do some heavy computing, while the other workers will predominantly be I/O bound).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.