In the following snippet, as I understand, a pool of two processes is being created and then the main script enters an infinite loop while continuously checking for messages and delegates the task to some function action_fn if it finds any message.
p = Pool(processes = 2)
while True:
message = receive_message_from_queue()
if message is not None:
# Do some task
p.map_async(action_fn, [temp_iterables])
What would happen here if there are 100 messages in the queue? Will there be 100 processes created by python? Or is it that at any time only two messages will be processed? Also, in the case such as this, what is the way to kill the process when its task is done and recreate the process when there is a new message?
The Pool of Workers is a design pattern which aims to separate the service logic from the business logic.
With service logic is intended all the logic needed to support a given task such as data storage and retrieval, metrics, logging and error handling.
Business logic instead refers to the components which do the "actual job" such as enriching or transforming the data, generating statistics etc.
It is usually implemented adopting the Publisher/Subscriber design pattern where one or more workers listen to a queue of jobs which is fed from the service side.
Most of the Pool implementations require the User to set a static number of workers during their declaration. Some more advanced ones allow to change the number of workers dynamically.
Jobs can be scheduled in non-blocking (asynchronous) fashion allowing the service to continue its execution flow or in blocking (synchronous) mode stopping the execution until results are not ready.
In your specific example, you are declaring a Pool with 2 workers. Assuming you are using the multiprocessing.Pool class, the interpreter will start 2 processes which will wait for new jobs. When you call map_async, the iterable gets split into multiple chunks which get enqueued inside the Pool internal queue. The workers will pick the chunks in the order they arrive, run the action_fn function against them and publish the results in a second results queue which gets consumed by the service.
Multiple calls to map_async result in more chucks getting appended to the internal queue. Virtually the queue is infinite in size. Practically, if you manage to fill it up, the subsequent call to map_async would block until the workers make some more space for new jobs to be enqueued.
You don't need to "kill the process when is done" as the Pool manages the workflow for you in a transparent manner. Concretely, the process never dies. It simply picks the next task from the queue and executes it until there are no more tasks available. At that point it goes into sleep until either new tasks are not scheduled or the Pool itself is terminated.
Related
I am quite experienced in single-threaded Python as well as embarrasingly parallel multi-processing, but this is the first time I attempt processing something with a producer- and a consumer-thread via a shared queue.
The producer thread is going to download data items from URLs and put them in a queue. Simultaneously, a consumer thread is going to process the data items as they arrive on the queue.
Eventually, there will be no more data items to download and the program should terminate. I wish for the consumer thread to be able to distinguish whether it should keep waiting at an empty queue, because more items may be coming in, or it should terminate, because the producer thread is done.
I am considering signaling the latter situation by placing a special object on the queue in the producer thread when there are no more data items to download. When the consumer thread sees this object, it then stops waiting at the queue and terminates.
Is this a sensible approach?
I have a multithreaded program that has a management thread that at regular intervals puts jobs of various types in a queue shared among many worker threads, that pick jobs up as they are put in the queue (it's a queue.SimpleQueue).
The time that a worker thread needs to finish a job can vary greatly, so basically my need is, from the management thread, to know whether a given job type is already in the queue to avoid putting in another one of the same type. However I see no way to peek into the queue.
You need to use a separate data structure to keep track of the tasks put in the queue. A good idea is to generate unique task ID for each task and put them on a dictionary.
When a task completes, you set an attribute (say, done=True) using that task ID.
Using a external data store (a database or Redis for example) might be easier to manage this in a distributed system.
There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.
I use a list of processes with queues for each one. Another thread is used to fill these queues one after the other and the processes fetch the data from it. The problem is that after a while the queues raise an empty exception from within the processes but the thread get a full exception. When I check the queue size it is consistent with the exceptions.
To make it worse this behavior can only be reproduced as part of a large code base, i can’t generate a small program to reproduce this.
Anyone had similar issues with multiprocessing queues not being consistent in different processes?
Edit
To add more to the description of the pipeline. I have multiple worker objects, each worker has an input queue (multiprocessing.Queue), a worker queue (multiprocessing.Queue), an output queue (threading.Queue), a worker process (multiprocessing.Process) and a manager thread (threading.Thread).
Against all these workers, I have a single feeder thread (threading.Thread) that adds sample identifiers to the input queues of all workers, one by one. The sample identifiers are very small in size (paths of files) so the feeder thread can keep up with the processes.
The worker gets the sample identifiers from the input queue, reads these samples, processes them and puts them into the worker queue on by one. The manager thread reads the data in the worker queues and puts it into the output queue because multiprocessing.Queue is slower on read.
All .get() and .put() calls have timeouts and I keep track of time it takes to get new data from this pipeline. I also have mechanisms for closing it and reopening it, by joining all processes and threads (even for queues) and then recreating all of them from scratch. When everything is working, the main process goes over the workers and reads the data off of their output queue one by one. It also takes a few ms to read new data most of the time.
This whole pipeline exists two times in my code (used for machine learning with Tensorflow). One instance is used for training and is created close to the beginning of the program, the other is used for testing. The second instance is created after a while of training, it goes over all of my dataset and then resets. When the second instance is run for the second time it gets stuck after 1000 samples or so. When it is stuck and I break on debug mode in the main process, I see that the input queue is full and the worker and output queues are empty. When I then break inside one of the worker processes I see that their input queue is empty. It seems like for some reason the worker process sees a different input queue than it should. Note that this is not some race issue because this result is stable.
Edit 2
I zeroed in on the point that the program hangs on. It seems like performing json.loads() on read file data. This means that the problem is different than what originally described. The processes hang and don't see an empty queue.
code for opening the file:
with open(json_path, 'r') as f:
data = f.read()
json_data = json.loads(data) # <== program hangs at this line
I tried using signal.alarm package to pinpoint where in json.loads() the program hangs but it doesn't raise the exception. The problem is reproduced with a single multiprocessing.Process as well, but not when all processing is done in the main process.
Rings a bell to anyone?
I'm fairly familiar with the python multiprocessing module, but I'm unsure of how to implement this setup. My project has this basic flow:
request serial device -> gather response -> parse response -> assert response -> repeat
It is right now a sequential operation that loops over this until it has gather the desired number of asserted responses. I was hoping to speed this task up by having a 'master process' do the first two operations, and then pass off the parsing and assertion task into a queue of worker processes. However, this is only beneficial if the master process is ALWAYS running. I'm guaranteed to be working on a multi-core machine.
Is there any way to have a process in the multiprocessing module always have focus / make run so I can achieve this?
From what I can gather (assuming that you don't have a stringent requirement that the master is always logging the data from the serial device) you just want the master to be ready to give any worker a chunk of data and be ready to receive data from any worker as soon as the worj=ker is ready.
to acheive this use two queus and multiprocessing
Multiprocessing Queue in Python
How to use multiprocessing queue in Python?
this should be sufficient fro your needs if time(parse data)>>gather data
Here's one way to implement your workflow:
Have two multiprocessing.Queue objects: tasks_queue and
results_queue. The tasks_queue will hold device outputs, and results_queue will hold results of the assertions.
Have a pool of workers, where each worker pulls device output from
tasks_queue, parses it, asserts, and puts the result of assertion on the results_queue.
Have another process continuously polling device and put device
output on the tasks_queue.
Have one last process continuously polling results_queue, and
ending the overall program when the desired number of resuts (successful
assertions) is reached.
Total number of processes (multiprocessing.Process objects) is 2 + k, where k is the number of workers in the pool.