I'm using map_async with processes that return a ton of data. The normal map_async results in the data being stored in memory, then returned when everything is processed. To get around this, I've used a generator approach from:
Combining itertools and multiprocessing?
However, this doesn't make full use of multi-threading (as in, if you have 29 threads finished and 1 thread hanging, it won't start the next batch of jobs until everyone is done). Is there a way to have the map_async or does there exist a similar function which will send its returns to a callback function as each thread finishes?
What you want is to use a producer-consumer-based solution. The producer puts tasks in a multiprocessing.Queue, and the consumers (subprocesses) get and processes them, in a loop.
This is a good SO question with a (detailed) possible solution.
Related
There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.
I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?
What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.
In Python, what is the difference between using a thread.join vs a queue.join? I feel that it could both do the same job in some scenarios. Especially if there is a one-one correpodence between thread spawned and item picked from the queue for a job. Is it something like if you are going to use Threading on a queue, it is best to depend on queue.join and if you are just doing something in paralelly where there is no queue data structure used, but its something like a list you could use thread.join? Ofcourse in the scenario of thread.join you need to mention all the threads spawned.
Also just as an aside is queue something you would normally use for consuming input? I think in the scenario of chaining inputs for another job it makes sense to use as an output as well, but in general queue is for processing input? Can someone clarify?
Queue.join will wait for the queue to be empty (actually that Queue.task_done is called for each item after processing). Thread.join will block until all threads terminates. The behavior using one or other might be similar if all the threads take items from the queue, make a task and return when there's nothing left. However you can still have threads which don't use a queue at all, thus Queue.join would be useless.
I want to spawn X number of Pool workers and give each of them X% of the work to do. My issue is that the work takes about 20 minutes to exhaust, longer for each extra process running, due to the type of calculations being done my answer may be found within minutes or hours. What I would like to do is implement some way for a single worker to go "HEY I FOUND IT" and use that signal to kill the remainder of the pool and move on with my calculations.
Key points:
I have tried callbacks, they don't seem to run on a starmap_async until the entire pool finishes.
I only care about the first suitable answer found.
I am not sharing resources and surprise process death, albeit rude, is perfectly acceptable.
I've also considered using a Queue, but it wouldn't make since because the scope of work I'm passing to each is already built into the parameters of the function.
Below is a very dulled down version of what I'm working with (the calculations I'm working with can take hours to finish over a 4.2 billion complex iterable.)
def doWork():
workers = Pool(2)
results = workers.starmap_async( func = distSearch , iterable = Sections1_5, callback = killPool )
workers.close()
print("Found answer : {}".format(results.get()))
workers.join()
def killPool():
workers.terminate()
print("Worker Pool Terminated")
I should probably specify that my process only returns if it finds an answer otherwise it just exits once done. I have looked at this thread but it has my completely lost and seems like a lot of overhead to consistently check for the win condition when that should come in the return/callback of the Worker Pool.
All the answers I've found result in significant overhead by supervising the worker pool, I'm looking for a solution that sources the kill signal at the worker level, autonomously.
I'm looking for a solution that sources the kill signal at the worker level, autonomously.
AFAIK, that doesn't exist. The methods of the Pool object (like Pool.terminate) should only be used in the process that created the pool.
What you could do is use Pool.imap_unordered. This returns an iterator in the parent process over the results which yields results as soon as they become available. As soon as the desired result pops up, you could then use Pool.terminate().
Edit:
From looking at the 3.5 implementation starmap_async returns a MapResult instance, which is not an iterator.
You can wrap multiple inputs in a tuple and use imap_unordered over a list of those.
My question is inspired by a comment on the solving embarassingly parallel problem with multiprocessing post.
I am asking about the general case where python multiprocessing is used to (1) read data from file, (2) manipulate data, (3) write results to file. In the case I describe, data that is read from file is passed to a queue A in (1) and fetched from this queue A in (2). (2) also passes results to a separate queue B and (3) fetches results from this queue B to write them to file.
When (1) is done, it passes a STOP signal* to queue A so (2) knows queue A is empty. (2) then terminates and passes a STOP signal to queue B so (3) knows queue B is empty and terminates when it has used up the results queue.
So is there any need to call the multiprocessing .join() method on (1) and (2)? I would have thought that (2) will not finish until (1) finishes and sends a STOP signal? For (3) it makes sense to wait as any subsequent instructions might else proceed without (3).
But maybe calling the .join() method costs nothing and can be used just to avoid having to think about it?
*actually, the STOP signal consists of a sequence of N stop signals where N is equivalent to the number of processes running in (2).
According to the docs, it is safe to call join multiple times - this suggests that if p has already stopped, p.join() will return immediately. This means that if you expect p to have already stopped by this time, the cost of joining it should be negligible. If p hasn't stopped (as you say you expect the writer process might not have), there is a potential cost to joining it depending on what your main process needs to do. If it does any user interaction, it will appear hung. If that is a problem, you might consider this type of pattern:
while p.is_alive():
iterate_mainloop()
p.join(small_timeout)
But if that process doesn't do user interaction, joining the others should be fine. That seems to be the most likely situation here - if you can afford to be blocked waiting for a disk read, you should also be fine waiting for another process to complete (modulo any defensive timeouts in case it misbehaves).