Queueing up workers in Dask

Queueing up workers in Dask - python

I have the following scenario that I need to solve with Dask scheduler and workers:
Dask program has N functions called in a loop (N defined by the user)
Each function is started with delayed(func)(args) to run in parallel.
When each function from the previous point starts, it triggers W workers. This is how I invoke the workers:
futures = client.map(worker_func, worker_args)
worker_responses = client.gather(futures)
That means that I need N * W workers to run everything in parallel. The problem is that this is not optimal as it's too much resource allocation, I run it on the cloud and it's expensive. Also, N is defined by the user, so I don't know beforehand how much processing capability I need to have.
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when a worker ends then the next one starts?

First define the number of workers you need, treat them as ephemeral, but static for the entire duration of your processing
You can create them dynamically (when you start or later on), but probably want to have them all ready right at the beginning of your processing
From your view, the client is an executor (so when you refer to workers and running in parallel, you probably mean the same thing
This class resembles executors in concurrent.futures but also allows Future objects within submit/map calls. When a Client is instantiated it takes over all dask.compute and dask.persist calls by default.
Once your workers are available, Dask will distribute work given to them via the scheduler
You should make any tasks that depend on each other do so by passing the result to dask.delayed() with the preceeding function result (which is a Future, and not yet the result)
This Futures-as-arguments will allow Dask to build a task graph of your work
Example use https://examples.dask.org/delayed.html
Future reference https://docs.dask.org/en/latest/futures.html#distributed.Future
Dependent Futures with dask.delayed
Here's a complete example from the Delayed docs (actually combines several successive examples to the same result)
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
def inc(x):
return x + 1
def double(x):
return x * 2
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b) # depends on a and b
output.append(c)
total = dask.delayed(sum)(output) # depends on everything
total.compute() # 45
You can call total.visualize() to see the task graph
(image from Dask Delayed docs)
Collections of Futures
If you're already using .map(..) to map function and argument pairs, you can keep creating Futures and then .gather(..) them all at once, even if they're in a collection (which is convenient to you here)
The .gather()'ed results will be in the same arrangement as they were given (a list of lists)
[[fn1(args11), fn1(args12)], [fn2(args21)], [fn3(args31), fn3(args32), fn3(args33)]]
https://distributed.dask.org/en/latest/api.html#distributed.Client.gather
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
collection_of_futures = []
for worker_func, worker_args in iterable_of_pairs_of_fn_args:
futures = client.map(worker_func, worker_args)
collection_of_futures.append(futures)
results = client.gather(collection_of_futures)
notes
worker_args must be some iterable to map to worker_func, which can be a source of error
.gather()ing will block until all the futures are completed or raise
.as_completed()
If you need the results as quickly as possible, you could use .as_completed(..), but note the results will be in a non-deterministic order, so I don't think this makes sense for your case .. if you find it does, you'll need some extra guarantees
include information about what to do with the result in the result
keep a reference to each and check them
only combine groups where it doesn't matter (ie. all the Futures have the same purpose)
also note that the yielded futures are complete, but are still a Future, so you still need to call .result() or .gather() them
https://distributed.dask.org/en/latest/api.html#distributed.as_completed

Related

Multiprocessing task in python

I am trying to figure out how to perfom a multiprocessing task with an unusual formulation.
Basically, given two lists containing 10 matrices for each list, I have to check if applying an operation (that I'll call fn) gives the same results if the input is (A, B) or vice versa (B, A).
With a sequential approach, the solution is streightforward:
#Given
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
AB_BA= [fn(A[i], B[i])==fn(B[i], A[i]) for i in range(0, len(A)) ]
The next task is a bit strange because it requires setting strictly more than ten threads and applying multiprocessing. The restriction is that you can not assign all the single comparisons to ten different processes because the remaining processes will be unused. I do not know why the request seems to be using "process" and "thread" interchangeably.
This task seems a bit confusing because in multiprocessing, generally, you set the maximum number of workers, not the minimum.
I tried to use a solution that uses a ProcessPoolExecutor, as follows:
def equality(A, B,i):
res= fn(A[i], B[i]) == fn(B[i],A[i] )
return(res)
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
idx=range(0, len(A))
results= executor.map(equality, A, B, idx)
for result in results:
print(result)
My problem is that I am not sure how to check resource usage. I have naively tried to monitor the CPU usage using the ubuntu system monitor as well as "top" from the command line.
In addition, this solution is the most efficient among those I tried, but there is not a direct specification to use at least 11 workers, so this solution seems not to stick with what was requested.
I also tried other solutions, such as using pool directly. This causes to evoke 10 python instances using top, but again, not more than 10. Here's what I tried:
def equality(A, B):
res=fn(A, B) == fn(B,A )
return(res)
with mp.Pool(20) as p:
print(p.starmap(equality, ((A[i], B[i]) for i in range(0, len(A)))))
Do you have any suggestions to address this request as well as monitor the resource usage to be sure it is working as expected?
Thank you very much for your help in advance.

I wish you had published the actual problem word for word, since your description is a bit unclear. But this is what I know (or think I know):
Unless the amount of CPU processing done by your worker function equality is great enough so that what is gained by running the function in parallel more than offsets the additional multiprocessing overhead you would not otherwise have if not using multiprocessing (i.e. starting processes, moving data from one address space to another, etc.), your multiprocessing code will run more slowly. Therefore, you should design your worker function to do the most work possible and to pass as little data as possible.
When you specify ...
results = executor.map(equality, A, B, idx)
... your equality function will be invoked once for each element of A, B and idx. So what is being passed is not the entire lists A and B but rather individual elements (e.g. matrix_a1 and matrix_b1). Therefore, there is no point in even passing an idx argument:
def equality(matrix_a, matrix_b):
"""
matrix_a and matrix_a are each single elements of
lists A and B respecticely.
"""
return fn(matrix_a) == fn(matrix_b)
def main():
from os import cpu_count
from concurrent.futures import ProcessPoolExecutor
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
# Do not create more processes then we have either
# CPU cores or the number of tasks that need to submit:
pool_size = min(cpu_count(), len(A))
with ProcessPoolExecutor(max_workers=pool_size) as executor:
AB_BA = list(executor.map(equality, A, B))
# This will be a list of 10 elements, each either `True` or `False`:
print(AB_BA)
# Required for Windows:
if __name__ == '__main__':
main()
So we will be submitting 10 tasks to a pool size of 10. Internally there is a "task queue" on which all the arguments being passed to equality exist:
matrix_a1, matrix_b1 # task 1
matrix_a2, matrix_b2 # task 2
...
matrix_a10, matrix_b10 # task 10
Any process in the pool that is idle will grap the next task in the queue to work on and the results will be returned in task submission order. But since equality is such a short-running function unless function fn is sufficiently complicated, there is the possibility that the pool process that grabs the first task can complete it and then grab the second task before some other pool process is dispatched by the operating system and can grab it. So there is no guarantee that all 10 tasks will be worked on in parallel by 10 pool processes even if function fn is sufficiently CPU-intensive. If you were to insert a call to time.sleep(.1) at the beginning of equality, that would give the other pool processes a chance to "wake up" and grab its own task from the task queue. But that would slow your program down even more since sleeping for this purposes is totally non-productive. But the point I am trying to make is that you cannot ensure that all pool processes will always be active concurrently.

Dask: Gather futures remotely

I have a large number of futures pointing to data which I subsequently need to aggregate remotely in another task:
futures = [client.submit(get_data, i) for i in range(0, LARGE_NUMBER)]
agg_future = client.submit(aggregate, futures)
The issue with the above is that the client complains about the size of the argument I am submitting due to the large number of futures.
If I were willing to pull the data back to the client, I would simply use gather to collect the results:
agg_local = aggregate(client.gather(futures))
This, however, I would explicitly like to avoid. Is there a way (ideally non-blocking) to effectively gather the futures results within a remote task without having the client complain about the size of the list of futures being aggregated?

If your workload really suits the creation of many many futures and aggregating them in a single function, you could easily ignore the warning and continue.
However, you might find it more efficient to perform something like the tree summation example from the docs. That case is for delayed, but a client.submit version would look pretty similar, something like replacing the line
lazy = add(L[i], L[i + 1])
with
lazy = client.submit(agg_func, L[i], L[i + 1])
but you will have to figure out a version of your aggregation function which can work pairwise to produce a grand result. Presumably, this would result in a rather large number of futures in-play on the scheduler, which may cause additional latency, so profile to see what works well!

I think you could probably do this within a worker:
>>> def f():
... client = get_client()
... futures = client.map(lambda x: x + 1, range(10)) # spawn many tasks
... results = client.gather(futures)
... return sum(results)
>>> future = client.submit(f)
>>> future.result()
This example is lifted directly from the docs on get_client

Write python with joblib in parallel in the list

I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
data[i].append(ldata)
for i in range(0, 1000):
data.append([])
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
resultlist.extend(data[i])

You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
...
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.

Most efficient way to multiprocess separate functions over same object

In a python script, I have a large dataset that I would like to apply multiple functions to. The functions are responsible for creating certain outputs that get saved to the hard drive.
A few things of note:
the functions are independent
none of the functions return anything
the functions will take variable amounts of time
some of the functions may fail, and that is fine
Can I multiprocess this in any way that each function and the dataset are sent separately to a core and run there? This way I do not need the first function to finish before the second one can kick off? There is no need for them to be sequentially dependent.
Thanks!

Since your functions are independent and only read data, as long as it is not an issue if your data is modified during the execution of a function, then they are also thread safe.
Use a thread pool (click) . You would have to create a task per function you want to run.
Note: In order for it to run on more than one core you must use Python Multiprocessing. Else all the threads will run on a single core. This happens because Python has a Global Interpreter Lock (GIL). For more information Python threads all executing on a single core
Alternatively, you could use DASK , which augments the data in order to run some multi threading. While adding some overhead, it might be quicker for your needs.

I was in a similar situation as yours, and used Processes with the following function:
import multiprocessing as mp
def launch_proc(nproc, lst_functions, lst_args, lst_kwargs):
n = len(lst_functions)
r = 1 if n % nproc > 0 else 0
for b in range(n//nproc + r):
bucket = []
for p in range(nproc):
i = b*nproc + p
if i == n:
break
proc = mp.Process(target=lst_functions[i], args=lst_args[i], kwargs=lst_kwargs[i])
bucket.append(proc)
for proc in bucket:
proc.start()
for proc in bucket:
proc.join()
This has a major drawback: all Processes in a bucket have to finish before a new bucket can start. I tried to use a JoinableQueue to avoid this, but could not make it work.
Example:
def f(i):
print(i)
nproc = 2
n = 11
lst_f = [f] * n
lst_args = [[i] for i in range(n)]
lst_kwargs = [{}] * n
launch_proc(nproc, lst_f, lst_args, lst_kwargs)
Hope it can help.

how to throttle a large number of tasks without using all workers

Imagine I have a dask grid with 10 workers & 40 cores totals. This is a shared grid, so I don't want to fully saturate it with my work. I have 1000 tasks to do, and I want to submit (and have actively running) a maximum of 20 tasks at a time.
To be concrete,
from time import sleep
from random import random
def inc(x):
from random import random
sleep(random() * 2)
return x + 1
def double(x):
from random import random
sleep(random())
return 2 * x
>>> from distributed import Executor
>>> e = Executor('127.0.0.1:8786')
>>> e
<Executor: scheduler=127.0.0.1:8786 workers=10 threads=40>
If I setup a system of Queues
>>> from queue import Queue
>>> input_q = Queue()
>>> remote_q = e.scatter(input_q)
>>> inc_q = e.map(inc, remote_q)
>>> double_q = e.map(double, inc_q)
This will work, BUT, this will just dump ALL of my tasks to the grid, saturating it. Ideally I could:
e.scatter(input_q, max_submit=20)
It seems that the example from the docs here would allow me to use a maxsize queue. But that looks like from a user-perspective I would still have to deal with the backpressure. Ideally dask would automatically take care of this.

Use maxsize=
You're very close. All of scatter, gather, and map take the same maxsize= keyword argument that Queue takes. So a simple workflow might be as follows:
Example
from time import sleep
def inc(x):
sleep(1)
return x + 1
your_input_data = list(range(1000))
from queue import Queue # Put your data into a queue
q = Queue()
for i in your_input_data:
q.put(i)
from dask.distributed import Executor
e = Executor('127.0.0.1:8786') # Connect to cluster
futures = e.map(inc, q, maxsize=20) # Map inc over data
results = e.gather(futures) # Gather results
L = []
while not q.empty() or not futures.empty() or not results.empty():
L.append(results.get()) # this blocks waiting for all results
All of q, futures, and results are Python Queue objects. The q and results queues don't have a limit, so they'll greedily pull in as much as they can. The futures queue however has a maximum size of 20, so it will only allow 20 futures in flight at any given time. Once the leading future is complete it will immediately be consumed by the gather function and its result will be placed into the results queue. This frees up space in futures and causes another task to be submitted.
Note that this isn't exactly what you wanted. These queues are ordered so futures will only get popped off when they're in the front of the queue. If all of the in-flight futures have finished except for the first they'll still stay in the queue, taking up space. Given this constraint you might want to choose a maxsize= slightly more than your desired 20 items.
Extending this
Here we do a simple map->gather pipeline with no logic in between. You could also put other map computations in here or even pull futures out of the queues and do custom work with them on your own. It's easy to break out of the mold provided above.

The solution posted on github was very useful - https://github.com/dask/distributed/issues/864
Solution:
inputs = iter(inputs)
futures = [c.submit(func, next(inputs)) for i in range(maxsize)]
ac = as_completed(futures)
for finished_future in ac:
# submit new future
try:
new_future = c.submit(func, next(inputs))
ac.add(new_future)
except StopIteration:
pass
result = finished_future.result()
... # do stuff with result
Query:
However for determining the workers that are free for throttling the tasks, am trying to utilize the client.has_what() api. Seems like the load on workers does not get reflected immediately similar to what is shown on the status UI page. At times it takes quite a bit of time for has_what to reflect any data.
Is there another api that can be used to determine number of free workers which can then be used to determine the throttle range similar to what UI is utilizing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Queueing up workers in Dask - python

Related

Multiprocessing task in python

Dask: Gather futures remotely

Write python with joblib in parallel in the list

Most efficient way to multiprocess separate functions over same object

how to throttle a large number of tasks without using all workers

Categories

Resources