Dask: Gather futures remotely - python

I have a large number of futures pointing to data which I subsequently need to aggregate remotely in another task:
futures = [client.submit(get_data, i) for i in range(0, LARGE_NUMBER)]
agg_future = client.submit(aggregate, futures)
The issue with the above is that the client complains about the size of the argument I am submitting due to the large number of futures.
If I were willing to pull the data back to the client, I would simply use gather to collect the results:
agg_local = aggregate(client.gather(futures))
This, however, I would explicitly like to avoid. Is there a way (ideally non-blocking) to effectively gather the futures results within a remote task without having the client complain about the size of the list of futures being aggregated?

If your workload really suits the creation of many many futures and aggregating them in a single function, you could easily ignore the warning and continue.
However, you might find it more efficient to perform something like the tree summation example from the docs. That case is for delayed, but a client.submit version would look pretty similar, something like replacing the line
lazy = add(L[i], L[i + 1])
with
lazy = client.submit(agg_func, L[i], L[i + 1])
but you will have to figure out a version of your aggregation function which can work pairwise to produce a grand result. Presumably, this would result in a rather large number of futures in-play on the scheduler, which may cause additional latency, so profile to see what works well!

I think you could probably do this within a worker:
>>> def f():
... client = get_client()
... futures = client.map(lambda x: x + 1, range(10)) # spawn many tasks
... results = client.gather(futures)
... return sum(results)
>>> future = client.submit(f)
>>> future.result()
This example is lifted directly from the docs on get_client

Related

Queueing up workers in Dask

I have the following scenario that I need to solve with Dask scheduler and workers:
Dask program has N functions called in a loop (N defined by the user)
Each function is started with delayed(func)(args) to run in parallel.
When each function from the previous point starts, it triggers W workers. This is how I invoke the workers:
futures = client.map(worker_func, worker_args)
worker_responses = client.gather(futures)
That means that I need N * W workers to run everything in parallel. The problem is that this is not optimal as it's too much resource allocation, I run it on the cloud and it's expensive. Also, N is defined by the user, so I don't know beforehand how much processing capability I need to have.
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when a worker ends then the next one starts?
First define the number of workers you need, treat them as ephemeral, but static for the entire duration of your processing
You can create them dynamically (when you start or later on), but probably want to have them all ready right at the beginning of your processing
From your view, the client is an executor (so when you refer to workers and running in parallel, you probably mean the same thing
This class resembles executors in concurrent.futures but also allows Future objects within submit/map calls. When a Client is instantiated it takes over all dask.compute and dask.persist calls by default.
Once your workers are available, Dask will distribute work given to them via the scheduler
You should make any tasks that depend on each other do so by passing the result to dask.delayed() with the preceeding function result (which is a Future, and not yet the result)
This Futures-as-arguments will allow Dask to build a task graph of your work
Example use https://examples.dask.org/delayed.html
Future reference https://docs.dask.org/en/latest/futures.html#distributed.Future
Dependent Futures with dask.delayed
Here's a complete example from the Delayed docs (actually combines several successive examples to the same result)
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
def inc(x):
return x + 1
def double(x):
return x * 2
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b) # depends on a and b
output.append(c)
total = dask.delayed(sum)(output) # depends on everything
total.compute() # 45
You can call total.visualize() to see the task graph
(image from Dask Delayed docs)
Collections of Futures
If you're already using .map(..) to map function and argument pairs, you can keep creating Futures and then .gather(..) them all at once, even if they're in a collection (which is convenient to you here)
The .gather()'ed results will be in the same arrangement as they were given (a list of lists)
[[fn1(args11), fn1(args12)], [fn2(args21)], [fn3(args31), fn3(args32), fn3(args33)]]
https://distributed.dask.org/en/latest/api.html#distributed.Client.gather
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
collection_of_futures = []
for worker_func, worker_args in iterable_of_pairs_of_fn_args:
futures = client.map(worker_func, worker_args)
collection_of_futures.append(futures)
results = client.gather(collection_of_futures)
notes
worker_args must be some iterable to map to worker_func, which can be a source of error
.gather()ing will block until all the futures are completed or raise
.as_completed()
If you need the results as quickly as possible, you could use .as_completed(..), but note the results will be in a non-deterministic order, so I don't think this makes sense for your case .. if you find it does, you'll need some extra guarantees
include information about what to do with the result in the result
keep a reference to each and check them
only combine groups where it doesn't matter (ie. all the Futures have the same purpose)
also note that the yielded futures are complete, but are still a Future, so you still need to call .result() or .gather() them
https://distributed.dask.org/en/latest/api.html#distributed.as_completed

Using local memory in Pool workers with python's multiprocessing module

I'm working on implementing a randomized algorithm in python. Since this involves doing the same thing many (say N) times, it rather naturally parallelizes and I would like to make use of that. More specifically, I want to distribute the N iterations on all of the cores of my CPU. The problem in question involves computing the maximum of something and is thus something where every worker could compute their own maximum and then only report that one back to the parent process, which then only needs to figure out the global maximum out of those few local maxima.
Somewhat surprisingly, this does not seem to be an intended use-case of the multiprocessing module, but I'm not entirely sure how else to go about it. After some research I came up with the following solution (toy problem to find the maximum in a list that is structurally the same as my actual one):
import random
import multiprocessing
l = []
N = 100
numCores = multiprocessing.cpu_count()
# globals for every worker
mySendPipe = None
myRecPipe = None
def doWork():
pipes = zip(*[multiprocessing.Pipe() for i in range(numCores)])
pool = multiprocessing.Pool(numCores, initializeWorker, (pipes,))
pool.map(findMax, range(N))
results = []
# collate results
for p in pipes[0]:
if p.poll():
results.append(p.recv())
print(results)
return max(results)
def initializeWorker(pipes):
global mySendPipe, myRecPipe
# ID of a worker process; they are consistently named PoolWorker-i
myID = int(multiprocessing.current_process().name.split("-")[1])-1
# Modulo: When starting a second pool for the second iteration of doWork() they are named with IDs 5-8.
mySendPipe = pipes[1][myID%numCores]
myRecPipe = pipes[0][myID%numCores]
def findMax(count):
myMax = 0
if myRecPipe.poll():
myMax = myRecPipe.recv()
value = random.choice(l)
if myMax < value:
myMax = value
mySendPipe.send(myMax)
l = range(1, 1001)
random.shuffle(l)
max1 = doWork()
l = range(1001, 2001)
random.shuffle(l)
max2 = doWork()
return (max1, max2)
This works, sort of, but I've got a problem with it. Namely, using pipes to store the intermediate results feels rather silly (and is probably slow). But it also has the real problem, that I can't send arbitrarily large things through the pipe, and my application unfortunately sometimes exceeds this size (and deadlocks).
So, what I'd really like is a function analogue to the initializer that I can call once for every worker on the pool to return their local results to the parent process. I could not find such functionality, but maybe someone here has an idea?
A few final notes:
I use a global variable for the input because in my application the input is very large and I don't want to copy it to every process. Since the processes never write to it, I believe it should never be copied (or am I wrong there?). I'm open to suggestions to do this differently, but mind that I need to run this on changing inputs (sequentially though, just like in the example above).
I'd like to avoid using the Manager-class, since (by my understanding) it introduces synchronisation and locks, which in this problem should be completely unnecessary.
The only other similar question I could find is Python's multiprocessing and memory, but they wish to actually process the individual results of the workers, whereas I do not want the workers to return N things, but to instead only run a total of N times and return only their local best results.
I'm using Python 2.7.15.
tl;dr: Is there a way to use local memory for every worker process in a multiprocessing pool, so that every worker can compute a local optimum and the parent process only needs to worry about figuring out which one of those is best?
You might be overthinking this a little.
By making your worker-functions (in this case findMax) actually return a value instead of communicating it, you can store the result from calling pool.map() - it is just a parallel variant of map, after all! It will map a function over a list of inputs and return the list of results of that function call.
The simplest example illustrating my point follows your "distributed max" example:
import multiprocessing
# [0,1,2,3,4,5,6,7,8]
x = range(9)
# split the list into 3 chunks
# [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
input = zip(*[iter(x)]*3)
pool = multiprocessing.Pool(2)
# compute the max of each chunk:
# max((0,1,2)) == 2
# max((3,4,5)) == 5
# ...
res = pool.map(max, input)
print(res)
This returns [2, 5, 8].
Be aware that there is some light magic going on: I use the built-in max() function which expects iterables as input. Now, if I would only pool.map over a plain list of integers, say, range(9), that would result in calls to max(0), max(1) etc. - not very useful, huh? Instead, I partition my list into chunks, so effectively, when mapping, we now map over a list of tuples, thus feeding a tuple to max on each call.
So perhaps you have to:
return a value from your worker func
think about how you structure your input domain so that you feed meaningful chunks to each worker
PS: You wrote a great first question! Thank you, it was a pleasure reading it :) Welcome to StackOverflow!

how to throttle a large number of tasks without using all workers

Imagine I have a dask grid with 10 workers & 40 cores totals. This is a shared grid, so I don't want to fully saturate it with my work. I have 1000 tasks to do, and I want to submit (and have actively running) a maximum of 20 tasks at a time.
To be concrete,
from time import sleep
from random import random
def inc(x):
from random import random
sleep(random() * 2)
return x + 1
def double(x):
from random import random
sleep(random())
return 2 * x
>>> from distributed import Executor
>>> e = Executor('127.0.0.1:8786')
>>> e
<Executor: scheduler=127.0.0.1:8786 workers=10 threads=40>
If I setup a system of Queues
>>> from queue import Queue
>>> input_q = Queue()
>>> remote_q = e.scatter(input_q)
>>> inc_q = e.map(inc, remote_q)
>>> double_q = e.map(double, inc_q)
This will work, BUT, this will just dump ALL of my tasks to the grid, saturating it. Ideally I could:
e.scatter(input_q, max_submit=20)
It seems that the example from the docs here would allow me to use a maxsize queue. But that looks like from a user-perspective I would still have to deal with the backpressure. Ideally dask would automatically take care of this.
Use maxsize=
You're very close. All of scatter, gather, and map take the same maxsize= keyword argument that Queue takes. So a simple workflow might be as follows:
Example
from time import sleep
def inc(x):
sleep(1)
return x + 1
your_input_data = list(range(1000))
from queue import Queue # Put your data into a queue
q = Queue()
for i in your_input_data:
q.put(i)
from dask.distributed import Executor
e = Executor('127.0.0.1:8786') # Connect to cluster
futures = e.map(inc, q, maxsize=20) # Map inc over data
results = e.gather(futures) # Gather results
L = []
while not q.empty() or not futures.empty() or not results.empty():
L.append(results.get()) # this blocks waiting for all results
All of q, futures, and results are Python Queue objects. The q and results queues don't have a limit, so they'll greedily pull in as much as they can. The futures queue however has a maximum size of 20, so it will only allow 20 futures in flight at any given time. Once the leading future is complete it will immediately be consumed by the gather function and its result will be placed into the results queue. This frees up space in futures and causes another task to be submitted.
Note that this isn't exactly what you wanted. These queues are ordered so futures will only get popped off when they're in the front of the queue. If all of the in-flight futures have finished except for the first they'll still stay in the queue, taking up space. Given this constraint you might want to choose a maxsize= slightly more than your desired 20 items.
Extending this
Here we do a simple map->gather pipeline with no logic in between. You could also put other map computations in here or even pull futures out of the queues and do custom work with them on your own. It's easy to break out of the mold provided above.
The solution posted on github was very useful - https://github.com/dask/distributed/issues/864
Solution:
inputs = iter(inputs)
futures = [c.submit(func, next(inputs)) for i in range(maxsize)]
ac = as_completed(futures)
for finished_future in ac:
# submit new future
try:
new_future = c.submit(func, next(inputs))
ac.add(new_future)
except StopIteration:
pass
result = finished_future.result()
... # do stuff with result
Query:
However for determining the workers that are free for throttling the tasks, am trying to utilize the client.has_what() api. Seems like the load on workers does not get reflected immediately similar to what is shown on the status UI page. At times it takes quite a bit of time for has_what to reflect any data.
Is there another api that can be used to determine number of free workers which can then be used to determine the throttle range similar to what UI is utilizing.

Using Concurrent Futures without running out of RAM

I'm doing some file parsing that is a CPU bound task. No matter how many files I throw at the process it uses no more than about 50MB of RAM.
The task is parrallelisable, and I've set it up to use concurrent futures below to parse each file as a separate process:
from concurrent import futures
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
for this_file in files_list:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
The problem is that when I run this using futures, RAM usage rockets and before long I've run out and Python has crashed. This is probably in large part because the results from parse_function are several MB in size. Once the results have been through post_processing, the application has no further need of them. As you can see, I'm trying del jobs[job] to clear items out of jobs, but this has made no difference, memory usage remains unchanged, and seems to increase at the same rate.
I've also confirmed it's not because it's waiting on the post_process function by only using a single process, plus throwing in a time.sleep(1).
There's nothing in the futures docs about memory management, and while a brief search indicates it has come up before in real-world applications of futures (Clear memory in python loop and http://grokbase.com/t/python/python-list/1458ss5etz/real-world-use-of-concurrent-futures) - the answers don't translate to my use-case (they're all concerned with timeouts and the like).
So, how do you use Concurrent futures without running out of RAM?
(Python 3.5)
I'll take a shot (Might be a wrong guess...)
You might need to submit your work bit by bit since on each submit you're making a copy of parser_variables which may end up chewing your RAM.
Here is working code with "<----" on the interesting parts
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
files_left = len(files_list) #<----
files_iter = iter(files_list) #<------
while files_left:
for this_file in files_iter:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
if len(jobs) > MAX_JOBS_IN_QUEUE:
break #limit the job submission for now job
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
files_left -= 1 #one down - many to go... <---
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
break; #give a chance to add more jobs <-----
Try adding del to your code like this:
for job in futures.as_completed(jobs):
del jobs[job] # or `val = jobs.pop(job)`
# del job # or `job._result = None`
Looking at the concurrent.futures.as_completed() function, I learned it is enough to ensure there is no longer any reference to the future. If you dispense this reference as soon as you've got the result, you'll minimise memory usage.
I use a generator expression for storing my Future instances because everything I care about is already returned by the future in its result (basically, the status of the dispatched work). Other implementations use a dict for example like in your case, because you don't return the input filename as part of the thread workers result.
Using a generator expression means once the result is yielded, there is no longer any reference to the Future. Internally, as_completed() has already taken care of removing its own reference, after it yielded the completed Future to you.
futures = (executor.submit(thread_worker, work) for work in workload)
for future in concurrent.futures.as_completed(futures):
output = future.result()
... # on next loop iteration, garbage will be collected for the result data, too
Edit: Simplified from using a set and removing entries, to simply using a generator expression.
Same problem for me.
In my case I need to start millions of threads. For python2, I would write a thread pool myself using a dict. But in python3 I encounted the following error when I del finished threads dynamically:
RuntimeError: dictionary changed size during iteration
So I have to use concurrent.futures, at first I coded like this:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces:
pool.submit(handle_resource, *args)
But soon memory exhausted, because memory will be released only after all threads finished. I need to del finished threads before to many thread started. So I read the docs here: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
Find that Executor.shutdown(wait=True) might be what I need.
And this is my final solution:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
i = 0
while i < len(all_resouces):
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces[i:i+1000]:
pool.submit(handle_resource, *args)
i += 1000
You can avoid having to call this method explicitly if you use the with statement, which will shutdown the Executor (waiting as if Executor.shutdown() were called with wait set to True)

python for large data processing

I relatively new to python, and have been able to answer most of my questions based on similar problems answered on forms, but I'm at a point where I'm stuck an could use some help.
I have a simple nested for loop script that generates an output of strings. What I need to do next is have each grouping go through a simulation, based on numerical values that the strings will be matched too.
really my question is how do I go about this in the best way? Im not sure if multithreading will work since the strings are generated and then need to undergo the simulation, one set at a time. I was reading about queue's and wasn't sure if they could be passed into queue's for storage and then undergo the simulation, in the same order they entered the queue.
Regardless of the research I've done I'm open to any suggestion anyone can make on the matter.
thanks!
edit: Im not look for an answer on how to do the simulation, but rather a way to store the combinations while simulations are being computed
example
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
print(D)
As was hinted at in the comments, the multiprocessing module is what you're looking for. Threading won't help you because of the Global Interpreter Lock (GIL), which limits execution to one Python thread at a time. In particular, I would look at multiprocessing pools. These objects give you an interface to have a pool of subprocesses do work for you in parallel with the main process, and you can go back and check on the results later.
Your example snippet could look something like this:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool() # by default, this will create a number of workers equal to
# the number of CPU cores you have available
combination_list = [] # create a list to store the combinations
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
combination_list.append(D) # append this combination to the list
results = pool.map(simulation_function, combination_list)
# simulation_function is the function you're using to actually run your
# simulation - assuming it only takes one parameter: the combination
The call to pool.map is blocking - meaning that once you call it, execution in the main process will halt until all the simulations are complete, but it is running them in parallel. When they complete, whatever your simulation function returns will be available in results, in the same order that the input arguments were in the combination_list.
If you don't want to wait for them, you can also use apply_async on your pool and store the result to look at later:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool()
result_list = [] # create a list to store the simulation results
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
result_list.append(pool.apply_async(
simulation_function,
args=(D,))) # note the extra comma - args must be a tuple
# do other stuff
# now iterate over result_list to check the results when they're ready
If you use this structure, result_list will be full of multiprocessing.AsyncResult objects, which allow you to check if they are ready with result.ready() and, if it's ready, retrieve the result with result.get(). This approach will cause the simulations to be kicked off right when the combination is calculated, instead of waiting until all of them have been calculated to start processing them. The downside is that it's a little more complicated to manage and retrieve the results. For example, you have to make sure the result is ready or be ready to catch an exception, you need to be ready to catch exceptions that may have been raised in the worker function, etc. The caveats are explained pretty well in the documentation.
If calculating the combinations doesn't actually take very long and you don't mind your main process halting until they're all ready, I suggest the pool.map approach.

Categories

Resources