I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:
In this example
from dask.distributed import Client
from dask import delayed
client = Client()
def f(*args):
return args
result = [delayed(f)(x) for x in range(1000)]
x1 = client.compute(result)
x2 = client.persist(result)
Here x1 and x2 are different but in a less trivial calculation where result is also a list of Delayed objects, using client.persist(result) starts the calculation just like client.compute(result) does.
Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures
As you say, both Client.compute and Client.persist take lazy Dask collections and start them running on the cluster. They differ in what they return.
Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.
So if you persist a dask dataframe with 100 partitions you get back
a dask dataframe with 100 partitions, with each partition pointing to
a future currently running on the cluster.
Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.
So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data
More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.
In practice I rarely use Client.compute, preferring instead to use persist for intermediate staging and dask.compute to pull down final results.
df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist() # compute up to here, keep results in memory
>>> df.value.max().compute()
100
>>> df.value.min().compute()
0
When using delayed
Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.
Related
I have the following scenario that I need to solve with Dask scheduler and workers:
Dask program has N functions called in a loop (N defined by the user)
Each function is started with delayed(func)(args) to run in parallel.
When each function from the previous point starts, it triggers W workers. This is how I invoke the workers:
futures = client.map(worker_func, worker_args)
worker_responses = client.gather(futures)
That means that I need N * W workers to run everything in parallel. The problem is that this is not optimal as it's too much resource allocation, I run it on the cloud and it's expensive. Also, N is defined by the user, so I don't know beforehand how much processing capability I need to have.
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when a worker ends then the next one starts?
First define the number of workers you need, treat them as ephemeral, but static for the entire duration of your processing
You can create them dynamically (when you start or later on), but probably want to have them all ready right at the beginning of your processing
From your view, the client is an executor (so when you refer to workers and running in parallel, you probably mean the same thing
This class resembles executors in concurrent.futures but also allows Future objects within submit/map calls. When a Client is instantiated it takes over all dask.compute and dask.persist calls by default.
Once your workers are available, Dask will distribute work given to them via the scheduler
You should make any tasks that depend on each other do so by passing the result to dask.delayed() with the preceeding function result (which is a Future, and not yet the result)
This Futures-as-arguments will allow Dask to build a task graph of your work
Example use https://examples.dask.org/delayed.html
Future reference https://docs.dask.org/en/latest/futures.html#distributed.Future
Dependent Futures with dask.delayed
Here's a complete example from the Delayed docs (actually combines several successive examples to the same result)
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
def inc(x):
return x + 1
def double(x):
return x * 2
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b) # depends on a and b
output.append(c)
total = dask.delayed(sum)(output) # depends on everything
total.compute() # 45
You can call total.visualize() to see the task graph
(image from Dask Delayed docs)
Collections of Futures
If you're already using .map(..) to map function and argument pairs, you can keep creating Futures and then .gather(..) them all at once, even if they're in a collection (which is convenient to you here)
The .gather()'ed results will be in the same arrangement as they were given (a list of lists)
[[fn1(args11), fn1(args12)], [fn2(args21)], [fn3(args31), fn3(args32), fn3(args33)]]
https://distributed.dask.org/en/latest/api.html#distributed.Client.gather
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
collection_of_futures = []
for worker_func, worker_args in iterable_of_pairs_of_fn_args:
futures = client.map(worker_func, worker_args)
collection_of_futures.append(futures)
results = client.gather(collection_of_futures)
notes
worker_args must be some iterable to map to worker_func, which can be a source of error
.gather()ing will block until all the futures are completed or raise
.as_completed()
If you need the results as quickly as possible, you could use .as_completed(..), but note the results will be in a non-deterministic order, so I don't think this makes sense for your case .. if you find it does, you'll need some extra guarantees
include information about what to do with the result in the result
keep a reference to each and check them
only combine groups where it doesn't matter (ie. all the Futures have the same purpose)
also note that the yielded futures are complete, but are still a Future, so you still need to call .result() or .gather() them
https://distributed.dask.org/en/latest/api.html#distributed.as_completed
I have a small dataframe (about ~100MB) and an expensive computation that I want to perform for each row. It is not a vectorizable computation; it requires some parsing and a DB lookup for each row.
As such, I have decided to try Dask to parallelize the task. The task is "embarrassingly parallel" and order of execution or repeated execution is no issue. However, for some unknown reason, memory usage blows up to about ~100GB.
Here is the offending code sample:
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_jobqueue import LSFCluster
cluster = LSFCluster(memory="6GB", cores=1, project='gRNA Library Design')
cluster.scale(jobs=16)
client = Client(cluster)
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
library_df = pd.read_csv(args.library_csv)
meta = library_df.dtypes
meta = meta.append(pd.Series({
'specificity': np.dtype('int64'),
'cutting_efficiency': np.dtype('int64'),
'0 Off-targets': np.dtype('object'),
'1 Off-targets': np.dtype('object'),
'2 Off-targets': np.dtype('object'),
'3 Off-targets': np.dtype('object')}))
library_ddf = dd.from_pandas(library_df, npartitions=32)
library_ddf = library_ddf.apply(score_guide, axis=1, meta=meta)
library_ddf = library_ddf.compute()
library_ddf = library_ddf.drop_duplicates()
library_ddf.to_csv(args.outfile, index=False)
My guess is that somehow the big dictionary required for lookup is the issue, but its size is only ~1.5GB in total and is not included in the resultant dataframe.
Why might Dask be blowing up memory usage?
Not 100% sure this will resolve it in this case, but you can try to futurize the dictionary:
# broadcasting makes sure that every worker has a copy
[fut_dict] = client.scatter([required_dict], broadcast=True)
score_guide = lambda row: expensive_computation(fut_dict, row)
What this does is put a copy of the dict on every worker and store reference to the object in fut_dict, obviating the need to hash the large dict on every call to the function:
Every time you pass a concrete result (anything that isn’t delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well.
Note that this will eat away a part of each worker's memory (e.g. given your information, each worker will have 1.5GB allocated for the dict). You can read more in this Q&A.
The problem is that the required_dict needs to be serialized and sent to all the worker threads. As required_dict is large and many workers need it simultaneously, repeated serializations cause a massive memory blowup.
There are many fixes; for me it was easiest to simply load the dictionary from the worker threads and explicitly use map_partitions instead of apply.
Here is the solution in code,
def do_df(df):
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
return df.apply(score_guide, axis=1)
library_ddf = dd.from_pandas(library_df, npartitions=128)
library_ddf = library_ddf.map_partitions(do_df)
library_ddf = library_ddf.compute()
I'm creating a function that reads and entire folder, creates a Dask dataframe, then processes the partitions of this dataframe and sums the results, like this:
import dask.dataframe as dd
from dask import delayed, compute
def partitions_func(folder):
df = dd.read_csv(f'{folder}/*.csv')
partial_results = []
for partition in df.partitions:
partial = another_function(partition)
partial_results.append(partial)
total = delayed(sum)(partial_results)
return total
The function being called in partitions_func (another_function) is also delayed.
#delayed
def another_function(partition):
# Partition processing
return result
I checked and the variables created during the processing are all small, so they shouldn't cause any issues. The partitions can be quite large but not larger than the available RAM.
When I execute partitions_func(folder), the process gets killed. At first, I thought the problem had to do with having two delayed, one on another_function and one on delayed(sum).
Removing the delayed decorator from another_function causes issues because the argument is a Dask dataframe and you can't do operations like tolist(). I tried removing delayed from sum, because I thought it could be a problem with parallelisation and the available resources but the process also gets killed.
However, I know there are 5 partitions. If I remove the statement total = delayed(sum)(partial_results) from partitions_func and compute the sum "manually" instead, everything works as expected:
total = partial_results[0].compute() + partial_results[1].compute() + partial_results[2].compute() \
+ partial_results[3].compute() + partial_results[4].compute()
Thanks!
Dask dataframe creates a series of delayed objects, so when you call a delayed function another_function that becomes a nested delayed and dask.compute will not be able to handle it. One option is to use .map_partitions(), the typical example is df.map_partitions(len).compute(), which will compute length of each partition. So if you can rewrite another_function to accept a pandas dataframe, and remove the delayed decorator, then your code will roughly look like this:
df = dd.read_csv(f'{folder}/*.csv')
total = df.map_partitions(another_function)
Now total is a delayed object which you can pass to dask.compute (or simply run total = df.map_partitions(another_function).compute()).
I am coding in python a machine learning task and I distribute it with spark.
I use spark 1.3.1 with python 2.7 on ubuntu (The master and one worker with 2 slots are on the same machine)
My (pseudo)code:
p_params = sc.parallelize(small_index_collection,numSlices=4)
eval_grid = p_params.map(highly_computational_intensive_mapper)
# in eval_grid we will have a dictionary with some numbers,
# representing various performance metrics
p1 = eval_grid.map(lambda x: x['dict_entry_1']).collect()
#Spark is lazy so basically the p1 will trigger the compute intensive mapper
p2 = eval_grid.map(lambda x: x['dict_entry_2']).collect()
p3 = eval_grid.map(lambda x: x['dict_entry_3']).collect()
p4 = eval_grid.map(lambda x: x['dict_entry_4']).collect()
......
I am timing each operation and p1 takes ~ the same amount of time as p2, p3. In the logs, I also see the highly_computational_intensive_mapper being called for each collect() action.
What am I doing wrong? Is the eval_grid RDD deleted from workers after each collect()? Do I have to specify some flags? Mark the RDD somehow? Do some sort of action directly on eval_grid before the aggregation mappers and then run the px = ... code on resulting RDD? What action should I use?
Thaks!
p.s. I didn't try any of the enumerated methods yet.
p.p.s. A simiar question Why the RDD is not persisted in memory for every iteration in spark? but for me, the RDD is recomputed not loaded from the disk. And of course there is no code there.
You need to call cache on eval_grid so that after the first run, it is stored in memory. There is some buffer caching that should occur, but if you want true storage, then cache
All eval_grid is is a graph to show how to compute the data. Each time you call an action (collect) on it then it runs through that graph. cache short circuits that DAG and grabs the heap of data directly from memory.
I have an interesting multi-processing problem with structure that I might be able to exploit. The problem involves a largish ~80 column DataFrame (df) in Pandas with many columns and a function func that operates on pairs (~80*79/2 pairs) of those columns in df and takes a fairly short amount of time on each run.
the code looks like
mgr = Manager()
ns = mgr.Namespace()
ns.df = df
pool = Pool(processes=16)
args = [(ns, list(combo)) for combo in list(combinations(df.columns, 2))]
results = pool.map(func, args)
pool.close()
The above is not fast but faster than without the pool but only faster by a factor of 7 or so. I'm worried that the the overhead from so many calls is the issue. Is there a good way to exploit the structure here for MultiProcessing?
That is a fairly standard result. Nothing will scale perfectly linearly when run in parallel because of the overhead required to set up each process and pass data between processes. Keep in mind that (80 * 79) / 2 = 3,160 is actually a very small number assuming the function is not extremely computationally intensive (i.e. takes a really long time). All else equal, the faster the function the greater the overhead cost to using multiprocessing because the time to set up an additional process is relatively fixed.
Overhead on multiprocessing mainly comes in memory if you have to make several duplications of a large dataset (one duplication for each process if the function is poorly designed) because processes do not share memory. Assuming your function is set up such that it can be easily parallelized, adding more processes is good so long as you do not exceed the number of processors on your computer. Most home computers do not have 16 processors (at most 8 is typical) and your result (that it is 7 times faster in parallel) is consistent with you having fewer than 16 processors. You can check the number of processors on your machine with multiprocessing.cpu_count().
EDIT:
If you parallelize a function by passing the column string then it will repeatedly make copies of the dataframe. For example:
def StringPass(string1, string2):
return df[string1] * df[string2]
If you parallelize StringPass it will copy the data frame at least once per process. In contrast:
def ColumnPass(column1, column2):
return column1 * column2
If you pass just the necessary columns ColumnPass will only copy the columns necessary for each call to the function when run in parallel. So while StringPass(string1, string2) and ColumnPass(df[string1], df[string2]) will return the same result, in multiprocessing the former will make several inefficient copies of the global df, while the latter will only copy the necessary columns for each call to the function.