I'm creating a function that reads and entire folder, creates a Dask dataframe, then processes the partitions of this dataframe and sums the results, like this:
import dask.dataframe as dd
from dask import delayed, compute
def partitions_func(folder):
df = dd.read_csv(f'{folder}/*.csv')
partial_results = []
for partition in df.partitions:
partial = another_function(partition)
partial_results.append(partial)
total = delayed(sum)(partial_results)
return total
The function being called in partitions_func (another_function) is also delayed.
#delayed
def another_function(partition):
# Partition processing
return result
I checked and the variables created during the processing are all small, so they shouldn't cause any issues. The partitions can be quite large but not larger than the available RAM.
When I execute partitions_func(folder), the process gets killed. At first, I thought the problem had to do with having two delayed, one on another_function and one on delayed(sum).
Removing the delayed decorator from another_function causes issues because the argument is a Dask dataframe and you can't do operations like tolist(). I tried removing delayed from sum, because I thought it could be a problem with parallelisation and the available resources but the process also gets killed.
However, I know there are 5 partitions. If I remove the statement total = delayed(sum)(partial_results) from partitions_func and compute the sum "manually" instead, everything works as expected:
total = partial_results[0].compute() + partial_results[1].compute() + partial_results[2].compute() \
+ partial_results[3].compute() + partial_results[4].compute()
Thanks!
Dask dataframe creates a series of delayed objects, so when you call a delayed function another_function that becomes a nested delayed and dask.compute will not be able to handle it. One option is to use .map_partitions(), the typical example is df.map_partitions(len).compute(), which will compute length of each partition. So if you can rewrite another_function to accept a pandas dataframe, and remove the delayed decorator, then your code will roughly look like this:
df = dd.read_csv(f'{folder}/*.csv')
total = df.map_partitions(another_function)
Now total is a delayed object which you can pass to dask.compute (or simply run total = df.map_partitions(another_function).compute()).
Related
I have the following scenario that I need to solve with Dask scheduler and workers:
Dask program has N functions called in a loop (N defined by the user)
Each function is started with delayed(func)(args) to run in parallel.
When each function from the previous point starts, it triggers W workers. This is how I invoke the workers:
futures = client.map(worker_func, worker_args)
worker_responses = client.gather(futures)
That means that I need N * W workers to run everything in parallel. The problem is that this is not optimal as it's too much resource allocation, I run it on the cloud and it's expensive. Also, N is defined by the user, so I don't know beforehand how much processing capability I need to have.
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when a worker ends then the next one starts?
First define the number of workers you need, treat them as ephemeral, but static for the entire duration of your processing
You can create them dynamically (when you start or later on), but probably want to have them all ready right at the beginning of your processing
From your view, the client is an executor (so when you refer to workers and running in parallel, you probably mean the same thing
This class resembles executors in concurrent.futures but also allows Future objects within submit/map calls. When a Client is instantiated it takes over all dask.compute and dask.persist calls by default.
Once your workers are available, Dask will distribute work given to them via the scheduler
You should make any tasks that depend on each other do so by passing the result to dask.delayed() with the preceeding function result (which is a Future, and not yet the result)
This Futures-as-arguments will allow Dask to build a task graph of your work
Example use https://examples.dask.org/delayed.html
Future reference https://docs.dask.org/en/latest/futures.html#distributed.Future
Dependent Futures with dask.delayed
Here's a complete example from the Delayed docs (actually combines several successive examples to the same result)
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
def inc(x):
return x + 1
def double(x):
return x * 2
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b) # depends on a and b
output.append(c)
total = dask.delayed(sum)(output) # depends on everything
total.compute() # 45
You can call total.visualize() to see the task graph
(image from Dask Delayed docs)
Collections of Futures
If you're already using .map(..) to map function and argument pairs, you can keep creating Futures and then .gather(..) them all at once, even if they're in a collection (which is convenient to you here)
The .gather()'ed results will be in the same arrangement as they were given (a list of lists)
[[fn1(args11), fn1(args12)], [fn2(args21)], [fn3(args31), fn3(args32), fn3(args33)]]
https://distributed.dask.org/en/latest/api.html#distributed.Client.gather
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
collection_of_futures = []
for worker_func, worker_args in iterable_of_pairs_of_fn_args:
futures = client.map(worker_func, worker_args)
collection_of_futures.append(futures)
results = client.gather(collection_of_futures)
notes
worker_args must be some iterable to map to worker_func, which can be a source of error
.gather()ing will block until all the futures are completed or raise
.as_completed()
If you need the results as quickly as possible, you could use .as_completed(..), but note the results will be in a non-deterministic order, so I don't think this makes sense for your case .. if you find it does, you'll need some extra guarantees
include information about what to do with the result in the result
keep a reference to each and check them
only combine groups where it doesn't matter (ie. all the Futures have the same purpose)
also note that the yielded futures are complete, but are still a Future, so you still need to call .result() or .gather() them
https://distributed.dask.org/en/latest/api.html#distributed.as_completed
I have a small dataframe (about ~100MB) and an expensive computation that I want to perform for each row. It is not a vectorizable computation; it requires some parsing and a DB lookup for each row.
As such, I have decided to try Dask to parallelize the task. The task is "embarrassingly parallel" and order of execution or repeated execution is no issue. However, for some unknown reason, memory usage blows up to about ~100GB.
Here is the offending code sample:
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_jobqueue import LSFCluster
cluster = LSFCluster(memory="6GB", cores=1, project='gRNA Library Design')
cluster.scale(jobs=16)
client = Client(cluster)
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
library_df = pd.read_csv(args.library_csv)
meta = library_df.dtypes
meta = meta.append(pd.Series({
'specificity': np.dtype('int64'),
'cutting_efficiency': np.dtype('int64'),
'0 Off-targets': np.dtype('object'),
'1 Off-targets': np.dtype('object'),
'2 Off-targets': np.dtype('object'),
'3 Off-targets': np.dtype('object')}))
library_ddf = dd.from_pandas(library_df, npartitions=32)
library_ddf = library_ddf.apply(score_guide, axis=1, meta=meta)
library_ddf = library_ddf.compute()
library_ddf = library_ddf.drop_duplicates()
library_ddf.to_csv(args.outfile, index=False)
My guess is that somehow the big dictionary required for lookup is the issue, but its size is only ~1.5GB in total and is not included in the resultant dataframe.
Why might Dask be blowing up memory usage?
Not 100% sure this will resolve it in this case, but you can try to futurize the dictionary:
# broadcasting makes sure that every worker has a copy
[fut_dict] = client.scatter([required_dict], broadcast=True)
score_guide = lambda row: expensive_computation(fut_dict, row)
What this does is put a copy of the dict on every worker and store reference to the object in fut_dict, obviating the need to hash the large dict on every call to the function:
Every time you pass a concrete result (anything that isn’t delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well.
Note that this will eat away a part of each worker's memory (e.g. given your information, each worker will have 1.5GB allocated for the dict). You can read more in this Q&A.
The problem is that the required_dict needs to be serialized and sent to all the worker threads. As required_dict is large and many workers need it simultaneously, repeated serializations cause a massive memory blowup.
There are many fixes; for me it was easiest to simply load the dictionary from the worker threads and explicitly use map_partitions instead of apply.
Here is the solution in code,
def do_df(df):
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
return df.apply(score_guide, axis=1)
library_ddf = dd.from_pandas(library_df, npartitions=128)
library_ddf = library_ddf.map_partitions(do_df)
library_ddf = library_ddf.compute()
I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?
Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.
I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:
In this example
from dask.distributed import Client
from dask import delayed
client = Client()
def f(*args):
return args
result = [delayed(f)(x) for x in range(1000)]
x1 = client.compute(result)
x2 = client.persist(result)
Here x1 and x2 are different but in a less trivial calculation where result is also a list of Delayed objects, using client.persist(result) starts the calculation just like client.compute(result) does.
Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures
As you say, both Client.compute and Client.persist take lazy Dask collections and start them running on the cluster. They differ in what they return.
Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.
So if you persist a dask dataframe with 100 partitions you get back
a dask dataframe with 100 partitions, with each partition pointing to
a future currently running on the cluster.
Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.
So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data
More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.
In practice I rarely use Client.compute, preferring instead to use persist for intermediate staging and dask.compute to pull down final results.
df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist() # compute up to here, keep results in memory
>>> df.value.max().compute()
100
>>> df.value.min().compute()
0
When using delayed
Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.
I have an interesting multi-processing problem with structure that I might be able to exploit. The problem involves a largish ~80 column DataFrame (df) in Pandas with many columns and a function func that operates on pairs (~80*79/2 pairs) of those columns in df and takes a fairly short amount of time on each run.
the code looks like
mgr = Manager()
ns = mgr.Namespace()
ns.df = df
pool = Pool(processes=16)
args = [(ns, list(combo)) for combo in list(combinations(df.columns, 2))]
results = pool.map(func, args)
pool.close()
The above is not fast but faster than without the pool but only faster by a factor of 7 or so. I'm worried that the the overhead from so many calls is the issue. Is there a good way to exploit the structure here for MultiProcessing?
That is a fairly standard result. Nothing will scale perfectly linearly when run in parallel because of the overhead required to set up each process and pass data between processes. Keep in mind that (80 * 79) / 2 = 3,160 is actually a very small number assuming the function is not extremely computationally intensive (i.e. takes a really long time). All else equal, the faster the function the greater the overhead cost to using multiprocessing because the time to set up an additional process is relatively fixed.
Overhead on multiprocessing mainly comes in memory if you have to make several duplications of a large dataset (one duplication for each process if the function is poorly designed) because processes do not share memory. Assuming your function is set up such that it can be easily parallelized, adding more processes is good so long as you do not exceed the number of processors on your computer. Most home computers do not have 16 processors (at most 8 is typical) and your result (that it is 7 times faster in parallel) is consistent with you having fewer than 16 processors. You can check the number of processors on your machine with multiprocessing.cpu_count().
EDIT:
If you parallelize a function by passing the column string then it will repeatedly make copies of the dataframe. For example:
def StringPass(string1, string2):
return df[string1] * df[string2]
If you parallelize StringPass it will copy the data frame at least once per process. In contrast:
def ColumnPass(column1, column2):
return column1 * column2
If you pass just the necessary columns ColumnPass will only copy the columns necessary for each call to the function when run in parallel. So while StringPass(string1, string2) and ColumnPass(df[string1], df[string2]) will return the same result, in multiprocessing the former will make several inefficient copies of the global df, while the latter will only copy the necessary columns for each call to the function.