Dask: why is memory usage blowing up? - python

I have a small dataframe (about ~100MB) and an expensive computation that I want to perform for each row. It is not a vectorizable computation; it requires some parsing and a DB lookup for each row.
As such, I have decided to try Dask to parallelize the task. The task is "embarrassingly parallel" and order of execution or repeated execution is no issue. However, for some unknown reason, memory usage blows up to about ~100GB.
Here is the offending code sample:
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_jobqueue import LSFCluster
cluster = LSFCluster(memory="6GB", cores=1, project='gRNA Library Design')
cluster.scale(jobs=16)
client = Client(cluster)
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
library_df = pd.read_csv(args.library_csv)
meta = library_df.dtypes
meta = meta.append(pd.Series({
'specificity': np.dtype('int64'),
'cutting_efficiency': np.dtype('int64'),
'0 Off-targets': np.dtype('object'),
'1 Off-targets': np.dtype('object'),
'2 Off-targets': np.dtype('object'),
'3 Off-targets': np.dtype('object')}))
library_ddf = dd.from_pandas(library_df, npartitions=32)
library_ddf = library_ddf.apply(score_guide, axis=1, meta=meta)
library_ddf = library_ddf.compute()
library_ddf = library_ddf.drop_duplicates()
library_ddf.to_csv(args.outfile, index=False)
My guess is that somehow the big dictionary required for lookup is the issue, but its size is only ~1.5GB in total and is not included in the resultant dataframe.
Why might Dask be blowing up memory usage?

Not 100% sure this will resolve it in this case, but you can try to futurize the dictionary:
# broadcasting makes sure that every worker has a copy
[fut_dict] = client.scatter([required_dict], broadcast=True)
score_guide = lambda row: expensive_computation(fut_dict, row)
What this does is put a copy of the dict on every worker and store reference to the object in fut_dict, obviating the need to hash the large dict on every call to the function:
Every time you pass a concrete result (anything that isn’t delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well.
Note that this will eat away a part of each worker's memory (e.g. given your information, each worker will have 1.5GB allocated for the dict). You can read more in this Q&A.

The problem is that the required_dict needs to be serialized and sent to all the worker threads. As required_dict is large and many workers need it simultaneously, repeated serializations cause a massive memory blowup.
There are many fixes; for me it was easiest to simply load the dictionary from the worker threads and explicitly use map_partitions instead of apply.
Here is the solution in code,
def do_df(df):
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
return df.apply(score_guide, axis=1)
library_ddf = dd.from_pandas(library_df, npartitions=128)
library_ddf = library_ddf.map_partitions(do_df)
library_ddf = library_ddf.compute()

Related

Dask delayed sum gets killed but there are enough resources

I'm creating a function that reads and entire folder, creates a Dask dataframe, then processes the partitions of this dataframe and sums the results, like this:
import dask.dataframe as dd
from dask import delayed, compute
def partitions_func(folder):
df = dd.read_csv(f'{folder}/*.csv')
partial_results = []
for partition in df.partitions:
partial = another_function(partition)
partial_results.append(partial)
total = delayed(sum)(partial_results)
return total
The function being called in partitions_func (another_function) is also delayed.
#delayed
def another_function(partition):
# Partition processing
return result
I checked and the variables created during the processing are all small, so they shouldn't cause any issues. The partitions can be quite large but not larger than the available RAM.
When I execute partitions_func(folder), the process gets killed. At first, I thought the problem had to do with having two delayed, one on another_function and one on delayed(sum).
Removing the delayed decorator from another_function causes issues because the argument is a Dask dataframe and you can't do operations like tolist(). I tried removing delayed from sum, because I thought it could be a problem with parallelisation and the available resources but the process also gets killed.
However, I know there are 5 partitions. If I remove the statement total = delayed(sum)(partial_results) from partitions_func and compute the sum "manually" instead, everything works as expected:
total = partial_results[0].compute() + partial_results[1].compute() + partial_results[2].compute() \
+ partial_results[3].compute() + partial_results[4].compute()
Thanks!
Dask dataframe creates a series of delayed objects, so when you call a delayed function another_function that becomes a nested delayed and dask.compute will not be able to handle it. One option is to use .map_partitions(), the typical example is df.map_partitions(len).compute(), which will compute length of each partition. So if you can rewrite another_function to accept a pandas dataframe, and remove the delayed decorator, then your code will roughly look like this:
df = dd.read_csv(f'{folder}/*.csv')
total = df.map_partitions(another_function)
Now total is a delayed object which you can pass to dask.compute (or simply run total = df.map_partitions(another_function).compute()).

python multiprocessing subprocess - high VIRT usage leads to memory error

I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?
Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

dask: difference between client.persist and client.compute

I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:
In this example
from dask.distributed import Client
from dask import delayed
client = Client()
def f(*args):
return args
result = [delayed(f)(x) for x in range(1000)]
x1 = client.compute(result)
x2 = client.persist(result)
Here x1 and x2 are different but in a less trivial calculation where result is also a list of Delayed objects, using client.persist(result) starts the calculation just like client.compute(result) does.
Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures
As you say, both Client.compute and Client.persist take lazy Dask collections and start them running on the cluster. They differ in what they return.
Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.
So if you persist a dask dataframe with 100 partitions you get back
a dask dataframe with 100 partitions, with each partition pointing to
a future currently running on the cluster.
Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.
So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data
More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.
In practice I rarely use Client.compute, preferring instead to use persist for intermediate staging and dask.compute to pull down final results.
df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist() # compute up to here, keep results in memory
>>> df.value.max().compute()
100
>>> df.value.min().compute()
0
When using delayed
Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.

How to best share static data between ipyparallel client and remote engines?

I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.

Multiprocessing doing Many Fast Calculations

I have an interesting multi-processing problem with structure that I might be able to exploit. The problem involves a largish ~80 column DataFrame (df) in Pandas with many columns and a function func that operates on pairs (~80*79/2 pairs) of those columns in df and takes a fairly short amount of time on each run.
the code looks like
mgr = Manager()
ns = mgr.Namespace()
ns.df = df
pool = Pool(processes=16)
args = [(ns, list(combo)) for combo in list(combinations(df.columns, 2))]
results = pool.map(func, args)
pool.close()
The above is not fast but faster than without the pool but only faster by a factor of 7 or so. I'm worried that the the overhead from so many calls is the issue. Is there a good way to exploit the structure here for MultiProcessing?
That is a fairly standard result. Nothing will scale perfectly linearly when run in parallel because of the overhead required to set up each process and pass data between processes. Keep in mind that (80 * 79) / 2 = 3,160 is actually a very small number assuming the function is not extremely computationally intensive (i.e. takes a really long time). All else equal, the faster the function the greater the overhead cost to using multiprocessing because the time to set up an additional process is relatively fixed.
Overhead on multiprocessing mainly comes in memory if you have to make several duplications of a large dataset (one duplication for each process if the function is poorly designed) because processes do not share memory. Assuming your function is set up such that it can be easily parallelized, adding more processes is good so long as you do not exceed the number of processors on your computer. Most home computers do not have 16 processors (at most 8 is typical) and your result (that it is 7 times faster in parallel) is consistent with you having fewer than 16 processors. You can check the number of processors on your machine with multiprocessing.cpu_count().
EDIT:
If you parallelize a function by passing the column string then it will repeatedly make copies of the dataframe. For example:
def StringPass(string1, string2):
return df[string1] * df[string2]
If you parallelize StringPass it will copy the data frame at least once per process. In contrast:
def ColumnPass(column1, column2):
return column1 * column2
If you pass just the necessary columns ColumnPass will only copy the columns necessary for each call to the function when run in parallel. So while StringPass(string1, string2) and ColumnPass(df[string1], df[string2]) will return the same result, in multiprocessing the former will make several inefficient copies of the global df, while the latter will only copy the necessary columns for each call to the function.

Categories

Resources