Dask Dataframe nunique operation: Worker running out of memory (MRE)

Dask Dataframe nunique operation: Worker running out of memory (MRE) - python

tl;dr
I want to
dd.read_parquet('*.parq')['column'].nunique().compute()
but I get
WARNING - Worker exceeded 95% memory budget. Restarting
a couple of times before the workers get killed altogether.
Long version
I have a dataset with
10 billion rows,
~20 columns,
and a single machine with around 200GB memory. I am trying to use dask's LocalCluster to process the data, but my workers quickly exceed their memory budget and get killed even if I use a reasonably small subset and try using basic operations.
I have recreated a toy problem demonstrating the issue below.
Synthetic data
To approximate the problem above on a smaller scale, I will create a single column with 32-character ids with
a million unique ids
total length of 200 million rows
split into 100 parquet files
The result will be
100 files, 66MB each, taking 178MB when loaded as a Pandas dataframe (estimated by df.memory_usage(deep=True).sum())
If loaded as a pandas dataframe, all the data take 20GB in memory
A single Series with all ids (which is what I assume the workers also have to keep in memory when computing nunique) takes about 90MB
import string
import os
import numpy as np
import pandas as pd
chars = string.ascii_letters + string.digits
n_total = int(2e8)
n_unique = int(1e6)
# Create random ids
ids = np.sum(np.random.choice(np.array(list(chars)).astype(object), size=[n_unique, 32]), axis=1)
outputdir = os.path.join('/tmp', 'testdata')
os.makedirs(outputdir, exist_ok=True)
# Sample from the ids to create 100 parquet files
for i in range(100):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
Attempt at a solution
Let's assume that my machine only has 8GB of memory. Since the partitions take about 178MB and the result 90MB, according to Wes Kinney's rule of thumb, I might need up to 2-3Gb of memory. Therefore, either
n_workers=2, memory_limit='4GB', or
n_workers_1, memroy_limit='8GB'
seems like a good choice. Sadly, when I try it, I get
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
a couple of times, before the worker(s) get killed altogether.
import os
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
cluster = LocalCluster(n_workers=4, memory_limit='6GB')
client = Client(cluster)
dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'))['id'].nunique().compute()
In fact, it seems, for example, with 4 workers, they each need 6GB of memory before being able to perform the task.
Can this situation be improved?

That's a great example of a recurring problem. The only shocking thing is that delayed was not used during the synthetic data creation:
import dask
#dask.delayed
def create_sample(i):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
return
# Sample from the ids to create 100 parquet files
dels = [create_sample(i) for i in range(100)]
_ = dask.compute(dels)
For the following answer I will actually just use a small number of partitions (so change to range(5)), to have sane visualizations. Let's start with the loading:
df = dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'), use_cols=['id'])
print(df.npartitions) # 5
This is a minor point, but having use_cols=['id'] in .read_parquet(), exploits the parquet advantage of columnar extraction (it might be that dask will do some optimization behind the scenes, but if you know the columns you want, there's no harm in being explicit).
Now, when you run df['id'].nunique(), here's the DAG that dask will compute:
With more partitions, there would be more steps, but it's apparent that there's a potential bottleneck when each partition is trying to send data that is quite large. This data can be very large for high-dimensional columns, so if each worker is trying to send a result that requires object that is 100MB, then the receiving worker will have to have 5 times the memory to accept the data (which could potentially decrease after further value-counting).
Additional consideration is how many tasks a single worker can run at a given time. The easiest way to control how many tasks can run at the same time on a given worker is resources. If you initiate the cluster with resources:
cluster = LocalCluster(n_workers=2, memory_limit='4GB', resources={'foo': 1})
Then every worker has the specified resources (in this case it's 1 unit of arbitrary foo), so if you think that processing a single partition should happen one at a time (due to high memory footprint), then you can do:
# note, no split_every is needed in this case since we're just
# passing a single number
df['id'].nunique().compute(resources={'foo': 1})
This will ensure that any single worker is busy with 1 task at a time, preventing excessive memory usage. (side note: there's also .nunique_approx(), which may be of interest)
To control the amount of data that any given worker receives for further processing, one approach is to use split_every option. Here's what the DAG will look like with split_every=3:
You can see that now (for this number of partitions), the max memory that a worker will need is 3 times that max size of the dataset. So depending on your worker memory settings you might want to set split_every to a low value (2,3,4 or so).
In general, the more unique the variable, the more memory is needed for each partition's object with unique counts, and so a lower value of split_every is going to be useful to put a cap on the max memory usage. If the variable is not very unique, then each individual partition's unique count will be a small object, so there's no need to have a split_every restriction.

Related

Dask word count on single large file doesn't quite perform well

I've noticed a significant performance degradation with the following script when increasing my cluster size: from a single node to 3 node cluster. Running times are 2min and 6min respectively.
Also, noticed CPU activity is very low in both cases.
The task here is a word count over a single text file (10GB, 100M lines, ~2B words).
I've made the file available to all nodes prior to launching the script.
What could possibly impede Dask from scaling this out?
from dask.distributed import Client
import dask.dataframe as dd
df = dd.read_csv(file_url, header=None)
# count words
new_df = (
df[0]
.str
.split()
.explode()
.value_counts()
)
print(new_df.compute().head())

One potential problem is the communication that arises when you have many workers sending value counts on a high cardinality variable. The core issue is somewhat similar to the one discussed here: even if memory is not a bottleneck for the workers, they still will end up passing around (potentially) large counters, which can be very slow due to (de)serialization and transmission over the network.
To test if this is the issue, you can try creating a low cardinality file, e.g. using terminal yes | head -1000000 > test_low_cardinality.csv, and testing your snippet on this file.

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using dask seems to work for large datasets but I'm having problems when I'm trying to save the file. I've tried CSV and parquet file. I want to save the results and then I can open it later in chunks.
Here's code to show the issue(the below script generates 2M rows and up to 30k unique values to onehot encode).
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
sizeOfRows = 2000000
columnsForDF = 30000
partitionsforDask = 500
print("partition is ", partitionsforDask)
cluster = LocalCluster()
client = Client(cluster)
print(client)
df = pd.DataFrame(np.random.randint(0,columnsForDF,size=(sizeOfRows, 2)), columns=list('AB'))
ddf = dd.from_pandas(df, npartitions=partitionsforDask)
# ddf = ddf.persist()
wait(ddf)
# %%time
# need to globally know the categories before one hot encoding
ddf = ddf.categorize(columns=["B"])
one_hot = dd.get_dummies(ddf, columns=['B'])
print("starting groupby")
# result = one_hot.groupby('A').max().persist() # or to_parquet/to_csv/compute/etc.
# result = one_hot.groupby('A', sort=False).max().to_csv('./daskDF.csv', single_file = True)
result = one_hot.groupby('A', sort=False).max().to_parquet('./parquetFile')
wait(result)
It seems to work until it does the groupby to csv or parquet. At that point, I get many errors about workers exceeded 95% of memory and then the program exits with a "killedworker" exception:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
KilledWorker: ("('dataframe-groupby-max-combine-3ddcd8fc854613101b4bdc7fccde32cd', 1, 0, 0)", <Worker 'tcp://127.0.0.1:33815', name: 6, memory: 0, processing: 22>)
Monitoring my machine, I never get close to exceeding memory and my drive space is over 300 GB which is never used(no file is created during this process although it's in the groupby section).
What can I do?
Update - I thought I'd add an award. I'm having the same problem with .to_csv as well, since someone else had a similar problem I hope it has value for a wide audience.

Let's first think of the end result: it will be a dataframe with 30'000 columns and 30'000 rows. This object will take about 6.7 GB in memory. (there's scope in playing around with dtypes to reduce memory footprint and also not all combinations might appear in the data, but let's ignore these points for simplicity)
Now, imagine we only had two partitions and each partition contained all possible dummy variable combinations. That would mean that each worker would need at least 6.7 GB to store the .groupby().max() object, but the final step would require 13.4 GB because the final worker would need to find the .max of those two objects. Naturally, if you have more partitions, the memory requirements on the final worker will grow. There is a way of controlling that in dask by specifying split_every in the relevant function. For example, if you specify .max(split_every=2), then any worker will receive at most 2 objects (the default value of split_every is 8).
Early on in the processing of 500 partitions, it's likely that each partition will contain only a subset of possible dummy values, so memory requirements are low. However, as dask marches on in computing the final result, it will combine objects with different dummy value combinations, so memory requirements will grow towards the end of the pipeline.
In principle, you could also use resources to restrict how many tasks a worker will undertake at one time, but that's not going to help if the worker doesn't have sufficient memory to handle the tasks.
What are potential ways out of this? At least a few options:
use workers with bigger resources;
simplify the task (e.g. split the task into several sub-tasks based on subsets of possible categories);
develop a custom workflow with delayed/futures that will sort the data and implement custom priorities, ensuring that workers complete a subset of work before proceeding to the final aggregation.
If worker memory is a constraint, then the subsetting will have to be very fine-grained. For example, at the limit, subsetting to only one possible dummy variable combination will have very low memory requirements (the initial data load and filter will still require enough memory to fit a partition), but of course that's an extreme example that would generate tens of thousands of tasks, so larger category groups are recommended (balancing the number of tasks and memory requirements). To see an example, you can check this related answer.

Optimizing dask.distributed scheduling for data reduction

I have a question pertaining to the scheduling/execution order of tasks in dask.distributed for the case of strong data reduction of a large raw dataset.
We are using dask.distributed for a code which extracts information from movie frames. Its specific application is crystallography, but quite generally the steps are:
Read frames of a movie stored as a 3D array in a HDF5 file (or a few thereof which are concatenated) into a dask array. This is obviously quite I/O-heavy
Group these frames into consecutive sub-stacks of typically 10 move stills, the frames of which are aggregated (summed or averaged), resulting in a single 2D image.
Run several, computationally heavy, analysis functions on the 2D image (such as positions of certain features), returning a dictionary of results, which is negligibly small compared to the movie itself.
We implement this by using the dask.array API for steps 1 and 2 (the latter using map_blocks with a block/chunk size of one or a few of the aggregation sub-stacks), then converting the array chunks to dask.delayed objects (using to_delayed) which are passed to a function doing the actual data reduction function (step 3). We take care to properly align the chunks of the HDF5 arrays, the dask computation and the aggregation ranges in step 2 such that the task graph of each final delayed object (elements of tasks) is very clean. Here's the example code:
def sum_sub_stacks(mov):
# aggregation function
sub_stk = []
for k in range(mov.shape[0]//10):
sub_stk.append(mov[k*10:k*10+10,...].sum(axis=0, keepdims=True))
return np.concatenate(sub_stk)
def get_info(mov):
# reduction function
results = []
for frame in mov:
results.append({
'sum': frame.sum(),
'variance': frame.var()
# ...actually much more complex/expensive stuff
})
return results
# connect to dask.distributed scheduler
client = Client(address='127.0.0.1:8786')
# 1: get the movie
fh = h5py.File('movie_stack.h5')
movie = da.from_array(fh['/entry/data/raw_counts'], chunks=(100,-1,-1))
# 2: sum sub-stacks within movie
movie_aggregated = movie.map_blocks(sum_sub_stacks,
chunks=(10,) + movie.chunks[1:],
dtype=movie.dtype)
# 3: create and run reduction tasks
tasks = [delayed(get_info)(chk)
for chk in movie_aggregated.to_delayed().ravel()]
info = client.compute(tasks, sync=True)
The ideal scheduling of operations would clearly be for each worker to perform the 1-2-3 sequence on a single chunk and then move on to the next, which would keep I/O load constant, CPUs maxed out and memory low.
What happens instead is that first all workers are trying to read as many chunks as possible from the files (step 1) which creates an I/O bottleneck and quickly exhausts the worker memory causing thrashing to the local drives. Often, at some point workers eventually move to steps 2/3 which quickly frees up memory and properly uses all CPUs, but in other cases workers get killed in an uncoordinated way or the entire computation is stalling. Also intermediate cases happen where surviving workers behave reasonably for a while only.
Is there any way to give hints to the scheduler to process the tasks in the preferred order as described above or are there other means to improve the scheduling behavior? Or is there something inherently stupid about this code/way of doing things?

First, there is nothing inherently stupid about what you are doing at all!
In general, Dask tries to reduce the number of temporaries it is holding onto and it also balances this with parallelizability (width of the graph and the number of workers). Scheduling is complex and there is yet another optimization Dask uses which fuses tasks together to make them more optimal. With lots of little chunks you may run into issues: https://docs.dask.org/en/latest/array-best-practices.html?highlight=chunk%20size#select-a-good-chunk-size
Dask does have a number of optimization configurations which I would recommend playing with after considering other chunksizes. I would also encourage you to read through the following issue as there is a healthy discussion around scheduling configurations.
Lastly, you might consider additional memory configuration of your workers as you may want to more tightly control how much memory each worker should use

How does dask work for larger than memory datasets

Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael

By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.

Multiprocessing pool with large shared objects

I'm working with a large pandas DataFrame of roughly 17million rows * 5 columns.
I'm then running a regression on a moving window within this DataFrame. I'm attempting to parallelize this computationally intensive part by passing the (static) DataFrame to multiple processes.
To simplify the example I'll assume I'm just working on the DataFrame 1000 times:
import multiprocessing as mp
def helper_func(input_tuple):
# Runs a regression on the DataFrame and outputs the
# results (coefficients)
...
if __name__ == '__main__':
# For simplicity's sake let's assume the input tuple is
# just a list of copies of the DataFrame
input_tuples = [df for x in range(1000)]
pl = mp.Pool(10)
jobs = pl.map_async(helper_func, list_of_input_tuples)
pl.close()
result = jobs.get()
While tracking the resource usage on repeated runs I'm noticing that the memory is constantly increasing and right before the process completes it tops out at 100%. Once it finishes it resets back to whatever it was prior to code run.
To give actual numbers, I can see the parent process using about 450mb while each of the workers is using around 1-2 GB of memory.
I'm worried (perhaps unnecessarily) that this could have memory related issues. Is there a way to reduce the memory that is held by the child processes? It is not clear to me why they are constantly increasing (and substantially larger than the parent process that held the DataFrame).
Edit:
Have tried other workarounds such as setting maxtasksperchild (per High Memory Usage Using Python Multiprocessing without much success)
Edit2:
Example below of what memory usage looks like while running. There are small peaks and dips (where I assume memory is being released?) however it without fail reaches 100% at the very end of the code run.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.