I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using dask seems to work for large datasets but I'm having problems when I'm trying to save the file. I've tried CSV and parquet file. I want to save the results and then I can open it later in chunks.
Here's code to show the issue(the below script generates 2M rows and up to 30k unique values to onehot encode).
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
sizeOfRows = 2000000
columnsForDF = 30000
partitionsforDask = 500
print("partition is ", partitionsforDask)
cluster = LocalCluster()
client = Client(cluster)
print(client)
df = pd.DataFrame(np.random.randint(0,columnsForDF,size=(sizeOfRows, 2)), columns=list('AB'))
ddf = dd.from_pandas(df, npartitions=partitionsforDask)
# ddf = ddf.persist()
wait(ddf)
# %%time
# need to globally know the categories before one hot encoding
ddf = ddf.categorize(columns=["B"])
one_hot = dd.get_dummies(ddf, columns=['B'])
print("starting groupby")
# result = one_hot.groupby('A').max().persist() # or to_parquet/to_csv/compute/etc.
# result = one_hot.groupby('A', sort=False).max().to_csv('./daskDF.csv', single_file = True)
result = one_hot.groupby('A', sort=False).max().to_parquet('./parquetFile')
wait(result)
It seems to work until it does the groupby to csv or parquet. At that point, I get many errors about workers exceeded 95% of memory and then the program exits with a "killedworker" exception:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
KilledWorker: ("('dataframe-groupby-max-combine-3ddcd8fc854613101b4bdc7fccde32cd', 1, 0, 0)", <Worker 'tcp://127.0.0.1:33815', name: 6, memory: 0, processing: 22>)
Monitoring my machine, I never get close to exceeding memory and my drive space is over 300 GB which is never used(no file is created during this process although it's in the groupby section).
What can I do?
Update - I thought I'd add an award. I'm having the same problem with .to_csv as well, since someone else had a similar problem I hope it has value for a wide audience.
Let's first think of the end result: it will be a dataframe with 30'000 columns and 30'000 rows. This object will take about 6.7 GB in memory. (there's scope in playing around with dtypes to reduce memory footprint and also not all combinations might appear in the data, but let's ignore these points for simplicity)
Now, imagine we only had two partitions and each partition contained all possible dummy variable combinations. That would mean that each worker would need at least 6.7 GB to store the .groupby().max() object, but the final step would require 13.4 GB because the final worker would need to find the .max of those two objects. Naturally, if you have more partitions, the memory requirements on the final worker will grow. There is a way of controlling that in dask by specifying split_every in the relevant function. For example, if you specify .max(split_every=2), then any worker will receive at most 2 objects (the default value of split_every is 8).
Early on in the processing of 500 partitions, it's likely that each partition will contain only a subset of possible dummy values, so memory requirements are low. However, as dask marches on in computing the final result, it will combine objects with different dummy value combinations, so memory requirements will grow towards the end of the pipeline.
In principle, you could also use resources to restrict how many tasks a worker will undertake at one time, but that's not going to help if the worker doesn't have sufficient memory to handle the tasks.
What are potential ways out of this? At least a few options:
use workers with bigger resources;
simplify the task (e.g. split the task into several sub-tasks based on subsets of possible categories);
develop a custom workflow with delayed/futures that will sort the data and implement custom priorities, ensuring that workers complete a subset of work before proceeding to the final aggregation.
If worker memory is a constraint, then the subsetting will have to be very fine-grained. For example, at the limit, subsetting to only one possible dummy variable combination will have very low memory requirements (the initial data load and filter will still require enough memory to fit a partition), but of course that's an extreme example that would generate tens of thousands of tasks, so larger category groups are recommended (balancing the number of tasks and memory requirements). To see an example, you can check this related answer.
Related
I've noticed a significant performance degradation with the following script when increasing my cluster size: from a single node to 3 node cluster. Running times are 2min and 6min respectively.
Also, noticed CPU activity is very low in both cases.
The task here is a word count over a single text file (10GB, 100M lines, ~2B words).
I've made the file available to all nodes prior to launching the script.
What could possibly impede Dask from scaling this out?
from dask.distributed import Client
import dask.dataframe as dd
df = dd.read_csv(file_url, header=None)
# count words
new_df = (
df[0]
.str
.split()
.explode()
.value_counts()
)
print(new_df.compute().head())
One potential problem is the communication that arises when you have many workers sending value counts on a high cardinality variable. The core issue is somewhat similar to the one discussed here: even if memory is not a bottleneck for the workers, they still will end up passing around (potentially) large counters, which can be very slow due to (de)serialization and transmission over the network.
To test if this is the issue, you can try creating a low cardinality file, e.g. using terminal yes | head -1000000 > test_low_cardinality.csv, and testing your snippet on this file.
I'm working with a dataset stored in S3 bucket (parquet files) consisting of a total of ~165 million records (with ~30 columns). Now, the requirement is to first groupby a certain ID column then generate 250+ features for each of these grouped records based on the data. Building these features is quite complex using multiple Pandas functionality along with 10+ supporting functions. The groupby function should generate ~5-6 million records, hence the final output should be 6M x 250 shaped dataframe.
Now, I've tested the code on a smaller sample and it works fine. The issue is, when I'm implementing it on the entire dataset, it takes a very long time - the progress bar in Spark display doesn't change even after 4+ hours of running. I'm running this in AWS EMR Notebook connected to a Cluster (1 m5.xlarge Master & 2 m5.xlarge Core Nodes).
I've tried with 1 m5.4xlarge Master & 2 m5.4xlarge Core Nodes, 1 m5.xlarge Master & 8 m5.xlarge Core Nodes combinations among others. None of them have shown any progress.
I've tried running it in Pandas in-memory in my local machine for ~650k records, the progress was ~3.5 iterations/sec which came to be an ETA of ~647 hours.
So, the question is - can anyone share any better solution to reduce the time consumption and speed up the processing ? Should another cluster type be used for this use-case ? Should this be refactored or should Pandas dataframe usage be removed or any other pointer would be really helpful.
Thanks much in advance!
First things first: is your data partitioned enough to take advantage of all of your workers? If some part of your process causes it to coalesce to e.g. a single partition, then you're basically running single-threaded.
Beyond that, I don't know for certain without seeing the code, but here's a subtle behaviour that can cause runtimes to become massive:
source_df = # some pandas dataframe with a lot of features in columns
flattened_df = your_df.stack().reset_index().unstack() # Turn the features into rows
spark_df = spark.createDataFrame(flattened_df) # 'index' is the column that contains the feature name
# a function to do a linear regression and calculate residual
def your_good_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return y-predicted
def your_bad_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return source_df[key]-predicted
spark_df.groupBy('index').applyInPandas(your_good_pandas_function,schema=some_schema) #fast
spark_df.groupBy('index').applyInPandas(your_bad_pandas_function,schema=some_schema) #slow
These two ApplyInPandas functions do the same thing - they linear-regress some characteristics against a feature and calculate the residual. The first uses variables that are in scope within the pandas UDF. The second uses a variable that is out of scope of the pandas UDF. In the second case, Spark will help you out by broadcasting source_df to every single invocation of your pandas UDF. This will cause enormous memory usage and definitely kill your job.
Your data don't seem large enough to take that long, so my guess is that the reason why it works on a small subset and not the larger set may be because you're inadvertently broadcasting the larger set to your applyInPandas function calls.
I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.
Here's a snippet of my code:
df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))
I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.
which takes about 4gb space on disk
It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.
The dataframe should fit in memory
So, have you measured its size as a single pandas dataframe?
You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.
Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.
You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.
MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.
See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.
tl;dr
I want to
dd.read_parquet('*.parq')['column'].nunique().compute()
but I get
WARNING - Worker exceeded 95% memory budget. Restarting
a couple of times before the workers get killed altogether.
Long version
I have a dataset with
10 billion rows,
~20 columns,
and a single machine with around 200GB memory. I am trying to use dask's LocalCluster to process the data, but my workers quickly exceed their memory budget and get killed even if I use a reasonably small subset and try using basic operations.
I have recreated a toy problem demonstrating the issue below.
Synthetic data
To approximate the problem above on a smaller scale, I will create a single column with 32-character ids with
a million unique ids
total length of 200 million rows
split into 100 parquet files
The result will be
100 files, 66MB each, taking 178MB when loaded as a Pandas dataframe (estimated by df.memory_usage(deep=True).sum())
If loaded as a pandas dataframe, all the data take 20GB in memory
A single Series with all ids (which is what I assume the workers also have to keep in memory when computing nunique) takes about 90MB
import string
import os
import numpy as np
import pandas as pd
chars = string.ascii_letters + string.digits
n_total = int(2e8)
n_unique = int(1e6)
# Create random ids
ids = np.sum(np.random.choice(np.array(list(chars)).astype(object), size=[n_unique, 32]), axis=1)
outputdir = os.path.join('/tmp', 'testdata')
os.makedirs(outputdir, exist_ok=True)
# Sample from the ids to create 100 parquet files
for i in range(100):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
Attempt at a solution
Let's assume that my machine only has 8GB of memory. Since the partitions take about 178MB and the result 90MB, according to Wes Kinney's rule of thumb, I might need up to 2-3Gb of memory. Therefore, either
n_workers=2, memory_limit='4GB', or
n_workers_1, memroy_limit='8GB'
seems like a good choice. Sadly, when I try it, I get
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
a couple of times, before the worker(s) get killed altogether.
import os
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
cluster = LocalCluster(n_workers=4, memory_limit='6GB')
client = Client(cluster)
dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'))['id'].nunique().compute()
In fact, it seems, for example, with 4 workers, they each need 6GB of memory before being able to perform the task.
Can this situation be improved?
That's a great example of a recurring problem. The only shocking thing is that delayed was not used during the synthetic data creation:
import dask
#dask.delayed
def create_sample(i):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
return
# Sample from the ids to create 100 parquet files
dels = [create_sample(i) for i in range(100)]
_ = dask.compute(dels)
For the following answer I will actually just use a small number of partitions (so change to range(5)), to have sane visualizations. Let's start with the loading:
df = dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'), use_cols=['id'])
print(df.npartitions) # 5
This is a minor point, but having use_cols=['id'] in .read_parquet(), exploits the parquet advantage of columnar extraction (it might be that dask will do some optimization behind the scenes, but if you know the columns you want, there's no harm in being explicit).
Now, when you run df['id'].nunique(), here's the DAG that dask will compute:
With more partitions, there would be more steps, but it's apparent that there's a potential bottleneck when each partition is trying to send data that is quite large. This data can be very large for high-dimensional columns, so if each worker is trying to send a result that requires object that is 100MB, then the receiving worker will have to have 5 times the memory to accept the data (which could potentially decrease after further value-counting).
Additional consideration is how many tasks a single worker can run at a given time. The easiest way to control how many tasks can run at the same time on a given worker is resources. If you initiate the cluster with resources:
cluster = LocalCluster(n_workers=2, memory_limit='4GB', resources={'foo': 1})
Then every worker has the specified resources (in this case it's 1 unit of arbitrary foo), so if you think that processing a single partition should happen one at a time (due to high memory footprint), then you can do:
# note, no split_every is needed in this case since we're just
# passing a single number
df['id'].nunique().compute(resources={'foo': 1})
This will ensure that any single worker is busy with 1 task at a time, preventing excessive memory usage. (side note: there's also .nunique_approx(), which may be of interest)
To control the amount of data that any given worker receives for further processing, one approach is to use split_every option. Here's what the DAG will look like with split_every=3:
You can see that now (for this number of partitions), the max memory that a worker will need is 3 times that max size of the dataset. So depending on your worker memory settings you might want to set split_every to a low value (2,3,4 or so).
In general, the more unique the variable, the more memory is needed for each partition's object with unique counts, and so a lower value of split_every is going to be useful to put a cap on the max memory usage. If the variable is not very unique, then each individual partition's unique count will be a small object, so there's no need to have a split_every restriction.
I have a question pertaining to the scheduling/execution order of tasks in dask.distributed for the case of strong data reduction of a large raw dataset.
We are using dask.distributed for a code which extracts information from movie frames. Its specific application is crystallography, but quite generally the steps are:
Read frames of a movie stored as a 3D array in a HDF5 file (or a few thereof which are concatenated) into a dask array. This is obviously quite I/O-heavy
Group these frames into consecutive sub-stacks of typically 10 move stills, the frames of which are aggregated (summed or averaged), resulting in a single 2D image.
Run several, computationally heavy, analysis functions on the 2D image (such as positions of certain features), returning a dictionary of results, which is negligibly small compared to the movie itself.
We implement this by using the dask.array API for steps 1 and 2 (the latter using map_blocks with a block/chunk size of one or a few of the aggregation sub-stacks), then converting the array chunks to dask.delayed objects (using to_delayed) which are passed to a function doing the actual data reduction function (step 3). We take care to properly align the chunks of the HDF5 arrays, the dask computation and the aggregation ranges in step 2 such that the task graph of each final delayed object (elements of tasks) is very clean. Here's the example code:
def sum_sub_stacks(mov):
# aggregation function
sub_stk = []
for k in range(mov.shape[0]//10):
sub_stk.append(mov[k*10:k*10+10,...].sum(axis=0, keepdims=True))
return np.concatenate(sub_stk)
def get_info(mov):
# reduction function
results = []
for frame in mov:
results.append({
'sum': frame.sum(),
'variance': frame.var()
# ...actually much more complex/expensive stuff
})
return results
# connect to dask.distributed scheduler
client = Client(address='127.0.0.1:8786')
# 1: get the movie
fh = h5py.File('movie_stack.h5')
movie = da.from_array(fh['/entry/data/raw_counts'], chunks=(100,-1,-1))
# 2: sum sub-stacks within movie
movie_aggregated = movie.map_blocks(sum_sub_stacks,
chunks=(10,) + movie.chunks[1:],
dtype=movie.dtype)
# 3: create and run reduction tasks
tasks = [delayed(get_info)(chk)
for chk in movie_aggregated.to_delayed().ravel()]
info = client.compute(tasks, sync=True)
The ideal scheduling of operations would clearly be for each worker to perform the 1-2-3 sequence on a single chunk and then move on to the next, which would keep I/O load constant, CPUs maxed out and memory low.
What happens instead is that first all workers are trying to read as many chunks as possible from the files (step 1) which creates an I/O bottleneck and quickly exhausts the worker memory causing thrashing to the local drives. Often, at some point workers eventually move to steps 2/3 which quickly frees up memory and properly uses all CPUs, but in other cases workers get killed in an uncoordinated way or the entire computation is stalling. Also intermediate cases happen where surviving workers behave reasonably for a while only.
Is there any way to give hints to the scheduler to process the tasks in the preferred order as described above or are there other means to improve the scheduling behavior? Or is there something inherently stupid about this code/way of doing things?
First, there is nothing inherently stupid about what you are doing at all!
In general, Dask tries to reduce the number of temporaries it is holding onto and it also balances this with parallelizability (width of the graph and the number of workers). Scheduling is complex and there is yet another optimization Dask uses which fuses tasks together to make them more optimal. With lots of little chunks you may run into issues: https://docs.dask.org/en/latest/array-best-practices.html?highlight=chunk%20size#select-a-good-chunk-size
Dask does have a number of optimization configurations which I would recommend playing with after considering other chunksizes. I would also encourage you to read through the following issue as there is a healthy discussion around scheduling configurations.
Lastly, you might consider additional memory configuration of your workers as you may want to more tightly control how much memory each worker should use