The documentation of the Dask package for dataframes says:
Dask dataframes look and feel like pandas dataframes, but operate on
datasets larger than memory using multiple threads.
But later in the same page:
One dask DataFrame is comprised of several in-memory pandas DataFrames
separated along the index.
Does Dask read the different DataFrame partitions from disk sequentally and perform computations to fit into memory? Does it spill some partitions to disk when needed? In general, how does Dask manage the memory <--> disk IO of data to allow larger-than-memory data analysis?
I tried to perform some basic computations (e.g. mean rating) on the 10M MovieLens dataset and my laptop (8GB RAM) started to swap.
Dask.dataframe loads data lazily and attempts to perform your entire computation in one linear scan through the dataset. Surprisingly, this is usually doable.
Intelligently dumping down to disk is also an option that it can manage, especially when shuffles are required, but generally there are ways around this.
I happen to come to this page after 2 years and now there is an easy option to limit memory usage by each worker. Think that was included by #MRocklin after this thread got inactive.
$ dask-worker tcp://scheduler:port --memory-limit=auto # total available RAM on the machine
$ dask-worker tcp://scheduler:port --memory-limit=4e9 # four gigabytes per worker process.
This feature is called Spill-to-disk policy for workers and details can be found here in the documentation.
Apparently, extra data will be spilled to a directory as specified by the command below:
$ dask-worker tcp://scheduler:port --memory-limit 4e9 --local-directory /scratch
That data is still available and will be read back from disk when necessary.
Related
I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.
Here's a snippet of my code:
df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))
I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.
which takes about 4gb space on disk
It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.
The dataframe should fit in memory
So, have you measured its size as a single pandas dataframe?
You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.
Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.
You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.
MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.
See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.
since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!
This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.
I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?
You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html
Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.
There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...