Reading large number of parquet files: read_parquet vs from_delayed

Reading large number of parquet files: read_parquet vs from_delayed - python

I'm reading a larger number (100s to 1000s) of parquet files into a single dask dataframe (single machine, all local). I realized that
files = ['file1.parq', 'file2.parq', ...]
ddf = dd.read_parquet(files, engine='fastparquet')
ddf.groupby(['col_A', 'col_B']).value.sum().compute()
is a lot less efficient than
from dask import delayed
from fastparquet import ParquetFile
#delayed
def load_chunk(pth):
return ParquetFile(pth).to_pandas()
ddf = dd.from_delayed([load_chunk(f) for f in files])
ddf.groupby(['col_A', 'col_B']).value.sum().compute()
For my particular application, the second approach (from_delayed) takes 6 seconds to complete, the first approach takes 39 seconds. In the dd.read_parquet case there seems to be a lot of overhead before the workers even start to do something, and there are quite a few transfer-... operations scattered across the task stream plot. I'd like to understand what's going on here. What could be the reason that the read_parquet approach is so much slower? What does it do differently than just reading the files and putting them in chunks?

You are experiencing the client trying to establish the min/max statistics of the columns of the data, and thereby establish a good index for the dataframe. An index can be very useful in preventing reading from data files which are not needed for your particular job.
In many cases, this is a good idea, where the amount of data in a file is large and the total number of files is small. In other cases, the same information might be contained in a special "_metadata" file, so that there would be no need to read from all the files first.
To prevent the scan of the files' footers, you should call
dd.read_parquet(..,. gather_statistics=False)
This should be the default in the next version of dask.

Related

Dask dataframe concatenate and repartitions large files for time series and correlation

I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime())
We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.
The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.
Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?
I tried the latter solution following this post, and did:
# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")
but at the 2nd step, Dask blew up my memory and I got a kernel shutdown.
As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.
----------------- UPDATE 1 ROW GROUPS---------------
I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups to True or False, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean(). I have to do compute() on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.
When doing that resampling, I get instead:
ValueError: Can only resample dataframes with known divisions See
https://docs.dask.org/en/latest/dataframe-design.html#partitions for
more information.
I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.

dask.dataframe by default is structured a bit more toward reading smaller "hive" parquet files rather than chunking individual huge parquet files into manageable pieces. From the dask.dataframe docs:
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are of reasonable size.
We recommend aiming for 10-250 MiB in-memory size per file once loaded into pandas. Too large files can lead to excessive memory usage on a single worker, while too small files can lead to poor performance as the overhead of Dask dominates. If you need to read a parquet dataset composed of large files, you can pass split_row_groups=True to have Dask partition your data by row group instead of by file. Note that this approach will not scale as well as split_row_groups=False without a global _metadata file, because the footer will need to be loaded from every file in the dataset.
I'd try a few strategies here:
Only read in the columns you need. Since your files are so huge, you don't want dask even trying to load the first chunk to infer structure. You can provide the columns key dd.read_parquet which will be passed through to various stages of the parsing engines. In this case, dd.read_parquet(filepath, columns=list_of_columns).
If your parquet files have multiple row groups, you can make use of the dd.read_parquet argument split_row_groups=True. This will create smaller chunks which are each smaller than the full file size.
If (2) works, you may be able to avoid repartitioning, or if you need to, repartition to a multiple of your original number of partitions (22, 33, etc). When reading data from a file, dask doesn't know how large each partition is, and if you specify a number less than a multiple of the current number of partitions, the partitioning behavior isn't very well defined. On some small tests I've run, repartitioning 11 --> 20 will leave the first 10 partitions as-is and split the last one into the remaining 10!
If your file is on disk, you may be able to read the file as a memory map to avoid loading the data prior to repartitioning. You can do this by passing memory_map=True to dd.read_parquet.
I'm sure you're not the only one with this problem. Please let us know how this goes and report back what works!

dask out of memory error despite data size doesnt exceed memory

I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.
Here's a snippet of my code:
df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))
I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.

which takes about 4gb space on disk
It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.
The dataframe should fit in memory
So, have you measured its size as a single pandas dataframe?
You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.
Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.

You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.
MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.
See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.

Efficient use of dask with parquet files

I have received a huge (140MM records) dataset and Dask has come in handy but I'm not sure if I could perhaps do a better job. Imagine the records are mostly numeric (two columns are dates), so the process to transform from CSV to parquet was a breeze (dask.dataframe.read_csv('in.csv').to_parquet('out.pq')), but
(i) I would like to use the data on Amazon Athena, so a single parquet file would be nice. How to achieve this? As it stands, Dask saved it as hundreds of files.
(ii) For the Exploratory Data Analysis I'm trying with this dataset, there are certain operations where I need more then a couple of variables, which won't fit into memory so I'm constantly dumping two/three-variable views into SQL, is this code efficient use of dask?
mmm = ['min','mean','max']
MY_COLUMNS = ['emisor','receptor','actividad', 'monto','grupo']
gdict = {'grupo': mmm, 'monto': mmm, 'actividad': ['mean','count']}
df = dd.read_parquet('out.pq', columns=MY_COLUMNS).groupby(['emisor','receptor']).agg(gdict)
df = df.compute()
df.columns = ['_'.join(c) for c in df.columns] # ('grupo','max') -> grupo_max
df.to_sql('er_stats',conn,index=False,if_exists='replace')
Reading the file takes about 80 and writing to SQL about 60 seconds.

To reduce the number of partitions, you should either set the blocksize when reading the CSV (preferred), or repartition before writing the parquet. The "best" size depends on your memory and number of workers, but a single partition is probably not possible if your data is "huge". Putting the many partitions into a single file is also not possible (or, rather, not implemented), because dask writes in parallel and there would be no way of knowing where in the file the next part goes before the previous part is finished. I could imagine writing code to read in successive dask-produced parts and streaming them into a single output, it would not be hard but perhaps not trivial either.
writing to SQL about 60 seconds
This suggests that your output is still quite large. Is SQL the best option here? Perhaps writing again to parquet files would be possible.

Convert a csv larger than RAM into parquet with Dask

I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?

You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?

I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.

I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.