Merge small parquet files into a single large parquet file

Merge small parquet files into a single large parquet file - python

I have been trying to merge small parquet files each with 10 k rows and for each set the number of small files will be 60-100. So resulting into around 600k rows minimum in the merged parquet file.
I have been trying to use pandas concat.it is working fine with around 10-15 small files merge.
But as the set may be consists of 50-100 files. The process it is getting killed while running python script with memory limit breached
So i am looking for a memory efficient way to merge any number of small parquet in range of 100 file set
Used pandas read parquet to read each individual dataframe and combine them with pd.conact(all dataframe)
Is there a better library other than pandas or if possible in pandas how it can be done efficiently.
Time is not constraint. It can run for some long time as well.

For large data you should definitely use the PySpark library, split into smaller sizes if possible, and then use Pandas.
PySpark is very similar to Pandas.
link

You can open files one by one and append them to the parquet file. Best to use pyarrow for this.
import pyarrow.parquet as pq
files = ["table1.parquet", "table2.parquet"]
with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
for file in files:
writer.write_table(pq.read_table(file))

Related

Reading and processing multiple csv files with limited RAM in Python

I need to read thousands of csv files and output them as a single csv file in Python.
Each of the original files will be used to create single row in the final output with columns being some operation on the rows of the original file.
Due to the combined size of the files, this takes many hours to process and also is not able to be fully loaded into memory.
I am able to read in each csv and delete it from memory to solve the RAM issue. However, I am currently iteratively reading and processing each csv (in Pandas) and appending the output row to the final csv, which seems slow. I believe I can use the multiprocessing library to have each process read and process its own csv, but wasn't sure if there was a better way than this.
What is the fastest way to complete this in Python while having RAM limitations?
As an example, ABC.csv and DEF.csv would be read and processed into individual rows in the final output csv. (The actual files would have tens of columns and hundreds of thousands of rows)
ABC.csv:
id,col1,col2
abc,2.3,3
abc,3.7,5
abc,3.0,9
DEF.csv:
id,col1,col2
def,1.9,3
def,2.8,2
def,1.6,1
Final Output:
id,col1_avg,col2_max
abc,3.0,9
def,2.1,3

I would suggest using dask for this. It's a library that allows you to do parallel processing on large datasets.
import dask.dataframe as dd
df = dd.read_csv('*.csv')
df = df.groupby('id').agg({'col1': 'mean', 'col2': 'max'})
df.to_csv('output.csv')
Code explanation
dd.read_csv will read all the csv files in the current directory and concatenate them into a single dataframe.
df.groupby('id').agg({'col1': 'mean', 'col2': 'max'}) will group the dataframe by the id column and then calculate the mean of col1 and the max of col2 for each group.
df.to_csv('output.csv') will write the dataframe to a csv file.
Performance
I tested this on my machine with a directory containing 10,000 csv files with 10,000 rows each. The code took about 2 minutes to run.
Installation
To install dask, run pip install dask.

is it possible to have one meta file for multiple parquet data files?

I have a process that generates millions of small dataframes and save them to parquet in parallel.
all dataframes have the same columns and index information. and have the same number of rows (about 300).
as the dataframe is small, when they are saved into parquet files, the meta information is quite big in comparison with the data. as the meta information for each parquet file is basically the same, the disk space is wasted because the same meta are repeated millions of times.
is it possible to save one copy of meta information and other parquet files contains only the data ? when I need to read a dataframe, read the meta and the data from two different files?
some updates:
concating them into one big dataframe can save the disk space, but it's not friendly to do parallel processing of each small dataframe.
I also tried other format like feather.but it seems that feather does not store data as effciently as parquet. the file size is smaller but it's larger than parquet meta + parquet data

This is not possible at least using python pandas (fastparquet and pyarrow dont have any such feature).
I do see a parameter that disables statistics in footer.
pyarrow - write_statistics=Fasle
fastparquet - stats=False
However this will not save you a lot of disk space. Only few stats related info will not be written to the parquet metadata footer.
is it possible to save one copy of meta information and other parquet
files contains only the data ?
You are wanting to write multiple data files with only row groups without footer and a single metadata file that consists only footer. In this case none of those files will be a valid parquet file. This should be possible theoretically but no such known implementation exists. Checkout comments on this thread. Maybe reach out to parquet community on slack to find out if any such implementation exists.
My suggestions would be combine the dataframes somehow before writing to parquet format on disk. OR run a job at a later stage that will merge these files. Both these options are not efficient since you have a huge number of small df's/files.
You could write 300 rows per dataframe into a some kind of intermediate database as well. Later convert to parquet.

Dask dataframe concatenate and repartitions large files for time series and correlation

I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime())
We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.
The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.
Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?
I tried the latter solution following this post, and did:
# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")
but at the 2nd step, Dask blew up my memory and I got a kernel shutdown.
As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.
----------------- UPDATE 1 ROW GROUPS---------------
I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups to True or False, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean(). I have to do compute() on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.
When doing that resampling, I get instead:
ValueError: Can only resample dataframes with known divisions See
https://docs.dask.org/en/latest/dataframe-design.html#partitions for
more information.
I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.

dask.dataframe by default is structured a bit more toward reading smaller "hive" parquet files rather than chunking individual huge parquet files into manageable pieces. From the dask.dataframe docs:
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are of reasonable size.
We recommend aiming for 10-250 MiB in-memory size per file once loaded into pandas. Too large files can lead to excessive memory usage on a single worker, while too small files can lead to poor performance as the overhead of Dask dominates. If you need to read a parquet dataset composed of large files, you can pass split_row_groups=True to have Dask partition your data by row group instead of by file. Note that this approach will not scale as well as split_row_groups=False without a global _metadata file, because the footer will need to be loaded from every file in the dataset.
I'd try a few strategies here:
Only read in the columns you need. Since your files are so huge, you don't want dask even trying to load the first chunk to infer structure. You can provide the columns key dd.read_parquet which will be passed through to various stages of the parsing engines. In this case, dd.read_parquet(filepath, columns=list_of_columns).
If your parquet files have multiple row groups, you can make use of the dd.read_parquet argument split_row_groups=True. This will create smaller chunks which are each smaller than the full file size.
If (2) works, you may be able to avoid repartitioning, or if you need to, repartition to a multiple of your original number of partitions (22, 33, etc). When reading data from a file, dask doesn't know how large each partition is, and if you specify a number less than a multiple of the current number of partitions, the partitioning behavior isn't very well defined. On some small tests I've run, repartitioning 11 --> 20 will leave the first 10 partitions as-is and split the last one into the remaining 10!
If your file is on disk, you may be able to read the file as a memory map to avoid loading the data prior to repartitioning. You can do this by passing memory_map=True to dd.read_parquet.
I'm sure you're not the only one with this problem. Please let us know how this goes and report back what works!

From hdf5 files to csv files with Python

I have to process hdf5 files. Each of them contains data that can be loaded into a pandas DataFrame formed by 100 columns and almost 5E5 rows. Each hdf5 file weighs approximately 130MB.
So I want to fetch the data from the hdf5 file then apply some processing and finally save the new data in a csv file. In my case, the performance of the process is very important because I will have to repeat it.
So far I have focused on Pandas and Dask to get the job done. Dask is good for parallelization and I will get good processing times with a stronger PC and more CPUs.
However some of you have already encountered this problem and found the best optimization ?

As others have mentioned in the comments, unless you have to move it to CSV, I'd recommend keeping it in HDF5. However, below is a description of how you might do it if you do have to carry out the conversion.
It sounds like you have a function for loading the HDF5 file into a pandas data frame. I would suggest using dask's delayed API to create a list of delayed pandas data frames, and then convert them into a dask data frame. The snipped below is copied from the linked page, with an added line to save to CSV.
import dask.dataframe as dd
from dask.delayed import delayed
from my_custom_library import load
filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df.to_csv(filename, **kwargs)
See dd.to_csv() documentation for info on options for saving to CSV.

Efficient use of dask with parquet files

I have received a huge (140MM records) dataset and Dask has come in handy but I'm not sure if I could perhaps do a better job. Imagine the records are mostly numeric (two columns are dates), so the process to transform from CSV to parquet was a breeze (dask.dataframe.read_csv('in.csv').to_parquet('out.pq')), but
(i) I would like to use the data on Amazon Athena, so a single parquet file would be nice. How to achieve this? As it stands, Dask saved it as hundreds of files.
(ii) For the Exploratory Data Analysis I'm trying with this dataset, there are certain operations where I need more then a couple of variables, which won't fit into memory so I'm constantly dumping two/three-variable views into SQL, is this code efficient use of dask?
mmm = ['min','mean','max']
MY_COLUMNS = ['emisor','receptor','actividad', 'monto','grupo']
gdict = {'grupo': mmm, 'monto': mmm, 'actividad': ['mean','count']}
df = dd.read_parquet('out.pq', columns=MY_COLUMNS).groupby(['emisor','receptor']).agg(gdict)
df = df.compute()
df.columns = ['_'.join(c) for c in df.columns] # ('grupo','max') -> grupo_max
df.to_sql('er_stats',conn,index=False,if_exists='replace')
Reading the file takes about 80 and writing to SQL about 60 seconds.

To reduce the number of partitions, you should either set the blocksize when reading the CSV (preferred), or repartition before writing the parquet. The "best" size depends on your memory and number of workers, but a single partition is probably not possible if your data is "huge". Putting the many partitions into a single file is also not possible (or, rather, not implemented), because dask writes in parallel and there would be no way of knowing where in the file the next part goes before the previous part is finished. I could imagine writing code to read in successive dask-produced parts and streaming them into a single output, it would not be hard but perhaps not trivial either.
writing to SQL about 60 seconds
This suggests that your output is still quite large. Is SQL the best option here? Perhaps writing again to parquet files would be possible.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.