Split a parquet file in smaller chunks using dask

Split a parquet file in smaller chunks using dask - python

I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated

df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

Related

How to read a large file as Pandas dataframe?

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.
However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.
import pandas as pd
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()

I have had similar problem. With some research, I came to know about Vaex.
You can read about its performance here and here.
Essentially this is what you can try to do:
Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.
vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).
You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)

Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):
from dask_cudf import read_csv
ddf = read_csv('example_output/foo_dask.csv')

Splitting very large csv files into smaller files

Is Dask proper to read large csv files in parallel and split them into multiple smaller files?

Yes, dask can read large CSV files. It will split them into chunks
df = dd.read_csv("/path/to/myfile.csv")
Then, when saving, Dask always saves CSV data to multiple files
df.to_csv("/output/path/*.csv")
See the read_csv and to_csv docstrings for much more information about this.
dd.read_csv
dd.DataFrame.to_csv

Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.
Generate dataframe
import dask.dataframe as dd
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase)
N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
"values":np.random.rand(N)})
df.to_csv("file.csv", index=False)
One parquet file (folder) per member
If you're happy to have the output in as parquet you can just use the option partition_on as
df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")
If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.

Dask merge and export csv

I have several big CSV files more than 5GB, which need to merge. My RAM is only 8 GB.
Currently, I am using Dask to merge all of the files together and tried to export the data frame into CSV. I cannot export them due to low memory.
import dask.dataframe as dd
file_loc_1=r"..."
file_loc_2=r"..."
data_1=dd.read_csv(file_loc_1,dtype="object",encoding='cp1252')
data_2=dd.read_csv(file_loc_2,dtype="object",encoding='cp1252')
final_1=dd.merge(file_data_1,file_data_2,left_on="A",right_on="A",how="left")
final_loc=r"..."
dd.to_csv(final_1,final_loc,index=False,low_memory=False)
If Dask is not the good way to process the data, please feel free to suggest new methods!
Thanks!

You can read the csv files with pandas.read_csv: setting the chunksize parameter the method returns an iterators. Afterwards you can write a single csv in append mode.
Code example (not tested):
import pandas ad pd
import os
src = ['file1.csv', 'file2.csv']
dst = 'file.csv'
for f in src:
for df in pd.read_csv(f,chuncksize=200000):
if not os.path.isfile(dst):
df.to_csv(dst)
else:
df.to_csv(dst,mode = 'a', header=False)
Useful links:
http://acepor.github.io/2017/08/03/using-chunksize/
Panda's Write CSV - Append vs. Write
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks

Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html

As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files

Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()

Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a parquet file in smaller chunks using dask - python

Related

How to read a large file as Pandas dataframe?

Splitting very large csv files into smaller files

Dask merge and export csv

Merge csv files using dask

How to read a Parquet file into Pandas DataFrame?

Categories

Resources