I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.
Related
I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.
I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated
df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used
I have several big CSV files more than 5GB, which need to merge. My RAM is only 8 GB.
Currently, I am using Dask to merge all of the files together and tried to export the data frame into CSV. I cannot export them due to low memory.
import dask.dataframe as dd
file_loc_1=r"..."
file_loc_2=r"..."
data_1=dd.read_csv(file_loc_1,dtype="object",encoding='cp1252')
data_2=dd.read_csv(file_loc_2,dtype="object",encoding='cp1252')
final_1=dd.merge(file_data_1,file_data_2,left_on="A",right_on="A",how="left")
final_loc=r"..."
dd.to_csv(final_1,final_loc,index=False,low_memory=False)
If Dask is not the good way to process the data, please feel free to suggest new methods!
Thanks!
You can read the csv files with pandas.read_csv: setting the chunksize parameter the method returns an iterators. Afterwards you can write a single csv in append mode.
Code example (not tested):
import pandas ad pd
import os
src = ['file1.csv', 'file2.csv']
dst = 'file.csv'
for f in src:
for df in pd.read_csv(f,chuncksize=200000):
if not os.path.isfile(dst):
df.to_csv(dst)
else:
df.to_csv(dst,mode = 'a', header=False)
Useful links:
http://acepor.github.io/2017/08/03/using-chunksize/
Panda's Write CSV - Append vs. Write
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I have a huge list of GZip files which need to be converted to Parquet. Due to the compressing nature of GZip, this cannot be parallelized for one file.
However, since I have many, is there a relatively easy way to let every node do a part of the files? The files are on HDFS. I assume that I cannot use the RDD infrastructure for the writing of the Parquet files because this is all done on the driver as opposed to on the nodes themselves.
I could parallelize the list of file names, write a function that deals with the Parquets local and saves them back to HDFS. I wouldn't know how to do that. I feel like I'm missing something obvious, thanks!
This was marked as a duplicate question, this is not the case however. I am fully aware of the ability of Spark to read them in as RDDs without having to worry about the compression, my question is more about how to parallelize converting these files to structured Parquet files.
If I knew how to interact with Parquet files without Spark itself I could do something like this:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
write_csv_to_parquet_on_hdfs(file_to)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
That would allow me to parallelize this, however I don't know how to interact with HDFS and Parquet from a local environment. I want to know either:
1) How to do that
Or..
2) How to parallelize this process in a different way using PySpark
I would suggest one of the two following approaches (where in practice I have found the first one to give better results in terms of performance).
Write each Zip-File to a separate Parquet-File
Here you can use pyarrow to write a Parquet-File to HDFS:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
pyarrow_table = to_pyarrow_table(gzipped_csv)
hdfs_client = pyarrow.HdfsClient()
with hdfs_client.open(file_to, "wb") as f:
pyarrow.parquet.write_table(pyarrow_table, f)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
There are two ways to obtain pyarrow.Table objects:
either obtain it from a pandas DataFrame (in which case you can also use pandas' read_csv() function): pyarrow_table = pyarrow.Table.from_pandas(pandas_df)
or manually construct it using pyarrow.Table.from_arrays
For pyarrow to work with HDFS one needs to set several environment variables correctly, see here
Concatenate the rows from all Zip-Files into one Parquet-File
def get_rows_from_gzip(file_from):
rows = read_gzip_file(file_from)
return rows
# read the rows of each zip file into a Row object
rows_rdd = filenameRDD.map(lambda x: get_rows_from_gzip(x[0]))
# flatten list of lists
rows_rdd = rows_rdd.flatMap(lambda x: x)
# convert to DataFrame and write to Parquet
df = spark_session.create_DataFrame(rows_rdd)
df.write.parquet(file_to)
If you know the schema of the data in advance, passing in a schema object to create_DataFrame will speed up the creation of the DataFrame.
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)