Parallelize GZip file processing Spark - python

I have a huge list of GZip files which need to be converted to Parquet. Due to the compressing nature of GZip, this cannot be parallelized for one file.
However, since I have many, is there a relatively easy way to let every node do a part of the files? The files are on HDFS. I assume that I cannot use the RDD infrastructure for the writing of the Parquet files because this is all done on the driver as opposed to on the nodes themselves.
I could parallelize the list of file names, write a function that deals with the Parquets local and saves them back to HDFS. I wouldn't know how to do that. I feel like I'm missing something obvious, thanks!
This was marked as a duplicate question, this is not the case however. I am fully aware of the ability of Spark to read them in as RDDs without having to worry about the compression, my question is more about how to parallelize converting these files to structured Parquet files.
If I knew how to interact with Parquet files without Spark itself I could do something like this:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
write_csv_to_parquet_on_hdfs(file_to)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
That would allow me to parallelize this, however I don't know how to interact with HDFS and Parquet from a local environment. I want to know either:
1) How to do that
Or..
2) How to parallelize this process in a different way using PySpark

I would suggest one of the two following approaches (where in practice I have found the first one to give better results in terms of performance).
Write each Zip-File to a separate Parquet-File
Here you can use pyarrow to write a Parquet-File to HDFS:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
pyarrow_table = to_pyarrow_table(gzipped_csv)
hdfs_client = pyarrow.HdfsClient()
with hdfs_client.open(file_to, "wb") as f:
pyarrow.parquet.write_table(pyarrow_table, f)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
There are two ways to obtain pyarrow.Table objects:
either obtain it from a pandas DataFrame (in which case you can also use pandas' read_csv() function): pyarrow_table = pyarrow.Table.from_pandas(pandas_df)
or manually construct it using pyarrow.Table.from_arrays
For pyarrow to work with HDFS one needs to set several environment variables correctly, see here
Concatenate the rows from all Zip-Files into one Parquet-File
def get_rows_from_gzip(file_from):
rows = read_gzip_file(file_from)
return rows
# read the rows of each zip file into a Row object
rows_rdd = filenameRDD.map(lambda x: get_rows_from_gzip(x[0]))
# flatten list of lists
rows_rdd = rows_rdd.flatMap(lambda x: x)
# convert to DataFrame and write to Parquet
df = spark_session.create_DataFrame(rows_rdd)
df.write.parquet(file_to)
If you know the schema of the data in advance, passing in a schema object to create_DataFrame will speed up the creation of the DataFrame.

Related

Split a spark dataframe into multiple frames and write as CSV

I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')

Workflow for modifying an hdf5 file in vaex

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?
I tried the following:
import vaex as vx
vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')
This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.
[1] Convert large hdf5 dataset written via pandas/pytables to vaex
Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.
Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:
df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2) # merge without column name will merge on rows basis
We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.

Merging datasets using dask proves unsuccessful

I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv the merged file. The task proves harder than imagined:
I put together a toy example with just two data sets
The code I then use is the following:
import dask.dataframe as dd
import glob
import os
os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")
dfs = []
for file in file_list:
ddf = dd.read_table(file, sep=';')
dfs.append(ddf)
dd_all = dd.concat(dfs)
If I use dd_all.to_csv('*.csv') I simply print out the two original data sets.
If I use dd_all.to_csv('name.csv') I get an error saying the file does not exist.
(FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part')
I can check that using dd_all.compute() the merged data set had successfully been created.
You are misunderstanding how Dask works - the behaviour you see is as expected. In order to be able to write from multiple workers in parallel, it is necessary for each worker to be able to write to a separate file; there is no way to know the length of the first chunk before writing it has finished, for example. To write to a single file is therefore necessarily a sequential operation.
The default operation, therefore, is to write one output file for each input partition, and this is what you see. Since Dask can read from these in parallel, it does raise the question of why you would want to creation one output file at all.
For the second method without the "*" character, Dask is assuming that you are supplying a directory, not a file, and is trying to write two files within this directory, which doesn't exist.
If you really wanted to write a single file, you could do one of the following:
use the repartition method to make a single output piece and then to_csv
write the separate file and concatenate them after the fact (taking care of the header line)
iterate over the partitions of your dataframe in sequence to write to the same file.

Pyspark split csv file in packets

I'm very new in spark and I'm still with my first tests with it. I installed one single node and I'm using it as my master on a decent server running:
pyspark --master local[20]
And of course I'm facing some difficulties with my first steps using pyspark.
I have a CSV file of 40GB and around 300 million lines on it. What I want to do is to find the fastest way to split this file over and make small packages of it and store them as CSV files as well. For that I have two scenarios:
First one. Split the file without any criteria. Just split it equally into lets say 100 pieces (3 million rows each).
Second one. The CSV data I'm loading is a tabular one and I have one column X with 100K different IDs. What I woudl like to do is to create a set of dictionaries and create smaller pieces of CSV files where my dictionaries will tell me to which package each row should go.
So far, this is where I'm now:
sc=SparkContext.getOrCreate()
file_1 = r'D:\PATH\TOFILE\data.csv'
sdf = spark.read.option("header","true").csv(file_1, sep=";", encoding='cp1252')
Thanks for your help!
The best (and probably "fastest") way to do this would be to take advantage of the in-built partitioning of RDDs by Spark and write to one CSV file from each partition. You may repartition or coalesce to create the desired number of partitions (let's say, 100) you want. This will give you maximum parallelism (based on your cluster resources and configurations) as each Spark Executor works on the task on one partition at a time.
You may do one of these:
Do a mapPartition over the Dataframe and write each partition to a unique CSV file.
OR df.write.partitionBy("X").csv('mycsv.csv'), which will create one partition (and thereby file) per unique entry in "X"
Note. If you use HDFS to store your CSV files, Spark will automatically create multiple files to store the different partitions (number of files created = number of RDD partitions).
What I did at last was to load the data as a spark dataframe and spark automatically creates equal sized parititions of 128MB (default configuration of hive) and then I used the repartition method to redistribute my rows according the values for a specific column on my dataframe.
# This will load my CSV data on a spark dataframe and will generate the requiered amount of 128MB partitions to store my raw data.
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
# This line will redistribute the rows of each paritition according the values on a specific column. Here I'm placing all rows with the same set of values on the same partition and I'm creating 20 of them. (Sparks handle to allocate the rows so the partitions will be the same size)
sdf_2 = sdf.repartition(20, 'TARGET_COLUMN')
# This line will save all my 20 partitions on different csv files
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
the easiest way to split a csv file is to use unix utils called split.
Just google split unix command line.
I split my files using split -l 3500 XBTUSDorderbooks4.csv orderbooks

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.

Categories

Resources