Split a spark dataframe into multiple frames and write as CSV - python

I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on

As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')

Related

Adding file name column to Dask DataFrame

I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.

overwrite existing Parquet dataset with modified PySpark DataFrame

Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)
My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.

Dataframe instance management in Python

I recently worked on a project parsing CSV files with cable modem MAC addresses (CMMAC) data that made it useful to incorporate dataframes through the Pandas module. One of the problems I encountered related to the overall approach to and structure of the dataframes themselves. Specifically I was concerned with having to increment the number of instances of dataframes to perform specific actions on the data. I did not feel that having to invoke "df1", "df2", "df3", etc was an efficient approach to writing in Python.
Below is a segment of the code where I had to instantiate the dataframes for different actions. The sample files (file1.csv and file2.csv) are identical and posted below as well.
file1.csv and file2.csv
cmmac,match
AABBCCDDEEFF,true
001122334455,false
001122334455,false
Python script:
import os
import glob
from functools import partial
import pandas as pd
#read and concatenate all CSV files in working directory
df1 = pd.concat(map(partial(pd.read_csv, header=0), glob.glob(os.path.join('', "*.csv"))))
#sort by column labeled "cmmac"
df2 = df1.sort_values(by='cmmac')
#delete any duplicate records
df3 = df2.drop_duplicates()
#convert MAC address format to colon notation (e.g. 001122334455 to 00:11:22:33:44:55)
df3['cmmac'] = df3['cmmac'].apply(lambda x: ':'.join(x[i:i+2] for i in range(0, len(x), 2)))
There were additional actions that were performed on the data in the CSV files and by the end I had thirteen dataframes (df13). With more complex projects I would have been in a death spiral of dataframes using this method.
The question I have is: how should dataframes be managed in order to avoid using this many instances? If it was necessary to drop a column or rearrange the columns does each one of those actions require invoking a new dataframe? In "df1" I am able to combine two distinct actions, which include reading in all CSV files and concatenating them. I was unable to add additional actions but even so that line would eventually become difficult to read. Which approach have you adopted when working with dataframes that incorporated many smaller tasks? Thanks.

Pyspark split csv file in packets

I'm very new in spark and I'm still with my first tests with it. I installed one single node and I'm using it as my master on a decent server running:
pyspark --master local[20]
And of course I'm facing some difficulties with my first steps using pyspark.
I have a CSV file of 40GB and around 300 million lines on it. What I want to do is to find the fastest way to split this file over and make small packages of it and store them as CSV files as well. For that I have two scenarios:
First one. Split the file without any criteria. Just split it equally into lets say 100 pieces (3 million rows each).
Second one. The CSV data I'm loading is a tabular one and I have one column X with 100K different IDs. What I woudl like to do is to create a set of dictionaries and create smaller pieces of CSV files where my dictionaries will tell me to which package each row should go.
So far, this is where I'm now:
sc=SparkContext.getOrCreate()
file_1 = r'D:\PATH\TOFILE\data.csv'
sdf = spark.read.option("header","true").csv(file_1, sep=";", encoding='cp1252')
Thanks for your help!
The best (and probably "fastest") way to do this would be to take advantage of the in-built partitioning of RDDs by Spark and write to one CSV file from each partition. You may repartition or coalesce to create the desired number of partitions (let's say, 100) you want. This will give you maximum parallelism (based on your cluster resources and configurations) as each Spark Executor works on the task on one partition at a time.
You may do one of these:
Do a mapPartition over the Dataframe and write each partition to a unique CSV file.
OR df.write.partitionBy("X").csv('mycsv.csv'), which will create one partition (and thereby file) per unique entry in "X"
Note. If you use HDFS to store your CSV files, Spark will automatically create multiple files to store the different partitions (number of files created = number of RDD partitions).
What I did at last was to load the data as a spark dataframe and spark automatically creates equal sized parititions of 128MB (default configuration of hive) and then I used the repartition method to redistribute my rows according the values for a specific column on my dataframe.
# This will load my CSV data on a spark dataframe and will generate the requiered amount of 128MB partitions to store my raw data.
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
# This line will redistribute the rows of each paritition according the values on a specific column. Here I'm placing all rows with the same set of values on the same partition and I'm creating 20 of them. (Sparks handle to allocate the rows so the partitions will be the same size)
sdf_2 = sdf.repartition(20, 'TARGET_COLUMN')
# This line will save all my 20 partitions on different csv files
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
the easiest way to split a csv file is to use unix utils called split.
Just google split unix command line.
I split my files using split -l 3500 XBTUSDorderbooks4.csv orderbooks

Parallelize GZip file processing Spark

I have a huge list of GZip files which need to be converted to Parquet. Due to the compressing nature of GZip, this cannot be parallelized for one file.
However, since I have many, is there a relatively easy way to let every node do a part of the files? The files are on HDFS. I assume that I cannot use the RDD infrastructure for the writing of the Parquet files because this is all done on the driver as opposed to on the nodes themselves.
I could parallelize the list of file names, write a function that deals with the Parquets local and saves them back to HDFS. I wouldn't know how to do that. I feel like I'm missing something obvious, thanks!
This was marked as a duplicate question, this is not the case however. I am fully aware of the ability of Spark to read them in as RDDs without having to worry about the compression, my question is more about how to parallelize converting these files to structured Parquet files.
If I knew how to interact with Parquet files without Spark itself I could do something like this:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
write_csv_to_parquet_on_hdfs(file_to)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
That would allow me to parallelize this, however I don't know how to interact with HDFS and Parquet from a local environment. I want to know either:
1) How to do that
Or..
2) How to parallelize this process in a different way using PySpark
I would suggest one of the two following approaches (where in practice I have found the first one to give better results in terms of performance).
Write each Zip-File to a separate Parquet-File
Here you can use pyarrow to write a Parquet-File to HDFS:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
pyarrow_table = to_pyarrow_table(gzipped_csv)
hdfs_client = pyarrow.HdfsClient()
with hdfs_client.open(file_to, "wb") as f:
pyarrow.parquet.write_table(pyarrow_table, f)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
There are two ways to obtain pyarrow.Table objects:
either obtain it from a pandas DataFrame (in which case you can also use pandas' read_csv() function): pyarrow_table = pyarrow.Table.from_pandas(pandas_df)
or manually construct it using pyarrow.Table.from_arrays
For pyarrow to work with HDFS one needs to set several environment variables correctly, see here
Concatenate the rows from all Zip-Files into one Parquet-File
def get_rows_from_gzip(file_from):
rows = read_gzip_file(file_from)
return rows
# read the rows of each zip file into a Row object
rows_rdd = filenameRDD.map(lambda x: get_rows_from_gzip(x[0]))
# flatten list of lists
rows_rdd = rows_rdd.flatMap(lambda x: x)
# convert to DataFrame and write to Parquet
df = spark_session.create_DataFrame(rows_rdd)
df.write.parquet(file_to)
If you know the schema of the data in advance, passing in a schema object to create_DataFrame will speed up the creation of the DataFrame.

Categories

Resources