partitionBy taking too long while saving a dataset on S3 using Pyspark

partitionBy taking too long while saving a dataset on S3 using Pyspark - python

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints.
Could somebody help me in fining tune the parititonby?

Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.
Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.
As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.
In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.
We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.
Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.

Use version 2 of the FileOutputCommiter
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Related

Py-Spark mapPartitions: how to craft the function?

We are using Databricks on Azure with a reasonably large cluster (20 cores, 70GB memory across 5 executors). I have a parquet file with 4 million rows. Spark can read well, call that sdf.
I am hitting the problem that the data must be converted to a Pandas dataframe. Taking the easy/obvious way pdf = sdf.toPandas() causes an out of memory error.
So I want to apply my function separately to subsets of the Spark DataFrame. The sdf itself is in 19 partitions, so what I want to do is write a function and apply it to each partition separately. Here's where mapPartitions comes in.
I was trying to write my own function like
def example_function(sdf):
pdf = sdf.toPandas()
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Then I'd use mapPartitions to run that.
sdf.rdd.mapPartitions(example_function)
That fails with all kinds of errors.
Looking back at the instructions, I realize I'm clueless! Iwas too optimistic/simplistic in what they expect to get from me. They don't seem to imagine that I'm using my own functions to handle the whole Spark DF that exists partition. They seem to plan only for code that would handle the rows in the Spark data frame one row at a time and the parameters are Iterators.
Can you please share you thoughts on this?

In your example case it might be counter productive to start from a Spark Dataframe and fall back to RDD if you're aiming at using pandas.
Under the hood toPandas() is triggering collect() which retrieve all data on the driver node, which will fail on large data.
If you want to use pandas code on Spark, you can use pandas UDFs which are equivalent to UDFs but designed and optimized for pandas code.
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

I did not find a solution using Spark map or similar. Here is best option I've found.
The parquet folder has lots of smaller parquet files inside it. As long as default settings were used, these files have extension snappy.parquet. Use Python os.listdir and filter out the file list to ones with correct extension.
Use Python and Pandas, NOT SPARK, tools to read the individual parquet files. It is much faster to load a parquet file with a few 100,000 rows with pandas than it is with Spark.
For the loaded dataframes, run the function I described in the first message, where the dataframe gets put through the wringer.
def example_function(pdf):
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Since the work for each data section has to happen in Pandas anyway, there's no need to keep fighting with Spark tools.
Other bit worth mentioning is that joblib's Parallel tool can be used to distribute this work among cluster nodes.

Are temporary files generated while working with pandas data frames

Until now, I have always used SAS to work with sensitive data. It would be nice to change to Python instead. However, I realize I do not understand how the data is handled during processing in pandas.
While running SAS, one knows exactly where all the temporary files are stored (hence it is easy to keep these in an encrypted container). But what happens when I use pandas data frames? I think I would not even notice, if the data left my computer during processing.
The size of the mere flat files, of which I typically have dozens to merge, are a couple of Gb. Hence I cannot simply rely on the hope, that everything will be kept in the RAM during processing - or can I? I am currently using a desktop with 64 Gb RAM.

If it's a matter of life and death, I would write the data merging function in C. This is the only way to be 100% sure of what happens with the data. The general philosophy of Python is to hide whatever happens "under the hood", this does not seem to fit your particular use case.

Pandas/Dask - Very long time to write to file

I have a few files. The big one is ~87 million rows. I have others that are ~500K rows. Part of what I am doing is joining them, and when I try to do it with Pandas, I get memory issues. So I have been using Dask. It is super fast to do all the joins/applies, but then it takes 5 hours to write out to a csv, even if I know the resulting dataframe is only 26 rows.
I've read that some joins/applies are not the best for Dask, but does that mean it is slower using Dask? Because mine have been very quick. It takes seconds to do all of my computations/manipulations on the millions of rows. But it takes forever to write out. Any ideas how to speed this up/why this is happening?

You can use Dask Parallel Processing or try writing into Parquet file instead of CSV as Parquet operation is very fast with Dask

dask uses lazy evaluation. This means that when you perform the operations, you are actually only creating the processing graph.
Once you try to write your data to a csv file, Dask starts performing the operations.
And that is why it takes 5 hrs, he just needs to process a lot of data.
See https://tutorial.dask.org/01x_lazy.html for more information on the topic.
One way to speed up the processing would be to increase the parallelism by using a machine with more resources.

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?

I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.

I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

How do I get PySpark to write intermediate results to disk before running out of memory?

Background: In Hadoop Streaming, each reduce job writes to the hdfs as it finishes, thus clearing the way for the Hadoop cluster to execute the next reduce.
I am having trouble mapping this paradigm to (Py)Spark.
As an example,
df = spark.read.load('path')
df.rdd.reduceByKey(my_func).toDF().write.save('output_path')
When I run this, the cluster collects all of the data in the dataframe before it writes anything to disk. At least this is what it looks like is happening as I watch the job progress.
My problem is that my data is much bigger than my cluster memory, so I run out of memory before any data is written. In Hadoop Streaming, we don't have this problem because the output data is streamed to the disk to make room for the subsequent batches of data.
I have considered something like this:
for i in range(100):
(df.filter(df.loop_index==i)
.rdd
.reduceByKey(my_func)
.toDF()
.write.mode('append')
.save('output_path'))
where I only process a subset of my data in each iteration. But this seems kludgy mainly because I have to either persist df, which isn't possible because of memory constraints, or I have to re-read from the input hdfs source in each iteration.
One way to make the loop work is to partition the source folders by day or some other subset of the data. But for the sake of the question, let's assume that isn't possible.
Questions: How do I run a job like this in PySpark? Do I just have to have a much bigger cluster? If so, what are the common practices for sizing a cluster before processing the data?

It might help to repartition your data in a large number of partitions. The example below would be similar to your for loop, although you may want to try with less partitions first
df = spark.read.load('path').repartition(100)
You should also review the number of executors you are currently using (--num-executors). Reducing this number should also reduce your memory footprint.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.