Py-Spark mapPartitions: how to craft the function? - python

We are using Databricks on Azure with a reasonably large cluster (20 cores, 70GB memory across 5 executors). I have a parquet file with 4 million rows. Spark can read well, call that sdf.
I am hitting the problem that the data must be converted to a Pandas dataframe. Taking the easy/obvious way pdf = sdf.toPandas() causes an out of memory error.
So I want to apply my function separately to subsets of the Spark DataFrame. The sdf itself is in 19 partitions, so what I want to do is write a function and apply it to each partition separately. Here's where mapPartitions comes in.
I was trying to write my own function like
def example_function(sdf):
pdf = sdf.toPandas()
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Then I'd use mapPartitions to run that.
sdf.rdd.mapPartitions(example_function)
That fails with all kinds of errors.
Looking back at the instructions, I realize I'm clueless! Iwas too optimistic/simplistic in what they expect to get from me. They don't seem to imagine that I'm using my own functions to handle the whole Spark DF that exists partition. They seem to plan only for code that would handle the rows in the Spark data frame one row at a time and the parameters are Iterators.
Can you please share you thoughts on this?

In your example case it might be counter productive to start from a Spark Dataframe and fall back to RDD if you're aiming at using pandas.
Under the hood toPandas() is triggering collect() which retrieve all data on the driver node, which will fail on large data.
If you want to use pandas code on Spark, you can use pandas UDFs which are equivalent to UDFs but designed and optimized for pandas code.
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

I did not find a solution using Spark map or similar. Here is best option I've found.
The parquet folder has lots of smaller parquet files inside it. As long as default settings were used, these files have extension snappy.parquet. Use Python os.listdir and filter out the file list to ones with correct extension.
Use Python and Pandas, NOT SPARK, tools to read the individual parquet files. It is much faster to load a parquet file with a few 100,000 rows with pandas than it is with Spark.
For the loaded dataframes, run the function I described in the first message, where the dataframe gets put through the wringer.
def example_function(pdf):
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Since the work for each data section has to happen in Pandas anyway, there's no need to keep fighting with Spark tools.
Other bit worth mentioning is that joblib's Parallel tool can be used to distribute this work among cluster nodes.

Related

Saving list of pandas dataframes for use in another file

I have written an optimization algorithm that tests some functions on historical stock data, then returns a 2d list of the pandas dataframes generated by each run and the function parameters used. This list takes the form of [[df,params],[df,params], ... [df,params],[df,params]]. After it has been generated, I would like to save this data to be processed in another script, but I am having trouble. Currently I am converting this list to a dataframe and using the to_csv() method from pandas, but this is mangling my data when I open it in another file - I expect the data types to be [[dataframe,list][dataframe,list]...[dataframe,list][dataframe,list]], but they instead become [[str,str],[str,str]...,[str,str],[str,str]]. I open the file using the read_csv() method from pandas, then I convert the resulting dataframe back into a list using the df.values.to_list() method.
To clarify, I save the list to a .csv like this, where out is the list:
out = pd.DataFrame(out)
out.to_csv('optimized_ticker.csv')
And I open the .csv and convert it back from a dataframe to a list like this:
df = pd.read_csv('optimized_ticker.csv')
list = df.values.tolist()
I figured that the problem was my dataframes had commas in there somewhere, so I tried changing the delimiter on the .csv to a few different things, but the issue persisted. How can I fix this issue, so that my datatypes aren't ? It is not imperative that I use the .csv format, so if there's a filetype more suited to the job I can switch to using it. The only purpose of saving the data is so that I can process it with any number of other scripts without having to re-run the simulation each time.
The best way to save a pandas dataframe is not via CSV if its only purpose is to be read by another pandas script. Parquet offers a much more robust option, it saves the datatypes for each column, can be compressed and you won't have to worry about things like comma's in values. Just use the following:
out.to_parquet('optimized_ticker.parquet')
df = pd.read_parquet('optimized_ticker.parquet')
EDIT:
As mentioned in the comments pickle is also a possibility, so the solution depends on your case. Google will be your best friend in figuring out whether to use pickle or parquet or feather.

Efficient use of dask with parquet files

I have received a huge (140MM records) dataset and Dask has come in handy but I'm not sure if I could perhaps do a better job. Imagine the records are mostly numeric (two columns are dates), so the process to transform from CSV to parquet was a breeze (dask.dataframe.read_csv('in.csv').to_parquet('out.pq')), but
(i) I would like to use the data on Amazon Athena, so a single parquet file would be nice. How to achieve this? As it stands, Dask saved it as hundreds of files.
(ii) For the Exploratory Data Analysis I'm trying with this dataset, there are certain operations where I need more then a couple of variables, which won't fit into memory so I'm constantly dumping two/three-variable views into SQL, is this code efficient use of dask?
mmm = ['min','mean','max']
MY_COLUMNS = ['emisor','receptor','actividad', 'monto','grupo']
gdict = {'grupo': mmm, 'monto': mmm, 'actividad': ['mean','count']}
df = dd.read_parquet('out.pq', columns=MY_COLUMNS).groupby(['emisor','receptor']).agg(gdict)
df = df.compute()
df.columns = ['_'.join(c) for c in df.columns] # ('grupo','max') -> grupo_max
df.to_sql('er_stats',conn,index=False,if_exists='replace')
Reading the file takes about 80 and writing to SQL about 60 seconds.
To reduce the number of partitions, you should either set the blocksize when reading the CSV (preferred), or repartition before writing the parquet. The "best" size depends on your memory and number of workers, but a single partition is probably not possible if your data is "huge". Putting the many partitions into a single file is also not possible (or, rather, not implemented), because dask writes in parallel and there would be no way of knowing where in the file the next part goes before the previous part is finished. I could imagine writing code to read in successive dask-produced parts and streaming them into a single output, it would not be hard but perhaps not trivial either.
writing to SQL about 60 seconds
This suggests that your output is still quite large. Is SQL the best option here? Perhaps writing again to parquet files would be possible.

partitionBy taking too long while saving a dataset on S3 using Pyspark

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints.
Could somebody help me in fining tune the parititonby?
Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.
Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.
As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.
In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.
We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.
Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.
Use version 2 of the FileOutputCommiter
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

Collecting attributes from dask dataframe providers

TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
Thanks!
There are a few potential questions here:
Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.
Q: How do I store arbitrary metadata onto a dask.dataframe?
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?
A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.

Categories

Resources