Parallellise a custom function with PySpark - python

I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.
Here's a simplified example:
import numpy as np
import pandas as pd
dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
'val':np.random.normal(size=100)})
My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.
The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.
How can I parallellise this?

This answer is so short that it should rather be a comment but not enough reputation to comment.
Spark 2.3 introduced pandas vectorized UDFs that are exactly what you're looking for: executing a custom pandas transformation over a grouped Spark DataFrame, in a distributed fashion, and with great performance thanks to PyArrow serialization.
See
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?pyspark.sql.functions.pandas_udf#pyspark.sql.functions.pandas_udf
Using Collect_set after exploding in a groupedBy object in Pyspark
for more information and examples.

Related

PySpark - SciPy Functions

I am dealing with spark dataframes with very long columns that represents time domain signals. Lots of millions of rows.
I need to perform signal processing on these using some functions from SciPy, which require me to input the columns as numpy arrays. Specifically I am trying to use these 2:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.stft.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html
My current approach is to turn the column into a dataframe using collect as follows:
x = np.array(df.select("column_name").collect()).reshape(-1)
Then I feed the result to SciPy
However:
This is very slow
Collect loads all the data in the driver and therefore is not scalable
Please, could somebody help me find the most performing way to do this?
I found this somewhat old post, but there is no conclusion to it and I can find the following objections to it:
How to convert a pyspark dataframe column to numpy array
Objections to that post:
It is not clear to me that Dask Arrays are compatible with SciPy, because Dask apparently only implements a subset of Numpy algorithms.
The problem of how to convert from a PySpark dataframe to an array in the first place (using collect() currently) would not be solved by dask.
Please, help on this is much needed and would be greatly appreciated.
Many thanks in advance.

Why Does Pandas Convert One Row (or Column) of a DataFrame to a Series?

Context: I was passing what I thought was a DataFrame, df.iloc[n], into a function. Thanks to the dialogue here I figured out this was causing errors because Pandas automatically converts a single row or column from a Dataframe into a series, and is easily solved by using df.iloc[[n]] instead of df.iloc[n].
Question: My question is why does Pandas do this? Is there some performance benefit in using Series instead of DataFrames? What is the reasoning behind this automatic conversion to a Series?
As per Pandas documentation Why more than one data structure?
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.
So, no conversion happening here, rather objects with different properties/methods being retrieved.

How to use a GROUPED_MAP pandas udf in every Spark Dataframe partition?

I would like to use a pandas UDF to speed up a user defined function.
The type of pandas udf I am interested in the one that gets a pandas DataFrame as input and returns a Pandas DataFrame (the PandasUDFType.GROUPED_MAP).
However it seems that these pandas
UDFs must be inserted in a groupby().apply() framework, while in my case I simply would like
to apply the pandas UDF to every partition of the Pyspark Dataframe, with the idea of transforming each partition into a local Pandas Dataframe in each executor. In fact, I would like to avoid any type of groupby because this would lead to some data reshuffling.
Is there a way to achieve this, maybe by specifically saying that the groupby should be done by partition or something similar?

How are pandas dataframes implemented? Can we implement similar dataframes with some customizations?

I need to only some of the functionalities of Pandas dataframe and need to remove others or restrict users from using them. So, I am planning to write my own dataframe class which would only have a subset of methods of Pandas dataframes.
The code for pandas DataFrame object can be found here.
Theoretically you could clone the repository and re-write sections of it. However, it's not a simple object and this may take a decent amount of reading into the code to understand how it works.
For example: pandas describes the dataframe object as a
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects.

Uncertainties in Pandas

How to handle easily uncertainties on Series or DataFrame in Pandas (Python Data Analysis Library) ? I recently discovered the Python uncertainties package but I am wondering if there is any simpler way to manage uncertainties directly within Pandas. I didn't find anything about this in the documentation.
To be more precise, I don't want to store the uncertainties as a new column in my DataFrame because I think they are part of a data series and shouldn't be logically separated from it. For example, it doesn't make any sense deleting a column in a DataFrame but not its uncertainties, so I have to handle this case by hand.
I was looking for something like data_frame.uncertainties which could work like the data_frame.values attribute. A data_frame.units (for data units) would be great too but I think those things don't exist in Pandas (yet?)...
If you really want it to be a built in function you can just create a class to put your dataframe in. Then you can define whatever values or functions that you want. Below I wrote a quick example but you could easily add a units definition or a more complicated uncertainty formula
import pandas as pd
data={'target_column':[100,105,110]}
class data_analysis():
def __init__(self, data, percentage_uncertainty):
self.df = pd.DataFrame(data)
self.uncertainty = percentage_uncertainty*self.df['target_column'].values
When I run
example=data_analysis(data,.01)
example.uncertainty
I get out
array([1. , 1.05, 1.1 ])
Hope this helps

Categories

Resources