How to add custom method to Pyspark Dataframe class by inheritance - python

I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataframe. I get an exception as column is not iterable
from pyspark.sql.dataframe import DataFrame
class Myclass(DataFrame):
def __init__(self,df):
super().__init__(df._jdf, df.sql_ctx)
def add_column3(self):
// Add column1 to dataframe received
self._jdf.withColumn("col3",lit(3))
return self
def add_column4(self):
// Add column to dataframe received
self._jdf.withColumn("col4",lit(4))
return self
if __name__ == "__main__":
'''
Spark Context initialization code
col1 col2
a 1
b 2
'''
df = spark.createDataFrame([("a",1), ("b",2)], ["col1","col2"])
myobj = MyClass(df)
## Trying to accomplish below where i can chain MyClass methods & Dataframe methods
myobj.add_column3().add_column4().drop_columns(["col1"])
'''
Expected Output
col2, col3,col4
1,3,4
2,3,4
'''

Actually you don't need to inherit DataFrame class in order to add some custom methods to DataFrame objects.
In Python, you can add a custom property that wraps your methods like this:
# decorator to attach a function to an attribute
def add_attr(cls):
def decorator(func):
#wraps(func)
def _wrapper(*args, **kwargs):
f = func(*args, **kwargs)
return f
setattr(cls, func.__name__, _wrapper)
return func
return decorator
# custom functions
def custom(self):
#add_attr(custom)
def add_column3():
return self.withColumn("col3", lit(3))
#add_attr(custom)
def add_column4():
return self.withColumn("col4", lit(4))
return custom
# add new property to the Class pyspark.sql.DataFrame
DataFrame.custom = property(custom)
# use it
df.custom.add_column3().show()

The answer by blackbishop is worth a look, even if it has no upvotes as of this writing. This seems a good general approach for extending the Spark DataFrame class, and I presume other complex objects. I rewrote it slightly as this:
from pyspark.sql.dataframe import DataFrame
from functools import wraps
# Create a decorator to add a function to a python object
def add_attr(cls):
def decorator(func):
#wraps(func)
def _wrapper(*args, **kwargs):
f = func(*args, **kwargs)
return f
setattr(cls, func.__name__, _wrapper)
return func
return decorator
# Extensions to the Spark DataFrame class go here
def dataframe_extension(self):
#add_attr(dataframe_extension)
def drop_records():
return(
self
.where(~((col('test1') == 'ABC') & (col('test2') =='XYZ')))
.where(~col('test1').isin(['AAA', 'BBB']))
)
return dataframe_extension
DataFrame.dataframe_extension = property(dataframe_extension)

Below is my solution (which is based on your code).
I don't know if it's the best practice, but at least does what you want correctly. Dataframes are immutable objects, so after we add a new column we create a new object but not a Dataframe object but a Myclass object, because we want to have Dataframe and custom methods.
from pyspark.sql.dataframe import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
class MyClass(DataFrame):
def __init__(self,df):
super().__init__(df._jdf, df.sql_ctx)
self._df = df
def add_column3(self):
#Add column1 to dataframe received
newDf=self._df.withColumn("col3",F.lit(3))
return MyClass(newDf)
def add_column4(self):
#Add column2 to dataframe received
newDf=self._df.withColumn("col4",F.lit(4))
return MyClass(newDf)
df = spark.createDataFrame([("a",1), ("b",2)], ["col1","col2"])
myobj = MyClass(df)
myobj.add_column3().add_column4().na.drop().show()
# Result:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 1| 3| 4|
| b| 2| 3| 4|
+----+----+----+----+

I think you are looking for something like this:
class dfc:
def __init__(self, df):
self.df = df
def func(self, num):
self.df = self.df.selectExpr(f"id * {num} AS id")
def func1(self, num1):
self.df = self.df.selectExpr(f"id * {num1} AS id")
def dfdis(self):
self.df.show()
In this example, there is a dataframe passed to the constructor method which is used by subsequent methods defined inside the class. The state of the dataframe is stored in the instantiated object whenever corresponding methods are called.
df = spark.range(10)
ob = dfc(df)
ob.func(2)
ob.func(2)
ob.dfdis()

Note: Pyspark is deprecating df.sql_ctx in an upcoming version, so this answer is not future-proof.
I like many of the other answers, but there are a few lingering questions in comments. I think they can be addressed as such:
we need to think of everything as immutable, so we return the subclass
we do not need to call self._jdf anywhere -- instead, just use self as if it were a DataFrame (since it is one -- this is why we used inheritance!)
we need to explicitly construct a new one of our class since returns from self.foo will be of the base DataFrame type
I have added a DataFrameExtender subclass that mediates creation of new classes. Subclasses will inherit parent constructors if not overridden, so we can neaten up the DataFrame constructor to take a DataFrame, and add in the capability to store metadata.
We can make a new class for conceptual stages that the data arrives in, and we can sidecar flags that help us identify the state of the data in the dataframe. Here I add a flag when either add column method is called, and I push forward all existing flags. You can do whatever you like.
This setup means that you can create a sequence of DataFrameExtender objects, such as:
RawData, which implements .clean() method, returning CleanedData
CleanedData, which implements .normalize() method, returning ModelReadyData
ModelReadyData, which implements .train(model) and .predict(model), or .summarize() and which is used in a model as a base DataFrame object would be used.
By splitting these methods into different classes, it means that we cannot call .train() on RawData, but we can take a RawData object and chain together .clean().normalize().train(). This is a functional-like approach, but using immutable objects to assist in interpretation.
Note that DataFrames in Spark are lazily evaluated, which is great for this approach. All of this code just produces a final unevaluated DataFrame object that contains all of the operations that will be performed. We don't have to worry about memory or copies or anything.
from pyspark.sql.dataframe import DataFrame
class DataFrameExtender(DataFrame):
def __init__(self,df,**kwargs):
self.flags = kwargs
super().__init__(df._jdf, df.sql_ctx)
class ColumnAddedData(DataFrameExtender):
def add_column3(self):
df_added_column = self.withColumn("col3", lit(3))
return ColumnAddedData(df_added_column, with_col3=True, **self.flags)
def add_column4(self):
## Add a bit of complexity: do not call again if we have already called this method
if not self.flags['with_col4']:
df_added_column = self.withColumn("col4", lit(4))
return ColumnAddedData(df_added_column, with_col4=True, **self.flags)
return self

Related

PySpark applyinpands/grouped_map pandas_udf too many arguments

I'm trying to use the pyspark applyInPandas in my python code. Problem is, the function that I want to pass to it exists in the same class, and so it is defined as def func(self, key, df). This becomes an issue because applyInPandas will error out saying I'm passing too many arguments to the underlying func (at most it allows a key and df params, so the self is causing the issue). Is there any way around this?
The underlying goal is to process a pandas function on dataframe groups in parallel.
As OP mentioned, one way is to just use #staticmethod, which may not be desirable in some cases.
The pyspark source code for creating pandas_udf uses inspect.getfullargspec().args (line 386, 436), this includes self even if the class method is called from the instance. I would think this is a bug on their part (maybe worthwhile to raise a ticket).
To overcome this, the easiest way is to use functools.partial which can help change the argspec, i.e. remove the self argument and restore the number of args to 2.
This is based on the idea that calling an instance method is the same as calling the method directly from the class and supply the instance as the first argument (because of the descriptor magic):
A.func(A(), *args, **kwargs) == A().func(*args, **kwargs)
In a concrete example,
import functools
import inspect
class A:
def __init__(self, y):
self.y = y
def sum(self, a: int, b: int):
return (a + b) * self.y
def x(self):
# calling the method using the class and then supply the self argument
f = functools.partial(A.sum, self)
print(f(1, 2))
print(inspect.getfullargspec(f).args)
A(2).x()
This will print
6 # can still use 'self.y'
['a', 'b'] # 2 arguments (without 'self')
Then, in OP's case, one can simply do the same for key, df parameters:
class A:
def __init__(self):
...
def func(self, key, df):
...
def x(self):
f = functools.partial(A.func, self)
self.df.groupby(...).applyInPandas(f)

using static method as internal method

Apologies if the title is misleading/incorrect, as I am strictly not aware of terminologies.
I have a class, let's call it Cleaner and this should have couple of methods in it.
For example:
class Cleaner:
def __init__(self, df):
self.df = df
#staticmethod
def clean(self, dataframe = None):
if dataframe is None:
tmp = self.df
# do cleaning operation
The function clean should behave as both staticmethod and internal method. What I mean by that is, I should be able to call it in both of the following ways:
tble = pd.read_csv('./some.csv')
cleaner = Cleaner(tble)
#method 1
cleaner.clean()
#method 2
Cleaner.clean(tble)
I will acknowledge that I have very nascent knowledge of OOP concept in python and would like your advise, if this is something doable and how so?

Specify helper function that's used by another helper function inside a class

Update to question:
I want to include a helper function in my class that uses another helper function that's only used within one of the methods of the class. Using #staticmethod and self.func_name is what I'd do if I had one staticmethod.. However, if I want to call another staticmethod from a staticmethod and specify that using self.helper_func, I get an 'name 'self' is not defined' error.
To give you some context, the reason I'm doing this is because in my actual use case, I'm working with a list of grouped dataframes. Then within that outer apply statement, I then iterate through sets of specific columns in each grouped dataframe and apply the actual function. So the outer helper function is just an apply over the groups in the grouped dataframes, and it then calls the inner helper that performs manipulations on groups of columns.
import pandas as pd
import numpy as np
class DataManipulation():
def __init__(self, data):
self.data = data
#staticmethod
def helper_func(const):
return const
#staticmethod
def add_constant(var):
res = var+self.helper_func(5)
return res
def manipulate_data(self):
res = self.data.apply(add_constant)
return res
test_df = pd.DataFrame({'a': np.arange(4), 'b': np.arange(4)})
data_manip = DataManipulation(test_df)
data_manip.manipulate_data()
how can static #staticmethod access self
Static method can be called without creating an object or instance.
So what will be self when staticmethod is called before creating any object?
PS. Well that's my opinion, I may be wrong (I am new to python, that's how it was in C / C++ / Java).
Maybe you need to call DataManipulation.helper_func(5) instead of self.helper_func(5).

Can I define how accessing python object (and not just its attributes) is handled?

I have a custom class in python, that I would like to behave in a certain way if the object itself (i.e., and not one if its methods/properties) is accessed.
This is a contrived minimal working example to show what I mean. I have a class that holds various pandas DataFrames so that they can separately be manipulated:
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
#property
def asonedf(self):
return pd.concat(self._dfs, axis=1)
d = SplitDataFrame(pd.DataFrame(np.random.rand(2,2), columns=['a','b']),
pd.DataFrame(np.random.rand(2,2), columns=['q','r']))
d.increase(0, 10)
This works, and I can examine that d._dfs now indeed is
[ a b
0 10.845681 10.561956
1 10.036739 10.262282,
q r
0 0.164336 0.412171
1 0.440800 0.945003]
So far, so good.
Now, I would like to change/add to the class's definition so that, when not using the .increase method, it returns the concatenated dataframe. In other words, when accessing d, I would like it to return the same dataframe as when typing d.asonedf, i.e.,
a b q r
0 10.143904 10.154455 0.776952 0.247526
1 10.039038 10.619113 0.443737 0.040389
That way, the object more closely follows the pandas.DataFrame api:
instead of needing to use d.asonedf['a'], I could access d['a'];
instead of needing to use d.asonedf + 12, I could do d + 12;
etc.
Is that possible?
I could make SplitDataFrame inherit from pandas.DataFrame, but that does not magically add the desired behaviour.
Many thanks!
You could of course proxy all relevant magic methods to a concatenated dataframe on demand. If you don't want to repeat yourself endlessly, you could dynamically do that.
I'm not saying this is the way to go, but it kind of works:
import textwrap
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
for name in dir(pd.DataFrame):
if name in (
"__init__",
"__new__",
"__getattribute__",
"__getattr__",
"__setattr__",
) or not callable(getattr(pd.DataFrame, name)):
continue
exec(
textwrap.dedent(
f"""
def {name}(self, *args, **kwargs):
return pd.concat(self._dfs, axis=1).{name}(*args, **kwargs)
"""
)
)
As you might guess, there's all kind of strings attached to this solution, and it uses horrible practices (using exec, using dir, ...).
At the very least I would implement __repr__ so you don't lie to yourself about what kind of object this is, and maybe you'd want to explicitly enumerate all methods you want defined instead of getting them via dir(). Instead of exec() you can define the function normally and then set it on the class with setattr.

How do I subclass or otherwise extend a pandas DataFrame without breaking DataFrame.append()?

I have a complex object I'd like to build around a pandas DataFrame. I've tried to do this with a subclass, but appending to the DataFrame reinitializes all properties in a new instance even when using _metadata, as recommended here. I know subclassing pandas objects is not recommended but I don't know how to do what I want with composition (or any other method), so if someone can tell me how to do this without subclassing that would be great.
I'm working with the following code:
import pandas as pd
class thisDF(pd.DataFrame):
#property
def _constructor(self):
return thisDF
_metadata = ['new_property']
def __init__(self, data=None, index=None, columns=None, copy=False, new_property='reset'):
super(thisDF, self).__init__(data=data, index=index, columns=columns, dtype='str', copy=copy)
self.new_property = new_property
cols = ['A', 'B', 'C']
new_property = cols[:2]
tdf = thisDF(columns=cols, new_property=new_property)
As in the examples I linked to above, operations like tdf[['A', 'B']].new_property work fine. However, modifying the data in a way that creates a new copy initializes a new instance that doesn't retain new_property. So the code
print(tdf.new_property)
tdf = tdf.append(pd.Series(['a', 'b', 'c'], index=tdf.columns), ignore_index=True)
print(tdf.new_property)
outputs
['A', 'B']
reset
How do I extend pd.DataFrame so that thisDF.append() retains instance attributes (or some equivalent data structure if not using a subclass)? Note that I can do everything I want by making a class with a DataFrame as an attribute, but I don't want to do my_object.dataframe.some_method() for all DataFrame operations.
"[...] or wrapping all DataFrame methods with my_object class methods (because I'm assuming that would be a lot of work, correct?)"
No it doesn't have to be a lot of work. You actually don't have to wrap every function of the wrapped object yourself. You can use getattr to pass calls down to your wrapped object like this:
class WrappedDataFrame:
def __init__(self, df, new_property):
self._df = df
self.new_property = new_property
def __getattr__(self, attr):
if attr in self.__dict__:
return getattr(self, attr)
return getattr(self._df, attr)
def __getitem__(self, item):
return self._df[item]
def __setitem__(self, item, data):
self._df[item] = data
__getattr__ is a dunder method that is called every time you call a method of an instance of that class. In my implementation, every time __getattr__ is implicitly called, it checks if the object has the method you are calling. If it does, that method is returned and executed. Otherwise, it will look for that method in the __dict__of the wrapped object and return that method.
So this class works almost exactly like a DataFrame for the most part. You could now just implement the methods you want to behave differently like append in your example.
You could either make it so that append modifies the wrapped DataFrame object
def append(self, *args, **kwargs):
self._df = self._df.append(*args, **kwargs)
or so that it returns a new instance of the WrappedDataFrame class, which of course keeps all your functionality.
def append(self, *args, **kwargs):
return self.__class__(self._df.append(*args, **kwargs))

Categories

Resources