Apologies if the title is misleading/incorrect, as I am strictly not aware of terminologies.
I have a class, let's call it Cleaner and this should have couple of methods in it.
For example:
class Cleaner:
def __init__(self, df):
self.df = df
#staticmethod
def clean(self, dataframe = None):
if dataframe is None:
tmp = self.df
# do cleaning operation
The function clean should behave as both staticmethod and internal method. What I mean by that is, I should be able to call it in both of the following ways:
tble = pd.read_csv('./some.csv')
cleaner = Cleaner(tble)
#method 1
cleaner.clean()
#method 2
Cleaner.clean(tble)
I will acknowledge that I have very nascent knowledge of OOP concept in python and would like your advise, if this is something doable and how so?
Related
I'm designing an ETL which would actually be a Spark job, written in Python. This is the ML model pre-training process, in which I enrich the data, complete missing values, execute different kinds of aggregation, filtering, and so on, bringing the raw data to the point, it is ready for feature processing.
What is the best practice to work with a Dataframe when I also need to change different config files according to the data, perform a lot of logic, combine Dataframes from multiple sources, and want it to be still readable, testable, etc?
I thought might it be good to stick with the transform pipeline composition we are familiar with and extend the Dataframe so a python class will wrap it with all the methods and members I need for each data source.
I don't know if it is considered best practice, what do you think? Is there a better way to deal with it?
Another question on the same subject - talking about this idea, I can think of two ways to do that -
The first -
class EnrichedDataframe:
def __init__(self, df, *args):
self.df = df
self.args = args
def func1(self):
self.df = self.df.doSomeLogic()
return self
def func2(self):
self.df = self.df.doSomeLogic()
return self
def func3(self):
self.df = self.df.doSomeLogic()
return self
The second -
class EnrichedDataframe(DataFrame):
def __init__(self, df, *args):
super(self.__class__, self).__init__(df._jdf, df.sql_ctx)
self.df = df
self.args = args
def func1(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
def func2(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
def func3(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
Which one is better, and why? Or maybe it doesn't matter?
I have a custom class in python, that I would like to behave in a certain way if the object itself (i.e., and not one if its methods/properties) is accessed.
This is a contrived minimal working example to show what I mean. I have a class that holds various pandas DataFrames so that they can separately be manipulated:
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
#property
def asonedf(self):
return pd.concat(self._dfs, axis=1)
d = SplitDataFrame(pd.DataFrame(np.random.rand(2,2), columns=['a','b']),
pd.DataFrame(np.random.rand(2,2), columns=['q','r']))
d.increase(0, 10)
This works, and I can examine that d._dfs now indeed is
[ a b
0 10.845681 10.561956
1 10.036739 10.262282,
q r
0 0.164336 0.412171
1 0.440800 0.945003]
So far, so good.
Now, I would like to change/add to the class's definition so that, when not using the .increase method, it returns the concatenated dataframe. In other words, when accessing d, I would like it to return the same dataframe as when typing d.asonedf, i.e.,
a b q r
0 10.143904 10.154455 0.776952 0.247526
1 10.039038 10.619113 0.443737 0.040389
That way, the object more closely follows the pandas.DataFrame api:
instead of needing to use d.asonedf['a'], I could access d['a'];
instead of needing to use d.asonedf + 12, I could do d + 12;
etc.
Is that possible?
I could make SplitDataFrame inherit from pandas.DataFrame, but that does not magically add the desired behaviour.
Many thanks!
You could of course proxy all relevant magic methods to a concatenated dataframe on demand. If you don't want to repeat yourself endlessly, you could dynamically do that.
I'm not saying this is the way to go, but it kind of works:
import textwrap
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
for name in dir(pd.DataFrame):
if name in (
"__init__",
"__new__",
"__getattribute__",
"__getattr__",
"__setattr__",
) or not callable(getattr(pd.DataFrame, name)):
continue
exec(
textwrap.dedent(
f"""
def {name}(self, *args, **kwargs):
return pd.concat(self._dfs, axis=1).{name}(*args, **kwargs)
"""
)
)
As you might guess, there's all kind of strings attached to this solution, and it uses horrible practices (using exec, using dir, ...).
At the very least I would implement __repr__ so you don't lie to yourself about what kind of object this is, and maybe you'd want to explicitly enumerate all methods you want defined instead of getting them via dir(). Instead of exec() you can define the function normally and then set it on the class with setattr.
Apologizes for the clunky title. But I was wondering what the best practice was for the below example in terms of TDD and maintainability.
Take the class below.
class sampleClass():
def __init__(self, datframe):
self.dataframe = dataframe
self.other_dataframe = pandas.load_csv(....)
def modify_dataframe_method1(self):
self.dataframe = self.dataframe.join(self.other_dataframe)
def modify_dataframe_method2(self, df):
df = df.join(self.other_dataframe)
return df
Both of those methods can do the same thing, with just different syntax. If I created another method within the class, either of these statements would work end in the same result for self.dataframe.
def process(self):
self.modify_dataframe_method1()
def process(self):
self.dataframe = self.modify_dataframe_method2(self.dataframe)
Why are the pros and cons of each approach. While I am using a dataframe in the example, I can imagine doing similar things to jsons or other data structures.
I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataframe. I get an exception as column is not iterable
from pyspark.sql.dataframe import DataFrame
class Myclass(DataFrame):
def __init__(self,df):
super().__init__(df._jdf, df.sql_ctx)
def add_column3(self):
// Add column1 to dataframe received
self._jdf.withColumn("col3",lit(3))
return self
def add_column4(self):
// Add column to dataframe received
self._jdf.withColumn("col4",lit(4))
return self
if __name__ == "__main__":
'''
Spark Context initialization code
col1 col2
a 1
b 2
'''
df = spark.createDataFrame([("a",1), ("b",2)], ["col1","col2"])
myobj = MyClass(df)
## Trying to accomplish below where i can chain MyClass methods & Dataframe methods
myobj.add_column3().add_column4().drop_columns(["col1"])
'''
Expected Output
col2, col3,col4
1,3,4
2,3,4
'''
Actually you don't need to inherit DataFrame class in order to add some custom methods to DataFrame objects.
In Python, you can add a custom property that wraps your methods like this:
# decorator to attach a function to an attribute
def add_attr(cls):
def decorator(func):
#wraps(func)
def _wrapper(*args, **kwargs):
f = func(*args, **kwargs)
return f
setattr(cls, func.__name__, _wrapper)
return func
return decorator
# custom functions
def custom(self):
#add_attr(custom)
def add_column3():
return self.withColumn("col3", lit(3))
#add_attr(custom)
def add_column4():
return self.withColumn("col4", lit(4))
return custom
# add new property to the Class pyspark.sql.DataFrame
DataFrame.custom = property(custom)
# use it
df.custom.add_column3().show()
The answer by blackbishop is worth a look, even if it has no upvotes as of this writing. This seems a good general approach for extending the Spark DataFrame class, and I presume other complex objects. I rewrote it slightly as this:
from pyspark.sql.dataframe import DataFrame
from functools import wraps
# Create a decorator to add a function to a python object
def add_attr(cls):
def decorator(func):
#wraps(func)
def _wrapper(*args, **kwargs):
f = func(*args, **kwargs)
return f
setattr(cls, func.__name__, _wrapper)
return func
return decorator
# Extensions to the Spark DataFrame class go here
def dataframe_extension(self):
#add_attr(dataframe_extension)
def drop_records():
return(
self
.where(~((col('test1') == 'ABC') & (col('test2') =='XYZ')))
.where(~col('test1').isin(['AAA', 'BBB']))
)
return dataframe_extension
DataFrame.dataframe_extension = property(dataframe_extension)
Below is my solution (which is based on your code).
I don't know if it's the best practice, but at least does what you want correctly. Dataframes are immutable objects, so after we add a new column we create a new object but not a Dataframe object but a Myclass object, because we want to have Dataframe and custom methods.
from pyspark.sql.dataframe import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
class MyClass(DataFrame):
def __init__(self,df):
super().__init__(df._jdf, df.sql_ctx)
self._df = df
def add_column3(self):
#Add column1 to dataframe received
newDf=self._df.withColumn("col3",F.lit(3))
return MyClass(newDf)
def add_column4(self):
#Add column2 to dataframe received
newDf=self._df.withColumn("col4",F.lit(4))
return MyClass(newDf)
df = spark.createDataFrame([("a",1), ("b",2)], ["col1","col2"])
myobj = MyClass(df)
myobj.add_column3().add_column4().na.drop().show()
# Result:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 1| 3| 4|
| b| 2| 3| 4|
+----+----+----+----+
I think you are looking for something like this:
class dfc:
def __init__(self, df):
self.df = df
def func(self, num):
self.df = self.df.selectExpr(f"id * {num} AS id")
def func1(self, num1):
self.df = self.df.selectExpr(f"id * {num1} AS id")
def dfdis(self):
self.df.show()
In this example, there is a dataframe passed to the constructor method which is used by subsequent methods defined inside the class. The state of the dataframe is stored in the instantiated object whenever corresponding methods are called.
df = spark.range(10)
ob = dfc(df)
ob.func(2)
ob.func(2)
ob.dfdis()
Note: Pyspark is deprecating df.sql_ctx in an upcoming version, so this answer is not future-proof.
I like many of the other answers, but there are a few lingering questions in comments. I think they can be addressed as such:
we need to think of everything as immutable, so we return the subclass
we do not need to call self._jdf anywhere -- instead, just use self as if it were a DataFrame (since it is one -- this is why we used inheritance!)
we need to explicitly construct a new one of our class since returns from self.foo will be of the base DataFrame type
I have added a DataFrameExtender subclass that mediates creation of new classes. Subclasses will inherit parent constructors if not overridden, so we can neaten up the DataFrame constructor to take a DataFrame, and add in the capability to store metadata.
We can make a new class for conceptual stages that the data arrives in, and we can sidecar flags that help us identify the state of the data in the dataframe. Here I add a flag when either add column method is called, and I push forward all existing flags. You can do whatever you like.
This setup means that you can create a sequence of DataFrameExtender objects, such as:
RawData, which implements .clean() method, returning CleanedData
CleanedData, which implements .normalize() method, returning ModelReadyData
ModelReadyData, which implements .train(model) and .predict(model), or .summarize() and which is used in a model as a base DataFrame object would be used.
By splitting these methods into different classes, it means that we cannot call .train() on RawData, but we can take a RawData object and chain together .clean().normalize().train(). This is a functional-like approach, but using immutable objects to assist in interpretation.
Note that DataFrames in Spark are lazily evaluated, which is great for this approach. All of this code just produces a final unevaluated DataFrame object that contains all of the operations that will be performed. We don't have to worry about memory or copies or anything.
from pyspark.sql.dataframe import DataFrame
class DataFrameExtender(DataFrame):
def __init__(self,df,**kwargs):
self.flags = kwargs
super().__init__(df._jdf, df.sql_ctx)
class ColumnAddedData(DataFrameExtender):
def add_column3(self):
df_added_column = self.withColumn("col3", lit(3))
return ColumnAddedData(df_added_column, with_col3=True, **self.flags)
def add_column4(self):
## Add a bit of complexity: do not call again if we have already called this method
if not self.flags['with_col4']:
df_added_column = self.withColumn("col4", lit(4))
return ColumnAddedData(df_added_column, with_col4=True, **self.flags)
return self
I recently moved from Matlab to Python and want to transfer some Matlab code to Python. However an obstacle popped up.
In Matlab you can define a class with its methods and create nd-arrays of instances. The nice thing is that you can apply the class methods to the array of instances as long as the method is written so it can deal with the arrays. Now in Python I found that this is not possible: when applying a class method to a list of instances it will not find the class method. Below an example of how I would write the code:
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
times5(classlist)
This will give an error on the times5(classlist) line. Now this is a simple example explaining what I want to do (the final class will have multiple numpy arrays as variables).
What is the best way to get this kind of functionality in Python? The reason I want to do this is because it allows batch operations and they make the class a lot more powerful. The only solution I can think of is to define a second class that has a list of instances of the first class as variables. The batch processing would need to be implemented in the second class then.
thanks!
UPDATE:
In your comment , I notice this sentence,
For example a function that takes the data of the first class in the list and substracts the data of all following classe.
This can be solved by reduce function.
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
from functools import reduce
classlist = [x.data for x in [testclass(1), testclass(10), testclass(100)]]
result = reduce(lambda x,y:x-y,classlist[1:],classlist[0])
print(result)
ORIGIN ANSWER:
In fact, what you need is List Comprehensions.
Please let me show you the code
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
results = [x.times5() for x in classlist]
print(results)