Apologizes for the clunky title. But I was wondering what the best practice was for the below example in terms of TDD and maintainability.
Take the class below.
class sampleClass():
def __init__(self, datframe):
self.dataframe = dataframe
self.other_dataframe = pandas.load_csv(....)
def modify_dataframe_method1(self):
self.dataframe = self.dataframe.join(self.other_dataframe)
def modify_dataframe_method2(self, df):
df = df.join(self.other_dataframe)
return df
Both of those methods can do the same thing, with just different syntax. If I created another method within the class, either of these statements would work end in the same result for self.dataframe.
def process(self):
self.modify_dataframe_method1()
def process(self):
self.dataframe = self.modify_dataframe_method2(self.dataframe)
Why are the pros and cons of each approach. While I am using a dataframe in the example, I can imagine doing similar things to jsons or other data structures.
Related
I'm designing an ETL which would actually be a Spark job, written in Python. This is the ML model pre-training process, in which I enrich the data, complete missing values, execute different kinds of aggregation, filtering, and so on, bringing the raw data to the point, it is ready for feature processing.
What is the best practice to work with a Dataframe when I also need to change different config files according to the data, perform a lot of logic, combine Dataframes from multiple sources, and want it to be still readable, testable, etc?
I thought might it be good to stick with the transform pipeline composition we are familiar with and extend the Dataframe so a python class will wrap it with all the methods and members I need for each data source.
I don't know if it is considered best practice, what do you think? Is there a better way to deal with it?
Another question on the same subject - talking about this idea, I can think of two ways to do that -
The first -
class EnrichedDataframe:
def __init__(self, df, *args):
self.df = df
self.args = args
def func1(self):
self.df = self.df.doSomeLogic()
return self
def func2(self):
self.df = self.df.doSomeLogic()
return self
def func3(self):
self.df = self.df.doSomeLogic()
return self
The second -
class EnrichedDataframe(DataFrame):
def __init__(self, df, *args):
super(self.__class__, self).__init__(df._jdf, df.sql_ctx)
self.df = df
self.args = args
def func1(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
def func2(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
def func3(self):
df = self.df.doSomeLogic()
return EnrichDataframe(df, self.args)
Which one is better, and why? Or maybe it doesn't matter?
I am looking to build fairly detailed annotations for methods in a Python class. These to be used in troubleshooting, documentation, tooltips for a user interphase, etc. However it's not clear how I can keep these annotations associated to the functions.
For context, this is a feature engineering class, so two example methods might be:
def create_feature_momentum(self):
return self.data['mass'] * self.data['velocity'] *
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
For example:
It'd be good to tell easily what core features were used in each engineered feature.
It'd be good to track arbitrary metadata about each method
It'd be good to embed non-string data as metadata about each function. Eg. some example calculations on sample dataframes.
So far I've been manually creating docstrings like:
def create_feature_kinetic_energy(self)->pd.Series:
'''Calculate the non-relativistic kinetic energy.
Depends on: ['mass', 'velocity']
Supports NaN Values: False
Unit: Energy (J)
Example:
self.data= pd.DataFrame({'mass':[0,1,2], 'velocity':[0,1,2]})
self.create_feature_kinetic_energy()
>>> pd.Series([0, 0.5, 4])
'''
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
And then I'm using regex to get the data about a function by inspecting the __doc__ attribute. However, is there a better place than __doc__ where I could store information about a function? In the example above, it's fairly easy to parse the Depends on list, but in my use case it'd be good to also embed some example data as dataframes somehow (and I think writing them as markdown in the docstring would be hard).
Any ideas?
I ended up writing an class as follows:
class ScubaDiver(pd.DataFrame):
accessed = None
def __getitem__(self, key):
if self.accessed is None:
self.accessed = set()
self.accessed.add(key)
return pd.Series(dtype=float)
#property
def columns(self):
return list(self.accessed)
The way my code is writen, I can do this:
sd = ScubbaDiver()
foo(sd)
sd.columns
and sd.columns contains all the columns accessed by foo
Though this might not work in your codebase.
I also wrote this decorator:
def add_note(notes: dict):
'''Adds k:v pairs to a .notes attribute.'''
def _(f):
if not hasattr(f, 'notes'):
f.notes = {}
f.notes |= notes # Summation for dicts
return f
return _
You can use it as follows:
#add_note({'Units':'J', 'Relativity':False})
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
and then you can do:
create_feature_kinetic_energy.notes['Units'] # J
Apologies if the title is misleading/incorrect, as I am strictly not aware of terminologies.
I have a class, let's call it Cleaner and this should have couple of methods in it.
For example:
class Cleaner:
def __init__(self, df):
self.df = df
#staticmethod
def clean(self, dataframe = None):
if dataframe is None:
tmp = self.df
# do cleaning operation
The function clean should behave as both staticmethod and internal method. What I mean by that is, I should be able to call it in both of the following ways:
tble = pd.read_csv('./some.csv')
cleaner = Cleaner(tble)
#method 1
cleaner.clean()
#method 2
Cleaner.clean(tble)
I will acknowledge that I have very nascent knowledge of OOP concept in python and would like your advise, if this is something doable and how so?
Some times I found useful split bigger class in smaller classes with their methods and attributes that then I access assigning to an attribute of the bigger class an instance of the smaller class. In this way I can organize the class better: when I work using the console I can use nested dot notation instead of seeing a lot of attributes. For instance, I have an instrument with some parameters that can be grouped together and a method that is linked to these parameters. I would structure the class like this:
class params(object):
def __init__(self,P,I,D):
self.P = P
self.I = I
self.D = D
def compute_PID(self):
pass
class instrument(object):
def __init__(self,name,SN,P,I,D):
self.name = name
self.SN = SN
self.params = params(P,I,D)
def swith_on(self):
pass
myinstrument = instrument('blender','123',45,4,3)
myinstrument.params.P
Is there any drawback of this deign patter? I imagine that from the point of view of the memory it requires more memory, but working with the dot notation make the things easier compared to a dictionary.
I recently moved from Matlab to Python and want to transfer some Matlab code to Python. However an obstacle popped up.
In Matlab you can define a class with its methods and create nd-arrays of instances. The nice thing is that you can apply the class methods to the array of instances as long as the method is written so it can deal with the arrays. Now in Python I found that this is not possible: when applying a class method to a list of instances it will not find the class method. Below an example of how I would write the code:
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
times5(classlist)
This will give an error on the times5(classlist) line. Now this is a simple example explaining what I want to do (the final class will have multiple numpy arrays as variables).
What is the best way to get this kind of functionality in Python? The reason I want to do this is because it allows batch operations and they make the class a lot more powerful. The only solution I can think of is to define a second class that has a list of instances of the first class as variables. The batch processing would need to be implemented in the second class then.
thanks!
UPDATE:
In your comment , I notice this sentence,
For example a function that takes the data of the first class in the list and substracts the data of all following classe.
This can be solved by reduce function.
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
from functools import reduce
classlist = [x.data for x in [testclass(1), testclass(10), testclass(100)]]
result = reduce(lambda x,y:x-y,classlist[1:],classlist[0])
print(result)
ORIGIN ANSWER:
In fact, what you need is List Comprehensions.
Please let me show you the code
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
results = [x.times5() for x in classlist]
print(results)