Processing arbitrary data: inherit and override base methods or provide callbacks? - python

I'm writing a Python class, let's call it CSVProcessor. Its purpose is the following:
extract data from a CSV file
process that data in an arbitrary way
update a database with the freshly processed data
Now it sounds like this is way too much for one class but it's already relying on high-level components for steps 1 and 3, so I only need to focus on step 2.
I also established the following:
the data extracted in step 1 would be stored in a list
every single element of that list needs to be processed individually and independently of one another by step 2
the processed data needs to come out of step 2 as a list in order for step 3 to be continued
It's not a hard problem, Python is amazingly flexible and in fact, I already found two solutions but I'm wondering which are the side effects of each (if any). Basically, which should be preferred over the other and why.
Solution 1
During runtime, my class CSVProcessor accepts in a function object, and uses it in step 2 to process every single element output by step 1. It simply aggregates the results from that function in an array and carries on with step 3.
Sample code (outrageously simplified but gives an idea):
class CSVProcessor:
...
def step_1(self):
self.data = self.extract_data_from_CSV()
def step_2(self, processing_function):
for element in self.data:
element = processing_function(element)
def step_3(self):
self.update_database(self.data)
Usage:
csv_proc = CSVProcessor()
csv_proc.step_1()
csv_proc.step_2(my_custom_function) # my_custom_function would defined elsewhere
csv_proc.step_3()
Solution 2
My class CSVProcessor defines an "abstract method" whose purpose is to process single elements in a concrete implementation of the class. Before runtime, CSVProcessor is inherited from by a new class, and its abstract method is overridden to process the elements.
class CSVProcessor:
...
def step_1(self):
self.data = self.extract_data_from_CSV()
def processing_function(self, element): # Abstract method to be overridden
pass
def step_2(self):
for element in self.data:
element = self.processing_function(element)
def step_3(self):
self.update_database(self.data)
Usage:
class ConcreteCSVProcessor:
def processing_function(self, element): # Here it gets overridden
# Do actual stuff
# Blah blah blah
csv_proc = ConcreteCSVProcessor()
csv_proc.step_1()
csv_proc.step_2() # No need to pass anything!
csv_proc.step_3()
In hindsight these two solutions share quite the same workflow, my question is more like "where should the data processing function reside in?".
In C++ I'd obviously have gone with the second solution but both ways in Python are just as easy to implement and I don't really see a noticeable difference in them apart from what I mentioned above.
And today there's also such a thing as considering one's ways of doing things more or less Pythonic... :p

Related

How to dynamically return Object attributes in python, including attributes of objects that are attributes

I am trying to write a testing program for a python program that takes data, does calculations on it, then puts the output in a class instance object. This object contains several other objects, each with their own attributes. I'm trying to access all the attributes and sub-attributes dynamically with a one size fits all solution, corresponding to elements in a dictionary I wrote to cycle through and get all those attributes for printing onto a test output file.
Edit: this may not be clear from the above but I have a list of the attributes I want, so using something to actually get those attributes is not a problem, although I'm aware python has methods that accomplish this. What I need to do is to be able to get all of those attributes with the same function call, regardless of whether they are top level object attributes or attributes of object attributes.
Python is having some trouble with this - first I tried doing something like this:
for string in attr_dictionary:
...
outputFile.print(outputclass.string)
...
But Python did not like this, and returned an AttributeError
After checking SE, I learned that this is a supposed solution:
for string in attr_dictionary:
...
outputFile.print(getattr(outputclass, string))
...
The only problem is - I want to dynamically access the attributes of objects that are attributes of outputclass. So ideally it would be something like outputclass.objectAttribute.attribute, but this does not work in python. When I use getattr(outputclass, objectAttribute.string), python returns an AttributeError
Any good solution here?
One thing I have thought of trying is creating methods to return those sub-attributes, something like:
class outputObject:
...
def attributeIWant(self,...):
return self.subObject.attributeIWant
...
Even then, it seems like getattr() will return an error because attributeIWant() is supposed to be a function call, it's not actually an attribute. I'm not certain that this is even within the capabilities of Python to make this happen.
Thank you in advance for reading and/or responding, if anyone is familiar with a way to do this it would save me a bunch of refactoring or additional code.
edit: Additional Clarification
The class for example is outputData, and inside that class you could have and instance of the class furtherData, which has the attribute dataIWant:
class outputData:
example: furtherData
example = furtherData()
example.dataIWant = someData
...
with the python getattr I can't access both attributes directly in outputData and attributes of example unless I use separate calls, the attribute of example needs two calls to getattr.
Edit2: I have found a solution I think works for this, see below
I was able to figure this out - I just wrote a quick function that splits the attribute string (for example outputObj.subObj.propertyIWant) then proceeds down the resultant array, calling getattr on each subobject until it reaches the end of the array and returns the actual attribute.
Code:
def obtainAttribute(sample, attributeString: str):
baseObj = sample
attrArray = attributeString.split(".")
for string in attrArray:
if(attrArray.index(string) == (len(attrArray) - 1)):
return getattr(baseObj,string)
else:
baseObj = getattr(baseObj,string)
return "failed"
sample is the object and attributeString is, for example object.subObject.attributeYouWant

Avoiding Unnecessary Class Declarations

I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()
Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1, with a name that describes what it does to the data (something like add_numerical_column, parse_datetime_column, etc). Likewise for the data_xyz variable.

Receiving data in python callback function from dll

I am writing a program in Python that communicates with a spectrometer from Avantes. There are some proprietary dlls available whose code I don't access to, but they have some decent documentation. I am having some trouble to find a good way to store the data received via callbacks.
The proprietary shared library
Basically, the dll contains a function that I have to call to start measuring and that receives a callback function that will be called whenever the spectrometer has finished a measurement. The function is the following:
int AVS_MeasureCallback(AvsHandle a_hDevice,void (*__Done)(AvsHandle*, int*),short a_Nmsr)
The first argument is a handle object that identifies the spectrometer, the second is the actual callback function and the third is the amount of measurements to be made.
The callback function will receive then receive another type of handle identifying the spetrometer and information about the amount of data available after a measurement.
Python library
I am using a library that has Python wrappers for many equipments, including my spectrometer.
def measure_callback(self, num_measurements, callback=None):
self.sdk.AVS_MeasureCallback(self._handle, callback, num_measurements)
And they also have defined the following decorator:
MeasureCallback = FUNCTYPE(None, POINTER(c_int32), POINTER(c_int32))
The idea is that when the callback function is finally called, this will trigger the get_data() function that will retrieve data from the equipment.
The recommended example is
#MeasureCallback
def callback_fcn(handle, info):
print('The DLL handle is:', handle.contents.value)
if info.contents.value == 0: # equals 0 if everything is okay (see manual)
print(' callback data:', ava.get_data())
ava.measure_callback(-1, callback_fcn)
My problem
I have to store the received data in a 2D numpy array that I have created somewhere else in my main code, but I can't figure out what is the best way to update this array with the new data available inside the callback function.
I wondered if I could pass this numpy array as an argument for the callback function, but even in this case I cannot find a good way to do this since it is expected that the callback function will have only those 2 arguments.
Edit 1
I found a possible solution here but I am not sure it is the best way to do it. I'd rather not create a new class just to hold a single numpy array inside.
Edit 2
I actually changed my mind about my approach, because inside my callback I'd like to do many operations with the received data and save the results in many different variables. So, I went back to the class approach mentioned here, where I would basically have a class with all the variables that will somehow be used in the callback function and that would also inherit or have an object of the class ava.
However, as shown in this other question, the self parameter is a problem in this case.
If you don't want to create a new class, you can use a function closure:
# Initialize it however you want
numpy_array = ...
def callback_fcn(handle, info):
# Do what you want with the value of the variable
store_data(numpy_array, ...)
# After the callback is called, you can access the changes made to the object
print(get_data(numpy_array))
How this works is that when the callback_fcn is defined, it keeps a reference to the value of the variable numpy_array, so when it's called, it can manipulate it, as if it were passed as an argument to the function. So you get the effect of passing it in, without the callback caller having to worry about it.
I finally managed to solve my problem with a solution envolving a new class and also a closure function to deal with the self parameter that is described here. Besides that, another problem would appear by garbage collection of the new created method.
My final solution is:
class spectrometer():
def measurement_callback(self,handle,info):
if info.contents.value >= 0:
timestamp,spectrum = self.ava.get_data()
self.spectral_data[self.spectrum_index,:] = np.ctypeslib.as_array(spectrum[0:pixel_amount])
self.timestamps[self.spectrum_index] = timestamp
self.spectrum_index += 1
def __init__(self,ava):
self.ava = ava
self.measurement_callback = MeasureCallback(self.measurement_callback)
def register_callback(self,scans,pattern_amount,pixel_amount):
self.spectrum_index = 0
self.timestamps = np.empty((pattern_amount),dtype=np.uint32)
self.spectral_data = np.empty((pattern_amount,pixel_amount),dtype=np.float64)
self.ava.measure_callback(scans, self.measurement_callback)

Is it appropriate to use a class for the purpose of organizing functions that share inputs?

To provide a bit of context, I am building a risk model that pulls data from various different sources. Initially I wrote the model as a single function that when executed read in the different data sources as pandas.DataFrame objects and used those objects when necessary. As the model grew in complexity, it quickly became unreadable and I found myself copy an pasting blocks of code often.
To cleanup the code I decided to make a class that when initialized reads, cleans and parses the data. Initialization takes about a minute to run and builds my model in its entirety.
The class also has some additional functionality. There is a generate_email method that sends an email with details about high risk factors and another method append_history that point-in-times the risk model and saves it so I can run time comparisons.
The thing about these two additional methods is that I cannot imagine a scenario where I would call them without first re-calibrating my risk model. So I have considered calling them in init() like my other methods. I haven't only because I am trying to justify having a class in the first place.
I am consulting this community because my project structure feels clunky and awkward. I am inclined to believe that I should not be using a class at all. Is it frowned upon to create classes merely for the purpose of organization? Also, is it bad practice to call instance methods (that take upwards of a minute to run) within init()?
Ultimately, I am looking for reassurance or a better code structure. Any help would be greatly appreciated.
Here is some pseudo code showing my project structure:
class RiskModel:
def __init__(self, data_path_a, data_path_b):
self.data_path_a = data_path_a
self.data_path_b = data_path_b
self.historical_data = None
self.raw_data = None
self.lookup_table = None
self._read_in_data()
self.risk_breakdown = None
self._generate_risk_breakdown()
self.risk_summary = None
self.generate_risk_summary()
def _read_in_data(self):
# read in a .csv
self.historical_data = pd.read_csv(self.data_path_a)
# read an excel file containing many sheets into an ordered dictionary
self.raw_data = pd.read_excel(self.data_path_b, sheet_name=None)
# store a specific sheet from the excel file that is used by most of
# my class's methods
self.lookup_table = self.raw_data["Lookup"]
def _generate_risk_breakdown(self):
'''
A function that creates a DataFrame from self.historical_data,
self.raw_data, and self.lookup_table and stores it in
self.risk_breakdown
'''
self.risk_breakdown = some_dataframe
def _generate_risk_summary(self):
'''
A function that creates a DataFrame from self.lookup_table and
self.risk_breakdown and stores it in self.risk_summary
'''
self.risk_summary = some_dataframe
def generate_email(self, recipient):
'''
A function that sends an email with details about high risk factors
'''
if __name__ == "__main__":
risk_model = RiskModel(data_path_a, data_path_b)
risk_model.generate_email(recipient#generic.com)
In my opinion it is a good way to organize your project, especially since you mentioned the high rate of re-usability of parts of the code.
One thing though, I wouldn't put the _read_in_data, _generate_risk_breakdown and _generate_risk_summary methods inside __init__, but instead let the user call this methods after initializing the RiskModel class instance.
This way the user would be able to read in data from a different path or only to generate the risk breakdown or summary, without reading in the data once again.
Something like this:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
my_risk_model.generate_risk_breakdown(parameters)
my_risk_model.generate_risk_summary(other_parameters)
If there is an issue of user calling these methods in an order which would break the logical chain, you could throw an exception if generate_risk_breakdown or generate_risk_summary are called before read_in_data. Of course you could only move the generate... methods out, leaving the data import inside __init__.
To advocate more on exposing the generate... methods out of __init__, consider a case scenario, where you would like to generate multiple risk summaries, changing various parameters. It would make sense, not to create the RiskModel every time and read the same data, but instead change the input to generate_risk_summary method:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
for parameter in [50, 60, 80]:
my_risk_model.generate_risk_summary(parameter)
my_risk_model.generate_email('test#gmail.com')

How do I retain the method attributes of the functions generated through yield in python 2.7?

I have been doing a lot of searching, and I don't think I've really found what I have been looking for. I will try my best to explain what I am trying to do, and hopefully there is a simple solution, and I'll be glad to have learned something new.
This is ultimately what I am trying to accomplish: Using nosetests, decorate some test cases using the attribute selector plugin, then execute test cases that match a criteria by using the -a switch during commandline invocation. The attribute values for the tests that are executed are then stored in an external location. The command line call I'm using is like below:
nosetests \testpath\ -a attribute='someValue'
I have also created a customized nosetest plugin, which stores the test cases' attributse, and writes them to an external location. The idea is that I can select a batch of tests, and by storing the attributes of these tests, I can do filtering on these results later for reporting purposes. I am accessing the method attributes in my plugin by overriding the "wantMethod" method with the code similar to the following:
def set_attribs(self, method, attribute):
if hasattr(method, attribute):
if not self.method_attributes.has_key(method.__name__):
self.method_attributes[method.__name__] = {}
self.method_attributes[method.__name__][attribute] = getattr(method, attribute)
def wantMethod(self, method):
self.set_attribs(method, "attribute1")
self.set_attribs(method, "attribute2")
pass
I have this working for pretty much all the tests, except for one case, where the test is uing the "yield" keyword. What is happening is that the methods that are generated are being executed fine, but then the method attributes are empty for each of the generated functions.
Below is the example of what I am trying to achieve. The test below retreives a list of values, and for each of those values, yields the results from another function:
#attr(attribute1='someValue', attribute2='anotherValue')
def sample_test_generator(self):
for (key, value) in _input_dictionary.items()
f = partial(self._do_test, key, value)
f.attribute1='someValue'
yield (lambda x: f(), key)
def _do_test(self, input1, input2):
# Some code
From what I have read, and think I understand, when yield is called, it would create a new callable function which then gets executed. I have been trying to figure out how to retain the attribute values from my sample_test_generator method, but I have not been successful. I thought I could create a partial method, and then add the attribute to the method, but no luck. The tests execute without errors at all, it just seems that from my plugin's perspective, the method attributes aren't present, so they don't get recorded.
I realize this a pretty involved question, but I wanted to make sure that the context for what I am trying to achieve is clear. I have been trying to find information that could help me for this particular case, but I feel like I've reached a stumbling block now, so I would really like to ask the experts for some advice.
Thanks.
** Update **
After reading through the feedback and playing around some more, it looks like if I modified the lambda expression, it would achieve what I am looking for. In fact, I didn't even need to create the partial function:
def sample_test_generator(self):
for (key, value) in _input_dictionary.items()
yield (lambda: self._do_test)
The only downside to this approach is that the test name will not change. As I am playing around more, it looks like in nosetests, when a test generator is used, it would actually change the test name in the result based on the keywords it contains. Same thing was happening when I was using the lambda expression with a parameter.
For example:
Using lamdba expression with a parameter:
yield (lambda x: self._do_test, "value1")
In nosetests plugin, when you access the test case name, it would be displayed as "sample_test_generator(value1)
Using lambda expression without a parameter:
yield (lambda: self._do_test)
The test case name in this case would be "sample_test_generator". In my example above, if there are multiple values in the dictionary, then the yield call would occur multiple times. However, the test name would always remain as "sample_test_generator". This is not as bad as when I would get the unique test names, but then not be able to store the attribute values at all. I will keep playing around, but thanks for the feedback so far!
EDIT
I forgot to come back and provide my final update on how I was able to get this to work in the end, there was a little confusion on my part at first, and after I looked through it some more, I figured out that it had to do with how the tests are recognized:
My original implementation assumed that every test that gets picked up for execution goes through the "wantMethod" call from the plugin's base class. This is not true when "yield" is used to generate the test, because at this point, the test method has already passed the "wantMethod" call.
However, once the test case is generated through the "yeild" call, it does go through the "startTest" call from the plug-in base class, and this is where I was finally able to store the attribute successfully.
So in a nut shell, my test execution order looked like this:
nose -> wantMethod(method_name) -> yield -> startTest(yielded_test_name)
In my override of the startTest method, I have the following:
def startTest(self, test):
# If a test is spawned by using the 'yield' keyword, the test names would be the parent test name, appended by the '(' character
# example: If the parent test is "smoke_test", the generated test from yield would be "smoke_test('input')
parent_test_name = test_name.split('(')[0]
if self.method_attributes.has_key(test_name):
self._test_attrib = self.method_attributes[test_name]
elif self.method_attributes.has_key(parent_test_name):
self._test_attrib = self.method_attributes[parent_test_name]
else:
self._test_attrib = None
With this implementation, along with my overide of wantMethod, each test spawned by the parent test case also inherits attributes from the parent method, which is what I needed.
Again, thanks to all who send replies. This was quite a learning experience.
Would this fix your name issue?
def _actual_test(x, y):
assert x == y
def test_yield():
_actual_test.description = "test_yield_%s_%s" % (5, 5)
yield _actual_test, 5, 5
_actual_test.description = "test_yield_%s_%s" % (4, 8) # fail
yield _actual_test, 4, 8
_actual_test.description = "test_yield_%s_%s" % (2, 2)
yield _actual_test, 2, 2
Rename survives #attr too.
does this work?
#attr(attribute1='someValue', attribute2='anotherValue')
def sample_test_generator(self):
def get_f(f, key):
return lambda x: f(), key
for (key, value) in _input_dictionary.items()
f = partial(self._do_test, key, value)
f.attribute1='someValue'
yield get_f(f, key)
def _do_test(self, input1, input2):
# Some code
The Problem ist that the local variables change after you created the lambda.

Categories

Resources