Stale pickling of class objects - best practice - python

I have a basic ETL workflow that grabs data from an API, builds a class object, performs various operations which results in storing the data in applicable tables in a DB but ultimately I pickle the object and store that into the DB as well. The reason for pickling is to save these events and reuse the data for new features.
The problem is how best to implement adding attributes for new features. Of course when a new attribute is added, pickled objects are now stale and need to be checked (AttributeError, etc). This is simple with one or two changes but over time it seems like it will be problematic.
Any design tips? Pythonic best practices for inherently updating pickled objects? Seems like a common problem in database design?!

You can define an update method for the class. The update method will take in one object of the same class (but an older version, as you specified) and update all the data from the object passed into the method to the new class object.
Here is an example:
class MyClass:
def __init__(self):
self.data = []
def add_data(self, data):
self.data.append(data)
def update(self, obj):
self.data = obj.data
my_class = MyClass()
my_class.add_data(34) # Class object then gets pickled...
class MyClass2:
def __init__(self):
self.data = []
def add_data(self, data):
self.data.append(data)
def new_attr(self):
print('This is a new attribute.')
def update(self, obj):
self.data = obj.data
my_class2 = MyClass2()
my_class2.update(my_class) # Remember to unpickle the class object

Related

Creating objects of derived class in base class - python

I have an abstract class called IDataStream and it has one method. I create two implementations for this abstract class called IMUDataStream, GPSDataStream. In future there is a possibility that I may add another implementation of IDataStream abstract class. I have another class called DataVisualizer that visualizes all the data pulled by different DataStream classes.
In future if I add another DataStream implementation of the IDataStream abstract class, I should not be modifying the DataVisualizer class to visualize the data. Is there a way to create objects of all the derived classes of the IDataStream class, add it to a list and iterate through the list and use it to call the methods that will give me the data ?
Please note that I'm new to python and design patterns. Trying to learn. This may be a complete dumb question and total madness. I actually have a requirement for this. If this can be achieved through a design pattern I request the reader to point me to the material. Help much appreciated. Thanks !
#!/usr/bin/env python3
from abc import ABC, abstractmethod
class IDataStream(ABC):
def get_data(self):
pass
class IMUDataStream(IDataStream):
def __init__(self):
self.__text = "this is IMU data"
def get_data(self):
print(self.__text)
class GPSDataStream(IDataStream):
def __init__(self):
self.__text = "this is GPS data"
def get_data(self):
print(self.__text)
class DataVisualizer:
def __init__(self):
# somehow create objects of all derived classes of IDataStream here and call the get_data() function
# even if I add another derived class in the future. I should not be modifying the code here
What you're asking is to be able to find all instantiated objects that are in memory, and then filter them for only a particular class/subclass/parent class/whatever, take a look at this stack-overflow question regarding how to get all current objects and methods from memory.
That said... Any time you have to ask yourself how to find ALL instances of something GLOBALLY in memory, you should be stopping yourself and asking (which it seems like you did, so kudos) is there a better/easier way?
Most of the time, you'd want to make a data visualizer independent, such that it only consumes the data stream (which is specified during construction), see below:
ds = myDataStream()
vis = myDataVisualizer(ds)
vis.show() # or whatever
or
ds = myDataStream()
vis = myDataVisualizer()
vis.show(ds)
If you want your data visualizer to be data agnostic, at runtime (like have data coming from multiple sources), then you have a couple choices. Add methods for removing and adding data sources, or, you can link them together using something like the producer-consumer pattern using Queues and Processes (This is how I do it).
BUT, if you really must be managing your own memory, entirely (like through a map, or heap, or whatever). Then there are design patterns that can help you:
Factory
Abstract Factory
Decorator
Or maybe some other one, look at the catalog at refactoring.guru
First, you probably want method get_data to return data rather than print it (else it is doing its own visualization). This may do what you want. The following code will figure out all the subclassess of IDataStream, instantiate an instance of the class if it is not an abstract class, call method get_data on the instance and append the return values in a list:
#!/usr/bin/env python3
from abc import ABC, abstractmethod
class IDataStream(ABC):
#abstractmethod # you probably ment to add this
def get_data(self):
pass
class IMUDataStream(IDataStream):
def __init__(self):
self.__text = "this is IMU data"
def get_data(self):
return self.__text
class GPSDataStream(IDataStream):
def __init__(self):
self.__text = "this is GPS data"
def get_data(self):
return self.__text
def is_abstract(cls):
return bool(getattr(cls, "__abstractmethods__", False))
def get_all_non_abstract_subclasses(cls):
all_subclasses = []
for subclass in cls.__subclasses__():
if not is_abstract(subclass):
all_subclasses.append(subclass)
all_subclasses.extend(get_all_non_abstract_subclasses(subclass))
return all_subclasses
class DataVisualizer:
def __init__(self):
data = [cls().get_data() for cls in get_all_non_abstract_subclasses(IDataStream)]
print(data)
dv = DataVisualizer()
Prints:
['this is IMU data', 'this is GPS data']

Method parameter or instance attribute?

Sometimes when I am designing a new method for a class that needs to act on certain variable, I can't say if it's better to pass this variable as a method parameter or if it's better to save this variable as an instance attribute and just use it inside the method. What are the advantages/disadvantages of both approaches?
class A:
def __init__(self, data):
self.data = data
def my_method(self):
# does something with self.data
Or
class B:
def my_method(self, data):
# does something with data
It depends on all the other things the class may do.
Most generally, what is the abstraction that your class encapsulates?
Does it need data for lots of operations, or only this? Will data change? If data is the "point" of this class, then it should probably be in the __init__, but if it uses data to act on the object, then probably not.
We need to know more....

Python class variables or #property

I am writing a python class to store data and then another class will create an instance of that class to print different variables. Some class variables require a lot of formatting which may take multiple lines of code to get it in its "final state".
Is it bad practice to just access the variables from outside the class with this structure?
class Data():
def __init__(self):
self.data = "data"
Or is it better practice to use an #property method to access variables?
class Data:
#property
def data(self):
return "data"
Be careful, if you do:
class Data:
#property
def data(self):
return "data"
d = Data()
d.data = "try to modify data"
will give you error:
AttributeError: can't set attribute
And as I see in your question, you want to be able to transform the data until its final state, so, go for the other option
class Data2():
def __init__(self):
self.data = "data"
d2 = Data2()
d2.data = "now I can be modified"
or modify the previus:
class Data:
def __init__(self):
self._data = "data"
#property
def data(self):
return self._data
#data.setter
def data(self, value):
self._data = value
d = Data()
d.data = "now I can be modified"
Common Practice
The normal practice in Python is to exposure the attributes directly. A property can be added later if additional actions are required when getting or setting.
Most of the modules in the standard library follow this practice. Public variables (not prefixed with an underscore) typically don't use property() unless there is a specific reason (such as making an attribute read-only).
Rationale
Normal attribute access (without property) is simple to implement, simple to understand, and runs very fast.
The possibility of use property() afterwards means that we don't have to practice defensive programming. We can avoid having to prematurely implement getters and setters which bloats the code and makes accesses slower.
Basically you could hide lot of complexity in the property and make it look like an attribute. This increases code readability.
Also, you need to understand the difference between property and attribute.
Please refer What's the difference between a Python "property" and "attribute"?

Is usage of isinstance is justified when calling different methods by passed object

I'm working on little tool that is able to read file's low level data i.e mappings, etc, and store the results to sqlite DB, using python's builtin sqlite API.
For parsed file data I have 3 classes:
class GenericFile: # general file class
# bunch of methods here
...
class SomeFileObject_A: # low level class for storing objects of kind SomeFileObject_A
# bunch of methods here
...
class SomeFileObject_B: # low level cass for storing objects of kind SomeFileObject_A
# bunch of methods here
...
sqlite interface is implemented as a separate class:
class Database:
def insert(self, object_to_insert):
...
def _insert_generic_file_object(self, object_to_insert):
...
def _insert_file_object_a(self, object_to_insert):
...
def _insert_file_object_b(self, object_to_insert):
...
# bunch of sqlite related methods
When I need to insert some object to DB, I'm using db.insert(object).
Now I thought that it could be good idea to use isinstance in my insert method, as it takes care of any inserted object, without need to explicitly call suitable method for each object, which looks more elegant.
But after reading more on isinstance, I begin to suspect, that my design is not so good.
Here is implementation of generic insert method:
class Database:
def insert(self, object_to_insert):
self._logger.info("inserting %s object", object_to_insert.__class__.__name__)
if isinstance(object_to_insert, GenericFile):
self._insert_generic_file_object(object_to_insert)
elif isinstance(object_to_insert, SomeFileObject_A):
self._insert_file_object_a(object_to_insert)
elif isinstance(object_to_insert, SomeFileObject_B):
self._insert_file_object_b(object_to_insert)
else:
self._logger.error("Insert Failed. Bad object type %s" % type(object_to_insert))
raise Exception
self._db_connection.commit()
So, should isinstace be avoided in my case, and if it should, what is better solution here?
Thanks
One of the base principles in OO is to replace explicit switches with polymorphic dispatch. In your case, the solution would be to use double dispatch so it's the FileObect 's responsability to know which Database method to call, ie:
class GenericFile: # general file class
# bunch of methods here
...
def insert(self, db):
return db.insert_generic_file_object(self)
class SomeFileObject_A: # low level class for storing objects of kind SomeFileObject_A
# bunch of methods here
...
def insert(self, db):
return db.insert_file_object_a(self)
class SomeFileObject_B: # low level cass for storing objects of kind SomeFileObject_A
# bunch of methods here
...
def insert(self, db):
return db.insert_file_object_b(self)
class Database:
def insert(self, obj):
return obj.insert(self)

update metadata in objects when changing shared subobjects

I need advice from experience programmers as I am failing to wrap my head around this.
I have the following data structure.
class obj(object):
def __init__(self, data=[], meta=[]):
self.data=data
self.meta=meta
class subobj(object):
def __init__(self, data=[]):
self.data=data
Say, I'm creating the following objects from it.
sub1=subobj([0,1,5])
sub2=subobj([0,1,6])
sub3=subobj([0,1,7])
objA=obj(data=[sub1,sub2], meta=[3,3])
objB=obj(data=[sub3,sub1], meta=[3,3])
Now I am changing sub1 operating on the second object as well as its metadata. For simplicity, I'm writing here via obj. vars without setter/getters:
objB.data[1].data+=[10,11]
objB.meta[1]=5
Now, objA.data[0] has (obviously) changed. But objA.meta[0] stayed the same. I want some func(objB.meta[1]) to be triggered right after the change of the value in objA.data (caused in objB.data) and to change objA.meta as well. Important: this func() uses metadata of the changed sub1 from objB.
I simply don't know how to make every obj know about all other objs that share the same subobj as it does. I could make a func() be triggered upon having that knowledge. I would appreciate any hints.
Notes:
I want to pass those subobj around between objs without metadata and let them be changed by those objs. Metadata is supposed to store information that is defined within objs, not subobj. Hence, the value of the func() depends on the obj itself, but its definition is the same for all objs of the class obj.
For simplicity, this func(metadata) can be something like multiply3(metadata).
I will have thousands of those objects, so I am looking for rather an abstract solution that is not constrained by a small number of objects.
Is that possible in the current design? I am lost as to how to implement this.
Assuming that objs' data property can only contain subobjs and can never change, this code should work.
class obj(object):
__slots__=("data","meta")
def __init__(self,data=(),meta=None):
meta = meta or []
self.data = data
self.meta = meta
for so in data:
so._objs_using.append(self)
class subobj(object):
__slots__ = ("_data","_objs_using")
def __init__(self,data=()):
self._objs_using=[]
self._data = data
#property
def data(self):
return self._data
#data.setter
def data(self,val):
self._data = val
for obj in self._objs_using:
metadata_changed(obj.meta)
I called the function that you want to call on the metadata metadata_changed. This works by keeping track of a list of objs each subobj is used by, and then creating a special data property that notifies each obj whenever it changes.

Categories

Resources