Using a classmethod to retrieve or load data on init - python

I have a time-consuming database lookup (downloads data from online) which I want to avoid doing constantly, so I would like to pickle the data if I don't already have it.
This data is being used by the class which has this classmethod.
Is this a 'proper' or expected use of a classmethod? I feel like I could fairly easily refactor it to be an instance method but it feels like it should be a classmethod due to what it's doing. Below is a mockup of the relevant parts of the class.
import os
import pickle
class Example:
def __init__(self):
self.records = self.get_records()
#classmethod
def get_records(cls):
"""
If the records aren't already downloaded from the server,
get them and add to a pickle file.
Otherwise, just load the pickle file.
"""
if not os.path.exists('records.pkl'):
# Slow request
records = get_from_server()
with open('records.pkl', 'wb') as rec_file:
pickle.dump(records, rec_file)
else:
with open('records.pkl', 'rb') as rec_file:
records = pickle.load(rec_file)
return records
def use_records(self):
for item in self.records:
...
Is there also an easy way to refactor this so that I can retrieve the data on request, even if the pickle file exists? Is that as simple as just adding another argument to the classmethod?
Thanks for any help.

Related

Pickling an object containing a large numpy array

I'm pickling an object which has the following structure:
obj
|---metadata
|---large numpy array
I'd like to be able to access the metadata. However, if I pickle.load() the object and iterate over a directory (say because I'm looking for some specific metadata to determine which one to return), then it gets lenghty. I'm guessing pickle wants to load, well, the whole object.
Is there a way to access only the top-level metadata of the object without having to load the whole thing?
I thought about maintaining an index, but then it means I have to implement the logic of it and keep it current, which I'd rather avoid if there's an simpler solution....
Yes ordinary pickle will load everything. In Python 3.8, the new Pickle protocol allows one to control how objects are serialized and use a side-channel for the large part of the data, but that is mainly useful when using pickle in inter-process communication. That would require a custom implementation of the pickling for your objects.
However, even with older Python versions it is possible to customize how to serialize your objects to disk.
For example, instead of having your arrays as ordinary members of your objects, you could have them "living" in another data structure - say, a dictionary, and implement data-access to your arrays indirectly, through that dictionary.
In Python versions 3.8, this will require you to "cheat" on the pickle-customization, in the sense that upon serialization of your object, the custom method should save the separate data as a side-effect. But other than that, it should be straight forward.
In more concrete terms, when you have something like:
class MyObject:
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Augment it this way - you should be still good to do whatever you do with your objects, but pickling now will only picke the metadata - the numpy arrays will "live" in a separate data structure that won't be automatically serialized:
from uuid import uuid4
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class MyObject:
data = SeparateSerializationDescriptor()
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Really -that is all that is needed to customize the attribute access: all ordinary uses of the self.data attribute will retrieve the original numpy array seamlessly - self.data[0:10] will just work. But pickle, at this point, will retrieve the contents of the instance's __dict__ - which only contain a key to the real data in the "vault" object.
Besides allowing you to serialize the metadata and data in separated files, it also allows you a fine-grained of the data in memory, by manipulating the "VAULT".
And now, customize the pickling of the class so that it will save the data separatly to disk, and retrieve it on reading. On Python 3.8, this probably can be done "within the rules" (I will take the time, since I am answering this, to take a lookg at that). For tradiciotnal pickle, we "break the rules" in which we save the extra data to disk, and load it, as side-effects of serialization.
Actually, it just occurred me that ordinarily customizing the methods used directly by the pickle protocol, like __reduce_ex__ and __setstate__ while would work, would, again, automatically unpickle the whole object from disk.
A way to go is: upon serialization, save the full data in a separate file, and create some more metadata so that the array-file can be found. Upon desserialization, always load only the metadata - and build into the descriptor above a mechanism to lazy load the arrays as needed.
So, we provide a Mixin class, and its dump method should be called
instead of pickle.dump- so the data is written in separate files. To unpickle the object, use Python's pickle.load normally: it will retrieve only the "normal" attributes of the object. The object's .load() method can be then be called explicitly to load all the arrays, or it will be called automatically when the data is first accessed, in a lazy way:
import pathlib
from uuid import uuid4
import pickle
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
try:
return VAULT[instance.__dict__[self.name]]
except KeyError:
# attempt so silently load missing data from disk upon first array access after unpickling:
instance.load()
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class SeparateSerializationMixin:
def _iter_descriptors(self, data_dir):
for attr in self.__class__.__dict__.values():
if not isinstance(attr, SeparateSerializationDescriptor):
continue
id = self.__dict__[attr.name]
if not data_dir:
# use saved absolute path instead of passed in folder
data_path = pathlib.Path(self.__dict__[attr.name + "_path"])
else:
data_path = data_dir / (id + ".pickle")
yield attr, id, data_path
def dump(self, file, protocol=None, **kwargs):
data_dir = pathlib.Path(file.name).absolute().parent
# Annotate paths and pickle all numpyarrays into separate files:
for attr, id, data_path in self._iter_descriptors(data_dir):
self.__dict__[attr.name + "_path"] = str(data_path)
pickle.dump(getattr(self, attr.name), data_path.open("wb"), protocol=protocol)
# Pickle the metadata as originally intended:
pickle.dump(self, file, protocol, **kwargs)
def load(self, data_dir=None):
"""Load all saved arrays associated with this object.
if data_dir is not passed, the the absolute path used on picking is used. Otherwise
the files are searched by their name in the given folder
"""
if data_dir:
data_dir = pathlib.Path(data_dir)
for attr, id, data_path in self._iter_descriptors(data_dir):
VAULT[id] = pickle.load(data_path.open("rb"))
def __del__(self):
for attr, id, path in self._iter_descriptors(None):
VAULT.pop(id, None)
try:
super().__del__()
except AttributeError:
pass
class MyObject(SeparateSerializationMixin):
data = SeparateSerializationDescriptor()
def __init__(self, data, meta_data):
self.data = data
self.meta_data = meta_data
Of course this is not perfect, and there are likely corner cases.
I included some safeguards in case the data-files are moved to another directory - but I did not test that.
Other than that, using those in an interactive session here went smooth,
and I coud create a MyObject instance that would be pickled separated
from its data attribute, which then would be loaded just when needed
on unpickling.
As for the suggestion of just "keep stuff in a database" - some of the code here can be used just as well with your objects if they live in a database, and you prefer to let the raw-data on the filesystem rather than on a 'blob column' on the database.

How to control access to a file from multiple processes in python

I am stuck into finding solution to below multiprocessing issue.
I have a class Record in record.py module. The responsibility of record class is to process the input data and save it into a JSON file.
The Record class has method put() to update JSON file.
The record class is initialized in the class decorator. The decorator is applied over most of the classes of various sub-modules.
Decorator extracts information of each method it decorates and sends data to put() method of Record class.
put() method of Record class then updates the JSON file.
The problem is when the different process runs, each process creates its own instance of record object and Json data gets corrupted since
multiple processes tries to update the same json file.
Also, each process may have threads running that tries to access and update same JSON file.
Please let me know how can i resolve this problem.
class Record():
def put(data):
# read json file
# update json file with new data
# close json file
def decorate_method(theMethod):
# Extract method details
data = extract_method_details(theMethod)
# Initialize record object
rec = Record()
rec.put(data)
class ClassDeco(cls):
# This class decorator decorates all methods of the target class
for method in cls(): #<----This is just a pseudo codebase
decorate_method()
#ClassDeco
class Test()
def __init__():
pass
def run(a):
# some function calls
if __name__ == "__main__":
t = Test()
p = Pool(processes=len(process_pool))
p.apply_async(t.run, args=(10,))
p.apply_async(t.run, args=(20,))
p.close()
You should lock the file prior to reading and writing it. Check another question related to file locking in python: Locking a file in Python
Have you ever heard about critical section concept in multiprocessing/multithreading programming?
If so think about using multiprocessing locks to allow only one process at the time to write to the file.

Saving a python object along an external function used by it

I have some class which take a function as on of its init arguments:
class A(object):
def __init__(self, some_var, some_function):
self.some_function = some_function
self.x = self.some_function(some_var)
I can create a function, pass it to an instance of the object, and save it using pickle:
import pickle as pcl
def some_function(x):
return x*x
a = A(some_var=2, some_function=some_function)
pcl.dump(a, open('a_obj.p', 'wb'))
Now I want to open this object in some other code. However, I don't want to include the def some_function(x): code in each file which uses this specific saved object.
So, what's the best python practice to pass external function as an argument to a python object and then save the object, such that the external function is "implemented" inside the object instance, so it doesn't have to be written in every file which uses the saved object?
Edit: Let me clarify, I don't want to save the function. I want to save only the object. I there's any elegant way to "combine" the external function inside the object so I can pass it as an argument and then it "becomes" part of this object's instance?
The easiest way to do what you are asking is with the dill module.
You can dump an instance of an object like this:
import dill
def f(x):
return x*x
class A(object):
def __init__(self, some_var, some_function):
self.some_function = some_function
self.x = self.some_function(some_var)
a = A(2, f)
a.x
# returns:
# 4
with open('a.dill', 'wb') as fp:
dill.dump(a, fp)
Then create a new instance of python, you can load it back in using:
import dill
with open('a.dill', 'rb') as fp:
a = dill.load(fp)
a.x
# returns:
# 4
a.some_function(4)
# returns:
# 16
If you really, really wanted to do this it would be possible with the Marshal module, from which Pickle is based on. Pickling functions is intentionally not possible for security reasons.
There is also a lot of info you would probably find useful in this question:
Is there an easy way to pickle a python function (or otherwise serialize its code)?

New file for each instance creation

I have a class that needs to write to a file. My program creates multiple of these classes and I want to avoid write collisions. I tried to avoid it by using a static variable so each class has a unique file name. ie:
class Foo:
instance_count = 1
#staticmethod
def make():
file_name = Foo.instance_count + '-' + 'file.foo'
Foo.instance_count += 1
Foo(file_name)
def Foo(self, fname):
self.fname = fname
This works to some extent but doesn't work in cases where the class may be created in parallel. How can I make this more robust?
EDIT:
My use case has this class being created in my app, which is served by gunicorn. So I launch my app with gunicorn, with lets say 10 workers, so I can't actually manage the communication between them.
You could make use of something like uuid instead if unique name is what you are after.
EDIT:
If you would want readable but unique names, I would suggest you to look into guarding your increment statement above with a lock so that only one process is increasing it at any point of time and also perhaps make the file creation and the increment operation atomic.
Easy, just use another text file to keep the filenames and the number/code you need to identify. You can use JSON, pickle, or just your own format.
In the __init__ function, you can read to your file. Then, make a new file based on the information you get.
File example:
File1.txt,1
Fil2e.txt,2
And the __init__ function:
def __init__(self):
counter = int(open('counter.txt', 'r').read()[-2:].strip())
with open("File%d.txt"%counter+1, "w"):
#do things
#don't forget to keep the information of your new file
Really what I was looking for was a way to avoid write contention. Then I realized why not just use the logger. Might be a wrong assumption, but I would imagine the logger takes care of locking files for writing. Plus it flushes on every write, so it meets that requirement. As for speed, the overhead definitely does not affect me in this case.
The other solution I found was to use the tempfile class. This would create a unique file for each instantiation of the class.
import tempfile as tf
class Foo:
#staticmethod
def make():
file_name = tf.NamedTemporaryFile('w+',suffix="foofile",dir="foo/dir")
Foo(file_name)
def __init__(self, fname):
self.fname = fname

Using Python "json" module to make a class serializable

I am working in Python with an Email() class that I would like to extend into a SerializeEmail() class, which simply adds two further methods, .write_email() and .read_email(). I would like this sort of behaviour:
# define email
my_email = SerializeEmail()
my_email.recipients = 'link#hyrule.com'
my_email.subject = 'RE: Master sword'
my_email.body = "Master using it and you can have this."
# write email to file system for hand inspection
my_email.write_email('my_email.txt')
...
# Another script reads in email
my_verified_email = SerializeEmail()
my_verified_email.read_email('my_email.txt')
my_verified_email.send()
I have navigated the json encode/decode process, and I can successfully write my SerializeEmail() object, and read it in, however, I can't find a satisfactory way to recreate my object via a SerializeEmail.read_email() call.
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
def read_email(self,file_name):
with open(file_name,"r") as f:
json.load(f,cls=SerializeEmailJSONDecoder)
The problem here is that the json.load() call in my read_email() method returns an instance of my SerializeEmail object, but doesn't assign that object to the current instance that I'm using to call it. So right now I'd have to do something like this,
another_email = my_verified_email.read_email('my_email.txt')
when what I want is for the call to my_veridied_email.read_email() to populate the current instance of my_verified_email with the data on the file. I've tried
self = json.load(f,cls=SerializeEmailJSONDecoder)
but that doesn't work. I could just assign each individual element of my returned object to my "self" object, but that seems ad-hoc and inelegant, and I'm looking for the "right way" to do this, if it exists. Any suggestions? If you think that my whole approach is flawed and recommend a different way of accomplishing this task, please sketch it out for me.
While you could jump through a number of hoops to load serialized content into an existing instance, I wouldn't recommend doing so. It's an unnecessary complication which really gains you nothing; it means that the extra step of creating a dummy instance is required every time you want to load an e-mail from JSON. I'd recommend using either a factory class or a factory method which loads the e-mail from the serialized JSON and returns it as a new instance. My personal preference would be a factory method, which you'd accomplish as follows:
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
#staticmethod
def read_email(file_name):
with open(file_name,"r") as f:
return json.load(f,cls=SerializeEmailJSONDecoder)
# You can now create a new instance by simply doing the following:
new_email = SerializeEmail.read_email('my_email.txt')
Note the #staticmethod decorator, which allows you to call the method on the class without any implicit first argument being passed in. Normally factory methods would be #classmethods, but since you're loading the object from JSON, the implicit class argument is unnecessary.
Notice how, with this modification, you don't need to instantiate a SerializeEmail object before you can load another one from JSON. You simply call the method directly on the class and get the desired behavior.

Categories

Resources