Pickling an object containing a large numpy array - python

I'm pickling an object which has the following structure:
obj
|---metadata
|---large numpy array
I'd like to be able to access the metadata. However, if I pickle.load() the object and iterate over a directory (say because I'm looking for some specific metadata to determine which one to return), then it gets lenghty. I'm guessing pickle wants to load, well, the whole object.
Is there a way to access only the top-level metadata of the object without having to load the whole thing?
I thought about maintaining an index, but then it means I have to implement the logic of it and keep it current, which I'd rather avoid if there's an simpler solution....

Yes ordinary pickle will load everything. In Python 3.8, the new Pickle protocol allows one to control how objects are serialized and use a side-channel for the large part of the data, but that is mainly useful when using pickle in inter-process communication. That would require a custom implementation of the pickling for your objects.
However, even with older Python versions it is possible to customize how to serialize your objects to disk.
For example, instead of having your arrays as ordinary members of your objects, you could have them "living" in another data structure - say, a dictionary, and implement data-access to your arrays indirectly, through that dictionary.
In Python versions 3.8, this will require you to "cheat" on the pickle-customization, in the sense that upon serialization of your object, the custom method should save the separate data as a side-effect. But other than that, it should be straight forward.
In more concrete terms, when you have something like:
class MyObject:
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Augment it this way - you should be still good to do whatever you do with your objects, but pickling now will only picke the metadata - the numpy arrays will "live" in a separate data structure that won't be automatically serialized:
from uuid import uuid4
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class MyObject:
data = SeparateSerializationDescriptor()
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Really -that is all that is needed to customize the attribute access: all ordinary uses of the self.data attribute will retrieve the original numpy array seamlessly - self.data[0:10] will just work. But pickle, at this point, will retrieve the contents of the instance's __dict__ - which only contain a key to the real data in the "vault" object.
Besides allowing you to serialize the metadata and data in separated files, it also allows you a fine-grained of the data in memory, by manipulating the "VAULT".
And now, customize the pickling of the class so that it will save the data separatly to disk, and retrieve it on reading. On Python 3.8, this probably can be done "within the rules" (I will take the time, since I am answering this, to take a lookg at that). For tradiciotnal pickle, we "break the rules" in which we save the extra data to disk, and load it, as side-effects of serialization.
Actually, it just occurred me that ordinarily customizing the methods used directly by the pickle protocol, like __reduce_ex__ and __setstate__ while would work, would, again, automatically unpickle the whole object from disk.
A way to go is: upon serialization, save the full data in a separate file, and create some more metadata so that the array-file can be found. Upon desserialization, always load only the metadata - and build into the descriptor above a mechanism to lazy load the arrays as needed.
So, we provide a Mixin class, and its dump method should be called
instead of pickle.dump- so the data is written in separate files. To unpickle the object, use Python's pickle.load normally: it will retrieve only the "normal" attributes of the object. The object's .load() method can be then be called explicitly to load all the arrays, or it will be called automatically when the data is first accessed, in a lazy way:
import pathlib
from uuid import uuid4
import pickle
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
try:
return VAULT[instance.__dict__[self.name]]
except KeyError:
# attempt so silently load missing data from disk upon first array access after unpickling:
instance.load()
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class SeparateSerializationMixin:
def _iter_descriptors(self, data_dir):
for attr in self.__class__.__dict__.values():
if not isinstance(attr, SeparateSerializationDescriptor):
continue
id = self.__dict__[attr.name]
if not data_dir:
# use saved absolute path instead of passed in folder
data_path = pathlib.Path(self.__dict__[attr.name + "_path"])
else:
data_path = data_dir / (id + ".pickle")
yield attr, id, data_path
def dump(self, file, protocol=None, **kwargs):
data_dir = pathlib.Path(file.name).absolute().parent
# Annotate paths and pickle all numpyarrays into separate files:
for attr, id, data_path in self._iter_descriptors(data_dir):
self.__dict__[attr.name + "_path"] = str(data_path)
pickle.dump(getattr(self, attr.name), data_path.open("wb"), protocol=protocol)
# Pickle the metadata as originally intended:
pickle.dump(self, file, protocol, **kwargs)
def load(self, data_dir=None):
"""Load all saved arrays associated with this object.
if data_dir is not passed, the the absolute path used on picking is used. Otherwise
the files are searched by their name in the given folder
"""
if data_dir:
data_dir = pathlib.Path(data_dir)
for attr, id, data_path in self._iter_descriptors(data_dir):
VAULT[id] = pickle.load(data_path.open("rb"))
def __del__(self):
for attr, id, path in self._iter_descriptors(None):
VAULT.pop(id, None)
try:
super().__del__()
except AttributeError:
pass
class MyObject(SeparateSerializationMixin):
data = SeparateSerializationDescriptor()
def __init__(self, data, meta_data):
self.data = data
self.meta_data = meta_data
Of course this is not perfect, and there are likely corner cases.
I included some safeguards in case the data-files are moved to another directory - but I did not test that.
Other than that, using those in an interactive session here went smooth,
and I coud create a MyObject instance that would be pickled separated
from its data attribute, which then would be loaded just when needed
on unpickling.
As for the suggestion of just "keep stuff in a database" - some of the code here can be used just as well with your objects if they live in a database, and you prefer to let the raw-data on the filesystem rather than on a 'blob column' on the database.

Related

Stale pickling of class objects - best practice

I have a basic ETL workflow that grabs data from an API, builds a class object, performs various operations which results in storing the data in applicable tables in a DB but ultimately I pickle the object and store that into the DB as well. The reason for pickling is to save these events and reuse the data for new features.
The problem is how best to implement adding attributes for new features. Of course when a new attribute is added, pickled objects are now stale and need to be checked (AttributeError, etc). This is simple with one or two changes but over time it seems like it will be problematic.
Any design tips? Pythonic best practices for inherently updating pickled objects? Seems like a common problem in database design?!
You can define an update method for the class. The update method will take in one object of the same class (but an older version, as you specified) and update all the data from the object passed into the method to the new class object.
Here is an example:
class MyClass:
def __init__(self):
self.data = []
def add_data(self, data):
self.data.append(data)
def update(self, obj):
self.data = obj.data
my_class = MyClass()
my_class.add_data(34) # Class object then gets pickled...
class MyClass2:
def __init__(self):
self.data = []
def add_data(self, data):
self.data.append(data)
def new_attr(self):
print('This is a new attribute.')
def update(self, obj):
self.data = obj.data
my_class2 = MyClass2()
my_class2.update(my_class) # Remember to unpickle the class object

Using a classmethod to retrieve or load data on init

I have a time-consuming database lookup (downloads data from online) which I want to avoid doing constantly, so I would like to pickle the data if I don't already have it.
This data is being used by the class which has this classmethod.
Is this a 'proper' or expected use of a classmethod? I feel like I could fairly easily refactor it to be an instance method but it feels like it should be a classmethod due to what it's doing. Below is a mockup of the relevant parts of the class.
import os
import pickle
class Example:
def __init__(self):
self.records = self.get_records()
#classmethod
def get_records(cls):
"""
If the records aren't already downloaded from the server,
get them and add to a pickle file.
Otherwise, just load the pickle file.
"""
if not os.path.exists('records.pkl'):
# Slow request
records = get_from_server()
with open('records.pkl', 'wb') as rec_file:
pickle.dump(records, rec_file)
else:
with open('records.pkl', 'rb') as rec_file:
records = pickle.load(rec_file)
return records
def use_records(self):
for item in self.records:
...
Is there also an easy way to refactor this so that I can retrieve the data on request, even if the pickle file exists? Is that as simple as just adding another argument to the classmethod?
Thanks for any help.

Avoid global variables for unpicklable shared state among multiprocessing.Pool workers

I frequently find myself writing programs in Python that construct a large (megabytes) read-only data structure and then use that data structure to analyze a very large (hundreds of megabytes in total) list of small records. Each of the records can be analyzed in parallel, so a natural pattern is to set up the read-only data structure and assign it to a global variable, then create a multiprocessing.Pool (which implicitly copies the data structure into each worker process, via fork) and then use imap_unordered to crunch the records in parallel. The skeleton of this pattern tends to look like this:
classifier = None
def classify_row(row):
return classifier.classify(row)
def classify(classifier_spec, data_file):
global classifier
try:
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classify_row, rd)
finally:
classifier = None
I'm not happy with this because of the global variable and the implicit coupling between classify and classify_row. Ideally, I would like to write
def classify(classifier_spec, data_file):
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
but this does not work, because the Classifier object usually contains objects which cannot be pickled (because they are defined by extension modules whose authors didn't care about that); I have also read that it would be really slow if it did work, because the Classifier object would get copied into the worker processes on every invocation of the bound method.
Is there a better alternative? I only care about 3.x.
This was surprisingly tricky. The key here is to preserve read-access to variables that are available at fork-time without serialization. Most solutions to sharing memory in multiprocessing end up serializing. I tried using a weakref.proxy to pass in a classifier without serialization, but that didn't work because both dill and pickle will try to follow and serialize the referent. However, a module-ref works.
This organization gets us close:
import multiprocessing as mp
import csv
def classify(classifier, data_file):
with open(data_file, "rt") as fp, mp.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
def orchestrate(classifier_spec, data_file):
# construct a classifier from the spec; note that we can
# even dynamically import modules here, using config values
# from the spec
import classifier_module
classifier_module.init(classifier_spec)
return classify(classifier_module, data_file)
if __name__ == '__main__':
list(orchestrate(None, 'data.txt'))
A few changes to note here:
we add an orchestrate method for some DI goodness; orchestrate figures out how to construct/initialize a classifier, and hands it to classify, decoupling the two
classify only needs to assume that the classifier parameter has a classify method; it doesn't care if it's an instance or a module
For this Proof of Concept, we provide a Classifier that is obviously not serializable:
# classifier_module.py
def _create_classifier(spec):
# obviously not pickle-able because it's inside a function
class Classifier():
def __init__(self, spec):
pass
def classify(self, x):
print(x)
return x
return Classifier(spec)
def init(spec):
global __classifier
__classifier = _create_classifier(spec)
def classify(x):
return __classifier.classify(x)
Unfortunately, there's still a global in here, but it's now nicely encapsulated inside a module as a private variable, and the module exports a tight interface composed of the classify and init functions.
This design unlocks some possibilities:
orchestrate can import and init different classifier modules, based on what it sees in classifier_spec
one could also pass an instance of some Classifier class to classify, as long as this instance is serializable and has a classify method of the same signature
If you want to use forking, I don't see a way around using a global. But I also don't see a reason why you would have to feel bad about using a global in this case, you're not manipulating a global list with multi-threading or so.
It's possible to cope with the ugliness in your example, though. You want to pass classifier.classify directly, but the Classifier object contains objects which cannot be pickled.
import os
import csv
import uuid
from threading import Lock
from multiprocessing import Pool
from weakref import WeakValueDictionary
class Classifier:
def __init__(self, spec):
self.lock = Lock() # unpickleable
self.spec = spec
def classify(self, row):
return f'classified by pid: {os.getpid()} with spec: {self.spec}', row
I suggest we subclass Classifier and define __getstate__ and __setstate__ to enable pickling. Since you're using forking anyway, all state it has to pickle, is information how to get a reference to a forked global instance. Then we'll just update the pickled object's __dict__ with the __dict__ of the forked instance (which hasn't gone through the reduction of pickling) and your instance is complete again.
To achieve this without additional boilerplate, the subclassed Classifier instance has to generate a name for itself and register this as a global variable. This first reference, will be a weak reference, so the instance can be garbage collected when the user expects it. The second reference is created by the user when he assigns classifier = Classifier(classifier_spec). This one, doesn't have to be global.
The generated name in the example below is generated with help of standard-lib's uuid module. An uuid is converted to a string and edited into a valid identifier (it wouldn't have to be, but it's convenient for debugging in interactive mode).
class SubClassifier(Classifier):
def __init__(self, spec):
super().__init__(spec)
self.uuid = self._generate_uuid_string()
self.pid = os.getpid()
self._register_global()
def __getstate__(self):
"""Define pickled content."""
return {'uuid': self.uuid}
def __setstate__(self, state):
"""Set state in child process."""
self.__dict__ = state
self.__dict__.update(self._get_instance().__dict__)
def _get_instance(self):
"""Get reference to instance."""
return globals()[self.uuid][self.uuid]
#staticmethod
def _generate_uuid_string():
"""Generate id as valid identifier."""
# return 'uuid_' + '123' # for testing
return 'uuid_' + str(uuid.uuid4()).replace('-', '_')
def _register_global(self):
"""Register global reference to instance."""
weakd = WeakValueDictionary({self.uuid: self})
globals().update({self.uuid: weakd})
def __del__(self):
"""Clean up globals when deleted in parent."""
if os.getpid() == self.pid:
globals().pop(self.uuid)
The sweet thing here is, the boilerplate is totally gone. You don't have to mess manually with declaring and deleting globals since the instance manages everything itself in background:
def classify(classifier_spec, data_file, n_workers):
classifier = SubClassifier(classifier_spec)
# assert globals()['uuid_123']['uuid_123'] # for testing
with open(data_file, "rt") as fh, Pool(n_workers) as pool:
rd = csv.DictReader(fh)
yield from pool.imap_unordered(classifier.classify, rd)
if __name__ == '__main__':
PATHFILE = 'data.csv'
N_WORKERS = 4
g = classify(classifier_spec='spec1', data_file=PATHFILE, n_workers=N_WORKERS)
for record in g:
print(record)
# assert 'uuid_123' not in globals() # no reference left
The multiprocessing.sharedctypes module provides functions for allocating ctypes objects from shared memory which can be inherited by child processes, i.e., parent and children can access the shared memory.
You could use
1. multiprocessing.sharedctypes.RawArray to allocate a ctypes array from the shared memory.
2. multiprocessing.sharedctypes.RawValue to allocate a ctypes object from the shared memory.
Dr Mianzhi Wang has written a very detailed document on this. You could share multiple multiprocessing.sharedctypes objects.
You may find the solution here useful to you.

Most optimized way for creating objects from a huge set of objects

I have an iterator (userEnvironments) that contains a lot of user environment objects which I want for creating a dictionary containing environment.name as the key and a new EnvironmentStore object that would use environment.
for environment in userEnvironments:
set_cache(user,environment)
def set_cache(user, environment):
_environments_cache[environment.name] = EnvironmentStore(user, environment)
The memory efficiency is not important here but creating all of these objects will take approximately 4 seconds.
So the question is, what would be a good approach in python that could create the objects only on-demand (as in, creates the objects when another method wants to call the class) similar to a generator?
If creating the EnvironmentStore() instances is the time sink here you can postpone creating these by using a custom mapping:
from collections import Mapping
class EnvironmentStoreMapping(Mapping):
def __init__(self, user, environments):
self._user = user
self._envs = {env.name: env for env in environments}
self._cache = {}
def __getitem__(self, key):
try:
return self._cache[key]
except KeyError:
store = EnvironmentStore(self._user, self._envs[key])
self._cache[key] = store
return store
def __iter__(self):
return iter(self._envs)
def __len__(self):
return len(self._envs)
def __contains__(self, key):
return key in self._envs
_environments_cache = EnvironmentStoreMapping(user, userEnvironments)
Now creation of the EnvironmentStore is postponed until it is actually looked up. If not all env.name values are ever looked up, this will not only postpone the cost, but avoid it altogether for those keys never used.

python JSON complex objects (accounting for subclassing)

What is the best practice for serializing/deserializing complex python objects into/from JSON, that would account for subclassing and prevent multiple copies of same objects (assuming we know how to distinguish between different instances of same class) to be stored multiple times?
In a nutshell, I'm writing a small scientific library and want people to use it. But after watching Raymond Hettinger talk Python's Class Development Toolkit I've decided that it would be a good exercise for me to implement subclassing-aware behaviour. So far it went fine, but now I hit the JSON serialization task.
Until now I've looked around and found the following about JSON serialization in Python:
python docs about json module
python cookbook about json serialization
dive into python 3 in regards to json
very interesting article from 2009
Two main obstacles that I have are accounting for possible subclassing, single copy per instance.
After multiple different attempts to solve it in pure python, without any changes to the JSON representation of object, I've ended up understanding, that at a time of deserializing JSON, there is now way to know instance of what class heir was serialized before. So some mention about it shall be made, and I've ended up with something like this:
class MyClassJSONEncoder(json.JSONEncoder):
#classmethod
def represent_object(cls, obj):
"""
This is a way to serialize all built-ins as is, and all complex objects as their id, which is hash(obj) in this implementation
"""
if isinstance(obj, (int, float, str, Boolean)) or value is None:
return obj
elif isinstance(obj, (list, dict, tuple)):
return cls.represent_iterable(obj)
else:
return hash(obj)
#classmethod
def represent_iterable(cls, iterable):
"""
JSON supports iterables, so they shall be processed
"""
if isinstance(iterable, (list, tuple)):
return [cls.represent_object(value) for value in iterable]
elif isinstance(iterable, dict):
return [cls.represent_object(key): cls.represent_object(value) for key, value in iterable.items()]
def default(self, obj):
if isinstance(obj, MyClass):
result = {"MyClass_id": hash(obj),
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]}
for attr, value in self.__dict__.items():
result[attr] = self.represent_object(value)
return result
return super().default(obj) # accounting for JSONEncoder subclassing
here the accounting for subclassing is done in
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]
the JSONDecoder is to be implemented as follows:
class MyClassJSONDecoder(json.JSONDecoder):
def decode(self, data):
if isinstance(data, str):
data = super().decode(data)
if "py__class__" in data:
module_name, class_name = data["py__class__"].split(":")
object_class = getattr(importlib.__import__(module_name, fromlist=[class_name]), class_name)
else:
object_class = MyClass
data = {key, value for key, value in data.items() if not key.endswith("_id") or key != "py__class__"}
return object_class(**data)
As can be seen, here we account for possible subclassing with a "py__class__" attribute in JSON representation of object, and if no such attribute is present (this can be the case, if JSON was generated by another program, say in C++, and they just want to pass us information about the plain MyClass object, and don't really care for inheritance) the default approach to creating an instance of MyClass
is pursued. This is, by the way, the reason why not a single JSONDecoder can be created all objects: it has to have a default class value to create, if no py__class__ is specified.
In terms of a single copy for every instance, this is done by the fact, that object is serialized with a special JSON key myclass_id, and all attribute values are serialized as primitives (lists, tuples, dicts, and built-in are preserved, while when a complex object is a value of some attribute, only its hash is stored). Such approach of storing objects hashes allows one to serialize each object exactly once, and then, knowing the structure of an object to be decoded from json representation, it can look for respective objects and assign them after all. To simply illustrate this the following example can be observed:
class MyClass(object):
json_encoder = MyClassJSONEncoder()
json_decoder = MyClassJSONDecoder()
def __init__(self, attr1):
self.attr1 = attr1
self.attr2 = [complex_object_1, complex_object_2]
def to_json(self, top_level=None):
if top_level is None:
top_level = {}
top_level["my_class"] = self.json_encoder.encode(self)
top_level["complex_objects"] = [obj.to_json(top_level=top_level) for obj in self.attr2]
return top_level
#classmethod
def from_json(cls, data, class_specific_data=None):
if isinstance(data, str):
data = json.loads(data)
if class_specific_data is None:
class_specific_data = data["my_class"] # I know the flat structure of json, and I know the attribute name, this class will be stored
result = cls.json_decoder.decode(class_spcific_data)
# repopulate complex valued attributes with real python objects
# rather than their id aliases
complex_objects = {co_data["ComplexObject_id"]: ComplexObject.from_json(data, class_specific_data=co_data) for co_data in data["complex_objects"]]
result.complex_objects = [c_o for c_o_id, c_o in complex_objects.items() if c_o_id in self.complex_objects]
# finish such repopulation
return result
Is this even a right way to go? Is there a more robust way? Have I missed some programming patter to implement in this very particular situation?
I just really want to understand what is the most correct and pythonic way to implement a JSON serialization that would account for subclassing and also prevent multiple copies of same object to be stored.

Categories

Resources