python JSON complex objects (accounting for subclassing) - python

What is the best practice for serializing/deserializing complex python objects into/from JSON, that would account for subclassing and prevent multiple copies of same objects (assuming we know how to distinguish between different instances of same class) to be stored multiple times?
In a nutshell, I'm writing a small scientific library and want people to use it. But after watching Raymond Hettinger talk Python's Class Development Toolkit I've decided that it would be a good exercise for me to implement subclassing-aware behaviour. So far it went fine, but now I hit the JSON serialization task.
Until now I've looked around and found the following about JSON serialization in Python:
python docs about json module
python cookbook about json serialization
dive into python 3 in regards to json
very interesting article from 2009
Two main obstacles that I have are accounting for possible subclassing, single copy per instance.
After multiple different attempts to solve it in pure python, without any changes to the JSON representation of object, I've ended up understanding, that at a time of deserializing JSON, there is now way to know instance of what class heir was serialized before. So some mention about it shall be made, and I've ended up with something like this:
class MyClassJSONEncoder(json.JSONEncoder):
#classmethod
def represent_object(cls, obj):
"""
This is a way to serialize all built-ins as is, and all complex objects as their id, which is hash(obj) in this implementation
"""
if isinstance(obj, (int, float, str, Boolean)) or value is None:
return obj
elif isinstance(obj, (list, dict, tuple)):
return cls.represent_iterable(obj)
else:
return hash(obj)
#classmethod
def represent_iterable(cls, iterable):
"""
JSON supports iterables, so they shall be processed
"""
if isinstance(iterable, (list, tuple)):
return [cls.represent_object(value) for value in iterable]
elif isinstance(iterable, dict):
return [cls.represent_object(key): cls.represent_object(value) for key, value in iterable.items()]
def default(self, obj):
if isinstance(obj, MyClass):
result = {"MyClass_id": hash(obj),
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]}
for attr, value in self.__dict__.items():
result[attr] = self.represent_object(value)
return result
return super().default(obj) # accounting for JSONEncoder subclassing
here the accounting for subclassing is done in
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]
the JSONDecoder is to be implemented as follows:
class MyClassJSONDecoder(json.JSONDecoder):
def decode(self, data):
if isinstance(data, str):
data = super().decode(data)
if "py__class__" in data:
module_name, class_name = data["py__class__"].split(":")
object_class = getattr(importlib.__import__(module_name, fromlist=[class_name]), class_name)
else:
object_class = MyClass
data = {key, value for key, value in data.items() if not key.endswith("_id") or key != "py__class__"}
return object_class(**data)
As can be seen, here we account for possible subclassing with a "py__class__" attribute in JSON representation of object, and if no such attribute is present (this can be the case, if JSON was generated by another program, say in C++, and they just want to pass us information about the plain MyClass object, and don't really care for inheritance) the default approach to creating an instance of MyClass
is pursued. This is, by the way, the reason why not a single JSONDecoder can be created all objects: it has to have a default class value to create, if no py__class__ is specified.
In terms of a single copy for every instance, this is done by the fact, that object is serialized with a special JSON key myclass_id, and all attribute values are serialized as primitives (lists, tuples, dicts, and built-in are preserved, while when a complex object is a value of some attribute, only its hash is stored). Such approach of storing objects hashes allows one to serialize each object exactly once, and then, knowing the structure of an object to be decoded from json representation, it can look for respective objects and assign them after all. To simply illustrate this the following example can be observed:
class MyClass(object):
json_encoder = MyClassJSONEncoder()
json_decoder = MyClassJSONDecoder()
def __init__(self, attr1):
self.attr1 = attr1
self.attr2 = [complex_object_1, complex_object_2]
def to_json(self, top_level=None):
if top_level is None:
top_level = {}
top_level["my_class"] = self.json_encoder.encode(self)
top_level["complex_objects"] = [obj.to_json(top_level=top_level) for obj in self.attr2]
return top_level
#classmethod
def from_json(cls, data, class_specific_data=None):
if isinstance(data, str):
data = json.loads(data)
if class_specific_data is None:
class_specific_data = data["my_class"] # I know the flat structure of json, and I know the attribute name, this class will be stored
result = cls.json_decoder.decode(class_spcific_data)
# repopulate complex valued attributes with real python objects
# rather than their id aliases
complex_objects = {co_data["ComplexObject_id"]: ComplexObject.from_json(data, class_specific_data=co_data) for co_data in data["complex_objects"]]
result.complex_objects = [c_o for c_o_id, c_o in complex_objects.items() if c_o_id in self.complex_objects]
# finish such repopulation
return result
Is this even a right way to go? Is there a more robust way? Have I missed some programming patter to implement in this very particular situation?
I just really want to understand what is the most correct and pythonic way to implement a JSON serialization that would account for subclassing and also prevent multiple copies of same object to be stored.

Related

Is customizing YAML Serialization in python using decorator implemented somewhere?

I'm writing a YAML configuration serialization in python (using YAML because its a tree of objects configuration and I want the configuration to be as humanly readable as possible).
I have several problems with this:
Several internal (non configuration) members that are used by the objects and thus I wish not to store in the config file
Some configuration members have default values, I don't want to store them if they are default (this also does not touch deserialization)
In Java you had jackson annotations s.a. #JsonInclude(Include.NON_NULL) and others that do this for json files, I found nothing similar for yaml (or even for JSON) in python, I know how to write this (using YAML package API) but I'd rather not if it's already implemented somewhere.
example of a class I would like to serialize
class Locator(object):
def __init__(self, multiple=False):
# configurable parameters
self.multiple = multiple
# internal parameters (used in locate method)
self.precedents = []
self.segments
self.probabilities = []
def locate(self, message):
"""
do stuff to locate stuff on message
""" . . .
yield segment
Here we see the root class that holds configuration parameter (multiple) which I only wish to serialize if it is True and other members that are used in its operation s.a. sons (precedents) etc... which I don't want to serialize at all
Can anyone help me with this?
I think the honest answer is "probably not", and the reason is that what you're reaching for here just isn't really idiomatic in Python. If you squint a little, there is a strong resemblance between Python dicts and JSON objects -- and squint some more and YAML looks like a whitespace-y dialect of JSON -- so when we need to serialize things from Python we tend to write some custom mapping of thing to dict, stuff it in a JSON/YAML serializer, and be done with it.
There are some shortcuts and idiomatic trickery that can come in handy in the thing => dict step. For example, a namedtuple subclass with methods on it will leave said methods out when you call asdict on it:
In [1]: from collections import namedtuple
In [2]: class Locator(namedtuple("Locator", "foo bar baz")):
...: def hide(self):
...: pass
...:
In [3]: wow = Locator(1,2,3)
In [4]: wow._asdict()
Out[4]: OrderedDict([('foo', 1), ('bar', 2), ('baz', 3)])
Of course a tuple is not mutable, so this is not a general purpose solution if you really need a class with mutable attributes, and furthermore this doesn't address your desire to drop certain attributes from the serialization in a declarative way.
One nice third party library that might fit your needs is attrs... this library provides something like an extra-fancy namedtuple with a lot of customizability, including filters and defaults, which you might be able to work in to something you find comfortable. It's not 1:1 with what you're reaching for here but it could be a start.
This is a solution I wrote meanwhile for JSON, but's its very specific and I would love to find a package that already solved this is a more general manner
class LocatorEncoder(json.JSONEncoder):
"""
custom Locator json encoder to encode only configuration related parameters
it encodes only :
1. protected data members (whose names start with a single '_'
2. members that are different from their default value
NOTE: for this filtering encoder to work two strong conventions must be upheld, namely:
1. Configuration data members must start with a single preceding '_'
2. They must differ from their correlated __init__ parameter by only that '_'
"""
#staticmethod
def get_default_args(f):
return {
k: v.default
for k, v in inspect.signature(f).parameters.items()
if v.default is not inspect.Parameter.empty
}
#staticmethod
def filter(d, defaults):
"""
this is the filtering method
:param d: dictionary of members to filter
:param defaults: default values to filter out
:return: ordered dictionary with only protected members ('_') that do not have their default value
the return dictionary is ordered because it prints nicer
"""
filtered = OrderedDict()
for k, v in d.items():
# this is the filter logic (key starts with one _ and is not its default value)
if (re.match(r'_[^_]', k)) and (k[1:] not in defaults or defaults[k[1:]] != v):
filtered[k] = v
return filtered
def default(self, o):
"""
iterate on classes in the objects mro and build a list of default constructor values
:param o: the object to be json encoded
:return: encoded dictionary for the object serialization
"""
if isinstance(o, Locator):
defaults = {}
for cl in o.__class__.mro():
# iterate on all the default arguments of the __init__ method
for k, v in self.get_default_args(cl.__init__).items():
# update the key with value if it doesn't already exist
defaults[k] = defaults.get(k, v)
# build the filtered configuration data members and add precedent in a recursive call to this default method
filtered_dictionary = self.filter(o.__dict__, defaults)
precedents = []
for precedent in o.precedents:
precedents.append(self.default(precedent))
filtered_dictionary["precedents"] = precedents
return {'__{}__'.format(o.__class__.__name__): filtered_dictionary}
return super().default(self, o)

Wrapping a python class around JSON data, which is better?

Preamble: I'm writing a python API against a service that delivers JSON.
The files are stored in JSON format on disk to cache the values.
The API should sport classful access to the JSON data, so IDEs and users can have a clue what (read-only) attributes there are in the object before runtime while also providing some convenience functions.
Question: I have two possible implementations, I'd like to know which is nicer or 'pythonic'. While I like both, I am open for suggestions, if you come up with a better solution.
First Solution: defining and inheriting JSONWrapper while nice, it is pretty verbose and repetitive.
class JsonDataWrapper:
def __init__(self, json_data):
self._data = json_data
def get(self, name):
return self._data[name]
class Course(JsonDataWrapper):
def __init__(self, data):
super().__init__(data)
self._users = {} # class omitted
self._groups = {} # class omitted
self._assignments = {}
#property
def id(self): return self.get('id')
#property
def name(self): return self.get('full_name')
#property
def short_name(self): return self.get('short_name')
#property
def users(self): return self._users
#users.setter
def users(self, data):
users = [User(u) for u in data]
for user in users:
self.users[user.id] = user
# self.groups = user # this does not make much sense without the rest of the code (It works, but that decision will be revised :D)
Second solution: using lambda for shorter syntax. While working and short, it does not quite look right (see edit1 below.)
def json(name): return property(lambda self: self.get(name))
class Group(JsonDataWrapper):
def __init__(self, data):
super().__init__(data)
self.group_members = [] # elements are of type(User). edit1, was self.members = []
id = json('id')
description = json('description')
name = json('name')
description_format = json('description_format')
(Naming this function 'json' is not a problem, since I don't import json there.)
I have a possible third solution in mind, that I cant quite wrap my head around: overriding the property builtin, so I can define a decorator that wraps the returned field name for lookup:
#json # just like a property fget
def short_name(self): return 'short_name'
That could be a little shorter, dunno if that makes code better.
Disqualified solutions (IMHO):
JSON{De,En}coder: kills all flexibility, provide no means of read-only attributes
__{get,set}attr__: makes it impossible to determine attributes before runtime. While it whould shorten self.get('id') to self['id'] it whould also further complicate matters where an attribute was not in the underlying json data.
Thank you for reading!
Edit 1: 2016-07-20T08:26Z
To further clarify (#SuperSaiyan) why I don't quite like the second solution:
I feel the lambda function is completely disconnected from the rest of classes semantics (which is also the reason why it is shorter :D). I think I can help myself liking it more by properly documenting the decision in the code. The first solution is easy to understand for everybody who understands the meaning of #property without any additional explaination.
On the second comment of #SuperSaiyan: Your question is, why I put Group.members as attribute in there? The list stores type(User) entities, might not be what you think it is, I changed the example.
#jwodder: I will use Code Review next time, did not know that was a thing.
(Also: I really think the Group.members threw some of you off, I edited the code to make it a little more obvious: Group members are Users that will be added to the list.
The complete code is on github, while undocumented it may be interesting for somebody. Keep in mind: this is all WIP :D)
(note: this got an update, I'm now using dataclasses with run-time type enforcment. see bottom :3)
So, it's been a year and I'm going to answer my own question. I don't quite like answering it myself, but: this will mark the thread as resolved which in itself might help others.
On the other hand, I want to document and give reason to why I chose my solution over proposed answers. Not, to prove me right, but to highlight the different tradeoffs.
I just realized, that this got quite long, so:
tl;dr
collections.abc contains powerful abstractions and you should use them if you have access to it (cpython >= 3.3).
#property is nice to use, enables to add documentation easily and provides read only access.
Nested classes look weird but replicate the structure of deeply nested JSON just fine.
Proposed solutions
python meta-classes
So first off: I love the concept.
I've considered many applications for where they prove useful, especially when:
writing a pluggable API where meta-classes enforce correct usage of derived classes and their implementation specifics
having a fully automated registry of classes that derive a from a meta-class.
On the other hand, python's meta-class logic felt obscure to wrap my head around (took me at least three days to figure it out). While simple in principle, the devil is in the details.
So, I decided against it, simply because I might abandon the project in the not so far future and others should be able to pick up where I left off easily.
namedtuple
collections.namedtuple is very efficient and concise enough to boil my solution down to several lines instead of the current 800+ lines. My IDE will also be able to introspect possible members of the generated class.
Cons: the breverity of namedtuple leaves much less room for the awfully necessary documentation of the APIs returned values. So with less insane APIs you will possibly get away with just that.
It also feels wierd to nest class objects into the namedtuple, but that's just personal preference.
What I went with
So in the end, I chose to stick to my first original solution with a few minor details added, if you find the details interesting, you can look at the source on github.
collections.abc
When I started the project, my python knowledge was next to none, so I went with what I knew about python ("everything is a dict") and wrote code like that. For example: classes that work like a dict, but have a file structure underneath (that was before pathlib).
While looking through python's code I noticed how they implement and enforce container "traits" through abstract base classes which sounds far more complicated than it really is in python.
the very basics
The following is indeed very basic, but we'll build up from there.
from collections import Mapping, Sequence, Sized
class JsonWrapper(Sized):
def __len__(self):
return len(self._data)
def __init__(self, json):
self._data = json
#property
def raw(self): return self._data
The most basic class I could come up with, this will just enable you to call len on the container. You also can get read-only access through raw if you really want to bother with the underlying dictionary.
So why am I inheriting from Sized instead of just starting from scratch and def __len__ just like that?
not overriding __len__ will not be accepted by the python interpreter. I forget when exactly, but AFAIR it's when you import the module that contains the class, so you're not getting screwed at runtime.
While Sized does not provide any mixin methods, the next two abstractions do provide them. I'll explain there.
With that down, we only got two more basic cases in JSON lists and dicts.
Lists
So, with the API I had to worry about, we we're not always sure what we got; so I wanted a way of checking if I got a list when we initialize the wrapper class, mostly to abort early instead of "object has no member" during more complicated processes.
Deriving from Sequence will enforce overriding __getitem__ and __len__ (which is already implemented in JsonWrapper).
class JsonListWrapper(JsonWrapper, Sequence):
def __init__(self, json_list):
if type(json_list) is not list:
raise TypeError('received type {}, expected list'.format(type(json_list)))
super().__init__(json_list)
def __getitem__(self, index):
return self._data[index]
def __iter__(self):
raise NotImplementedError('__iter__')
def get(self, index):
try:
return self._data[index]
except Exception as e:
print(index)
raise e
So you might have noted, that I chose to not implement __iter__.
I wanted an iterator that yielded typed objects, so my IDE is able to autocomplete. To illustrate:
class CourseListResponse(JsonListWrapper):
def __iter__(self):
for course in self._data:
yield self.Course(course)
class Course(JsonDictWrapper):
pass # for now
Implementing the abstract methods of Sequence, the mixin methods __contains__, __reversed__, index and count are gifted to you, so you don't have to worry about possible side-effects.
Dictionaries
To complete the basic types to wrangle JSON, here's the class derived from Mapping:
class JsonDictWrapper(JsonWrapper, Mapping):
def __init__(self, json_dict):
super().__init__(json_dict)
if type(self._data) is not dict:
raise TypeError('received type {}, expected dict'.format(type(json_dict)))
def __iter__(self):
return iter(self._data)
def __getitem__(self, key):
return self._data[key]
__marker = object()
def get(self, key, default=__marker):
try:
return self._data[key]
except KeyError:
if default is self.__marker:
raise
else:
return default
Mapping only enforces __iter__, __getitem__ and __len__.
To avoid confusion: There is also MutableMapping which will enforce the writing methods. But that's neither needed nor wanted here.
With the abstract methods out of the way, python provides the mixins __contains__, keys, items, values, get, __eq__, and __ne__ based on them.
I'm not sure why I chose to override the get mixin, I might update the post when it get's back to me.
__marker serves as a fallback to detect if the default keyword was not set. If somebody decided to call get(*args, default=None) you won't be able to detect that otherwise.
So to pick up the previous example:
class CourseListResponse(JsonListWrapper):
# [...]
class Course(JsonDictWrapper):
# Jn is just a class that contains the keys for JSON, so I only mistype once.
#property
def id(self): return self[Jn.id]
#property
def short_name(self): return self[Jn.short_name]
#property
def full_name(self): return self[Jn.full_name]
#property
def enrolled_user_count(self): return self[Jn.enrolled_user_count]
# [...] you get the idea
The properties provide read-only access to members and can be documented like a function definition.
Altough verbose, for basic accessors you can easily define a template in your editor, so it's less tedious to write.
Properties also allow to abstract from magic numbers and optional JSON return values, to provide defaults instead guarding for KeyError everywhere:
#property
def isdir(self): return 1 == self[Jn.is_dir]
#property
def time_created(self): return self.get(Jn.time_created, 0)
#property
def file_size(self): return self.get(Jn.file_size, -1)
#property
def author(self): return self.get(Jn.author, "")
#property
def license(self): return self.get(Jn.license, "")
class nesting
It seems a little weird to nest classes in others.
I chose to do that, becaue the API uses the same name for various objects with different attributes, depending on which remote function you called.
Another benefit: new people can easily understand the structure of the returned JSON.
The end of the file contains various aliases to the nested classes for easier access from outside the module.
adding logic
Now that we have encapsulated most of the returned values, I wanted to have more logic associated with the data, to add some convenience.
It also seemed necessary to merge some of the data into a more comprehensive tree that contained all of the data gathered through several API calls:
get all "assignments". each assignment contains many submissions, so:
for(assignment in assigmnents) get all "submissions"
merge submissions into respective assignment.
now get grades for the submissions, and so on...
I chose to implement them seperately, so I just inherited from the "dumb" accessors (full source):
So in this class
class Assignment(MoodleAssignment):
def __init__(self, data, course=None):
super().__init__(data)
self.course = course
self._submissions = {} # accessed via submission.id
self._grades = {} # are accessed via user_id
these properties do the merging
#property
def submissions(self): return self._submissions
#submissions.setter
def submissions(self, data):
if data is None:
self.submissions = {}
return
for submission in data:
sub = Submission(submission, assignment=self)
if sub.has_content:
self.submissions[sub.id] = sub
#property
def grades(self):
return self._grades
#grades.setter
def grades(self, data):
if data is None:
self.grades = {}
return
grades = [Grade(g) for g in data]
for g in grades:
self.grades[g.user_id] = g
and these implement some logic that can be abstracted from the data.
#property
def is_due(self):
now = datetime.now()
return now > self.due_date
#property
def due_date(self): return datetime.fromtimestamp(super().due_date)
While the setters obscure the wrangling, they are nice to write and use: so it's just a trade-off.
Caveat: The logic implementation is not quite what I want it to be, there's much interdependance where it should not be. It's grown from me not knowing enough of python to get the abstractions right and getting things done, so I can do the actual work with the tedium out of my way.
Now that I know, what could have been done: I look at some of that spaghetti, and well … you know the feeling.
Conclusion
Encapsulating the JSON into classes proved quite useful to me and the project's structue and I'm quite happy with it.
The rest of the project is fine and works, although some parts are just awful :D
Thank you all for the feedback, I'll be around for questions and remarks.
update: 2019-05-02
As #RickTeachey points out in the comments, pythons dataclasses (DCs) can be used here, as well.
And I forgot to put an update here, since I already did that some time ago and extended it with pythons typing functionality :D
Reason for that: I was growing tired to manually check if the documentation of the API I was abstracting from was correct or if I got my implementation wrong.
With dataclasses.fields I'm able to check if the response does conform to my schema; and now I'm able to find changes in the external API much faster, since the assumptions are checked during run-time on instantiation.
DCs provide a __post_init__(self) hook to do some post-processing once the __init__ completed successfully. Pythons' type hints are only in place to provide hints for static checkers, I built a little system that does enforce the types on dataclasses in the post init phase.
Here is the BaseDC, from which all other DCs inherit (abbreviated)
import dataclasses as dc
#dataclass
class BaseDC:
def _typecheck(self):
for field in dc.fields(self):
expected = field.type
f = getattr(self, field.name)
actual = type(f)
if expected is list or expected is dict:
log.warning(f'untyped list or dict in {self.__class__.__qualname__}: {field.name}')
if expected is actual:
continue
if is_generic(expected):
return self._typecheck_generic(expected, actual)
# Subscripted generics cannot be used with class and instance checks
if issubclass(actual, expected):
continue
print(f'mismatch {field.name}: should be: {expected}, but is {actual}')
print(f'offending value: {f}')
def __post_init__(self):
for field in dc.fields(self):
castfunc = field.metadata.get('castfunc', False)
if castfunc:
attr = getattr(self, field.name)
new = castfunc(attr)
setattr(self, field.name, new)
if DEBUG:
self._typecheck()
Fields have an additional attribute that is allowed to store arbitary information, I'm using it to store functions that convert the response value; but more on that later.
A basic response wrapper looks like this:
#dataclass
class DCcore_enrol_get_users_courses(BaseDC):
id: int # id of course
shortname: str # short name of course
fullname: str # long name of course
enrolledusercount: int # Number of enrolled users in this course
idnumber: str # id number of course
visible: int # 1 means visible, 0 means hidden course
summary: Optional[str] = None # summary
summaryformat: Optional[int] = None # summary format (1 = HTML, 0 = MOODLE, 2 = PLAIN or 4 = MARKDOWN)
format: Optional[str] = None # course format: weeks, topics, social, site
showgrades: Optional[int] = None # true if grades are shown, otherwise false
lang: Optional[str] = None # forced course language
enablecompletion: Optional[int] = None # true if completion is enabled, otherwise false
category: Optional[int] = None # course category id
progress: Optional[float] = None # Progress percentage
startdate: Optional[int] = None # Timestamp when the course start
enddate: Optional[int] = None # Timestamp when the course end
def __str__(self): return f'{self.fullname[0:39]:40} id:{self.id:5d} short: {self.shortname}'
core_enrol_get_users_courses = destructuring_list_cast(DCcore_enrol_get_users_courses)
Responses that are just lists were giving me trouble in the beginning, since I could not enforce type checking on them with a plain List[DCcore_enrol_get_users_courses].
This is where the destructuring_list_cast solves that problem for me, which is a little more involved. We're entering higher order function territory:
T = typing.TypeVar('T')
def destructuring_list_cast(cls: typing.Callable[[dict], T]) -> typing.Callable[[list], T]:
def cast(data: list) -> List[T]:
if data is None:
return []
if not isinstance(data, list):
raise SystemExit(f'listcast expects a list, you sent: {type(data)}')
try:
return [cls(**entry) for entry in data]
except TypeError as err:
# here is more code that explains errors
raise SystemExit(f'listcast for class {cls} failed:\n{err}')
return cast
This expects a Callable that accepts a dict and returns a class instance of type T, which is something what you'd expect from a constructor or a factory.
It returns a Callable that will accept a list, here it's cast.
return [cls(**entry) for entry in data] does all the work here, by constructing a list of dataclasses, when you call core_enrol_get_users_courses(response.json()).
(Throwing SystemExit is not nice, but that's handled in the upper layers, so it works for me; I want that to fail hard and fast.)
It's other use case is to define nested fields, then the responses are deeply nested: remember the field.metadata.get('castfunc', False) in the BaseDC? That's where these two shortcuts come in:
# destructured_cast_field
def dcf(cls):
return dc.field(metadata={'castfunc': destructuring_list_cast(cls)})
def optional_dcf(cls):
return dc.field(metadata={'castfunc': destructuring_list_cast(cls)}, default_factory=list)
These are used in nested cases like this (see bottom):
#dataclass
class core_files_get_files(BaseDC):
#dataclass
class parent(BaseDC):
contextid: int
# abbrev ...
#dataclass
class file(BaseDC):
contextid: int
component: str
timecreated: Optional[int] = None # Time created
# abbrev ...
parents: List[parent] = dcf(parent)
files: Optional[List[file]] = optional_dcf(file)
Have you considered using a meta-class?
class JsonDataWrapper(object):
def __init__(self, json_data):
self._data = json_data
def get(self, name):
return self._data[name]
class JsonDataWrapperMeta(type):
def __init__(self, name, base, dict):
for mbr in self.members:
prop = property(lambda self: self.get(mbr))
setattr(self, mbr, prop)
# You can use the metaclass inside a class block
class Group(JsonDataWrapper):
__metaclass__ = JsonDataWrapperMeta
members = ['id', 'description', 'name', 'description_format']
# Or more programmatically
def jsonDataFactory(name, members):
d = {"members":members}
return JsonDataWrapperMeta(name, (JsonDataWrapper,), d)
Course = jsonDataFactory("Course", ["id", "name", "short_name"])
When developing an API like this- in which all the members are read-only (meaning you do not want them overwritten, but may still have mutable data structures as members), I have often considered using collections.namedtuple a hard-to-beat approach unless I have a very good reason to do otherwise. It is fast, and needs a bare minimum of code.
from collections import namedtuple as nt
Group = nt('Group', 'id name shortname users')
g = Group(**json)
Simple.
If there is more data in your json than will be used in the object, just filter it out:
g = Group(**{k:v for k,v in json.items() if k in Group._fields})
If you want defaults for missing data, you can do that, too:
Group.__new__.__defaults__ = (0, 'DefaultName', 'DefN', None)
# now this works:
g = Group()
# and now this will still work even if some keys are missing;
g = Group(**{k:v for k,v in json.items() if k in Group._fields})
One gotcha using the above technique of setting defaults: don't set the default value for one of the members to any mutable object, such as a list, because it will be the same mutable shared object across all instances:
# don't do this:
Group.__new__.__defaults__(0, 'DefaultName', 'DefN', [])
g1 = Group()
g2 = Group()
g1.users.append(user1)
g2.users # output: [user1] <-- whoops!
Instead, wrap it all up in a nice factory that instantiates a new list (or dict or whatever user-defined data structure you need) for the members that need them:
# jsonfactory.py
new_list = Object()
def JsonClassFactory(name, *args, defaults=None):
'''Produces a new namedtuple class. Any members
intended to default to a blank list should be set to
the new_list object.
'''
cls = nt(name, *args)
if defaults is not None:
cls.__new__.__defaults__ = tuple(([] if d is new_list else d) for d in defaults)
Now given some json object that defines the fields you want present:
from jsonfactory import JsonClassFactory, new_list
MyJsonClass = JsonClassFactory(MyJsonClass, *json_definition,
defaults=(0, 'DefaultName', 'DefN', new_list))
And then as before:
obj = MyJsonClass(**json)
OR, if there is extra data:
obj = MyJsonClass(**{k:v for k,v in json.items() if k in MyJsonClass._fields})
If you want the default container to be something other than a list, this is simple enough- just replace the new_list sentinel with whatever sentinel you wish. If needed you could have multiple sentinels at the same time.
And if you still need extra functionality, you can always extend your MyJsonClass:
class ExtJsonClass(MyJsonClass):
__slots__ = () # optional- needed if you want the low memory benefits of namedtuple
def __new__(cls, *args, **kwargs):
self = super().__new__(cls, *args, **{k:v for k,v in kwargs.items()
if k in cls._fields})
return self
def add_user(self, user):
self.users.append(user)
The __new__ method above takes care of the missing data problem for good. So now you can always just do this:
obj = ExtJsonClass(**json)
Simple.
I myself am a newbie in python and so excuse me if I sound naive. One of the solution could be using __dict__ as discussed in the article below:
https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch06s02.html
Of course this solution will create issues if there are objects inside a class which below to other class and need to be serialized or de-serialized. I would love to hear the opinion of the experts here on this solution and different limitations.
Any feedback on jsonpickle.
Update:
I just saw your objection about the serialization and how you don't like it as everything is runtime. Understood. Thanks a lot.
Below is the code I wrote to get around that. A bit of a stretch but works well and I do not have to add get/set everytime !!!
import json
class JSONObject:
exp_props = {"id": "", "title": "Default"}
def __init__(self, d):
self.__dict__ = d
for key in [x for x in JSONObject.exp_props if x not in self.__dict__]:
setattr(self, key, JSONObject.exp_props[key])
#staticmethod
def fromJSON(s):
return json.loads(s, object_hook=JSONObject)
def toJSON(self):
return json.dumps(self.__dict__, indent=4)
s = '{"name": "ACME", "shares": 50, "price": 490.1}'
anObj = JSONObject.fromJSON(s)
print("Name - {}".format(anObj.name))
print("Shares - {}".format(anObj.shares))
print("Price - {}".format(anObj.price))
print("Title - {}".format(anObj.title))
sAfter = anObj.toJSON()
print("Type of dumps is {}".format(type(sAfter)))
print(sAfter)
Results below
Name - ACME
Shares - 50
Price - 490.1
Title - Default
Type of dumps is <type 'str'>
{
"price": 490.1,
"title": "Default",
"name": "ACME",
"shares": 50,
"id": ""
}

what's the usage of inherit 'dict' with a class?

I saw one of my colleague write his code like:
class a(dict):
# something
pass
Is this a common skill? What does it serve for?
This can be done when you want a class with the default behaviour of a dictionary (getting and setting keys), but the instances are going to be used in highlu specific circumstances, and you anticipate the need to provide custom methods or constructors specific to those.
For example, you may want have a dynamic KeyStorage that starts as a in-memory store, but later adapt the class to keep the data on disk.
You can also mangle the keys and values as needed - for storage of unicode data on a database with a specific encoding, for example.
In some cases it makes sense. For example you could create a dict that allows case insensitive lookup:
class case_insensitive_dict(dict):
def __getitem__(self, key):
return super(case_insensitive_dict, self).__getitem__(key.lower())
def __setitem__(self, key, value):
return super(case_insensitive_dict, self).__setitem__(key.lower(), value)
d = case_insensitive_dict()
d["AbCd"] = 1
print d["abcd"]
(this might require additional error handling)
Extending the built-in dict class can be useful to create dict "supersets" (e.g. "bunch" class where keys can be accessed object-style, as in javascript) without having to reimplement MutableMapping's 5 methods by hand.
But if your colleague literally writes
class MyDict(dict):
pass
without any customisation, I can only see evil uses for it, such as adding attributes to the dict:
>>> a = {}
>>> a.foo = 3
AttributeError: 'dict' object has no attribute 'foo'
>>> b = MyDict()
>>> b.foo = 3
>>>

How to use large dicts in Python which not fit in memory?

We use a dict which contains about 4GB of data for data processing. It's convenient and fast.
The problem we are having is that this dict might grow over 32GB.
I'm looking for a way to use a dict (just like a variable with get()-method etc) which can be bigger than the available memory. It would be great if this dict somehow stored the data on disk and retrieved the data from disk when get(key) is called and value for the key is not in memory.
Preferably I wouldn't like to use an external service, like a SQL database.
I did find Shelve, but it seems to need the memory too.
Any ideas on how to approach this problem?
That sounds like you could use a key-value-store which are currently hyped under the buzzword of No-SQL. Good introduction about it can be found, for instance in
http://ayende.com/blog/4449/that-no-sql-thing-key-value-stores.
It is simply a database with the API you described.
Use the pickle module to serialize the dictionary to disk. Then take successive values from the iterator of the dictionary and place them into your cache, initially. Then implement a cache scheme such as LRU; remove a dictionary item using the `popitem() method of dictionaries and add the previously accessed item in the case of LRU.
I couldn't find any (fast) module to do this and decided to create my own (my first python project - thanks #blubber for some ideas :P). You can find it on GitHub: https://github.com/ddofborg/diskdict Comments are welcome!
If you don't want to use an SQL database (which is a reasonable solution to a problem like this) you'll have to either figure out a way to compress the data you're working with or use a library like this one (or your own) to do the mapping to disc yourself.
You can also look at this question for some more strategies.
/EDIT: This is now a Python module: fdict.
I had a similar issue but I was working on nested dicts. Because of the very nested nature of my dataset, all the available solutions would not fit to my use case: shelve, chest, shove, sqlite, zodb, and any other NoSQL db work mainly when you have one level, because they all pickle/JSON the values to serialize into the database, so all values after 1st level are serialized and it is not possible to do incremental update of the nested objects (you have to load the whole branch from the 1st-level node, modify, and then put it back), which is not possible if your nested objects are huge by themselves.
This means that these solutions will help when you have lots of 1st-level nodes, but not when you have a few 1st-level nodes and a deep hierarchy.
To counter that, I have made a custom dict class over the native Python dict, to internally represent nested dict in a flattened form: dict()['a']['b'] will internally be represented as dict()['a/b'].
Then on top of this "internally flattened" dict, you can now use any object to disk solution like shelve, which is what I used but you can easily replace with something else.
Here is the code:
import shelve
class fdict(dict):
'''Flattened nested dict, all items are settable and gettable through ['item1']['item2'] standard form or ['item1/item2'] internal form.
This allows to replace the internal dict with any on-disk storage system like a shelve's shelf (great for huge nested dicts that cannot fit into memory).
Main limitation: an entry can be both a singleton and a nested fdict, and there is no way to tell what is what, no error will be shown, the singleton will always be returned.
'''
def __init__(self, d=None, rootpath='', delimiter='/', *args):
if d:
self.d = d
else:
self.d = {}
self.rootpath = rootpath
self.delimiter = delimiter
def _buildpath(self, key):
return self.rootpath+self.delimiter+key if self.rootpath else key
def __getitem__(self, key):
# Node or leaf?
if key in self.d: # Leaf: return the value
return self.d.__getitem__(key)
else: # Node: return a new full fdict based on the old one but with a different rootpath to limit the results by default
return fdict(d=self.d, rootpath=self._buildpath(key))
def __setitem__(self, key, value):
self.d.__setitem__(self._buildpath(key), value)
def keys(self):
if not self.rootpath:
return self.d.keys()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [k[lpattern:] for k in self.d.keys() if k.startswith(pattern)]
def items(self):
# Filter items to keep only the ones below the rootpath level
if not self.rootpath:
return self.d.items()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [(k[lpattern:], v) for k,v in self.d.items() if k.startswith(pattern)]
def values(self):
if not self.rootpath:
return self.d.values()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [v for k,v in self.d.items() if k.startswith(pattern)]
def update(self, d2):
return self.d.update(d2.d)
def __repr__(self):
# Filter the items if there is a rootpath and return as a new fdict
if self.rootpath:
return repr(fdict(d=dict(self.items())))
else:
return self.d.__repr__()
def __str__(self):
if self.rootpath:
return str(fdict(d=dict(self.items())))
else:
return self.d.__str__()
class sfdict(fdict):
'''A nested dict with flattened internal representation, combined with shelve to allow for efficient storage and memory allocation of huge nested dictionnaries.
If you change leaf items (eg, list.append), do not forget to sync() to commit changes to disk and empty memory cache because else this class has no way to know if leaf items were changed!
'''
def __init__(self, *args, **kwargs):
if not ('filename' in kwargs):
self.filename = None
else:
self.filename = kwargs['filename']
del kwargs['filename']
fdict.__init__(self, *args, **kwargs)
self.d = shelve.open(filename=self.filename, flag='c', writeback=True)
def __setitem__(self, key, value):
fdict.__setitem__(self, key, value)
self.sync()
def get_filename(self):
return self.filename
def sync(self):
self.d.sync()
def close(self):
self.d.close()
Both fdict and sfdict are useable just like standard dict (but I did not implement all methods heh).
The full code is here:
https://gist.github.com/lrq3000/8ce9174c1c7a5ef546df1e1361417213
This was further developped in a full Python module: fdict.
After benchmarking, this is about 10x slower than a dict when using indirect access (ie, x['a']['b']['c']), and about as fast when using direct access (ie, x['a/b/c']), although here I do not account for the overhead of shelve saving to a anydbm file, just of fdict data structure compared to dict.

Easiest way to serialize a simple class object with simplejson?

I'm trying to serialize a list of python objects with JSON (using simplejson) and am getting the error that the object "is not JSON serializable".
The class is a simple class having fields that are only integers, strings, and floats, and inherits similar fields from one parent superclass, e.g.:
class ParentClass:
def __init__(self, foo):
self.foo = foo
class ChildClass(ParentClass):
def __init__(self, foo, bar):
ParentClass.__init__(self, foo)
self.bar = bar
bar1 = ChildClass(my_foo, my_bar)
bar2 = ChildClass(my_foo, my_bar)
my_list_of_objects = [bar1, bar2]
simplejson.dump(my_list_of_objects, my_filename)
where foo, bar are simple types like I mentioned above. The only tricky thing is that ChildClass sometimes has a field that refers to another object (of a type that is not ParentClass or ChildClass).
What is the easiest way to serialize this as a json object with simplejson? Is it sufficient to make it serializable as a dictionary? Is the best way to simply write a dict method for ChildClass? Finally, does having the field that refer to another object significantly complicate things? If so, I can rewrite my code to only have simple fields in classes (like strings/floats etc.)
thank you.
I've used this strategy in the past and been pretty happy with it: Encode your custom objects as JSON object literals (like Python dicts) with the following structure:
{ '__ClassName__': { ... } }
That's essentially a one-item dict whose single key is a special string that specifies what kind of object is encoded, and whose value is a dict of the instance's attributes. If that makes sense.
A very simple implementation of an encoder and a decoder (simplified from code I've actually used) is like so:
TYPES = { 'ParentClass': ParentClass,
'ChildClass': ChildClass }
class CustomTypeEncoder(json.JSONEncoder):
"""A custom JSONEncoder class that knows how to encode core custom
objects.
Custom objects are encoded as JSON object literals (ie, dicts) with
one key, '__TypeName__' where 'TypeName' is the actual name of the
type to which the object belongs. That single key maps to another
object literal which is just the __dict__ of the object encoded."""
def default(self, obj):
if isinstance(obj, TYPES.values()):
key = '__%s__' % obj.__class__.__name__
return { key: obj.__dict__ }
return json.JSONEncoder.default(self, obj)
def CustomTypeDecoder(dct):
if len(dct) == 1:
type_name, value = dct.items()[0]
type_name = type_name.strip('_')
if type_name in TYPES:
return TYPES[type_name].from_dict(value)
return dct
In this implementation assumes that the objects you're encoding will have a from_dict() class method that knows how to take recreate an instance from a dict decoded from JSON.
It's easy to expand the encoder and decoder to support custom types (e.g. datetime objects).
EDIT, to answer your edit: The nice thing about an implementation like this is that it will automatically encode and decode instances of any object found in the TYPES mapping. That means that it will automatically handle a ChildClass like so:
class ChildClass(object):
def __init__(self):
self.foo = 'foo'
self.bar = 1.1
self.parent = ParentClass(1)
That should result in JSON something like the following:
{ '__ChildClass__': {
'bar': 1.1,
'foo': 'foo',
'parent': {
'__ParentClass__': {
'foo': 1}
}
}
}
An instance of a custom class could be represented as JSON formatted string with help of following function:
def json_repr(obj):
"""Represent instance of a class as JSON.
Arguments:
obj -- any object
Return:
String that reprent JSON-encoded object.
"""
def serialize(obj):
"""Recursively walk object's hierarchy."""
if isinstance(obj, (bool, int, long, float, basestring)):
return obj
elif isinstance(obj, dict):
obj = obj.copy()
for key in obj:
obj[key] = serialize(obj[key])
return obj
elif isinstance(obj, list):
return [serialize(item) for item in obj]
elif isinstance(obj, tuple):
return tuple(serialize([item for item in obj]))
elif hasattr(obj, '__dict__'):
return serialize(obj.__dict__)
else:
return repr(obj) # Don't know how to handle, convert to string
return json.dumps(serialize(obj))
This function will produce JSON-formatted string for
an instance of a custom class,
a dictionary that have instances of
custom classes as leaves,
a list of instances of custom
classes
As specified in python's JSON docs // help(json.dumps) // >
You should simply override the default() method of JSONEncoder in order to provide a custom type conversion, and pass it as cls argument.
Here is one I use to cover Mongo's special data types (datetime and ObjectId)
class MongoEncoder(json.JSONEncoder):
def default(self, v):
types = {
'ObjectId': lambda v: str(v),
'datetime': lambda v: v.isoformat()
}
vtype = type(v).__name__
if vtype in types:
return types[type(v).__name__](v)
else:
return json.JSONEncoder.default(self, v)
Calling it as simple as
data = json.dumps(data, cls=MongoEncoder)
If you are using Django, it can be easily done via Django's serializers module. More info can be found here: https://docs.djangoproject.com/en/dev/topics/serialization/
This is kind of hackish and I'm sure there's probably a lot that can be wrong with it. However, I was producing a simple script and I ran the issue that I did not want to subclass my json serializer to serialize a list of model objects. I ended up using list comprehension
Let:
assets = list of modelobjects
Code:
myJson = json.dumps([x.__dict__ for x in assets])
So far seems to have worked charmingly for my needs
I have a similar problem but the json.dump function is not called by me.
So, to make MyClass JSON serializable without giving a custom encoder to json.dump you have to Monkey patch the json encoder.
First create your encoder in your module my_module:
import json
class JSONEncoder(json.JSONEncoder):
"""To make MyClass JSON serializable you have to Monkey patch the json
encoder with the following code:
>>> import json
>>> import my_module
>>> json.JSONEncoder.default = my_module.JSONEncoder.default
"""
def default(self, o):
"""For JSON serialization."""
if isinstance(o, MyClass):
return o.__repr__()
else:
return super(self,o)
class MyClass:
def __repr__(self):
return "my class representation"
Then as it is described in the comment, monkey patch the json encoder:
import json
import my_module
json.JSONEncoder.default = my_module.JSONEncoder.default
Now, even an call of json.dump in an external library (where you cannot change the cls parameter) will work for your my_module.MyClass objects.
I feel a bit silly about my possible 2 solutions rereading it now,
of course when you use django-rest-framework, this framework have some excellent features buildin for this problem mentioned above.
see this model view example on their website
If you're not using django-rest-framework, this can help anyway:
I found 2 helpfull solutions for this problem in this page: (I like the second one the most!)
Possible solution 1 (or way to go):
David Chambers Design made a nice solution
I hope David does not mind I copy paste his solution code here:
Define a serialization method on the instance's model:
def toJSON(self):
import simplejson
return simplejson.dumps(dict([(attr, getattr(self, attr)) for attr in [f.name for f in self._meta.fields]]))
and he even extracted the method above, so it's more readable:
def toJSON(self):
fields = []
for field in self._meta.fields:
fields.append(field.name)
d = {}
for attr in fields:
d[attr] = getattr(self, attr)
import simplejson
return simplejson.dumps(d)
Please mind, it's not my solution, all the credits goes to the link included. Just thought this should be on stack overflow.
This could be implemented in the answers above as well.
Solution 2:
My preferable solution is found on this page:
http://www.traddicts.org/webdevelopment/flexible-and-simple-json-serialization-for-django/
By the way, i saw the writer of this second and best solution: is on stackoverflow as well:
Selaux
I hope he sees this, and we can talk about starting to implement and improve his code in an open solution?

Categories

Resources