Problem: I have an object with some values and a schema which defines field and schema level validation for such an object using marshmallow.
The schema can validate serialized data, which, for most used cases, is enough. But when I want to partially update the object, I have to serialize the already existing object, update the returned dictionary with the new fields, validate the dict and then reassign those values to the object. This is required to run the schema level validation.
Question: I was wondering if there is a cleaner way to validate partial updates. For example, partially loading the data, assigning it to the existing object and validating the object using the schema (not possible, because the validation function of a schema only accepts dicts)
Note: I can not just recreate the object, because it is synced with the database. I have to assign the new values.
Example:
class Obj:
def __init__(self):
self.a = 1
self.b = 2
def update(self, **kwargs):
self.__dict__.update(kwargs)
class ObjSchema(Schema):
a = fields.Integer()
b = fields.Integer()
#validates_schema
def validate_whole_schema(self, data, **kwargs):
...
o = Obj()
sc = ObjSchema()
new_values = {
"a": 3
}
# ugly partial update
o_data = sc.dump(o)
o_data.update(new_values)
errors = sc.validate(o_data)
if errors:
...
else:
o.update(o_data)
You might have already solved this problem, but Marshmallow schema.load() offers a "partial" flag that will ignore missing fields when loading the schema.
Refer to the resource below:
https://marshmallow.readthedocs.io/en/stable/quickstart.html#partial-loading
Related
We'd like to enforce parameter checking before people can insert into a schema as shown below, but the code below doesn't work.
Is there a way to implement pre-insertion parameter checking?
#schema
class AutomaticCurationParameters(dj.Manual):
definition = """
auto_curation_params_name: varchar(200) # name of this parameter set
---
merge_params: blob # dictionary of params to merge units
label_params: blob # dictionary params to label units
"""
def insert1(key, **kwargs):
# validate the labels and then insert
#TODO: add validation for merge_params
for metric in key['label_params']:
if metric not in _metric_name_to_func:
raise Exception(f'{metric} not in list of available metrics')
comparison_list = key['label_params'][metric]
if comparison_list[0] not in _comparison_to_function:
raise Exception(f'{metric}: {comparison_list[0]} not in list of available comparisons')
if type(comparison_list[1]) != int and type(comparison_list) != float:
raise Exception(f'{metric}: {comparison_list[1]} not a number')
for label in comparison_list[2]:
if label not in valid_labels:
raise Exception(f'{metric}: {comparison_list[2]} not a valid label: {valid_labels}')
super().insert1(key, **kwargs)
This is a great question that has come up many times for us.
Most likely the issue is either that you are missing the class' self reference or that you are missing the case where the key is passed in as a keyword argument (we are actually expecting it as a row instead).
I'll demonstrate a simple example that hopefully can illustrate how to inject your validation code which you can tweak to perform as you're intending above.
Suppose, we want to track filepaths within a dj.Manual table but I'd like to validate that only filepaths with a certain extension are inserted.
As you've already discovered, we can achieve this through overloading like so:
import datajoint as dj
schema = dj.Schema('rguzman_insert_validation')
#schema
class FilePath(dj.Manual):
definition = '''
file_id: int
---
file_path: varchar(100)
'''
def insert1(self, *args, **kwargs): # Notice that we need a reference to the class
key = kwargs['row'] if 'row' in kwargs else args[0] # Handles as arg or kwarg
if '.md' not in key['file_path']:
raise Exception('Sorry, we only support Markdown files...')
super().insert1(*args, **kwargs)
P.S. Though this example is meant to illustrate the concept, there is actually a better way of doing the above if you are using MySQL8. There is a CHECK utility available from MySQL that allows simple validation that DataJoint will respect. If those conditions are met, you can simplify it to:
import datajoint as dj
schema = dj.Schema('rguzman_insert_validation')
#schema
class FilePath(dj.Manual):
definition = '''
file_id: int
---
file_path: varchar(100) CHECK(REGEXP_LIKE(file_path, '^.*\.md$', 'c'))
'''
I have a model A and want to make subclasses of it.
class A(models.Model):
type = models.ForeignKey(Type)
data = models.JSONField()
def compute():
pass
class B(A):
def compute():
df = self.go_get_data()
self.data = self.process(df)
class C(A):
def compute():
df = self.go_get_other_data()
self.data = self.process_another_way(df)
# ... other subclasses of A
B and C should not have their own tables, so I decided to use the proxy attirbute of Meta. However, I want there to be a table of all the implemented proxies.
In particular, I want to keep a record of the name and description of each subclass.
For example, for B, the name would be "B" and the description would be the docstring for B.
So I made another model:
class Type(models.Model):
# The name of the class
name = models.String()
# The docstring of the class
desc = models.String()
# A unique identifier, different from the Django ID,
# that allows for smoothly changing the name of the class
identifier = models.Int()
Now, I want it so when I create an A, I can only choose between the different subclasses of A.
Hence the Type table should always be up-to-date.
For example, if I want to unit-test the behavior of B, I'll need to use the corresponding Type instance to create an instance of B, so that Type instance already needs to be in the database.
Looking over on the Django website, I see two ways to achieve this: fixtures and data migrations.
Fixtures aren't dynamic enough for my usecase, since the attributes literally come from the code. That leaves me with data migrations.
I tried writing one, that goes something like this:
def update_results(apps, schema_editor):
A = apps.get_model("app", "A")
Type = apps.get_model("app", "Type")
subclasses = get_all_subclasses(A)
for cls in subclasses:
id = cls.get_identifier()
Type.objects.update_or_create(
identifier=id,
defaults=dict(name=cls.__name__, desc=cls.__desc__)
)
class Migration(migrations.Migration):
operations = [
RunPython(update_results)
]
# ... other stuff
The problem is, I don't see how to store the identifier within the class, so that the Django Model instance can recover it.
So far, here is what I have tried:
I have tried using the fairly new __init_subclass__ construct of Python. So my code now looks like:
class A:
def __init_subclass__(cls, identifier=None, **kwargs):
super().__init_subclass__(**kwargs)
if identifier is None:
raise ValueError()
cls.identifier = identifier
Type.objects.update_or_create(
identifier=identifier,
defaults=dict(name=cls.__name__, desc=cls.__doc__)
)
# ... the rest of A
# The identifier should never change, so that even if the
# name of the class changes, we still know which subclass is referred to
class B(A, identifier=3):
# ... the rest of B
But this update_or_create fails when the database is new (e.g. during unit tests), because the Type table does not exist.
When I have this problem in development (we're still in early stages so deleting the DB is still sensible), I have to go
comment out the update_or_create in __init_subclass__. I can then migrate and put it back in.
Of course, this solution is also not great because __init_subclass__ is run way more than necessary. Ideally this machinery would only happen at migration.
So there you have it! I hope the problem statement makes sense.
Thanks for reading this far and I look forward to hearing from you; even if you have other things to do, I wish you a good rest of your day :)
With a little help from Django-expert friends, I solved this with the post_migrate signal.
I removed the update_or_create in __init_subclass, and in project/app/apps.py I added:
from django.apps import AppConfig
from django.db.models.signals import post_migrate
def get_all_subclasses(cls):
"""Get all subclasses of a class, recursively.
Used to get a list of all the implemented As.
"""
all_subclasses = []
for subclass in cls.__subclasses__():
all_subclasses.append(subclass)
all_subclasses.extend(get_all_subclasses(subclass))
return all_subclasses
def update_As(sender=None, **kwargs):
"""Get a list of all implemented As and write them in the database.
More precisely, each model is used to instantiate a Type, which will be used to identify As.
"""
from app.models import A, Type
subclasses = get_all_subclasses(A)
for cls in subclasses:
id = cls.identifier
Type.objects.update_or_create(identifier=id, defaults=dict(name=cls.__name__, desc=cls.__doc__))
class MyAppConfig(AppConfig):
default_auto_field = "django.db.models.BigAutoField"
name = "app"
def ready(self):
post_migrate.connect(update_As, sender=self)
Hope this is helpful for future Django coders in need!
I have 3 marshmallow Schemas with Nested fields that form a dependency cycle/triangle. If I use the boilerplate from two-way nesting, I seem to have no problem.
from marshmallow import Schema, fields
class A(Schema):
id = fields.Integer()
b = fields.Nested('B')
class B(Schema):
id = fields.Integer()
c = fields.Nested('C')
class C(Schema):
id = fields.Integer()
a = fields.Nested('A')
However, I have my own, thin subclass of fields.Nested that looks something like the following:
from marshmallow import fields
class NestedRelationship(fields.Nested):
def __init__(self, nested,
include_data=True,
**kwargs):
super(NestedRelationship, self).__init__(nested, **kwargs)
self.schema.is_relationship = True
self.schema.include_relationship_data = include_data
and I change each Schema to use NestedRelationship instead of the native Nested type, I get:
marshmallow.exceptions.RegistryError: Class with name 'B' was not found. You may need to import the class.
NestedRelationship is a relatively thin subclass and I am surprised at the difference in behavior. Am I doing something wrong here? Am I not calling super appropriately?
The problem is with your extra code that accesses self.schema. When you define A.b field, it tries to resolve it, but it wasn't defined yet. On the other hand marshmallow.fields.Nested does not try to resolve schema on construction time and thus does not have this problem.
Say I have an object, "Order," a field of which, "items," holds a list of order items. The list of items will never be searched or individually selected in the database so I just want to store it in a DB field as a JSON string.
I'm trying to figure out the best way to embed this functionality so it's fairly transparent to anyone using the model. I think saving the model is pretty easy - just override the save method and serialize the "items" list into an internal "_items" field, and then write that to the db. I'm confused about how to deserialize, though. Having looked into possibly some kind of classmethod for creation, or creating a custom manger, or something to do with signals, I've thoroughly confused myself. I'm sure this has been solved a hundred times over and I'm curious what people consider to be best practice.
Example classes:
class OrderItem():
def __init__(self, desc="", qty=0):
self.desc = desc
self.qty = qty
class Order(Model):
user = ForeignKey(User)
_items = TextField()
def save(self, *args, **kwargs):
self._items = jsonpickle.encode(self.items)
super(Order, self).save(*args, **kwargs)
Example usage:
order = Order()
order.items = [OrderItem("widget", 5)]
order.save()
This would create a record in the DB in which
_items = [{"desc":"widget", "qty":5}]
Now I want to be able to later select the object
order = Order.objects.get(id=whatever)
and have order.items be the unpacked array of items, not the stored JSON string.
EDIT:
The solution turned out to be quite simple, and I'm posting here in case it helps any other newbies. Based on Daniel's suggestion, I went with this custom model field:
class JSONField(with_metaclass(SubfieldBase, TextField)):
def db_type(self, connection):
return 'JSONField'
def to_python(self, value):
if isinstance(value, basestring):
return jsonpickle.decode(value)
else:
return value
def get_prep_value(self, value):
return jsonpickle.encode(value)
A much better approach is to subclass TextField and override the relevant methods to do the serialization/deserialization transparently as required. In fact there are a number of implementations of this already: here's one, for example.
Ok, so I'm having a bit of a problem here. I need to be able to create a sort of import/export functionality for some sqlalchemy. Now these are not objects I'm defining, so to get the columns I'm doing:
for attr, value in res.__class__.__dict__.iteritems():
if isinstance(value, InstrumentedAttribute):
data = eval("res." + str(attr))
param_dict[attr] = data
Now this correctly gets me the attributes of that object. However, I can;t be certain that the parameters of the init are the same, since I'm not the one handling this objects. So there could be a situation like:
class SomeClass(model.Base):
my_column = Column(String)
....some other stuff...
def __init__(self, mycolumn, ...):
self.my_column = mycolumn
So in this case I don't have any correspondance between the name of the field and the name of the parameter as recieved by init. I'm currently contraining the ones who define these classes to have all the init parametes with a default value so I could just:
obj = SomeClass()
exec "obj." + attr + " = " + param[attr]
However I would like to get away even from this constrain. Is there any way I can achieve this?
Serializing can't really be generalized for all possible sqlalchemy mapped classes, classes might have properties that aren't stored in the database, or that must be inferred across multiple levels of indirection through relationship properties. In short, only you know how to serialize a particular class for a particular use.
Lets pretend that you only need or care about the column values for the specific instance of the specific class under consideration (in res). Here's a crude function that will return a dict containing only those values.
from sqlalchemy.orm.attributes import manager_of_class
from sqlalchemy.orm.properties import ColumnProperty
def get_state_dict(instance):
cls = type(instance)
mgr = manager_of_class(cls)
return dict((key, getattr(instance, key))
for key, attr in mgr.iteritems()
if isinstance(attr.property, ColumnProperty))
and to recreate an instance from the dict*
def create_from_state_dict(cls, state_dict):
mgr = manager_of_class(cls)
instance = mgr.new_instance()
for key, value in state_dict.iteritems():
setattr(instance, key, value)
return instance
If you need something more complex, probably handling the relationships not shown as columns (as in many-to-many relationships), you can probably add that case by looking for sqlalchemy.orm.properties.RelationshipProperty, and then iterating over the collection.
*Serializing the intermediate dict and class is left as an exercise.