validating parameter dictionaries before insertion in datajoint schema

validating parameter dictionaries before insertion in datajoint schema - python

We'd like to enforce parameter checking before people can insert into a schema as shown below, but the code below doesn't work.
Is there a way to implement pre-insertion parameter checking?
#schema
class AutomaticCurationParameters(dj.Manual):
definition = """
auto_curation_params_name: varchar(200) # name of this parameter set
---
merge_params: blob # dictionary of params to merge units
label_params: blob # dictionary params to label units
"""
def insert1(key, **kwargs):
# validate the labels and then insert
#TODO: add validation for merge_params
for metric in key['label_params']:
if metric not in _metric_name_to_func:
raise Exception(f'{metric} not in list of available metrics')
comparison_list = key['label_params'][metric]
if comparison_list[0] not in _comparison_to_function:
raise Exception(f'{metric}: {comparison_list[0]} not in list of available comparisons')
if type(comparison_list[1]) != int and type(comparison_list) != float:
raise Exception(f'{metric}: {comparison_list[1]} not a number')
for label in comparison_list[2]:
if label not in valid_labels:
raise Exception(f'{metric}: {comparison_list[2]} not a valid label: {valid_labels}')
super().insert1(key, **kwargs)

This is a great question that has come up many times for us.
Most likely the issue is either that you are missing the class' self reference or that you are missing the case where the key is passed in as a keyword argument (we are actually expecting it as a row instead).
I'll demonstrate a simple example that hopefully can illustrate how to inject your validation code which you can tweak to perform as you're intending above.
Suppose, we want to track filepaths within a dj.Manual table but I'd like to validate that only filepaths with a certain extension are inserted.
As you've already discovered, we can achieve this through overloading like so:
import datajoint as dj
schema = dj.Schema('rguzman_insert_validation')
#schema
class FilePath(dj.Manual):
definition = '''
file_id: int
---
file_path: varchar(100)
'''
def insert1(self, *args, **kwargs): # Notice that we need a reference to the class
key = kwargs['row'] if 'row' in kwargs else args[0] # Handles as arg or kwarg
if '.md' not in key['file_path']:
raise Exception('Sorry, we only support Markdown files...')
super().insert1(*args, **kwargs)
P.S. Though this example is meant to illustrate the concept, there is actually a better way of doing the above if you are using MySQL8. There is a CHECK utility available from MySQL that allows simple validation that DataJoint will respect. If those conditions are met, you can simplify it to:
import datajoint as dj
schema = dj.Schema('rguzman_insert_validation')
#schema
class FilePath(dj.Manual):
definition = '''
file_id: int
---
file_path: varchar(100) CHECK(REGEXP_LIKE(file_path, '^.*\.md$', 'c'))
'''

Related

marshmallow validate object without serializing

Problem: I have an object with some values and a schema which defines field and schema level validation for such an object using marshmallow.
The schema can validate serialized data, which, for most used cases, is enough. But when I want to partially update the object, I have to serialize the already existing object, update the returned dictionary with the new fields, validate the dict and then reassign those values to the object. This is required to run the schema level validation.
Question: I was wondering if there is a cleaner way to validate partial updates. For example, partially loading the data, assigning it to the existing object and validating the object using the schema (not possible, because the validation function of a schema only accepts dicts)
Note: I can not just recreate the object, because it is synced with the database. I have to assign the new values.
Example:
class Obj:
def __init__(self):
self.a = 1
self.b = 2
def update(self, **kwargs):
self.__dict__.update(kwargs)
class ObjSchema(Schema):
a = fields.Integer()
b = fields.Integer()
#validates_schema
def validate_whole_schema(self, data, **kwargs):
...
o = Obj()
sc = ObjSchema()
new_values = {
"a": 3
}
# ugly partial update
o_data = sc.dump(o)
o_data.update(new_values)
errors = sc.validate(o_data)
if errors:
...
else:
o.update(o_data)

You might have already solved this problem, but Marshmallow schema.load() offers a "partial" flag that will ignore missing fields when loading the schema.
Refer to the resource below:
https://marshmallow.readthedocs.io/en/stable/quickstart.html#partial-loading

Is it possible to create custom "Columns" in SQLAlchemy? is this breaking OOP?

I'm new in SQLAlchemy and I'm trying to understand some concepts.
Here is a code example without SQLAlchemy:
class Token:
def __init__(self, key):
# generate token based on a string
class User:
def __init__(self, name):
self.name = name
self.token = Token(name)
def check_token(token_to_check)
return self.token.is_valid(token_to_check)
How can I move this to SQLAlchemy?. I think I have to do something like:
class UserDatabase(Base):
__tablename__ = 'testing_sql_alchemy_v2_users'
name = Column(String(256))
token = Column(String(256))
But, when I get the User, token will be a String instead of a Token object. Can I create a Column with my objects?. Example:
token = Column(Token)
If I can't do this, all my objects that use a database must have only "simple" variables (string, int, etc). I think this breaks OOP, right?.

When defining columns in a model (class UserDatabase), you are limited to types that exist in the database engine being used by you.
However, some database engines allow you to overcome this difficulty.
In PostgreSQL it is possible to define your own custom types, either by pure SQL or with the usage of ORM such as SQLAlchemy.
import sqlalchemy.types as types
class MyType(types.TypeDecorator):
'''Prefixes Unicode values with "PREFIX:" on the way in and
strips it off on the way out.
'''
impl = types.Unicode
def process_bind_param(self, value, dialect):
return "PREFIX:" + value
def process_result_value(self, value, dialect):
return value[7:]
def copy(self, **kw):
return MyType(self.impl.length)
source: SQLAlchemy Docs
As you can see it is implementing additional logic on top of already existing types, therefore it is limited.
But for your use case it should be sufficient - you can make your own type on top of varchar type and perform some logic to token-ify that string.

How can I create an encrypted django field that converts data when it's retrieved from the database?

I have a custom EncryptedCharField, which I want to basically appear as a CharField when interfacing UI, but before storing/retrieving in the DB it encrypts/decrypts it.
The custom fields documentation says to:
add __metaclass__ = models.SubfieldBase
override to_python to convert the data from it's raw storage into the desired format
override get_prep_value to convert the value before storing ot the db.
So you think this would be easy enough - for 2. just decrypt the value, and 3. just encrypt it.
Based loosely on a django snippet, and the documentation this field looks like:
class EncryptedCharField(models.CharField):
"""Just like a char field, but encrypts the value before it enters the database, and decrypts it when it
retrieves it"""
__metaclass__ = models.SubfieldBase
def __init__(self, *args, **kwargs):
super(EncryptedCharField, self).__init__(*args, **kwargs)
cipher_type = kwargs.pop('cipher', 'AES')
self.encryptor = Encryptor(cipher_type)
def get_prep_value(self, value):
return encrypt_if_not_encrypted(value, self.encryptor)
def to_python(self, value):
return decrypt_if_not_decrypted(value, self.encryptor)
def encrypt_if_not_encrypted(value, encryptor):
if isinstance(value, EncryptedString):
return value
else:
encrypted = encryptor.encrypt(value)
return EncryptedString(encrypted)
def decrypt_if_not_decrypted(value, encryptor):
if isinstance(value, DecryptedString):
return value
else:
encrypted = encryptor.decrypt(value)
return DecryptedString(encrypted)
class EncryptedString(str):
pass
class DecryptedString(str):
pass
and the Encryptor looks like:
class Encryptor(object):
def __init__(self, cipher_type):
imp = __import__('Crypto.Cipher', globals(), locals(), [cipher_type], -1)
self.cipher = getattr(imp, cipher_type).new(settings.SECRET_KEY[:32])
def decrypt(self, value):
#values should always be encrypted no matter what!
#raise an error if tthings may have been tampered with
return self.cipher.decrypt(binascii.a2b_hex(str(value))).split('\0')[0]
def encrypt(self, value):
if value is not None and not isinstance(value, EncryptedString):
padding = self.cipher.block_size - len(value) % self.cipher.block_size
if padding and padding < self.cipher.block_size:
value += "\0" + ''.join([random.choice(string.printable) for index in range(padding-1)])
value = EncryptedString(binascii.b2a_hex(self.cipher.encrypt(value)))
return value
When saving a model, an error, Odd-length string, occurs, as a result of attempting to decrypt an already decrypted string. When debugging, it appears as to_python ends up being called twice, the first with the encrypted value, and the second time with the decrypted value, but not actually as a type Decrypted, but as a raw string, causing the error. Furthermore get_prep_value is never called.
What am I doing wrong?
This should not be that hard - does anyone else think this Django field code is very poorly written, especially when it comes to custom fields, and not that extensible? Simple overridable pre_save and post_fetch methods would easily solve this problem.

I think the issue is that to_python is also called when you assign a value to your custom field (as part of validation may be, based on this link). So the problem is to distinguish between to_python calls in the following situations:
When a value from the database is assigned to the field by Django (That's when you want to decrypt the value)
When you manually assign a value to the custom field, e.g. record.field = value
One hack you could use is to add prefix or suffix to the value string and check for that instead of doing isinstance check.
I was going to write an example, but I found this one (even better :)).
Check BaseEncryptedField:
https://github.com/django-extensions/django-extensions/blob/2.2.9/django_extensions/db/fields/encrypted.py (link to an older version because the field was removed in 3.0.0; see Issue #1359 for reason of deprecation)
Source:
Django Custom Field: Only run to_python() on values from DB?

You should be overriding to_python, like the snippet did.
If you take a look at the CharField class you can see that it doesn't have a value_to_string method:
django/db/models/fields/__init__.py
The docs say that the to_python method needs to deal with three things:
An instance of the correct type
A string (e.g., from a deserializer).
Whatever the database returns for the column type you're using.
You are currently only dealing with the third case.
One way to handle this is to create a special class for a decrypted string:
class DecryptedString(str):
pass
Then you can detect this class and handle it in to_python():
def to_python(self, value):
if isinstance(value, DecryptedString):
return value
decrypted = self.encrypter.decrypt(encrypted)
return DecryptedString(decrypted)
This prevents you from decrypting more than once.

You forgot to set the metaclass:
class EncryptedCharField(models.CharField):
__metaclass__ = models.SubfieldBase
The custom fields documentation explains why this is necessary.

Since this question was originally answered, a number of packages have been written to solve this exact problem.
For example, as of 2018, the package django-encrypted-model-fields handles this with a syntax like
from encrypted_model_fields.fields import EncryptedCharField
class MyModel(models.Model):
encrypted_char_field = EncryptedCharField(max_length=100)
...
As a rule of thumb, it's usually a bad idea to roll your own solution to a security challenge when a more mature solution exists out there -- the community is a better tester and maintainer than you are.

You need to add a to_python method that deals with a number of cases, including passing on an already decrypted value
(warning: snippet is cut from my own code - just for illustration)
def to_python(self, value):
if not value:
return
if isinstance(value, _Param): #THIS IS THE PASSING-ON CASE
return value
elif isinstance(value, unicode) and value.startswith('{'):
param_dict = str2dict(value)
else:
try:
param_dict = pickle.loads(str(value))
except:
raise TypeError('unable to process {}'.format(value))
param_dict['par_type'] = self.par_type
classname = '{}_{}'.format(self.par_type, param_dict['rule'])
return getattr(get_module(self.par_type), classname)(**param_dict)
By the way:
Instead of get_db_prep_value you should use get_prep_value (the former is for db specific conversions - see https://docs.djangoproject.com/en/1.4/howto/custom-model-fields/#converting-python-objects-to-query-values )

Recreate some sqlalchemy object

Ok, so I'm having a bit of a problem here. I need to be able to create a sort of import/export functionality for some sqlalchemy. Now these are not objects I'm defining, so to get the columns I'm doing:
for attr, value in res.__class__.__dict__.iteritems():
if isinstance(value, InstrumentedAttribute):
data = eval("res." + str(attr))
param_dict[attr] = data
Now this correctly gets me the attributes of that object. However, I can;t be certain that the parameters of the init are the same, since I'm not the one handling this objects. So there could be a situation like:
class SomeClass(model.Base):
my_column = Column(String)
....some other stuff...
def __init__(self, mycolumn, ...):
self.my_column = mycolumn
So in this case I don't have any correspondance between the name of the field and the name of the parameter as recieved by init. I'm currently contraining the ones who define these classes to have all the init parametes with a default value so I could just:
obj = SomeClass()
exec "obj." + attr + " = " + param[attr]
However I would like to get away even from this constrain. Is there any way I can achieve this?

Serializing can't really be generalized for all possible sqlalchemy mapped classes, classes might have properties that aren't stored in the database, or that must be inferred across multiple levels of indirection through relationship properties. In short, only you know how to serialize a particular class for a particular use.
Lets pretend that you only need or care about the column values for the specific instance of the specific class under consideration (in res). Here's a crude function that will return a dict containing only those values.
from sqlalchemy.orm.attributes import manager_of_class
from sqlalchemy.orm.properties import ColumnProperty
def get_state_dict(instance):
cls = type(instance)
mgr = manager_of_class(cls)
return dict((key, getattr(instance, key))
for key, attr in mgr.iteritems()
if isinstance(attr.property, ColumnProperty))
and to recreate an instance from the dict*
def create_from_state_dict(cls, state_dict):
mgr = manager_of_class(cls)
instance = mgr.new_instance()
for key, value in state_dict.iteritems():
setattr(instance, key, value)
return instance
If you need something more complex, probably handling the relationships not shown as columns (as in many-to-many relationships), you can probably add that case by looking for sqlalchemy.orm.properties.RelationshipProperty, and then iterating over the collection.
*Serializing the intermediate dict and class is left as an exercise.

Check if a non-nullable field is null

I wanted to write some code like this:
class SomeModel(models.Model):
field = models.ForeignKey(SomeOtherModel)
def __init__(self, *args, **kwargs):
super(SomeModel, self).__init__(*args, **kwargs)
if self.field is None:
self.field = SomeOtherModel()
...
However this raises self.field.rel.to.DoesNotExist. The Django code is very clear on that:
class ReverseSingleRelatedObjectDescriptor(object):
def __get__(self, instance, instance_type=None):
...
if val is None:
# If NULL is an allowed value, return it.
if self.field.null:
return None
raise self.field.rel.to.DoesNotExist
An obvious workaround would be of course to make the field nullable however as far as I understand that would actually have an effect on the database schema, also I like the integrity checks Django offers. Another one would be to catch the exception and handle it appropriately. However this adds a lot of boilerplate code. Especially when there are multiple fields like that (a separate try...except block for each one of them - now that's ugly). What could I do?
I could use initial however this is quite limited when it comes to foreign keys. I do not always know the kind of default that I would like to have at the moment of creation. I will however know it at the initialization phase. Moreover it could then be dependent on the values of the other fields.

Check if it has the attribute set -
if hasattr(self, 'field')

proper way to refer a field in form is like this:
self.fields['myfield']
so, in your case, the null check should go like this
self.fields['myfield'].value is None
on the other note, don't use reserved/near to reserved words like 'field' for naming your fields.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

validating parameter dictionaries before insertion in datajoint schema - python

Related

marshmallow validate object without serializing

Is it possible to create custom "Columns" in SQLAlchemy? is this breaking OOP?

How can I create an encrypted django field that converts data when it's retrieved from the database?

Recreate some sqlalchemy object

Check if a non-nullable field is null

Categories

Resources