pydantic convert to jsonable dict (not full json string) - python

I'd like to use pydantic for handling data (bidirectionally) between an api and datastore due to it's nice support for several types I care about that are not natively json-serializable. It has better read/validation support than the current approach, but I also need to create json-serializable dict objects to write out.
from uuid import UUID, uuid4
from pydantic import BaseModel
class Model(BaseModel):
the_id: UUID
instance = Model(the_id=uuid4())
print("1: %s" % instance.dict()
print("2: %s" % instance.json()
prints
{'the_id': UUID('4108356a-556e-484b-9447-07b56a664763')}
>>> inst.json()
'{"the_id": "4108356a-556e-484b-9447-07b56a664763"}'
Id like the following:
{"the_id": "4108356a-556e-484b-9447-07b56a664763"} # eg "json-compatible" dict
It appears that while pydantic has all the mappings, but I can't find any usage of the serialization outside the standard json ~recursive encoder (json.dumps( ... default=pydantic_encoder)) in pydantic/main.py. but I'd prefer to keep to one library for both validate raw->obj (pydantic is great at this) as well as the obj->raw(dict) so that I don't have to manage multiple serialization mappings. I suppose I could implement something similar to the json usage of the encoder, but this should be a common use case?
Other approaches such as dataclasses(builtin) + libraries such as dataclasses_jsonschema provide this ~serialization to json-ready dict, but again, hoping to use pydantic for the more robust input validation while keeping things symmetrical.

The current version of pydantic does not support creating jsonable dict straightforwardly. But you can use the following trick:
class Model(BaseModel):
the_id: UUID = Field(default_factory=uuid4)
print(json.loads(Model().json()))
{'the_id': '4c94e7bc-78fe-48ea-8c3b-83c180437774'}
Or more efficiently by means of orjson
orjson.loads(Model().json())

it appears this functionality has been proposed, and (may be) favored by pydantic's author samuel colvin, as https://github.com/samuelcolvin/pydantic/issues/951#issuecomment-552463606
which proposes adding a simplify parameter to Model.dict() to output jsonalbe data.
This code runs in a production api layer, and is exersized such that we can't use the one-line workaround suggested (just doing a full serialize (.json()) + full deserialize). We implemented a custom function to do this, descending the result of .dict() and converting types to jsonable - hopefully the above proposed functionality is added to pydantic in the future.

Another alternative is to use the jsonable_encoder method from fastapi if you're using that already: https://fastapi.tiangolo.com/tutorial/encoder/
The code seems pretty self-contained so you could copy paste it if the license allows it.

Related

Can I extend Django model fields without creating mixed types?

I have subclassed built-in model Fields to reduce repetition in similar columns. This triggers exceptions in tests against Django 3.2 (but interestingly does work in the otherwise now irrelevant, unsupported, version 2.2)
django.core.exceptions.FieldError: Expression contains mixed types: DecimalField, DecimalFWB. You must set output_field.
from django.db.models import Model,DecimalField,F
from decimal import Decimal
class DecimalFWB(DecimalField):
#property
def validators(self):
return super().validators + [MinValueValidator(0.1), ]
...
class Repro(Model):
frac = DecimalFWB(max_digits=4, decimal_places=4, default=Decimal("0.2"))
...
# same internal type
assert DecimalFWB().get_internal_type() == DecimalField().get_internal_type()
# 3.2: works
# 2.2: works
Repro.objects.annotate(dec_annotation = -F("frac") + Decimal(1)).first()
# 3.2: django.core.exceptions.FieldError
# 2.2: works
Repro.objects.annotate(dec_annotation = Decimal(1) - F("frac")).first()
I found this entry in the Django 3.2 release notes that could explain the change in behaviour from the earlier version:
[..] resolving an output_field for database functions and combined expressions may now crash with mixed types when using Value(). You will need to explicitly set the output_field in such cases.
That suggestion does not solve my problem. If I were to bloat all annotations with ExpressionWrapper/output_field=, I could just as well bloat the model definition and not use the subclass in the first place.
I am trying to emulate the internal type. I want the combined output_field of DecimalField and DecimalFWB to be DecimalField - regardless of order of super/subclass. How do I express that no mixing is happening here?
Automatically selecting the shared field as the output has been fixed as of Bug #33397, released in Django 4.1 (but not backported). The change does however comes with a warning (emphasis mine):
As a guess, if the output fields of all source fields match then
simply infer the same type here.
This guess is mostly a bad idea, but there is quite a lot of code
(especially 3rd party Func subclasses) that depend on it, we'd need a
deprecation path to fix it.
Meaning this might change again in a future release, but at least then intentionally, and reliably triggering a DeprecationWarning on request.

How to test if object has a type odict_values in Python3?

On a project I have a generic function, which can take different data types as input data. While migrating project to Python 3 I have an issue with odict_values. I need to convert those to list, unfortunately, not all data types should be converted. So I decided to do something like this:
if isinstance(data, odict_values):
data = list(data)
But I get an error - undefined variable odict_values. I don't understand what should I provide as a second argument for isinstance. I can clearly see <class 'odict_values'> if I use type(data). The best solution I came up so far is to use:
str(type(data)) == "<class 'odict_values'>"
but it feels wrong.
The odict_values type is not accessible in the built-in types, nor in the collections module.
That means you have to define it yourself:
from collections import OrderedDict
odict_values = type(OrderedDict().values())
You can (and probably should) use a more descriptive name for this type than odict_values.
However you can then you can use this type as second argument for isinstance checks:
isinstance({1: 1}.values(), odict_values) # False
isinstance(OrderedDict([(1, 1)]).values(), odict_values) # True
If you want a more general test if it's a view on the values of a mapping (like dict and OrderedDict), then you could use the abstract base class ValuesView:
from collections.abc import ValuesView
isinstance({1: 1}.values(), ValuesView) # True
isinstance(OrderedDict([(1, 1)]).values(), ValuesView) # True

How to use default datetime serialization in Django REST Framework?

I've got a Django REST Framework serializer containing the following:
from rest_framework import serializers
class ThingSerializer(serializers.ModelSerializer):
last_changed = serializers.SerializerMethodField(read_only=True)
def get_last_changed(self, instance: Thing) -> str:
log_entry = LogEntry.objects.get_for_object(instance).latest()
representation: str = serializers.DateTimeField('%Y-%m-%dT%H:%M:%SZ').to_representation(log_entry.timestamp)
return representation
This is problematic because if the datetime formatting ever changes it will be different to all the other datetimes. I want to reuse the code path which DRF uses to serialize other datetime fields.
What I've tried so far:
The only answer which looked relevant doesn't actually produce the same result as DRF (it includes milliseconds, which DRF does not), presumably because it's using the Django rather than DRF serializer.
rest_framework.serializers.DateTimeField().to_representation(log_entry.timestamp), rest_framework.fields.DateTimeField().to_representation(log_entry.timestamp) and rest_framework.fields.DateTimeField(format=api_settings.DATETIME_FORMAT).to_representation(log_entry.timestamp) don't work either; they produce strings with microsecond accuracy. I've verified with a debugger that DRF calls the latter when serializing other fields, so I can't understand why it produces a different result in my case.
LogEntry.timestamp is declared as a django.db.DateTimeField, but if I try something like LogEntry.timestamp.to_representation(log_entry.timestamp) it fails badly:
AttributeError: 'DeferredAttribute' object has no attribute 'to_representation'
Taking a look through the source of DRF, the interesting stuff is happening in rest_framework/fields.py.
In particular, all of the formatting stuff is happening directly in the DateTimeField.to_representation method.
You have a couple of ways of replicating DRF's behaviour.
First, you could just not pass a format at all. DRF should use its default if you don't explicitly supply a format.
representation: str = serializers.DateTimeField().to_representation(log_entry.timestamp)
Alternatively, keep doing what you're doing, but explicitly pass the format string from DRF's api_settings.DATETIME_FORMAT. This might feel less magical, but honestly it's probably more brittle to API changes in the future.
This might look like:
from rest_framework.settings import api_settings
...
representation: str = serializers.DateTimeField(api_settings.DATETIME_FORMAT).to_representation(log_entry.timestamp)
However, given that you attempted the first and it failed, we need to look a bit deeper!
The default DateFormat for DRF is ISO_8601, which has the following code in it:
value = value.isoformat()
if value.endswith('+00:00'):
value = value[:-6] + 'Z'
return value
That is, it effectively just leans on the python isoformat function.
isoformat will format differently if the value has microseconds or not.
From the Python docs, isoformat will:
Return a string representing the date and time in ISO 8601 format, YYYY-MM-DDTHH:MM:SS.ffffff or, if microsecond is 0, YYYY-MM-DDTHH:MM:SS
In this case, the solution is to explicitly set the microseconds to zero in the timestamp. There are a couple of ways to do this, but we can switch to a Unix timestamp, clip to seconds, and back again
ts = int(log_entry.timestamp)
representation: str = serializers.DateTimeField().to_representation(ts)
or keep using the DateTime object directly, which will have better timezone handling:
representation: str = serializers.DateTimeField().to_representation(
logentry.replace(microsecond=0)
)

Pythonic way to parse command line output into a container object

Please read this whole question before answering, as it's not what you think... I'm looking at creating python object wrappers that represent hardware devices on a system (trimmed example below).
class TPM(object):
#property
def attr1(self):
"""
Protects value from being accidentally modified after
constructor is called.
"""
return self._attr1
def __init__(self, attr1, ...):
self._attr1 = attr1
...
#classmethod
def scan(cls):
"""Calls Popen, parses to dict, and passes **dict to constructor"""
Most of the constructor inputs involve running command line outputs in subprocess.Popen and then parsing the output to fill in object attributes. I've come up with a few ways to handle these, but I'm unsatisfied with what I've put together just far and am trying to find a better solution. Here are the common catches that I've found. (Quick note: tool versions are tightly controlled, so parsed outputs don't change unexpectedly.)
Many tools produce variant outputs, sometimes including fields and sometimes not. This means that if you assemble a dict to be wrapped in a container object, the constructor is more or less forced to take **kwargs and not really have defined fields. I don't like this because it makes static analysis via pylint, etc less than useful. I'd prefer a defined interface so that sphinx documentation is clearer and errors can be more reliably detected.
In lieu of **kwargs, I've also tried setting default args to None for many of the fields, with what ends up as pretty ugly results. One thing I dislike strongly about this option is that optional fields don't always come at the end of the command line tool output. This makes it a little mind-bending to look at the constructor and match it up to tool output.
I'd greatly prefer to avoid constructing a dictionary in the first place, but using setattr to create attributes will make pylint unable to detect the _attr1, etc... and create warnings. Any ideas here are welcome...
Basically, I am looking for the proper Pythonic way to do this. My requirements, for a re-summary are the following:
Command line tool output parsed into a container object.
Container object protects attributes via properties post-construction.
Varying number of inputs to constructor, with working static analysis and error detection for missing required fields during runtime.
Is there a good way of doing this (hopefully without a ton of boilerplate code) in Python? If so, what is it?
EDIT:
Per some of the clarification requests, we can take a look at the tpm_version command. Here's the output for my laptop, but for this TPM it doesn't include every possible attribute. Sometimes, the command will return extra attributes that I also want to capture. This makes parsing to known attribute names on a container object fairly difficult.
TPM 1.2 Version Info:
Chip Version: 1.2.4.40
Spec Level: 2
Errata Revision: 3
TPM Vendor ID: IFX
Vendor Specific data: 04280077 0074706d 3631ffff ff
TPM Version: 01010000
Manufacturer Info: 49465800
Example code (ignore lack of sanity checks, please. trimmed for brevity):
def __init__(self, chip_version, spec_level, errata_revision,
tpm_vendor_id, vendor_specific_data, tpm_version,
manufacturer_info):
self._chip_version = chip_version
...
#classmethod
def scan(cls):
tpm_proc = Popen("/usr/sbin/tpm_version")
stdout, stderr = Popen.communicate()
tpm_dict = dict()
for line in tpm_proc.stdout.splitlines():
if "Version Info:" in line:
pass
else:
split_line = line.split(":")
attribute_name = (
split_line[0].strip().replace(' ', '_').lower())
tpm_dict[attribute_name] = split_line[1].strip()
return cls(**tpm_dict)
The problem here is that this (or a different one that I may not be able to review the source of to get every possible field) could add extra things that cause my parser to work, but my object to not capture the fields. That's what I'm really trying to solve in an elegant way.
I've been working on a more solid answer to this the last few months, as I basically work on hardware support libraries and have finally come up with a satisfactory (though pretty verbose) answer.
Parse the tool outputs, whatever they look like, into objects structures that match up to how the tool views the device. These can have very generic dict structures, but should be broken out as much as possible.
Create another container class on top of that that which uses attributes to access items in the tool-container-objects. This enforces an API and can return sane errors across multiple versions of the tool, and across differing tool outputs!

tornado maps GET and POST arguments to lists. How can I disable this "feature"?

The HTTPRequest class in the tornado* web framework helpfully maps GET and POST arguments to lists. I understand why -- in case a given argument name is used multiple times. But for some RequestHandlers, this is a pain. For instance, if I want to pass a json object and parse it as-is on the server.
What's the most straightforward way to disable the map-to-list behavior so that I can send unaltered json to a tornado/cyclone server?
*Cyclone, actually, in case there's an implementation difference here.
Instead of accessing self.request.arguments directly you should use the accessor functions:
self.get_argument("ID", default=None, strip=False)
This returns a single item.
If you want to turn the arguments into a JSON object you can quite easily do so:
json.dumps({ k: self.get_argument(k) for k in self.request.arguments })
I'm going to go with "you're out of luck." You could re-write the class in question (looks like that would not be fun), but aside from that I don't see many options.
I would just use a dict comprehension.
{k:''.join(v) for k,v in self.request.arguments.iteritems()}

Categories

Resources