Related
from typing import Union
from pydantic import BaseModel, Field
class Category(BaseModel):
name: str = Field(alias="name")
class OrderItems(BaseModel):
name: str = Field(alias="name")
category: Category = Field(alias="category")
unit: Union[str, None] = Field(alias="unit")
quantity: int = Field(alias="quantity")
When instantiated like this:
OrderItems(**{'name': 'Test','category':{'name': 'Test Cat'}, 'unit': 'kg', 'quantity': 10})
It returns data like this:
OrderItems(name='Test', category=Category(name='Test Cat'), unit='kg', quantity=10)
But I want the output like this:
OrderItems(name='Test', category='Test Cat', unit='kg', quantity=10)
How can I achieve this?
You should try as much as possible to define your schema the way you actually want the data to look in the end, not the way you might receive it from somewhere else.
UPDATE: Generalized solution (one nested field or more)
To generalize this problem, let's assume you have the following models:
from pydantic import BaseModel
class Foo(BaseModel):
x: bool
y: str
z: int
class _BarBase(BaseModel):
a: str
b: float
class Config:
orm_mode = True
class BarNested(_BarBase):
foo: Foo
class BarFlat(_BarBase):
foo_x: bool
foo_y: str
Problem: You want to be able to initialize BarFlat with a foo argument just like BarNested, but the data to end up in the flat schema, wherein the fields foo_x and foo_y correspond to x and y on the Foo model (and you are not interested in z).
Solution: Define a custom root_validator with pre=True that checks if a foo key/attribute is present in the data. If it is, it validates the corresponding object against the Foo model, grabs its x and y values and then uses them to extend the given data with foo_x and foo_y keys:
from pydantic import BaseModel, root_validator
from pydantic.utils import GetterDict
...
class BarFlat(_BarBase):
foo_x: bool
foo_y: str
#root_validator(pre=True)
def flatten_foo(cls, values: GetterDict) -> GetterDict | dict[str, object]:
foo = values.get("foo")
if foo is None:
return values
# Assume `foo` must ba valid `Foo` data:
foo = Foo.validate(foo)
return {
"foo_x": foo.x,
"foo_y": foo.y,
} | dict(values)
Note that we need to be a bit more careful inside a root validator with pre=True because the values are always passed in the form of a GetterDict, which is an immutable mapping-like object. So we cannot simply assign new values foo_x/foo_y to it like we would to a dictionary. But nothing is stopping us from returning the cleaned up data in the form of a regular old dict.
To demonstrate, we can throw some test data at it:
test_dict = {"a": "spam", "b": 3.14, "foo": {"x": True, "y": ".", "z": 0}}
test_orm = BarNested(a="eggs", b=-1, foo=Foo(x=False, y="..", z=1))
test_flat = '{"a": "beans", "b": 0, "foo_x": true, "foo_y": ""}'
bar1 = BarFlat.parse_obj(test_dict)
bar2 = BarFlat.from_orm(test_orm)
bar3 = BarFlat.parse_raw(test_flat)
print(bar1.json(indent=4))
print(bar2.json(indent=4))
print(bar3.json(indent=4))
The output:
{
"a": "spam",
"b": 3.14,
"foo_x": true,
"foo_y": "."
}
{
"a": "eggs",
"b": -1.0,
"foo_x": false,
"foo_y": ".."
}
{
"a": "beans",
"b": 0.0,
"foo_x": true,
"foo_y": ""
}
The first example simulates a common situation, where the data is passed to us in the form of a nested dictionary. The second example is the typical database ORM object situation, where BarNested represents the schema we find in a database. The third is just to show that we can still correctly initialize BarFlat without a foo argument.
One caveat to note is that the validator does not get rid of the foo key, if it finds it in the values. If your model is configured with Extra.forbid that will lead to an error. In that case, you'll just need to have an extra line, where you coerce the original GetterDict to a dict first, then pop the "foo" key instead of getting it.
Original post (flatten single field)
If you need the nested Category model for database insertion, but you want a "flat" order model with category being just a string in the response, you should split that up into two separate models.
Then in the response model you can define a custom validator with pre=True to handle the case when you attempt to initialize it providing an instance of Category or a dict for category.
Here is what I suggest:
from pydantic import BaseModel, validator
class Category(BaseModel):
name: str
class OrderItemBase(BaseModel):
name: str
unit: str | None
quantity: int
class OrderItemCreate(OrderItemBase):
category: Category
class OrderItemResponse(OrderItemBase):
category: str
#validator("category", pre=True)
def handle_category_model(cls, v: object) -> object:
if isinstance(v, Category):
return v.name
if isinstance(v, dict) and "name" in v:
return v["name"]
return v
Here is a demo:
if __name__ == "__main__":
insert_data = '{"name": "foo", "category": {"name": "bar"}, "quantity": 1}'
insert_obj = OrderItemCreate.parse_raw(insert_data)
print(insert_obj.json(indent=2))
... # insert into DB
response_obj = OrderItemResponse.parse_obj(insert_obj.dict())
print(response_obj.json(indent=2))
Here is the output:
{
"name": "foo",
"unit": null,
"quantity": 1,
"category": {
"name": "bar"
}
}
{
"name": "foo",
"unit": null,
"quantity": 1,
"category": "bar"
}
One of the benefits of this approach is that the JSON Schema stays consistent with what you have on the model. If you use this in FastAPI that means the swagger documentation will actually reflect what the consumer of that endpoint receives. You could of course override and customize schema creation, but... why? Just define the model correctly in the first place and avoid headache in the future.
Try this when instantiating:
myCategory = Category(name="test cat")
OrderItems(
name="test",
category=myCategory.name,
unit="kg",
quantity=10)
Well, i was curious, so here's the insane way:
class Category(BaseModel):
name: str = Field(alias="name")
class OrderItems(BaseModel):
name: str = Field(alias="name")
category: Category = Field(alias="category")
unit: Union[str, None] = Field(alias="unit")
quantity: int = Field(alias="quantity")
def json(self, *args, **kwargs) -> str:
self.__dict__.update({'category': self.__dict__['category'].name})
return super().json(*args, **kwargs)
c = Category(name='Dranks')
m = OrderItems(name='sodie', category=c, unit='can', quantity=1)
m.json()
And you get:
'{"name": "sodie", "category": "Dranks", "unit": "can", "quantity": 1}'
The sane way would probably be:
class Category(BaseModel):
name: str = Field(alias="name")
class OrderItems(BaseModel):
name: str = Field(alias="name")
category: Category = Field(alias="category")
unit: Union[str, None] = Field(alias="unit")
quantity: int = Field(alias="quantity")
c = Category(name='Dranks')
m = OrderItems(name='sodie', category=c, unit='can', quantity=1)
r = m.dict()
r['category'] = r['category']['name']
Recently I stuck on inserting an element into an array that held by a parent document.
Basically, my model represents a train that can have many wagons. The wagons is an array that hold a wagon.
Here is my database model
rom typing import Optional,Union
from pydantic import BaseModel, EmailStr, Field
#Wagon Model
class Wagon(BaseModel):
no: int = Field(...)
wagon_code: str = Field(...)
wagon_type: str = Field(...)
passengers_on_board: int = Field(...)
max_capacity: int = Field(...)
weight: float = Field(...)
dimension: tuple = Field(...)
class Config:
schema_extra = {
"example": {
"no": 3,
"wagon_code": "GA151-03",
"wagon_type": "Penumpang",
"passengers_on_board": 40,
"max_capacity":80,
"weight":1050.01,
"dimension": (20.62,9.91,4.15)
}
}
class UpdateWagon(BaseModel):
no: Optional[int]
wagon_code: Optional[str]
wagon_type: Optional[str]
passengers_on_board: Optional[int]
max_capacity: Optional[int]
weight: Optional[float]
dimension: Optional[tuple]
class Config:
schema_extra = {
"example": {
"no": 3,
"wagon_code": "GA151-03",
"wagon_type": "Penumpang",
"passengers_on_board": 64,
"max_capacity":80,
"weight":1050.01,
"dimension": (20.62,9.91,4.15)
}
}
#Train Model
class Train(BaseModel):
name: str = Field(...)
no: str = Field(...)
first_station: str = Field(...)
last_station: str = Field(...)
from_: str = Field(...)
to_: str = Field(...)
current_station: str = Field(...)
position: Optional[tuple]
wagons: Union[list[Wagon], None] = None
class Config:
schema_extra = {
"example": {
"name": "Gajayana",
"no": "GA-151",
"first_station": "Malang - Kota Baru",
"last_station": "Jakarta - Gambir",
"from_":"Malang - Kota Baru",
"to_":"Malang - Kota Lama",
"current_station":"Malang - Kota Baru",
"position": (2.12,2.12),
"wagons" : [],
}
}
class UpdateTrain(BaseModel):
name: Optional[str]
no: Optional[str]
first_station: Optional[str]
last_station: Optional[str]
from_: Optional[str]
to_: Optional[str]
current_station: Optional[str]
position: Optional[tuple]
wagons: Optional[list]
class Config:
schema_extra = {
"example": {
"name": "Gajayana",
"no": "GA-151",
"first_station": "Malang - Kota Baru",
"last_station": "Jakarta - Gambir",
"from_":"Malang - Kota Lama",
"to_":"Kepanjen",
"current_station":"Malang - Kota Baru",
"position": (2.08,2.16),
"wagons": [],
}
}
I was using arrayFilters to attach the newly created wagon to the train. However, I got no luck to attach the model but a new wagon was created as shown in the function below.
async def add_wagon(train_no: str,wagon_data: dict) -> dict:
wagon = await wagon_collection.insert_one(wagon_data)
new_wagon = await wagon_collection.find_one({"_id": wagon.inserted_id})
train = await train_collection.find_one({"no": train_no})
if train:
train_collection.update_one({"no": train_no},{"$set":{"wagons.$[element]":new_wagon}},array_filters=[{"element":{"$exists":"false"}}],upsert=True)
return wagon_helper(new_wagon)
The model was served as an API with FASTAPI. The operation above was done using PUT method with the route as shown below.
#router.put("{train_no}/wagons/", response_description="Wagon data added into the database")
async def add_wagon_data(train_no: str,wagon: Wagon = Body(...)):
wagon = jsonable_encoder(wagon)
new_wagon = await add_wagon(train_no,wagon)
return ResponseModel(new_wagon, "Wagon added successfully")
Is my arrayFilters is wrong ? or is the way I use $exists as exact equality match is incorrect ?
It turns out I was thinking too complicated. USing $push with upsert=True solve the problem.
async def add_wagon(train_no: str,wagon_data: dict) -> dict:
wagon = await wagon_collection.insert_one(wagon_data)
new_wagon = await wagon_collection.find_one({"_id": wagon.inserted_id})
train = await train_collection.find_one({"no": train_no})
if train:
train_collection.update_one({"no": train_no},{"$push":{"wagons":new_wagon}},upsert=True)
return wagon_helper(new_wagon)
let's say I have a simple HTTP endpoint accepting JSON payload form the user:
here the user passed None explicitly:
payload = {
"name": "Paul",
"option": None
}
and here the user didn't provide option at all:
payload = {
"name": "Paul",
}
For some reason, I want to 'normalize' such payloads into
payload = {
"name": "Paul",
"option": Missing(),
}
where Missing() is an instance of a class:
class Missing:
"""Represents missing value in a dictionary"""
pass
so I can differentiate between user passing None explicitly and not passing option key at all.
What I'm struggling to define is a custom "type constructor", such that I could then annotate function argument like so:
def process_user_option(option: Missing[Optional[str]]):
...do something with 'option'
and the semantics of Missing would be something like this:
Missing(x) -> Union[Optional[x], Missing]
Any suggestions? Thank you!
One suggestion:
from typing import NewType, Optional, TypeVar
class MissingType: pass
Missing = MissingType()
T = TypeVar('T')
PossiblyMissing = T | MissingType
Name = NewType("Name", str)
Option = NewType("Option", dict)
def process_user_option(option: PossiblyMissing[Optional[Option]]):
if option is Missing:
...
if option is None:
...
Additionally, you might want to declare optional keys in a TypedDict explicitly with total=False*.
from typing import TypedDict
class _PayloadBase(TypedDict):
name: str
class Payload(_PayloadBase, total=False):
option: Optional[str] # optional key
in order to safely normalize and process the payload:
#final
class MissingType:
pass
Missing = MissingType()
class NormalizedPaylaod(_PayloadBase):
option: str | MissingType
def normalize_payload(payload: Payload) -> NormalizedPaylaod:
option = Missing if (option := payload.get("option")) is None else option
normalized_payload: NormalizedPaylaod = {
"name": payload["name"],
"option": option,
}
return normalized_payload
def process_payload(payload: NormalizedPaylaod) -> None:
if payload["option"] is Missing:
print("No option specified!")
*It will be much easier in Python 3.11 with PEP-655 (Required and NotRequired).
Setup:
# Pydantic Models
class TMDB_Category(BaseModel):
name: str = Field(alias="strCategory")
description: str = Field(alias="strCategoryDescription")
class TMDB_GetCategoriesResponse(BaseModel):
categories: list[TMDB_Category]
#router.get(path="category", response_model=TMDB_GetCategoriesResponse)
async def get_all_categories():
async with httpx.AsyncClient() as client:
response = await client.get(Endpoint.GET_CATEGORIES)
return TMDB_GetCategoriesResponse.parse_obj(response.json())
Problem:
Alias is being used when creating a response, and I want to avoid it. I only need this alias to correctly map the incoming data but when returning a response, I want to use actual field names.
Actual response:
{
"categories": [
{
"strCategory": "Beef",
"strCategoryDescription": "Beef is ..."
},
{
"strCategory": "Chicken",
"strCategoryDescription": "Chicken is ..."
}
}
Expected response:
{
"categories": [
{
"name": "Beef",
"description": "Beef is ..."
},
{
"name": "Chicken",
"description": "Chicken is ..."
}
}
Switch aliases and field names and use the allow_population_by_field_name model config option:
class TMDB_Category(BaseModel):
strCategory: str = Field(alias="name")
strCategoryDescription: str = Field(alias="description")
class Config:
allow_population_by_field_name = True
Let the aliases configure the names of the fields that you want to return, but enable allow_population_by_field_name to be able to parse data that uses different names for the fields.
An alternate option (which likely won't be as popular) is to use a de-serialization library other than pydantic. For example, the Dataclass Wizard library is one which supports this particular use case. If you need the same round-trip behavior that Field(alias=...) provides, you can pass the all param to the json_field function. Note that with such a library, you do lose out on the ability to perform complete type validation, which is arguably one of pydantic's greatest strengths; however it does, perform type conversion in a similar fashion to pydantic. There are also a few reasons why I feel that validation is not as important, which I do list below.
Reasons why I would argue that data validation is a nice to have
feature in general:
If you're building and passing in the input yourself, you can most likely trust that you know what you are doing, and are passing in the correct data types.
If you're getting the input from another API, then assuming that API has decent docs, you can just grab an example response from their documentation, and use that to model your class structure. You generally don't need any validation if an API documents its response structure clearly.
Data validation takes time, so it can slow down the process slightly, compared to if you just perform type conversion and catch any errors that might occur, without validating the input type beforehand.
So to demonstrate that, here's a simple example for the above use case using the dataclass-wizard library (which relies on the usage of dataclasses instead of pydantic models):
from dataclasses import dataclass
from dataclass_wizard import JSONWizard, json_field
#dataclass
class TMDB_Category:
name: str = json_field('strCategory')
description: str = json_field('strCategoryDescription')
#dataclass
class TMDB_GetCategoriesResponse(JSONWizard):
categories: list[TMDB_Category]
And the code to run that, would look like this:
input_dict = {
"categories": [
{
"strCategory": "Beef",
"strCategoryDescription": "Beef is ..."
},
{
"strCategory": "Chicken",
"strCategoryDescription": "Chicken is ..."
}
]
}
c = TMDB_GetCategoriesResponse.from_dict(input_dict)
print(repr(c))
# TMDB_GetCategoriesResponse(categories=[TMDB_Category(name='Beef', description='Beef is ...'), TMDB_Category(name='Chicken', description='Chicken is ...')])
print(c.to_dict())
# {'categories': [{'name': 'Beef', 'description': 'Beef is ...'}, {'name': 'Chicken', 'description': 'Chicken is ...'}]}
Measuring Performance
If anyone is curious, I've set up a quick benchmark test to compare deserialization and serialization times with pydantic vs. just dataclasses:
from dataclasses import dataclass
from timeit import timeit
from pydantic import BaseModel, Field
from dataclass_wizard import JSONWizard, json_field
# Pydantic Models
class Pydantic_TMDB_Category(BaseModel):
name: str = Field(alias="strCategory")
description: str = Field(alias="strCategoryDescription")
class Pydantic_TMDB_GetCategoriesResponse(BaseModel):
categories: list[Pydantic_TMDB_Category]
# Dataclasses
#dataclass
class TMDB_Category:
name: str = json_field('strCategory', all=True)
description: str = json_field('strCategoryDescription', all=True)
#dataclass
class TMDB_GetCategoriesResponse(JSONWizard):
categories: list[TMDB_Category]
# Input dict which contains sufficient data for testing (100 categories)
input_dict = {
"categories": [
{
"strCategory": f"Beef {i * 2}",
"strCategoryDescription": "Beef is ..." * i
}
for i in range(100)
]
}
n = 10_000
print('=== LOAD (deserialize)')
print('dataclass-wizard: ',
timeit('c = TMDB_GetCategoriesResponse.from_dict(input_dict)',
globals=globals(), number=n))
print('pydantic: ',
timeit('c = Pydantic_TMDB_GetCategoriesResponse.parse_obj(input_dict)',
globals=globals(), number=n))
c = TMDB_GetCategoriesResponse.from_dict(input_dict)
pydantic_c = Pydantic_TMDB_GetCategoriesResponse.parse_obj(input_dict)
print('=== DUMP (serialize)')
print('dataclass-wizard: ',
timeit('c.to_dict()',
globals=globals(), number=n))
print('pydantic: ',
timeit('pydantic_c.dict()',
globals=globals(), number=n))
And the benchmark results (tested on Mac OS Big Sur, Python 3.9.0):
=== LOAD (deserialize)
dataclass-wizard: 1.742989194
pydantic: 5.31538175
=== DUMP (serialize)
dataclass-wizard: 2.300118940
pydantic: 5.582638598
In their docs, pydantic claims to be the fastest library in general, but it's rather straightforward to prove otherwise. As you can see, for the above dataset pydantic is about 2x slower in both the deserialization and serialization process. It’s worth noting that pydantic is already quite fast, though.
Disclaimer: I am the creator (and maintener) of said library.
maybe you could use this approach
from pydantic import BaseModel, Field
class TMDB_Category(BaseModel):
name: str = Field(alias="strCategory")
description: str = Field(alias="strCategoryDescription")
data = {
"strCategory": "Beef",
"strCategoryDescription": "Beef is ..."
}
obj = TMDB_Category.parse_obj(data)
# {'name': 'Beef', 'description': 'Beef is ...'}
print(obj.dict())
I was trying to do something similar (migrate a field pattern to a list of patterns while gracefully handling old versions of the data). The best solution I could find was to do the field mapping in the __init__ method. In the terms of OP, this would be like:
class TMDB_Category(BaseModel):
name: str
description: str
def __init__(self, **data):
if "strCategory" in data:
data["name"] = data.pop("strCategory")
if "strCategoryDescription" in data:
data["description"] = data.pop("strCategoryDescription")
super().__init__(**data)
Then we have:
>>> TMDB_Category(strCategory="name", strCategoryDescription="description").json()
'{"name": "name", "description": "description"}'
If you need to use field aliases to do this but still use the name/description fields in your code, one option is to alter Hernán Alarcón's solution to use properties:
class TMDB_Category(BaseModel):
strCategory: str = Field(alias="name")
strCategoryDescription: str = Field(alias="description")
class Config:
allow_population_by_field_name = True
#property
def name(self):
return self.strCategory
#name.setter
def name(self, value):
self.strCategory = value
#property
def description(self):
return self.strCategoryDescription
#description.setter
def description(self, value):
self.strCategoryDescription = value
That's still a bit awkward, since the repr uses the "alias" names:
>>> TMDB_Category(name="name", description="description")
TMDB_Category(strCategory='name', strCategoryDescription='description')
Use the Config option by_alias.
from fastapi import FastAPI, Path, Query
from pydantic import BaseModel, Field
app = FastAPI()
class Item(BaseModel):
name: str = Field(..., alias="keck")
#app.post("/item")
async def read_items(
item: Item,
):
return item.dict(by_alias=False)
Given the request:
{
"keck": "string"
}
this will return
{
"name": "string"
}
Let's say I want to initialize the below dataclass
from dataclasses import dataclass
#dataclass
class Req:
id: int
description: str
I can of course do it in the following way:
data = make_request() # gives me a dict with id and description as well as some other keys.
# {"id": 123, "description": "hello", "data_a": "", ...}
req = Req(data["id"], data["description"])
But, is it possible for me to do it with dictionary unpacking, given that the keys I need is always a subset of the dictionary?
req = Req(**data) # TypeError: __init__() got an unexpected keyword argument 'data_a'
Here's a solution that can be used generically for any class. It simply filters the input dictionary to exclude keys that aren't field names of the class with init==True:
from dataclasses import dataclass, fields
#dataclass
class Req:
id: int
description: str
def classFromArgs(className, argDict):
fieldSet = {f.name for f in fields(className) if f.init}
filteredArgDict = {k : v for k, v in argDict.items() if k in fieldSet}
return className(**filteredArgDict)
data = {"id": 123, "description": "hello", "data_a": ""}
req = classFromArgs(Req, data)
print(req)
Output:
Req(id=123, description='hello')
UPDATE: Here's a variation on the strategy above which creates a utility class that caches dataclasses.fields for each dataclass that uses it (prompted by a comment by #rv.kvetch expressing performance concerns around duplicate processing of dataclasses.fields by multiple invocations for the same dataclass).
from dataclasses import dataclass, fields
class DataClassUnpack:
classFieldCache = {}
#classmethod
def instantiate(cls, classToInstantiate, argDict):
if classToInstantiate not in cls.classFieldCache:
cls.classFieldCache[classToInstantiate] = {f.name for f in fields(classToInstantiate) if f.init}
fieldSet = cls.classFieldCache[classToInstantiate]
filteredArgDict = {k : v for k, v in argDict.items() if k in fieldSet}
return classToInstantiate(**filteredArgDict)
#dataclass
class Req:
id: int
description: str
req = DataClassUnpack.instantiate(Req, {"id": 123, "description": "hello", "data_a": ""})
print(req)
req = DataClassUnpack.instantiate(Req, {"id": 456, "description": "goodbye", "data_a": "my", "data_b": "friend"})
print(req)
#dataclass
class Req2:
id: int
description: str
data_a: str
req2 = DataClassUnpack.instantiate(Req2, {"id": 123, "description": "hello", "data_a": "world"})
print(req2)
print("\nHere's a peek at the internals of DataClassUnpack:")
print(DataClassUnpack.classFieldCache)
Output:
Req(id=123, description='hello')
Req(id=456, description='goodbye')
Req2(id=123, description='hello', data_a='world')
Here's a peek at the internals of DataClassUnpack:
{<class '__main__.Req'>: {'description', 'id'}, <class '__main__.Req2'>: {'description', 'data_a', 'id'}}
You can possibly introduce a new function that will perform the given conversion from dict to dataclass:
import inspect
from dataclasses import dataclass
#dataclass
class Req:
id: int
description: str
def from_dict_to_dataclass(cls, data):
return cls(
**{
key: (data[key] if val.default == val.empty else data.get(key, val.default))
for key, val in inspect.signature(cls).parameters.items()
}
)
from_dict_to_dataclass(Req, {"id": 123, "description": "hello", "data_a": ""})
# Output: Req(id=123, description='hello')
Note, if val.default == val.empty condition is needed in order to check if your dataclass has a default value set. If it's true then we should take the given value into consideration when constructing a dataclass.
A workaround to this is by intercepting the __init__ of the dataclass and filter out the fields that are not recognized.
from dataclasses import dataclass, fields
#dataclass
class Req1:
id: int
description: str
#dataclass
class Req2:
id: int
description: str
def __init__(self, **kwargs):
for key, value in kwargs.items():
if key in REQ2_FIELD_NAMES:
setattr(self, key, value)
# To not re-evaluate the field names for each and every creation of Req2, list them here.
REQ2_FIELD_NAMES = {field.name for field in fields(Req2)}
data = {
"id": 1,
"description": "some",
"data_a": None,
}
try:
print("Call for Req1:", Req1(**data))
except Exception as error:
print("Call for Req1:", error)
try:
print("Call for Req2:", Req2(**data))
except Exception as error:
print("Call for Req2:", error)
Output:
Call for Req1: __init__() got an unexpected keyword argument 'data_a'
Call for Req2: Req2(id=1, description='some')
Related question:
How does one ignore extra arguments passed to a data class?