Pydantic Models inside a Pandas DataFrame - python

Cannot insert a pydantic model into a pandas DataFrame.
I cannot figure out what about classes that inherit from pydantic's BaseModel means that they cannot be inserted into a DataFrame whereas other classes can.
For example:
#dataclasses.dataclass
class Test:
name: str
df = pd.DataFrame()
inst = Test(name='Brian')
df.at[0, 'test'] = inst
print(df)
will output
test
0 Test(name='Brian')
(the above also works for non dataclasses as well)
whereas for a pydantic model
class Test(BaseModel):
name: str
pandas seems to interpret the model as list_like
and fails with error:
TypeError: object of type 'Test' has no len()
This goes for pandas insert, loc, iloc, at, iat
Whilst its not the most obvious use case for a DataFrame I am working with a codebase that expects to be able to insert and retrieve classes as objects from dataframe fields and I am currently attempting to change some classes to use Pydantic.
Why does pandas assume pydantic models are list_like and is there a way around this?
UPDATE:
I found a way to make this work although it is not ideal...
Pandas uses a method is_list_like - one can find it in pandas._libs.lib.pyx but as it is written Pyrex it is hard to tell why a dataclass is not list_like and a pydantic class is, someone may be able to enlighten me on that one.
Anyway to get around this I noticed pandas checks the value's 'ndim' attribute > 0 with a default of 1. If this condition fails then the value is set using the _setitem_single_column method in the pandas _iLocIndexer class.
Therefore setting ndim=0 on the pydantic class allows for it to be set as a field value in a pandas df.
class Test(BaseModel):
name: str
ndim = 0

Related

Pydantic does not validate the key/values of dict fields

I have the following simple data model:
from typing import Dict
from pydantic import BaseModel
class TableModel(BaseModel):
table: Dict[str, str]
I want to add multiple tables like this:
tables = TableModel(table={'T1': 'Tea'})
print(tables) # table={'T1': 'Tea'}
tables.table['T2'] = 'coffee'
tables.table.update({'T3': 'Milk'})
print(tables) # table={'T1': 'Tea', 'T2': 'coffee', 'T3': 'Milk'}
So far everything is working as expected. However the next piece of code does not raise any error:
tables.table[1] = 2
print(tables) # table={'T1': 'Tea', 'T2': 'coffee', 'T3': 'Milk', 1: 2}
I changed tables field name to __root__. With this change as well I see the same behavior.
I also add the validate_assignment = True in the Model Config that also does not help.
How can I get the model to validate the dict fields? Am I missing something basic here?
There are actually two distinct issues here that I'll address separately.
Mutating a dict on a Pydantic model
Observed behavior
from typing import Dict
from pydantic import BaseModel
class TableModel(BaseModel):
table: Dict[str, str]
class Config:
validate_assignment = True
instance = TableModel(table={"a": "b"})
instance.table[1] = object()
print(instance)
Output: table={'a': 'b', 1: <object object at 0x7f7c427d65a0>}
Both key and value type clearly don't match our annotation of table. So, why does the assignment instance.table[1] = object() not cause a validation error?
Explanation
The reason is rather simple: There is no mechanism to enforce validation here. You need to understand what happens here from the point of view of the model.
A model can validate attribute assignment (if you configure validate_assignment = True). It does so by hooking into the __setattr__ method and running the value through the appropriate field validator(s).
But in that example above, we never called BaseModel.__setattr__. Instead, we called the __getattribute__ method that BaseModel inherits from object to access the value of instance.table. That returned the dictionary object ({"a": "b"}). And then we called the dict.__setitem__ method on that dictionary and added a key-value-pair of 1: object() to it.
The dictionary is just a regular old dictionary without any validation logic. And the mutation of that dictionary is completely obscure to the Pydantic model. It has no way of knowing that after accessing the object currently assigned to the table field, we changed something inside that object.
Validation would only be triggered, if we actually assigned a new object to the table field of the model. But that is not what happens here.
If we instead tried to do instance.table = {1: object()}, we would get a validation error because now we are actually setting the table attribute and trying to assign a value to it.
Possible workaround
Depending on how you intend to use the model, you could ensure that changes in the table dictionary will always happen "outside" of the model and are followed by a re-assignment in the form instance.table = .... I would say that is probably the most practical option. In general, re-parsing (subsets of) data should ensure consistency, if you mutated values. Something like this should work (i.e. cause an error):
tables.table[1] = 2
tables = TableModel.parse_obj(tables.dict())
Another option might be to play around and define your own subtype of Dict and add validation logic there, but I am not sure how much "reinventing the wheel" that might entail.
The most sophisticated option could maybe be a descriptor-based approach, where instead of just calling __getattribute__, a custom descriptor intercepts the attribute access and triggers the assignment validation. But that is just an idea. I have not tried this and don't know if that might break other Pydantic magic.
Implicit type coercion
Observed behavior
from typing import Dict
from pydantic import BaseModel
class TableModel(BaseModel):
table: Dict[str, str]
instance = TableModel(table={1: 2})
print(instance)
Output: table={'1': '2'}
Explanation
This is very easily explained. This is expected behavior and was put in place by choice. The idea is that if we can "simply" coerce a value to the specified type, we want to do that. Although you defined both the key and value type as str, passing an int for each is no big deal because the default string validator can just do str(1) and str(2) respectively.
Thus, instead of raising a validation error, the tables value ends up with {"1": "2"} instead.
Possible workaround
If you do not want this implicit coercion to happen, there are strict types that you can use to annotate with. In this case you could to table: Dict[StrictStr, StrictStr]. Then the previous example would indeed raise a validation error.

Cannot determine if type of field in a Pydantic model is of type List

I am trying to automatically convert a Pydantic model to a DB schema. To do that, I am recursively looping through a Pydantic model's fields to determine the type of field.
As an example, I have this simple model:
from typing import List
from pydantic import BaseModel
class TestModel(BaseModel):
tags: List[str]
I am recursing through the model using the __fields__ property as described here: https://docs.pydantic.dev/usage/models/#model-properties
If I do type(TestModel).__fields__['tags'] I see:
ModelField(name='tags', type=List[str], required=True)
I want to programatically check if the ModelField type has a List origin. I have tried the following, and none of them work:
type(TestModel).__fields__['tags'].type_ is List[str]
type(TestModel).__fields__['tags'].type_ == List[str]
typing.get_origin(type(TestModel).__fields__['tags'].type_) is List
typing.get_origin(type(TestModel).__fields__['tags'].type_) == List
Frustratingly, this does return True:
type(TestModel).__fields__['tags'].type_ is str
What is the correct way for me to confirm a field is a List type?
Pydantic has the concept of the shape of a field. These shapes are encoded as integers and available as constants in the fields module. The more-or-less standard types have been accommodated there already. If a field was annotated with list[T], then the shape attribute of the field will be SHAPE_LIST and the type_ will be T.
The type_ refers to the element type in the context of everything that is not SHAPE_SINGLETON, i.e. with container-like types. This is why you get str in your example.
Thus for something as simple as list, you can simply check the shape against that constant:
from pydantic import BaseModel
from pydantic.fields import SHAPE_LIST
class TestModel(BaseModel):
tags: list[str]
other: tuple[str]
tags_field = TestModel.__fields__["tags"]
other_field = TestModel.__fields__["other"]
assert tags_field.shape == SHAPE_LIST
assert other_field.shape != SHAPE_LIST
If you want more insight into the actual annotation of the field, that is stored in the annotation attribute of the field. With that you should be able to do all the typing related analyses like get_origin.
That means another way of accomplishing your check would be this:
from typing import get_origin
from pydantic import BaseModel
class TestModel(BaseModel):
tags: list[str]
other: tuple[str]
tags_field = TestModel.__fields__["tags"]
other_field = TestModel.__fields__["other"]
assert get_origin(tags_field.annotation) is list
assert get_origin(other_field.annotation) is tuple
Sadly, neither of those attributes are officially documented anywhere as far as I know, but the beauty of open-source is that we can just check ourselves. Neither the attributes nor the shape constants are obfuscated, protected or made private in any of the usual ways, so I'll assume these are stable (at least until Pydantic v2 drops).

A way to set field validation attribute in pydantic

I have the following pydentic dataclass
#dataclass
class LocationPolygon:
type: int
coordinates: list[list[list[float]]]
this is taken from a json schema where the most inner array has maxItems=2, minItems=2.
I couldn't find a way to set a validation for this in pydantic.
setting this in the field is working only on the outer level of the list.
#dataclass
class LocationPolygon:
type: int
coordinates: list[list[list[float]]] = Field(maxItems=2, minItems=2)
using #validator and updating the field attribute doesn't help either as the value was already set and basic validations were already made:
#validator('coordinates')
def coordinates_come_in_pair(cls, values, field):
field.sub_fields[0].sub_fields[0].field_info.min_items = 2
field.sub_fields[0].sub_fields[0].field_info.max_items = 2
I thought about using root_validator with pre=True, but there are only the raw values there.
Is there a way to tweak the field validation attributes or use pydantic basic rules to make that validation?
You can use conlist function to create nested constrained list:
from pydantic import conlist
from pydantic.dataclasses import dataclass
#dataclass
class LocationPolygon:
type: int
coordinates: list[list[conlist(float, min_items=2, max_items=2)]]

How to limit choices for pydantic using Enum

I got next Enum options:
class ModeEnum(str, Enum):
""" mode """
map = "map"
cluster = "cluster"
region = "region"
This enum used in two Pydantic data structures.
In one data structure I need all Enum options.
In other data structure I need to exclude region.
If I use custom validation for this and try to enter some other value, standard Validation error message informs, that allowed values are all three.
So what is best decision in this situation?
P.S.
I use map variable in ModeEnum. Is it bad? I can't imagine situation when it can override built-in map object, but still, is it ok?
It's a little bit of a hack, but if you mark your validator with pre=True, you should be able to force it to run first, and then you can throw a custom error with the allowed values.

Define CQLEngine model dynamically using 'type'

I am using Datastax Cassandra python driver's Object Mapper for defining cassandra table columns at run time (requirements are like those).
Table and column name and column types are resolved at run time.
I am trying to define a cassandra cqlengine model at runtime using 'type' to define a class.
Looks like Model class defined in python driver has added a metaclass to Model
#six.add_metaclass(ModelMetaClass)
class Model(BaseModel):
...
Is there even a way to define Models using type?
I am seeing following error while defining a Model class
from cassandra.cqlengine.models import Model
from cassandra.cqlengine import columns as Columns
attributes_dict = {
'test_id': Columns.Text(primary_key=True)
'test_col1': Columns.Text()
}
RunTimeModel = type ('NewModelName', tuple(Model), attributes_dict)
Error:
RunTimeModel = type ('NewModelName', tuple(Model), attributes_dict)
TypeError: 'ModelMetaClass' object is not iterable
I'll stay away from the rest, but to answer the question about the error, I think you have a simple syntax error trying to construct a tuple from a non-sequence argument. Instead, you might use the tuple literal notation:
RunTimeModel = type ('NewModelName', (Model,), attributes_dict)

Categories

Resources