In Python 3.8, I'm trying to mock up a validation JSON schema for the structure below:
{
# some other key/value pairs
"data_checks": {
"check_name": {
"sql": "SELECT col FROM blah",
"expectations": {
"expect_column_values_to_be_unique": {
"column": "col",
},
# additional items as required
}
},
# additional items as required
}
}
The requirements I'm trying to enforce include:
At least one item in data_checks that can have a dynamic name. Item keys should be unique.
sql and expectations keys must be present
sql should be a text string
At least one item in expectations. Item keys should be unique.
Within expectations, item keys must be equal to available methods provided by dir(class_name)
More advanced capability would include:
Enforcing expectations method items to only include kwargs for that method
I currently have the following JSON schema for the data_checks portion:
"data_checks": {
"description": "Data quality checks against provided sources.",
"minProperties": 1,
"type": "object",
"patternProperties": {
".+": {
"required": ["expectations", "sql"],
"sql": {
"description": "SQL for data quality check.",
"minLength": 1,
"type": "string",
},
"expectations": {
"description": "Great Expectations function name.",
"minProperties": 1,
"type": "object",
"anyOf": [
{
"type": "string",
"minLength": 1,
"pattern": [e for e in dir(SqlAlchemyDataset) if e.startswith("expect_")],
}
],
},
},
},
},
This JSON schema does not enforce expectations to have at least one item nor does it enforce valid method names for the nested keys as expected from [e for e in dir(SqlAlchemyDataset) if e.startswith("expect_")]. I haven't really looked into enforcing kwargs for the selected method (is that even possible?).
I don't know if this is related to things being nested, but how would I enforce the proper validation requirements?
Thanks!
Related
I'm using ElasticSearch 8.3.2 to store some data I have. The data consists of metabolites and several "studies" for each metabolite, with each study in turn containing concentration values. I am also using the Python ElasticSearch client to communicate with the backend, which works fine.
To associate metabolites with studies, I was considering using a join field as described here.
I have defined this index mapping:
INDEXMAPPING_MET = {
"mappings": {
"properties": {
"id": {"type": "keyword"},
"entry_type": {"type": "text"},
"pc_relation": {
"type": "join",
"relations": {
"metabolite": "study"
}
},
"concentration": {
"type": "nested",
}
}
}
}
pc_relation is the join field here, with metabolites being the parent documents of each study document.
I can create metabolite entries (the parent documents) just fine using the Python client, for example
self.client.index(index="metabolitesv2", id=metabolite, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "metabolite",
},
}))
However, once I try adding child documents, I get a mapping_parser_exception. Notably, I only get this exception when trying to add the pc_relation field, any other fields work just fine and I can create documents if I omit the join field. Here is an example for a study document I am trying to create (on the same index):
self.client.index(index="metabolitesv2", id=study, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "study",
"parent": metabolite_id
},
}))
At first I thought there might be some typing issues, but casting everything to a string sadly does not change the outcome. I would really appreciate any help with regards to where the error could be as I am not really sure what the issue is - From what I can tell from the official ES documentation and other Python+ES projects I am not really doing anything differently.
Tried: Creating an index with a join field, creating a parent document, creating a child document with a join relation to the parent.
Expectation: Documents get created and can be queried using has_child or has_parent tags.
Result: MappingParserException when trying to create the child document
Tldr;
You need to provide a routing value at indexing time for the child document.
The routing value is mandatory because parent and child documents must be indexed on the same shard
By default the routing value of a document is its _id, so in practice you need to provide the _id of the parent document when indexing the child.
Solution
self.client.index(index="metabolitesv2", id=study, routing=metabolite, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "study",
"parent": metabolite_id
},
}))
To reproduce
PUT 75224800
{
"settings": {
"number_of_shards": 4
},
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"pc_relation": {
"type": "join",
"relations": {
"metabolite": "study"
}
}
}
}
}
PUT 75224800/_doc/1
{
"id": "1",
"pc_relation": "metabolite"
}
# No routing Id this is going to fail
PUT 75224800/_doc/2
{
"id": 2,
"pc_relation":{
"name": "study",
"parent": "1"
}
}
PUT 75224800/_doc/3
{
"id": "3",
"pc_relation": "metabolite"
}
PUT 75224800/_doc/4?routing=3
{
"id": 2,
"pc_relation":{
"name": "study",
"parent": "3"
}
}
I'm working with fastjsonschema to validate JSON objects that contain a list of product details.
If the object is missing a value, the validation should create it with the default value e.g.
validate = fastjsonschema.compile({
'type': 'object',
'properties': {
'a': {'type': 'number', 'default': 42},
},
})
data = validate({})
assert data == {'a': 42}
But with an array, it will only fill out the defaults for as many of the array objects as you define in the schema. Which means that if the user enters more array items than the schema covers, the extra items will not be validated by the schema.
Is there a way to declare that all items in the array will follow the same schema, and that they should all be validated?
Currently when I define in the schema
{
"products": {
"type": "array",
"default": [],
"items":[
{
"type": "object",
"default": {},
"properties": {
"string_a": {
"type": "string",
"default": "a"
},
"string_b": {
"type": "string",
"default": "b"
}
}
]
}
}
What will happen when I try to validate
{"products":[{},{}]}
is that it becomes
{"products":[{"string_a":"a","string_b":"b"},{}]}
This can cause issues with missing data, and of course it's better to have the whole thing validated.
So is there a way to define a schema for an object in an array, and then have that schema applied to every item in the array?
Thanks
You've got an extra array around your items schema. The way you have it written there (for json schema versions before 2020-12), an items with an array will specify the schema for each item individually, rather than all of them:
"items": [
{ .. this schema is only used for the first item .. },
{ .. this schema is only used for the second item .. },
...
]
compare to:
"items": { .. this schema is used for ALL items ... }
(The implementation really shouldn't be filling in defaults like that anyway, as that's contrary to the specification, but that's orthogonal.)
My question is about performance.
I am using filtered query a lot and I am not certain what is the proper way to query by type.
So first, lets have a look at the mappings:
{
"my_index": {
"mappings": {
"type_Light_Yellow": {
"properties": {
"color_type": {
"properties": {
"color": {
"type": "string",
"index": "not_analyzed"
},
"brightness": {
"type": "string",
"index": "not_analyzed"
}
}
},
"details": {
"properties": {
"FirstName": {
"type": "string",
"index": "not_analyzed"
},
"LastName": {
"type": "string",
"index": "not_analyzed"
},
.
.
.
}
}
}
}
}
}
}
Above, we can see example of one mapping for type light Yellow. As well, there are many more mappings for various types (colors. e.g: dark Yellow, light Brown and so on...)
Please notice color_type's sub fields.
For type type_Light_Yellow, values are always: "color": "Yellow", "brightness" : "Light" and so on for all other types.
And now, my performance question: I wonder if there is a favorite method for querying my index.
For example, let's search for all documents where "details.FirstName": "John" and "details.LastName": "Doe" under type type_Light_Yellow.
Current method I'm using:
curl -XPOST 'http://somedomain.com:1234my_index/_search' -d '{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"term":{
"color_type.color": "Yellow"
}
},
{
"term":{
"color_type.brightness": "Light"
}
},
{
"term":{
"details.FirstName": "John"
}
},
{
"term":{
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
As can be seen above, by defining
"color_type.color": "Yellow" and "color_type.brightness": "Light", I am querying all the index and referring type type_Light_Yellow as it was just another field under the documents I'm searching.
The alternate method is to query directly under the type:
curl -XPOST 'http://somedomain.com:1234my_index/type_Light_Yellow/_search' -d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"details.FirstName": "John"
}
},
{
"term": {
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
Please notice the first line: my_index/type_Light_Yellow/_search.
What would be, by performance means, more efficient to query?
Would it be a different answer if I am querying via code (I am using Python with ElasticSearch package)?
Types in elasticsearch work by adding _type attribute to documents and every time you search a specific type it automatically filters by _type attributes. So, performance wise there shouldn't be much of a difference. Types are an abstraction and not actual data. What I mean here is that, fields across multiple document types are flattened out on entire index, i.e. fields of one type occupy space on fields of other type as well, even though they are not indexed (think of it the same way as null occupies space).
But its important to keep in mind that order of filtering impacts performance.You must aim to exclude as many documents as possible in one go. So, if you think its better not to first filter by type, filtering the way first way is preferable. Otherwise, I don't think there would be much of a difference if ordering is same.
Since Python API also queries over http in default settings, use of Python shouldn't impact performance.
Here, in your case is certain degree of data duplication though as color is captured both in _type meta field as well as color field.
I'm trying to test a lot of json documents against a schema, and I use an object with all the required field names to keep how many errors each has.
Is there a function in any python libraries that creates a sample object with boolean values for whether a particular field is required. i.e.
From this schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"type": {
"type": "string"
},
"position": {
"type": "array"
},
"content": {
"type": "object"
}
},
"additionalProperties": false,
"required": [
"type",
"content"
]
}
I need to get something like:
{
"type" : True,
"position" : False,
"content" : True
}
I need it to support references to definitions as well
I don't know of a library that will do this, but this simple function uses a dict comprehension to get the desired result.
def required_dict(schema):
return {
key: key in schema['required']
for key in schema['properties']
}
print(required_dict(schema))
Example output from your provided schema
{'content': True, 'position': False, 'type': True}
Edit: link to repl.it example
This is a structure I'm getting from elsewhere, that is, a list of deeply nested dictionaries:
{
"foo_code": 404,
"foo_rbody": {
"query": {
"info": {
"acme_no": "444444",
"road_runner": "123"
},
"error": "no_lunch",
"message": "runner problem."
}
},
"acme_no": "444444",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
{
"foo_code": 200,
"foo_rbody": {
"query": {
"result": {
"acme_no": "260060730303258",
"road_runner": "123",
"abyss": "26843545600"
}
}
},
"acme_no": "260060730303258",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
Asking for different structures is out of question (legacy apis etc).
So I'm wondering if there's some clever way of extracting selected values from such a structure.
The candidates I was thinking of:
flatten particular dictionaries, building composite keys, smth like:
{
"foo_rbody.query.info.acme_no": "444444",
"foo_rbody.query.info.road_runner": "123",
...
}
Pro: getting every value with one access and if predictable key is not there, it means that the structure was not there (as you might have noticed, dictionaries may have different structures depending on whether it was successful operation, error happened, etc).
Con: what to do with lists?
Use some recursive function that would do successive key lookups, say by "foo_rbody", then by "query", "info", etc.
Any better candidates?
You can try this rather trivial function to access nested properties:
import re
def get_path(dct, path):
for i, p in re.findall(r'(\d+)|(\w+)', path):
dct = dct[p or int(i)]
return dct
Usage:
value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")
Maybe the function byPath in my answer to this post might help you.
You could create your own path mechanism and then query the complicated dict with paths. Example:
/ : get the root object
/key: get the value of root_object['key'], e.g. /foo_code --> 404
/key/key: nesting: /foo_rbody/query/info/acme_no -> 444444
/key[i]: get ith element of that list, e.g. /xyzzy_rbody/api/items[0]/desc --> "OK"
The path can also return a dict which you then run more queries on, etc.
It would be fairly easy to implement recursively.
I think about two more solutions:
You can try package Pynq, described here - structured query language for JSON (in Python). As far as a I understand, it's some kind of LINQ for python.
You may also try to convert your JSON to XML and then use Xquery language to get data from it - XQuery library under Python