I'm working with fastjsonschema to validate JSON objects that contain a list of product details.
If the object is missing a value, the validation should create it with the default value e.g.
validate = fastjsonschema.compile({
'type': 'object',
'properties': {
'a': {'type': 'number', 'default': 42},
},
})
data = validate({})
assert data == {'a': 42}
But with an array, it will only fill out the defaults for as many of the array objects as you define in the schema. Which means that if the user enters more array items than the schema covers, the extra items will not be validated by the schema.
Is there a way to declare that all items in the array will follow the same schema, and that they should all be validated?
Currently when I define in the schema
{
"products": {
"type": "array",
"default": [],
"items":[
{
"type": "object",
"default": {},
"properties": {
"string_a": {
"type": "string",
"default": "a"
},
"string_b": {
"type": "string",
"default": "b"
}
}
]
}
}
What will happen when I try to validate
{"products":[{},{}]}
is that it becomes
{"products":[{"string_a":"a","string_b":"b"},{}]}
This can cause issues with missing data, and of course it's better to have the whole thing validated.
So is there a way to define a schema for an object in an array, and then have that schema applied to every item in the array?
Thanks
You've got an extra array around your items schema. The way you have it written there (for json schema versions before 2020-12), an items with an array will specify the schema for each item individually, rather than all of them:
"items": [
{ .. this schema is only used for the first item .. },
{ .. this schema is only used for the second item .. },
...
]
compare to:
"items": { .. this schema is used for ALL items ... }
(The implementation really shouldn't be filling in defaults like that anyway, as that's contrary to the specification, but that's orthogonal.)
Related
I have been storing references as ObjectId instead of strings to make it easier for $lookup.
However, whenever I have to return a document from a route, I have to convert the document's id and reference ids to string first. Otherwise, I would receive an error message as follows:
TypeError: Object of type ObjectId is not JSON serializable
To make things worse, after I update a document, I have to re-convert all ids and reference ids back to ObjectId before storing in my MongoDB collection.
Is there a smarter way to do this?
Option 1: Project _id as string in each query
The aggregation operator $toString can be used in projection of non-aggregated find queries starting from Mongo version 4.4.
The downside is that this projection needs to be applied across all read queries.
db.find(
{"name": "Jack"},
{ "_id": {"$toString": "$_id"}, "name": 1 }
)
Option 2: Store the refs as strings
You may store the refs as strings and modify the $lookup to match between ObjectIDs and hex strings.
Example:
A lookup from a collection accounts with each document containing a string field userId to users collection matching on _id in users.
[
{
"$lookup": {
"from": "users",
"as": "user",
"let": {
"userId": "$userId"
},
"pipeline": [
{
"$match": {
"$expr": {
"$eq": [
{
"$toObjectId": "$$userId"
},
"$_id"
]
}
}
}
]
}
}
]
In Python 3.8, I'm trying to mock up a validation JSON schema for the structure below:
{
# some other key/value pairs
"data_checks": {
"check_name": {
"sql": "SELECT col FROM blah",
"expectations": {
"expect_column_values_to_be_unique": {
"column": "col",
},
# additional items as required
}
},
# additional items as required
}
}
The requirements I'm trying to enforce include:
At least one item in data_checks that can have a dynamic name. Item keys should be unique.
sql and expectations keys must be present
sql should be a text string
At least one item in expectations. Item keys should be unique.
Within expectations, item keys must be equal to available methods provided by dir(class_name)
More advanced capability would include:
Enforcing expectations method items to only include kwargs for that method
I currently have the following JSON schema for the data_checks portion:
"data_checks": {
"description": "Data quality checks against provided sources.",
"minProperties": 1,
"type": "object",
"patternProperties": {
".+": {
"required": ["expectations", "sql"],
"sql": {
"description": "SQL for data quality check.",
"minLength": 1,
"type": "string",
},
"expectations": {
"description": "Great Expectations function name.",
"minProperties": 1,
"type": "object",
"anyOf": [
{
"type": "string",
"minLength": 1,
"pattern": [e for e in dir(SqlAlchemyDataset) if e.startswith("expect_")],
}
],
},
},
},
},
This JSON schema does not enforce expectations to have at least one item nor does it enforce valid method names for the nested keys as expected from [e for e in dir(SqlAlchemyDataset) if e.startswith("expect_")]. I haven't really looked into enforcing kwargs for the selected method (is that even possible?).
I don't know if this is related to things being nested, but how would I enforce the proper validation requirements?
Thanks!
I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation.
I use the Python client running in containers to index and perform the search.
My corpus is a JSONL file a bit like this:
{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...
I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:
def create_index(client):
client.indices.create(
index="documents",
body={
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": False
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 10,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": True
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {"type": "text", "analyzer": "my_analyzer"}
}
},
},
ignore=400,
)
I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:
However, querying the index with:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['_id', 'body'],
'size': K,
'query': {
"match": {
"body": {
"query": body,
"analyzer" : "my_analyzer"
}
}
}
}
res = es.search(index='documents', body=doc)
top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]
my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.
"I come up here for perception and clarity"
or substrings like
"I come up here for perception"
while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.
Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?
P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)
Here are some things that you should double-check as they are likely culprits:
when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:
"mappings": {
"properties": {
"text": {"type": "text", "analyzer": "my_analyzer"}
}
in indexing phase you could also rework your generate_actions() method following the documentation with something like:
for elem in corpus:
yield {
"_op_type": "index"
"_index": "documents",
"_id": elem["id"],
"_source": elem["text"]
}
Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.
Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity".
So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['text'],
'size': K,
'query': {
"match": {
"text": { # I changed this line!
"query": body
}
}
}
}
res = es.search(index='documents', body=doc)
# also changed the list comprehension!
top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]
I'm using Python to validate json against a json schema (Draft 4). jsonschema is working nicely for basic validation, but I'd like to add additional checks (outside of jsonschema) for specific subschemas (defined under "definitions" keyword). To do that I'd like pull all fields implementing a particular subschema from an instance being checked.
e.g. if i have the schema
{ "definitions": {
"fu": {
"type": "object",
...
},
...
"properties": {
"P1": {
"items": { "$ref": "'#/definitions/fu"},
"type": "array"
},
"P2": { "$ref": "'#/definitions/fu" }
}
"P3": { "type": "string" }
}
I'd like to pull all instances of a field "/definitions/fu"
So, from
{
"P1": ...,
"P2": ...,
"p3" :...,
}
I should get the list of fu objects from P1, the fu object from P2, nothing
from P3.
I can't work out how to do this, or if this is possible using the main python libraries: jsonschema or json_schema_validator. Any suggestions?
I'm trying to test a lot of json documents against a schema, and I use an object with all the required field names to keep how many errors each has.
Is there a function in any python libraries that creates a sample object with boolean values for whether a particular field is required. i.e.
From this schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"type": {
"type": "string"
},
"position": {
"type": "array"
},
"content": {
"type": "object"
}
},
"additionalProperties": false,
"required": [
"type",
"content"
]
}
I need to get something like:
{
"type" : True,
"position" : False,
"content" : True
}
I need it to support references to definitions as well
I don't know of a library that will do this, but this simple function uses a dict comprehension to get the desired result.
def required_dict(schema):
return {
key: key in schema['required']
for key in schema['properties']
}
print(required_dict(schema))
Example output from your provided schema
{'content': True, 'position': False, 'type': True}
Edit: link to repl.it example