I'm using Python to validate json against a json schema (Draft 4). jsonschema is working nicely for basic validation, but I'd like to add additional checks (outside of jsonschema) for specific subschemas (defined under "definitions" keyword). To do that I'd like pull all fields implementing a particular subschema from an instance being checked.
e.g. if i have the schema
{ "definitions": {
"fu": {
"type": "object",
...
},
...
"properties": {
"P1": {
"items": { "$ref": "'#/definitions/fu"},
"type": "array"
},
"P2": { "$ref": "'#/definitions/fu" }
}
"P3": { "type": "string" }
}
I'd like to pull all instances of a field "/definitions/fu"
So, from
{
"P1": ...,
"P2": ...,
"p3" :...,
}
I should get the list of fu objects from P1, the fu object from P2, nothing
from P3.
I can't work out how to do this, or if this is possible using the main python libraries: jsonschema or json_schema_validator. Any suggestions?
Related
I am trying to validate the json for required fields using python. I am doing it manually like iterating through the json reading it. Howerver i am looking for more of library / generic solution to handle all scenarios.
For example I want to check in a list, if a particular attribute is available in all the list items.
Here is the sample json which I am trying to validate.
{
"service": {
"refNumber": "abc",
"item": [{
"itemnumber": "1",
"itemloc": "uk"
}, {
"itemnumber": "2",
"itemloc": "us"
}]
}
}
I want to validate if I have refNumber and itemnumber in all the list items.
A JSON Schema is a way to define the structure of JSON.
There are some accompanying python packages which can use a JSON schema to validate JSON (jsonschema).
The JSON Schema for your example would look approximately like this:
{
"type": "object",
"properties": {
"service": {
"type": "object",
"properties": {
"refNumber": {
"type": "string"
},
"item": {
"type": "array",
"items": {
"type": "object",
"properties": {
"itemnumber": {
"type": "string"
},
"itemloc": {
"type": "string"
}
}
}
}
}
}
}
}
i.e., an object containing service, which itself contains a refNumber and a list of items.
Since i dont have enough rep to add a comment i will post this answer.
First i have to say i dont program with python.
According to my google search, you have a jsonschema module available for Python.
from jsonschema import validate
schema = {
"type": "object",
"properties": {
"service": {"object": {
"refNumber": {"type" : "string"},
"item: {"array": []}
},
"required": ["refNumber"]
},
},
}
validate(instance=yourJSON, schema=yourValidationSchema)
This example is not tested, but you can get some idea,
Link to jsonschema docs
The title says it all, really. I'm struggling to figure out how to make a Google Cloud Pub/Sub schema that has optional fields. Or would having optional fields in an AVRO schema basically directly contradict the whole point of having a schema?
The structure I tried is this, with no success:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "TestStringField",
"type": ["null", "string"],
"default": ""
},
{
"name": "TestIntField",
"type": ["null", "int"],
"default": 0
}
]
}
The schema as presented in the question needs to have the null values be the default since null is the first type in the union:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "TestStringField",
"type": ["null", "string"],
"default": null
},
{
"name": "TestIntField",
"type": ["null", "int"],
"default": null
}
]
}
A message that wants to be have non-null fields would have to put them inside an object, e.g.:
{
"TestStringField": {
"string": "Hi"
},
"TestIntField": {
"int": 7
}
}
There is currently an issue with supporting nullable fields with Avro schemas: https://issuetracker.google.com/issues/242757468.
Transforming the schema into a protobuf (proto2) schema worked for me. The messages themselves can stay in JSON format.
Your protobuf schema will probably look like this:
syntax = "proto2";
message ProtocolBuffer {
optional string TestStringField = 1;
optional int32 TestIntField = 2;
}
If you use GCP console, make sure to tick ProtoBuf for Schema type. Also note the ProtoBuf version 3 did not work for me. Stay with version 2.
gcloud pubsub schemas create key-schema
--definition='{ "type": "record", "namespace": "my.ns", "name":
"KeyMsg", "fields": [ { "name": "key", "type": "string" } ] }'
--type=AVRO
I have a set of jsonschema compliant documents. Some documents contain references to other documents (via the $ref attribute). I do not wish to host these documents such that they are accessible at an HTTP URI. As such, all references are relative. All documents live in a local folder structure.
How can I make python-jsonschema understand to properly use my local file system to load referenced documents?
For instance, if I have a document with filename defs.json containing some definitions. And I try to load a different document which references it, like:
{
"allOf": [
{"$ref":"defs.json#/definitions/basic_event"},
{
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["page_load"]
}
},
"required": ["action"]
}
]
}
I get an error RefResolutionError: <urlopen error [Errno 2] No such file or directory: '/defs.json'>
It may be important that I'm on a linux box.
(I'm writing this as a Q&A because I had a hard time figuring this out and observed other folks having trouble too.)
I had the hardest time figuring out how to resolve against a set of schemas that $ref each other without going to the network. It turns out the key is to create the RefResolver with a store that is a dict which maps from url to schema.
import json
from jsonschema import RefResolver, Draft7Validator
address="""
{
"$id": "https://example.com/schemas/address",
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"],
"additionalProperties": false
}
"""
customer="""
{
"$id": "https://example.com/schemas/customer",
"type": "object",
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"shipping_address": { "$ref": "/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"],
"additionalProperties": false
}
"""
data = """
{
"first_name": "John",
"last_name": "Doe",
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
},
"billing_address": {
"street_address": "1st Street SE",
"city": "Washington",
"state": "DC"
}
}
"""
address_schema = json.loads(address)
customer_schema = json.loads(customer)
schema_store = {
address_schema['$id'] : address_schema,
customer_schema['$id'] : customer_schema,
}
resolver = RefResolver.from_schema(customer_schema, store=schema_store)
validator = Draft7Validator(customer_schema, resolver=resolver)
jsonData = json.loads(data)
validator.validate(jsonData)
The above was built with jsonschema==4.9.1.
You must build a custom jsonschema.RefResolver for each schema which uses a relative reference and ensure that your resolver knows where on the filesystem the given schema lives.
Such as...
import os
import json
from jsonschema import Draft4Validator, RefResolver # We prefer Draft7, but jsonschema 3.0 is still in alpha as of this writing
abs_path_to_schema = '/path/to/schema-doc-foobar.json'
with open(abs_path_to_schema, 'r') as fp:
schema = json.load(fp)
resolver = RefResolver(
# The key part is here where we build a custom RefResolver
# and tell it where *this* schema lives in the filesystem
# Note that `file:` is for unix systems
schema_path='file:{}'.format(abs_path_to_schema),
schema=schema
)
Draft4Validator.check_schema(schema) # Unnecessary but a good idea
validator = Draft4Validator(schema, resolver=resolver, format_checker=None)
# Then you can...
data_to_validate = `{...}`
validator.validate(data_to_validate)
EDIT-1
Fixed a wrong reference ($ref) to base schema.
Updated the example to use the one from the docs: https://json-schema.org/understanding-json-schema/structuring.html
EDIT-2
As pointed out in the comments, in the following I'm using the following imports:
from jsonschema import validate, RefResolver
from jsonschema.validators import validator_for
This is just another version of #Daniel's answer -- which was the one correct for me. Basically, I decided to define the $schema in a base schema. Which then release the other schemas and makes for a clear call when instantiating the resolver.
The fact that RefResolver.from_schema() gets (1) some schema and also (2) a schema-store was not very clear to me whether the order and which "some" schema were relevant here. And so the structure you see below.
I have the following:
base.schema.json:
{
"$schema": "http://json-schema.org/draft-07/schema#"
}
definitions.schema.json:
{
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
address.schema.json:
{
"type": "object",
"properties": {
"billing_address": { "$ref": "definitions.schema.json#" },
"shipping_address": { "$ref": "definitions.schema.json#" }
}
}
I like this setup for two reasons:
Is a cleaner call on RefResolver.from_schema():
base = json.loads(open('base.schema.json').read())
definitions = json.loads(open('definitions.schema.json').read())
schema = json.loads(open('address.schema.json').read())
schema_store = {
base.get('$id','base.schema.json') : base,
definitions.get('$id','definitions.schema.json') : definitions,
schema.get('$id','address.schema.json') : schema,
}
resolver = RefResolver.from_schema(base, store=schema_store)
Then I profit from the handy tool the library provides give you the best validator_for your schema (according to your $schema key):
Validator = validator_for(base)
And then just put them together to instantiate validator:
validator = Validator(schema, resolver=resolver)
Finally, you validate your data:
data = {
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
},
"billing_address": {
"street_address": "1st Street SE",
"city": "Washington",
"state": 32
}
}
This one will crash since "state": 32:
>>> validator.validate(data)
ValidationError: 32 is not of type 'string'
Failed validating 'type' in schema['properties']['billing_address']['properties']['state']:
{'type': 'string'}
On instance['billing_address']['state']:
32
Change that to "DC", and will validate.
Following up on the answer #chris-w provided, I wanted to do this same thing with jsonschema 3.2.0 but his answer didn't quite cover it I hope this answer helps those who are still coming to this question for help but are using a more recent version of the package.
To extend a JSON schema using the library, do the following:
Create the base schema:
base.schema.json
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
Create the extension schema
extend.schema.json
{
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
Create your JSON file you want to test against the schema
data.json
{
"prop": "This is the property",
"extra": true
}
Create your RefResolver and Validator for the base Schema and use it to check the data
#Set up schema, resolver, and validator on the base schema
baseSchema = json.loads(baseSchemaJSON) # Create a schema dictionary from the base JSON file
relativeSchema = json.loads(relativeJSON) # Create a schema dictionary from the relative JSON file
resolver = RefResolver.from_schema(baseSchema) # Creates your resolver, uses the "$id" element
validator = Draft7Validator(relativeSchema, resolver=resolver) # Create a validator against the extended schema (but resolving to the base schema!)
# Check validation!
data = json.loads(dataJSON) # Create a dictionary from the data JSON file
validator.validate(data)
You may need to make a few adjustments to the above entries, such as not using the Draft7Validator. This should work for single-level references (children extending a base), you will need to be careful with your schemas and how you set up the RefResolver and Validator objects.
P.S. Here is a snipped that exercises the above. Try modifying the data string to remove one of the required attributes:
import json
from jsonschema import RefResolver, Draft7Validator
base = """
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
"""
extend = """
{
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
"""
data = """
{
"prop": "This is the property string",
"extra": true
}
"""
schema = json.loads(base)
extendedSchema = json.loads(extend)
resolver = RefResolver.from_schema(schema)
validator = Draft7Validator(extendedSchema, resolver=resolver)
jsonData = json.loads(data)
validator.validate(jsonData)
My approach is to preload all schema fragments to RefResolver cache. I created a gist that illustrates this: https://gist.github.com/mrtj/d59812a981da17fbaa67b7de98ac3d4b
This is what I used to dynamically generate a schema_store from all schemas in a given directory
base.schema.json
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
extend.schema.json
{
"$id": "extend.schema.json",
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
instance.json
{
"prop": "This is the property string",
"extra": true
}
validator.py
import json
from pathlib import Path
from jsonschema import Draft7Validator, RefResolver
from jsonschema.exceptions import RefResolutionError
schemas = (json.load(open(source)) for source in Path("schema/dir").iterdir())
schema_store = {schema["$id"]: schema for schema in schemas}
schema = json.load(open("schema/dir/extend.schema.json"))
instance = json.load(open("instance/dir/instance.json"))
resolver = RefResolver.from_schema(schema, store=schema_store)
validator = Draft7Validator(schema, resolver=resolver)
try:
errors = sorted(validator.iter_errors(instance), key=lambda e: e.path)
except RefResolutionError as e:
print(e)
I'm trying to test a lot of json documents against a schema, and I use an object with all the required field names to keep how many errors each has.
Is there a function in any python libraries that creates a sample object with boolean values for whether a particular field is required. i.e.
From this schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"type": {
"type": "string"
},
"position": {
"type": "array"
},
"content": {
"type": "object"
}
},
"additionalProperties": false,
"required": [
"type",
"content"
]
}
I need to get something like:
{
"type" : True,
"position" : False,
"content" : True
}
I need it to support references to definitions as well
I don't know of a library that will do this, but this simple function uses a dict comprehension to get the desired result.
def required_dict(schema):
return {
key: key in schema['required']
for key in schema['properties']
}
print(required_dict(schema))
Example output from your provided schema
{'content': True, 'position': False, 'type': True}
Edit: link to repl.it example
I used following function in Python to initialize an index in Elasticsearch.
def init_index():
constants.ES_CLIENT.indices.create(
index = constants.INDEX_NAME,
body = {
"settings": {
"index": {
"type": "default"
},
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"ap_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt"
},
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 5,
"output_unigrams": True
}
},
"analyzer": {
constants.ANALYZER_NAME : {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard",
"ap_stop",
"lowercase",
"shingle_filter",
"snowball"]
}
}
}
}
}
)
new_mapping = {
constants.TYPE_NAME: {
"properties": {
"text": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets_payloads",
"search_analyzer": constants.ANALYZER_NAME,
"index_analyzer": constants.ANALYZER_NAME
}
}
}
}
constants.ES_CLIENT.indices.put_mapping (
index = constants.INDEX_NAME,
doc_type = constants.TYPE_NAME,
body = new_mapping
)
Using this function I was able to create an index by user-defined specs.
I recently started to work with scala and spark. For integrating elasticsearch into this I can either use Spark's API i.e. org.elasticsearch.spark or I can use Hadoop org.elasticsearch.hadoop. Most of the examples I see are related to Hadoop's methodology but I don't wish to use Hadoop here. I went through Spark-elasticsearch documentation and was able to atleast index documents without including Hadoop but I noticed that this created everything default, I can't even specify _id there. It generates _id on its own.
In scala I use the following code for indexing (not complete code):
val document = mutable.Map[String, String]()
document("id") = docID
document("text") = textChunk.mkString(" ") //textChunk is a list of Strings
sc.makeRDD(Seq(document)).saveToEs("es_park_ap/document")
This created an index this way:
{
"es_park_ap": {
"mappings": {
"document": {
"properties": {
"id": {
"type": "string"
},
"text": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1433006647684",
"uuid": "QNXcTamgQgKx7RP-h8FVIg",
"number_of_replicas": "1",
"number_of_shards": "5",
"version": {
"created": "1040299"
}
}
}
}
}
So if I pass a document to it, a following document is created:
{
"_index": "es_park_ap",
"_type": "document",
"_id": "AU2l2ixcAOrl_Gagnja5",
"_score": 1,
"_source": {
"text": "some large text",
"id": "12345"
}
}
Just like Python, how can I use Spark and Scala to create an index with user defined specifications?
I think we should divide your question to several smaller issues.
If you want to create an index with specific mapping / settings you should use elasticsearch JAVA api directly (You can use it from Scala code of course).
You can use the following sources for examples of index creating using Scala:
Creating index and adding mapping in Elasticsearch with java api gives missing analyzer errors
Define custom ElasticSearch Analyzer using Java API
Elasticsearch Hadoop / Spark plugin is used in order transport data easily from HDFS to ES. ES maintenance should be done separately.
The fact that you still seeing automatically generated id is because you must specify to the plugin your id field using the following syntax:
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "your_id_field"))
Or in your case:
sc.makeRDD(Seq(document)).saveToEs("es_park_ap/document", Map("es.mapping.id" -> "your_id_field"))
You can find more details about syntax and proper use here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
Michael