Jsonschema, validating object keys with a custom function - python

I am using python-jsonschema for json validation. I have an object with localised texts that are specified inside rfc1766 language code keys as followings:
"Description": {
"en": "English Description",
"sv": "Swedish Description",
"fr": "French Description"
},
I've read in the documentation that I could use the 'format' attribute to check a custom format using a function. So,I wrote a method which takes a string as a parameter and returns True if it is an RFC1766 language string.
#_checks_drafts('rfc1766lang')
def rfc1766lang(instance):
"""some logic, return True if rfc1766"""
However I couldn't find any example on how to apply this to do validation on an object key, not a value.
Is this possible?
I have tried something like below but I couldn't succeed
rfc1766_string_schema_v2 = {
'type': 'object',
'format': 'rfc1766lang',
'additionalProperties': False
}
I know that it would be much easier if I had the json string as follows. However, this is not an option for now.
"Description": [{
"lan": "en",
"text": "Description in English"
}, {
"lan": "sv",
"name": "Description in Swedish"
}]

This is a very good and relevant question because this is actually part of the proposed syntax for v5, so the official meta-schema will have to deal with this as well.
JSON Schema cannot specify a "format" for object keys. The only "validation" JSON Schema supports for object keys is patternProperties, which supplies a regular expression.
For language codes, the best you can do is probably something like:
{
"type": "object",
"patternProperties": {
"^[a-zA-Z]+(-[a-zA-Z]+)*$": {...}
},
"additionalProperties": false
}
That would limit the data so that it was only allowed properties matching that pattern - but that's not the full validation you're looking for, I'm afraid.

Related

Using Solr-Docker with python return a wrong results

I have a flask app which runs in a Docker container and I wanted to use Solr with it for indexing and searching, so I built a container for Solr using the Solr official image and used it with my app using docker-compose.
In the app I have multiple types of objects that I want to index for example type1 and type2 and each type has specific fields, so I got in Solr, documents that have different fields, such as doc1 could have field1 and field2, and doc2 could have field3, field4 and field5, and each document has a field called type to specify its type.
I have two types of search first one is searching for documents of a specific type and this is an example URL of it which is used with requests Python package:
response = requests.get("http://solr:8983/solr/myCollection/select?q=*val*&defType=edismax&fq=type:type1&qf=field1^2&qf=field2^1")
, and the other is overall search so I search for documents of all types, and here is its URL example:
response = requests.get("http://solr:8983/solr/myCollection/select?q=*val*&defType=edismax&fq=type:type1||type2&qf=field1^1&qf=field2^1&qf=field3^1&qf=field4^1&qf=field1^1")
I have two problems with my work:
I don't get the result that I expected when I run some queries.
some fields have values with special characters like (z=x+y*f) and when I try to escape these special characters by '\' it doesn't work.
So, is the queries that I wrote have something wrong and is there any article or tutorial that could help me because I searched a lot in the documentation and the internet but I couldn't find I way to solve my problems.
Note: I didn't change the schema file I let it as default.
I've solved the problems by using the tokenizers and filters in indexing and querying.
You can use them by the Client API that Solr provide.
Here is an example of JSON data to add tokenizers and filters to a field type:
{
"replace-field-type": {
"name": "field_name",
"class": "solr.TextField",
"multiValued": True,
"indexAnalyzer": {
"tokenizer": {
"class": "solr.LowerCaseTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
}
]
},
"queryAnalyzer": {
"tokenizer": {
"class": "solr.WhitespaceTokenizerFactory",
"rule": "java"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
}
]
}
}
}

JSON Schema validation in Python

I wonder if it is possible to have a Swagger JSON with all schemas in one file to be validated across all tests.
Assume somewhere in a long .json file a schema is specified like so:
...
"Stop": {
"type": "object",
"properties": {
"id": {
"description": "id identifier",
"type": "string"
},
"lat": {
"$ref": "#/components/schemas/coordinate"
},
"lng": {
"$ref": "#/components/schemas/coordinate"
},
"name": {
"description": "Name of location",
"type": "string"
},
"required": [
"id",
"name",
"lat",
"lng"
]
}
...
So lat and lng schemas are defined in another schema (the same file at the top).
Now I want to get response from backend with array of those stops and would like to validate it against the schema.
How should I approach it?
I am trying to get a partial schema loaded and validate against it but then the $ref won't resolve. Also is there anyway to tell the validator to accept arrays? Schema only tells how a single object looks like. If I manually add
"type": "array",
"items": {
Stop...
with hardcoded coordinates
}
Then it seems to work.
Here are my functions to validate arbitrary response JSON against chosen "chunk" of the full Swagger schema:
def pick_schema(schema):
with open("json_schemas/full_schema.json", "r") as f:
jschema = json.load(f)
return jschema["components"]["schemas"][schema]
def validate_json_with_schema(json_data, json_schema_name):
validate(
json_data,
schema=pick_schema(json_schema_name),
format_checker=jsonschema.FormatChecker(),
)
Maybe another approach is preferred? Separate files with schemas for API responses (it is quite tedious to write, full_schema.json is generated from Swagger itself).

JSON Type Validation - Guidance

I have a stream of events coming in as JSON. The schema for the JSON is well defined, but the source producing them doesn't always behave when it comes to types.
Example Schema:
{
"type":"object",
"$schema": "http://json-schema.org/draft-03/schema",
"properties":{
"FirstName": {
"type":"string",
"id": "http://jsonschema.net/FirstName",
"required":false
},
"MiddleName": {
"type":"string",
"id": "http://jsonschema.net/MiddleName",
"required":false
},
"LastName": {
"type":"string",
"id": "http://jsonschema.net/LastName",
"required":false
},
"Age": {
"type":"number",
"id": "http://jsonschema.net/Age",
"required":false
}
}
In some cases the Age shows up as a "-" character, meaning it was left blank when the record was created. Obviously this isn't a number, thus my problem.
I'm not using any formal JSON validation library, but I was considering looping through each element of the event and handling any needed type conversation. In the example above, I would just make age 0.
Is there a way to validate each element and then apply some type of conversation function is it fails validation?
I ended up using Schematics with custom Types to do this. Works perfectly.

ElasticSearch-Haystack: Spanish Tokenizer "Fails"

I'm using:
Haystack - 2.1.0
ElasticSearch - 0.90.3
pyelasticsearch - 0.6
I've configured a custom backend to change default Elasticsearch settings and use Spanish analyzer.
I'm using this settings for Elasticsearch:
"settings" : {
"index": {
"uuid": "IPwcMthwRpSJzpjtarc9eQ",
"analysis": {
"analyzer": {
"default": {
"filter": ["standard", "lowercase", "asciifolding", ],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"number_of_shards": "10",
}
},
"analyzer": {
"spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
I read this settings in some answer here. When I apply this settings to ElasticSearch and reindex my models I get a behaviour that I'm not sure I understand.
I have some objects with names like "Ciencias" and others like "Ciéncies" When I do a search like "ciencias" I receive objects with names like "Ciencias" and "Ciéncies", and the same happens when I search for "ciencies" or "ciéncies".
I want ElasticSearch to ignore accents, that's why I'm using asciifolding, and using spanish tokenizer because most of text is in spanish. I don't understand why using different words like "cienciAs" and "cienciEs" receive same results.
Why is this happening ? Is because a default ngram analyzer that is splitting the words ?
Why searching for "cienciAs" I get object with name like "ciénciEs" as results ?
Probably because the stemmer is doing its job. If you want to find out what happens while tokenising or stemming, install the inquisitor plugin and go to the Analyzers tab (see here)
Finally I removed the Spanish analyzer and everything began to work as expected.
Now I'm using only Asciifolding and Lowercase filters and accents and ñ's are being indexed well, and I don't have the issue with ciencias and ciencies.

MongoDB Collection with Non-repeated field value

So I'm rather new to MongoDB. Here is an imaginary database with the following format.
{
"_id": "message_id",
"headers": {
"from": <from_email>,
"to": <to_email>,
"timestamp": <timestamp>
},
"message": {
"message": <the message contents>,
"signature": <signature contents>
}
}
Suppose all emails received are inserted into it and sometimes emails are double sent. How can one return a collection of emails from an author without any double sends.
I thought this might do it but it doesn't seem to work as expected:
db.mycoll.find({"headers.from": <authorname>}).distinct("message.message")
Edit:
Please excuse me, It seems I have been making some kind of typo, the above query works, but it only returns messages.messages without the Headers, How would I keep the headers intact as well?
Hard to really determine from your question which part is the "duplicate" or therefore should be unique. It stands to reason though that things such as the message "_id" and "timestamp" are not going to duplicate, so this only really leaves the message content, with the possible additional paranoia of that message being "from" the same person.
Document reshaping is generally best handled by the aggregation framework:
db.collection.aggregate([
{ "$group": {
"_id": { "message": "$message.message", "from": "$headers.from" },
"message_id": { "$first": "$_id" },
"headers": { "$first": "$headers" },
"message": { "$first": "$message" }
}},
{ "$project": {
"_id": "$message_id",
"headers": 1,
"message": 1
}}
])
The $group will filter out any matching message content with the $first operations selecting only the "first" found item for the matching field on the document grouping boundary.
There is an assumption in here that the existing order is by "timestamp" but if not then you might want to apply a $sort as the first pipeline stage before the others:
{ "$sort": { "headers.timestamp": 1 } }
The final $project really just restores the original document form and removes the "grouping key" that was supplied earlier. Just prettier than duplicating information and/or putting things out of place.
You could use distinct() to return an array of distinct messages from a specific author as follows:
db.collection.distinct('message.message', {"headers.from": <authorname>})
What you're looking for is not currently implemented (at least as far as I know). One work around would be this
db.mycoll.aggregate([
{
$match:{"headers.from": <authorname>}
},{
$group:{
_id:"$headers.from",
"message":{$addToSet:"$message.message"}
}
}
])
Building on Neil Lunn's answer above:
I think one can do
db.collection.aggregate([{"$match": {"headers.from": <from email>} } ,
{"$group": { "_id": "$message.message"},
"headers": {"$first": "$headers"},
"signature": {"$first": "$message.signature"},
"message_id": "$_id" }},
{"$project" : { "_id": "$message_id",
"headers": "$headers",
"message": { "message": "$_id", "signature": "$signature" } } }])
Since _id must be unique the consequence is that duplicate messages will not make the list, and then $project will restructure it to the original object structure with correct key names.
I guess I only have one question in this regard - is there a way to force uniqueness without aggregating into _id or is this generally considered the correct way to do it in MongoDB ?

Categories

Resources