TTL in elasticsearch - python

I am using the elasticsearch python client to create and store data to a aws elasticsearch instance.
def create_index():
"""
create mapping of data
"""
mappings = '''
{
"tweet":{
"_ttl":{
"enabled": true,
"default": "2m"
},
"properties": {
"text":{
"type": "string"
},
"location":{
"type": "geo_point"
}
}
}
}
'''
# Ignore if index already exists
es.indices.create(index='tweetmap', ignore=400, body=mappings)
As defined above, now I am expecting the records to be deleted automatically after 2 minutes, however they are persisting.
What could be the possible reason ?

There was an error with the way I had defined mappings, corrected it as shown below:
mappings = '''
{
"mappings":{
"tweet":{
"_ttl":{
"enabled": true,
"default": "1d"
},
"properties": {
"text":{
"type": "string"
},
"location":{
"type": "geo_point"
}
}
}
}
}
'''

Related

Elasticsearch in Python: "unknown parameter [analyser] on mapper [institution] of type [text]" when trying to create index

I'm trying to create my first index in Elasticsearch with Python. I keep getting errors of the type : "unknown parameter [analyser] on mapper [institution] of type [text]" or "index has not been configured in mapper" when trying to create the index. I've tried several ways of creating the index with "put_settings" and "put_mappings" methods but nothing works. I use version 7 of the Python API and Elasticsearch 7.12.0 on Docker. Here is my code.
def create_expert_index(index_name="expert_query"):
es = connect()
if es.indices.exists(index=index_name):
es.indices.delete(index=index_name, ignore=[400, 404])
mappings = {
"settings": {
"analysis": {
"analyser": {
"standard_asciifolding": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
}
}
},
"mappings": {
"properties":
{
"first_name":
{
"type": "text",
"analyser": "standard_asciifolding"
},
"last_name":
{
"type": "text",
"analyser": "standard_asciifolding"
},
"email":
{
"type": "text",
"analyser": "standard_asciifolding"
},
"specializations":
{
"type": "text",
"analyser": "standard_asciifolding"
},
"institution":
{
"type": "text",
"analyser": "standard_asciifolding"
}
}
}
}
es.indices.create(index_name, body=mappings)
Tried creating the index with the create, put_settings and put_mappings methods. I expect the index to get created.
You misspelled "analyzer" - you wrote it with "s" - "analyser". The rest of the request looks ok. Consider adding Kibana to your implementation, and test your requests with dev tools. That's a really simple way of correcting those mistakes. That's how I found this misspelling.

Python JSON schema validation for array of objects

I am trying to validate a JSON file using the schema listed below, I can enter any additional fields, I don't understand, what I am doing wrong and why please?
Sample JSON Data
{
"npcs":
[
{
"id": 0,
"name": "Pilot Alpha",
"isNPC": true,
"race": "1e",
"testNotValid": false
},
{
"id": 1,
"name": "Pilot Beta",
"isNPC": true,
"race": 1
}
]
}
JSON Schema
I have set "required" and "additionalProperties" so I thought the validation would fail....
FileSchema = {
"definitions":
{
"NpcEntry":
{
"properties":
{
"id": { "type": "integer" },
"name": { "type" : "string" },
"isNPC": { "type": "boolean" },
"race": { "type" : "integer" }
},
"required": [ "id", "name", "isNPC", "race" ],
"additionalProperties": False
}
},
"type": "object",
"required": [ "npcs" ],
"additionalProperties": False,
"properties":
{
"npcs":
{
"type": "array",
"npcs": { "$ref": "#/definitions/NpcEntry" }
}
}
}
The JSON file and schema are processed using the jsonschema package for Python, (I am using python 3.7 on a Mac).
The method I use to read and validate is below, I have removed a lot of the general validation to make the code as short and usable as possible:
import json
import jsonschema
def _ReadJsonfile(self, filename, schemaSystem, fileType):
with open(filename) as fileHandle:
fileContents = fileHandle.read()
jsonData = json.loads(fileContents)
try:
jsonschema.validate(instance=jsonData, schema=schemaSystem)
except jsonschema.exceptions.ValidationError as ex:
print(f"JSON schema validation failed for file '{filename}'")
return None
return jsonData
at: "npcs": { "$ref": "#/definitions/NpcEntry" }
change "npcs" to "items". npcs is not a valid keyword so it is ignored. The only validation that is happening is at the top level, verifying that the data is an object and that the one property is an array.

How to set up local file references in python-jsonschema document?

I have a set of jsonschema compliant documents. Some documents contain references to other documents (via the $ref attribute). I do not wish to host these documents such that they are accessible at an HTTP URI. As such, all references are relative. All documents live in a local folder structure.
How can I make python-jsonschema understand to properly use my local file system to load referenced documents?
For instance, if I have a document with filename defs.json containing some definitions. And I try to load a different document which references it, like:
{
"allOf": [
{"$ref":"defs.json#/definitions/basic_event"},
{
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["page_load"]
}
},
"required": ["action"]
}
]
}
I get an error RefResolutionError: <urlopen error [Errno 2] No such file or directory: '/defs.json'>
It may be important that I'm on a linux box.
(I'm writing this as a Q&A because I had a hard time figuring this out and observed other folks having trouble too.)
I had the hardest time figuring out how to resolve against a set of schemas that $ref each other without going to the network. It turns out the key is to create the RefResolver with a store that is a dict which maps from url to schema.
import json
from jsonschema import RefResolver, Draft7Validator
address="""
{
"$id": "https://example.com/schemas/address",
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"],
"additionalProperties": false
}
"""
customer="""
{
"$id": "https://example.com/schemas/customer",
"type": "object",
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"shipping_address": { "$ref": "/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"],
"additionalProperties": false
}
"""
data = """
{
"first_name": "John",
"last_name": "Doe",
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
},
"billing_address": {
"street_address": "1st Street SE",
"city": "Washington",
"state": "DC"
}
}
"""
address_schema = json.loads(address)
customer_schema = json.loads(customer)
schema_store = {
address_schema['$id'] : address_schema,
customer_schema['$id'] : customer_schema,
}
resolver = RefResolver.from_schema(customer_schema, store=schema_store)
validator = Draft7Validator(customer_schema, resolver=resolver)
jsonData = json.loads(data)
validator.validate(jsonData)
The above was built with jsonschema==4.9.1.
You must build a custom jsonschema.RefResolver for each schema which uses a relative reference and ensure that your resolver knows where on the filesystem the given schema lives.
Such as...
import os
import json
from jsonschema import Draft4Validator, RefResolver # We prefer Draft7, but jsonschema 3.0 is still in alpha as of this writing
abs_path_to_schema = '/path/to/schema-doc-foobar.json'
with open(abs_path_to_schema, 'r') as fp:
schema = json.load(fp)
resolver = RefResolver(
# The key part is here where we build a custom RefResolver
# and tell it where *this* schema lives in the filesystem
# Note that `file:` is for unix systems
schema_path='file:{}'.format(abs_path_to_schema),
schema=schema
)
Draft4Validator.check_schema(schema) # Unnecessary but a good idea
validator = Draft4Validator(schema, resolver=resolver, format_checker=None)
# Then you can...
data_to_validate = `{...}`
validator.validate(data_to_validate)
EDIT-1
Fixed a wrong reference ($ref) to base schema.
Updated the example to use the one from the docs: https://json-schema.org/understanding-json-schema/structuring.html
EDIT-2
As pointed out in the comments, in the following I'm using the following imports:
from jsonschema import validate, RefResolver
from jsonschema.validators import validator_for
This is just another version of #Daniel's answer -- which was the one correct for me. Basically, I decided to define the $schema in a base schema. Which then release the other schemas and makes for a clear call when instantiating the resolver.
The fact that RefResolver.from_schema() gets (1) some schema and also (2) a schema-store was not very clear to me whether the order and which "some" schema were relevant here. And so the structure you see below.
I have the following:
base.schema.json:
{
"$schema": "http://json-schema.org/draft-07/schema#"
}
definitions.schema.json:
{
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
address.schema.json:
{
"type": "object",
"properties": {
"billing_address": { "$ref": "definitions.schema.json#" },
"shipping_address": { "$ref": "definitions.schema.json#" }
}
}
I like this setup for two reasons:
Is a cleaner call on RefResolver.from_schema():
base = json.loads(open('base.schema.json').read())
definitions = json.loads(open('definitions.schema.json').read())
schema = json.loads(open('address.schema.json').read())
schema_store = {
base.get('$id','base.schema.json') : base,
definitions.get('$id','definitions.schema.json') : definitions,
schema.get('$id','address.schema.json') : schema,
}
resolver = RefResolver.from_schema(base, store=schema_store)
Then I profit from the handy tool the library provides give you the best validator_for your schema (according to your $schema key):
Validator = validator_for(base)
And then just put them together to instantiate validator:
validator = Validator(schema, resolver=resolver)
Finally, you validate your data:
data = {
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
},
"billing_address": {
"street_address": "1st Street SE",
"city": "Washington",
"state": 32
}
}
This one will crash since "state": 32:
>>> validator.validate(data)
ValidationError: 32 is not of type 'string'
Failed validating 'type' in schema['properties']['billing_address']['properties']['state']:
{'type': 'string'}
On instance['billing_address']['state']:
32
Change that to "DC", and will validate.
Following up on the answer #chris-w provided, I wanted to do this same thing with jsonschema 3.2.0 but his answer didn't quite cover it I hope this answer helps those who are still coming to this question for help but are using a more recent version of the package.
To extend a JSON schema using the library, do the following:
Create the base schema:
base.schema.json
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
Create the extension schema
extend.schema.json
{
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
Create your JSON file you want to test against the schema
data.json
{
"prop": "This is the property",
"extra": true
}
Create your RefResolver and Validator for the base Schema and use it to check the data
#Set up schema, resolver, and validator on the base schema
baseSchema = json.loads(baseSchemaJSON) # Create a schema dictionary from the base JSON file
relativeSchema = json.loads(relativeJSON) # Create a schema dictionary from the relative JSON file
resolver = RefResolver.from_schema(baseSchema) # Creates your resolver, uses the "$id" element
validator = Draft7Validator(relativeSchema, resolver=resolver) # Create a validator against the extended schema (but resolving to the base schema!)
# Check validation!
data = json.loads(dataJSON) # Create a dictionary from the data JSON file
validator.validate(data)
You may need to make a few adjustments to the above entries, such as not using the Draft7Validator. This should work for single-level references (children extending a base), you will need to be careful with your schemas and how you set up the RefResolver and Validator objects.
P.S. Here is a snipped that exercises the above. Try modifying the data string to remove one of the required attributes:
import json
from jsonschema import RefResolver, Draft7Validator
base = """
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
"""
extend = """
{
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
"""
data = """
{
"prop": "This is the property string",
"extra": true
}
"""
schema = json.loads(base)
extendedSchema = json.loads(extend)
resolver = RefResolver.from_schema(schema)
validator = Draft7Validator(extendedSchema, resolver=resolver)
jsonData = json.loads(data)
validator.validate(jsonData)
My approach is to preload all schema fragments to RefResolver cache. I created a gist that illustrates this: https://gist.github.com/mrtj/d59812a981da17fbaa67b7de98ac3d4b
This is what I used to dynamically generate a schema_store from all schemas in a given directory
base.schema.json
{
"$id": "base.schema.json",
"type": "object",
"properties": {
"prop": {
"type": "string"
}
},
"required": ["prop"]
}
extend.schema.json
{
"$id": "extend.schema.json",
"allOf": [
{"$ref": "base.schema.json"},
{
"properties": {
"extra": {
"type": "boolean"
}
},
"required": ["extra"]
}
]
}
instance.json
{
"prop": "This is the property string",
"extra": true
}
validator.py
import json
from pathlib import Path
from jsonschema import Draft7Validator, RefResolver
from jsonschema.exceptions import RefResolutionError
schemas = (json.load(open(source)) for source in Path("schema/dir").iterdir())
schema_store = {schema["$id"]: schema for schema in schemas}
schema = json.load(open("schema/dir/extend.schema.json"))
instance = json.load(open("instance/dir/instance.json"))
resolver = RefResolver.from_schema(schema, store=schema_store)
validator = Draft7Validator(schema, resolver=resolver)
try:
errors = sorted(validator.iter_errors(instance), key=lambda e: e.path)
except RefResolutionError as e:
print(e)

How to specify Stopwords in Elasticsearch mapping using python

I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.

python jsonschema validation using schema list

I'm trying to validate a json file against a schema using python and jsonschema module. My schema is made up from a list of schemas, one of them has definitions of basic elements and the rest are collections of these elements and other objects.
I can't find the documentation for function which loads the list of schemas so that I can validate using it. I tried separating schemas into a dictionary and calling the appropriate one on a jsonObject, but that doesn't work since they cross reference each other.
How do I load/assemble all schemas into one for validation?
Part of the schema I'm trying to load:
[{
"definitions": {
"location": {
"required": [
"name",
"country"
],
"additionalProperties": false,
"properties": {
"country": {
"pattern": "^[A-Z]{2}$",
"type": "string"
},
"name": {
"type": "string"
}
},
"type": "object"
}
},
"required": [
"type",
"content"
],
"additionalProperties": false,
"properties": {
"content": {
"additionalProperties": false,
"type": "object"
},
"type": {
"type": "string"
}
},
"type": "object",
"title": "base",
"$schema": "http://json-schema.org/draft-04/schema#"
},
{
"properties": {
"content": {
"required": [
"address"
],
"properties": {
"address": {
"$ref": "#/definitions/location"
}
},
"type": {
"pattern": "^person$"
}
}
}]
And the json object would look something like this:
{
"type":"person",
"content":{
"address": {
"country": "US",
"name" : "1,Street,City,State,US"
}
}
}
You can only validate against one schema at a time, but that schema can reference ($ref) external schemas. These references are usually URIs that can be used to GET the schema. A filepath might work too if your schemas are not public. Using a fixed up version of your example, this would look something like this ...
http://your-domain.com/schema/person
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Person",
"allOf": [{ "$ref": "http://your-domain.com/schema/base#" }],
"properties": {
"type": { "enum": ["person"] },
"content": {
"properties": {
"address": { "$ref": "http://your-domain.com/schema/base#/definitions/location" }
},
"required": ["address"],
"additionalProperties": false
}
}
}
http://your-domain.com/schema/base
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "base",
"type": "object",
"properties": {
"content": { "type": "object" },
"type": { "type": "string" }
},
"required": ["type", "content"],
"additionalProperties": false,
"definitions": {
"location": {
"type": "object",
"properties": {
"country": {
"type": "string",
"pattern": "^[A-Z]{2}$"
},
"name": { "type": "string" }
},
"required": ["name", "country"],
"additionalProperties": false
}
}
}
Some documentation that might be useful
https://python-jsonschema.readthedocs.org/en/latest/validate/#the-validator-interface
https://python-jsonschema.readthedocs.org/en/latest/references/
Instead of hand coding a single schema from all your schemata, you can create a small schema which refers to the other schema files. This way you can use multiple existing JSONschema files and validate against them in combination:
import yaml
import jsonschema
A_yaml = """
id: http://foo/a.json
type: object
properties:
prop:
$ref: "./num_string.json"
"""
num_string_yaml = """
id: http://foo/num_string.json
type: string
pattern: ^[0-9]*$
"""
A = yaml.load(A_yaml)
num_string = yaml.load(num_string_yaml)
resolver = jsonschema.RefResolver("",None,
store={
"http://foo/A.json":A,
"http://foo/num_string.json":num_string,
})
validator = jsonschema.Draft4Validator(
A, resolver=resolver)
try:
validator.validate({"prop":"1234"})
print "Properly accepted object"
except jsonschema.exceptions.ValidationError:
print "Failed to accept object"
try:
validator.validate({"prop":"12d34"})
print "Failed to reject object"
except jsonschema.exceptions.ValidationError:
print "Properly rejected object"
Note that you may want to combine the external using one of the schema cominators oneOf, allOf, or anyOf to combine your schemata like so:
[A.yaml]
oneOf:
- $ref: "sub-schema1.json"
- $ref: "sub-schema2.json"
- $ref: "sub-schema3.json"

Categories

Resources