Improving elasticsearch performance - python

I'm using elasticsearch in a python web app in order to query news documents. There're actually 100000 documents in the database.
The original db is a mongo one and elasticsearch is plugged through the mongoriver plugin.
The problem is that the function takes ~850ms to return the results. I'd like to decrease that number as much as possible.
Here's the python code I'm using to query the db(the limit is usually 16):
def search_news(term, limit, page, flagged_articles):
query = {
"query": {
"from": page*limit,
"size": limit,
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"not" : {
"filter" : {
"ids" : {
"values" : flagged_articles
}
},
"_cache" : True
}
}
}
es_query = json_util.dumps(query)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, data=es_query)
results = json.loads( r.text )
data = []
for res in results['hits']['hits']:
data.append(res['_source'])
return data
And here's the index mapping:
{
"news": {
"properties": {
"actual_rank": {
"type": "long"
},
"added": {
"type": "date",
"format": "dateOptionalTime"
},
"api_id": {
"type": "long"
},
"art_text": {
"type": "string"
},
"category": {
"type": "string"
},
"downvotes": {
"type": "long"
},
"entities": {
"properties": {
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"flags": {
"properties": {
"a": {
"type": "long"
},
"b": {
"type": "long"
},
"bad_image": {
"type": "long"
},
"c": {
"type": "long"
},
"d": {
"type": "long"
},
"innapropiate": {
"type": "long"
},
"irrelevant_info": {
"type": "long"
},
"miscategorized": {
"type": "long"
}
}
},
"media": {
"type": "string"
},
"published": {
"type": "string"
},
"published_date": {
"type": "date",
"format": "dateOptionalTime"
},
"show": {
"type": "boolean"
},
"source": {
"type": "string"
},
"source_rank": {
"type": "double"
},
"summary": {
"type": "string"
},
"times_showed": {
"type": "long"
},
"title": {
"type": "string"
},
"top_entities": {
"properties": {
"einfo_test": {
"type": "string"
},
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"tweet_article_poster": {
"type": "string"
},
"tweet_favourites": {
"type": "long"
},
"tweet_retweets": {
"type": "long"
},
"tweet_user_rank": {
"type": "double"
},
"upvotes": {
"type": "long"
},
"url": {
"type": "string"
}
}
}
}
Edit: The response time was measured on the server, given the tornado server information output.

I've rewritten your query somewhat here, moving the size and limit to the outside scope, adding the filtered query clause and changing your not query to a bool/must_not query, which should be cached by default:
{
"query": {
"filtered": {
"query": {
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"bool" : {
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
}
}
"from": page * limit,
"size": limit,
}
I haven't tested this, and I haven't made sense of your mapping as it is jumbled, so there might be some improvements to be made there.
Edit: This is a great read on why to use the bool filter: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/ - in short, bool uses 'bitsets', which are very fast on subsequent queries.

First of all you can add the boosts to your mapping (assuming it doesn't interfere with your other queries) like this:
"title": {
"boost": 3.0,
"type": "string"
},
"category": {
"boost": 5.0,
"type": "string"
},
etc.
Then setup a bool query with field (or term) queries like this:
"query": {
"bool" : {
"should" : [ {
"field" : {
"title" : term
}
}, {
"field" : {
"category" : term
}
} ],
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
"from": page * limit,
"size": limit
This should perform better, but without access to your setup I can't test it :)

Related

cant do case insensitive search in elastic search

I'm new to elastic search and trying to do this query right.
So I'm having a document like this:
{
"id": 1,
"name": "Văn Hiến"
}
I want to get that document in 3 cases:
1/ User input is: "v" or "h" or "i",...
2/ User input is: "Văn" or "văn" or "hiến",...
3/ User input is: "va" or "van" or "van hi",...
I'm currently can search for case 1 and 2, but not case 3, where the user input don't have the 'tonal' of the Vietnamese language
This is my query, I'm using Python:
query = {
"bool": {
"should": [
{
"match": {
"name": name.lower()
}
},
{
"wildcard": {
"name": {
"value": f"*{name.lower()}*"
}
}
}
]
}
}
Can anyone help me with this? Any helps will be apperciated
Use the lowercase_filter and mapping_character_filter functions in your mapping.
the following mapping and query will work for all the three usecases you mentioned
Mapping Example:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"my_mappings_char_filter"
]
}
},
"char_filter": {
"my_mappings_char_filter": {
"type": "mapping",
"mappings": [
"ă => a",
"ế => e"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff" : "9"
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"facet": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Example Query:
{
"query" : {
"query_string" :{
"query":"van hi",
"type": "best_fields",
"default_field": "name"
}
}
}

Python Elasticsearch: Errors when trying to apply an analyzer to Index documents

So I'm trying to apply an analyzer to my index but no matter what I do I get some sort of error. I've been looking stuff up all day but can't get it to work. If I run it as it is below, I get an error which says
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'analyzer [{settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}}] has not been configured in mappings')
if I add a "mappings" below the body= part of the code and above the "properties" part, I get this error
elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'Root mapping definition has unsupported parameters: [mappings : {properties={Name={analyzer={settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}} (and it'll go through every name in the body part of the code)
def text_normalization():
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
elasticsearch.indices.put_mapping(
index=index_name,
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": normalization_analyzer
},
"Status": {
"type": "text",
"analyzer": normalization_analyzer
},
"Country": {
"type": "text",
"analyzer": normalization_analyzer
},
"Blood Type": {
"type": "text",
"analyzer": normalization_analyzer
}
}
}
)
match_docments = elasticsearch.search(index=index_name, body={"query": {"match_all": {}}})
print(match_docments)
Any help would be appreciated.
Your analyzer is simply missing a name, you should specify it like this:
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"normalization_analyzer": { <--- add this
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
You need to install this analyzer using
elasticsearch.indices.put_settings(...)
Also in the mappings section, you need to reference the analyzer by name, so you simply need to add the analyzer name as a string
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Status": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Country": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Blood Type": {
"type": "text",
"analyzer": "normalization_analyzer"
}
}
}

Access nested objects in Elasticsearch using a script

I'm trying to use data from ElasticSearch 6 results in setting up the scoring for my results.
Part of my mapping looks like:
{
"properties": {
"annotation_date": {
"type": "date"
},
"annotation_date_time": {
"type": "date"
},
"annotations": {
"properties": {
"details": {
"type": "nested",
"properties": {
"filter": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"bucket": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keyword": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"frequency": {
"type": "long",
}
}
}
}
}
}
}
Example part of a document JSON:
"annotations": {
"details": [
{
"filter": "filter_A",
"bucket": "bucket_A",
"keyword": "keyword_A",
"frequency": 6
},
{
"filter": "filter_B",
"bucket": "bucket_B",
"keyword": "keyword_B",
"frequency": 7
}
]
I want to use the the frequency of my annotation.details if it hits a certain 'bucket', which I try to do with the following:
GET my_index/_search
{
"size": 10000,
"query": {
"function_score": {
"query": {
"match": { "title": "<search term>" }
},
"script_score": {
"script": {
"lang": "painless",
"source": """
int score = 0;
for (int i = 0; i < doc['annotations.details.filter'].length; i++){
if (doc['annotations.details.filter'][i].keyword == "bucket_A"){
score += doc['annotations.details.frequency'][i].value;
}
}
return score;
"""
}
}
}
}
}
Ultimately, this would mean that in this specific situation a score is expected of 6. If it would have hit on more buckets, the score is incremented with the frequency it hit on.
You should use bool,must with range and gt
example
GET /_search
{
"query": {
"nested" : {
"path" : "obj1",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"obj1.name" : "blue"} },
{ "range" : {"obj1.count" : {"gt" : 5}} }
]
}
}
}
}
}

How to validate a specific jsonschema based on document type

I have a JSON schema for user's new message:
message_creation = {
"title": "Message",
"type": "object",
"properties": {
"post": {
"oneOf": [
{
"type": "object",
"properties": {
"content": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["content"]
},
{
"type": "object",
"properties": {
"image": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["image"]
},
{
"type": "object",
"properties": {
"video_path": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["video"]
}
]
},
"doc_type": {
"type": "string",
"enum": ["text", "image", "video"]
}
},
"required": ["post", "doc_type"],
"additionalProperties": False
}
It's simple as that! There is two fields one is type and the other is post. So a payload like below succeeds:
{
"post": {
"image": "Hey there!"
},
"type": "image"
}
Now problem is that if user sets type value to text I cannot validate if text's schema has been given. How should I verify this? How should I check in case type is set to image then make sure that image exists inside of post?
You can do it, but it's complicated. This uses a boolean logic concept called implication to ensure that if schema A matches then schema B must also match.
{
"type": "object",
"properties": {
"post": {
"type": "object",
"properties": {
"content": { "type": "string" },
"image": { "type": "string" },
"video_path": { "type": "string" }
},
"additionalProperties": false
},
"doc_type": {
"type": "string",
"enum": ["text", "image", "video"]
}
},
"required": ["post", "doc_type"],
"additionalProperties": false,
"allOf": [
{ "$ref": "#/definitions/image-requires-post-image" },
{ "$ref": "#/definitions/text-requires-post-content" },
{ "$ref": "#/definitions/video-requires-post-video-path" }
],
"definitions": {
"image-requires-post-image": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-image" } },
{ "$ref": "#/definitions/post-image-required" }
]
},
"type-image": {
"properties": {
"doc_type": { "const": "image" }
}
},
"post-image-required": {
"properties": {
"post": { "required": ["image"] }
}
},
"text-requires-post-content": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-text" } },
{ "$ref": "#/definitions/post-content-required" }
]
},
"type-text": {
"properties": {
"doc_type": { "const": "text" }
}
},
"post-content-required": {
"properties": {
"post": { "required": ["content"] }
}
},
"video-requires-post-video-path": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-video" } },
{ "$ref": "#/definitions/post-video-path-required" }
]
},
"type-video": {
"properties": {
"doc_type": { "const": "video" }
}
},
"post-video-path-required": {
"properties": {
"post": { "required": ["video_path"] }
}
}
}
}

How to specify Stopwords in Elasticsearch mapping using python

I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.

Categories

Resources