Elasticsearch - IndicesClient.put_settings not working - python

I am trying to update my original index settings.
My initial setting looks like this:
client.create(index = "movies", body= {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_custom_stop_words": {
"type": "stop",
"stopwords": stop_words
}
},
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_custom_stop_words"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
},
ignore=400
)
And I am trying to add the synonym filter to my existing analyzer (my_custom_analyzer) using client.put_settings:
client.put_settings(index='movies', body={
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_stops",
"my_synonyms"
],
"type": "custom",
"tokenizer": "standard"
}
},
"filter": {
"my_custom_stops": {
"type": "stop",
"stopwords": stop_words
},
"my_custom_synonyms": {
"ignore_case": "true",
"type": "synonym",
"synonyms": ["Harry Potter, HP => HP", "Terminator, TM => TM"]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
},
ignore=400
)
However, when I issue a search query (searching for "HP") that queries the movies index and I'm trying to rank the documents so that the document containing "Harry Potter" 5 times is the top element in the list. Right now, it seems like the document with "HP" 3 times tops the list, so the synonyms filter isn't working. I've closed movies index before I do client.put_settings and then re-opened the index.
Any help would be greatly appreciated!

You should re-index all your data in order to apply the updated settings on all your data and fields.
The data that had already been indexed won't be affected by the updated analyzer, only documents that has been indexed after you updated the settings will be affected.
Not re-indexing your data might produce incorrect results since your old data is analyzed with the old custom analyzer and not with the new one.
The most efficient way to resolve this issue is to create a new index, and move your data from the old one to the new one with the updated settings.
Reindex Api
Follow these steps:
POST _reindex
{
"source": {
"index": "movies"
},
"dest": {
"index": "new_movies"
}
}
DELETE movies
PUT movies
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_custom_stops",
"my_custom_synonyms"
],
"type": "custom",
"tokenizer": "standard"
}
},
"filter": {
"my_custom_stops": {
"type": "stop",
"stopwords": "stop_words"
},
"my_custom_synonyms": {
"ignore_case": "true",
"type": "synonym",
"synonyms": [
"Harry Potter, HP => HP",
"Terminator, TM => TM"
]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
}
POST _reindex?wait_for_completion=false
{
"source": {
"index": "new_movies"
},
"dest": {
"index": "movies"
}
}
After you've verified all your data is in place you can delete new_movies index. DELETE new_movies
Hope these help

Related

cant do case insensitive search in elastic search

I'm new to elastic search and trying to do this query right.
So I'm having a document like this:
{
"id": 1,
"name": "Văn Hiến"
}
I want to get that document in 3 cases:
1/ User input is: "v" or "h" or "i",...
2/ User input is: "Văn" or "văn" or "hiến",...
3/ User input is: "va" or "van" or "van hi",...
I'm currently can search for case 1 and 2, but not case 3, where the user input don't have the 'tonal' of the Vietnamese language
This is my query, I'm using Python:
query = {
"bool": {
"should": [
{
"match": {
"name": name.lower()
}
},
{
"wildcard": {
"name": {
"value": f"*{name.lower()}*"
}
}
}
]
}
}
Can anyone help me with this? Any helps will be apperciated
Use the lowercase_filter and mapping_character_filter functions in your mapping.
the following mapping and query will work for all the three usecases you mentioned
Mapping Example:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"my_mappings_char_filter"
]
}
},
"char_filter": {
"my_mappings_char_filter": {
"type": "mapping",
"mappings": [
"ă => a",
"ế => e"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff" : "9"
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"facet": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Example Query:
{
"query" : {
"query_string" :{
"query":"van hi",
"type": "best_fields",
"default_field": "name"
}
}
}

Python Elasticsearch: Errors when trying to apply an analyzer to Index documents

So I'm trying to apply an analyzer to my index but no matter what I do I get some sort of error. I've been looking stuff up all day but can't get it to work. If I run it as it is below, I get an error which says
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'analyzer [{settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}}] has not been configured in mappings')
if I add a "mappings" below the body= part of the code and above the "properties" part, I get this error
elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'Root mapping definition has unsupported parameters: [mappings : {properties={Name={analyzer={settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}} (and it'll go through every name in the body part of the code)
def text_normalization():
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
elasticsearch.indices.put_mapping(
index=index_name,
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": normalization_analyzer
},
"Status": {
"type": "text",
"analyzer": normalization_analyzer
},
"Country": {
"type": "text",
"analyzer": normalization_analyzer
},
"Blood Type": {
"type": "text",
"analyzer": normalization_analyzer
}
}
}
)
match_docments = elasticsearch.search(index=index_name, body={"query": {"match_all": {}}})
print(match_docments)
Any help would be appreciated.
Your analyzer is simply missing a name, you should specify it like this:
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"normalization_analyzer": { <--- add this
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
You need to install this analyzer using
elasticsearch.indices.put_settings(...)
Also in the mappings section, you need to reference the analyzer by name, so you simply need to add the analyzer name as a string
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Status": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Country": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Blood Type": {
"type": "text",
"analyzer": "normalization_analyzer"
}
}
}

Access nested objects in Elasticsearch using a script

I'm trying to use data from ElasticSearch 6 results in setting up the scoring for my results.
Part of my mapping looks like:
{
"properties": {
"annotation_date": {
"type": "date"
},
"annotation_date_time": {
"type": "date"
},
"annotations": {
"properties": {
"details": {
"type": "nested",
"properties": {
"filter": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"bucket": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keyword": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"frequency": {
"type": "long",
}
}
}
}
}
}
}
Example part of a document JSON:
"annotations": {
"details": [
{
"filter": "filter_A",
"bucket": "bucket_A",
"keyword": "keyword_A",
"frequency": 6
},
{
"filter": "filter_B",
"bucket": "bucket_B",
"keyword": "keyword_B",
"frequency": 7
}
]
I want to use the the frequency of my annotation.details if it hits a certain 'bucket', which I try to do with the following:
GET my_index/_search
{
"size": 10000,
"query": {
"function_score": {
"query": {
"match": { "title": "<search term>" }
},
"script_score": {
"script": {
"lang": "painless",
"source": """
int score = 0;
for (int i = 0; i < doc['annotations.details.filter'].length; i++){
if (doc['annotations.details.filter'][i].keyword == "bucket_A"){
score += doc['annotations.details.frequency'][i].value;
}
}
return score;
"""
}
}
}
}
}
Ultimately, this would mean that in this specific situation a score is expected of 6. If it would have hit on more buckets, the score is incremented with the frequency it hit on.
You should use bool,must with range and gt
example
GET /_search
{
"query": {
"nested" : {
"path" : "obj1",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"obj1.name" : "blue"} },
{ "range" : {"obj1.count" : {"gt" : 5}} }
]
}
}
}
}
}

How to specify Stopwords in Elasticsearch mapping using python

I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.

Filters not working in Elasticsearch

I have the following mapping and settings for an index:
def init_index():
ES_CLIENT.indices.create(
index = "social_media",
body = {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 0
},
"analysis": {
"analyzer": {
"my_english": {
"type": "standard",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"cust_stop",
"my_snow"
]
},
"my_english_shingle": {
"type": "standard",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"cust_stop",
"my_snow",
"shingle_filter"
]
}
},
"filter": {
"cust_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt",
},
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 2,
"output_unigrams": True
},
"my_snow" : {
"type" : "snowball",
"language" : "English"
}
}
}
}
}
)
press_mapping = {
"tweet": {
"dynamic": "strict",
"properties": {
"_id": {
"type": "string",
"store": True,
"index": "not_analyzed"
},
"text": {
"type": "multi_field",
"fields": {
"text": {
"include_in_all": False,
"type": "string",
"store": False,
"index": "not_analyzed"
},
"_analyzed": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "my_english"
},
"_analyzed_shingles": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "my_english_shingle"
}
}
}
}
}
}
constants.ES_CLIENT.indices.put_mapping (
index = "social_media",
doc_type = "tweet",
body = press_mapping
)
I notice that except lowercase no other filter is working. The termvectors for both the analyzers are same since shingle_filter is also not working.
GET /social_media/_analyze?analyzer=my_english_shingle&text=complaining when should remove when, stem complaining to complain and return a shingle complain _ but instead it gives me:
{
"tokens": [
{
"token": "complaining",
"start_offset": 0,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "when",
"start_offset": 12,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
}
]
}
What could be the reason??
Since you are trying to define new custom analyzers and not new standard analyzers you need to change the types in the mapping of both your analyzers from standard to custom. Standard analyzers actually don't take any of the settings you are passing in the mapping - personally would prefer ES to throw an exception in such a case but instead he is just creating new standard analyzers with no custom fields and ignores everything else you passed in (try removing lowercase form both your analyzers and re-run your analyzer, the output will still be lowercased!):
"analyzer": {
"my_english": {
"type": "custom", // <--- CUSTOM
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop",
"my_snow"
]
},
"my_english_shingle": {
"type": "custom", // <--- CUSTOM
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop",
"my_snow",
"shingle_filter"
]
}
With this the query (I changed the query and the custom stop words to just stop as I don't have your file) GET /social_media/_analyze?analyzer=my_english_shingle&text=COMPLAINING TEST returns:
{
"tokens": [
{
"token": "complain",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "complain test",
"start_offset": 0,
"end_offset": 16,
"type": "shingle",
"position": 1
},
{
"token": "test",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 2
}
]
}
Also not sure about your ES version but my requires boolean values true and false to be lowercased.

Categories

Resources