How to specify Stopwords in Elasticsearch mapping using python

How to specify Stopwords in Elasticsearch mapping using python - python

I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.

Related

cant do case insensitive search in elastic search

I'm new to elastic search and trying to do this query right.
So I'm having a document like this:
{
"id": 1,
"name": "Văn Hiến"
}
I want to get that document in 3 cases:
1/ User input is: "v" or "h" or "i",...
2/ User input is: "Văn" or "văn" or "hiến",...
3/ User input is: "va" or "van" or "van hi",...
I'm currently can search for case 1 and 2, but not case 3, where the user input don't have the 'tonal' of the Vietnamese language
This is my query, I'm using Python:
query = {
"bool": {
"should": [
{
"match": {
"name": name.lower()
}
},
{
"wildcard": {
"name": {
"value": f"*{name.lower()}*"
}
}
}
]
}
}
Can anyone help me with this? Any helps will be apperciated

Use the lowercase_filter and mapping_character_filter functions in your mapping.
the following mapping and query will work for all the three usecases you mentioned
Mapping Example:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"my_mappings_char_filter"
]
}
},
"char_filter": {
"my_mappings_char_filter": {
"type": "mapping",
"mappings": [
"ă => a",
"ế => e"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff" : "9"
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"facet": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Example Query:
{
"query" : {
"query_string" :{
"query":"van hi",
"type": "best_fields",
"default_field": "name"
}
}
}

Python Elasticsearch: Errors when trying to apply an analyzer to Index documents

So I'm trying to apply an analyzer to my index but no matter what I do I get some sort of error. I've been looking stuff up all day but can't get it to work. If I run it as it is below, I get an error which says
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'analyzer [{settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}}] has not been configured in mappings')
if I add a "mappings" below the body= part of the code and above the "properties" part, I get this error
elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'Root mapping definition has unsupported parameters: [mappings : {properties={Name={analyzer={settings={analysis={analyzer={filter=[lowercase], type=custom, tokenizer=keyword}}}} (and it'll go through every name in the body part of the code)
def text_normalization():
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
elasticsearch.indices.put_mapping(
index=index_name,
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": normalization_analyzer
},
"Status": {
"type": "text",
"analyzer": normalization_analyzer
},
"Country": {
"type": "text",
"analyzer": normalization_analyzer
},
"Blood Type": {
"type": "text",
"analyzer": normalization_analyzer
}
}
}
)
match_docments = elasticsearch.search(index=index_name, body={"query": {"match_all": {}}})
print(match_docments)
Any help would be appreciated.

Your analyzer is simply missing a name, you should specify it like this:
normalization_analyzer = {
"settings": {
"analysis": {
"analyzer": {
"normalization_analyzer": { <--- add this
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
You need to install this analyzer using
elasticsearch.indices.put_settings(...)
Also in the mappings section, you need to reference the analyzer by name, so you simply need to add the analyzer name as a string
body={
"properties": {
"Year of Birth": {
"type": "integer",
},
"Name": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Status": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Country": {
"type": "text",
"analyzer": "normalization_analyzer"
},
"Blood Type": {
"type": "text",
"analyzer": "normalization_analyzer"
}
}
}

Elasticsearch - IndicesClient.put_settings not working

I am trying to update my original index settings.
My initial setting looks like this:
client.create(index = "movies", body= {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_custom_stop_words": {
"type": "stop",
"stopwords": stop_words
}
},
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_custom_stop_words"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
},
ignore=400
)
And I am trying to add the synonym filter to my existing analyzer (my_custom_analyzer) using client.put_settings:
client.put_settings(index='movies', body={
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_stops",
"my_synonyms"
],
"type": "custom",
"tokenizer": "standard"
}
},
"filter": {
"my_custom_stops": {
"type": "stop",
"stopwords": stop_words
},
"my_custom_synonyms": {
"ignore_case": "true",
"type": "synonym",
"synonyms": ["Harry Potter, HP => HP", "Terminator, TM => TM"]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
},
ignore=400
)
However, when I issue a search query (searching for "HP") that queries the movies index and I'm trying to rank the documents so that the document containing "Harry Potter" 5 times is the top element in the list. Right now, it seems like the document with "HP" 3 times tops the list, so the synonyms filter isn't working. I've closed movies index before I do client.put_settings and then re-opened the index.
Any help would be greatly appreciated!

You should re-index all your data in order to apply the updated settings on all your data and fields.
The data that had already been indexed won't be affected by the updated analyzer, only documents that has been indexed after you updated the settings will be affected.
Not re-indexing your data might produce incorrect results since your old data is analyzed with the old custom analyzer and not with the new one.
The most efficient way to resolve this issue is to create a new index, and move your data from the old one to the new one with the updated settings.
Reindex Api
Follow these steps:
POST _reindex
{
"source": {
"index": "movies"
},
"dest": {
"index": "new_movies"
}
}
DELETE movies
PUT movies
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"my_custom_stops",
"my_custom_synonyms"
],
"type": "custom",
"tokenizer": "standard"
}
},
"filter": {
"my_custom_stops": {
"type": "stop",
"stopwords": "stop_words"
},
"my_custom_synonyms": {
"ignore_case": "true",
"type": "synonym",
"synonyms": [
"Harry Potter, HP => HP",
"Terminator, TM => TM"
]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer",
"search_quote_analyzer": "my_custom_analyzer"
}
}
}
}
POST _reindex?wait_for_completion=false
{
"source": {
"index": "new_movies"
},
"dest": {
"index": "movies"
}
}
After you've verified all your data is in place you can delete new_movies index. DELETE new_movies
Hope these help

Why is JSON-validation always successful using this schema containing 'allOf'?

I have a JSON schema with which I want to validate some data, using python and the jsonschema module. However, this doesn't quite work as expected, as some of the accepted data doesn't appear valid at all (to me and the purpose of my application). Sadly, the schema is provided, so I can't change the schema itself - at least not manually.
This is a shortened version of the schema ('schema.json' in code below):
{
"type": "object",
"allOf": [
{
"type": "object",
"allOf": [
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
}
}
},
{
"type": "object",
"properties": {
"language": {
"type": "integer"
}
}
}
]
},
{
"type": "object",
"properties": {
"addressArray": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
}
}
}
}
}
]
}
This is an example of what should be a valid instance ('person.json' in code below):
{
"firstName": "Sherlock",
"lastName": "Holmes",
"language": 1,
"addresses": [
{
"streetNumber": "221B",
"street": "Baker Street",
"city": "London"
}
]
}
This is an example of what should be considered invalid ('no_person.json' in code below):
{
"name": "eggs",
"colour": "white"
}
And this is the code I used for validating:
from json import load
from jsonschema import Draft7Validator, exceptions
with open('schema.json') as f:
schema = load(f)
with open('person.json') as f:
person = load(f)
with open('no_person.json') as f:
no_person = load(f)
validator = Draft7Validator(schema)
try:
validator.validate(person)
print("person.json is valid")
except exceptions.ValidationError:
print("person.json is invalid")
try:
validator.validate(no_person)
print("no_person.json is valid")
except exceptions.ValidationError:
print("no_person.json is invalid")
Result:
person.json is valid
no_person.json is valid
I expected no_person.json to be invalid. What can there be done to have only data such as person.json to be validated successfully? Thank you very much for your help, I'm very new to this (spent ages searching for an answer).

This is work schema and pay attention on "required" (when there is no such key - if field is doesn't get it just skipped):
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"language": {
"type": "integer"
},
"addresses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
},
"required": [
"streetNumber",
"street",
"city"
]
}
}
},
"required": [
"firstName",
"lastName",
"language",
"addresses"
]
}
I've got:
person.json is valid
no_person.json is invalid
If you have hardest structure of response (array of objects, which contain objects etc) let me known

Improving elasticsearch performance

I'm using elasticsearch in a python web app in order to query news documents. There're actually 100000 documents in the database.
The original db is a mongo one and elasticsearch is plugged through the mongoriver plugin.
The problem is that the function takes ~850ms to return the results. I'd like to decrease that number as much as possible.
Here's the python code I'm using to query the db(the limit is usually 16):
def search_news(term, limit, page, flagged_articles):
query = {
"query": {
"from": page*limit,
"size": limit,
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"not" : {
"filter" : {
"ids" : {
"values" : flagged_articles
}
},
"_cache" : True
}
}
}
es_query = json_util.dumps(query)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, data=es_query)
results = json.loads( r.text )
data = []
for res in results['hits']['hits']:
data.append(res['_source'])
return data
And here's the index mapping:
{
"news": {
"properties": {
"actual_rank": {
"type": "long"
},
"added": {
"type": "date",
"format": "dateOptionalTime"
},
"api_id": {
"type": "long"
},
"art_text": {
"type": "string"
},
"category": {
"type": "string"
},
"downvotes": {
"type": "long"
},
"entities": {
"properties": {
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"flags": {
"properties": {
"a": {
"type": "long"
},
"b": {
"type": "long"
},
"bad_image": {
"type": "long"
},
"c": {
"type": "long"
},
"d": {
"type": "long"
},
"innapropiate": {
"type": "long"
},
"irrelevant_info": {
"type": "long"
},
"miscategorized": {
"type": "long"
}
}
},
"media": {
"type": "string"
},
"published": {
"type": "string"
},
"published_date": {
"type": "date",
"format": "dateOptionalTime"
},
"show": {
"type": "boolean"
},
"source": {
"type": "string"
},
"source_rank": {
"type": "double"
},
"summary": {
"type": "string"
},
"times_showed": {
"type": "long"
},
"title": {
"type": "string"
},
"top_entities": {
"properties": {
"einfo_test": {
"type": "string"
},
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"tweet_article_poster": {
"type": "string"
},
"tweet_favourites": {
"type": "long"
},
"tweet_retweets": {
"type": "long"
},
"tweet_user_rank": {
"type": "double"
},
"upvotes": {
"type": "long"
},
"url": {
"type": "string"
}
}
}
}
Edit: The response time was measured on the server, given the tornado server information output.

I've rewritten your query somewhat here, moving the size and limit to the outside scope, adding the filtered query clause and changing your not query to a bool/must_not query, which should be cached by default:
{
"query": {
"filtered": {
"query": {
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"bool" : {
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
}
}
"from": page * limit,
"size": limit,
}
I haven't tested this, and I haven't made sense of your mapping as it is jumbled, so there might be some improvements to be made there.
Edit: This is a great read on why to use the bool filter: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/ - in short, bool uses 'bitsets', which are very fast on subsequent queries.

First of all you can add the boosts to your mapping (assuming it doesn't interfere with your other queries) like this:
"title": {
"boost": 3.0,
"type": "string"
},
"category": {
"boost": 5.0,
"type": "string"
},
etc.
Then setup a bool query with field (or term) queries like this:
"query": {
"bool" : {
"should" : [ {
"field" : {
"title" : term
}
}, {
"field" : {
"category" : term
}
} ],
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
"from": page * limit,
"size": limit
This should perform better, but without access to your setup I can't test it :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to specify Stopwords in Elasticsearch mapping using python - python

Related

cant do case insensitive search in elastic search

Python Elasticsearch: Errors when trying to apply an analyzer to Index documents

Elasticsearch - IndicesClient.put_settings not working

Why is JSON-validation always successful using this schema containing 'allOf'?

Improving elasticsearch performance

Categories

Resources