I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
type 'keyword' for aggregations
type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
Related
I'm trying to find a way to retrieve some data on MongoDB trough python scripts
but I got stuck on a situation as follows:
I have to retrieve some data, check a field value and compare with another data (MongoDB Documents).
But the Object's name may vary from each module, see bellow:
Document 1
{
"_id": "001",
"promotion": {
"Avocado": {
"id": "01",
"timestamp": "202005181407",
},
"Banana": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "11"
}
Document 2
{
"_id": "002",
"promotion": {
"Grape": {
"id": "02",
"timestamp": "202005181407",
},
"Dragonfruit": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "15"
}
}
I'll aways have an Object called promotion but the child's name may vary, sometimes it's an ordered number, sometimes it is not. The field I need the value is the id inside promotion, it will aways have the same name.
So if the document matches the criteria I'll retrieve with python and get the rest of the work done.
PS.: I'm not the one responsible for this kind of Document Structure.
I've already tried these docs, but couldn't get them to work the way I need.
$all
$elemMatch
Try this python pipeline:
[
{
'$addFields': {
'fruits': {
'$objectToArray': '$promotion'
}
}
}, {
'$addFields': {
'FruitIds': '$fruits.v.id'
}
}, {
'$project': {
'_id': 0,
'FruitIds': 1
}
}
]
Output produced:
{FruitIds:["01","02"]},
{FruitIds:["02","02"]}
Is this the desired output?
I have three indexes, all three of them share a particular key-value pair. When I do a blanket search with the api "http://localhost:9200/_search" using the request body
{"query":{
"query_string":
{
"query":"city*"
}
}
}
It is only returning results from two of the indexes. I tried using the same request body by altering the url to search only in that missed index "http://localhost:9200/index_name/_search" and that's working. Am I missing anything here?
The code for inserting all three indexes follow the same procedure and I used elasticsearch-py to ingest the data.
I'm using the GET HTTP method and also tried the POST HTTP method. Both returns the same results. Elasticsearch version is 7.6.0.
Results for specific index search is like the one below
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "index_name",
"_type": "meta",
"_id": "LMRqDnIBh5wU6Ax_YsOD",
"_score": 1.0,
"_source": {
"table_schema": "test_table",
"table_name": "citymaster_old"
}
}
]
}
}
The reason might be that you haven't provided the size parameter in the query. This limits the result count to 10 by default. Out of all the results the top 10 might be from the two index even thought the match is present in third index as well. This in turn giving the perception that result from third index are not being returned.
Try adding size parameter.
{
"query": {
"query_string": {
"query": "city*"
}
},
"size": 20
}
You can figure out the number of documents that matched the query by the total key in response
"total": {
"value": 1,
"relation": "eq"
}
I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate.
code --
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"match_phrase":{
"content":"xyz abc"
}}}
,
size=500,
scroll='60s')
It doesn't get me records where content is -
"hi my name isxyz abc." and "hey wassupxyz abc. how is life"
doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.
If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).
You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.
Here's an example:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test_index/_doc/1
{
"content": "hi my name isxyz abc."
}
PUT test_index/_doc/2
{
"content": "hey wassupxyz abc. how is life"
}
POST test_index/_doc/_search
{
"query": {
"match_phrase": {
"content": "xyz abc"
}
}
}
That results in finding both documents.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "hey wassupxyz abc. how is life"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "hi my name isxyz abc."
}
}
]
}
}
EDIT:
If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:
PUT test_index/_doc/3
{
"content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}
And you can query it with wildcard:
POST test_index/_doc/_search
{
"query": {
"wildcard": {
"content.keyword": {
"value": "*Lasit Pant*"
}
}
}
}
Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "*<my search terms>*". I would still recommend looking into nGrams.
you can also use type parameter to set to phrase in the query
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"query":"xyz abc"
},
type:"phrase"}
,
size=500,
scroll='60s')
I am newbie in Elastic search. I am trying to implement it in Python for one of my college projects. I want to use Elastic search as a resume indexer. Everything is working fine except it is showing all the fields in _source field .I don't want some fields and I tried too many thing but nothing is working. Below is my code
es = Elastcisearch()
query = {
"_source":{
"exclude":["resume_content"]
},
"query":{
"match":{
"resume_content":{
"query":keyword,
"fuzziness":"Auto",
"operator":"and",
"store":"false"
}
}
}
}
res = es.search(size=es_conf["MAX_SEARCH_RESULTS_LIMIT"],index=es_conf["ELASTIC_INDEX_NAME"], body=query)
return res
where es_conf is my local dictionary.
Apart from the above code I have also tried _source:false ,_source:[name of my fields], fields:[name of my fields] . I also tried store=False in my search method. Any ideas?
Did you try just using fields?
Here's a simple example. I set up a mapping with three fields, (imaginatively) named "field1", "field2", "field3":
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string"
},
"field2": {
"type": "string"
},
"field3": {
"type": "string"
}
}
}
}
}
Then I indexed three documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"text11","field2":"text12","field3":"text13"}
{"index":{"_id":2}}
{"field1":"text21","field2":"text22","field3":"text23"}
{"index":{"_id":3}}
{"field1":"text31","field2":"text32","field3":"text33"}
And let's say I want to find docs that contain "text22" in field "field2", but I only want to return the contents of "field1" and "field2". Here's the query:
POST /test_index/doc/_search
{
"fields": [
"field1", "field2"
],
"query": {
"match": {
"field2": "text22"
}
}
}
which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4054651,
"fields": {
"field1": [
"text21"
],
"field2": [
"text22"
]
}
}
]
}
}
Here's the code I used: http://sense.qbox.io/gist/69dabcf9f6e14fb1961ec9f761645c92aa8e528b
It should be straightforward to set this up with the Python adapter.
Using elastic search's query DSL this is how I am currently constructing my query:
elastic_sort = [
{ "timestamp": {"order": "desc" }},
"_score",
{ "name": { "order": "desc" }},
{ "channel": { "order": "desc" }},
]
elastic_query = {
"fuzzy_like_this" : {
"fields" : [ "msgs.channel", "msgs.msg", "msgs.name" ],
"like_text" : search_string,
"max_query_terms" : 10,
"fuzziness": 0.7,
}
}
res = self.es.search(index="chat", body={
"from" : from_result, "size" : results_per_page,
"track_scores": True,
"query": elastic_query,
"sort": elastic_sort,
})
I've been trying to implement a filter or an analyzer that will allow the inclusion of "#" in searches (I want a search for "#thing" to return results that include "#thing"), but I am coming up short. The error messages I am getting are not helpful and just telling me that my query is malformed.
I attempted to incorporate the method found here : http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html but it doesn't make any sense to me in context.
Does anyone have a clue how I can do this?
Did you create a mapping for you index? You can specify within your mapping to not analyze certain fields.
For example, a tweet mapping can be something like:
"tweet": {
"properties": {
"id": {
"type": "long"
},
"msg": {
"type": "string"
},
"hashtags": {
"type": "string",
"index": "not_analyzed"
}
}
}
You can then perform a term query on "hashtags" for an exact string match, including "#" character.
If you want "hashtags" to be tokenized as well, you can always create a multi-field for "hashtags".