Elastic Search not giving exact results python - python

I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate.
code --
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"match_phrase":{
"content":"xyz abc"
}}}
,
size=500,
scroll='60s')
It doesn't get me records where content is -
"hi my name isxyz abc." and "hey wassupxyz abc. how is life"
doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.

If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).
You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.
Here's an example:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test_index/_doc/1
{
"content": "hi my name isxyz abc."
}
PUT test_index/_doc/2
{
"content": "hey wassupxyz abc. how is life"
}
POST test_index/_doc/_search
{
"query": {
"match_phrase": {
"content": "xyz abc"
}
}
}
That results in finding both documents.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "hey wassupxyz abc. how is life"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "hi my name isxyz abc."
}
}
]
}
}
EDIT:
If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:
PUT test_index/_doc/3
{
"content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}
And you can query it with wildcard:
POST test_index/_doc/_search
{
"query": {
"wildcard": {
"content.keyword": {
"value": "*Lasit Pant*"
}
}
}
}
Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "*<my search terms>*". I would still recommend looking into nGrams.

you can also use type parameter to set to phrase in the query
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"query":"xyz abc"
},
type:"phrase"}
,
size=500,
scroll='60s')

Related

How to filter ElasticSearch results without having it affect the document score?

I am trying to filter my results on "publication_year" field but I don't want it to affect the score of the document, but if I add the "range" to the query or to "filter", it seems to affect the score and score the documents higher whose "publication_year" is closer to "lte" or "less than equal to" the upper limit in the "range".
My query:
query = {
'bool': {
'should': [
{
'match_phrase': {
"title": keywords
}
},
{
'match_phrase': {
"abstract": keywords
}
},
]
}
}
if publication_year_constraint:
range_query = {"range":{"publication_year":{"gte":publication_year_constraint, "lte": datetime.datetime.today().year}}}
query["bool"]["filter"] = [range_query]
tried putting the "range" inside the "should" block as well, similar results.
Try use Filter Context.
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated.
Example:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}

Elasticsearch not returning result for single word query

I have a basic Elasticsearch index that consists of a variety of help articles. Users can search for them in my Python/Django app.
The index has the following mappings:
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"category": {
"type": "nested",
"properties": {
"category_id": {
"type": "long"
},
"category_title": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
}
},
"title": {
"type": "keyword"
},
"date_updated": {
"type": "date"
},
"position": {
"type": "integer"
}
}
}
}
I basically want the user to be able to search for a query and get any results that match the article title or category.
Say I have an article called "I Can't Remember My Password" in the "Your Account" category.
If I search for the article title exactly, I see the result. If I search for the category title exactly, I also see the result.
But if I search for just "password", I get nothing. What do I need to change in my setup/query to make it so that this query (or similarly non-exact queries) also returns the result?
My query looks like:
{
"query": {
"bool": {
"should": [{
"multi_match": {
"fields": ["title"],
"query": "password"
}
},
{
"nested": {
"path": "category",
"query": {
"multi_match": {
"fields": ["category.category_title"],
"query": "password"
}
}
}
}
]
}
}
}
I have read other questions and experimented with various settings but no luck so far. I am not doing anything particularly special at index time in terms of preparing the fields so I don't know if that's something to look at. I'm just using the elasticsearch-dsl defaults.
The solution was to reindex the title field as text rather than keyword. The latter only allows exact matching.
Credit to LeBigCat for pointing that out in the comments. They haven't posted it as an answer so I'm doing it on their behalf to improve visibility.

Searching period and hyphen-delimited fields in Elasticsearch

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
type 'keyword' for aggregations
type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

Elastic search is not showing the fields

I am newbie in Elastic search. I am trying to implement it in Python for one of my college projects. I want to use Elastic search as a resume indexer. Everything is working fine except it is showing all the fields in _source field .I don't want some fields and I tried too many thing but nothing is working. Below is my code
es = Elastcisearch()
query = {
"_source":{
"exclude":["resume_content"]
},
"query":{
"match":{
"resume_content":{
"query":keyword,
"fuzziness":"Auto",
"operator":"and",
"store":"false"
}
}
}
}
res = es.search(size=es_conf["MAX_SEARCH_RESULTS_LIMIT"],index=es_conf["ELASTIC_INDEX_NAME"], body=query)
return res
where es_conf is my local dictionary.
Apart from the above code I have also tried _source:false ,_source:[name of my fields], fields:[name of my fields] . I also tried store=False in my search method. Any ideas?
Did you try just using fields?
Here's a simple example. I set up a mapping with three fields, (imaginatively) named "field1", "field2", "field3":
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string"
},
"field2": {
"type": "string"
},
"field3": {
"type": "string"
}
}
}
}
}
Then I indexed three documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"text11","field2":"text12","field3":"text13"}
{"index":{"_id":2}}
{"field1":"text21","field2":"text22","field3":"text23"}
{"index":{"_id":3}}
{"field1":"text31","field2":"text32","field3":"text33"}
And let's say I want to find docs that contain "text22" in field "field2", but I only want to return the contents of "field1" and "field2". Here's the query:
POST /test_index/doc/_search
{
"fields": [
"field1", "field2"
],
"query": {
"match": {
"field2": "text22"
}
}
}
which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4054651,
"fields": {
"field1": [
"text21"
],
"field2": [
"text22"
]
}
}
]
}
}
Here's the code I used: http://sense.qbox.io/gist/69dabcf9f6e14fb1961ec9f761645c92aa8e528b
It should be straightforward to set this up with the Python adapter.

Elastic Search: including #/hashtags in search results

Using elastic search's query DSL this is how I am currently constructing my query:
elastic_sort = [
{ "timestamp": {"order": "desc" }},
"_score",
{ "name": { "order": "desc" }},
{ "channel": { "order": "desc" }},
]
elastic_query = {
"fuzzy_like_this" : {
"fields" : [ "msgs.channel", "msgs.msg", "msgs.name" ],
"like_text" : search_string,
"max_query_terms" : 10,
"fuzziness": 0.7,
}
}
res = self.es.search(index="chat", body={
"from" : from_result, "size" : results_per_page,
"track_scores": True,
"query": elastic_query,
"sort": elastic_sort,
})
I've been trying to implement a filter or an analyzer that will allow the inclusion of "#" in searches (I want a search for "#thing" to return results that include "#thing"), but I am coming up short. The error messages I am getting are not helpful and just telling me that my query is malformed.
I attempted to incorporate the method found here : http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html but it doesn't make any sense to me in context.
Does anyone have a clue how I can do this?
Did you create a mapping for you index? You can specify within your mapping to not analyze certain fields.
For example, a tweet mapping can be something like:
"tweet": {
"properties": {
"id": {
"type": "long"
},
"msg": {
"type": "string"
},
"hashtags": {
"type": "string",
"index": "not_analyzed"
}
}
}
You can then perform a term query on "hashtags" for an exact string match, including "#" character.
If you want "hashtags" to be tokenized as well, you can always create a multi-field for "hashtags".

Categories

Resources