I am newbie in Elastic search. I am trying to implement it in Python for one of my college projects. I want to use Elastic search as a resume indexer. Everything is working fine except it is showing all the fields in _source field .I don't want some fields and I tried too many thing but nothing is working. Below is my code
es = Elastcisearch()
query = {
"_source":{
"exclude":["resume_content"]
},
"query":{
"match":{
"resume_content":{
"query":keyword,
"fuzziness":"Auto",
"operator":"and",
"store":"false"
}
}
}
}
res = es.search(size=es_conf["MAX_SEARCH_RESULTS_LIMIT"],index=es_conf["ELASTIC_INDEX_NAME"], body=query)
return res
where es_conf is my local dictionary.
Apart from the above code I have also tried _source:false ,_source:[name of my fields], fields:[name of my fields] . I also tried store=False in my search method. Any ideas?
Did you try just using fields?
Here's a simple example. I set up a mapping with three fields, (imaginatively) named "field1", "field2", "field3":
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string"
},
"field2": {
"type": "string"
},
"field3": {
"type": "string"
}
}
}
}
}
Then I indexed three documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"text11","field2":"text12","field3":"text13"}
{"index":{"_id":2}}
{"field1":"text21","field2":"text22","field3":"text23"}
{"index":{"_id":3}}
{"field1":"text31","field2":"text32","field3":"text33"}
And let's say I want to find docs that contain "text22" in field "field2", but I only want to return the contents of "field1" and "field2". Here's the query:
POST /test_index/doc/_search
{
"fields": [
"field1", "field2"
],
"query": {
"match": {
"field2": "text22"
}
}
}
which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4054651,
"fields": {
"field1": [
"text21"
],
"field2": [
"text22"
]
}
}
]
}
}
Here's the code I used: http://sense.qbox.io/gist/69dabcf9f6e14fb1961ec9f761645c92aa8e528b
It should be straightforward to set this up with the Python adapter.
Related
I'm new to elastic search and trying to do this query right.
So I'm having a document like this:
{
"id": 1,
"name": "Văn Hiến"
}
I want to get that document in 3 cases:
1/ User input is: "v" or "h" or "i",...
2/ User input is: "Văn" or "văn" or "hiến",...
3/ User input is: "va" or "van" or "van hi",...
I'm currently can search for case 1 and 2, but not case 3, where the user input don't have the 'tonal' of the Vietnamese language
This is my query, I'm using Python:
query = {
"bool": {
"should": [
{
"match": {
"name": name.lower()
}
},
{
"wildcard": {
"name": {
"value": f"*{name.lower()}*"
}
}
}
]
}
}
Can anyone help me with this? Any helps will be apperciated
Use the lowercase_filter and mapping_character_filter functions in your mapping.
the following mapping and query will work for all the three usecases you mentioned
Mapping Example:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"my_mappings_char_filter"
]
}
},
"char_filter": {
"my_mappings_char_filter": {
"type": "mapping",
"mappings": [
"ă => a",
"ế => e"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff" : "9"
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"facet": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Example Query:
{
"query" : {
"query_string" :{
"query":"van hi",
"type": "best_fields",
"default_field": "name"
}
}
}
I am currently using this to push a 'review' to my array of reviews in my perfumes collection:
mongo.db.perfumes.update(
{"_id": perfume["_id"]},
{
"$push": {
"reviews": {
"_id": review_id,
"review_content": form.review.data,
"reviewer": current_user.username,
"date_reviewed": datetime.utcnow(),
"reviewer_picture": current_user.avatar,
}
}
},
)
So as a result my document is:
[
{
"_id": {
"$oid": "5ebf29dd1f3fe19434e41761"
},
"author": "Guillermo",
"brand": "A test brand",
"name": "A test perfume",
"perfume_type": "Woody",
"description": "<p>A test description</p>",
"date_updated": {
"$date": "2020-05-15T23:46:37.242Z"
},
"public": false,
"picture": "generic.png",
"reviews": [
{
"_id": {
"$oid": "5ebf29e90000000000000000"
},
"review_content": "<p>A test review</p>",
"reviewer": "Guillermo",
"date_reviewed": {
"$date": "2020-05-15T23:46:49.308Z"
},
"reviewer_picture": "a92de23ae01cdfde.jpg"
}
]
}
]
I want to create another route to update or edit the contents of my review (review_content).
What's the way to update that subarray in my collection?
Thank you!!
Let's assume you want to update review_content of a particular review you will use below query
mongo.db.perfumes.update(
{"_id": perfume["_id"], "reviews._id": review["_id"]},
{ $set: { "reviews.$.review_content" : "This is my new content"} },
)
I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
type 'keyword' for aggregations
type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate.
code --
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"match_phrase":{
"content":"xyz abc"
}}}
,
size=500,
scroll='60s')
It doesn't get me records where content is -
"hi my name isxyz abc." and "hey wassupxyz abc. how is life"
doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.
If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).
You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.
Here's an example:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test_index/_doc/1
{
"content": "hi my name isxyz abc."
}
PUT test_index/_doc/2
{
"content": "hey wassupxyz abc. how is life"
}
POST test_index/_doc/_search
{
"query": {
"match_phrase": {
"content": "xyz abc"
}
}
}
That results in finding both documents.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "hey wassupxyz abc. how is life"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "hi my name isxyz abc."
}
}
]
}
}
EDIT:
If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:
PUT test_index/_doc/3
{
"content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}
And you can query it with wildcard:
POST test_index/_doc/_search
{
"query": {
"wildcard": {
"content.keyword": {
"value": "*Lasit Pant*"
}
}
}
}
Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "*<my search terms>*". I would still recommend looking into nGrams.
you can also use type parameter to set to phrase in the query
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"query":"xyz abc"
},
type:"phrase"}
,
size=500,
scroll='60s')
I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.