Consider the following documents are in my elastic search . I want to group the documents based on rank, but any rank below 1000 must be displayed individually and anything above 1000 must be grouped how do I achieve this using composite aggregation, I am new and I am using composite because I want to use the after key function to allow pagination.
Documents
{
rank : 200,
name:abcd,
score1 :100,
score2:200
},
{
rank 300,
name:abcd,
score1:100,
score2:200
}
Expected Result:
{
key:{
rank:101
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
{
key:{
rank:1000-*
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
},
{
key:{
rank:300
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
######## QUery that I tried
{
"query":{"match_all":{}},
"aggs":{
"_scores":{
"composite"{
"sources":[
{"_rank":{"terms":{"field":"rank"}}}
]
}
},
"aggs":{
"_ranks":{
"field":"rank:[
{"to":1000},
{"from":1000}
]
}
"_score1": {"sum": {"field": "score1"}}
"_score2": {"sum": {"field": "score2"}}
}
}
}
From what I understand, you want to
Group the aggregations whose value is below 1000 rank to their own buckets
Group the aggregations whose value is 1000 and above to a single bucket with key 1000-*
And for each buckets, calculate the sum of _score1 of all buckets
Similarly calculate the sum of _score2 of all buckets
For this scenario, you can simply make use of Terms Aggregation as I've mentioned in below answer.
I've mentioned sample mapping, sample documents, query and response so that you'll have clarity on what's happening.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"rank":{
"type": "integer"
},
"name":{
"type": "keyword"
},
"_score1": {
"type":"integer"
},
"_score2":{
"type": "integer"
}
}
}
}
Sample Documents:
POST my_sample_index/_doc/1
{
"rank": 100,
"name": "john",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/2
{
"rank": 1001, <--- Rank > 1000
"name": "constantine",
"_score1": 200,
"_score2": 200
}
POST my_sample_index/_doc/3
{
"rank": 200,
"name": "bruce",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/4
{
"rank": 2001, <--- Rank > 1000
"name": "arthur",
"_score1": 200,
"_score2": 200
}
Aggregation Query:
POST my_sample_index/_search
{
"size":0,
"aggs": {
"_score": {
"terms": {
"script": {
"source": """
if(doc['rank'].value < 1000){
return doc['rank'];
}else
return '1000-*';
"""
}
},
"aggs":{
"_score1_sum":{
"sum": {
"field": "_score1"
}
},
"_score2_sum":{
"sum":{
"field": "_score2"
}
}
}
}
}
}
Note that I've used Scripted Terms Aggregation where I've mentioned by logic in the script. Logic I believe is self-explainable once you go through it.
Response:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1000-*", <---- Note this
"doc_count" : 2, <---- Note this
"_score2_sum" : {
"value" : 400.0
},
"_score1_sum" : {
"value" : 400.0
}
},
{
"key" : "100",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
},
{
"key" : "200",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
}
]
}
}
}
Note that there are two keys having rank > 1000, both of their scores for _score1 and _score2 sum to 400, which is what is expected.
Let me know if this helps!
Related
I want to retrieve data from elasticsearch based on timestamp. The timestamp is in epoch_millis and I tried to retrieve the data like this:
{
"query": {
"bool": {
"must":[
{
"range": {
"TimeStamp": {
"gte": "1632844180",
"lte": "1635436180"
}
}
}
]
}
},
"size": 10
}
But the response is this:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
How can I retrieve data for a given period of time from a certain index?
The data looks like this:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "zWpMNXcBTeKmGB84eksSD",
"_score" : 1.0,
"_source" : {
"Source" : "Market",
"Category" : "electronics",
"Value" : 20,
"Price" : 45.6468,
"Currency" : "EUR",
"TimeStamp" : 1611506922000 }
Also, the result has 10.000 hits when using the _search on the index. How could I access other entries? (more than 10.000 results) and to be able to choose the desired timestamp interval.
For your first question, assume that you have the mappings like this:
{
"mappings": {
"properties": {
"Source": {
"type": "keyword"
},
"Category": {
"type": "keyword"
},
"Value": {
"type": "integer"
},
"Price": {
"type": "float"
},
"Currency": {
"type": "keyword"
},
"TimeStamp": {
"type": "date"
}
}
}
}
Then I indexed 2 sample documents (1 is yours above, but the timestamp is definitely not in your range):
[{
"Source": "Market",
"Category": "electronics",
"Value": 30,
"Price": 55.6468,
"Currency": "EUR",
"TimeStamp": 1633844180000
},
{
"Source": "Market",
"Category": "electronics",
"Value": 20,
"Price": 45.6468,
"Currency": "EUR",
"TimeStamp": 1611506922000
}]
If you really need to query using the range above, you will first need to convert your TimeStamp field to seconds (/1000), then query based on that field:
{
"runtime_mappings": {
"secondTimeStamp": {
"type": "long",
"script": "emit(doc['TimeStamp'].value.millis/1000);"
}
},
"query": {
"bool": {
"must": [
{
"range": {
"secondTimeStamp": {
"gte": 1632844180,
"lte": 1635436180
}
}
}
]
}
},
"size": 10
}
Then you will get the first document.
About your second question, by default, Elasticsearch's max_result_window is only 10000. You can increase this limit by updating the settings, but it will increase the memory usage.
PUT /index/_settings
{
"index.max_result_window": 999999
}
You should use the search_after API instead.
I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you
When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])
Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.
I have elasticsearch configured for my django project.
Elasticsearch index has two fields user_id and address, my goal is to search a list of comma separated addresses on elasticsearch.
Example:
i have this list of addresses ["abc", "def","ghi","jkl","mno"] and i want to search them on elasticsearch in one hit, the result i'm expecting for the above list is ["abc", "def","ghi"] if these three addresses "abc", "def" and "ghi" (individually) exist on elasticsearch in address field.
Ingest data
POST test_foki/_doc
{
"user_id": 1,
"address": "abc"
}
POST test_foki/_doc
{
"user_id": 2,
"address": "def"
}
POST test_foki/_doc
{
"user_id": 3,
"address": "ghi"
}
If you want to do exact matches then you can use a terms query to filter up by an array of addresses.
Request
We use filter because we dont care about score on exact matches (it matches or not)
POST test_foki/_search
{
"query": {
"bool": {
"filter": [
{
"terms": {
"address.keyword": [
"abc",
"def",
"ghi",
"jkl",
"mno"
]
}
}
]
}
}
}
Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "YzkL4HcBv0SJscHMrZB8",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"address" : "abc"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZDkL4HcBv0SJscHMsZAx",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"address" : "def"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZTkL4HcBv0SJscHMtpAd",
"_score" : 1.0,
"_source" : {
"user_id" : 3,
"address" : "ghi"
}
}
]
}
}
If you want to do full text searches you will have to do a boolean query
POST test_foki/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"address": "abc"
}
},
{
"match": {
"address": "def"
}
},
{
"match": {
"address": "ghi"
}
},
{
"match": {
"address": "jkl"
}
},
{
"match": {
"address": "mno"
}
}
]
}
}
}
This produces the same Lucene query address:abc address:def address:ghi address:jkl address:mno
POST test_foki/_search
{
"query": {
"match": {
"address": "abc def ghi jkl mno"
}
}
}
I tried to aggregate over the inner-hits, but the code aggregates on all the docs. I need aggregation on each document inner-hits instead of global aggregation. How do I code it?
I have tried this code after the query
"aggs": {
"jobs": {
"nested": {
"path": "departures"
},
"aggs": {
"departures_only": {
"filter": {
"range" : {
"departures.eb_price": {
"gte" : 90000,
"lte" : 100000
}
}
},
"aggs": {
"start_price" : { "min" : { "field" : "departures.eb_price" } },
"start_date" : { "min" : { "field" : "departures.start_date"} }
}
}
}
}
}
The actual result is
"jobs" : {
"doc_count" : 55,
"departures_only" : {
"doc_count" : 55,
"start_price" : {
"value" : 93440.0
},
"start_date" : {
"value" : 1.5696288E12,
"value_as_string" : "28 Sep 2019"
}
}
}
}
But the actual result should show this aggregation for each document instead of the complete docs in the query results. Because the aggregations to be run on each individual document's inner-hits.
Any suggestions? I am just a beginner in elasticsearch
I want to count how many entries i have for each field in my elasticsearch DB for one index. I have tried with the code below, but this only returns the total number of entries. I'm working in Python.
What I have tried so far:
qry = {
"aggs": {
"field": {
"terms" : {"field": "field"}
}
}, "size": 0
}
r = es.search(body=qry,
index="webhose_english")
My current result:
Out[64]:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
'aggregations': {'field': {'buckets': [],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0}},
'hits': {'hits': [], 'max_score': 0.0, 'total': 4519134},
'timed_out': False,
'took': 16}
And I would ideally have something like:
{'field_1': 321,
'field_2': 231,
'field_3': 132}
This information used to be part of the _field_stats API, but it has been removed in 6.0. So you are on the right track, you will need an aggregation. I think value_count is the one you need and for good measure I've added global as well, so we know how many documents are there in total.
Three sample docs:
PUT foo/_doc/1
{
"foo": "bar"
}
PUT foo/_doc/2
{
"foo": "bar",
"bar": "bar"
}
PUT foo/_doc/3
{
"foo": "bar",
"bar": "bar",
"baz": "bar"
}
Aggregation (I'm not sure if there might be a shorter version of this especially with many fields):
GET foo/_search
{
"aggs": {
"count_fields": {
"global": {},
"aggs": {
"count_foo": {
"value_count": {
"field": "foo.keyword"
}
},
"count_bar": {
"value_count": {
"field": "bar.keyword"
}
},
"count_baz": {
"value_count": {
"field": "baz.keyword"
}
}
}
}
},
"size": 0
}
Result:
{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"count_fields" : {
"doc_count" : 3,
"count_foo" : {
"value" : 3
},
"count_bar" : {
"value" : 2
},
"count_baz" : {
"value" : 1
}
}
}
}
I did it by iterating over the following query and then collecting the "total" values in a dictionary:
qry = {
"query": {
"exists": {
"field": "fields_to_iterate"
}
}
}