Count all occurrences per field in one index

Count all occurrences per field in one index - python

I want to count how many entries i have for each field in my elasticsearch DB for one index. I have tried with the code below, but this only returns the total number of entries. I'm working in Python.
What I have tried so far:
qry = {
"aggs": {
"field": {
"terms" : {"field": "field"}
}
}, "size": 0
}
r = es.search(body=qry,
index="webhose_english")
My current result:
Out[64]:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
'aggregations': {'field': {'buckets': [],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0}},
'hits': {'hits': [], 'max_score': 0.0, 'total': 4519134},
'timed_out': False,
'took': 16}
And I would ideally have something like:
{'field_1': 321,
'field_2': 231,
'field_3': 132}

This information used to be part of the _field_stats API, but it has been removed in 6.0. So you are on the right track, you will need an aggregation. I think value_count is the one you need and for good measure I've added global as well, so we know how many documents are there in total.
Three sample docs:
PUT foo/_doc/1
{
"foo": "bar"
}
PUT foo/_doc/2
{
"foo": "bar",
"bar": "bar"
}
PUT foo/_doc/3
{
"foo": "bar",
"bar": "bar",
"baz": "bar"
}
Aggregation (I'm not sure if there might be a shorter version of this especially with many fields):
GET foo/_search
{
"aggs": {
"count_fields": {
"global": {},
"aggs": {
"count_foo": {
"value_count": {
"field": "foo.keyword"
}
},
"count_bar": {
"value_count": {
"field": "bar.keyword"
}
},
"count_baz": {
"value_count": {
"field": "baz.keyword"
}
}
}
}
},
"size": 0
}
Result:
{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"count_fields" : {
"doc_count" : 3,
"count_foo" : {
"value" : 3
},
"count_bar" : {
"value" : 2
},
"count_baz" : {
"value" : 1
}
}
}
}

I did it by iterating over the following query and then collecting the "total" values in a dictionary:
qry = {
"query": {
"exists": {
"field": "fields_to_iterate"
}
}
}

Related

MongoDB Aggregation - Creating variable for $sum

Sample input:
{
"students":[
{
"name" : "John",
"semesters":[
{
"semester": "fall",
"grades": [
{"EXAM_1" : 25},
{"EXAM_2" : 45},
{"EXAM_3" : 22}
]
},
{
"semester": "winter",
"grades": [
{"EXAM_1" : 85},
{"EXAM_2" : 32},
{"EXAM_3" : 17}
]
}
]
},{
"name" : "Abraham",
"semesters":[
{
"semester": "fall",
"grades": [
{"EXAM_1" : 5},
{"EXAM_2" : 91},
{"EXAM_3" : 51}
]
},
{
"semester": "winter",
"grades": [
{"EXAM_1" : 55},
{"EXAM_2" : 62},
{"EXAM_3" : 17}
]
}
]
},{
"name" : "Zach",
"semesters":[
{
"semester": "spring",
"grades": [
{"EXAM_1" : 18},
{"EXAM_2" : 19},
{"EXAM_3" : 26}
]
},
{
"semester": "winter",
"grades": [
{"EXAM_1" : 100},
{"EXAM_2" : 94},
{"EXAM_3" : 45}
]
}
]
}
]
}
So this is what I have so far
data = await db.userstats.aggregate([
{ "$unwind": "$students.semesters" },
{ "$unwind": "$students.semesters.fall" },
{ "$unwind": f"$students.semesters.fall.grades" },
{
{ "$sum": [
{"$match" : { "$students.semesters.fall.grades" : "EXAM_3" } },
{"$multiply": [2, {"$match" : { "$students.semesters.fall.grades" : "EXAM_1" } }]}
]
}
},
{
"$project": {
"name" : "$name",
"character" : "$students.semesters.fall",
"exam_name" : "$students.semesters.fall.grades",
"exam_value" : "2*exam 1 + exam 3"
}
},
{ "$sort": { "exam_value": -1 }},
{ '$limit' : 30 }
]).to_list(length=None)
print(data)
I've been trying to implement a calculation performed on exam grades for each student in a data sample and comparing it to other students. I am stuck on how to properly perform the calculation. The basic rundown is that I need the output to be sorted calculations of
2*exam 1 + exam3.
I understand that $sum cannot be used in the pipeline stage, but I am unaware of how to use the $match command within the $sum operator.
Sample output:
{name: John, calculated_exam_grade: 202, 'semester':'winter'},
{name: Abraham, calculated_exam_grade: 101, 'semester':'fall'},
{name: John, calculated_exam_grade: 95, 'semester':'fall'},
etc...

Based on the expected result provided, the query is almost similar to the link I posted in the comment.
$unwind - Deconstruct students array.
$unwind - Deconstruct student.semesters array.
$project - Decorate output documents with the calculation for the calculated_exam_grade field.
$sort
$limit
db.collection.aggregate([
{
"$unwind": "$students"
},
{
"$unwind": "$students.semesters"
},
{
"$project": {
_id: 0,
"name": "$students.name",
"semester": "$students.semesters.semester",
"calculated_exam_grade": {
$sum: [
{
"$multiply": [
2,
{
$sum: [
"$students.semesters.grades.EXAM_1"
]
}
]
},
{
$sum: [
"$students.semesters.grades.EXAM_3"
]
}
]
}
}
},
{
"$sort": {
"calculated_exam_grade": -1
}
},
{
"$limit": 30
}
])
Sample Mongo Playground

Querying Elasticsearch Index from Python returns 0 hits but thousands from Kibana

I am trying to query from Jupyter a Elasticsearch index using Python. This is my code:
from elasticsearch import Elasticsearch
esHost="http://myhost.com:9600"
esUser="myuser"
esPass="mypass"
es = Elasticsearch(hosts=esHost, http_auth=(esUser, esPass))
query = {
"query": {
"bool": {
"must": {
"exists": {
"field": "pkey"
}
}
}
}
}
res = es.search(index="myindex", doc_type="articles", body=query)
res
The result I am getting is:
{'took': 1,
'timed_out': False,
'_shards': {'total': 6, 'successful': 6, 'skipped': 0, 'failed': 0},
'hits': {'total': 0, 'max_score': None, 'hits': []}}
However, when I run the same query in Kibana:
GET /myindex/_search?
{
"query": {
"bool": {
"must": {
"exists": {
"field": "pkey"
}
}
}
}
}
I am getting thousands of results:
{
"took" : 67,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 12605121,
"max_score" : 1.0,
"hits" : [
{
...
What am I missing?

Just remove doc_type in your python code, since this seems to be the only difference
res = es.search(index="myindex", doc_type="articles", body=query)
^
|
remove this parameter

Search a list of comma separated strings on elasticsearch

I have elasticsearch configured for my django project.
Elasticsearch index has two fields user_id and address, my goal is to search a list of comma separated addresses on elasticsearch.
Example:
i have this list of addresses ["abc", "def","ghi","jkl","mno"] and i want to search them on elasticsearch in one hit, the result i'm expecting for the above list is ["abc", "def","ghi"] if these three addresses "abc", "def" and "ghi" (individually) exist on elasticsearch in address field.

Ingest data
POST test_foki/_doc
{
"user_id": 1,
"address": "abc"
}
POST test_foki/_doc
{
"user_id": 2,
"address": "def"
}
POST test_foki/_doc
{
"user_id": 3,
"address": "ghi"
}
If you want to do exact matches then you can use a terms query to filter up by an array of addresses.
Request
We use filter because we dont care about score on exact matches (it matches or not)
POST test_foki/_search
{
"query": {
"bool": {
"filter": [
{
"terms": {
"address.keyword": [
"abc",
"def",
"ghi",
"jkl",
"mno"
]
}
}
]
}
}
}
Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "YzkL4HcBv0SJscHMrZB8",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"address" : "abc"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZDkL4HcBv0SJscHMsZAx",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"address" : "def"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZTkL4HcBv0SJscHMtpAd",
"_score" : 1.0,
"_source" : {
"user_id" : 3,
"address" : "ghi"
}
}
]
}
}
If you want to do full text searches you will have to do a boolean query
POST test_foki/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"address": "abc"
}
},
{
"match": {
"address": "def"
}
},
{
"match": {
"address": "ghi"
}
},
{
"match": {
"address": "jkl"
}
},
{
"match": {
"address": "mno"
}
}
]
}
}
}
This produces the same Lucene query address:abc address:def address:ghi address:jkl address:mno
POST test_foki/_search
{
"query": {
"match": {
"address": "abc def ghi jkl mno"
}
}
}

Elastic Search composite grouping with range

Consider the following documents are in my elastic search . I want to group the documents based on rank, but any rank below 1000 must be displayed individually and anything above 1000 must be grouped how do I achieve this using composite aggregation, I am new and I am using composite because I want to use the after key function to allow pagination.
Documents
{
rank : 200,
name:abcd,
score1 :100,
score2:200
},
{
rank 300,
name:abcd,
score1:100,
score2:200
}
Expected Result:
{
key:{
rank:101
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
{
key:{
rank:1000-*
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
},
{
key:{
rank:300
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
######## QUery that I tried
{
"query":{"match_all":{}},
"aggs":{
"_scores":{
"composite"{
"sources":[
{"_rank":{"terms":{"field":"rank"}}}
]
}
},
"aggs":{
"_ranks":{
"field":"rank:[
{"to":1000},
{"from":1000}
]
}
"_score1": {"sum": {"field": "score1"}}
"_score2": {"sum": {"field": "score2"}}
}
}
}

From what I understand, you want to
Group the aggregations whose value is below 1000 rank to their own buckets
Group the aggregations whose value is 1000 and above to a single bucket with key 1000-*
And for each buckets, calculate the sum of _score1 of all buckets
Similarly calculate the sum of _score2 of all buckets
For this scenario, you can simply make use of Terms Aggregation as I've mentioned in below answer.
I've mentioned sample mapping, sample documents, query and response so that you'll have clarity on what's happening.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"rank":{
"type": "integer"
},
"name":{
"type": "keyword"
},
"_score1": {
"type":"integer"
},
"_score2":{
"type": "integer"
}
}
}
}
Sample Documents:
POST my_sample_index/_doc/1
{
"rank": 100,
"name": "john",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/2
{
"rank": 1001, <--- Rank > 1000
"name": "constantine",
"_score1": 200,
"_score2": 200
}
POST my_sample_index/_doc/3
{
"rank": 200,
"name": "bruce",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/4
{
"rank": 2001, <--- Rank > 1000
"name": "arthur",
"_score1": 200,
"_score2": 200
}
Aggregation Query:
POST my_sample_index/_search
{
"size":0,
"aggs": {
"_score": {
"terms": {
"script": {
"source": """
if(doc['rank'].value < 1000){
return doc['rank'];
}else
return '1000-*';
"""
}
},
"aggs":{
"_score1_sum":{
"sum": {
"field": "_score1"
}
},
"_score2_sum":{
"sum":{
"field": "_score2"
}
}
}
}
}
}
Note that I've used Scripted Terms Aggregation where I've mentioned by logic in the script. Logic I believe is self-explainable once you go through it.
Response:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1000-*", <---- Note this
"doc_count" : 2, <---- Note this
"_score2_sum" : {
"value" : 400.0
},
"_score1_sum" : {
"value" : 400.0
}
},
{
"key" : "100",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
},
{
"key" : "200",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
}
]
}
}
}
Note that there are two keys having rank > 1000, both of their scores for _score1 and _score2 sum to 400, which is what is expected.
Let me know if this helps!

Elasticsearch query - print certain field based on other field

My goal is to find max value in one field and print another field in this found document.
My query so far:
{
"fields": ["text"], //NOT WORKING
"query": {
"query_string": {
"query": "_type:bmw AND _exists_:car_type",
"analyze_wildcard": True
}
},
"size": 0,
"aggs": {
"2": {
"terms": {
"field": "compound",
"size": 5,
"order": {
"2-orderAgg": "desc"
}
},
"aggs": {
"2-orderAgg": {
"max": {
"field": "compound"
}
}
}
}
}
}
Result is
'buckets': [{'doc_count': 1, '2-orderAgg': {'value': 0.8442}, 'key': 0.8442}, {'doc_count': 1, '2-orderAgg': {'value': 0.7777}, 'key': 0.7777}, {'doc_count': 1, '2-orderAgg': {'value': 0.7579}, 'key': 0.7579}, {'doc_count': 1, '2-orderAgg': {'value': 0.6476}, 'key': 0.6476}, {'doc_count': 1, '2-orderAgg': {'value': 0.6369}, 'key': 0.6369}]
Now I need to print text field in document contains compound value 0.8442 and so on.. Thank you for your advice.

I achieved this with a small workaroud. It's not pretty but at final I get what I wanted.
Firstly I used response from first query. Than I grabbed all keys from those dictionary and perform new query to find certain document's id.
{
"size": 0,
"query": {
"query_string": {
"analyze_wildcard": True,
"query": "_type:bmw AND compound:"+str(0.8442)+" AND _exists_:car_type"
}
},
"aggs": {
"3": {
"terms": {
"field": "id_str",
"size": 20,
"order": {
"_count": "desc"
}
}
}
}
}
than iterate through response and search document by this id field
for y in res1:
res3 = es.search(index='indexname', body={
"size" : 1,
"query": {
"bool": {
"must": [
{
"match": {
"id_str": y['key']
}
}
]
}
}
})
for x in res3['hits']['hits']:
print (x['_source']['text'])
now result is
Diamond stitch leather is a great addition to any custom vehicle. Prices start from 2k! #bmw i8 getting under car...
which is text what I wanted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count all occurrences per field in one index - python

I did it by iterating over the following query and then collecting the "total" values in a dictionary: qry = { "query": { "exists": { "field": "fields_to_iterate" } } }

Related

MongoDB Aggregation - Creating variable for $sum

Querying Elasticsearch Index from Python returns 0 hits but thousands from Kibana

Search a list of comma separated strings on elasticsearch

Elastic Search composite grouping with range

Elasticsearch query - print certain field based on other field

Categories

Resources