I've been trying to figure out a way to paginate the results of a terms aggregation in Elasticsearch and so far I have not been able to achieve the desired result.
Here's the problem I am trying to solve. In my index, I have a bunch of documents that have a score (separate to the ES _score) that is calculated based on the values of the other fields in the document. Each document "belongs" to a customer, referenced by the customer_id field. The document also has an id, referenced by the doc_id field, and is the same as the ES meta-field _id. Here is an example.
{
'_id': '1',
'doc_id': '1',
'doc_score': '85',
'customer_id': '123'
}
For each customer_id there are multiple documents, all with different document ids and different scores. What I want to be able to do is, given a list of customer ids, return the top document for each customer_id (only 1 per customer) and be able to paginate those results similar to the size, from method in the regular ES search API. The field that I want to use for the document score is the doc_score field.
So far in my current Python script, I've tried is a nested aggs with a "top hits" aggregation to only get the top document for each customer.
{
"size": 0,
"query:": {
"bool": {
"must": [
{
"match_all": {}
},
{
"terms": {
"customer_id": customer_ids # a list of the customer ids I want documents for
}
},
{
"exists": {
"field": "score" # sometimes it's possible a document does not have a score
}
}
]
}
}
"aggs": {
"customers": {
"terms" : {
{"field": "customer_id", "min_doc_count": 1},
"aggs": {
"top_documents": {
"top_hits": {
"sort": [
{"score": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
}
I then "paginate" by going through each customer bucket, appending the top document blob to a list and then sorting the list based on the value of the score field and finally taking a slice documents_list[from:from+size].
The issue with this is that, say I have 500 customers in the list but I only want the 2nd 20 documents, i.e. size = 20, from=20. So each time I call the function I have to first get the list for each of the 500 customers and then slice. This sounds very inefficient and is also a speed issue, since I need that function to be as fast as I can possibly make it.
Ideally, I could just get the 2nd 20 directly from ES without having to do any slicing in my function.
I have looked into Composite aggregations that ES offers, but it looks to me like I would not be able to use it in my case, since I need to get the entire doc, i.e. everything in the _source field in the regular search API response.
I would greatly appreciate any suggestions.
The best way to do this would be to use partitions
According to documentation:
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 1,
"num_partitions": 25
},
"size": 20,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
Related
I have this task "Write a mongodb query to find the count of movies released after the year 1999" . I'm trying to do this with this different line codes in the picture bellow, none of them works. Any thoughts?
PS: the collection's name is movies, the columns are the year and _id of the movies.
These are the lines I'm trying:
docs = db.movies.find({"year":{"$gt":"total"("1999")}}).count()
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"$1999"}}}])
docs = db.movies.count( {"year": { "$gt": "moviecount"("1999") } } )
docs = db.movies.find({"year":{"$gt":"1999"}})
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"1999"}}}])
You can do it with an aggregate
try it here
[
{
"$match": {
"year": {
"$gt": "1999"
}
}
},
{
"$group": {
"_id": 1,
"count": {
"$sum": "$total"
}
}
}
]
The first stage of the pipeline is $match, it will filter only your documents with a year greater than 1999.
Then in the $group we will sum all the total variables.
The "_id": 1, is a dummy value because we are not grouping on any particular field, and we just want to sum all the total
I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation.
I use the Python client running in containers to index and perform the search.
My corpus is a JSONL file a bit like this:
{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...
I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:
def create_index(client):
client.indices.create(
index="documents",
body={
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": False
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 10,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": True
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {"type": "text", "analyzer": "my_analyzer"}
}
},
},
ignore=400,
)
I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:
However, querying the index with:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['_id', 'body'],
'size': K,
'query': {
"match": {
"body": {
"query": body,
"analyzer" : "my_analyzer"
}
}
}
}
res = es.search(index='documents', body=doc)
top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]
my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.
"I come up here for perception and clarity"
or substrings like
"I come up here for perception"
while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.
Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?
P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)
Here are some things that you should double-check as they are likely culprits:
when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:
"mappings": {
"properties": {
"text": {"type": "text", "analyzer": "my_analyzer"}
}
in indexing phase you could also rework your generate_actions() method following the documentation with something like:
for elem in corpus:
yield {
"_op_type": "index"
"_index": "documents",
"_id": elem["id"],
"_source": elem["text"]
}
Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.
Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity".
So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['text'],
'size': K,
'query': {
"match": {
"text": { # I changed this line!
"query": body
}
}
}
}
res = es.search(index='documents', body=doc)
# also changed the list comprehension!
top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]
I am trying to do an aggregation and then sort the best scored result in top. Even though aggregation is successful, yet its not score sorted. How to make it possible?
"aggs": {
"group": {
"terms": {
"field": "description.keyword",
"order": {
"description_score": "desc"
}
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1
}
}
}
},
}
As #Lupanoide mentioned scores are not calculated in aggregation so you cannot sort your bucket on document score. Instead field collapsing can be used
Allows to collapse search results based on field values. The
collapsing is done by selecting only the top sorted document per
collapse key
{
"query": {
"match_all": {}
},
"collapse": { --> group by on given field
"field": "sources.keyword"
}
}
By default search results are sorted on "score" so you don't need to add "sort"
I'm using elastic search to filter 1 document and i use a loop to filer many documents. But now i want to filter many document in one request to optimize my script.
For the moment i have this query, and i'm using a "for" loop to filter by uuid.
for id in id_list:
filter (id)
def filter(id):
result = requests.get(
settings + '/data/_search?size=10000',
json={
"query": {
"bool": {
"filter": {
"terms": {
"id": id
}
}
}
},
"_source": {
"exclude": ["type", "date"]
}
}
)
I would like to do only one request to get all my document at once to optimize my code.
The terms query takes an array of arguments, see the reference for an example.
To provide some context :
I want to write a bulk update query(possibly affecting 0.5 - 1M docs). The update would be in the aspects field (shown below) which are mostly duplicated.
My thinking was if I normalised it into another entity (aspect_label), the amount of docs updated would be reduced drastically (say 500-1000 max).
Query : I want to find out if there is a way to get linked documents via id in Elastic Search.
Eg. if I have documents in index my_db according to the mapping below.
Just to point out : processed_reviews is a child of aspect_label
{
"my_db":{
"mappings":{
"processed_reviews":{
"_all":{
"enabled":false
},
"_parent":{
"type":"aspect_label"
},
"_routing":{
"required":true
},
"properties":{
"data":{
"properties":{
"insights":{
"type":"nested",
"properties":{
"aspects":{
"type":"nested",
"properties":{
"aspect_label_id":{
"type":"keyword"
},
"aspect_term_frequency":{
"type":"long"
}
}
}
}
},
"preprocessed_text":{
"type":"text"
},
"preprocessed_title":{
"type":"text"
}
}
}
}
}
}
}
}
And another entity aspect_label :
{
"my_db": {
"mappings": {
"aspect_label": {
"_all": {
"enabled": false
},
"properties": {
"aspect": {
"type": "keyword"
},
"aspect_label_new": {
"type": "keyword"
},
"aspect_label_old": {
"type": "text"
}
}
}
}
}
}
Now, I want to write a search query on the processed_reviews type such that the aspect_label_id entity is replaced with the the value of aspect_label_new in the doc or the entire doc in aspect_label matching the id.
{
"_index":"my_db",
"_type":"processed_reviews",
"_id":"191b3bff-4915-4404-a05a-10e6bd2b19d4",
"_score":1,
"_routing":"5",
"_parent":"5",
"_source":{
"data":{
"preprocessed_text":"Good product I really like so comfortable and so light wait and looks good",
"preprocessed_title":"Good choice",
"insights":[
{
"aspects":[
{
"aspect_label":"color",
"aspect_term_frequency":1
}
]
}
]
}
}
}
Also, if there is a better way to approach this problem/ something wrong with my approach or if this is possible or not. Please inform me of the same as well.