How to improve this function to avoid indexing duplicate documents on Elasctisearch

How to improve this function to avoid indexing duplicate documents on Elasctisearch - python

how could i improve my function to insert the id of my dataframe in the "_id" of Elasticsearch document to handle duplicates?.
Dataframe structure
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 412 non-null object
1 email_address 412 non-null object
2 first_name 412 non-null object
3 last_name 412 non-null object
The funtion to convert to an elasticsearch compatible format
def to_elastic_json(df, index_name):
import json
for record in df.to_dict(orient="records"):
yield ('{ "index" : { "_index" : "%s"}}'% (index_name))
yield (json.dumps(record, default=str))
es_response = elastic_client.bulk(to_elastic_json(df, INDEX_name))

EDIT
Yes, ES will update the doc w/ a new _version number if you ingest a doc w/ an already existing _id:
Here's how to do it:
def to_elastic_json(df, index_name):
import json
for record in df.to_dict(orient="records"):
yield ('{ "index" : { "_index" : "%s", "_id": "%s"}}'% (index_name, str(record['id'])))
yield (json.dumps(record, default=str))
Verify by calling
GET INDEX_NAME/_search?version=true
and looking for the _version attribute.
ORIGINAL
Why not let ES auto-generate the _id and you keep your own id separate. That way, you can then write a script to find docs with the same id and only keep the 'correct' docs?
E.g.:
2 dupes & one unique
POST df/_doc
{
"doc_id": 0,
"email_addr": "e#f.com",
"timestamp": 10
}
POST df/_doc
{
"doc_id": 0,
"email_addr": "a#b.com",
"timestamp": 100
}
POST df/_doc
{
"doc_id": 1,
"email_addr": "a#e.com"
}
Then finding uniques and including, arbitrarily, just the 'most recent' one:
GET df/_search
{
"size": 0,
"aggs": {
"scripted_terms": {
"terms": {
"size": 1000,
"field": "doc_id",
"min_doc_count": 2
},
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}
yielding
...
"aggregations" : {
"scripted_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"doc_count" : 2,
"top_hits_agg" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "df",
"_type" : "_doc",
"_id" : "Ev635HEBW-D5QnrWDjzH",
"_score" : null,
"_source" : {
"doc_id" : 0,
"email_addr" : "a#b.com",
"timestamp" : 100
},
"sort" : [
100
]
}
]
}
}
}
]
}
}
}

Related

PyMongo not returning results on aggregation

I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you

When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])

Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.

Rank records on the basis of a field value in Elasticsearch

I have a field distribution in record schema that looks likes this:
...
"distribution": {
"properties": {
"availability": {
"type": "keyword"
}
}
}
...
I want to rank the records with distribution.availability == "ondemand" lower than other records.
I looked in Elasticsearch docs but can't find a way to reduce the scores of this type of records in index-time to appear lower in search results.
How can I achieve this, any pointers to related source would be enough as well.
More Info:
I was completely omitting these ondemand records with help of python client in query-time like this:
from elasticsearch_dsl.query import Q
_query = Q("query_string", query=query_string) & ~Q('match', **{'availability.keyword': 'ondemand'})
Now, I want to include these records but I want to place them lower than other records.
If it is not possible to implement something like this in index-time, please suggest how can I achieve this in query-time with python client.
After applying the suggestion from llermaly, the python client query looks like this:
boosting_query = Q(
"boosting",
positive=Q("match_all"),
negative=Q(
"bool", filter=[Q({"term": {"distribution.availability.keyword": "ondemand"}})]
),
negative_boost=0.5,
)
if query_string:
_query = Q("query_string", query=query_string) & boosting_query
else:
_query = Q() & boosting_query

EDIT2 : elasticsearch-dsl-py version of boosting query
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q
client = Elasticsearch()
q = Q('boosting', positive=Q("match_all"), negative=Q('bool', filter=[Q({"term": {"test.available.keyword": "ondemand"}})]), negative_boost=0.5)
s = Search(using=client, index="test_parths007").query(q)
response = s.execute()
print(response)
for hit in response:
print(hit.meta.score, hit.test.available)
EDIT : Just read you need to do it on index time.
Elasticsearch deprecated index time boosting on 5.0
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-boost.html
You can use a Boosting query to achieve that on query time.
Ingest Documents
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "ondemand"
}
}
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "higherscore"
}
}
POST test_parths007/_doc
{
"name": "doc2",
"test": {
"available": "higherscore"
}
}
Query (index time)
POST test_parths007/_search
{
"query": {
"boosting": {
"positive": {
"match_all": {}
},
"negative": {
"term": {
"test.available.keyword": "ondemand"
}
},
"negative_boost": 0.5
}
}
}
Response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "VMdY7XcB50NMsuQPelRx",
"_score" : 1.0,
"_source" : {
"name" : "doc2",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "Vcda7XcB50NMsuQPiVRB",
"_score" : 1.0,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "U8dY7XcB50NMsuQPdlTo",
"_score" : 0.5,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "ondemand"
}
}
}
]
}
}
For more advanced manipulation you can check the Function Score Query

Search a list of comma separated strings on elasticsearch

I have elasticsearch configured for my django project.
Elasticsearch index has two fields user_id and address, my goal is to search a list of comma separated addresses on elasticsearch.
Example:
i have this list of addresses ["abc", "def","ghi","jkl","mno"] and i want to search them on elasticsearch in one hit, the result i'm expecting for the above list is ["abc", "def","ghi"] if these three addresses "abc", "def" and "ghi" (individually) exist on elasticsearch in address field.

Ingest data
POST test_foki/_doc
{
"user_id": 1,
"address": "abc"
}
POST test_foki/_doc
{
"user_id": 2,
"address": "def"
}
POST test_foki/_doc
{
"user_id": 3,
"address": "ghi"
}
If you want to do exact matches then you can use a terms query to filter up by an array of addresses.
Request
We use filter because we dont care about score on exact matches (it matches or not)
POST test_foki/_search
{
"query": {
"bool": {
"filter": [
{
"terms": {
"address.keyword": [
"abc",
"def",
"ghi",
"jkl",
"mno"
]
}
}
]
}
}
}
Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "YzkL4HcBv0SJscHMrZB8",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"address" : "abc"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZDkL4HcBv0SJscHMsZAx",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"address" : "def"
}
},
{
"_index" : "test_foki",
"_type" : "_doc",
"_id" : "ZTkL4HcBv0SJscHMtpAd",
"_score" : 1.0,
"_source" : {
"user_id" : 3,
"address" : "ghi"
}
}
]
}
}
If you want to do full text searches you will have to do a boolean query
POST test_foki/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"address": "abc"
}
},
{
"match": {
"address": "def"
}
},
{
"match": {
"address": "ghi"
}
},
{
"match": {
"address": "jkl"
}
},
{
"match": {
"address": "mno"
}
}
]
}
}
}
This produces the same Lucene query address:abc address:def address:ghi address:jkl address:mno
POST test_foki/_search
{
"query": {
"match": {
"address": "abc def ghi jkl mno"
}
}
}

MongoDB updateOne Array in array

I´m lookiing for update one field, array in array
db.germain.updateOne({}, {$set: { "items.$[elem].sub_items.price" : 2}}, {arrayFilters: [ { "elem.sub_item_name": "my_item_two_one" } ] } )
I find one but it doesn´t update.
{
"_id" : ObjectId("4faaba123412d654fe83hg876"),
"user_id" : 123456,
"items" : [
{
"item_name" : "my_item_one",
"sub_items" : [
{
"sub_item_name" : "my_item_one_one",
"price" : 20
},
]
},
{
"item_name" : "my_item_two",
"sub_items" : [
{
"sub_item_name" : "my_item_two_one",
"price" : 30
},
{
"sub_item_name" : "my_item_two_two",
"price" : 50
},
]
}
]
}

Actually its nested array. So you need to specify which object in the parent array to be changed? and which object in the specified object of parent array to be changed.
db.collection.update({},
{
$set: {
"items.$[parent].sub_items.$[child].price": 2
}
},
{
arrayFilters: [
{ "parent.item_name": "my_item_two" },
{ "child.sub_item_name": "my_item_two_one" }
}
]
})
If you need o to change the whole object in parent array, simply you can use $ positional operator.

Elastic Search composite grouping with range

Consider the following documents are in my elastic search . I want to group the documents based on rank, but any rank below 1000 must be displayed individually and anything above 1000 must be grouped how do I achieve this using composite aggregation, I am new and I am using composite because I want to use the after key function to allow pagination.
Documents
{
rank : 200,
name:abcd,
score1 :100,
score2:200
},
{
rank 300,
name:abcd,
score1:100,
score2:200
}
Expected Result:
{
key:{
rank:101
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
{
key:{
rank:1000-*
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
},
{
key:{
rank:300
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
######## QUery that I tried
{
"query":{"match_all":{}},
"aggs":{
"_scores":{
"composite"{
"sources":[
{"_rank":{"terms":{"field":"rank"}}}
]
}
},
"aggs":{
"_ranks":{
"field":"rank:[
{"to":1000},
{"from":1000}
]
}
"_score1": {"sum": {"field": "score1"}}
"_score2": {"sum": {"field": "score2"}}
}
}
}

From what I understand, you want to
Group the aggregations whose value is below 1000 rank to their own buckets
Group the aggregations whose value is 1000 and above to a single bucket with key 1000-*
And for each buckets, calculate the sum of _score1 of all buckets
Similarly calculate the sum of _score2 of all buckets
For this scenario, you can simply make use of Terms Aggregation as I've mentioned in below answer.
I've mentioned sample mapping, sample documents, query and response so that you'll have clarity on what's happening.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"rank":{
"type": "integer"
},
"name":{
"type": "keyword"
},
"_score1": {
"type":"integer"
},
"_score2":{
"type": "integer"
}
}
}
}
Sample Documents:
POST my_sample_index/_doc/1
{
"rank": 100,
"name": "john",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/2
{
"rank": 1001, <--- Rank > 1000
"name": "constantine",
"_score1": 200,
"_score2": 200
}
POST my_sample_index/_doc/3
{
"rank": 200,
"name": "bruce",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/4
{
"rank": 2001, <--- Rank > 1000
"name": "arthur",
"_score1": 200,
"_score2": 200
}
Aggregation Query:
POST my_sample_index/_search
{
"size":0,
"aggs": {
"_score": {
"terms": {
"script": {
"source": """
if(doc['rank'].value < 1000){
return doc['rank'];
}else
return '1000-*';
"""
}
},
"aggs":{
"_score1_sum":{
"sum": {
"field": "_score1"
}
},
"_score2_sum":{
"sum":{
"field": "_score2"
}
}
}
}
}
}
Note that I've used Scripted Terms Aggregation where I've mentioned by logic in the script. Logic I believe is self-explainable once you go through it.
Response:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1000-*", <---- Note this
"doc_count" : 2, <---- Note this
"_score2_sum" : {
"value" : 400.0
},
"_score1_sum" : {
"value" : 400.0
}
},
{
"key" : "100",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
},
{
"key" : "200",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
}
]
}
}
}
Note that there are two keys having rank > 1000, both of their scores for _score1 and _score2 sum to 400, which is what is expected.
Let me know if this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to improve this function to avoid indexing duplicate documents on Elasctisearch - python

Related

PyMongo not returning results on aggregation

Rank records on the basis of a field value in Elasticsearch

Search a list of comma separated strings on elasticsearch

MongoDB updateOne Array in array

Elastic Search composite grouping with range

Categories

Resources