I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you
When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])
Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.
Related
I have a collection of documents like this:
"RecordId": 1,
"CurrentState" : {
"collection_method" : "Phone",
"collection_method_convert" : 1,
"any_amount_outside_of_min_max_fx_margin" : null,
"amounts_and_rates" : [
{
"_id" : ObjectId("5ef870670000000000000000"),
"amount_from" : 1000.0,
"time_collected_researcher_input" : null,
"date_collected_researcher_input" : null,
"timezone_researcher_input" : null,
"datetime_collected_utc" : ISODate("2020-03-02T21:45:00.000Z"),
"interbank_rate" : 0.58548,
"ib_api_url" : null,
"fx_rate" : 0.56796,
"fx_margin" : 2.9924164787866,
"amount_margin_approved" : true,
"outside_of_min_max_fx_margin" : null,
"amount_duplicated" : false,
"fx_margin_delta_mom" : null,
"fx_margin_reldiff_pct_mom" : null,
"fx_margin_reldiff_gt15pct_mom" : null
},
{
"_id" : ObjectId("5efdadae0000000000000000"),
"amount_from" : 10000.0,
"time_collected_researcher_input" : null,
"date_collected_researcher_input" : null,
"timezone_researcher_input" : null,
"datetime_collected_utc" : ISODate("2020-03-02T21:45:00.000Z"),
"interbank_rate" : 0.58548,
"ib_api_url" : null,
"fx_rate" : 0.57386,
"fx_margin" : 1.9846963175514,
"amount_margin_approved" : true,
"outside_of_min_max_fx_margin" : null,
"amount_duplicated" : false,
"fx_margin_delta_mom" : null,
"fx_margin_reldiff_pct_mom" : null,
"fx_margin_reldiff_gt15pct_mom" : null
}
Array of amounts_and_rates can contain different fields in different documents. Even inside one document.
I need to find the document with largest number of fields.
And also to find all possible fields in the amounts_and_rates. Collection can be rather large and check one by one can take rather long time. Is it possible to find what I need with aggregation functions of mongodb?
I want to have in the end something like:
[{RecordId: 1, number_of_fields: [13, 12, 14]}{RecordId:2, number_of_fields:[9, 12, 14]}]
Or even just max_records_number in [{RecordId:2}, {RecordId: 4}].
Also would like to receive set of fields in amount_and_rates through the collection like:
set = ["_id", "amount_from", "time_collected_researcher_input" ...]
The solutions of your 2 requirements,
The set of unique fields:
set = ["_id", "amount_from", "time_collected_researcher_input" ...]
$unwind amounts_and_rates because its an array and need to use in $project
$project converted object to array using $objectToArray
$unwind again because amounts_and_rates is again an array and need to use in $group
$group by null _id and add unique keys in set amounts_and_rates using $addToSet
$project remove _id
db.collection.aggregate([
{
$unwind: "$CurrentState.amounts_and_rates"
},
{
$project: {
amounts_and_rates: {
$objectToArray: "$CurrentState.amounts_and_rates"
}
}
},
{
$unwind: "$amounts_and_rates"
},
{
$group: {
_id: null,
amounts_and_rates: {
$addToSet: "$amounts_and_rates.k"
}
}
},
{
$project: {
_id: 0
}
}
])
Working Playground: https://mongoplayground.net/p/6dPGM2hZ4vW
Fields count in sub document:
[{RecordId: 1, number_of_fields: [13, 12, 14]}{RecordId:2, number_of_fields:[9, 12, 14]}]
$unwind amounts_and_rates because its an array and need to use in $project
$project converted object to array using $objectToArray and get the count of particular document
$group by RecordId and push all arrayofkeyvalue count in number_of_fields and added total for total count
$project remove _id
db.collection.aggregate([
{
$unwind: "$CurrentState.amounts_and_rates"
},
{
"$project": {
RecordId: 1,
arrayofkeyvalue: {
$size: {
$objectToArray: "$CurrentState.amounts_and_rates"
}
}
}
},
{
$group: {
_id: "$RecordId",
RecordId: {
$first: "$RecordId"
},
number_of_fields: {
$push: {
$sum: "$arrayofkeyvalue"
}
},
total: {
$sum: "$arrayofkeyvalue"
}
}
},
{
$project: {
_id: 0
}
}
])
Working Playground: https://mongoplayground.net/p/TRFsj11BqVR
how could i improve my function to insert the id of my dataframe in the "_id" of Elasticsearch document to handle duplicates?.
Dataframe structure
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 412 non-null object
1 email_address 412 non-null object
2 first_name 412 non-null object
3 last_name 412 non-null object
The funtion to convert to an elasticsearch compatible format
def to_elastic_json(df, index_name):
import json
for record in df.to_dict(orient="records"):
yield ('{ "index" : { "_index" : "%s"}}'% (index_name))
yield (json.dumps(record, default=str))
es_response = elastic_client.bulk(to_elastic_json(df, INDEX_name))
EDIT
Yes, ES will update the doc w/ a new _version number if you ingest a doc w/ an already existing _id:
Here's how to do it:
def to_elastic_json(df, index_name):
import json
for record in df.to_dict(orient="records"):
yield ('{ "index" : { "_index" : "%s", "_id": "%s"}}'% (index_name, str(record['id'])))
yield (json.dumps(record, default=str))
Verify by calling
GET INDEX_NAME/_search?version=true
and looking for the _version attribute.
ORIGINAL
Why not let ES auto-generate the _id and you keep your own id separate. That way, you can then write a script to find docs with the same id and only keep the 'correct' docs?
E.g.:
2 dupes & one unique
POST df/_doc
{
"doc_id": 0,
"email_addr": "e#f.com",
"timestamp": 10
}
POST df/_doc
{
"doc_id": 0,
"email_addr": "a#b.com",
"timestamp": 100
}
POST df/_doc
{
"doc_id": 1,
"email_addr": "a#e.com"
}
Then finding uniques and including, arbitrarily, just the 'most recent' one:
GET df/_search
{
"size": 0,
"aggs": {
"scripted_terms": {
"terms": {
"size": 1000,
"field": "doc_id",
"min_doc_count": 2
},
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}
yielding
...
"aggregations" : {
"scripted_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"doc_count" : 2,
"top_hits_agg" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "df",
"_type" : "_doc",
"_id" : "Ev635HEBW-D5QnrWDjzH",
"_score" : null,
"_source" : {
"doc_id" : 0,
"email_addr" : "a#b.com",
"timestamp" : 100
},
"sort" : [
100
]
}
]
}
}
}
]
}
}
}
Consider the following documents are in my elastic search . I want to group the documents based on rank, but any rank below 1000 must be displayed individually and anything above 1000 must be grouped how do I achieve this using composite aggregation, I am new and I am using composite because I want to use the after key function to allow pagination.
Documents
{
rank : 200,
name:abcd,
score1 :100,
score2:200
},
{
rank 300,
name:abcd,
score1:100,
score2:200
}
Expected Result:
{
key:{
rank:101
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
{
key:{
rank:1000-*
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
},
{
key:{
rank:300
},
doc_count:1,
_score1: {value:3123}
_score2 : {value :3323}
}
######## QUery that I tried
{
"query":{"match_all":{}},
"aggs":{
"_scores":{
"composite"{
"sources":[
{"_rank":{"terms":{"field":"rank"}}}
]
}
},
"aggs":{
"_ranks":{
"field":"rank:[
{"to":1000},
{"from":1000}
]
}
"_score1": {"sum": {"field": "score1"}}
"_score2": {"sum": {"field": "score2"}}
}
}
}
From what I understand, you want to
Group the aggregations whose value is below 1000 rank to their own buckets
Group the aggregations whose value is 1000 and above to a single bucket with key 1000-*
And for each buckets, calculate the sum of _score1 of all buckets
Similarly calculate the sum of _score2 of all buckets
For this scenario, you can simply make use of Terms Aggregation as I've mentioned in below answer.
I've mentioned sample mapping, sample documents, query and response so that you'll have clarity on what's happening.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"rank":{
"type": "integer"
},
"name":{
"type": "keyword"
},
"_score1": {
"type":"integer"
},
"_score2":{
"type": "integer"
}
}
}
}
Sample Documents:
POST my_sample_index/_doc/1
{
"rank": 100,
"name": "john",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/2
{
"rank": 1001, <--- Rank > 1000
"name": "constantine",
"_score1": 200,
"_score2": 200
}
POST my_sample_index/_doc/3
{
"rank": 200,
"name": "bruce",
"_score1": 100,
"_score2": 100
}
POST my_sample_index/_doc/4
{
"rank": 2001, <--- Rank > 1000
"name": "arthur",
"_score1": 200,
"_score2": 200
}
Aggregation Query:
POST my_sample_index/_search
{
"size":0,
"aggs": {
"_score": {
"terms": {
"script": {
"source": """
if(doc['rank'].value < 1000){
return doc['rank'];
}else
return '1000-*';
"""
}
},
"aggs":{
"_score1_sum":{
"sum": {
"field": "_score1"
}
},
"_score2_sum":{
"sum":{
"field": "_score2"
}
}
}
}
}
}
Note that I've used Scripted Terms Aggregation where I've mentioned by logic in the script. Logic I believe is self-explainable once you go through it.
Response:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1000-*", <---- Note this
"doc_count" : 2, <---- Note this
"_score2_sum" : {
"value" : 400.0
},
"_score1_sum" : {
"value" : 400.0
}
},
{
"key" : "100",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
},
{
"key" : "200",
"doc_count" : 1,
"_score2_sum" : {
"value" : 100.0
},
"_score1_sum" : {
"value" : 100.0
}
}
]
}
}
}
Note that there are two keys having rank > 1000, both of their scores for _score1 and _score2 sum to 400, which is what is expected.
Let me know if this helps!
I have a mongo collection with doc as follows:-
{
"_id" : ObjectId("55a9378ee2874f0ed7b7cb7e"),
"_uid" : 10,
"impressions" : [
{
"pos" : 6,
"id" : 123,
"service" : "furniture"
},
{
"pos" : 0,
"id" : 128,
"service" : "electronics"
},
{
"pos" : 2,
"id" : 127,
"service" : "furniture"
},
{
"pos" : 2,
"id" : 125,
"service" : "electronics"
},
{
"pos" : 10,
"id" : 124,
"service" : "electronics"
}
]
},
{
"_id" : ObjectId("55a9378ee2874f0ed7b7cb7f"),
"_uid" : 11,
"impressions" : [
{
"pos" : 1,
"id" : 124,
"service" : "furniture"
},
{
"pos" : 10,
"id" : 124,
"service" : "electronics"
},
{
"pos" : 1,
"id" : 123,
"service" : "furniture"
},
{
"pos" : 21,
"id" : 122,
"service" : "furniture"
},
{
"pos" : 3,
"id" : 125,
"service" : "electronics"
},
{
"pos" : 10,
"id" : 121,
"service" : "electronics"
}
]
}
My aim is to find all the "id" in a particular "service" say "furniture" i.e to get results like this:
[122,123,124,127]
But i'm not able to figure out how to frame the condition in
db.collection_name.find()
because of the difficulty of having condition for the 'n' th element in an array, "impressions[n]":"value".
One option is to use the "id"s obtained perform aggregate operation to find impressions for each "id" for a service as suggested by the answer to this question I asked earlier:-
MapReduce in PyMongo.
But I only want the list of distinct 'id' in a service not the impressions.
Kindly help!
You need the aggregration framework for meaningful results. So much like this:
result = db.collection.aggregate([
{ "$match": {
"impressions.service": "furniture"
}},
{ "$unwind": "$impressions" },
{ "$match": {
"impressions.service": "furniture"
}},
{ "$group": {
"_id": "$impressions.id"
}}
])
Or better yet with MongoDB 2.6 or greater, which can remove the array items unmatched "prior" to $unwind with $redact:
result = db.collection.aggregate([
{ "$match": {
"impressions.service": "furniture"
}},
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$ifNull": [ "$service", "furniture" ] },
"furniture"
]
},
"then": "$$DESCEND",
"else": "$$PRUNE"
}
}},
{ "$unwind": "$impressions" },
{ "$group": {
"_id": "$impressions.id"
}}
])
Which yields:
{ "_id" : 122 }
{ "_id" : 124 }
{ "_id" : 127 }
{ "_id" : 123 }
Not a plain "list", but just transform it, therefore :
def mapper (x):
return x["_id"]
map(mapper,result)
Or:
map(lambda x: x["_id"], result)
To give you:
[122, 124, 127, 123]
If you want it "sorted" then either add a $sort stage at the end of the aggregation pipeline or sort the resulting list in code.
MongoDB noob here...
So, I'm trying to print out the minimum value score inside a collection that looks like this...
> db.students.find({'_id': 1}).pretty()
{
"_id" : 1,
"name" : "Aurelia Menendez",
"scores" : [
{
"type" : "exam",
"score" : 60.06045071030959
},
{
"type" : "quiz",
"score" : 52.79790691903873
},
{
"type" : "homework",
"score" : 71.76133439165544
},
{
"type" : "homework",
"score" : 34.85718117893772
}
]
}
The incantation I'm using is as such...
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 1
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// grab the minimum value score
{ $match: {
'scores.min.score': 1
}}
)
the output i'm getting is this...
{ "result" : [ ], "ok" : 1 }
What am I doing wrong?
You've got the right idea, but in the last step of the aggregation what you want to do is group all the scores by student and find the $min value.
Change the last pipeline operation to:
{ $group: {
_id: "$_id",
minScore: {$min: "$scores.score"}
}}
> db.students.aggregate(
{ $unwind: "$scores" },`
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
how to aggregate on each item in collection in mongoDB