MongoDB: find document with largest amount of - python

I have a collection of documents like this:
"RecordId": 1,
"CurrentState" : {
"collection_method" : "Phone",
"collection_method_convert" : 1,
"any_amount_outside_of_min_max_fx_margin" : null,
"amounts_and_rates" : [
{
"_id" : ObjectId("5ef870670000000000000000"),
"amount_from" : 1000.0,
"time_collected_researcher_input" : null,
"date_collected_researcher_input" : null,
"timezone_researcher_input" : null,
"datetime_collected_utc" : ISODate("2020-03-02T21:45:00.000Z"),
"interbank_rate" : 0.58548,
"ib_api_url" : null,
"fx_rate" : 0.56796,
"fx_margin" : 2.9924164787866,
"amount_margin_approved" : true,
"outside_of_min_max_fx_margin" : null,
"amount_duplicated" : false,
"fx_margin_delta_mom" : null,
"fx_margin_reldiff_pct_mom" : null,
"fx_margin_reldiff_gt15pct_mom" : null
},
{
"_id" : ObjectId("5efdadae0000000000000000"),
"amount_from" : 10000.0,
"time_collected_researcher_input" : null,
"date_collected_researcher_input" : null,
"timezone_researcher_input" : null,
"datetime_collected_utc" : ISODate("2020-03-02T21:45:00.000Z"),
"interbank_rate" : 0.58548,
"ib_api_url" : null,
"fx_rate" : 0.57386,
"fx_margin" : 1.9846963175514,
"amount_margin_approved" : true,
"outside_of_min_max_fx_margin" : null,
"amount_duplicated" : false,
"fx_margin_delta_mom" : null,
"fx_margin_reldiff_pct_mom" : null,
"fx_margin_reldiff_gt15pct_mom" : null
}
Array of amounts_and_rates can contain different fields in different documents. Even inside one document.
I need to find the document with largest number of fields.
And also to find all possible fields in the amounts_and_rates. Collection can be rather large and check one by one can take rather long time. Is it possible to find what I need with aggregation functions of mongodb?
I want to have in the end something like:
[{RecordId: 1, number_of_fields: [13, 12, 14]}{RecordId:2, number_of_fields:[9, 12, 14]}]
Or even just max_records_number in [{RecordId:2}, {RecordId: 4}].
Also would like to receive set of fields in amount_and_rates through the collection like:
set = ["_id", "amount_from", "time_collected_researcher_input" ...]

The solutions of your 2 requirements,
The set of unique fields:
set = ["_id", "amount_from", "time_collected_researcher_input" ...]
$unwind amounts_and_rates because its an array and need to use in $project
$project converted object to array using $objectToArray
$unwind again because amounts_and_rates is again an array and need to use in $group
$group by null _id and add unique keys in set amounts_and_rates using $addToSet
$project remove _id
db.collection.aggregate([
{
$unwind: "$CurrentState.amounts_and_rates"
},
{
$project: {
amounts_and_rates: {
$objectToArray: "$CurrentState.amounts_and_rates"
}
}
},
{
$unwind: "$amounts_and_rates"
},
{
$group: {
_id: null,
amounts_and_rates: {
$addToSet: "$amounts_and_rates.k"
}
}
},
{
$project: {
_id: 0
}
}
])
Working Playground: https://mongoplayground.net/p/6dPGM2hZ4vW
Fields count in sub document:
[{RecordId: 1, number_of_fields: [13, 12, 14]}{RecordId:2, number_of_fields:[9, 12, 14]}]
$unwind amounts_and_rates because its an array and need to use in $project
$project converted object to array using $objectToArray and get the count of particular document
$group by RecordId and push all arrayofkeyvalue count in number_of_fields and added total for total count
$project remove _id
db.collection.aggregate([
{
$unwind: "$CurrentState.amounts_and_rates"
},
{
"$project": {
RecordId: 1,
arrayofkeyvalue: {
$size: {
$objectToArray: "$CurrentState.amounts_and_rates"
}
}
}
},
{
$group: {
_id: "$RecordId",
RecordId: {
$first: "$RecordId"
},
number_of_fields: {
$push: {
$sum: "$arrayofkeyvalue"
}
},
total: {
$sum: "$arrayofkeyvalue"
}
}
},
{
$project: {
_id: 0
}
}
])
Working Playground: https://mongoplayground.net/p/TRFsj11BqVR

Related

MongoDb summing number of distinct elements in string field array

I have an ip address collection:
{
"_id" : "uezyuLx4jjfvcqN",
"CVE" : ["CVE2020-123", "CVE2022-789", "CVE2019-456"],
"ip" : "1.2.3.4"
}
{
"_id" : "dCC8GrNdEjym3ryua",
"CVE" : ["CVE2020-123", "CVE2021-469"],
"ip" : "5.6.7.8"
}
{
"_id" : "dCC8GrNdEjym3ryua",
"CVE" : ["CVE2020-123", "CVE2021-469"],
"ip" : "7.6.7.6"
}
I'm trying to calculate the distinct sum of the CVE field, where IPs are in ["5.6.7.8", "1.2.3.4"].
Expected output:
{
ip: ['1.2.3.4', '5.6.7.8'],
sum_distinct_cve:4,
CVES: ["CVE2020-123", "CVE2022-789", "CVE2019-456", "CVE2021-469"]
}
So I'm doing the following:
db = db.getSiblingDB("test");
hosts = db.getCollection("test-collection")
hosts.aggregate([
{$match:
{"ip": {$in: ["1.2.3.4", "5.6.7.8"]}}},
{$group:
{_id: "$CVE",
totals: {$sum: "$CVE"}}}
]);
The sum is returning 0, which I've realised is because of MongoDb's behaviour when trying to sum a string field. This is detailed here: mongodb sum query returning zero
What I would like to know though is how I can sum the number of elements, and also find the distinct sum.`
Simple option:
db.collection.aggregate([
{
$match: {
"ip": {
$in: [
"1.2.3.4",
"5.6.7.8"
]
}
}
},
{
$unwind: "$CVE"
},
{
$group: {
_id: "",
ip: {
$addToSet: "$ip"
},
CVE: {
$addToSet: "$CVE"
}
}
},
{
$project: {
_id: 0,
ip: 1,
CVE: 1,
sum_distinct_cve: {
$size: "$CVE"
}
}
}
])
Explained:
Match the ip's
unwind the CVE arrays
group so you can join ip and CVE distinct values only
Project the necessary fields and use $size to count the distinct CVE's
Playground
I agree with #R2D2. One more option to avoid $unwind (which considered costly in terms of performance) is to use $reduce instead:
db.collection.aggregate([
{$match: {ip: {$in: ["1.2.3.4", "5.6.7.8"]}}},
{$group: {
_id: 0,
ip: {$addToSet: "$ip"},
CVE: {$addToSet: "$CVE"}
}},
{$project: {
_id: 0, ip: 1,
CVE: {
$reduce: {
input: "$CVE",
initialValue: [],
in: {$setUnion: ["$$value", "$$this"]}
}
}
}},
{$set: {sum_distinct_cve: {$size: "$CVE"}}}
])
See how it works on the playground example

PyMongo not returning results on aggregation

I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you
When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])
Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.

Basic request to mongodb with pymongo

I need to get all objects inside "posts" that have "published: true"
with pymongo. I've tried already so many variants but all I can do:
for elt in db[collection].find({}, {"posts"}):
print(elt)
And it'll show all "posts". I've tried smth like this:
for elt in db[collection].find({}, {"posts", {"published": {"$eq": True}}}):
print(elt)
But it doesn't work. Help, I'm trying for 3 days already =\
What you want to be doing is to use the aggregate $filter like so:
db[collection].aggregate([
{
"$match": { // only fetch documents with such posts
"posts.published": {"$eq": True}
}
},
{
"$project": {
"posts": {
"$filter": {
"input": "$posts",
"as": "post",
"cond": {"$eq": ["$$post.published", True]}
}
}
}
}
])
Note that the currenct structure returned will be:
[
{posts: [post1, post2]},
{posts: [post3, post4]}
]
If you want to retrieve it as a list of posts you'll need to add an $unwind stage to flatten the array.
The query options are quite limited you can do it with $elemMatch (projection) or with the $ operator but both of these return only the first post that matches the condition which is not what you want.
------- EDIT --------
Realizing posts is actually an object and not an array, you'll have to turn the object to an array, iterate over to filter and then restore the structure like so:
db.collection.aggregate([
{
$project: {
"posts": {
"$arrayToObject": {
$filter: {
input: {
"$objectToArray": "$posts"
},
as: "post",
cond: {
$eq: [
"$$post.v.published",
true
]
}
}
}
}
}
}
])
Mongo Playground
What I assumed that your document looks like this,
{
"_id" : ObjectId("5f8570f8afdefd2cfe7473a7"),
"posts" : {
"a" : {
"p" : false,
"name" : "abhishek"
},
"k" : {
"p" : true,
"name" : "jack"
},
"c" : {
"p" : true,
"name" : "abhinav"
}
}}
You can try the following query but the result format will be a bit different, adding that for clarification,
db.getCollection('temp2').aggregate([
{
$project: {
subPost: { $objectToArray: "$posts" }
}
},
{
'$unwind' : '$subPost'
},
{
'$match' : {'subPost.v.p':true}
},
{
'$group': {_id:'$_id', subPosts: { $push: { subPost: "$subPost"} }}
}
])
result format,
{
"_id" : ObjectId("5f8570f8afdefd2cfe7473a7"),
"subPosts" : [
{
"subPost" : {
"k" : "k",
"v" : {
"p" : true,
"name" : "jack"
}
}
},
{
"subPost" : {
"k" : "c",
"v" : {
"p" : true,
"name" : "abhinav"
}
}
}
]
}

python mongodb $match and $group

I want to write a simple query that gives me the user with the most followers that has the timezone brazil and has tweeted 100 or more times:
this is my line :
pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
{"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
{"$sort":{"count":-1}} ]
I adapted it from a practice problem.
This was given as an example for the structure :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Can anybody spot my mistakes?
Suppose you have a couple of sample documents with the minimum test case. Insert the test documents to a collection in mongoshell:
db.collection.insert([
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"user" : {
"friends_count" : 145,
"statuses_count" : 457,
"screen_name" : "Catherinemull",
"time_zone" : "Brasilia",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong(22819398300)
},
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"user" : {
"friends_count" : 145,
"statuses_count" : 12334,
"time_zone" : "Brasilia",
"screen_name" : "marble",
"followers_count" : 2597,
"id" : 37486278
},
"id" : NumberLong(22819398301)
}])
For you to get the user with the most followers that is in the timezone "Brasilia" and has tweeted 100 or more times, this pipeline achieves the desired result but doesn't use the $group operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gt":99
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marble',
u'tweets': 12334}]}
The following aggregation pipeline will will also give you the desired result. In the pipeline, the first stage is the $match operator which filters those documents where the user has got the timezone field value "Brasilia" and has a tweet count (represented by the statuses_count) greater than or equal to 100 matched via the $gte comparison operator.
The second pipeline stage has the $group operator which groups the filtered documents by the specified identifier expression which is the $user.id field and applies the accumulator expression $max to each group on the $user.followers_count field to get the greatest number of followers for each user. The system variable $$ROOT which references the root document, i.e. the top-level document, currently being processed in the $group aggregation pipeline stage, is added to an extra array field for use later on. This is achieved by using the $addToSet array operator.
The next pipeline stage $unwinds to output a document for each element in the data array for processing in the next step.
The following pipeline step, $project, then transforms each document in the stream, by adding new fields which have values from the previous stream.
The last two pipeline stages $sort and $limit reorders the document stream by the specified sort key followers and returns one document which contains the user with the highest number of followers.
You final aggregation pipeline thus should look like this:
db.collection.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
Executing this in Robomongo gives you the result
/* 0 */
{
"result" : [
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"followers" : 2597,
"screen_name" : "marble",
"tweets" : 12334
}
],
"ok" : 1
}
In python, the implementation should be essentially the same:
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
... {"$unwind": "$data"},
... {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
"$data.user.statuses_count"}},
... {"$sort": { "followers": -1 }},
... {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
... print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>
where
pipeline = [
{"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
{"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
{"$unwind": "$data"},
{"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}},
{"$sort": { "followers": -1 }},
{"$limit" : 1}
]

How to print minimum result in MongoDB

MongoDB noob here...
So, I'm trying to print out the minimum value score inside a collection that looks like this...
> db.students.find({'_id': 1}).pretty()
{
"_id" : 1,
"name" : "Aurelia Menendez",
"scores" : [
{
"type" : "exam",
"score" : 60.06045071030959
},
{
"type" : "quiz",
"score" : 52.79790691903873
},
{
"type" : "homework",
"score" : 71.76133439165544
},
{
"type" : "homework",
"score" : 34.85718117893772
}
]
}
The incantation I'm using is as such...
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 1
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// grab the minimum value score
{ $match: {
'scores.min.score': 1
}}
)
the output i'm getting is this...
{ "result" : [ ], "ok" : 1 }
What am I doing wrong?
You've got the right idea, but in the last step of the aggregation what you want to do is group all the scores by student and find the $min value.
Change the last pipeline operation to:
{ $group: {
_id: "$_id",
minScore: {$min: "$scores.score"}
}}
> db.students.aggregate(
{ $unwind: "$scores" },`
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
how to aggregate on each item in collection in mongoDB

Categories

Resources