How to paginate subdocuments in a MongoDB collection? - python

I have a MongoDB collection with the following data structure;
[
{
"_id": "1",
"name": "businessName1",
"reviews": [
{
"_id": "1",
"comment": "comment1",
},
{
"_id": "2",
"comment": "comment1",
},
...
]
}
]
As you can see, the reviews for each business are a subdocument within the collection, where businessName1 has a total of 2 reviews. In my real MongoDB collection, each business has 100s of reviews. I want to view only 10 on one page using pagination.
I currently have a find_one() function in Python that retrieves this single business, but it also retrieves all of its reviews as well.
businesses.find_one( \
{ "_id" : ObjectId(1) }, \
{ "reviews" : 1, "_id" : 0 } )
I'm aware of the skip() and limit() methods in Python, where you can limit the number of results that are retrieved, but as far as I'm aware, you can only perform these methods on the find() method. Is this correct?

Option 1: You can use $slice for pagination as follow:
db.collection.find({
_id: 1
},
{
_id: 0,
reviews: {
$slice: [
3,
5
]
}
})
Playground
Option 2: Or via aggregation + total array size maybe better:
db.collection.aggregate([
{
$project: {
_id: 0,
reviews: {
$slice: [
"$reviews",
3,
5
]
},
total: {
$size: "$reviews"
}
}
}
])
Playground

Related

Mongodb query using "$gt"

I have this task "Write a mongodb query to find the count of movies released after the year 1999" . I'm trying to do this with this different line codes in the picture bellow, none of them works. Any thoughts?
PS: the collection's name is movies, the columns are the year and _id of the movies.
These are the lines I'm trying:
docs = db.movies.find({"year":{"$gt":"total"("1999")}}).count()
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"$1999"}}}])
docs = db.movies.count( {"year": { "$gt": "moviecount"("1999") } } )
docs = db.movies.find({"year":{"$gt":"1999"}})
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"1999"}}}])
You can do it with an aggregate
try it here
[
{
"$match": {
"year": {
"$gt": "1999"
}
}
},
{
"$group": {
"_id": 1,
"count": {
"$sum": "$total"
}
}
}
]
The first stage of the pipeline is $match, it will filter only your documents with a year greater than 1999.
Then in the $group we will sum all the total variables.
The "_id": 1, is a dummy value because we are not grouping on any particular field, and we just want to sum all the total

Get field value in MongoDB without parent object name

I'm trying to find a way to retrieve some data on MongoDB trough python scripts
but I got stuck on a situation as follows:
I have to retrieve some data, check a field value and compare with another data (MongoDB Documents).
But the Object's name may vary from each module, see bellow:
Document 1
{
"_id": "001",
"promotion": {
"Avocado": {
"id": "01",
"timestamp": "202005181407",
},
"Banana": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "11"
}
Document 2
{
"_id": "002",
"promotion": {
"Grape": {
"id": "02",
"timestamp": "202005181407",
},
"Dragonfruit": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "15"
}
}
I'll aways have an Object called promotion but the child's name may vary, sometimes it's an ordered number, sometimes it is not. The field I need the value is the id inside promotion, it will aways have the same name.
So if the document matches the criteria I'll retrieve with python and get the rest of the work done.
PS.: I'm not the one responsible for this kind of Document Structure.
I've already tried these docs, but couldn't get them to work the way I need.
$all
$elemMatch
Try this python pipeline:
[
{
'$addFields': {
'fruits': {
'$objectToArray': '$promotion'
}
}
}, {
'$addFields': {
'FruitIds': '$fruits.v.id'
}
}, {
'$project': {
'_id': 0,
'FruitIds': 1
}
}
]
Output produced:
{FruitIds:["01","02"]},
{FruitIds:["02","02"]}
Is this the desired output?

Why 'from' keyword goes unrecognizable while achieving pagination in aggregation in Elastic Search

I am trying to paginate 50 data at once in aggregation, so i gave it a try with below code.
"aggs": {
"source_list": {
"terms": {
"field": "source.keyword",
"from": 0,
"size": 50,
},
},
},
This sounded pretty straight forward but instead i hit rock bottom with it, by the following error.
{"detail":"RequestError(400, 'x_content_parse_exception', '[1:59] [terms] unknown field [from]')"}
Pagination in aggregation not supported in Elasticsearch
Since only size is supported, you have to remove the param from from aggs query. If the total size of the buckets is reasonable then just increase the value of the size to max. Otherwise you could try partitioning the aggregation.
For example :
"aggs": {
"source_list": {
"terms": {
"field": "source.keyword",
"size": 50,
"include": {
"partition": 0,
"num_partitions": 10
}
},
},
}
Pick a value for num_partitions to break the number up into more manageable chunks
Pick a size value for the number of responses we want from each partition
Source : Elasticsearch filtering values with partitions
You can only do pagination on your return results, not in aggregation:
{
"query": {
....
},
"from":0
"size":50,
"aggs":{
....
}
}
From and size as in query are not available in aggregations
You can use below options to paginate through aggegations:-
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"pagination": {
"composite": {
"size": 1,
"sources": [
{
"source_list": {
"terms": {
"field": "sources.keyword"
}
}
}
]
}
}
}
}
Result:
"aggregations" : {
"pagination" : {
"after_key" : {
"source_list" : "a" --> used to fetch next records linearly
},
"buckets" : [
{
"key" : {
"source_list" : "a"
},
"doc_count" : 1
}
]
}
}
To fetch next record
{
"size": 0,
"aggs": {
"pagination": {
"composite": {
"size": 1,
"after": {"source_list" : "a"},
"sources": [
{
"source_list": {
"terms": {
"field": "sources.keyword"
}
}
}
]
}
}
}
}
Include partition: group's the field’s values into a number of partitions at query-time and processing only one partition in each request. Term fields are evenly distributed in different partitions. So you must know number of terms beforehand. You can use cardinality aggregation to get count
GET index/_search
{
"size": 0,
"aggs": {
"source_list": {
"terms": {
"field": "sources.keyword",
"include": {
"partition": 1,
"num_partitions": 3
}
}
}
}
}
Bucket Sort aggregation : sorts the buckets of its parents multi bucket aggreation. Each bucket may be sorted based on its _key, _count or its sub-aggregations. It only applies to buckets returned from parent aggregation. You will need to set term size to 10,000(max value) and truncate buckets in bucket_sort. You can paginate using from and size just like in query. If you have terms more that 10,000 you won't be able to use it since it only selects from buckets returned by term.
GET index/_search
{
"size": 0,
"aggs": {
"source_list": {
"terms": {
"field": "sources.keyword",
"size": 10000 --> use large value to get all terms
},
"aggs": {
"my_bucket": {
"bucket_sort": {
"sort": [
{
"_key": {
"order": "asc"
}
}
],
"from": 1,
"size": 1
}
}
}
}
}
}
In terms of performance composite aggregation is a better choice

MongoDB Query in Pymongo

I have a collection in this format:
{
"name": ....,
"users": [....,....,....,....]
}
I have two different names and I want to find the total number of users that belongs to both documents. Now, I am doing it with Python. I download the document of name 1 and the document of name 2 and check how many users are in both of the documents. I was wondering if there is any other way to do it only with MongoDB and return the number.
Example:
{
"name": "John",
"users": ["001","003","008","010"]
}
{
"name": "Peter",
"users": ["002, "003", "004","005","006","008"]
}
The result would be 2 since users 003 and 008 belongs to both documents.
How I do it:
doc1 = db.collection.find_one({"name":"John"})
doc2 = db.collection.find_one({"name":"Peter"})
total = 0
for user in doc1["users"]:
if user in doc2["users"]:
total += 1
You could also do this with the aggregation framework, but I think it would only make sense if you were doing this over a more than two users even though your could use it that way:
db.users.aggregate([
{ "$match": {
"name": { "$in": [ "John", "Peter" ] }
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$users",
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }},
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
])
That allows you to find the same counts over the names you supply to $in in $match

Pymongo: Remove an element from array

I'm trying to remove lowest price from the iPad's in my schema. I know how to find it using pymongo, but I don't how to remove it.
Here's my schema:
{
"_id": "sjobs",
"items": [
{
"type": "iPod",
"price": 20.00
},
{
"type": "iPad",
"price": 399.99
},
{
"type": "iPad",
"price": 199.99
},
{
"type": "iPhone 5",
"price": 300.45
}
]
}
{
"_id": "bgates",
"items": [
{
"type": "MacBook",
"price": 2900.99
},
{
"type": "iPad",
"price": 399.99
},
{
"type": "iPhone 4",
"price": 100.00
},
{
"type": "iPad",
"price": 99.99
}
]
}
I've got a python loop that finds the lowest sale price for iPad:
cursor = db.sales.find({'items.type': 'iPad'}).sort([('items', pymongo.DESCENDING)])
for doc in cursor:
cntr = 0
for item in doc['items']:
if item['type'] == 'iPad' and resetCntr == 0:
cntr = 1
sales.update(doc, {'$pull': {'items': {item['type']}}})
That doesn't work. What do I need to do to remove lowest iPad price item?
Your Python code isn't doing what you think it's doing (unless there is a lot of it you didn't include). You don't need to do the sorting and iterating on the client side - you should make the server do the work. Run this aggregation pipeline (I'm giving shell syntax, you can call it from your Python, of course):
> r = db.sales.aggregate( {"$match" : { "items.type":"iPad"} },
{"$unwind" : "$items"},
{"$match" : { "items.type":"iPad"} },
{"$group" : { "_id" : "$_id",
"lowest" : {"$min":"$items.price"},
"count":{$sum:1}
}
},
{"$match" : {count:{$gt:1}}}
);
{
"result" : [
{
"_id" : "bgates",
"lowest" : 99.99,
"count" : 2
},
{
"_id" : "sjobs",
"lowest" : 199.99,
"count" : 2
}
],
"ok" : 1
}
Now you can iterate over the "r.results" array and execute your update:
db.sales.update( { "_id" : r.results[0]._id },
{ "$pull" : { "items" : { "type" : "iPad", "price" : r.result[0].lowest}}} );
Note that I only include records which have more than one iPad - since otherwise you may end up deleting the only iPad record in the array. If you want to delete all "non-highest" prices then you'd want to find the max and $pull all the elements $lt that price.
Disclaimer: The below code is not tested as I do not have mongo installed locally. However I did take my time writing it so im pretty confident its close to working
def remove_lowest_price(collection):
cursor = collection.find({}, {'items': 1})
for doc in cursor:
items = doc['items']
id = doc['_id']
for item in items:
lowest_price = 100000 # a huge number
if item['type'] == 'i_pad' and item['price'] < lowest:
lowest = item['price']
# lowest now contains the price of the cheapest ipad
collection.update(
{'_id': id},
{'$pull': {'items': {'price': lowest}}}
)
Of course there will be a problem here if another item happens to have exactly the same price but I think it will be easy to improve from here
{'$pull': {'items': {item['type']}}}
This doesn't look like valid json, does it?
shouldn't be "sales.update(...)" be "db.sales.update(...)" in your example?
maybe it's better to have query in update operation:
db.sales.update({_id: doc[_id]}, ...)
rather than entire doc.
and finally the update body itself might be
{'$pull': {'items': {type: item['type']}}}

Categories

Resources