Mongodb query using "$gt" - python

I have this task "Write a mongodb query to find the count of movies released after the year 1999" . I'm trying to do this with this different line codes in the picture bellow, none of them works. Any thoughts?
PS: the collection's name is movies, the columns are the year and _id of the movies.
These are the lines I'm trying:
docs = db.movies.find({"year":{"$gt":"total"("1999")}}).count()
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"$1999"}}}])
docs = db.movies.count( {"year": { "$gt": "moviecount"("1999") } } )
docs = db.movies.find({"year":{"$gt":"1999"}})
docs = db.movies.aggregate([{"$group":{"_id":"$year","count":{"$gt":"1999"}}}])

You can do it with an aggregate
try it here
[
{
"$match": {
"year": {
"$gt": "1999"
}
}
},
{
"$group": {
"_id": 1,
"count": {
"$sum": "$total"
}
}
}
]
The first stage of the pipeline is $match, it will filter only your documents with a year greater than 1999.
Then in the $group we will sum all the total variables.
The "_id": 1, is a dummy value because we are not grouping on any particular field, and we just want to sum all the total

Related

How to paginate subdocuments in a MongoDB collection?

I have a MongoDB collection with the following data structure;
[
{
"_id": "1",
"name": "businessName1",
"reviews": [
{
"_id": "1",
"comment": "comment1",
},
{
"_id": "2",
"comment": "comment1",
},
...
]
}
]
As you can see, the reviews for each business are a subdocument within the collection, where businessName1 has a total of 2 reviews. In my real MongoDB collection, each business has 100s of reviews. I want to view only 10 on one page using pagination.
I currently have a find_one() function in Python that retrieves this single business, but it also retrieves all of its reviews as well.
businesses.find_one( \
{ "_id" : ObjectId(1) }, \
{ "reviews" : 1, "_id" : 0 } )
I'm aware of the skip() and limit() methods in Python, where you can limit the number of results that are retrieved, but as far as I'm aware, you can only perform these methods on the find() method. Is this correct?
Option 1: You can use $slice for pagination as follow:
db.collection.find({
_id: 1
},
{
_id: 0,
reviews: {
$slice: [
3,
5
]
}
})
Playground
Option 2: Or via aggregation + total array size maybe better:
db.collection.aggregate([
{
$project: {
_id: 0,
reviews: {
$slice: [
"$reviews",
3,
5
]
},
total: {
$size: "$reviews"
}
}
}
])
Playground

How to paginate terms aggregation results in Elasticsearch

I've been trying to figure out a way to paginate the results of a terms aggregation in Elasticsearch and so far I have not been able to achieve the desired result.
Here's the problem I am trying to solve. In my index, I have a bunch of documents that have a score (separate to the ES _score) that is calculated based on the values of the other fields in the document. Each document "belongs" to a customer, referenced by the customer_id field. The document also has an id, referenced by the doc_id field, and is the same as the ES meta-field _id. Here is an example.
{
'_id': '1',
'doc_id': '1',
'doc_score': '85',
'customer_id': '123'
}
For each customer_id there are multiple documents, all with different document ids and different scores. What I want to be able to do is, given a list of customer ids, return the top document for each customer_id (only 1 per customer) and be able to paginate those results similar to the size, from method in the regular ES search API. The field that I want to use for the document score is the doc_score field.
So far in my current Python script, I've tried is a nested aggs with a "top hits" aggregation to only get the top document for each customer.
{
"size": 0,
"query:": {
"bool": {
"must": [
{
"match_all": {}
},
{
"terms": {
"customer_id": customer_ids # a list of the customer ids I want documents for
}
},
{
"exists": {
"field": "score" # sometimes it's possible a document does not have a score
}
}
]
}
}
"aggs": {
"customers": {
"terms" : {
{"field": "customer_id", "min_doc_count": 1},
"aggs": {
"top_documents": {
"top_hits": {
"sort": [
{"score": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
}
I then "paginate" by going through each customer bucket, appending the top document blob to a list and then sorting the list based on the value of the score field and finally taking a slice documents_list[from:from+size].
The issue with this is that, say I have 500 customers in the list but I only want the 2nd 20 documents, i.e. size = 20, from=20. So each time I call the function I have to first get the list for each of the 500 customers and then slice. This sounds very inefficient and is also a speed issue, since I need that function to be as fast as I can possibly make it.
Ideally, I could just get the 2nd 20 directly from ES without having to do any slicing in my function.
I have looked into Composite aggregations that ES offers, but it looks to me like I would not be able to use it in my case, since I need to get the entire doc, i.e. everything in the _source field in the regular search API response.
I would greatly appreciate any suggestions.
The best way to do this would be to use partitions
According to documentation:
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 1,
"num_partitions": 25
},
"size": 20,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

PyMongo aggregate by taking most recent value of field

I would like to group my documents and for certain fields take the value of the record with the most recent timestamp (i.e. most recently inserted/updated value). In the example below, I want to group by user ID and phone, and take the email of the record with the most recent timestamp in that group. My initial strategy is to sort by descending timestamp and take the first value for an aggregation like so:
import pymongo
...
pipeline = [
{
"$sort": {"timestamp": -1 }
},
{ "$group": {
"_id": {
"userId": "$userId",
"userPhone": "$userPhone",
"userEmail": { "$first" : "$userEmail"},
"count": {"$sum": 1}
}
}
]
However I run into the following error:
pymongo.errors.OperationFailure: Unrecognized expression '$first'
Is there an equivalent $first function available for pymongo?
Your pipeline syntax is incorrect. Accumulators go on their own fields.
Something like
pipeline = [
{ "$sort": {"timestamp": -1 } },
{ "$group": { "_id": { "userId": "$userId", "userPhone": "$userPhone" }, "userEmail": { "$first" : "$userEmail"}, "count": {"$sum": 1} } }
]

Group by and filter max(date) between two dates in elastic search

Currently we are able to group by customer_id in elastic search.
Following is the document structure
{
"order_id":"6",
"customer_id":"1",
"customer_name":"shailendra",
"mailing_addres":"shailendra#gmail.com",
"actual_order_date":"2000-04-30",
"is_veg":"0",
"total_amount":"2499",
"store_id":"276",
"city_id":"12",
"payment_mode":"cod",
"is_elite":"0",
"product":["1","2"],
"coupon_id":"",
"client_source":"1",
"vendor_id":"",
"vendor_name: "",
"brand_id":"",
"third_party_source":""
}
Now we need to filter the group to find the documents
last ordered date between two dates
first order date between two dates
How can we achieve this ?
You can try with the query below. Within each customer bucket, we further filter all document between two dates (here I've taken the month of August 2016) and then we run a stats aggregation on the date field. The min value will be the first order date and the max value will be the last order date.
{
"aggs": {
"customer_ids": {
"terms": {
"field": "customer_id"
},
"aggs": {
"date_filter": {
"filter": {
"range": {
"actual_order_date": {
"gt": "2016-08-01T00:00:00.000Z",
"lt": "2016-09-01T00:00:00.000Z"
}
}
},
"aggs": {
"min_max": {
"stats": {
"field": "actual_order_date"
}
}
}
}
}
}
}
}

Return the number of documents each month in MongoDB

I have a collection in my MongoDB with the following structure:
{
"_id" : "17812",
"date" : ISODate("2014-03-26T18:48:20Z"),
"text" : "............."
}
I want to use pyMongo and return in a list the number of documents each month from a certain date until now. The way I am doing it now, is to retrieve all the documents without filters and then create the list in Python. I am not sure, but do you know if I can do this one only with pymongo query?
So getting the counts for a whole month you can use the date operators in the aggregation pipeline. You can also combine with a $match in order to get the range of dates you want.
var start = new Date("2013-06-01");
var today = new Date();
db.collection.aggregate([
{ "$match": {
"date": { "$gte": start, "$lt": today }
}},
{ "$group": {
"_id": {
"year": { "$year": "$date" },
"month": { "$month": "$date" },
},
"count": { "$sum": 1 }
}}
])
Take a look at the aggregation framework in MongoDB which is ideal for this kind of task. It can be used from pyMongo with the aggregate method on a collection.

Categories

Resources