I want to write a simple query that gives me the user with the most followers that has the timezone brazil and has tweeted 100 or more times:
this is my line :
pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
{"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
{"$sort":{"count":-1}} ]
I adapted it from a practice problem.
This was given as an example for the structure :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Can anybody spot my mistakes?
Suppose you have a couple of sample documents with the minimum test case. Insert the test documents to a collection in mongoshell:
db.collection.insert([
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"user" : {
"friends_count" : 145,
"statuses_count" : 457,
"screen_name" : "Catherinemull",
"time_zone" : "Brasilia",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong(22819398300)
},
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"user" : {
"friends_count" : 145,
"statuses_count" : 12334,
"time_zone" : "Brasilia",
"screen_name" : "marble",
"followers_count" : 2597,
"id" : 37486278
},
"id" : NumberLong(22819398301)
}])
For you to get the user with the most followers that is in the timezone "Brasilia" and has tweeted 100 or more times, this pipeline achieves the desired result but doesn't use the $group operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gt":99
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marble',
u'tweets': 12334}]}
The following aggregation pipeline will will also give you the desired result. In the pipeline, the first stage is the $match operator which filters those documents where the user has got the timezone field value "Brasilia" and has a tweet count (represented by the statuses_count) greater than or equal to 100 matched via the $gte comparison operator.
The second pipeline stage has the $group operator which groups the filtered documents by the specified identifier expression which is the $user.id field and applies the accumulator expression $max to each group on the $user.followers_count field to get the greatest number of followers for each user. The system variable $$ROOT which references the root document, i.e. the top-level document, currently being processed in the $group aggregation pipeline stage, is added to an extra array field for use later on. This is achieved by using the $addToSet array operator.
The next pipeline stage $unwinds to output a document for each element in the data array for processing in the next step.
The following pipeline step, $project, then transforms each document in the stream, by adding new fields which have values from the previous stream.
The last two pipeline stages $sort and $limit reorders the document stream by the specified sort key followers and returns one document which contains the user with the highest number of followers.
You final aggregation pipeline thus should look like this:
db.collection.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
Executing this in Robomongo gives you the result
/* 0 */
{
"result" : [
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"followers" : 2597,
"screen_name" : "marble",
"tweets" : 12334
}
],
"ok" : 1
}
In python, the implementation should be essentially the same:
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
... {"$unwind": "$data"},
... {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
"$data.user.statuses_count"}},
... {"$sort": { "followers": -1 }},
... {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
... print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>
where
pipeline = [
{"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
{"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
{"$unwind": "$data"},
{"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}},
{"$sort": { "followers": -1 }},
{"$limit" : 1}
]
Related
I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you
When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])
Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.
I have a field distribution in record schema that looks likes this:
...
"distribution": {
"properties": {
"availability": {
"type": "keyword"
}
}
}
...
I want to rank the records with distribution.availability == "ondemand" lower than other records.
I looked in Elasticsearch docs but can't find a way to reduce the scores of this type of records in index-time to appear lower in search results.
How can I achieve this, any pointers to related source would be enough as well.
More Info:
I was completely omitting these ondemand records with help of python client in query-time like this:
from elasticsearch_dsl.query import Q
_query = Q("query_string", query=query_string) & ~Q('match', **{'availability.keyword': 'ondemand'})
Now, I want to include these records but I want to place them lower than other records.
If it is not possible to implement something like this in index-time, please suggest how can I achieve this in query-time with python client.
After applying the suggestion from llermaly, the python client query looks like this:
boosting_query = Q(
"boosting",
positive=Q("match_all"),
negative=Q(
"bool", filter=[Q({"term": {"distribution.availability.keyword": "ondemand"}})]
),
negative_boost=0.5,
)
if query_string:
_query = Q("query_string", query=query_string) & boosting_query
else:
_query = Q() & boosting_query
EDIT2 : elasticsearch-dsl-py version of boosting query
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q
client = Elasticsearch()
q = Q('boosting', positive=Q("match_all"), negative=Q('bool', filter=[Q({"term": {"test.available.keyword": "ondemand"}})]), negative_boost=0.5)
s = Search(using=client, index="test_parths007").query(q)
response = s.execute()
print(response)
for hit in response:
print(hit.meta.score, hit.test.available)
EDIT : Just read you need to do it on index time.
Elasticsearch deprecated index time boosting on 5.0
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-boost.html
You can use a Boosting query to achieve that on query time.
Ingest Documents
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "ondemand"
}
}
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "higherscore"
}
}
POST test_parths007/_doc
{
"name": "doc2",
"test": {
"available": "higherscore"
}
}
Query (index time)
POST test_parths007/_search
{
"query": {
"boosting": {
"positive": {
"match_all": {}
},
"negative": {
"term": {
"test.available.keyword": "ondemand"
}
},
"negative_boost": 0.5
}
}
}
Response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "VMdY7XcB50NMsuQPelRx",
"_score" : 1.0,
"_source" : {
"name" : "doc2",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "Vcda7XcB50NMsuQPiVRB",
"_score" : 1.0,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "U8dY7XcB50NMsuQPdlTo",
"_score" : 0.5,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "ondemand"
}
}
}
]
}
}
For more advanced manipulation you can check the Function Score Query
I have DB with my users:
{
"_id": {
"$oid": "5a0decadefcb09087c08a868"
},
"user_id": "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email": "email#email.com",
"password": "***",
"registration_date": {
"$date": "2017-11-16T19:53:17.946Z"
},
"type": "user"
},
{
"_id": {
"$oid": "5a0ded3aefcb090887d7f4fb"
},
"user_id": "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email": "second#gmail.com",
"password": "***",
"registration_date": {
"$date": "2017-11-16T19:55:38.194Z"
},
"type": "user"
}
I want to count users by each date (registration_date) and get some thing like that:
01.01.2017 – 10
01.02.2017 – 20
01.03.2017 – 15
...
I'm trying that code, but it doesn't work:
def registrations_by_date(self):
users = self.users_db.aggregate([
{'$group': {
'_id': {'registration_date':'$date'},
'count': {'$sum':1}
}},
])
return users
What i'm doing wrong? How to get this data?
If the date in your schema is of ISODate
then the below aggregate query will work, the date format is done before grouping so that the timestamp is not taken while grouping the data
{
"_id" : "5a0decadefcb09087c08a868",
"user_id" : "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email" : "email#email.com",
"password" : "***",
"registration_date" : ISODate("2017-11-16T19:53:17.946Z"),
"type" : "user"
}
{
"_id" : "5a0ded3aefcb090887d7f4fb",
"user_id" : "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email" : "second#gmail.com",
"password" : "***",
"registration_date" : ISODate("2017-11-16T19:55:38.194Z"),
"type" : "user"
}
The aggregation query to get the result is
db.userReg.aggregate([
{$project:
{ formattedRegDate:
{ "$dateToString": {format:"%Y-%m-%d", date:"$registration_date"}}
}
},
{$group:{_id:"$formattedRegDate", count:{$sum:1}}}]);
and the result is
{ "_id" : "2017-11-16", "count" : 2 }
If the date in your schema is of String
then the below approach to be used
Sample Data
{
"_id" : "5a0decadefcb09087c08a868",
"user_id" : "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email" : "email#email.com",
"password" : "***",
"registration_date" : "2017-11-16T19:53:17.946Z",
"type" : "user"
}
{
"_id" : "5a0ded3aefcb090887d7f4fb",
"user_id" : "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email" : "second#gmail.com",
"password" : "***",
"registration_date" : "2017-11-16T19:55:38.194Z",
"type" : "user"
}
Query
db.userReg.aggregate([{
$group:{ _id: { date: {"$substr":["$registration_date", 0, 10]}},
count:{$sum:1}
}
}]);
and the result is
{ "_id" : { "date" : "2017-11-16" }, "count" : 2 }
It seems you have an extra ,
db.userReg.aggregate([
{$group: {_id: "$registration_date", count: {$sum:1}}}
])
This gives the correct result(ON the basis of record on my mcahine) :
{
"_id" : ISODate("2017-11-15T19:55:38.194Z"),
"count" : 1.0 }
{
"_id" : ISODate("2017-11-16T19:55:38.194Z"),
"count" : 2.0 }
I have a mongo collection with doc as follows:-
{
"_id" : ObjectId("55a9378ee2874f0ed7b7cb7e"),
"_uid" : 10,
"impressions" : [
{
"pos" : 6,
"id" : 123,
"service" : "furniture"
},
{
"pos" : 0,
"id" : 128,
"service" : "electronics"
},
{
"pos" : 2,
"id" : 127,
"service" : "furniture"
},
{
"pos" : 2,
"id" : 125,
"service" : "electronics"
},
{
"pos" : 10,
"id" : 124,
"service" : "electronics"
}
]
},
{
"_id" : ObjectId("55a9378ee2874f0ed7b7cb7f"),
"_uid" : 11,
"impressions" : [
{
"pos" : 1,
"id" : 124,
"service" : "furniture"
},
{
"pos" : 10,
"id" : 124,
"service" : "electronics"
},
{
"pos" : 1,
"id" : 123,
"service" : "furniture"
},
{
"pos" : 21,
"id" : 122,
"service" : "furniture"
},
{
"pos" : 3,
"id" : 125,
"service" : "electronics"
},
{
"pos" : 10,
"id" : 121,
"service" : "electronics"
}
]
}
My aim is to find all the "id" in a particular "service" say "furniture" i.e to get results like this:
[122,123,124,127]
But i'm not able to figure out how to frame the condition in
db.collection_name.find()
because of the difficulty of having condition for the 'n' th element in an array, "impressions[n]":"value".
One option is to use the "id"s obtained perform aggregate operation to find impressions for each "id" for a service as suggested by the answer to this question I asked earlier:-
MapReduce in PyMongo.
But I only want the list of distinct 'id' in a service not the impressions.
Kindly help!
You need the aggregration framework for meaningful results. So much like this:
result = db.collection.aggregate([
{ "$match": {
"impressions.service": "furniture"
}},
{ "$unwind": "$impressions" },
{ "$match": {
"impressions.service": "furniture"
}},
{ "$group": {
"_id": "$impressions.id"
}}
])
Or better yet with MongoDB 2.6 or greater, which can remove the array items unmatched "prior" to $unwind with $redact:
result = db.collection.aggregate([
{ "$match": {
"impressions.service": "furniture"
}},
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$ifNull": [ "$service", "furniture" ] },
"furniture"
]
},
"then": "$$DESCEND",
"else": "$$PRUNE"
}
}},
{ "$unwind": "$impressions" },
{ "$group": {
"_id": "$impressions.id"
}}
])
Which yields:
{ "_id" : 122 }
{ "_id" : 124 }
{ "_id" : 127 }
{ "_id" : 123 }
Not a plain "list", but just transform it, therefore :
def mapper (x):
return x["_id"]
map(mapper,result)
Or:
map(lambda x: x["_id"], result)
To give you:
[122, 124, 127, 123]
If you want it "sorted" then either add a $sort stage at the end of the aggregation pipeline or sort the resulting list in code.
i am a beginner , i wrote a line for my pipeline that works, but i wanna add other information to my output, like screen name , or number of tweets.I tried to add that under $group but gave me an syntax error everytime
here is my pipeline:
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" }
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
];
I am using it on this example :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Use $first and your aggregation pipeline query as below :
db.collectionName.aggregate({
"$match": {
"user.statuses_count": {
"$gt": 99
},
"user.time_zone": "Brasilia"
}
}, {
"$sort": {
"user.followers_count": -1 // sort followers_count first
}
}, {
"$group": {
"_id": "$user.id",
"followers": {
"$first": "$user.followers_count" //use mongo $first method to get followers count or max followers count
},
"screen_name": {
"$first": "$user.screen_name"
},
"retweet_count": {
"$first": "$retweet_count"
}
}
})
Or using $limit and $project as
db.collectionName.aggregate({
"$match": {
"user.statuses_count": {
"$gt": 99
},
"user.time_zone": "Brasilia"
}
}, {
"$sort": {
"user.followers_count": -1 // sort followers_count
}
}, {
"$limit": 1 // Set limit 1 so get max followers_count document first
}, {
"$project": { // user project here
"userId": "$user.id",
"screen_name": "$user.screen_name",
"retweet_count": "$retweet_count"
}
}).pretty()
The following aggregation pipeline uses the $$ROOT system variable which references the root document, i.e. the top-level document, currently being processed in the $group aggregation pipeline stage. This is added to an array using the $addToSet operator. In the following pipeline stage, you can then $unwind the array to get the desired fields through a $project operator modifies the form of the output document:
db.tweet.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
The following pipeline also achieves the same result but doesn't use the $group operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gte": 100
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('5304e2d34149692bc5172729'),
u'followers': 17209,
u'screen_name': u'AndreHenning',
u'tweets': 8219}]}