Mongodb How to add addtional information when aggregating? - python

i am a beginner , i wrote a line for my pipeline that works, but i wanna add other information to my output, like screen name , or number of tweets.I tried to add that under $group but gave me an syntax error everytime
here is my pipeline:
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" }
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
];
I am using it on this example :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}

Use $first and your aggregation pipeline query as below :
db.collectionName.aggregate({
"$match": {
"user.statuses_count": {
"$gt": 99
},
"user.time_zone": "Brasilia"
}
}, {
"$sort": {
"user.followers_count": -1 // sort followers_count first
}
}, {
"$group": {
"_id": "$user.id",
"followers": {
"$first": "$user.followers_count" //use mongo $first method to get followers count or max followers count
},
"screen_name": {
"$first": "$user.screen_name"
},
"retweet_count": {
"$first": "$retweet_count"
}
}
})
Or using $limit and $project as
db.collectionName.aggregate({
"$match": {
"user.statuses_count": {
"$gt": 99
},
"user.time_zone": "Brasilia"
}
}, {
"$sort": {
"user.followers_count": -1 // sort followers_count
}
}, {
"$limit": 1 // Set limit 1 so get max followers_count document first
}, {
"$project": { // user project here
"userId": "$user.id",
"screen_name": "$user.screen_name",
"retweet_count": "$retweet_count"
}
}).pretty()

The following aggregation pipeline uses the $$ROOT system variable which references the root document, i.e. the top-level document, currently being processed in the $group aggregation pipeline stage. This is added to an array using the $addToSet operator. In the following pipeline stage, you can then $unwind the array to get the desired fields through a $project operator modifies the form of the output document:
db.tweet.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
The following pipeline also achieves the same result but doesn't use the $group operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gte": 100
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('5304e2d34149692bc5172729'),
u'followers': 17209,
u'screen_name': u'AndreHenning',
u'tweets': 8219}]}

Related

How to retrieve elasticsearch data from index based on timestamp?

I want to retrieve data from elasticsearch based on timestamp. The timestamp is in epoch_millis and I tried to retrieve the data like this:
{
"query": {
"bool": {
"must":[
{
"range": {
"TimeStamp": {
"gte": "1632844180",
"lte": "1635436180"
}
}
}
]
}
},
"size": 10
}
But the response is this:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
How can I retrieve data for a given period of time from a certain index?
The data looks like this:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "zWpMNXcBTeKmGB84eksSD",
"_score" : 1.0,
"_source" : {
"Source" : "Market",
"Category" : "electronics",
"Value" : 20,
"Price" : 45.6468,
"Currency" : "EUR",
"TimeStamp" : 1611506922000 }
Also, the result has 10.000 hits when using the _search on the index. How could I access other entries? (more than 10.000 results) and to be able to choose the desired timestamp interval.
For your first question, assume that you have the mappings like this:
{
"mappings": {
"properties": {
"Source": {
"type": "keyword"
},
"Category": {
"type": "keyword"
},
"Value": {
"type": "integer"
},
"Price": {
"type": "float"
},
"Currency": {
"type": "keyword"
},
"TimeStamp": {
"type": "date"
}
}
}
}
Then I indexed 2 sample documents (1 is yours above, but the timestamp is definitely not in your range):
[{
"Source": "Market",
"Category": "electronics",
"Value": 30,
"Price": 55.6468,
"Currency": "EUR",
"TimeStamp": 1633844180000
},
{
"Source": "Market",
"Category": "electronics",
"Value": 20,
"Price": 45.6468,
"Currency": "EUR",
"TimeStamp": 1611506922000
}]
If you really need to query using the range above, you will first need to convert your TimeStamp field to seconds (/1000), then query based on that field:
{
"runtime_mappings": {
"secondTimeStamp": {
"type": "long",
"script": "emit(doc['TimeStamp'].value.millis/1000);"
}
},
"query": {
"bool": {
"must": [
{
"range": {
"secondTimeStamp": {
"gte": 1632844180,
"lte": 1635436180
}
}
}
]
}
},
"size": 10
}
Then you will get the first document.
About your second question, by default, Elasticsearch's max_result_window is only 10000. You can increase this limit by updating the settings, but it will increase the memory usage.
PUT /index/_settings
{
"index.max_result_window": 999999
}
You should use the search_after API instead.

Mongodb aggregation with $or nested clauses

I have mongodb documents like this:
{
"_id" : ObjectId("5d35ba501545d248c383871f"),
"key1" : 1,
"currentTime" : ISODate("2019-07-18T19:41:54.000Z"),
"iState" : "START - 1",
"errGyro" : -4.0,
"states" : [
{
"ts" : 3,
"accY" : -165.877227783203,
"gyroZ" : 8.2994499206543,
},
{
"ts" : 4,
"accY" : -15.843573,
"gyroZ" : 12.434643,
},
{
"ts" : 3,
"accY" : 121.32667,
"gyroZ" : 98.45566,
}
]
}
I want to return all the states objects and the parent document where "ts" is 3 or 5.
I tried this query at first:
db.getCollection('logs').find(
{"states" :
{ "$elemMatch" : { "$or":[
{ "ts":
{ "$eq" : 3}
},
{ "ts":
{ "$eq" : 5}
}
]
}
}
},{"states.$":1 })
But this returns only the first "state" document where the "eq" occurred.
How can I return all the matching documents?
Thank you.
You can use aggregation pipelines
db.getCollection('logs').aggregate([
{
$unwind: "$states"
},
{
$match: {
$or: [
{ "states.ts": 3 },
{ "states.ts": 5 },
]
}
},
{
$group: {
_id: "$_id",
"key1": { $first: "key1" },
"currentTime": { $first: "currentTime" },
"iState": { $first: "$iState" },
"errGyro": { $first: "$errGyro" },
states: { $push: "$states" }
}
}
])
As $elemMatch returns only the first matching element of array, you have to use aggregation to achieve your goal. Here's the query :
db.collection.aggregate([
{
$match: {
$or: [
{
"states.ts": {
$eq: 3
}
},
{
"states.ts": {
$eq: 5
}
}
]
}
},
{
$project: {
states: 1
}
},
{
$unwind: "$states"
},
{
$match: {
$or: [
{
"states.ts": {
$eq: 3
}
},
{
"states.ts": {
$eq: 5
}
}
]
}
},
{
$group: {
_id: "$_id",
states: {
$push: "$states"
}
}
}
])
First $match and $project stages are here for query optimization and memory save, if many documents are returned.

Access nested objects in Elasticsearch using a script

I'm trying to use data from ElasticSearch 6 results in setting up the scoring for my results.
Part of my mapping looks like:
{
"properties": {
"annotation_date": {
"type": "date"
},
"annotation_date_time": {
"type": "date"
},
"annotations": {
"properties": {
"details": {
"type": "nested",
"properties": {
"filter": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"bucket": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keyword": {
"type": "text",
"fielddata": True,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"frequency": {
"type": "long",
}
}
}
}
}
}
}
Example part of a document JSON:
"annotations": {
"details": [
{
"filter": "filter_A",
"bucket": "bucket_A",
"keyword": "keyword_A",
"frequency": 6
},
{
"filter": "filter_B",
"bucket": "bucket_B",
"keyword": "keyword_B",
"frequency": 7
}
]
I want to use the the frequency of my annotation.details if it hits a certain 'bucket', which I try to do with the following:
GET my_index/_search
{
"size": 10000,
"query": {
"function_score": {
"query": {
"match": { "title": "<search term>" }
},
"script_score": {
"script": {
"lang": "painless",
"source": """
int score = 0;
for (int i = 0; i < doc['annotations.details.filter'].length; i++){
if (doc['annotations.details.filter'][i].keyword == "bucket_A"){
score += doc['annotations.details.frequency'][i].value;
}
}
return score;
"""
}
}
}
}
}
Ultimately, this would mean that in this specific situation a score is expected of 6. If it would have hit on more buckets, the score is incremented with the frequency it hit on.
You should use bool,must with range and gt
example
GET /_search
{
"query": {
"nested" : {
"path" : "obj1",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"obj1.name" : "blue"} },
{ "range" : {"obj1.count" : {"gt" : 5}} }
]
}
}
}
}
}

Python ElasticSearch query error

I have used the mapping query to perform a search in ElasticSearch, and it works fine as below.
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "{}mi".format(list_info.radius_b),
'location': {
"lat": zip.lat,
"lon": zip.lng
}
}
},
},
},
"sort" : [
{
"_geo_distance" : {
'location': {"lat": zip.lat,
"lon": zip.lng},
"order" : "asc",
"unit" : "mi",
"mode" : "min",
"distance_type" : "sloppy_arc"
}
}
],
"from": 0,
"size": 0,
}
However, even I add "terms", the I'm getting error: TransportError(400, u'parsing_exception', u'[term] malformed query, expected [END_OBJECT] but found [FIELD_NAME]')
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "{}mi".format(list_info.radius_b),
'location': {
"lat": zip.lat,
"lon": zip.lng
}
}
},
},
"term" : { "status" : "approved" }
},
"sort" : [
{
"_geo_distance" : {
'location': {"lat": zip.lat,
"lon": zip.lng},
"order" : "asc",
"unit" : "mi",
"mode" : "min",
"distance_type" : "sloppy_arc"
}
}
],
"from": 0,
"size": 0,
}
Your new term query must be located inside the bool/filter query:
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": [
{
"geo_distance": {
"distance": "{}mi".format(list_info.radius_b),
"location": {
"lat": zip.lat,
"lon": zip.lng
}
}
},
{
"term": {
"status": "approved"
}
}
]
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": zip.lat,
"lon": zip.lng
},
"order": "asc",
"unit": "mi",
"mode": "min",
"distance_type": "sloppy_arc"
}
}
],
"from": 0,
"size": 0
}

python mongodb $match and $group

I want to write a simple query that gives me the user with the most followers that has the timezone brazil and has tweeted 100 or more times:
this is my line :
pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
{"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
{"$sort":{"count":-1}} ]
I adapted it from a practice problem.
This was given as an example for the structure :
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Can anybody spot my mistakes?
Suppose you have a couple of sample documents with the minimum test case. Insert the test documents to a collection in mongoshell:
db.collection.insert([
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"user" : {
"friends_count" : 145,
"statuses_count" : 457,
"screen_name" : "Catherinemull",
"time_zone" : "Brasilia",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong(22819398300)
},
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"user" : {
"friends_count" : 145,
"statuses_count" : 12334,
"time_zone" : "Brasilia",
"screen_name" : "marble",
"followers_count" : 2597,
"id" : 37486278
},
"id" : NumberLong(22819398301)
}])
For you to get the user with the most followers that is in the timezone "Brasilia" and has tweeted 100 or more times, this pipeline achieves the desired result but doesn't use the $group operator:
pipeline = [
{
"$match": {
"user.statuses_count": {
"$gt":99
},
"user.time_zone": "Brasilia"
}
},
{
"$project": {
"followers": "$user.followers_count",
"screen_name": "$user.screen_name",
"tweets": "$user.statuses_count"
}
},
{
"$sort": {
"followers": -1
}
},
{"$limit" : 1}
]
Pymongo Output:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marble',
u'tweets': 12334}]}
The following aggregation pipeline will will also give you the desired result. In the pipeline, the first stage is the $match operator which filters those documents where the user has got the timezone field value "Brasilia" and has a tweet count (represented by the statuses_count) greater than or equal to 100 matched via the $gte comparison operator.
The second pipeline stage has the $group operator which groups the filtered documents by the specified identifier expression which is the $user.id field and applies the accumulator expression $max to each group on the $user.followers_count field to get the greatest number of followers for each user. The system variable $$ROOT which references the root document, i.e. the top-level document, currently being processed in the $group aggregation pipeline stage, is added to an extra array field for use later on. This is achieved by using the $addToSet array operator.
The next pipeline stage $unwinds to output a document for each element in the data array for processing in the next step.
The following pipeline step, $project, then transforms each document in the stream, by adding new fields which have values from the previous stream.
The last two pipeline stages $sort and $limit reorders the document stream by the specified sort key followers and returns one document which contains the user with the highest number of followers.
You final aggregation pipeline thus should look like this:
db.collection.aggregate([
{
'$match': {
"user.statuses_count": { "$gte": 100 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"max_followers": { "$max": "$user.followers_count" },
"data": { "$addToSet": "$$ROOT" }
}
},
{
"$unwind": "$data"
},
{
"$project": {
"_id": "$data._id",
"followers": "$max_followers",
"screen_name": "$data.user.screen_name",
"tweets": "$data.user.statuses_count"
}
},
{
"$sort": { "followers": -1 }
},
{
"$limit" : 1
}
])
Executing this in Robomongo gives you the result
/* 0 */
{
"result" : [
{
"_id" : ObjectId("52fd2490bac3fa1975477702"),
"followers" : 2597,
"screen_name" : "marble",
"tweets" : 12334
}
],
"ok" : 1
}
In python, the implementation should be essentially the same:
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
... {"$unwind": "$data"},
... {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
"$data.user.statuses_count"}},
... {"$sort": { "followers": -1 }},
... {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
... print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>
where
pipeline = [
{"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
{"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
{"$unwind": "$data"},
{"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}},
{"$sort": { "followers": -1 }},
{"$limit" : 1}
]

Categories

Resources