MongoDB aggregate compare with previous document - python

I have this query in Motor:
history = yield self.db.stat.aggregate([
{'$match': {'user_id': user.get('uid')}},
{'$sort': {'date_time': -1}},
{'$project': {'user_id': 1, 'cat_id': 1, 'doc_id': 1, 'date_time': 1}},
{'$group': {
'_id': '$user_id',
'info': {'$push': {'doc': '$doc_id', 'date': '$date_time', 'cat': '$cat_id'}},
'total': {'$sum': 1}
}},
{'$unwind': '$info'},
])
Documents in stat collection look like this:
{
"_id" : ObjectId("5788fa45bc54f428d8e77903"),
"vrr_id" : 2,
"date_time" : ISODate("2016-07-15T14:59:17.411Z"),
"ip" : "10.79.0.230",
"cat_id" : "rsl01",
"vrr_group" : ObjectId("55f6d1b5aaab934a00bae1a4"),
"col" : [
"dledu"
],
"vrr_type" : "TH",
"doc_type" : "local",
"user_id" : "696230",
"page" : null,
"method" : "OpenView",
"branch" : 9,
"sc" : 200,
"doc_id" : "004894802",
"spec" : 0
}
/* 40 */
{
"_id" : ObjectId("5788fa45bc54f428d8e77904"),
"vrr_id" : 2,
"date_time" : ISODate("2016-07-15T14:59:17.500Z"),
"ip" : "10.79.0.230",
"cat_id" : "rsl01",
"vrr_group" : ObjectId("55f6d1b5aaab934a00bae1a4"),
"col" : [
"autoref"
],
"vrr_type" : "TH",
"doc_type" : "open",
"user_id" : "696230",
"page" : null,
"method" : "OpenView",
"branch" : 9,
"sc" : 200,
"doc_id" : "000000002",
"spec" : "07"
}
I want to compare date_time field with date_time from previous document and if they are not equal (or not in timedelta within 5 seconds), include it in result.
Filtering this in Python was easy, is it possible in Mongo? How can I achieve this?

If you include some example documents from the "stat" collection I can give a more reliable answer. But with the information you've provided, I can guess. Add a stage something like:
{'$group': {'_id': '$info.date', 'info': {'$first': '$info'}}}
That gives you each document in the result list that has a distinct "date" from the previous document.
That said, if all you need is a distinct list of dates, this is simpler and faster:
db.stats.distinct("date_time")

Related

How can I get a subfield of dictionary in mongodb?

I've the data structured as follows:
{
"_id" : ObjectId("61e810b946788906359966"),
"titles" : {
"full_name" : "full name",
"name" : "name"
},
"duration" : 161,
"work_ids" : {
"plasma_id" : "METRO_3423659",
"product_code" : "34324000",
}
}
I would like query result as:
{'full_name': 'full name', 'plasma_id': 'METRO_3423659'}
The query that i do is:
.find({query},{"_id": 0, "work_ids.plasma_id": 1, "titles.full_name": 1}})
but the result i get is 'titles': {'full_name': 'full name'}, 'work_ids': {'plasma_id': 'METRO_3423659'}}
There is any way to get directly the result i want? Thank you very much
Query
you can do it using paths, and adding names for the fields
Playmongo
aggregate(
[{"$project":
{"_id": 0,
"full_name": "$titles.full_name",
"plasma_id": "$work_ids.plasma_id"}}])

Generalize algorithm for a loop comparing to last record?

I have a data set which I can represent by this toy example of a list of dictionaries:
data = [{
"_id" : "001",
"Location" : "NY",
"start_date" : "2022-01-01T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "002",
"Location" : "NY",
"start_date" : "2022-01-02T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "011",
"Location" : "NY",
"start_date" : "2022-02-01T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "012",
"Location" : "NY",
"Start_Date" : "2022-02-02T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "101",
"Location" : "NY",
"Start_Date" : "2022-03-01T00:00:00Z",
"Baz" : "pizza"
},
{
"_id" : "102",
"Location" : "NY",
"Start_Date" : "2022-03-2T00:00:00Z",
"Baz" : "pizza"
},
]
Here is an algorithm in Python which collects each of the keys in each 'collection' and whenever there is a key change, the algorithm adds those keys to output.
data_keys = []
for i, lst in enumerate(data):
all_keys = []
for k, v in lst.items():
all_keys.append(k)
if k.lower() == 'start_date':
start_date = v
this_coll = {'start_date': start_date, 'all_keys': all_keys}
if i == 0:
data_keys.append(this_coll)
else:
last_coll = data_keys[-1]
if this_coll['all_keys'] == last_coll['all_keys']:
continue
else:
data_keys.append(this_coll)
The correct output given here records each change of field name: Foo, Bar, Baz as well as the change of case in field start_date to Start_Date:
[{'start_date': '2022-01-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
{'start_date': '2022-02-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
{'start_date': '2022-02-02T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
{'start_date': '2022-03-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]
Is there a general algorithm which covers this pattern comparing current to previous item in a stack?
I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.
Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?
EDIT: If you want to use a query for this, one option is something like:
The $objectToArray allow to format the keys as values, and the $ifNull allows to check several options of start_date.
The $unwind allows us to sort the keys.
The $group allow us to undo the $unwind, but now with sorted keys
$reduce to create a string from all keys, so we'll have something to compare.
group again, but now with our string, so we'll only have documents for changes.
db.collection.aggregate([
{
$project: {
data: {$objectToArray: "$$ROOT"},
start_date: {$ifNull: ["$start_date", "$Start_Date"]}
}
},
{$unwind: "$data"},
{$project: {start_date: 1, key: "$data.k", _id: 0}},
{$sort: {start_date: 1, key: 1}},
{$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
{
$project: {
all_keys: 1,
all_keys_string: {
$reduce: {
input: "$all_keys",
initialValue: "",
in: {$concat: ["$$value", "$$this"]}
}
}
}
},
{
$group: {
_id: "$all_keys_string",
all_keys: {$first: "$all_keys"},
start_date: {$first: "$_id"}
}
},
{$unset: "_id"}
])
Playground example
itertools.groupby iterates subiterators when a key value has changed. It does the work of tracking a changing key for you. In your case, that's the keys of the dictionary. You can create a list comprehension that takes the first value from each of these subiterators.
import itertools
data = ... your data ...
data_keys = [next(val)
for _, val in itertools.groupby(data, lambda record: record.keys())]
for row in data_keys:
print(row)
Result
{'_id': '001', 'Location': 'NY', 'start_date': '2022-01-01T00:00:00Z', 'Foo': 'fruits'}
{'_id': '011', 'Location': 'NY', 'start_date': '2022-02-01T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '012', 'Location': 'NY', 'Start_Date': '2022-02-02T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '101', 'Location': 'NY', 'Start_Date': '2022-03-01T00:00:00Z', 'Baz': 'pizza'}

Pymongo finding all Mongo records where value is between two mongo index values

I have a list of values
myValues = [5,6,7,8,9]
I have record as follows
record1 = {"_id" : someID, "index" : datetime, "StartValue" : 1, "EndValue" : 8}
record2 = {"_id" : someID, "index" : datetime, "StartValue" : 9, "EndValue" : 16}
record3 = {"_id" : someID, "index" : datetime, "StartValue" : 17, "EndValue" : 24}
...
Now, I would like to perform a search query using find() such that one or more records are included in the returned cursor, where the returned elements have a StartValue and EndValue that together include all of the values in list myValues. In the above case, record1 and record2 would be returned. Values [5,6,7,8] would correspond to record1 since it falls between and including the StartValue and EndValue. record2 is returned because element [9] of myValues is within record2's StartValue and EndValue.
I have tried the following:
myData = myMongoCollection.find({"index" : {"$gt" : current_time - datetime.timedelta(minutes=5)}, "$and" : {{"StartValue" : {"$gte" : {"$or" : myValues}}}, {"EndValue" : { "$lte" : {"$or" : myValues}}}} } )
Edit I have tried using $gte and $lte to try to capture array values that may not match the StartValue and EndValue. For this reason, the $in operator would not work. For example, if myValues=[5,6,7], then record1 will not be returned when using $in. However, I would like to have record1 returned in that case as well.
You can get minimum and maximum numbers from myValues array, and check $gte and $lte condition with $or operator for both the properties,
myValues = [5,6,7,8,9]
minValue = min(myValues)
maxValue = max(myValues)
myData = myMongoCollection.find({
"index": {
"$gt": current_time - datetime.timedelta(minutes=5)
},
"$or": [
{
"StartValue": { "$lte": minValue },
"EndValue": { "$gte": minValue }
},
{
"StartValue": { "$lte": maxValue },
"EndValue": { "$gte": maxValue }
}
]
})
Playground

Print only specific parts of json file

I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
data["Departure"] is a list, and you are indexing into it like it's a dictionary.
You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.
As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}

Mongod at 100% CPU - Not sure how to diagnose?

I have a python (python & mongo newbie) application that runs via cron every hour to fetch data, clean it and insert into mongo. During execution, the application will query mongo to check for duplicates and insert if the document is new.
I noticed recently that mongod is at 100% cpu utilization ... and I'm not sure when/why this started happening.
I'm running on an EC2 micro instance with a dedicated EBS volume for mongo, which is at ~2.2GB in size.
I'm not really sure where to start diagnosing the issue. Here is output of stats() and systemStatus() on the system:
> db.myApp.stats()
{
"ns" : "myApp.myApp",
"count" : 138096,
"size" : 106576816,
"avgObjSize" : 771.7588923647318,
"storageSize" : 133079040,
"numExtents" : 13,
"nindexes" : 1,
"lastExtentSize" : 27090944,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 4496800,
"indexSizes" : {
"_id_" : 4496800
},
"ok" : 1
}
> db.serverStatus()
{
"host" : "kar",
"version" : "2.0.4",
"process" : "mongod",
"uptime" : 4146089,
"uptimeEstimate" : 3583433,
"localTime" : ISODate("2013-04-07T21:18:05.466Z"),
"globalLock" : {
"totalTime" : 4146088784941,
"lockTime" : 1483742858,
"ratio" : 0.0003578656741237909,
"currentQueue" : {
"total" : 0,
"readers" : 0,
"writers" : 0
},
"activeClients" : {
"total" : 2,
"readers" : 2,
"writers" : 0
}
},
"mem" : {
"bits" : 64,
"resident" : 139,
"virtual" : 1087,
"supported" : true,
"mapped" : 208,
"mappedWithJournal" : 416
},
"connections" : {
"current" : 7,
"available" : 812
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 359456,
"page_faults" : 634
},
"indexCounters" : {
"btree" : {
"accesses" : 3431,
"hits" : 3431,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 69092,
"total_ms" : 448897,
"average_ms" : 6.497090835407862,
"last_ms" : 0,
"last_finished" : ISODate("2013-04-07T21:17:15.620Z")
},
"cursors" : {
"totalOpen" : 0,
"clientCursors_size" : 0,
"timedOut" : 1
},
"network" : {
"bytesIn" : 297154435,
"bytesOut" : 222773714,
"numRequests" : 1721768
},
"opcounters" : {
"insert" : 138004,
"query" : 359,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 1583416
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 0,
"rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
"commits" : 9,
"journaledMB" : 0,
"writeToDataFilesMB" : 0,
"compression" : 0,
"commitsInWriteLock" : 0,
"earlyCommits" : 0,
"timeMs" : {
"dt" : 3180,
"prepLogBuffer" : 0,
"writeToJournal" : 0,
"writeToDataFiles" : 0,
"remapPrivateView" : 0
}
},
"ok" : 1
}
And top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18477 mongodb 20 0 1087m 139m 122m R 99.9 23.7 10729:36 mongod
I'm curious how to go about debugging mongo to determine where/what/why this awful performance is happening.
UPDATE:
I learned I can use explain() to get details, though I'm not sure how to yet interpret the results
> db.myApp.find({'id':'320969221423124481'}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 138124,
"nscannedObjects" : 138124,
"n" : 0,
"millis" : 3949,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
UPDATE:
OK, I see now that the example query (which it executes a BUNCH of times) is taking near 4 seconds. I guess it is NOT using any index. I need to lookup how to add an index...doing that now.
UPDATE:
So I did the following
db.myApp.ensureIndex({'id':1})
And it fixed everything. heh.
See my OP thread, but the answer was a missing index needed to be added:
db.myApp.ensureIndex({'id':1})

Categories

Resources