I have a mongo collection, i want get total value of 'number_of_ad_clicks' by given sitename, timestamp and variant id. Because we have large data so it would be better use map/reduce. Could any guys give me any suggestion?
Here is my collection json format
{ "_id" : ObjectId( "4e3c280ecacbd1333b00f5ff" ),
"timestamp" : "20110805",
"variants" : { "94" : { "number_of_ad_clicks" : 41,
"number_of_search_keywords" : 9,
"total_duration" : 0,
"os" : { "os_2" : 2,
"os_1" : 1,
"os_0" : 0 },
"countries" : { "ge" : 6,
"ca" : 1,
"fr" : 8,
"uk" : 4,
"us" : 6 },
"screen_resolutions" : { "(320, 240)" : 1,
"(640, 480)" : 5,
"(1024, 960)" : 5,
"(1280, 768)" : 5 },
"widgets" : { "widget_1" : 1,
"widget_0" : 0 },
"languages" : { "ua_uk" : 8,
"ca_en" : 2,
"ca_fr" : 2,
"us_en" : 5 },
"search_keywords" : { "search_keyword_8" : 8,
"search_keyword_5" : 5,
"search_keyword_4" : 4,
"search_keyword_7" : 7,
"search_keyword_6" : 6,
"search_keyword_1" : 1,
"search_keyword_3" : 3,
"search_keyword_2" : 2 },
"number_of_pageviews" : 18,
"browsers" : { "browser_4" : 4,
"browser_0" : 0,
"browser_1" : 1,
"browser_2" : 2,
"browser_3" : 3 },
"keywords" : { "keyword_5" : 5,
"keyword_4" : 4,
"keyword_1" : 1,
"keyword_0" : 0,
"keyword_3" : 3,
"keyword_2" : 2 },
"number_of_keyword_clicks" : 83,
"number_of_visits" : 96 } },
"site_name" : "fonter.com",
"number_of_variants" : 1 }
Here is my try. but failed.
He is my try.
m = function() {
emit(this.query, {variants: this.variants});
}
r = function(key , vals) {
var clicks = 0 ;
for(var i = 0; i < vals.length(); i++){
clicks = vals[i]['number_of_ad_clicks'];
}
return clicks;
}
res = db.variant_daily_collection.mapReduce(m, r, {out : "myoutput", "query":{"site_name": 'fonter.com', 'timestamp': '20110805'}})
db.myoutput.find()
could somebody any suggestion?
Thank you very much, i try you solution but nothing return.
I invoke the mapreduce in the following, is there any thing wrong?
res = db.variant_daily_collection.mapReduce(map, reduce, {out : "myoutput", "query":{"site_name": 'facee.com', 'timestamp': '20110809', 'variant_id': '305'}})
db.myoutput.find()
The emit function emits both a key and a value.
If you are used to SQL think of key as your GROUP BY and value as your SUM(), AVG(), etc..
In your case you want to "group by": site_name, timestamp and variant id. It looks like you may have more than one variant, so you will need to loop through the variants, like this:
map = function() {
for(var i in variants){
var key = {};
key.timestamp = this.timestamp;
key.site_name = this.site_name;
key.variant_id = i; // that's the "94" string.
var value = {};
value.clicks = this.variants[i].number_of_ad_clicks;
emit(key, value);
}
}
The reduce function will get an array of values each one like this { clicks: 41 }. The function needs to return one object that looks the same.
So if you get values = [ {clicks:21}, {clicks:10}, {clicks:5} ] you must output {clicks:36}.
So you do something like this:
reduce = function(key , vals) {
var returnValue = { clicks: 0 }; // initializing to zero
for(var i = 0; i < vals.length(); i++){
returnValue.clicks += vals[i].clicks;
}
return returnValue;
}
Note that the value from map has the same shape as the return from reduce.
Related
I have defined a dataclass in which all variable are in snake_case. Whereas when I am returning my object I want to return everything in lowerCamerCase. But the problem is then nesting is very deep. Is there any way to automate this.
Although I have defined upper response object in camelCase what can I do for others.
#My json looks like
{
"highLevelObj1" : {
"low_level_obj1" : 1,
"low_level_obj2" : 2
},
"someRandomText" : {
"some_random_info1" : 1,
"some_random_info2" : 2
}
}
My expected output is
{
"highLevelObj1" : {
"lowLevelObj1" : 1,
"lowLevelObj2" : 2
},
"someRandomText" : {
"someRandomInfo1" : 1,
"someRandomInfo2" : 2
}
}
We can define a method to convert a snake_case string to a lowerCamelCase string
(source), then I think the easiest way would be to convert your json into a string convert it to camelCase then convert it back to a dictionary
import json
def to_camel_case(snake_str):
components = snake_str.split('_')
return components[0] + ''.join(x.title() for x in components[1:])
my_dict = {
"highLevelObj1" : {
"low_level_obj1" : 1,
"low_level_obj2" : 2
},
"someRandomText" : {
"some_random_info1" : 1,
"some_random_info2" : 2
}
}
string_dict = json.dumps(my_dict)
string_dict_camel_case = to_camel_case(string_dict)
my_dict = json.loads(string_dict_camel_case)
Output :
{'highLevelObj1': {'lowLevelObj1': 1, 'LowLevelObj2': 2},
'Somerandomtext': {'SomeRandomInfo1': 1, 'SomeRandomInfo2': 2}}
Trying to parse a Json structure in python and Adding a new value with key 'cat':
data = []
for x in a:
for y in x['Hp'].values():
for z in y:
for k in z['abc']['xyz']:
for m in data:
det = m['response']
// Some processing with det whose output is stored in s
k['cat'] = s
print x
However when x is print only the last value is being appended onto the whole dictionary, wheras there are different values for s.
Its obvious that the 'cat' key is being overwritten everytime the loop rounds,but can't find a way to make it right
Below is a sample Json structure:
{
"_id" : ObjectId("asdasda156121s"),
"Hp" : {
"bermud" : [
{
"abc" : {
"gfh" : 1,
"fgh" : 0.0,
"xyz" : [
{
"kjl" : "0",
"bnv" : 0,
}
],
"xvc" : "bv",
"hgth" : "INnn",
"sdf" : 0,
}
}
},
{
"abc" : {
"gfh" : 1,
"fgh" : 0.0,
"xyz" : [
{
"kjl" : "0",
"bnv" : 0,
}
],
"xvc" : "bv",
"hgth" : "INnn",
"sdf" : 0,
}
}
},
..
If you want to store all values change
k['cat'] = s
to
if 'cat' in k.keys():
k['cat'] += s
else:
k['cat'] = s
If you want to store only the first one change
k['cat'] = s
to
if 'cat' not in k.keys():
k['cat'] = s
I'm trying to create a collection named ttl, and using a TTL index, make the documents in that collection expire after 30 seconds.
I've created the collection using mongoengine, like so:
class Ttl(Document):
meta = {
'indexes': [
{
'name': 'TTL_index',
'fields': ['expire_at'],
'expireAfterSeconds': 0
}
]
}
expire_at = DateTimeField()
The index has been created and Robo3T shows it's as expected.
The actual documents are inserted to the collection using mongoengine as well:
current_ttl = models.monkey.Ttl(expire_at=datetime.now() + timedelta(seconds=30))
current_ttl.save()
The save is successful (the document is inserted into the DB), but it never expires!
How can I make the documents expire?
I'm adding the collection's contents here as well in case I'm saving them wrong. These are the results of running db.getCollection('ttl').find({}):
/* 1 */
{
"_id" : ObjectId("5ccf0f5a4bdc6edcd3773cd6"),
"created_at" : ISODate("2019-05-05T19:31:10.715Z")
}
/* 2 */
{
"_id" : ObjectId("5ccf121c0b792dae8f55cc80"),
"expire_at" : ISODate("2019-05-05T19:41:08.220Z")
}
/* 3 */
{
"_id" : ObjectId("5ccf127d6729084a24772fad"),
"expire_at" : ISODate("2019-05-05T19:42:47.522Z")
}
/* 4 */
{
"_id" : ObjectId("5ccf15bab124a97322da28de"),
"expire_at" : ISODate("2019-05-05T19:56:56.359Z")
}
The indexes themselves, as per the results of db.getCollection('ttl').getIndexes(), are:
/* 1 */
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "monkeyisland.ttl"
},
{
"v" : 2,
"key" : {
"expire_at" : 1
},
"name" : "TTL_index",
"ns" : "monkeyisland.ttl",
"background" : false,
"expireAfterSeconds" : 0
}
]
My db.version() is 4.0.8 and it's running on Ubuntu 18.04.
The issue is with:
current_ttl = models.monkey.Ttl(expire_at=datetime.now() + timedelta(seconds=30))
that should be
current_ttl = models.monkey.Ttl(expire_at=datetime.utcnow() + timedelta(seconds=30))
I have the following valid dictionary. I'm trying to add another group of terms under the "expansion_modules" group.
lan_router = {
'HOSTNAME1':{
'system_type': 'MDF',
'chassis':{
0:{
'model_num': 'EX4550',
'vc_role': 'MASTER',
'expansion_modules':{
1:{
'pic_slot': 1,
'expan_model': 'EX4550VCP'
}
},
'built-in_modules':{
0:{
'pic_slot': 2,
'built-in_model': 'EX4550BI'
}
}
}
}
}
}
I want to add the following under "expansion_modules" without removing "1"...
2:{'pic_slot': 2, 'expan_model': 'EX4550SFP'}
The following code adds what I want, but removes the existing term...
print lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'][1]['expan_model']
lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'] = { 2: {} }
lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'][2] = {'pic_slot' : 1, 'expan_model' : 'EX45504XSFP'}
You do not need the line - lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'] = { 2: {} } , it is replacing the dictionary inside expansion_modules , just remove this and execute rest.
Code -
print lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'][1]['expan_model']
lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'][2] = {'pic_slot' : 1, 'expan_model' : 'EX45504XSFP'}
Access it like this:
lan_router['HOSTNAME1']['chassis'][0]['expansion_modules'][2] = {}
Anand's answer is correct as it answers your question.
I would add that often dictionaries with [0, 1, ...] as keys should be just lists. Instead of:
'expansion_modules':{
1:{
'pic_slot': 1,
'expan_model': 'EX4550VCP'
},
2:{ ... }
}
perhaps you should have:
'expansion_modules':[
{
'pic_slot': 1,
'expan_model': 'EX4550VCP'
},
{ ... }
]
I have a python (python & mongo newbie) application that runs via cron every hour to fetch data, clean it and insert into mongo. During execution, the application will query mongo to check for duplicates and insert if the document is new.
I noticed recently that mongod is at 100% cpu utilization ... and I'm not sure when/why this started happening.
I'm running on an EC2 micro instance with a dedicated EBS volume for mongo, which is at ~2.2GB in size.
I'm not really sure where to start diagnosing the issue. Here is output of stats() and systemStatus() on the system:
> db.myApp.stats()
{
"ns" : "myApp.myApp",
"count" : 138096,
"size" : 106576816,
"avgObjSize" : 771.7588923647318,
"storageSize" : 133079040,
"numExtents" : 13,
"nindexes" : 1,
"lastExtentSize" : 27090944,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 4496800,
"indexSizes" : {
"_id_" : 4496800
},
"ok" : 1
}
> db.serverStatus()
{
"host" : "kar",
"version" : "2.0.4",
"process" : "mongod",
"uptime" : 4146089,
"uptimeEstimate" : 3583433,
"localTime" : ISODate("2013-04-07T21:18:05.466Z"),
"globalLock" : {
"totalTime" : 4146088784941,
"lockTime" : 1483742858,
"ratio" : 0.0003578656741237909,
"currentQueue" : {
"total" : 0,
"readers" : 0,
"writers" : 0
},
"activeClients" : {
"total" : 2,
"readers" : 2,
"writers" : 0
}
},
"mem" : {
"bits" : 64,
"resident" : 139,
"virtual" : 1087,
"supported" : true,
"mapped" : 208,
"mappedWithJournal" : 416
},
"connections" : {
"current" : 7,
"available" : 812
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 359456,
"page_faults" : 634
},
"indexCounters" : {
"btree" : {
"accesses" : 3431,
"hits" : 3431,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 69092,
"total_ms" : 448897,
"average_ms" : 6.497090835407862,
"last_ms" : 0,
"last_finished" : ISODate("2013-04-07T21:17:15.620Z")
},
"cursors" : {
"totalOpen" : 0,
"clientCursors_size" : 0,
"timedOut" : 1
},
"network" : {
"bytesIn" : 297154435,
"bytesOut" : 222773714,
"numRequests" : 1721768
},
"opcounters" : {
"insert" : 138004,
"query" : 359,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 1583416
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 0,
"rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
"commits" : 9,
"journaledMB" : 0,
"writeToDataFilesMB" : 0,
"compression" : 0,
"commitsInWriteLock" : 0,
"earlyCommits" : 0,
"timeMs" : {
"dt" : 3180,
"prepLogBuffer" : 0,
"writeToJournal" : 0,
"writeToDataFiles" : 0,
"remapPrivateView" : 0
}
},
"ok" : 1
}
And top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18477 mongodb 20 0 1087m 139m 122m R 99.9 23.7 10729:36 mongod
I'm curious how to go about debugging mongo to determine where/what/why this awful performance is happening.
UPDATE:
I learned I can use explain() to get details, though I'm not sure how to yet interpret the results
> db.myApp.find({'id':'320969221423124481'}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 138124,
"nscannedObjects" : 138124,
"n" : 0,
"millis" : 3949,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
UPDATE:
OK, I see now that the example query (which it executes a BUNCH of times) is taking near 4 seconds. I guess it is NOT using any index. I need to lookup how to add an index...doing that now.
UPDATE:
So I did the following
db.myApp.ensureIndex({'id':1})
And it fixed everything. heh.
See my OP thread, but the answer was a missing index needed to be added:
db.myApp.ensureIndex({'id':1})