Mongod at 100% CPU - Not sure how to diagnose? - python

I have a python (python & mongo newbie) application that runs via cron every hour to fetch data, clean it and insert into mongo. During execution, the application will query mongo to check for duplicates and insert if the document is new.
I noticed recently that mongod is at 100% cpu utilization ... and I'm not sure when/why this started happening.
I'm running on an EC2 micro instance with a dedicated EBS volume for mongo, which is at ~2.2GB in size.
I'm not really sure where to start diagnosing the issue. Here is output of stats() and systemStatus() on the system:
> db.myApp.stats()
{
"ns" : "myApp.myApp",
"count" : 138096,
"size" : 106576816,
"avgObjSize" : 771.7588923647318,
"storageSize" : 133079040,
"numExtents" : 13,
"nindexes" : 1,
"lastExtentSize" : 27090944,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 4496800,
"indexSizes" : {
"_id_" : 4496800
},
"ok" : 1
}
> db.serverStatus()
{
"host" : "kar",
"version" : "2.0.4",
"process" : "mongod",
"uptime" : 4146089,
"uptimeEstimate" : 3583433,
"localTime" : ISODate("2013-04-07T21:18:05.466Z"),
"globalLock" : {
"totalTime" : 4146088784941,
"lockTime" : 1483742858,
"ratio" : 0.0003578656741237909,
"currentQueue" : {
"total" : 0,
"readers" : 0,
"writers" : 0
},
"activeClients" : {
"total" : 2,
"readers" : 2,
"writers" : 0
}
},
"mem" : {
"bits" : 64,
"resident" : 139,
"virtual" : 1087,
"supported" : true,
"mapped" : 208,
"mappedWithJournal" : 416
},
"connections" : {
"current" : 7,
"available" : 812
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 359456,
"page_faults" : 634
},
"indexCounters" : {
"btree" : {
"accesses" : 3431,
"hits" : 3431,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 69092,
"total_ms" : 448897,
"average_ms" : 6.497090835407862,
"last_ms" : 0,
"last_finished" : ISODate("2013-04-07T21:17:15.620Z")
},
"cursors" : {
"totalOpen" : 0,
"clientCursors_size" : 0,
"timedOut" : 1
},
"network" : {
"bytesIn" : 297154435,
"bytesOut" : 222773714,
"numRequests" : 1721768
},
"opcounters" : {
"insert" : 138004,
"query" : 359,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 1583416
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 0,
"rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
"commits" : 9,
"journaledMB" : 0,
"writeToDataFilesMB" : 0,
"compression" : 0,
"commitsInWriteLock" : 0,
"earlyCommits" : 0,
"timeMs" : {
"dt" : 3180,
"prepLogBuffer" : 0,
"writeToJournal" : 0,
"writeToDataFiles" : 0,
"remapPrivateView" : 0
}
},
"ok" : 1
}
And top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18477 mongodb 20 0 1087m 139m 122m R 99.9 23.7 10729:36 mongod
I'm curious how to go about debugging mongo to determine where/what/why this awful performance is happening.
UPDATE:
I learned I can use explain() to get details, though I'm not sure how to yet interpret the results
> db.myApp.find({'id':'320969221423124481'}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 138124,
"nscannedObjects" : 138124,
"n" : 0,
"millis" : 3949,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
UPDATE:
OK, I see now that the example query (which it executes a BUNCH of times) is taking near 4 seconds. I guess it is NOT using any index. I need to lookup how to add an index...doing that now.
UPDATE:
So I did the following
db.myApp.ensureIndex({'id':1})
And it fixed everything. heh.

See my OP thread, but the answer was a missing index needed to be added:
db.myApp.ensureIndex({'id':1})

Related

Print only specific parts of json file

I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
data["Departure"] is a list, and you are indexing into it like it's a dictionary.
You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.
As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}

How to extract data from json into a string

I am not able to extract the "Data" "12639735;7490484;3469776;9164745;650;0"
from this file using python:
In php it's piece of cake for me but I cannot master it in python.
Other answers from Stackexchange didn't give me the answer.
Here is the contents of the file test.json:
{
"ActTime" : 1494535483,
"ServerTime" : "2017-05-11 22:44:43",
"Sunrise" : "05:44",
"Sunset" : "21:14",
"result" : [
{
"AddjMulti" : 1.0,
"AddjMulti2" : 1.0,
"AddjValue" : 0.0,
"AddjValue2" : 0.0,
"BatteryLevel" : 255,
"Counter" : "20130.221",
"CounterDeliv" : "12634.521",
"CounterDelivToday" : "0.607 kWh",
"CounterToday" : "1.623 kWh",
"CustomImage" : 0,
"Data" : "12639735;7490484;3469776;9164745;650;0",
"Description" : "",
"Favorite" : 1,
"HardwareID" : 3,
"HardwareName" : "Slimme Meter",
"HardwareType" : "P1 Smart Meter USB",
"HardwareTypeVal" : 4,
"HaveTimeout" : false,
"ID" : "1",
"LastUpdate" : "2017-05-11 22:44:39",
"Name" : "Elektriciteitsmeter",
"Notifications" : "false",
"PlanID" : "0",
"PlanIDs" : [ 0 ],
"Protected" : false,
"ShowNotifications" : true,
"SignalLevel" : "-",
"SubType" : "Energy",
"SwitchTypeVal" : 0,
"Timers" : "false",
"Type" : "P1 Smart Meter",
"TypeImg" : "counter",
"Unit" : 1,
"Usage" : "650 Watt",
"UsageDeliv" : "0 Watt",
"Used" : 1,
"XOffset" : "0",
"YOffset" : "0",
"idx" : "1"
}
],
"status" : "OK",
"title" : "Devices"
}
This should work
import json
with open('test.json') as f:
contents = json.load(f)
print(contents['result'][0]['Data'])
Similar questions have been asked before: Parsing values from a JSON file using Python?
Got it.
url = "http://192.168.2.1:8080/json.htm?type=devices&rid=1"
response = urllib.urlopen(url)
str = json.loads(response.read())
for i in str["result"]:
datastring = i["Data"]
elementstring = i["Data"].split(';')
counter = 0
for j in elementstring:
if counter == 4:
usage = j
counter += 1
delivery = get_num(i["UsageDeliv"])

MongoDB aggregate compare with previous document

I have this query in Motor:
history = yield self.db.stat.aggregate([
{'$match': {'user_id': user.get('uid')}},
{'$sort': {'date_time': -1}},
{'$project': {'user_id': 1, 'cat_id': 1, 'doc_id': 1, 'date_time': 1}},
{'$group': {
'_id': '$user_id',
'info': {'$push': {'doc': '$doc_id', 'date': '$date_time', 'cat': '$cat_id'}},
'total': {'$sum': 1}
}},
{'$unwind': '$info'},
])
Documents in stat collection look like this:
{
"_id" : ObjectId("5788fa45bc54f428d8e77903"),
"vrr_id" : 2,
"date_time" : ISODate("2016-07-15T14:59:17.411Z"),
"ip" : "10.79.0.230",
"cat_id" : "rsl01",
"vrr_group" : ObjectId("55f6d1b5aaab934a00bae1a4"),
"col" : [
"dledu"
],
"vrr_type" : "TH",
"doc_type" : "local",
"user_id" : "696230",
"page" : null,
"method" : "OpenView",
"branch" : 9,
"sc" : 200,
"doc_id" : "004894802",
"spec" : 0
}
/* 40 */
{
"_id" : ObjectId("5788fa45bc54f428d8e77904"),
"vrr_id" : 2,
"date_time" : ISODate("2016-07-15T14:59:17.500Z"),
"ip" : "10.79.0.230",
"cat_id" : "rsl01",
"vrr_group" : ObjectId("55f6d1b5aaab934a00bae1a4"),
"col" : [
"autoref"
],
"vrr_type" : "TH",
"doc_type" : "open",
"user_id" : "696230",
"page" : null,
"method" : "OpenView",
"branch" : 9,
"sc" : 200,
"doc_id" : "000000002",
"spec" : "07"
}
I want to compare date_time field with date_time from previous document and if they are not equal (or not in timedelta within 5 seconds), include it in result.
Filtering this in Python was easy, is it possible in Mongo? How can I achieve this?
If you include some example documents from the "stat" collection I can give a more reliable answer. But with the information you've provided, I can guess. Add a stage something like:
{'$group': {'_id': '$info.date', 'info': {'$first': '$info'}}}
That gives you each document in the result list that has a distinct "date" from the previous document.
That said, if all you need is a distinct list of dates, this is simpler and faster:
db.stats.distinct("date_time")

Why 2700 records (320KB each) should take 30 seconds to be fetched?

I have 2700 records in MongoDB. Each document has a size of approximately 320KB. The engine I use is wiredTiger and the total size of collection is about 885MB.
My MongoDB config is as below:
systemLog:
destination: file
path: /usr/local/var/log/mongodb/mongo.log
logAppend: true
storage:
dbPath: /usr/local/var/mongodb
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 1
statisticsLogDelaySecs: 0
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: false
net:
bindIp: 127.0.0.1
My connection is via socket:
mongo_client = MongoClient('/tmp/mongodb-27017.sock')
And collection stats reveal this result:
db.mycol.stats()
{
"ns" : "bi.mycol",
"count" : 2776,
"size" : 885388544,
"avgObjSize" : 318944,
"storageSize" : 972476416,
"capped" : false,
"wiredTiger" : {
"metadata" : {
"formatVersion" : 1
},
"creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=0,checkpoint=(WiredTigerCheckpoint.9=(addr=\"01e30275da81e4b9e99f78e30275db81e4c61d1e01e30275dc81e40fab67d5808080e439f6afc0e41e80bfc0\",order=9,time=1444566832,size=511762432,write_gen=13289)),checkpoint_lsn=(24,52054144),checksum=uncompressed,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=5,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=1MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=u,version=(major=1,minor=1)",
"type" : "file",
"uri" : "statistics:table:collection-0-6630292038312816605",
"LSM" : {
"bloom filters in the LSM tree" : 0,
"bloom filter false positives" : 0,
"bloom filter hits" : 0,
"bloom filter misses" : 0,
"bloom filter pages evicted from cache" : 0,
"bloom filter pages read into cache" : 0,
"total size of bloom filters" : 0,
"sleep for LSM checkpoint throttle" : 0,
"chunks in the LSM tree" : 0,
"highest merge generation in the LSM tree" : 0,
"queries that could have benefited from a Bloom filter that did not exist" : 0,
"sleep for LSM merge throttle" : 0
},
"block-manager" : {
"file allocation unit size" : 4096,
"blocks allocated" : 0,
"checkpoint size" : 511762432,
"allocations requiring file extension" : 0,
"blocks freed" : 0,
"file magic number" : 120897,
"file major version number" : 1,
"minor version number" : 0,
"file bytes available for reuse" : 460734464,
"file size in bytes" : 972476416
},
"btree" : {
"column-store variable-size deleted values" : 0,
"column-store fixed-size leaf pages" : 0,
"column-store internal pages" : 0,
"column-store variable-size leaf pages" : 0,
"pages rewritten by compaction" : 0,
"number of key/value pairs" : 0,
"fixed-record size" : 0,
"maximum tree depth" : 4,
"maximum internal page key size" : 368,
"maximum internal page size" : 4096,
"maximum leaf page key size" : 3276,
"maximum leaf page size" : 32768,
"maximum leaf page value size" : 1048576,
"overflow pages" : 0,
"row-store internal pages" : 0,
"row-store leaf pages" : 0
},
"cache" : {
"bytes read into cache" : 3351066029,
"bytes written from cache" : 0,
"checkpoint blocked page eviction" : 0,
"unmodified pages evicted" : 8039,
"page split during eviction deepened the tree" : 0,
"modified pages evicted" : 0,
"data source pages selected for eviction unable to be evicted" : 1,
"hazard pointer blocked page eviction" : 1,
"internal pages evicted" : 0,
"pages split during eviction" : 0,
"in-memory page splits" : 0,
"overflow values cached in memory" : 0,
"pages read into cache" : 10519,
"overflow pages read into cache" : 0,
"pages written from cache" : 0
},
"compression" : {
"raw compression call failed, no additional data available" : 0,
"raw compression call failed, additional data available" : 0,
"raw compression call succeeded" : 0,
"compressed pages read" : 10505,
"compressed pages written" : 0,
"page written failed to compress" : 0,
"page written was too small to compress" : 0
},
"cursor" : {
"create calls" : 7,
"insert calls" : 0,
"bulk-loaded cursor-insert calls" : 0,
"cursor-insert key and value bytes inserted" : 0,
"next calls" : 0,
"prev calls" : 2777,
"remove calls" : 0,
"cursor-remove key bytes removed" : 0,
"reset calls" : 16657,
"search calls" : 16656,
"search near calls" : 0,
"update calls" : 0,
"cursor-update value bytes updated" : 0
},
"reconciliation" : {
"dictionary matches" : 0,
"internal page multi-block writes" : 0,
"leaf page multi-block writes" : 0,
"maximum blocks required for a page" : 0,
"internal-page overflow keys" : 0,
"leaf-page overflow keys" : 0,
"overflow values written" : 0,
"pages deleted" : 0,
"page checksum matches" : 0,
"page reconciliation calls" : 0,
"page reconciliation calls for eviction" : 0,
"leaf page key bytes discarded using prefix compression" : 0,
"internal page key bytes discarded using suffix compression" : 0
},
"session" : {
"object compaction" : 0,
"open cursor count" : 7
},
"transaction" : {
"update conflicts" : 0
}
},
"nindexes" : 2,
"totalIndexSize" : 208896,
"indexSizes" : {
"_id_" : 143360,
"date_1" : 65536
},
"ok" : 1
}
How can I understand that MongoDB uses swap? How to infer where exactly is the bottleneck?
EDIT:
The way I fetch data in python is:
for doc in mycol.find({'date': {"$lte": '2016-12-12', '$gte': '2012-09-09'}}, {'_id': False}):
doc['uids'] = set(doc['uids'])
records.append(doc)
date field is indexed.
EDIT 2:
These are the result when fetching data:
CPU core1: ~65%
CPU core2: ~65%
CPU core3: ~65%
CPU core4: ~65%
RAM: 7190/8190MB
swap: 1140/2048MB
EDIT 3:
MongoDB log is:
2015-10-11T17:25:08.317+0330 I NETWORK [initandlisten] connection accepted from anonymous unix socket #18 (2 connections now open)
2015-10-11T17:25:08.321+0330 I NETWORK [initandlisten] connection accepted from anonymous unix socket #19 (3 connections now open)
2015-10-11T17:25:36.501+0330 I QUERY [conn19] getmore bi.mycol cursorid:10267473126 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:3 nreturned:14 reslen:4464998 locks:{} 199ms
2015-10-11T17:25:37.665+0330 I QUERY [conn19] getmore bi.mycol cursorid:10267473126 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:5 nreturned:14 reslen:4464998 locks:{} 281ms
2015-10-11T17:25:50.331+0330 I NETWORK [conn19] end connection anonymous unix socket (2 connections now open)
2015-10-11T17:25:50.363+0330 I NETWORK [conn18] end connection anonymous unix socket (1 connection now open)
EDIT 4:
Sample data is:
{"date": "2012-09-12", "uids": [1,2,3,4,...,30000]}
NB: I have 30k UIDs inside of uids field.
EDIT 5:
Explaining query display that it has used IXSCAN stage:
$ db.mycol.find({'date': {"$lte": '2018-11-27', '$gte': '2011-04-23'}}, {'_id': 0}).explain("executionStats")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "bi.mycol",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"date" : {
"$lte" : "2018-11-27"
}
},
{
"date" : {
"$gte" : "2011-04-23"
}
}
]
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"_id" : 0
},
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"date" : 1
},
"indexName" : "date_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"date" : [
"[\"2011-04-23\", \"2018-11-27\"]"
]
}
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 2776,
"executionTimeMillis" : 2312,
"totalKeysExamined" : 2776,
"totalDocsExamined" : 2776,
"executionStages" : {
"stage" : "PROJECTION",
"nReturned" : 2776,
"executionTimeMillisEstimate" : 540,
"works" : 2777,
"advanced" : 2776,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 31,
"restoreState" : 31,
"isEOF" : 1,
"invalidates" : 0,
"transformBy" : {
"_id" : 0
},
"inputStage" : {
"stage" : "FETCH",
"nReturned" : 2776,
"executionTimeMillisEstimate" : 470,
"works" : 2777,
"advanced" : 2776,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 31,
"restoreState" : 31,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 2776,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 2776,
"executionTimeMillisEstimate" : 0,
"works" : 2776,
"advanced" : 2776,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 31,
"restoreState" : 31,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"date" : 1
},
"indexName" : "date_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"date" : [
"[\"2011-04-23\", \"2018-11-27\"]"
]
},
"keysExamined" : 2776,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
}
}
},
"serverInfo" : {
"host" : "MySys.local",
"port" : 27017,
"version" : "3.0.0",
"gitVersion" : "nogitversion"
},
"ok" : 1
}
EDIT 6:
OS: Mac osX Yosemite
MongoDB version: 3.0.0
Total RAM: 8G
Filesystem: Mac OS Extended (Journaled)
The methods I used to improve performance:
First of all instead of using for loop to traverse query and fetch data, I give the cursor to Pandas rather than creating a large list object in python:
cursor = mycol.find({'date': {"$lte": end_date, '$gte': start_date}}, {'_id': False})
df = pandas.DataFrame(list(cursor))
Performance got much better it now takes 10 seconds at most rather than 30 seconds.
Instead of using doc['uids'] = set(doc['uids']) which take around 6 seconds I didn't change default list to set and handled duplicates with dataframe itself.
You have two problems here:
Use isodate instead of string date for faster index lookups, as string dates do a lexographic string comparison, whereas isodates does a numeric one.
Since your total records are low, the type of index should not be a big problem, the problem might be size of documents and their network transfer plus deserialization.
Try a query with not selecting the uid field i.e.
for doc in mycol.find({'date': {"$lte": '2016-12-12', '$gte': '2012-09-09'}}, {'_id': False,'uid':False}):
Your query time will improve by a huge margin.You will then need to investigate the transfer times between your application and mongodb servers, and also benchmark for single document fetching using find_one() to see how much time deserializations are taking.

Get child dict values use Mongo Map/Reduce

I have a mongo collection, i want get total value of 'number_of_ad_clicks' by given sitename, timestamp and variant id. Because we have large data so it would be better use map/reduce. Could any guys give me any suggestion?
Here is my collection json format
{ "_id" : ObjectId( "4e3c280ecacbd1333b00f5ff" ),
"timestamp" : "20110805",
"variants" : { "94" : { "number_of_ad_clicks" : 41,
"number_of_search_keywords" : 9,
"total_duration" : 0,
"os" : { "os_2" : 2,
"os_1" : 1,
"os_0" : 0 },
"countries" : { "ge" : 6,
"ca" : 1,
"fr" : 8,
"uk" : 4,
"us" : 6 },
"screen_resolutions" : { "(320, 240)" : 1,
"(640, 480)" : 5,
"(1024, 960)" : 5,
"(1280, 768)" : 5 },
"widgets" : { "widget_1" : 1,
"widget_0" : 0 },
"languages" : { "ua_uk" : 8,
"ca_en" : 2,
"ca_fr" : 2,
"us_en" : 5 },
"search_keywords" : { "search_keyword_8" : 8,
"search_keyword_5" : 5,
"search_keyword_4" : 4,
"search_keyword_7" : 7,
"search_keyword_6" : 6,
"search_keyword_1" : 1,
"search_keyword_3" : 3,
"search_keyword_2" : 2 },
"number_of_pageviews" : 18,
"browsers" : { "browser_4" : 4,
"browser_0" : 0,
"browser_1" : 1,
"browser_2" : 2,
"browser_3" : 3 },
"keywords" : { "keyword_5" : 5,
"keyword_4" : 4,
"keyword_1" : 1,
"keyword_0" : 0,
"keyword_3" : 3,
"keyword_2" : 2 },
"number_of_keyword_clicks" : 83,
"number_of_visits" : 96 } },
"site_name" : "fonter.com",
"number_of_variants" : 1 }
Here is my try. but failed.
He is my try.
m = function() {
emit(this.query, {variants: this.variants});
}
r = function(key , vals) {
var clicks = 0 ;
for(var i = 0; i < vals.length(); i++){
clicks = vals[i]['number_of_ad_clicks'];
}
return clicks;
}
res = db.variant_daily_collection.mapReduce(m, r, {out : "myoutput", "query":{"site_name": 'fonter.com', 'timestamp': '20110805'}})
db.myoutput.find()
could somebody any suggestion?
Thank you very much, i try you solution but nothing return.
I invoke the mapreduce in the following, is there any thing wrong?
res = db.variant_daily_collection.mapReduce(map, reduce, {out : "myoutput", "query":{"site_name": 'facee.com', 'timestamp': '20110809', 'variant_id': '305'}})
db.myoutput.find()
The emit function emits both a key and a value.
If you are used to SQL think of key as your GROUP BY and value as your SUM(), AVG(), etc..
In your case you want to "group by": site_name, timestamp and variant id. It looks like you may have more than one variant, so you will need to loop through the variants, like this:
map = function() {
for(var i in variants){
var key = {};
key.timestamp = this.timestamp;
key.site_name = this.site_name;
key.variant_id = i; // that's the "94" string.
var value = {};
value.clicks = this.variants[i].number_of_ad_clicks;
emit(key, value);
}
}
The reduce function will get an array of values each one like this { clicks: 41 }. The function needs to return one object that looks the same.
So if you get values = [ {clicks:21}, {clicks:10}, {clicks:5} ] you must output {clicks:36}.
So you do something like this:
reduce = function(key , vals) {
var returnValue = { clicks: 0 }; // initializing to zero
for(var i = 0; i < vals.length(); i++){
returnValue.clicks += vals[i].clicks;
}
return returnValue;
}
Note that the value from map has the same shape as the return from reduce.

Categories

Resources