update depending on item within document mongodb pymongo

update depending on item within document mongodb pymongo - python

suppose that i have a mongodb collection with dictionary objects like such:
{
'value1' : 4 ,
'value2' : 0
}
and i want to update each dictionary object in the database such that value2 = value1 / 2, is there a simple way to do it?
the simple way to do doesnt seem to work because you cannot reference to the value1 value:
some_db.update( {} , { 'value2' : 'this.value1'/2 } ) # wont work, right?
the other way would be to perform batch jobs, pulling in data batch by batch on my own computer such that i can retreive the value of a to then update the value of b. i would rather have the server perform this opertion though.

MongoDB does not have functionality for this. You will have to do this in a batch job. If you do so, I would suggest that you make that you sleep a little between each update as to allow the server to perform normally for the application as well. Otherwise you might end up read/write starving the database. Of course, that's only really necessary if you have loads of (millions of) records really.

hey just do something like this in pymongo
from pymongo import MongoClient
cursor_object = MongoClient()[your_db][your_collection]
for object in cursor_object.find():
id = object['_id']
val1 = object['value1']
update = val1/2
cursor_object.update({"_id":id},{"$set":{"value2":update}})

Related

How can I get the total number of calls to my dynamodb table?

I'm working with AWS Lambda. I created a lambda function that perform a get operation to my dynamoDB table. Depending on the id (primary key) I pass to this get function, it should return me the correct item in JSON format. For that, I'm using the get_item function from boto3:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.get_item
So normally, if a call my lambda function via an API (created from API Gateway) by specifying and ID, I should get the corresponding item. The problem is that I also need to get the number of times I retrieved a result. For example, if it's the seventh time I call my lambda function, I should get an item (still depending on the id) and the index 7, like this :
{
"7":{
"id" :1246 ,
"toy":"car",
"color": "red"
}
}
Logically, the number of times that I call my lambda function is the number of times that I call dynamoDB. Than I suppose that the correct way to get this number is by maybe using dynamodb, but I already spent hours trying to find a way to get this number of events/calls to my table by looking everywhere... What can I do to get this number ? and how could I implement this using boto3?

There is no out of box solution to get the number of calls to the table in DynamoDb. You need to write a custom counter that will be shared across Lambda calls.
The easiest option and the fastest solution is probably using Redis and it's INCR operation to perform atomic increments. If you're not familiar with Redis, check the doc for INCR operation and specifically the Pattern: Counter section.
If you only can use the DynamoDb, you need to maintain a counter in a separate single item. Example:
{
"partionKey": "counter_item",
"counter": 1
}
Then you can execute update calls to increment the counter like that:
response = table.update_item(
Key={'partionKey': {'S': ':pk'}},
TableName='your_table_name',
ReturnValues='ALL_NEW',
UpdateExpression='SET #counter = if_not_exists(#counter, :default) + :incr',
ExpressionAttributeValues={
':pk': 'counter_item',
':incr': 1,
':default': 0
},
ExpressionAttributeNames={
'#counter': 'counter'
}
)
There will be an updated item in the response so you can get the counter field from it.
You can check this DynamoDb guide for better examples in python.

I feel there are plenty of ways to find out the number of server calls.
If you have the logs you can easily get the server calls of any of the specific pages.
Use the AWS Dashboard to get the info matrices. (It has everything, latency, failure ratio, calls, etc.)
Write your own function which can count the get, post, and update calls. (It will be similar to profile hits, generally, this is used at
the initial stage of a project.)

ObjectID generated by server on pymongo

I am using pymongo (python module for mongodb).
I want the ObjectID to be created automatically by the server, however it seems to be created by pymongo itself when we don't specify it.
The problem it raises is that I use ObjectID to sort by time (by just sorting by the _id field). However it seems that it is using the time set on each computer so we cannot truly rely on it.
Any idea on how to solve this problem?

If you call save and pass it a document without an _id field, you can force the server to add the _id instead of the client by setting the (enigmatically-named) manipulate option to False:
coll.save({'foo': 'bar'}, manipulate=False)

I'm not Python user but I'm afraid there's no way to generate _id by server. For performance reasons _id is always generated by driver thus when you insert a document you don't need to do another query to get the _id back.
Here's a possible way you can do it by generating a int sequence _id, just like the IDENTITY ID of SqlServer. To do this, you need to keep a record in you certain collection for example in my project there's a seed, which has only one record:
{_id: ObjectId("..."), seqNo: 1 }
The trick is, you have to use findAndModify to keep the find and modify in the same "transaction".
var idSeed = db.seed.findAndModify({
query: {},
sort: {seqNo: 1},
update: { $inc: { seqNo: 1 } },
new: false
});
var id = idSeed.seqNo;
This way you'll have all you instances get a unique sequence# and you can use it to sort the records.

Make changes persistent in Boto

I have a SimpleDB instance that I update and read using boto for Python:
sdb = boto.connect_sdb(access_key, secret_key)
domain = sdb.get_domain('DomainName')
itemName = 'UserID'
itemAttr = {'key1': 'val1', 'key2': val2}
userDom.put_attributes(itemName, itemAttr)
That works a expected. A new item with name 'UserID' and values val1 and val2 will be inserted in the domain.
Now, the problem that I am facing is that if I query that domain right after updating its attributes,
query = 'select * from `DomainName` where key1=val1'
check = domain.select(query)
itemName = check.next()['key2']
I will get an error because the values in the row could not be found. However, if I add a time.sleep(1) between the write and the read everything works.
I suspect this problem is due to the fact that put_atributes signals the data base for writing, but does not wait until this change has been made persistent. I have also tried to write using creating an item and then saving that item (item.save()) without much success. Does anyone know how can I make sure that the values have been written in the SimpleDB instance before proceeding with the next operations?
Thanks.

The issue here is that SimpleDB is, by default, eventually consistent. So, when you write data and then immediately try to read it, you are not guaranteed to get the newest data although you are guaranteed that eventually the data will be consistent. With SimpleDB, eventually usually means less than a second but there are no guarantees on how long that could take.
There is, however, a way to tell SimpleDB that you want a consistent view of the data and are willing to wait for it, if necessary. You could do this by changing your query code slightly:
query = 'select * from `DomainName` where key1=val1'
check = domain.select(query, consistent_read=True)
itemName = check.next()['key2']
This should always return the latest values.

How to sort mongodb with pymongo

I'm trying to use the sort feature when querying my mongoDB, but it is failing. The same query works in the MongoDB console but not here. Code is as follows:
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.myDB
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({u'entities.user_mentions.screen_name':1}):
print post
The error I get is as follows:
Traceback (most recent call last):
File "find_ow.py", line 7, in <module>
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({'entities.user_mentions.screen_name':1},1):
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/cursor.py", line 430, in sort
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/helpers.py", line 67, in _index_document
TypeError: first item in each key pair must be a string
I found a link elsewhere that says I need to place a 'u' infront of the key if using pymongo, but that didn't work either. Anyone else get this to work or is this a bug.

.sort(), in pymongo, takes key and direction as parameters.
So if you want to sort by, let's say, id then you should .sort("_id", 1)
For multiple fields:
.sort([("field1", pymongo.ASCENDING), ("field2", pymongo.DESCENDING)])

You can try this:
db.Account.find().sort("UserName")
db.Account.find().sort("UserName",pymongo.ASCENDING)
db.Account.find().sort("UserName",pymongo.DESCENDING)

This also works:
db.Account.find().sort('UserName', -1)
db.Account.find().sort('UserName', 1)
I'm using this in my code, please comment if i'm doing something wrong here, thanks.

Why python uses list of tuples instead dict?
In python, you cannot guarantee that the dictionary will be interpreted in the order you declared.
So, in mongo shell you could do .sort({'field1':1,'field2':1}) and the interpreter would sort field1 at first level and field 2 at second level.
If this syntax was used in python, there is a chance of sorting by field2 at first level. With tuple, there is no such risk.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])

Sort by _id descending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", -1 )])
Sort by _id ascending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", 1 )])

DESC & ASC :
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", -1) #
for x in doc:
print(x)
###################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", 1) #
for x in doc:
print(x)

TLDR: Aggregation pipeline is faster as compared to conventional .find().sort().
Now moving to the real explanation. There are two ways to perform sorting operations in MongoDB:
Using .find() and .sort().
Or using the aggregation pipeline.
As suggested by many .find().sort() is the simplest way to perform the sorting.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
However, this is a slow process compared to the aggregation pipeline.
Coming to the aggregation pipeline method. The steps to implement simple aggregation pipeline intended for sorting are:
$match (optional step)
$sort
NOTE: In my experience, the aggregation pipeline works a bit faster than the .find().sort() method.
Here's an example of the aggregation pipeline.
db.collection_name.aggregate([{
"$match": {
# your query - optional step
}
},
{
"$sort": {
"field_1": pymongo.ASCENDING,
"field_2": pymongo.DESCENDING,
....
}
}])
Try this method yourself, compare the speed and let me know about this in the comments.
Edit: Do not forget to use allowDiskUse=True while sorting on multiple fields otherwise it will throw an error.

.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
Python uses key,direction. You can use the above way.
So in your case you can do this
for post in db.posts.find().sort('entities.user_mentions.screen_name',pymongo.ASCENDING):
print post

Say, you want to sort by 'created_on' field, then you can do like this,
.sort('{}'.format('created_on'), 1 if sort_type == 'asc' else -1)

mongodb: insert if not exists

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.
I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
I don't want to have duplicate documents.
I don't want to remove a document which has previously been saved, but is not in my update.
95% (estimated) of the records are unmodified from day to day.
I am using the Python driver (pymongo).
What I currently do is (pseudo-code):
for each document in update:
existing_document = collection.find_one(document)
if not existing_document:
document['insertion_date'] = now
else:
document = existing_document
document['last_update_date'] = now
my_collection.save(document)
My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update).
I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )
Can someone advise how to do it faster?

Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:
key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument
This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.
Before:
{"key":"value", "key2":"Ohai."}
After:
{"key":"value", "key2":"value2", "key3":"value3"}
You can also specify what data you want to write:
data = {"$set":{"key2":"value2"}}
Now your selected document will update the value of key2 only and leave everything else untouched.

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)
Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.
To turn your pseudocode into a working example:
now = datetime.utcnow()
for document in update:
collection.update_one(
filter={
'_id': document['_id'],
},
update={
'$setOnInsert': {
'insertion_date': now,
},
'$set': {
'last_update_date': now,
},
},
upsert=True,
)

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:
> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13}) # This works
> db.getCollection("test").insert({a:1, b:12, c:13}) # This fails
E11000 duplicate key error index: foo.test.$a_1 dup key: { : 1.0 }

You may use Upsert with $setOnInsert operator.
db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

Summary
You have an existing collection of records.
You have a set records that contain updates to the existing records.
Some of the updates don't really update anything, they duplicate what you have already.
All updates contain the same fields that are there already, just possibly different values.
You want to track when a record was last changed, where a value actually changed.
Note, I'm presuming PyMongo, change to suit your language of choice.
Instructions:
Create the collection with an index with unique=true so you don't get duplicate records.
Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/
Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.
Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.
Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.
Key Optimization Points:
The inserts will vastly speed up your operations in bulk.
Retrieving records en masse will speed things up, too.
Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).

I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:
update( { "name": "abc" },
{ $set: { "created": "2010-07-14 11:11:11",
"updated": "2010-07-14 11:11:11" }},
true, true )
Scenario #1 - document with 'name' of 'abc' does not exist:
New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.
Scenario #2 - document with 'name' of 'abc' already exists with the following:
'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10.
After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.
My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.

1. Use Update.
Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.
NOTE: This method overrides the entire document when found (From the docs)
var conditions = { name: 'borne' } , update = { $inc: { visits: 1 }} , options = { multi: true };
Model.update(conditions, update, options, callback);
function callback (err, numAffected) { // numAffected is the number of updated documents })
1.a. Use $set
If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)...
So, if you want to set...
var query = { name: 'borne' }; Model.update(query, ***{ name: 'jason borne' }***, options, callback)
Send it as...
Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)
This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.

In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.
Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.

Method For Pymongo
The Official MongoDB Driver for Python
5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert
95% (estimated) of the records are unmodified from day to day.
The following solution is taken from this core mongoDB function:
db.collection.updateOne(filter, update, options)
Updates a single document within the collection based on the filter.
This is done with this Pymongo's function update_one(filter, new_values, upsert=True)
Code Example:
# importing pymongo's MongoClient
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.databaseName
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)
What upsert=True Do?
Creates a new document if no documents match the filter.
Updates a single document that matches the filter.

I do propose the using of await now.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.