MongoDB $set changing datatypes - python

I have a MongoDB 3.2 instance running on Ubuntu 14.04. Single node set up. Last night I performed a migration where I ran this code for ~1400 documents in a collection:
for r in responses: # find cursor with ~1400 documents in it
database.responses.update_one({
"_id" : r["_id"]
}, {
"$set" : {
"client_id" : client["_id"]
}
})
After the migration, some of the fields in my response documents in the responses collection had switched from DateObject types to Int32 timestamp representations. Some of the Int32 fields had changed to Doubles. These fields were not updated in my $set statement (obviously). This affected only a small subset of the cursor (~75 documents).
This caused catastrophic failure as our models expected those fields to have data types they no longer had. Can someone explain to me what went wrong here?

after reading your question I got curious to know what went wrong, I guess if you have explicitly sat your type in the creation/update of these record you would not have faced that issue, for example:-
for r in responses: # find cursor with ~1400 documents in it
database.responses.update_one({
"_id" : r["_id"]
}, {
"$set" : {
"client_id" : new DateObject(client["_id"]);
}
})

My guess is that somewhere else in your code python has made changes in the types (maybe some code that is trying to automatically infer the type !?).
I am pretty sure that before your "for r in responses: " code there is something else that is maybe trying to detect the type of the fields. Is this the case? Can you provide the code before the snippet you provided ?

Related

How to set type of field when inserting into mongodb

So I am querying an API, receiving data and then storing it into MongoDB.
All was working fine so far, Except now I have started using Mongo's Aggregation pipeline. During this I realized that Mongo is inserting the number data as strings. Hence now, my aggregation pipeline wont work as I am doing numerical computation such as calculating averages etc....Because Mongo is seeing it as a string.
How can I set the type of the field during Insert.....such that I specify that this is float etc...
What I have tried so far is the below code: but it does not work well, because the mongo shell is complaining because the field name starts with a number:
db.weeklycol.find().forEach(function(ch)
{
db.weeklycol.update({
"_id":ch._id},
{"$set":
{
"4_close":parseInt(ch.4_close)
}
});
To access property, which name is weird use []:
ch['4_close']
Then about saving numbers, well I made test:
> db.test.insertOne({_id:1, field: 2})
{ "acknowledged" : true, "insertedId" : 1 }
> db.test.find({_id:1})
{ "_id" : 1, "field" : 2 }
Seems to be added number alright. Can you please post exact example of code with some dummy values where inserted object have property with number value, and inserted document have this value turned to string?
I have managed to resolve this by the below code....So I changing the variable type before inserting it into my "insert" string with the below code: I created this function, and I call the function on my whole dictionary just before im inserting....If its a number it will convert, else it will pass:
I have similar function which converts as well....instead of float on line 4, i have it changed to date.
def convertint(bdic):
for key, value in bdic.items():
try:
bdic[key] = float(value)
except:
pass
return bdic

Simple MongoDB query slow

I am new to MongoDB. I am trying to write some data to a Mongo database from Python script, the data structure is simple:
{"name":name, "first":"2016-03-01", "last":"2016-03-01"}
I have a script to query if the "name" exists, if yes, update the "last" date, otherwise, create the document.
if db.collections.find_one({"name": the_name}):
And the size of data is actually very small, <5M bytes, and <150k records.
It was fast at first (e.g. the first 20,000 records), and then getting slower and slower. I checked the analyzer profile, some queries were > 50 miliseconds, but I don't see anything abnormal with those records.
Any ideas?
Update 1:
Seems there is no index for the "name" field:
> db.my_collection.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "domains.my_collection"
}
]
First, you should check if the collection has an index on the "name" field. See the output of the following command in mongo CLI.
db.my_collection.getIndexes();
If there is no index then create it (note, on production environment you'd better create index in background).
db.my_collection.createIndex({name:1},{unique:true});
And if you want to insert a document if the document does not exist or update one field if the document exists then you can do it in one step without pre-querying. Use UPDATE command with upsert option and $set/$setOnInsert operators (see https://docs.mongodb.org/manual/reference/operator/update/setOnInsert/).
db.my_collection.update(
{name:"the_name"},
{
$set:{last:"current_date"},
$setOnInsert:{first:"current_date"}
},
{upsert:true}
);

including a NumberInt in a dict for pymongo

I need to load a list of dicts (see below) into a mongoDB. Within mongo, you have to define an int type as NumberInt(). Python doesn't recognize this as a valid type for a dict. I've found pages on custom encoding for pymongo that don't actually do what I need. I'm totally stuck. Someone has to have encountered this before!
Need to insert a list of dicts like this into mongoDB from python.
agg = {
'_id' : unique_id_str,
'total' : NumberInt(int(total)),
'mode' : NumberInt(int(mymode))
}
You should be able to just insert the dict with an int, I've never needed to use NumberInt to insert documents using pymongo.
Also, fwiw, folks at mongodb told me that letting mongo create the _id itself tends to be more efficient but obviously it may work better for you to define in your case.
agg = {
'_id' : unique_id_str,
'total' : int(total),
'mode' : int(mymode)
}
should work

Pymongo find if value has a datatype of NumberLong

I'm using the Pymongo driver and my documents look like this:
{
"_id" : ObjectId("5368a4d583bcaff3629bf412"),
"book_id" : NumberLong(23302213),
"serial_number" : '1122',
}
This works because the serial number is a string:
find_one({"serial_number": "1122"})
However, this doesn't:
find_one({"book_id": "23302213"})
Obviously its because the book_id has a datatype of NumberLong. How can execute the find method based on this datatype?
==================================================
Update:
Still can't get this to work, I can only find string values. Any advise would be much appreciated.
You need to ensure your data types are matching. MongoDB is strict about types. When you execute this:
find_one({"book_id": "23302213"})
you are asking MongoDB for documents with book_id equal to "23302213". As you are not storing the book_id as type string but as type long the query needs to respect that:
find_one({"book_id": long(23302213)})
If, for some reason, you have the ID as string in your app this would also work:
find_one({"book_id": long("23302213")})
Update
Just checked it (MacOS 64bit, MongoDB 2.6, Python 2.7.5, pymongo 2.7) and it works even when providing an integer.
Document in collection (as displayed by Mongo shell):
{ "_id" : ObjectId("536960b9f7e8090e3da4e594"), "n" : NumberLong(222333444) }
Output of python shell:
>>> collection.find_one({"n": 222333444})
{u'_id': ObjectId('536960b9f7e8090e3da4e594'), u'n': 222333444L}
>>> collection.find_one({"n": long(222333444)})
{u'_id': ObjectId('536960b9f7e8090e3da4e594'), u'n': 222333444L}
You can use $type:
http://docs.mongodb.org/manual/reference/operator/query/type/
INT is 16, BIGINT is 18

mongodb: insert if not exists

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.
I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
I don't want to have duplicate documents.
I don't want to remove a document which has previously been saved, but is not in my update.
95% (estimated) of the records are unmodified from day to day.
I am using the Python driver (pymongo).
What I currently do is (pseudo-code):
for each document in update:
existing_document = collection.find_one(document)
if not existing_document:
document['insertion_date'] = now
else:
document = existing_document
document['last_update_date'] = now
my_collection.save(document)
My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update).
I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )
Can someone advise how to do it faster?
Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:
key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument
This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.
Before:
{"key":"value", "key2":"Ohai."}
After:
{"key":"value", "key2":"value2", "key3":"value3"}
You can also specify what data you want to write:
data = {"$set":{"key2":"value2"}}
Now your selected document will update the value of key2 only and leave everything else untouched.
As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)
Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.
To turn your pseudocode into a working example:
now = datetime.utcnow()
for document in update:
collection.update_one(
filter={
'_id': document['_id'],
},
update={
'$setOnInsert': {
'insertion_date': now,
},
'$set': {
'last_update_date': now,
},
},
upsert=True,
)
You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:
> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13}) # This works
> db.getCollection("test").insert({a:1, b:12, c:13}) # This fails
E11000 duplicate key error index: foo.test.$a_1 dup key: { : 1.0 }
You may use Upsert with $setOnInsert operator.
db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})
Summary
You have an existing collection of records.
You have a set records that contain updates to the existing records.
Some of the updates don't really update anything, they duplicate what you have already.
All updates contain the same fields that are there already, just possibly different values.
You want to track when a record was last changed, where a value actually changed.
Note, I'm presuming PyMongo, change to suit your language of choice.
Instructions:
Create the collection with an index with unique=true so you don't get duplicate records.
Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/
Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.
Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.
Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.
Key Optimization Points:
The inserts will vastly speed up your operations in bulk.
Retrieving records en masse will speed things up, too.
Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).
I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:
update( { "name": "abc" },
{ $set: { "created": "2010-07-14 11:11:11",
"updated": "2010-07-14 11:11:11" }},
true, true )
Scenario #1 - document with 'name' of 'abc' does not exist:
New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.
Scenario #2 - document with 'name' of 'abc' already exists with the following:
'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10.
After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.
My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.
1. Use Update.
Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.
NOTE: This method overrides the entire document when found (From the docs)
var conditions = { name: 'borne' } , update = { $inc: { visits: 1 }} , options = { multi: true };
Model.update(conditions, update, options, callback);
function callback (err, numAffected) { // numAffected is the number of updated documents })
1.a. Use $set
If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)...
So, if you want to set...
var query = { name: 'borne' }; Model.update(query, ***{ name: 'jason borne' }***, options, callback)
Send it as...
Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)
This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.
In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.
Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.
Method For Pymongo
The Official MongoDB Driver for Python
5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert
95% (estimated) of the records are unmodified from day to day.
The following solution is taken from this core mongoDB function:
db.collection.updateOne(filter, update, options)
Updates a single document within the collection based on the filter.
This is done with this Pymongo's function update_one(filter, new_values, upsert=True)
Code Example:
# importing pymongo's MongoClient
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.databaseName
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)
What upsert=True Do?
Creates a new document if no documents match the filter.
Updates a single document that matches the filter.
I do propose the using of await now.

Categories

Resources