Formatting JSON before inserting to MongoDB

Formatting JSON before inserting to MongoDB - python

I'm new to MongoDB and I'm still learning so please bear with me.
Let's say I have a document in Mongo with the following form
{ 'objName':
{
'id': 012345678,
'name': 'someName'
}
}
If I insert this JSON using pyMongo it'll take a default Object ID.
What I'd like to do is to set the
_id = 'id'
that is given inside the 'ObjName'.
The problem that I'm facing is that I don't know the value inside 'ObjName' so I need something generic to work regardless of the value inside.

You can assign value to _id
doc = {'objName': {'id': 12345678,'name': 'someName'}}
doc['_id'] = doc['objName']['id']
collection.insert_one(doc)

Related

solve E11000 duplicate key error collection: _id_ dup key in pymongo

I am trying to insert a great number of document(+1M) using a bulk_write instruction. In order to do that, I create a list of InsertOne function.
python version = 3.7.4
pymongo version = 3.8.0
Document creation:
document = {
'dictionary': ObjectId(dictionary_id),
'price': price,
'source': source,
'promo': promo,
'date': now_utc,
'updatedAt': now_utc,
'createdAt:': now_utc
}
# add line to debug
if '_id' in document.keys():
print(document)
return document
I create the full list of document by adding a new field from a list of elements and create the query by using InsertOne
bulk = []
for element in list_elements:
for document in documents:
document['new_field'] = element
# add line to debug
if '_id' in document.keys():
print(document)
insert = InsertOne(document)
bulk.append(insert)
return bulk
I do the insert by using bulk_write command
collection.bulk_write(bulk, ordered=False)
I attach the documentation https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.bulk_write
According to the documentation,the _id field is added automatically
Parameter - document: The document to insert. If the document is missing an _id field one will be added.
And somehow it seems that is doing it wrong because some of them have the same value.
Receiving this error(with differents _id of course) for 700k of the 1M documents
'E11000 duplicate key error collection: database.collection index: _id_ dup key: { _id: ObjectId(\'5f5fccb4b6f2a4ede9f6df62\') }'
Seems a bug to me from pymongo, because I used this approach in many situations but I didn't with such size of documents
The _id field has to be unique for sure, but, due to this is done automatically by pymongo, I don't know how to approach to this problem, perhaps using a UpdateOne with upsert True with an impossible filter and hope for the best.
I would appreciate any solution or work around for this problem

It seems that as I was adding the new field of the document and append it into the list, I created similar instances of the same element, so I had the same queries len(list_elements) times and that is why I had the duplicated key error.
to solve the problem, I append to the list a copy of the document
bulk.append(document.copy())
and then create the queries with that list
I would like to thank #Belly Buster for his help in the issue

If any of the documents from your code snippet already contain an _id, a new one won't be added, and you run the risk of getting a duplicate error as you have observed.

How can one make Salesforce Bulk API calls via simple_salesforce?

I'm using the module simple-salesforce, and I'm not seeing anything in the docs about making bulk API calls. Anybody know how to do this?
https://github.com/simple-salesforce/simple-salesforce

The code does have some comments. There's also this readthedocs page but, even that looks like it could use some help.
Good stuff first, explanation below.
Code example (written assuming you're running the whole block of code at once):
from simple_salesforce import Salesforce
sf = Salesforce(<credentials>)
# query
accounts = sf.bulk.Account.query('SELECT Id, Name FROM Account LIMIT 5')
# returns a list of dictionaries similar to: [{'Name': 'Something totally new!!!', 'attributes': {'url': '/services/data/v38.0/sobjects/Account/object_id_1', 'type': 'Account'}, 'Id': 'object_id_1'}]
# assuming you've pulled data, modify it to use in the next statement
accounts[0]['Name'] = accounts[0]['Name'] + ' - Edited'
# update
result = sf.bulk.Account.update(accounts)
# result would look like [{'errors': [], 'success': True, 'created': False, 'id': 'object_id_1'}]
# insert
new_accounts = [{'Name': 'New Bulk Account - 1', 'BillingState': 'GA'}]
new_accounts = sf.bulk.Account.insert(new_accounts)
# new_accounts would look like [{'errors': [], 'success': True, 'created': True, 'id': 'object_id_2'}]
# upsert
accounts[0]['Name'] = accounts[0]['Name'].replace(' - Edited')
accounts.append({'Name': 'Bulk Test Account'})
# 'Id' is the column to "join" on. this uses the object's id column
upserted_accounts = sf.bulk.Account.upsert(accounts, 'Id')
# upserted_accounts would look like [{'errors': [], 'success': True, 'created': False, 'id': 'object_id_1'}, {'errors': [], 'success': True, 'created': True, 'id': 'object_id_3'}]
# how i assume hard_delete would work (i never managed to run hard_delete due to insufficient permissions in my org)
# get last element from the response.
# *NOTE* This ASSUMES the last element in the results of the upsert is the new Account.
# This is a naive assumption
new_accounts.append(upserted_accounts[-1])
sf.bulk.Account.hard_delete(new_accounts)
Using simple_salesforce, you can access the bulk api by doing
<your Salesforce object>.bulk.<Name of the Object>.<operation to perform>(<appropriate parameter, based on your operation>)
<your Salesforce object> is the object you get back from constructing simple_salesforce.Salesforce(<credentials>)
<credentials> is your username, password, security_token, and sandbox(bool, if you're connecting to a sandbox) or session_id. (these are the 2 ways that i know of)
<Name of the Object> is just Account or Opportunity or whatever object you're trying to manipulate
<operation to perform> is one of the below:
query
insert
update
upsert
hard_delete (my account did not have appropriate permissions to test this operation. any mention is pure speculation)
<appropriate parameter> is dependent on which operation you wish to perform
query - a string that contains a SOQL
insert - a list of dictionaries. remember to have a key for all fields required by your org when creating a new record
update - a list of dictionaries. you'll obviously need a valid Object Id per dictionary
upsert - a list of dictionaries and a string representing the "external id" column. The "external id" can be the Salesforce Object 'Id' or any other column; choose wisely. If any dictionary does not have a key that is the same as the "external id", a new record will be created.
What's returned: depends on the operation.
query returns a list of dictionaries with your results. In addition to the columns your query, each dictionary has an 'attributes' key. This contains a 'url' key, which looks like it can be used for api requests for the specific object, key and 'type' key, which is the type of the Object returned
insert/update/upsert returns a list of dictionaries. each dictionary is like {'errors': [], 'success': True, 'created': False, 'id': 'id of object would be here'}
Thanks to #ATMA's question for showing how was using query. With that question and the source code, was able to figure out insert, update, and upsert.

I ran into this same problem a few weeks ago. Sadly, there isn't a way to do it with simple-salesforce. My research through the source didn't seem to have any way to do it or to hack it to make it work.
I looked into a number of other Python based Bulk API Tools. These included Salesforce-bulk 1.0.7 (https://pypi.python.org/pypi/salesforce-bulk/1.0.7), Salesforce-bulkipy 1.0 (https://pypi.python.org/pypi/salesforce-bulkipy), and Salesforce_bulk_api (https://github.com/safarijv/salesforce-bulk-api).
I ran into some issues getting Salesforce-bulk 1.0.7 and Salesforce-bulkipy 1.0 configured on my system, but Salesforce_bulk_api worked pretty well. It uses simple-salesforce as the authentication mechanism but handles the creation of the bulk jobs and uploading the records for you.
A word of caution, simple-salesforce and the bulk APIs work differently. Simple-Salesforce work via REST so that you only create JSON strings - which are readily compatible with Python dicts. The Bulk APIs work with CSV files that are uploaded to Salesforce. Creating those CSVs can be a bit dangerous since the order of the field names in header must correspond to the order of the data elements in the file. It isn't a huge deal, but you need to me more careful when creating your CSV rows that the order matches between the header and data rows.

including a NumberInt in a dict for pymongo

I need to load a list of dicts (see below) into a mongoDB. Within mongo, you have to define an int type as NumberInt(). Python doesn't recognize this as a valid type for a dict. I've found pages on custom encoding for pymongo that don't actually do what I need. I'm totally stuck. Someone has to have encountered this before!
Need to insert a list of dicts like this into mongoDB from python.
agg = {
'_id' : unique_id_str,
'total' : NumberInt(int(total)),
'mode' : NumberInt(int(mymode))
}

You should be able to just insert the dict with an int, I've never needed to use NumberInt to insert documents using pymongo.
Also, fwiw, folks at mongodb told me that letting mongo create the _id itself tends to be more efficient but obviously it may work better for you to define in your case.
agg = {
'_id' : unique_id_str,
'total' : int(total),
'mode' : int(mymode)
}
should work

ObjectID generated by server on pymongo

I am using pymongo (python module for mongodb).
I want the ObjectID to be created automatically by the server, however it seems to be created by pymongo itself when we don't specify it.
The problem it raises is that I use ObjectID to sort by time (by just sorting by the _id field). However it seems that it is using the time set on each computer so we cannot truly rely on it.
Any idea on how to solve this problem?

If you call save and pass it a document without an _id field, you can force the server to add the _id instead of the client by setting the (enigmatically-named) manipulate option to False:
coll.save({'foo': 'bar'}, manipulate=False)

I'm not Python user but I'm afraid there's no way to generate _id by server. For performance reasons _id is always generated by driver thus when you insert a document you don't need to do another query to get the _id back.
Here's a possible way you can do it by generating a int sequence _id, just like the IDENTITY ID of SqlServer. To do this, you need to keep a record in you certain collection for example in my project there's a seed, which has only one record:
{_id: ObjectId("..."), seqNo: 1 }
The trick is, you have to use findAndModify to keep the find and modify in the same "transaction".
var idSeed = db.seed.findAndModify({
query: {},
sort: {seqNo: 1},
update: { $inc: { seqNo: 1 } },
new: false
});
var id = idSeed.seqNo;
This way you'll have all you instances get a unique sequence# and you can use it to sort the records.

mongodb: insert if not exists

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.
I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
I don't want to have duplicate documents.
I don't want to remove a document which has previously been saved, but is not in my update.
95% (estimated) of the records are unmodified from day to day.
I am using the Python driver (pymongo).
What I currently do is (pseudo-code):
for each document in update:
existing_document = collection.find_one(document)
if not existing_document:
document['insertion_date'] = now
else:
document = existing_document
document['last_update_date'] = now
my_collection.save(document)
My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update).
I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )
Can someone advise how to do it faster?

Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:
key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument
This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.
Before:
{"key":"value", "key2":"Ohai."}
After:
{"key":"value", "key2":"value2", "key3":"value3"}
You can also specify what data you want to write:
data = {"$set":{"key2":"value2"}}
Now your selected document will update the value of key2 only and leave everything else untouched.

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)
Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.
To turn your pseudocode into a working example:
now = datetime.utcnow()
for document in update:
collection.update_one(
filter={
'_id': document['_id'],
},
update={
'$setOnInsert': {
'insertion_date': now,
},
'$set': {
'last_update_date': now,
},
},
upsert=True,
)

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:
> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13}) # This works
> db.getCollection("test").insert({a:1, b:12, c:13}) # This fails
E11000 duplicate key error index: foo.test.$a_1 dup key: { : 1.0 }

You may use Upsert with $setOnInsert operator.
db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

Summary
You have an existing collection of records.
You have a set records that contain updates to the existing records.
Some of the updates don't really update anything, they duplicate what you have already.
All updates contain the same fields that are there already, just possibly different values.
You want to track when a record was last changed, where a value actually changed.
Note, I'm presuming PyMongo, change to suit your language of choice.
Instructions:
Create the collection with an index with unique=true so you don't get duplicate records.
Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/
Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.
Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.
Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.
Key Optimization Points:
The inserts will vastly speed up your operations in bulk.
Retrieving records en masse will speed things up, too.
Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).

I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:
update( { "name": "abc" },
{ $set: { "created": "2010-07-14 11:11:11",
"updated": "2010-07-14 11:11:11" }},
true, true )
Scenario #1 - document with 'name' of 'abc' does not exist:
New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.
Scenario #2 - document with 'name' of 'abc' already exists with the following:
'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10.
After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.
My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.

1. Use Update.
Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.
NOTE: This method overrides the entire document when found (From the docs)
var conditions = { name: 'borne' } , update = { $inc: { visits: 1 }} , options = { multi: true };
Model.update(conditions, update, options, callback);
function callback (err, numAffected) { // numAffected is the number of updated documents })
1.a. Use $set
If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)...
So, if you want to set...
var query = { name: 'borne' }; Model.update(query, ***{ name: 'jason borne' }***, options, callback)
Send it as...
Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)
This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.

In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.
Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.

Method For Pymongo
The Official MongoDB Driver for Python
5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert
95% (estimated) of the records are unmodified from day to day.
The following solution is taken from this core mongoDB function:
db.collection.updateOne(filter, update, options)
Updates a single document within the collection based on the filter.
This is done with this Pymongo's function update_one(filter, new_values, upsert=True)
Code Example:
# importing pymongo's MongoClient
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.databaseName
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)
What upsert=True Do?
Creates a new document if no documents match the filter.
Updates a single document that matches the filter.

I do propose the using of await now.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.