MongoDB generating same ID between inserts

MongoDB generating same ID between inserts - python

I am using pymongo and I am trying to insert dicts into mongodb database. My dictionaries look like this
{
"name" : "abc",
"Jobs" : [
{
"position" : "Systems Engineer (Data Analyst)",
"time" : [
"October 2014",
"May 2015"
],
"is_current" : 1,
"location" : "xyz",
"organization" : "xyz"
},
{
"position" : "Systems Engineer (MDM Support Lead)",
"time" : [
"January 2014",
"October 2014"
],
"is_current" : 1,
"location" : "xxx",
"organization" : "xxx"
},
{
"position" : "Asst. Systems Engineer (ETL Support Executive)",
"time" : [
"May 2012",
"December 2013"
],
"is_current" : 1,
"location" : "zzz",
"organization" : "xzx"
},
],
"location" : "Buffalo, New York",
"education" : [
{
"school" : "State University of New York at Buffalo - School of Management",
"major" : "Management Information Systems, General",
"degree" : "Master of Science (MS), "
},
{
"school" : "Rajiv Gandhi Prodyogiki Vishwavidyalaya",
"major" : "Electrical and Electronics Engineering",
"degree" : "Bachelor of Engineering (B.E.), "
}
],
"id" : "abc123",
"profile_link" : "example.com",
"html_source" : "<html> some_source_code </html>"
}
I am getting this error:
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index:
Linkedin_DB.employee_info.$id dup key: { :
ObjectId('56b64f6071c54604f02510a8') }
When I run my program 1st document gets inserted properly but when I insert the second document I get this error. When I start my script again the document which was not inserted because of this error get inserted properly and error comes for next document and this continues.
Clearly mognodb is using the same objecID during two inserts. I don't understand why mongodb is failing to generate a unique ID for new documents.
My code to save passed data:
class Mongosave:
"""
Pass collection_name and dict data
This module stores the passed dict in collection
"""
def __init__(self):
self.connection = pymongo.MongoClient()
self.db = self.connection['Linkedin_DB']
def _exists(self, id):
#To check if user alredy exists
return True if list(self.collection.find({'id': id})) else False
def save(self, collection_name, data):
self.collection = self.db[collection_name]
if not self._exists(data['id']):
print (data['id'])
self.collection.insert(data)
else:
self.collection.update({'id':data['id']}, {"$set": data})
I can figure out why this is happening. Any help is appreciated.

The problem is that your save method is using a field called "id" to decide if it should do an insert or an upsert. You want to use "_id" instead. You can read about the _id field and index here. PyMongo automatically adds an _id to you document if one is not already present. You can read more about that here.

You might have inserted two copies of the same document into your collection in one run.
I cannot quite understand what do you mean by:
When I start my script again the document which was not inserted because of this error get inserted properly and error comes for next document and this continues.
What I do know is if you do:
from pymongo import MongoClient
client = MongoClient()
db = client['someDB']
collection = db['someCollection']
someDocument = {'x': 1}
for i in range(10):
collection.insert_one(someDocument)
You'll get a:
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index:
This make me think although pymongo would generate a unique _id for you if you don't provide one, it is not guaranteed to be unique, especially if the document provided is not unique. Presumably pymongo is using some sort of hash algorithm on what you insert for their auto-gen _id without changing the seed.
Try generate your own _id and see if it would happen again.
Edit:
I just tried this and it works:
for i in range(10):
collection.insert_one({'x':1})
This make me think the way pymongo generates _id is associated with the object you feed into it, this time I'm not referencing to the same object anymore and the problem disappeared.
Are you giving your database two references of a same object?

Related

Delete all documents returned in a find().limit()

I'm using db.collection.find({}, {'_id': False}).limit(2000) to get the documents from a collection. This documents are sent to a Facebook API, after the API return success this documents need to be deleted from the collection.
My main doubt is:
Is there a way to I delete all this 2000 documents withou using a for
loop? I know that collection.find returns a cursor, is there a way
to use this cursor in a delete_many?
The structure of my document is:
{
"_id" : ObjectId("61608068887f1a0e2162d94b"),
"event_time" : "1632582893",
"value" : "549.9000",
"contents" : [
{
"product_id" : "1-1",
"quantity" : "1.000000",
"value" : "10"
}
]
}

To solve this problem, based on the comments of #adarsh and #J.F I've used the following code:
rm = [x['_id'] for x in MongoDB(mongo).db.get_collection("DataToSend").find({}, {'_id' : 1}).limit(2000)
MongoDB(mongo).db.get_collection("DataToSend").delete_many({'_id' : { '$in' : list(rm)}})

MongoDB check if list item exists for an item in collection before inserting to DB

Im developing a django application uisng python and mongoDB. Im developing a form and take user inputs and save to DB.
Before inserting i want to check if data is already present DB.
I have a mongo collection which looks something like below :
coll_1 :
{ "_id" : ObjectId("56e0a3a2d59feaa43fba49d5"), "timestamp" : ISODate("2017-11-18T10:23:29.620Z"), "City_list" : "[PN-City1, PN-City2,PN-City3, PN-City4]", "LDE" : "LDE-1234, LDE-345, LDE-456" , "Name": "ABC"}
{ "_id" : ObjectId("56e0a3a2d59feaa43fba49d6"), "timestamp" : ISODate("2016-12-18T10:23:29.620Z"), "City_list" : "[PN-City4, PN-City5,PN-City6,PN- City7]", "LDE" : "LDE-444, LDE-3445, LDE-456", "Name": "BCD"}
{ "_id" : ObjectId("56e0a3a2d59feaa43fd67873"), "timestamp" : ISODate("2016-12-18T10:23:29.620Z"), "City_list" : "[PN-City1, PN-City6,PN-City9,PN- City10]", "LDE" : "LDE-444, LDE-3445, LDE-456", "Name": "XYZ"}
I have a form from where i take user inputs : Name, Cities (one or more comma separated), LDE (comma separated)
In my script i want to check before inserting into mongodb
If the user is new user insert directly db.
If old user, check if cities inputed by user is present in db already if not update db else throw a messagee to html with message saying city already present in DB.
Say my input is something like this :
Name: PQR
City_list : PN-City4, PN-City12
LDE: LDE-6767
My code is as below :
if 'Name' in pdata and ('city_list' in pdata and re.match("(PN-\w*-\d)(PN-\w*-\d)*", pdata['city_list'])):
user_input = pdata['city_list'].split(",")
pname = pdata['Name']
for data in user_input:
if db.coll_1.find({"Name": pname , 'City_list': { "$in": data}})
This is giving me error.
How do i achieve this
I tried something like this :
for data in user_input:
data = str(data) # it was taking as unicode
if (db.coll_1.find({"Name": pname , 'City_list': { "$in": data}}).count() > 0):
Gives error : OperationFailure: $in needs an array
CIty_list is a string
Can some one please help me with this

Nested complex query to MongoDB with Python

I have the following document in my MongoDB:
{
"_id" : ObjectId("5a672fe5c9afd19e04d011ca"),
"data" : [
{
"name" : "Smith",
"age" : 10,
"spouse" : "Lopez"
},
{
"name" : "Davis",
"age" : 10,
"spouse" : "Peter"
},
{
"name" : "Clark",
"age" : 10
}
],
"header" : {
"sourece" : "http://www.some.com/api/json/data?department=security&gender=female",
"fetch_time" : "2018-01-23T09:35:51"
}
}
Now I want to:
Get all the data under "data" node.
Get all the people who have
"spouse" node.
The following code doesn't work:
from pymongo import MongoClient
from pprint import pprint
client = MongoClient('mongodb://localhost:27017/')
db = client['test']
coll = db['test_2']
print('All content:')
for item in coll.find():
pprint(item)
print('-'*20)
print("Content under 'data':")
for item in coll.find({"data": "$all"}):
pprint(item)
for item in coll.find({"data": []}):
pprint(item)
for item in coll.find({"data": ["$all"]}):
pprint(item)
print('-'*20)
print("People who have 'spouse':")
for item in coll.find({"data": [{"spouse":"$all"}]}):
pprint(item)
The above code outputs the following:
All content:
{u'_id': ObjectId('5a672fe5c9afd19e04d011ca'),
u'data': [{u'age': 10, u'name': u'Smith', u'spouse': u'Lopez'},
{u'age': 10, u'name': u'Davis', u'spouse': u'Peter'},
{u'age': 10, u'name': u'Clark'}],
u'header': {u'fetch_time': u'2018-01-23T09:35:51',
u'sourece': u'http://www.some.com/api/json/data?department=security&gender=female'}}
--------------------
Content under 'data':
--------------------
People who have 'spouse':
I can get all the content from my MongoDB, which means the data is there in the database. But when I run the subsequent code, nothing was printed. I tried different ways but none of them work.
Moreover, is there any document like, say Oracle SQL reference.pdf stating the query statement grammar with strict structure specification so I can build any query statement based on it?

No need to get all data.
First part ( Regular Query ) - Read here
- Use projection to output all data fields with no query filter.
Something like coll.find({},{"data": 1}).
Second part ( Aggregate Query ) - Read here - Use $match to contain the documents where "data" have atleast have one array element where it has spouse field followed by $filter with $type expression to check for missing field to $project matching array elements.
Something like
col.aggregate([
{"$match":{"data.spouse":{"$exists":true}}},
{"$project":{
"data":{
"$filter":{
"input":"$data",
"as":"result",
"cond":{"$ne":[{"$type":"$$result.spouse"},"missing"]
}
}
}
}}
])
Also not query operators are different from aggregation comparison operators.

Mongodb find with wrong result Int64 object

I'm using MongoDB 3.2.1 / python 3.4 / pymongo / pandas 0.17 (although the latter two are probably completely irrelavant to this question).
I'm having a really strange (and wrong) behavior with MongoDB find.
I have a collection, containing a document like this:
{
"_id" : NumberLong(-1819413477243867792),
"targetentity" : "NODOGENERICO .ag.HP_BAR_DEG_APP_1",
"tx" : false,
"ocname" : ".oc.serv6",
"specificproblem" : null,
"saf" : false,
"iscriticalnode" : null,
"checkmask" : null,
"notificationidentifier" : 1347592,
"province" : null,
"usertext" : null,
"additionaltext" : "AAA Invalid Response",
"director" : ".temip.madrids01_director",
"problemoccurences" : 1,
"usertags" : null,
"managedobject" : "NODOGENERICO .ag.HP_BAR_DEG_APP_1",
"isacceptednode" : null,
"elementcode" : null,
"state" : "Terminated",
"probablecause" : "Unknown",
"ran" : false,
"counttotal" : 1,
"locationcode" : "NULL",
"problemstatus" : "Closed",
"structurednotes" : null,
"collection" : "serv6",
"operatornotes" : null,
"alarmtype" : "CommunicationsAlarm",
"workinfo" : null,
"perceivedseverity" : "Major",
"core" : true,
"eventtime" : NumberLong(1467342666000),
"originalseverity" : "Major",
"vendor" : "Several",
"controlelementcode" : null,
"outageflag" : false,
"incident" : null,
}
This "_id" it's basically a Hash computed using "hash" builtin method of Python 3.4.
The problem is that I cannot find any element with this id after I insert it.
I've tried (at this point I'm trying this on mongo terminal directly, but over Pymongo it gets me the same results):
db.getCollection('unique_alarm').find({"_id": NumberLong(-1819413477243867792)}
and
db.getCollection('unique_alarm').find({"_id": -1819413477243867792})
And for both I get this:
Fetched 0 record(s) in 1ms
I thought the problem was about how I deal with NumberLong, but for field eventtime (which has the same type) I have absolutely no problem.
I.e., for the eventtime if I query:
db.getCollection('unique_alarm').find({"eventtime" :
NumberLong(1467342666000)})
or by:
db.getCollection('unique_alarm').find({"eventtime" :1467342666000})
Both these queries return this first document again, no problem.
Any clues on what is happening? Why are the first two queries returning 0 results?
More information on my trial and error:
it doesnt matter if the field is "_id" or any other field, I cannot search for these numbers
I'm inserting these documents using pymongo
If I try to insert this document again (either using pymongo or the mongodb terminal), I get an error of duplicate key...

the answer could be trivial but all is connected with quotes " around value for number long.
Inserting data and querying need to be 'quoted'
db.sofia.find({"_id" : NumberLong("-1819413477243867792")}).pretty()
{
"_id" : NumberLong("-1819413477243867792"),
"targetentity" : "NODOGENERICO .ag.HP_BAR_DEG_APP_1",
"tx" : false,
"ocname" : ".oc.serv6",
....
}

I think you're hitting some sort of limits within mongo NumberLong.
I've opened a mongo console and this is the output
> NumberLong(-1819413477243867792)
NumberLong("-1819413477243867904")
So I would assume that if you find by NumberLong("-1819413477243867904") you would magically find your record, which would probably prove that your hash is hitting some sort of mongo db limit if NumberLong.

How can I insert records which have dicts and lists in Flask Eve?

I'm using Flask-Eve to provide an API for my data. I would like to insert my records using Eve, so that I get a _created attribute and the other Eve-added attributes.
Two of my fields are dicts, and one is a list. When I try to insert that to Eve the structure seems to get flattened, losing some information. Trying to tell Eve about the dict & list elements gives me an error on POST, saying those fields need to be dicts and lists, but they already are! Please can someone help me & tell me what I'm doing wrong?
My Eve conf looked like this:
'myendpoint': { 'allow_unknown': True,
'schema': { 'JobTitle': { 'type': 'string',
'required': True,
'empty': False,
'minlength': 3,
'maxlength': 99 },
'JobDescription': { 'type': 'string',
'required': True,
'empty': False,
'minlength': 32,
'maxlength': 102400 },
},
},
But when I POST the following structure using requests:
{
"_id" : ObjectId("56e840686dbf9a5fe069220e"),
"Salary" : {
"OtherPay" : "On Application"
},
"ContactPhone" : "xx",
"JobTypeCodeList" : [
"Public Sector",
"Other"
],
"CompanyName" : "Scc",
"url" : "xx",
"JobTitle" : "xxx",
"WebAdID" : "TA7494725_1_1",
"JobDescription" : "xxxx",
"JobLocation" : {
"DisplayCity" : "BRIDGWATER",
"City" : "BRIDGWATER",
"StateProvince" : "Somerset",
"Country" : "UK",
"PostalCode" : "TA6"
},
"CustomField1" : "Permanent",
"CustomField3" : "FTJOBUKNCSG",
"WebAdManagerEmail" : "xxxx",
"JobType" : "Full",
"ProductID" : "JCPRI0UK"
}
The post line looks like this:
resp = requests.post(url, data = job)
It gets 'flattened' and loses the information from the dicts and list:
{
"_id" : ObjectId("56e83f5a6dbf9a6395ea559d"),
"Salary" : "OtherPay",
"_updated" : ISODate("2016-03-15T16:59:06Z"),
"ContactPhone" : "xx",
"JobTypeCodeList" : "Public Sector",
"CompanyName" : "Scc",
"url" : "xxx",
"JobTitle" : "xx",
"WebAdID" : "TA7494725_1_1",
"JobDescription" : "xxx",
"JobLocation" : "DisplayCity",
"CustomField1" : "Permanent",
"_created" : ISODate("2016-03-15T16:59:06Z"),
"CustomField3" : "FTJOBUKNCSG",
"_etag" : "55d8d394141652f5dc2892a900aa450403a63d10",
"JobType" : "Full",
"ProductID" : "JCPRI0UK"
}
I've tried updating my schema to say some are dicts and lists:
'JobTypeCodeList': { 'type': 'list'},
'Salary': { 'type': 'dict'},
'JobLocation': { 'type': 'dict'},
But then when I POST in the new record I get an error saying
{u'Salary': u'must be of dict type', u'JobTypeCodeList': u'must be of list type', u'JobLocation': u'must be of dict type'},
I've verified before the POST that type(job.Salary) == dict etc, so I'm not sure how to resolve this. While I can POST the record directly into MongoDB ok, bypassing Eve, I'd prefer to use Eve if possible.

In case this is useful to anyone else, I ended up working around this issue by posting a flat structure into Eve, and then using the on_insert and on_update events to loop through the keys and construct objects (and lists) from them.
It's a bit convoluted but it does the trick and now that it's in place it's fairly transparent to use. My objects added to MongoDB through Eve now have embedded lists and hashes, but they also get the handy Eve attributes like _created and _updated, while the POST and PATCH requests also get validated through Eve's normal schema.
The only really awkward thing is that on_insert and on_update send slightly different arguments, so there's a lot of repetition in the code below which I haven't yet refactored out.
Any characters can be used as flags: I'm using two underscores to indicate key/values which should end up as a single object, and two ampersands for values which should be split into a list. The structure I'm posting in now looks like this:
"Salary__OtherPay" : "On Application"
"ContactPhone" : "xx",
"JobTypeCodeList" : "Public Sector&&Other",
"CompanyName" : "Scc",
"url" : "xx",
"JobTitle" : "xxx",
"WebAdID" : "TA7494725_1_1",
"JobDescription" : "xxxx",
"JobLocation__DisplayCity" : "BRIDGWATER",
"JobLocation__City" : "BRIDGWATER",
"JobLocation__StateProvince" : "Somerset",
"JobLocation__Country" : "UK",
"JobLocation__PostalCode" : "TA6"
"CustomField1" : "Permanent",
"CustomField3" : "FTJOBUKNCSG",
"WebAdManagerEmail" : "xxxx",
"JobType" : "Full",
"ProductID" : "JCPRI0UK"
And my Eve schema has been updated accordingly to validate the values of those new key names. Then in the backend I've defined the function below which checks the incoming keys/values and converts them into objects/lists, and also deletes the original __ and && data:
import re
def flat_to_complex(items=None, orig=None):
if type(items) is dict: # inserts of new objects
if True: # just to force indentation
objects = {} # hash-based container for each object
lists = {} # hash-based container for each list
for key,value in items.items():
has_object_wildcard = re.search(r'^([^_]+)__', key, re.IGNORECASE)
if bool(has_object_wildcard):
objects[has_object_wildcard.group(1)] = None
elif bool(re.search(r'&&', unicode(value))):
lists[key] = str(value).split('&&')
for list_name, this_list in lists.items():
items[list_name] = this_list
for obj_name in objects:
this_obj = {}
for key,value in items.items():
if key.startswith('{s}__'.format(s=obj_name)):
match = re.search(r'__(.+)$', key)
this_obj[match.group(1)] = value
del(items[key])
objects[obj_name] = this_obj
for obj_name, this_obj in objects.items():
items[obj_name] = this_obj
elif type(items) is list: # updates to existing objects
for idx in range(len(items)):
if type(items[idx]) is dict:
objects = {} # hash-based container for each object
lists = {} # hash-based container for each list
for key,value in items[idx].items():
has_object_wildcard = re.search(r'^([^_]+)__', key, re.IGNORECASE)
if bool(has_object_wildcard):
objects[has_object_wildcard.group(1)] = None
elif bool(re.search(r'&&', unicode(value))):
lists[key] = str(value).split('&&')
for list_name, this_list in lists.items():
items[idx][list_name] = this_list
for obj_name in objects:
this_obj = {}
for key,value in items[idx].items():
if key.startswith('{s}__'.format(s=obj_name)):
match = re.search(r'__(.+)$', key)
this_obj[match.group(1)] = value
del(items[idx][key])
objects[obj_name] = this_obj
for obj_name, this_obj in objects.items():
items[idx][obj_name] = this_obj
And then I just tell Eve to run that function on inserts and updates to that collection:
app.on_insert_myendpoint += flat_to_complex
app.on_update_myendpoint += flat_to_complex
This achieves what I needed and the resulting record in Mongo is the same as the one from the question above (with _created and _updated attributes). It's obviously not ideal but it gets there, and it's fairly easy to work with once it's in place.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

MongoDB generating same ID between inserts - python

Related

Delete all documents returned in a find().limit()

MongoDB check if list item exists for an item in collection before inserting to DB

Nested complex query to MongoDB with Python

Mongodb find with wrong result Int64 object

How can I insert records which have dicts and lists in Flask Eve?

Categories

Resources