MongoEngine 0.8.3 NotUniqueError on _id field - python

After upgrading MongoEngine from 0.7.9 to 0.8.3, any attempts to save any existing documents in any collection results in a NotUniqueError (user collection shown in example):
Tried to save duplicate unique keys (E11000 duplicate key error index: foo.user.$_id_ dup key: { : ObjectId('xxxxxx') })
I get the same error if I create a new document and save it more than once:
a = Foo()
a.save()
a.save() # results in duplicate error
Mongo by default creates an index on _id which cannot be removed, and I have no other indexes which use _id. Most issues similar to this that I've seen have been on duplicate indexes that aren't _id and can be removed, but this is really odd. I am doing nothing weird with the _id field, just letting Mongo generate it on its own.
Any ideas on what might be causing this to happen?
Thanks!

There was a custom save function which hadn't been migrated to using the new save() arguments, so one of them was caused force_insert to evaluate to true.
So dumb...

Related

solve E11000 duplicate key error collection: _id_ dup key in pymongo

I am trying to insert a great number of document(+1M) using a bulk_write instruction. In order to do that, I create a list of InsertOne function.
python version = 3.7.4
pymongo version = 3.8.0
Document creation:
document = {
'dictionary': ObjectId(dictionary_id),
'price': price,
'source': source,
'promo': promo,
'date': now_utc,
'updatedAt': now_utc,
'createdAt:': now_utc
}
# add line to debug
if '_id' in document.keys():
print(document)
return document
I create the full list of document by adding a new field from a list of elements and create the query by using InsertOne
bulk = []
for element in list_elements:
for document in documents:
document['new_field'] = element
# add line to debug
if '_id' in document.keys():
print(document)
insert = InsertOne(document)
bulk.append(insert)
return bulk
I do the insert by using bulk_write command
collection.bulk_write(bulk, ordered=False)
I attach the documentation https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.bulk_write
According to the documentation,the _id field is added automatically
Parameter - document: The document to insert. If the document is missing an _id field one will be added.
And somehow it seems that is doing it wrong because some of them have the same value.
Receiving this error(with differents _id of course) for 700k of the 1M documents
'E11000 duplicate key error collection: database.collection index: _id_ dup key: { _id: ObjectId(\'5f5fccb4b6f2a4ede9f6df62\') }'
Seems a bug to me from pymongo, because I used this approach in many situations but I didn't with such size of documents
The _id field has to be unique for sure, but, due to this is done automatically by pymongo, I don't know how to approach to this problem, perhaps using a UpdateOne with upsert True with an impossible filter and hope for the best.
I would appreciate any solution or work around for this problem
It seems that as I was adding the new field of the document and append it into the list, I created similar instances of the same element, so I had the same queries len(list_elements) times and that is why I had the duplicated key error.
to solve the problem, I append to the list a copy of the document
bulk.append(document.copy())
and then create the queries with that list
I would like to thank #Belly Buster for his help in the issue
If any of the documents from your code snippet already contain an _id, a new one won't be added, and you run the risk of getting a duplicate error as you have observed.

How to do ...ON DUPLICATE KEY UPDATE... in django

Is there a way to do the following in django's ORM?
INSERT INTO mytable
VALUES (1,2,3)
ON DUPLICATE KEY
UPDATE field=4
I'm familiar with get_or_create, which takes default values, but that doesn't update the record if there are differences in the defaults. Usually I use the following approach, but it takes two queries instead of one:
item = Item(id=1)
item.update(**fields)
item.save()
Is there another way to do this?
I'm familiar with get_or_create, which takes default values, but that doesn't update the record if there are differences in the defaults.
update_or_create should provide the behavior you're looking for.
Item.objects.update_or_create(
id=1,
defaults=fields,
)
It returns the same (object, created) tuple as get_or_create.
Note that this will still perform two queries, but only in the event the record does not already exist (as is the case with get_or_create). If that is for some reason unacceptable, you will likely be stuck writing raw SQL to handle this, which would be unfortunate in terms of readability and maintainability.
I think get_or_create() is still the answer, but only specify the pk field(s).
item, _ = Item.objects.get_or_create(id=1)
item.update(**fields)
item.save()
Django 4.1 has added the support for INSERT...ON DUPLICATE KEY UPDATE query. It will update the fields in case the unique validation fails.
Example of above in a single query:
# Let's say we have an Item model with unique on key
items = [
Item(key='foobar', value=10),
Item(key='foobaz', value=20),
]
# this function will create 2 rows in a single SQL query
Item.objects.bulk_create(items)
# this time it will update the value for foobar
# and create new row for barbaz
# all in a single SQL query
items = [
Item(key='foobar', value=30),
Item(key='barbaz', value=50),
]
Item.objects.bulk_create(
items,
update_conflicts=True,
update_fields=['rate']
)

Mongo Query that always returns zero documents

It's possible to write a query that always returns all of the elements in a collection, to use pymongo as an example:
MongoClient()["database"]["collection"].find({})
However, due to the structure of my code, I would quite like to be able to construct a query that does the opposite, a query that will necessarily return zero elements in all situations:
MongoClient()["database"]["collection"].find(null_query)
How can I define null_query, such that this is correct?
You can ask for any field to be in an empty list. It seems reasonable to use the _id field for this:
db.collection.find({_id: {$in: []}})
If you want a shorter query you don't need to use the _id field
at all:
db.collection.find({_:{$in:[]}})
Alternative if MongoDB version >= 3.4:
Arguably one can also ask if the _id field does not exists, which has been suggested by #Marco13:
db.collection.find({_id: {$exists: false}})
However, this assumes that all documents have the _id field, which is not necessarily true for MongoDB versions before 3.4 where a collection could be created with db.createCollection("mycol", {autoIndexID : false}) so all documents were not automatically given an _id field.

Why do I need to run the postgresql nextval function? And how to prevent it?

I just had an issue with Django and PostgreSQL that I don't understand.
I have a simple model, defined such as:
class MyModel(models.Model):
my_field = models.IntegerField()
my_other_field = models.TextField()
In my view, i have something similar to:
my_object = MyModel(my_field=1, my_other_field='blah')
my_object.save()
Everything was working fine, until this morning. I got this error:
IntegrityError at /my_url/
duplicate key value violates unique constraint "my_model_pkey"
DETAIL: Key (id)=(3) already exists.
CONTEXT: Remote SQL command: INSERT INTO public.my_model(id, my_field, my_other_field) VALUES ($1, $2, $3) RETURNING id
I had this error once, I know it is related to the way PostgreSQL syncs the sequential table associated with my model with the id column. I has to run this function in PostgreSQL until the id returned was greater than the biggest value of the id.
select nextval('my_model_id_seq'::regclass);
My question is: Why did this happen in the first place? And how to prevent it in the future ?
By the way, that's the only way I insert data into the table, I've never inserted data manually.
I hope the question is clear enough
I think the question is not "why is my sequence getting messed up" - rather it is "why is Django trying to supply a value for the id column when inserting a row, instead of allowing the database to insert the next value in the sequence".
The Django documentation describes the algorithm it uses to decide whether it should be doing an UPDATE or an INSERT when you call save().
This algorithm involves checking if the 'id' field of the object is already set to some value. If it is not, then it does an INSERT (presumably not specifying a value for the 'id' field). If it is set, then it first tries to do an UPDATE; if that does not result in an updated record, then it will do an INSERT (this time presumably it would specify a value for the 'id' field).
As pointed out in Erwin's answer, the error message which you seeing indicates it is trying to insert a row while specifying the value for the 'id' field.
I note that it appears this algorithm has changed in version 1.6 of Django. Previously it used a SELECT first to see if a record existed, then an UPDATE if it did or an INSERT if it did not. If your problem has started occurring since upgrading, then that could be a cause. The documentation notes:
There are some rare cases where the database doesn’t report that a row
was updated even if the database contains a row for the object’s
primary key value. An example is the PostgreSQL ON UPDATE trigger
which returns NULL. In such cases it is possible to revert to the old
algorithm by setting the select_on_save option to True.
If this were happening for you, then it would explain your symptoms: the error would actually be occurring when trying to update a value in the database, and django would erroneously think that the row did not exist and then try to create it.
You could check for this by setting 'select_on_save' to true to revert to the old behavior.
Another possible reason for this would be if your code inadvertently set the 'id' attribute on an object to some value, and then called save(). This could cause various problems, depending on whether the value already existed in the database or not. In particular, it might result in creating a row which has an 'id' value which is ahead of the current range of the sequence associated with the column, so that later on you would get errors trying to insert into the row.
Another possible reason could be using the 'force_insert' argument to save(), on a row which had previously loaded from the database (so that it was actually an existing row you should be updating).
The root of the problem lies here (SQL command from your error message):
INSERT INTO public.my_model(id, my_field, my_other_field)
VALUES ($1, $2, $3)
RETURNING id
Since your id column seems to be a serial type, do not insert values manually. Let the default draw from the sequence automatically. Should be:
INSERT INTO public.my_model(my_field, my_other_field)
VALUES ($1, $2)
RETURNING id;
That's the whole point of adding RETURNING id to begin with: to return the newly generated id. If you pass in a value yourself, you wouldn't need to have it returned.
Fix
If the sequence got out of sync somehow, because manual entries conflict with the numbers from nextval(), run this query once:
SELECT setval('my_model_id_seq', max(id)) FROM my_model;
This sets the sequence to the current maximum. Next call is next number, no off-by-one error.

Django get_or_create raises Duplicate entry for key Primary with defaults

Help! Can't figure this out! I'm getting a Integrity error on get_or_create even with a defaults parameter set.
Here's how the model looks stripped down.
class Example(models.Model):model
user = models.ForeignKey(User)
text = models.TextField()
def __unicode__(self):
return "Example"
I run this in Django:
def create_example_model(user, textJson):
defaults = {text: textJson.get("text", "undefined")}
model, created = models.Example.objects.get_or_create(
user=user,
id=textJson.get("id", None),
defaults=defaults)
if not created:
model.text = textJson.get("text", "undefined")
model.save()
return model
I'm getting an error on the get_or_create line:
IntegrityError: (1062, "Duplicate entry '3020' for key 'PRIMARY'")
It's live so I can't really tell what the input is.
Help? There's actually a defaults set, so it's not like, this problem where they do not have a defaults. Plus it doesn't have together-unique. Django : get_or_create Raises duplicate entry with together_unique
I'm using python 2.6, and mysql.
You shouldn't be setting the id for objects in general, you have to be careful when doing that.
Have you checked to see the value for 'id' that you are putting into the database?
If that doesn't fix your issue then it may be a database issue, for PostgreSQL there is a special sequence used to increment the ID's and sometimes this does not get incremented. Something like the following:
SELECT setval('tablename_id_seq', (SELECT MAX(id) + 1 FROM
tablename_id_seq));
get_or_create() will try to create a new object if it can't find one that is an exact match to the arguments you pass in.
So is what I'm assuming is happening is that a different user has made an object with the id of 3020. Since there is no object with the user/id combo you're requesting, it tries to make a new object with that combo, but fails because a different user has already created an item with the id of 3020.
Hopefully that makes sense. See what the following returns. Might give a little insight as to what has gone on.
models.Example.objects.get(id=3020)
You might need to make 3020 a string in the lookup. I'm assuming a string is coming back from your textJson.get() method.
One common but little documented cause for get_or_create() fails is corrupted database indexes.
Django depends on the assumption that there is only one record for given identifier, and this is in turn enforced using UNIQUE index on this particular field in the database. But indexes are constantly being rewritten and they may get corrupted e.g. when the database crashes unexpectedly. In such case the index may no longer return information about an existing record, another record with the same field is added, and as result you'll be hitting the IntegrityError each time you try to get or create this particular record.
The solution is, at least in PostgreSQL, to REINDEX this particular index, but you first need to get rid of the duplicate rows programmatically.

Categories

Resources