SQLAlchemy: Flush-Order for Inserts wrong? - python

I have 2 tables (contracts and contract_items) where the latter has a foreign key set to the first one.
When using SQLAlchemy for inserting new data from a list into my postgre database I'm basically doing the following:
for row in list:
# Get contract and item from row
...
session.add(contract)
session.add(contract_item)
# Do some select statements (which will raise an auto-flush)
...
session.commit()
Now... this works for maybe 2-3 runs, sometimes more, sometimes less. Then the part where an auto-flush is executed will end in an exception telling me that contract_item could not be inserted because it has an foreign key to contract and the contract row does not exist yet.
Is the order in which I pass the data to the add-function of the session not the order in which the data will be flushed? I actually hoped SQLAlchemy would find the right order in which to flush statements on it's own based on the dependencies. It should be clear that the contract_item row should not be inserted before the contract row, when contract_item has a foreign key to contract set. Yet the order seems to be random.
I then tried to flush the contract manually before adding contract_item:
for row in list:
# Getting contract and item from row
...
session.add(contract)
session.flush() # Flushing manually
session.add(contract_item)
# Do some select statements (which will raise an auto-flush)
...
session.commit()
This worked without any problems and the rows got inserted into the database.
Is there any way to set the order in which statements will be flushed for the session? Does SQLAlchemy really not care about dependencies such as foreign keys or am I making a mistake when adding the data? I'd rather not manage the flushs manually if somehow possible.
Is there a way to make SQLAlchemy get the order right?

Had the same problem. What solved it in my case is creating a biderictional relationship - you need to make relationship from contracts to contract_items, as described HERE
UPD: actually you can do it simplier: just add relationship from contract_items table to contract table and that should do the thing.

The way session handles related objects is defined by cascades. Use "save-update" cascade on a relationship (that is enabled by default) to automatically add related objects, so that you only have to use one add call. The documentation I linked contains code example.

Related

In SQLAlchemy, how can I get affected rows after update using optimistic lock?

I have known that in a session of SQLAlchemy, update() will not work, maybe even not communicate with database until you use session.commit.
Here is my code, I think the skill I use is called optimistic lock:
with Session() as session:
# get one record
object = session.query(Table).where(key=param).limit(1)
content = "here is the calculated value of my business"
# update the record by the id of the gotten record
# if 2 thread get the same record and run into this update statement, I think only can success
# because it is ensured by the MySQL MVCC
Table.update(Table).set(Table.content = content).where(id = object.id)
sessoin.commit()
But I hava a question here, what if I want to do something in the thread which got the lock after session.commit()?
For example, the code I expect is like this:
affected_rows = sessoin.commit()
if affected_rows:
do_something_after_get_lock_success()
According to the docs, ResultProxy has a property rowcount. you can get the affected rows from that.
but note that :
This attribute returns the number of rows matched, which is not necessarily the same as the number of rows that were actually modified - an UPDATE statement, for example, may have no net change on a given row if the SET values given are the same as those present in the row already. Such a row would be matched but not modified. On backends that feature both styles, such as MySQL, rowcount is configured by default to return the match count in all cases.
see here also for great Mike Bayer explanation about this.

Why do I need to run the postgresql nextval function? And how to prevent it?

I just had an issue with Django and PostgreSQL that I don't understand.
I have a simple model, defined such as:
class MyModel(models.Model):
my_field = models.IntegerField()
my_other_field = models.TextField()
In my view, i have something similar to:
my_object = MyModel(my_field=1, my_other_field='blah')
my_object.save()
Everything was working fine, until this morning. I got this error:
IntegrityError at /my_url/
duplicate key value violates unique constraint "my_model_pkey"
DETAIL: Key (id)=(3) already exists.
CONTEXT: Remote SQL command: INSERT INTO public.my_model(id, my_field, my_other_field) VALUES ($1, $2, $3) RETURNING id
I had this error once, I know it is related to the way PostgreSQL syncs the sequential table associated with my model with the id column. I has to run this function in PostgreSQL until the id returned was greater than the biggest value of the id.
select nextval('my_model_id_seq'::regclass);
My question is: Why did this happen in the first place? And how to prevent it in the future ?
By the way, that's the only way I insert data into the table, I've never inserted data manually.
I hope the question is clear enough
I think the question is not "why is my sequence getting messed up" - rather it is "why is Django trying to supply a value for the id column when inserting a row, instead of allowing the database to insert the next value in the sequence".
The Django documentation describes the algorithm it uses to decide whether it should be doing an UPDATE or an INSERT when you call save().
This algorithm involves checking if the 'id' field of the object is already set to some value. If it is not, then it does an INSERT (presumably not specifying a value for the 'id' field). If it is set, then it first tries to do an UPDATE; if that does not result in an updated record, then it will do an INSERT (this time presumably it would specify a value for the 'id' field).
As pointed out in Erwin's answer, the error message which you seeing indicates it is trying to insert a row while specifying the value for the 'id' field.
I note that it appears this algorithm has changed in version 1.6 of Django. Previously it used a SELECT first to see if a record existed, then an UPDATE if it did or an INSERT if it did not. If your problem has started occurring since upgrading, then that could be a cause. The documentation notes:
There are some rare cases where the database doesn’t report that a row
was updated even if the database contains a row for the object’s
primary key value. An example is the PostgreSQL ON UPDATE trigger
which returns NULL. In such cases it is possible to revert to the old
algorithm by setting the select_on_save option to True.
If this were happening for you, then it would explain your symptoms: the error would actually be occurring when trying to update a value in the database, and django would erroneously think that the row did not exist and then try to create it.
You could check for this by setting 'select_on_save' to true to revert to the old behavior.
Another possible reason for this would be if your code inadvertently set the 'id' attribute on an object to some value, and then called save(). This could cause various problems, depending on whether the value already existed in the database or not. In particular, it might result in creating a row which has an 'id' value which is ahead of the current range of the sequence associated with the column, so that later on you would get errors trying to insert into the row.
Another possible reason could be using the 'force_insert' argument to save(), on a row which had previously loaded from the database (so that it was actually an existing row you should be updating).
The root of the problem lies here (SQL command from your error message):
INSERT INTO public.my_model(id, my_field, my_other_field)
VALUES ($1, $2, $3)
RETURNING id
Since your id column seems to be a serial type, do not insert values manually. Let the default draw from the sequence automatically. Should be:
INSERT INTO public.my_model(my_field, my_other_field)
VALUES ($1, $2)
RETURNING id;
That's the whole point of adding RETURNING id to begin with: to return the newly generated id. If you pass in a value yourself, you wouldn't need to have it returned.
Fix
If the sequence got out of sync somehow, because manual entries conflict with the numbers from nextval(), run this query once:
SELECT setval('my_model_id_seq', max(id)) FROM my_model;
This sets the sequence to the current maximum. Next call is next number, no off-by-one error.

Automatically merging arbitrary Django models

I have two Django-ORM managed databases that I'd like to merge. Both have a very similar schema, and both have the standard auth_users table, along with a few other shared tables that reference each other as well as auth_users, which I'd like to merge into a single database automatically.
Understandably, this could be very non-trivial depending upon the foreign-key relationships, and what constitutes a "unique" record in each table.
Does anyone know if there exists a tool to do this merge operation?
If nothing like this currently exists, I was considering writing my own management command, based on the standard loaddata command. Essentially, you'd use the standard dumpdata command to export tables from a source database, and then use a modified version of loaddata to "merge" them into the destination database.
For example, if I have databases A and B, and I want to merge database B into database A, then I'd want to follow a procedure according to the pseudo-code:
merge_database_dst = A
merge_database_src = B
for table in sorted(merge_database_dst.get_redundant_tables(merge_database_src), key=acyclic_dependency):
key = table.get_unique_column_key()
src_id_to_dst_id = {}
for record_src in merge_database_src.table.objects.all():
src_key_value = record_src.get_key_value(key)
try:
record_dst = merge_database_dst.table.objects.get(key)
dst_key_value = record_dst.get_key_value(key)
except merge_database_dst.table.DoesNotExist:
record_dst = merge_database_dst.table(**[(k,convert_fk(v)) for k,v in record_src._meta.fields])
record_dst.save()
dst_key_value = record_dst.get_key_value(key)
src_id_to_dst_id[(table,record_src.id)] = record_dst.id
The convert_fk() function would use the src_id_to_dst_id index to convert foreign key references in the source table to the equivalent IDs in the destination table.
To summarize, the algorithm would iterate over the table to be merged in the order of dependency, with parents iterated over first. So if we wanted to merge tables auth_users and mycustomprofile, which is dependent on auth_users, we'd iterate ['auth_users','mycustomprofile'].
Each merged table would need some sort of indicator documenting the combination of columns that denotes a universally unique record (i.e. the "key"). For auth_users, that might be the "username" and/or "email" column.
If the value of the key in database B already exists in A, then the record is not imported from B, but the ID of the existing record in A is recorded.
If the value of the key in database B does not exist in A, then the record is imported from B, and the ID of the new record is recorded.
Using the previously recorded ID, a mapping is created, explaining how to map foreign-key references to that specific record in B to the new merged/pre-existing record in A. When future records are merged into A, this mapping would be used to convert the foreign keys.
I could still envision some cases where an imported record references a table not included in the dumpdata, which might cause the entire import to fail, therefore some sort of "dryrun" option would be needed to simulate the import to ensure all FK references can be translated.
Does this seem like a practical approach? Is there a better way?
EDIT: This isn't exactly what I'm looking for, but I thought others might find it interesting. The Turbion project has a mechanism for copying changes between equivalent records in different Django models within the same database. It works by defining a translation layer (i.e. merging.ModelLayer) between two Django models, so, say if you update the "www" field in user bob#bob.com's profile, it'll automatically update the "url" field in user bob#bob.com's otherprofile.
The functionality I'm looking for is a bit different, in that I want to merge an entire (or partial) database snapshot at infrequent intervals, sort of the way the loaddata management command does.
Wow. This is going to be a complex job regardless. That said:
If I understand the needs of your project correctly, this can be something that can be done using a data migration in South. Even so, I'd be lying if I said it was going to be a joke.
My recommendation is -- and this is mostly a parrot of an assumption in your question, but I want to make it clear -- that you have one "master" table that is the base, and which has records from the other table added to it. So, table A keeps all of its existing records, and only gets additions from B. B feeds additions into A, and once done, B is deleted.
I'm hesitant to write you sample code because your actual job will be so much more complex than this, but I will anyway to try and point you in the right direction. Consider something like...
import datetime
from south.db import db
from south.v2 import DataMigration
from django.db import models
class Migration(DataMigration):
def forwards(self, orm):
for b in orm.B.objects.all():
# sanity check: does this item get copied into A at all?
if orm.A.objects.filter(username=b.username):
continue
# make an A record with the properties of my B record
a = orm.A(
first_name=b.first_name,
last_name=b.last_name,
email_address=b.email_address,
[...]
)
# save the new A record, and delete the B record
a.save()
b.delete()
def backwards(self, orm):
# backwards method, if you write one
This would end up migrating all of the Bs not in A to A, and leave you a table of Bs that are expected duplicates, which you could then check by some other means before deleting.
Like I said, this sample isn't meant to be complete. If you decide to go this route, spend time in the South documentation, and particularly make sure you look at data migrations.
That's my 2¢. Hope it helps.

Django get_or_create raises Duplicate entry for key Primary with defaults

Help! Can't figure this out! I'm getting a Integrity error on get_or_create even with a defaults parameter set.
Here's how the model looks stripped down.
class Example(models.Model):model
user = models.ForeignKey(User)
text = models.TextField()
def __unicode__(self):
return "Example"
I run this in Django:
def create_example_model(user, textJson):
defaults = {text: textJson.get("text", "undefined")}
model, created = models.Example.objects.get_or_create(
user=user,
id=textJson.get("id", None),
defaults=defaults)
if not created:
model.text = textJson.get("text", "undefined")
model.save()
return model
I'm getting an error on the get_or_create line:
IntegrityError: (1062, "Duplicate entry '3020' for key 'PRIMARY'")
It's live so I can't really tell what the input is.
Help? There's actually a defaults set, so it's not like, this problem where they do not have a defaults. Plus it doesn't have together-unique. Django : get_or_create Raises duplicate entry with together_unique
I'm using python 2.6, and mysql.
You shouldn't be setting the id for objects in general, you have to be careful when doing that.
Have you checked to see the value for 'id' that you are putting into the database?
If that doesn't fix your issue then it may be a database issue, for PostgreSQL there is a special sequence used to increment the ID's and sometimes this does not get incremented. Something like the following:
SELECT setval('tablename_id_seq', (SELECT MAX(id) + 1 FROM
tablename_id_seq));
get_or_create() will try to create a new object if it can't find one that is an exact match to the arguments you pass in.
So is what I'm assuming is happening is that a different user has made an object with the id of 3020. Since there is no object with the user/id combo you're requesting, it tries to make a new object with that combo, but fails because a different user has already created an item with the id of 3020.
Hopefully that makes sense. See what the following returns. Might give a little insight as to what has gone on.
models.Example.objects.get(id=3020)
You might need to make 3020 a string in the lookup. I'm assuming a string is coming back from your textJson.get() method.
One common but little documented cause for get_or_create() fails is corrupted database indexes.
Django depends on the assumption that there is only one record for given identifier, and this is in turn enforced using UNIQUE index on this particular field in the database. But indexes are constantly being rewritten and they may get corrupted e.g. when the database crashes unexpectedly. In such case the index may no longer return information about an existing record, another record with the same field is added, and as result you'll be hitting the IntegrityError each time you try to get or create this particular record.
The solution is, at least in PostgreSQL, to REINDEX this particular index, but you first need to get rid of the duplicate rows programmatically.

Continue loading after IntegrityError

In python, I am populating a SQLITE data base using the importmany, so I can import tens of thousands of rows of data at once. My data is contained as a list of tuples. I had my database set up with the primary keys where I wanted them.
Problem I ran into was primary key errors would throw up an IntegrityError. If I handle the exception my script stops importing at the primary key conflict.
try:
try:
self.curs.executemany("INSERT into towers values (NULL,?,?,?,?)",self.insertList)
except IntegrityError:
print "Primary key error"
conn.commit()
So my questions are, in python using importmany can I:
1. Capture the values that violate the primary key?
2. Continue loading data after I get my primary key errors.
I get why it doesnt continue to load, because after the exception I commit the data to the database. I dont know how to continue where I left off however.
Unforutnley I cannot copy and paste all the code on this network, any help would be greatly appreciated. Right now I have no PKs set as a work around...
To answer (2) first, if you want to continue loading after you get an error, it's a simple fix on the SQL side:
INSERT OR IGNORE INTO towers VALUES (NULL,?,?,?,?)
This will successfully insert any rows that don't have any violations, and gracefully ignore the conflicts. Please do note however that the IGNORE clause will still fail on Foreign Key violations.
Another option for a conflict resolution clause in your case is: INSERT OR REPLACE INTO .... I strongly recommend the SQLite docs for more information on conflicts and conflict resolution.
As far as I know you cannot do both (1) and (2) simultaneously in an efficient way. You could possibly create a trigger to fire before insertions that can capture conflicting rows but this will impose a lot of unnecessary overhead on all of your insertions. (Someone please let me know if you can do this in a smarter way.) Therefore I would recommend you consider whether you truly need to capture the values of the conflicting rows or whether a redesign of your schema is required, if possible/applicable.
You could use lastrowid to get the point where you stopped:
http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.lastrowid
If you use it, however, you can't use executemany.
Use a for loop to iterate through the list and use execute instead of executemany. Surround the for loop with your try and continue execution after an exception. Something like this:
for it in self.insertList:
try:
self.curs.execute("INSERT into towers values (NULL,?,?,?,?)",it)
except IntegrityError:
#here you could insert the itens that were rejected in a temporary table
#without constraints for later use (question 1)
pass
conn.commit()
You can even count how many items of the list were really inserted.

Categories

Resources