I'm selecting some object for update and then perform operation on it
obj = Model.objects.select_for_update().get(id=someID)
obj.somefield = 1
obj.save()
But I need still to keep FOR UPDATE lock on this object.
PostgreSQL documentation says that FOR UPDATE lock will live until the end of transaction, which will be ended, because save will trigger commit.
Even if I will manage commit manually, I need to save some info to the database (to do this I need to commit).
So, what I can do in this situation?
If I will select object again, some other process may perform changes on this object before I will set a new lock.
(I'm using django 1.7 and postgresql 9.3)
You can't hold a row lock after the transaction commits or rolls back.
This might be a reasonable use case for advisory locks, though it's hard to say with the limited detail provided. Advisory locks can operate at the session level, crossing transaction boundaries.
You can't (ab)use a WITH HOLD cursor:
test=> DECLARE test_curs CURSOR WITH HOLD FOR SELECT * FROM dummy FOR UPDATE;
ERROR: DECLARE CURSOR WITH HOLD ... FOR UPDATE is not supported
DETAIL: Holdable cursors must be READ ONLY.
so I think an advisory lock is pretty much your only option.
The usual way of handling this, by the way, is to let the other task make changes in between. Make sure that each transaction leaves the object in a sensible state that makes sense. If someone else makes a change between your two changes, design things so that's OK.
To avoid overwriting the changes made in between you can use optimistic concurrency control, otherwise known as optimistic locking, using a row-version column. If you see someone snuck in and made a change you reload the object to get the new version, repeat your change, and try saving again. Unfortunately, unlike more sophisticated ORMs like Hibernate, it doesn't look like Django has built-in support for optimistic concurrency control, but there seem to be extensions that add it.
Related
I'm using pyscopg2 to manage some postgresql databases connections.
As I have found here and in the docs it seems psycopg2 simulates non-autocommit mode as default. Also postgresql treats every statement as a transaction, basically autocommit mode.
My doubt is, which one of this cases happen if both psycopg and postgresql stay in default mode? Or what exactly happens if it's neither one of these two. Any performance advise will be appreciated too.
Code Psycopg2 Postgresql
Some statements --> One big transaction --> Multiple simple transactions
or
Some statements --> One big transaction --> Big transaction
First, my interpretation of the two documents is that when running psycopg2 with postgresql you will be running by default in simulated non-autocommit mode by virtue of psycopg2 having started a transaction. You can, of course, override that default with autocommit=True. Now to answer your question:
By default you will not be using autocommit=True and this will require you to do a commit anytime you do an update to the database that you wish to be permanent. That may seem inconvenient. But there are many instances when you need to do multiple updates and either they must all succeed or none must succeed. If you specified autocommit=True, then you would have to explicitly start a transaction for these cases. With autocommit=False, you are saved the trouble of having to ever start a transaction at the price of always having to do a commit or rollback. It seems to be question of preference. I personally prefer autocommit=False.
As far as performance is concerned, specifying autocommit=True will save you the cost of starting a needless transaction in many instances. But I can't quantify how much of a performance savings that really is.
I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?
The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.
Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.
When using more complex, hierarchical models with differing settings on how cascade deletes are handled it gets quite hard to figure out beforehand what a delete() will exactly do with the database.
I couldn't find any way to get this piece of information ("Hey SQLAlchemy, what will be deleted if I delete that object over there?") from SQLAlchemy. Implementing this by myself doesn't really seem like an option since this would result sooner or later in situations where my prediction and the actual consequences of the delete() differ, which would be very… unpleasant for the user.
I asked this question on the SQLAlchemy mailing list to
and Michael Bayer explained the possible options (thanks a lot again! :-):
The only deletes that aren't present in session.deleted before the flush are those that will occur because a particular object is an "orphan", and the objects which would be deleted as a result of a cascade on that orphan.
So without orphans taken into account, session.deleted tells you everything that is to be deleted.
To take orphans into account requires traversing through all the relationships as the unit of work does, looking for objects that are currently orphans (there's an API function that will tell you this - if the object is considered an "orphan" by any attribute that refers to it with delete-orphan cascade, it's considered an "orphan"), and then traversing through the relationships of those orphans, considering them to be marked as "deleted", and then doing all the rules again for those newly-deleted objects.
The system right now is implemented by orm/dependency.py. It is probably not hard to literally run a unit of work process across the session normally, but just not emit the SQL, this would give you the final flush plan. But this is an expensive process that I wouldn't want to be calling all the time.
A feature add is difficult here because the use case is not clear. Knowing what will be deleted basically requires half the flush process actually proceed. But you can already implement events inside the flush process itself, most directly the before_delete() and after_delete() events that will guaranteed catch everything. So the rationale for a new feature that basically runs half the flush, before you just do the flush anyway and could just put events inside of it, isn't clear.
I guess the big question is, "when are you calling this".
An easy system would be to add a new event "flush_plan_complete" which will put you into a flush() right as the full plan has been assembled, but before any SQL occurs. It could allow you to register more objects for activity, and it would then rerun the flush plan to consider any new changes (since that's how it works right now anyway). How this registration would proceed is tricky, since it would be nice to use the Session normally there, but that makes this more complicated to implement. But then it could iterate again through the new changes and find any remaining steps to take before proceeding normally.
A user can perform an action, which may trigger dependent actions (which themselves may have dependent actions) and I want to be able to cancel the whole thing if the user cancels a dependent action.
The typical way I've seen this done is some variant of an undo stack and each action will need to know how to undo itself, and then if a child action is cancelled the undo's cascade their way up. Sometimes writing undo methods are tricky and there isn't always enough information in context to properly know how to undo an action in an isolated manner.
I just thought of a (potentially) easier way which is to just pickle the state of the (relevant parts of) program, and then the cancel would just restore to it's former state, without needing to create separate undo logic for each action.
Has anyone tried this? Any gotchas to watch out for? Any reason not to do this?
Edit: The dependent actions must happen after the parent action (and even whether there are dependent actions may depend on the result of the parent action), so just checking all the dependencies before doing anything isn't an option. I guess you could say an action triggers other actions, but if one of the triggered actions cannot be performed, then none of it happened.
Well, as you mentioned the data design is loosely coupled, so you I don't think you need to pickle it if it's in memory. Just take a copy of all the relevant variables, and the transaction.abort() would just copy them back, and transaction.commit() would then just remove the copy of the data.
There are issues, but none that you don't have with the pickle solution.
You can use pickle to store your state if all elements of state are serializable (usually they are). The only reasons for not doing so:
if you have to store pointers to any objects that are not saved in state, you will have problems with these pointers after performing undo operation.
this method could be expensive, depending on the size of your state.
Also you can use zip() to lower memory usage in exchange of raising CPU usage.
I'm making a Django web-app which allows a user to build up a set of changes over a series of GETs/POSTs before committing them to the database (or reverting) with a final POST. I have to keep the updates isolated from any concurrent database users until they are confirmed (this is a configuration front-end), ruling out committing after each POST.
My preferred solution is to use a per-session transaction. This keeps all the problems of remembering what's changed (and how it affects subsequent queries), together with implementing commit/rollback, in the database where it belongs. Deadlock and long-held locks are not an issue, as due to external constraints there can only be one user configuring the system at any one time, and they are well-behaved.
However, I cannot find documentation on setting up Django's ORM to use this sort of transaction model. I have thrown together a minimal monkey-patch (ew!) to solve the problem, but dislike such a fragile solution. Has anyone else done this before? Have I missed some documentation somewhere?
(My version of Django is 1.0.2 Final, and I am using an Oracle database.)
Multiple, concurrent, session-scale transactions will generally lead to deadlocks or worse (worse == livelock, long delays while locks are held by another session.)
This design is not the best policy, which is why Django discourages it.
The better solution is the following.
Design a Memento class that records the user's change. This could be a saved copy of their form input. You may need to record additional information if the state changes are complex. Otherwise, a copy of the form input may be enough.
Accumulate the sequence of Memento objects in their session. Note that each step in the transaction will involve fetches from the data and validation to see if the chain of mementos will still "work". Sometimes they won't work because someone else changed something in this chain of mementos. What now?
When you present the 'ready to commit?' page, you've replayed the sequence of Mementos and are pretty sure they'll work. When the submit "Commit", you have to replay the Mementos one last time, hoping they're still going to work. If they do, great. If they don't, someone changed something, and you're back at step 2: what now?
This seems complex.
Yes, it does. However it does not hold any locks, allowing blistering speed and little opportunity for deadlock. The transaction is confined to the "Commit" view function which actually applies the sequence of Mementos to the database, saves the results, and does a final commit to end the transaction.
The alternative -- holding locks while the user steps out for a quick cup of coffee on step n-1 out of n -- is unworkable.
For more information on Memento, see this.
In case anyone else ever has the exact same problem as me (I hope not), here is my monkeypatch. It's fragile and ugly, and changes private methods, but thankfully it's small. Please don't use it unless you really have to. As mentioned by others, any application using it effectively prevents multiple users doing updates at the same time, on penalty of deadlock. (In my application, there may be many readers, but multiple concurrent updates are deliberately excluded.)
I have a "user" object which persists across a user session, and contains a persistent connection object. When I validate a particular HTTP interaction is part of a session, I also store the user object on django.db.connection, which is thread-local.
def monkeyPatchDjangoDBConnection():
import django.db
def validConnection():
if django.db.connection.connection is None:
django.db.connection.connection = django.db.connection.user.connection
return True
def close():
django.db.connection.connection = None
django.db.connection._valid_connection = validConnection
django.db.connection.close = close
monkeyPatchDBConnection()
def setUserOnThisThread(user):
import django.db
django.db.connection.user = user
This last is called automatically at the start of any method annotated with #login_required, so 99% of my code is insulated from the specifics of this hack.
I came up with something similar to the Memento pattern, but different enough that I think it bears posting. When a user starts an editing session, I duplicate the target object to a temporary object in the database. All subsequent editing operations affect the duplicate. Instead of saving the object state in a memento at each change, I store operation objects. When I apply an operation to an object, it returns the inverse operation, which I store.
Saving operations is much cheaper for me than mementos, since the operations can be described with a few small data items, while the object being edited is much bigger. Also I apply the operations as I go and save the undos, so that the temporary in the db always corresponds to the version in the user's browser. I never have to replay a collection of changes; the temporary is always only one operation away from the next version.
To implement "undo," I pop the last undo object off the stack (as it were--by retrieving the latest operation for the temporary object from the db) apply it to the temporary and return the transformed temporary. I could also push the resultant operation onto a redo stack if I cared to implement redo.
To implement "save changes," i.e. commit, I de-activate and time-stamp the original object and activate the temporary in it's place.
To implement "cancel," i.e. rollback, I do nothing! I could delete the temporary, of course, because there's no way for the user to retrieve it once the editing session is over, but I like to keep the canceled edit sessions so I can run stats on them before clearing them out with a cron job.