I have a view that needs to perform updates to the database involving a shared resource that needs locking (the implementation is complex, but nothing more than a shared counter at heart).
To shield myself from race conditions, I'm using code that looks roughly like this:
#transaction.commit_manually
def do_it(request):
affected_models = Something.objects.select_for_update(blah = 1)
for model in affected_models:
model.modify()
model.save()
transaction.commit()
Is this usage of commit_manually, select_for_update() and save() ok? How can I write a test that confirms this? I can't find, say, a signal that Django fires between transactions; and I can't just run it and hope concurrency issues arise and are dealt with.
Why not to use commit_on_success there?
I think the query itself should look like:
Something.objects.select_for_update().filter(...)
I don't think django does anything special on select_for_update what you could assert on. Only assertion comes to my head is assertTrue(queryset.query.select_for_update). It tests nothing and might be usefull only if somebody accidentaly(?) removes the call.
Even if you come up with some unit test for this run condition problem I don't think it would be wise to put that into the project.
Rather focus on testing your code, not the db behaviour.
Related
Flask example applications Flasky and Flaskr create, drop, and re-seed their entire database between each test. Even if this doesn't make the test suite run slowly, I wonder if there is a way to accomplish the same thing while not being so "destructive". I'm surprised there isn't a "softer" way to roll back any changes. I've tried a few things that haven't worked.
For context, my tests call endpoints through the Flask test_client using something like self.client.post('/things'), and within the endpoints session.commit() is called.
I've tried making my own "commit" function that actually only flushes during tests, but then if I make two sequential requests like self.client.post('/things') and self.client.get('/things'), the newly created item is not present in the result set because the new request has a new request context with a new DB session (and transaction) which is not aware of changes that are merely flushed, not committed. This seems like an unavoidable problem with this approach.
I've tried using subtransactions with db.session.begin(subtransactions=True), but then I run into an even worse problem. Because I have autoflush=False, nothing actually gets committed OR flushed until the outer transaction is committed. So again, any requests that rely on data modified by earlier requests in the same test will fail. Even with autoflush=True, the earlier problem would occur for sequential requests.
I've tried nested transactions with the same result as subtransactions, and apparently they don't do what I was hoping they would do. I saw that nested transactions issue a SAVEPOINT command to the DB. I hoped that would allow commits to happen, visible to other sessions, and then be able to rollback to that save point at an arbitrary time, but that's not what they do. They're used within transactions, and have the same issues as the previous approach.
Update: Apparently there is a way of using nested transactions on a Connection rather than a Session, which might work but requires some restructuring of an application to use a Connection created by the test code. I haven't tried this yet. I'll get around to it eventually, but meanwhile I hope there's another way. Some say this approach may not work with MySQL due to a distinction between "real nested transactions" and savepoints, but the Postgres documentation also says to use SAVEPOINT rather than attempting to nest transactions. I think we can disregard this warning. I don't see any difference between these two databases anymore and if it works on one it will probably work on the other.
Another option that avoids a DB drop_all, create_all, and re-seeding with data, is to manually un-do the changes that a test introduces. But when testing an endpoint, many rows could be inserted into many tables, and reliably undoing this manually would be both exhausting and bug prone.
After trying all those things, I start to see the wisdom in dropping and creating between tests. However, is there something I've tried above that SHOULD work, but I'm simply doing something incorrectly? Or is there yet another method that someone is aware of that I haven't tried yet?
Update: Another method I just found on StackOverflow is to truncate all the tables instead of dropping and creating them. This is apparently about twice as fast, but it still seems heavy-handed and isn't as convenient as a rollback (which would not delete any sample data placed in the DB prior to the test case).
For unit tests I think the standard approach of regenerating the entire database is what makes the most sense, as you've seen in my examples and many others. But I agree, for large applications this can take a lot of time during your test run.
Thanks to SQLAlchemy you can get away with writing a lot of generic database code that runs on your production database, which might be MySQL, Postgres, etc. and at the same time it runs on sqlite for tests. It is not possible for every application out there to use 100% generic SQLAlchemy, since sqlite has some important differences with the others, but in many cases this works well.
So whenever possible, I set up a sqlite database for my tests. Even for large databases, using an in-memory sqlite database should be pretty fast. Another very fast alternative is to generate your tables once, make a backup of your sqlite file with all the emtpy tables, then before each test restore the file instead of doing a create_all().
I have not explored the idea of doing an initial backup of the database with empty tables and then use file based restores between tests for MySQL or Postgres, but in theory that should work as well, so I guess that is one solution you haven't mentioned in your list. You will need to stop and restart the db service in between your tests, though.
A bit of background:
I am using pyramid framework with SQLAlchemy. My db session is handled by pyramid_tm and ZTE
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
I have a very complicated database design with lots of model classes, foreign keys, and complex relationships between my models.
So while doing some very complicated logic on my Models and deleting, updating , inserting, and moving around objects from relationships in different models I used to get IntegrityError randomly which would go away after restarting pserve.
This is very strange because autoflush is on and in theory session must be flushed as soon as I change anything on models.
So my solution to the random IntegrityError was to manually flush the session within my logic whenever things get very complicated.
Since I did the DBSession.flush() within my logic I haven't got the IntegrityError any more.
The question
Now I have 2 questions:
How come autoflush does not prevent from integrity error? Is it that autoflush does not clean the models but DBSession.flush() cleans the models?
Are there any side effects of calling DBSession.flush() within my code? I can't really think of any side effects (apart from little performance overhead of calling DB). I don't really like calling DBSession.flush() within my code as it is something that must really be handled by framework.
See also
When should I be calling flush() on SQLAlchemy?
Thank you.
It is very hard to say why you used to get IntegrityError's without seeing any code, but in theory there are a few scenarios where autocommit may actually cause it by flushing the session prematurely. For example:
COURSE_ID = 10
student = Student(name="Bob")
student.course_id = COURSE_ID
course = Course(id=COURSE_ID, name="SQLAlchemy")
The above code will probably (haven't tested) fail with autocommit turned on and should succeed if you let SQLAlchemy to flush the changes.
I don't think there's any harm in flushing the session periodically if it helps, but again, it's hard to tell whether something can be done to avoid the manual flush without any code samples.
I have a task running in a thread that will save its input-data after quite some while. Django typically saves the whole object which can have changed in the meantime. I also don't want to run the transaction that long since it will fail or block other tasks. My solution is to reload the data and save the result. Is this the way to go or is there some optimistic locking scheme, partial save or something else I should use?
My solution:
with transaction.atomic():
obj = mod.TheModel.objects.get(id=the_id)
# Work the task
with transaction.atomic():
obj = mod.TheObject.objects.get(id=obj.id)
obj.result = result
obj.save()
Generally, if you don't want to block other tasks doing long operations, you want these operations to be asynchronous.
There exists libraries to do this kind of tasks with Django. The most famous is probably http://www.celeryproject.org/
I would recommend using a partial save via Django QuerySet's update method. There's an update_fields keyword argument in the save method on instances which will limit the fields to be saved. However any logic within the save method itself might rely on other data being up to date. There's a relatively new instance update_from_db method for that purpose. However if your save method isn't overridden then both with produce exactly the same SQL anyway. With update avoiding any potential issues with data integrity.
Example:
num_changed = mod.TheObject.objects.filter(id=obj.id).update(result=result)
if num_changed == 0:
# Handle as you would have handled DoesNotExist from your get call
I'm beginning to develop a site in Pyramid and, before I commit to using SQLAlchemy, would like to know if it's possible to wrap/extend it to add in 'database lock' functionality.
One quick example as to why I'd like this functionality is for write throttling. My wrapper will be able to detect if a user is flooding the database with writes and, if they are, they'll be prevented from further writes for X amount of time.
I was looking into extending sqlalchemy.org.session.Session and overriding the add method which would perform this throttle check. If the user passes the check, it would simply pass the query off to super(MyWrapper, self).query(*args, **kargs)
This is easy enough to do. However, it only adds the throttle functionality to DBSession.query. If somewhere in my code I use DBSession.execute, the throttle check is bypassed.
Is there a cleaner way to accomplish this?
Detecting excessive network traffic from particular clients is something you might be doing outside the ORM, even outside the Python app, like at the network or database client configuration level.
If within the Python app, definitely not in the ORM. add() doesn't correspond very cleanly to a SQL statement in any case (no SQL is emitted until flush(), and only if the given object was previously pending. add() also cascades to many objects and can result in any number of INSERT statements).
For a simple count on statements, cursor execute events are the best way to go. This gives you a hook at the point of calling execute() on the DBAPI cursor. See before_cursor_execute() at http://www.sqlalchemy.org/docs/core/events.html#connection-events.
I'm making a Django web-app which allows a user to build up a set of changes over a series of GETs/POSTs before committing them to the database (or reverting) with a final POST. I have to keep the updates isolated from any concurrent database users until they are confirmed (this is a configuration front-end), ruling out committing after each POST.
My preferred solution is to use a per-session transaction. This keeps all the problems of remembering what's changed (and how it affects subsequent queries), together with implementing commit/rollback, in the database where it belongs. Deadlock and long-held locks are not an issue, as due to external constraints there can only be one user configuring the system at any one time, and they are well-behaved.
However, I cannot find documentation on setting up Django's ORM to use this sort of transaction model. I have thrown together a minimal monkey-patch (ew!) to solve the problem, but dislike such a fragile solution. Has anyone else done this before? Have I missed some documentation somewhere?
(My version of Django is 1.0.2 Final, and I am using an Oracle database.)
Multiple, concurrent, session-scale transactions will generally lead to deadlocks or worse (worse == livelock, long delays while locks are held by another session.)
This design is not the best policy, which is why Django discourages it.
The better solution is the following.
Design a Memento class that records the user's change. This could be a saved copy of their form input. You may need to record additional information if the state changes are complex. Otherwise, a copy of the form input may be enough.
Accumulate the sequence of Memento objects in their session. Note that each step in the transaction will involve fetches from the data and validation to see if the chain of mementos will still "work". Sometimes they won't work because someone else changed something in this chain of mementos. What now?
When you present the 'ready to commit?' page, you've replayed the sequence of Mementos and are pretty sure they'll work. When the submit "Commit", you have to replay the Mementos one last time, hoping they're still going to work. If they do, great. If they don't, someone changed something, and you're back at step 2: what now?
This seems complex.
Yes, it does. However it does not hold any locks, allowing blistering speed and little opportunity for deadlock. The transaction is confined to the "Commit" view function which actually applies the sequence of Mementos to the database, saves the results, and does a final commit to end the transaction.
The alternative -- holding locks while the user steps out for a quick cup of coffee on step n-1 out of n -- is unworkable.
For more information on Memento, see this.
In case anyone else ever has the exact same problem as me (I hope not), here is my monkeypatch. It's fragile and ugly, and changes private methods, but thankfully it's small. Please don't use it unless you really have to. As mentioned by others, any application using it effectively prevents multiple users doing updates at the same time, on penalty of deadlock. (In my application, there may be many readers, but multiple concurrent updates are deliberately excluded.)
I have a "user" object which persists across a user session, and contains a persistent connection object. When I validate a particular HTTP interaction is part of a session, I also store the user object on django.db.connection, which is thread-local.
def monkeyPatchDjangoDBConnection():
import django.db
def validConnection():
if django.db.connection.connection is None:
django.db.connection.connection = django.db.connection.user.connection
return True
def close():
django.db.connection.connection = None
django.db.connection._valid_connection = validConnection
django.db.connection.close = close
monkeyPatchDBConnection()
def setUserOnThisThread(user):
import django.db
django.db.connection.user = user
This last is called automatically at the start of any method annotated with #login_required, so 99% of my code is insulated from the specifics of this hack.
I came up with something similar to the Memento pattern, but different enough that I think it bears posting. When a user starts an editing session, I duplicate the target object to a temporary object in the database. All subsequent editing operations affect the duplicate. Instead of saving the object state in a memento at each change, I store operation objects. When I apply an operation to an object, it returns the inverse operation, which I store.
Saving operations is much cheaper for me than mementos, since the operations can be described with a few small data items, while the object being edited is much bigger. Also I apply the operations as I go and save the undos, so that the temporary in the db always corresponds to the version in the user's browser. I never have to replay a collection of changes; the temporary is always only one operation away from the next version.
To implement "undo," I pop the last undo object off the stack (as it were--by retrieving the latest operation for the temporary object from the db) apply it to the temporary and return the transformed temporary. I could also push the resultant operation onto a redo stack if I cared to implement redo.
To implement "save changes," i.e. commit, I de-activate and time-stamp the original object and activate the temporary in it's place.
To implement "cancel," i.e. rollback, I do nothing! I could delete the temporary, of course, because there's no way for the user to retrieve it once the editing session is over, but I like to keep the canceled edit sessions so I can run stats on them before clearing them out with a cron job.