Force commit of nested save() within a transaction

Force commit of nested save() within a transaction - python

I have a function where I save a large number of models, (thousands at a time), this takes several minutes so I have written a progress bar to display progress to the user. The progress bar works by polling a URL (from Javascript) and looking a request.session value to see the state of the first call (the one that is saving).
The problem is that the first call is within a #transaction.commit_on_success decorator and because I am using Database Backed sessions when I try to force request.session.save() instead of it immediately committing it is appended to the ongoing transaction. This results in the progress bar only being updated once all the saves are complete, thus rendering it useless.
My question is, (and I'm 99.99% sure I already know the answer), can you commit statements within a transaction without doing the whole lot. i.e. I need to just commit the request.session.save() whilst leaving all of the others..
Many thanks, Alex

No, both your main saves and the status bar updates will be conducted using the same database connection so they will be part of the same transaction.
I can see two options to avoid this.
You can either create your own, separate database connection and save the status bar updates using that.
Don't save the status bar updates to the database at all and instead use a cache to store them. As long as you don't use the database cache backend (ideally you'd use memcached) this will work fine.
My preferred option would be the second one. You'll need to delve into the Django internals to get your own database connection so that could is likely to end up fragile and messy.

Related

Is a Rollback of my database possible in py2neo after execution of program?

I am trying to run a cypher query in py2neo and overcome some restrictions.
I actually want to add some weight to the edges of my graph for a specific execution but after the execution of the program I don't want the changes to remain on my neo4j DB (Rollback).
I want this in order to run the program/query with the edge weights as parameters every time
Thanks in advance!

It depends on timing. I believe py2neo uses the transactional cypher http endpoint. Rollback is possible before a transaction has been committed or finished, not after.
So let's say you're running your cypher query and doing other things at the same time.
tx = graph.cypher.begin()
statement = "some nifty mutating cypher in here"
tx.append(statement)
tx.commit()
By the time you hit commit you're done. The database doesn't necessarily work like git where you can undo any previous past change, or revert to a previous state of the database at a certain time. Generally you're creating transactions, then you're committing them, or rolling them back.
So, you can roll-back a transaction if you have not yet finished/committed it. This is useful because your transaction might include 8-9 different queries, all making changes. What if the first 4 succeed, then #5 fails? That's really what transaction rollback is meant to address, to make a big multi-query change into something atomic, that either all works, or all doesn't.
If this isn't what you want, then you should probably formulate a second cypher query which undoes whatever changes you're making to your weights. An example might be a pair of queries like this:
MATCH (a:Node)
SET a.old_weight=a.weight
WITH a
SET a.weight={myNewValue}
RETURN a;
Then undo it with:
MATCH (a:Node)
SET a.weight=a.old_weight
WITH a
DELETE a.old_weight
RETURN a;
Here's further documentation on transactions from the java API that describes a bit more how they work.

Best way to avoid data loss in a high-load Django app?

Imagine a quite complex Django application with both frontend and backend parts. Some users modify some data on the frontend part. Some scripts modify the same data periodically on the backend part.
Example:
instance = SomeModel.objects.get(...)
# (long-running part where various fields are changed, takes from 3 to 20 seconds)
instance.field = 123
instance.another_field = 'abc'
instance.save()
If somebody (or something) changes the instance while that part is changing some fields, then the changes will be lost because the instance will be saved lately, dumping the data from the Python (Django) class. In other words, if something in the code takes data, then waits for some time, and then saves the data back - then only the latest 'saver' will save its data, all the others (previous) ones will lose their changes.
It's a "high-load" app, the database load (we use Postgres) is quite high and I'd like to avoid anything that would cause a significant increase of the DB activity or memory taken.
Another issue - we have many signals attached, and even the save() method overriden, so I'd like to avoid anything that might break the signals or might be incompatible with custom save() or update() methods.
What would you recommend in this situation? Any special app for that? Transactions? Anything else?
Thank you!

The correct way to protect against this is to use select_for_update to make sure that the data doesn't change between reading and writing. However this causes the row to be locked for updates so this might slow down your application significantly.
Oen solution might be to read the data and perform your long-running tasks. Then before saving it back you start a transaction, read the data again but now with select_for_update and verify that the original data hasn't changed. If the data is still the same then you save. If the data has changed you abort and re-run the long-running task. That way you will hold the lock for as short as possible.
Something like:
success = False
while not success:
instance1 = SomeModel.objects.get(...)
# (long-running part)
with django.db.transaction.atomic():
instance2 = SomeModel.objects.select_for_update().get(...)
# (compare relevant data from instance1 vs instance2)
if unchanged:
# (make the changes on instance2)
instance2.field = 123
instance2.another_field = 'abc'
instance2.save()
success = True
If this is a viable approach does depend on what exactly your long-running task is. And a user might still overwrite the data you save here.

Can anyone tell me what' s the point of connection.commit() in python pyodbc ?

I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.

This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.

Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.

Efficient approach to catching database errors

I have a desktop app that has 65 modules, about half of which read from or write to an SQLite database. I've found that there are 3 ways that the database can throw an SQliteDatabaseError:
SQL logic error or missing database (happens unpredictably every now and then)
Database is locked (if it's being edited by another program, like SQLite Database Browser)
Disk I/O error (also happens unpredictably)
Although these errors don't happen often, when they do they lock up my application entirely, and so I can't just let them stand.
And so I've started re-writing every single access of the database to be a pointer to a common "database-access function" in its own module. That function then can catch these three errors as exceptions and thereby not crash, and also alert the user accordingly. For example, if it is a "database is locked error", it will announce this and ask the user to close any program that is also using the database and then try again. (If it's the other errors, perhaps it will tell the user to try again later...not sure yet). Updating all the database accesses to do this is mostly a matter of copy/pasting the redirect to the common function--easy work.
The problem is: it is not sufficient to just provide this database-access function and its announcements, because at all of the points of database access in the 65 modules there is code that follows the access that assumes the database will successfully return data or complete a write--and when it doesn't, that code has to have a condition for that. But writing those conditionals requires carefully going into each access point and seeing how best to handle it. This is laborious and difficult for the couple of hundred database accesses I'll need to patch in this way.
I'm willing to do that, but I thought I'd inquire if there were a more efficient/clever way or at least heuristics that would help in finishing this fix efficiently and well.
(I should state that there is no particular "architecture" of this application...it's mostly what could be called "ravioli code", where the GUI and database calls and logic are all together in units that "go together". I am not willing to re-write the architecture of the whole project in MVC or something like this at this point, though I'd consider it for future projects.)

Your gut feeling is right. There is no way to add robustness to the application without reviewing each database access point separately.
You still have a lot of important choice at how the application should react on errors that depends on factors like,
Is it attended, or sometimes completely unattended?
Is delay OK, or is it important to report database errors promptly?
What are relative frequencies of the three types of failure that you describe?
Now that you have a single wrapper, you can use it to do some common configuration and error handling, especially:
set reasonable connect timeouts
set reasonable busy timeouts
enforce command timeouts on client side
retry automatically on errors, especially on SQLITE_BUSY (insert large delays between retries, fail after a few retries)
use exceptions to reduce the number of application level handlers. You may be able to restart the whole application on database errors. However, do that only if you have confidence as to in which state you are aborting the application; consistent use of transactions may ensure that the restart method does not leave inconsistent data behind.
ask a human for help when you detect a locking error
...but there comes a moment where you need to bite the bullet and let the error out into the application, and see what all the particular callers are likely to do with it.

Database migrations on django production

From someone who has a django application in a non-trivial production environment, how do you handle database migrations? I know there is south, but it seems like that would miss quite a lot if anything substantial is involved.
The other two options (that I can think of or have used) is doing the changes on a test database and then (going offline with the app) and importing that sql export. Or, perhaps a riskier option, doing the necessary changes on the production database in real-time, and if anything goes wrong reverting to the back-up.
How do you usually handle your database migrations and schema changes?

I think there are two parts to this problem.
First is managing the database schema and it's changes. We do this using South, keeping both the working models and the migration files in our SCM repository. For safety (or paranoia), we take a dump of the database before (and if we are really scared, after) running any migrations. South has been adequate for all our requirements so far.
Second is deploying the schema change which goes beyond just running the migration file generated by South. In my experience, a change to the database normally requires a change to deployed code. If you have even a small web farm, keeping the deployed code in sync with the current version of your database schema may not be trivial - this gets worse if you consider the different caching layers and effect to an already active site user. Different sites handle this problem differently, and I don't think there is a one-size-fits-all answer.
Solving the second part of this problem is not necessarily straight forward. I don't believe there is a one-size-fits-all approach, and there is not enough information about your website and environment to suggest a solution that would be most suitable for your situation. However, I think there are a few considerations that can be kept in mind to help guide deployment in most situations.
Taking the whole site (web servers and database) offline is an option in some cases. It is certainly the most straight forward way to manage updates. But frequent downtime (even when planned) can be a good way to go our of business quickly, makes it tiresome to deploy even small code changes, and might take many hours if you have a large dataset and/or complex migration. That said, for sites I help manage (which are all internal and generally only used during working hours on business days) this approach works wonders.
Be careful if you do the changes on a copy of your master database. The main problem here is that your site is still live, and presumably accepting writes to the database. What happens to data written to the master database while you are busy migrating the clone for later use? Your site has to either be down the whole time or put in some read-only state temporarily otherwise you'll lose them.
If your changes are backwards compatible, and you have a web farm, sometimes you can get away with updating the live production database server (which I think is unavoidable in most situations) and then incrementally updating nodes in the farm by taking them out of the load balancer for a short period. This can work ok - however the main problem here is if a node that has already been updated sends a request for a url which isn't supported by an older node you will get fail as you cant manage that at the load balancer level.
I've seen/heard a couple of other ways work well.
The first is wrapping all code changes in a feature lock which is then configurable at run-time through some site-wide configuration options. This essentially means you can release code where all your changes are turned off, and then after you have made all the necessary updates to your servers you change your configuration option to enable the feature. But this makes quite heavy code...
The second is letting the code manage the migration. I've heard of sites where changes to the code is written in such a way that it handles the migration at runtime. It is able to detect the version of the schema being used, and the format of the data it got back - if the data is from the old schema it does the migration in place, if the data is already from the new schema it does nothing. From natural site usage a high portion of your data will be migrated by people using the site, the rest you can do with a migration script whenever you like.
But I think at this point Google becomes your friend, because as I say, the solution is very context specific and I'm worried this answer will start to get meaningless... Search for something like "zero down time deployment" and you'll get results such as this with plenty of ideas...

I use South for a production server with a codebase of ~40K lines and we have had no problems so far. We have also been through a couple of major refactors for some of our models and we have had zero problems.
One thing that we also have is version control on our models which helps us revert any changes we make to models on the software side with South being more for the actual data. We use Django Reversion

I have sometimes taken an unconventional approach (reading the other answers perhaps it's not that unconventional) to this problem. I never tried it with django so I just did some experiments with it.
In short, I let the code catch the exception resulting from the old schema and apply the appropriate schema upgrade. I don't expect this to be the accepted answer - it is only appropriate in some cases (and some might argue never). But I think it has an ugly-duckling elegance.
Of course, I have a test environment which I can reset back to the production state at any point. Using that test environment, I update my schema and write code against it - as usual.
I then revert the schema change and test the new code again. I catch the resulting errors, perform the schema upgrade and then re-try the erring query.
The upgrade function must be written so it will "do no harm" so that if it's called multiple times (as may happen when put into production) it only acts once.
Actual python code - I put this at the end of my settings.py to test the concept, but you would probably want to keep it in a separate module:
from django.db.models.sql.compiler import SQLCompiler
from MySQLdb import OperationalError
orig_exec = SQLCompiler.execute_sql
def new_exec(self, *args, **kw):
try:
return orig_exec(self, *args, **kw)
except OperationalError, e:
if e[0] != 1054: # unknown column
raise
upgradeSchema(self.connection)
return orig_exec(self, *args, **kw)
SQLCompiler.execute_sql = new_exec
def upgradeSchema(conn):
cursor = conn.cursor()
try:
cursor.execute("alter table users add phone varchar(255)")
except OperationalError, e:
if e[0] != 1060: # duplicate column name
raise
Once your production environment is up to date, you are free to remove this self-upgrade code from your codebase. But even if you don't, the code isn't doing any significant unnecessary work.
You would need to tailor the exception class (MySQLdb.OperationalError in my case) and numbers (1054 "unknown column" / 1060 "duplicate column" in my case) to your database engine and schema change, but that should be easy.
You might want to add some additional checks to ensure the sql being executed is actually erring because of the schema change in question rather than some other problem, but even if you don't, this should re-raise unrelated exception. The only penalty is that in that situation you'd be trying the upgrade and the bad query twice before raising the exception.
One of my favorite things about python is one's ability to easily override system methods at run-time like this. It provides so much flexibility.

If your database is non-trivial and Postgresql you have a whole bunch of excellent options SQL-wise, including:
snapshotting and rollback
live replication to a backup server
trial upgrade then live
The trial upgrade option is nice (but best done in collaboration with a snapshot)
su postgres
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql
psql template1
# create database upgrade_test template current_db
# \c upgradetest
# \i upgrade_file.sql
...assuming all well...
# \q
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql # we're paranoid
psql <db>
# \i upgrade_file.sql
If you like the above arrangement, but you are worried about the time it takes to run upgrade twice, you can lock db for writes and then if the upgrade to upgradetest goes well you can then rename db to dbold and upgradetest to db. There are lots of options.
If you have an SQL file listing all the changes you want to make, an extremely handy psql command \set ON_ERROR_STOP 1. This stops the upgrade script in its tracks the moment something goes wrong. And, with lots of testing, you can make sure nothing does.
There are a whole host of database schema diffing tools available, with a number noted in this StackOverflow answer. But it is basically pretty easy to do by hand ...
pg_dump --schema-only production_db > production_schema.sql
pg_dump --schema-only upgraded_db > upgrade_schema.sql
vimdiff production_schema.sql upgrade_schema.sql
or
diff -Naur production_schema.sql upgrade_schema.sql > changes.patch
vim changes.patch (to check/edit)

South isnt used everywhere. Like in my orgainzation we have 3 levels of code testing. One is local dev environment, one is staging dev enviroment, and third is that of a production .
Local Dev is on the developers hands where he can play according to his needs. Then comes staging dev which is kept identical to production, ofcourse, until a db change has to be done on the live site, where we do the db changes on staging first, and check if everything is working fine and then we manually change the production db making it identical to staging again.

If its not trivial, you should have pre-prod database/ app that mimic the production one. To avoid downtime on production.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.