Efficient approach to catching database errors

Efficient approach to catching database errors - python

I have a desktop app that has 65 modules, about half of which read from or write to an SQLite database. I've found that there are 3 ways that the database can throw an SQliteDatabaseError:
SQL logic error or missing database (happens unpredictably every now and then)
Database is locked (if it's being edited by another program, like SQLite Database Browser)
Disk I/O error (also happens unpredictably)
Although these errors don't happen often, when they do they lock up my application entirely, and so I can't just let them stand.
And so I've started re-writing every single access of the database to be a pointer to a common "database-access function" in its own module. That function then can catch these three errors as exceptions and thereby not crash, and also alert the user accordingly. For example, if it is a "database is locked error", it will announce this and ask the user to close any program that is also using the database and then try again. (If it's the other errors, perhaps it will tell the user to try again later...not sure yet). Updating all the database accesses to do this is mostly a matter of copy/pasting the redirect to the common function--easy work.
The problem is: it is not sufficient to just provide this database-access function and its announcements, because at all of the points of database access in the 65 modules there is code that follows the access that assumes the database will successfully return data or complete a write--and when it doesn't, that code has to have a condition for that. But writing those conditionals requires carefully going into each access point and seeing how best to handle it. This is laborious and difficult for the couple of hundred database accesses I'll need to patch in this way.
I'm willing to do that, but I thought I'd inquire if there were a more efficient/clever way or at least heuristics that would help in finishing this fix efficiently and well.
(I should state that there is no particular "architecture" of this application...it's mostly what could be called "ravioli code", where the GUI and database calls and logic are all together in units that "go together". I am not willing to re-write the architecture of the whole project in MVC or something like this at this point, though I'd consider it for future projects.)

Your gut feeling is right. There is no way to add robustness to the application without reviewing each database access point separately.
You still have a lot of important choice at how the application should react on errors that depends on factors like,
Is it attended, or sometimes completely unattended?
Is delay OK, or is it important to report database errors promptly?
What are relative frequencies of the three types of failure that you describe?
Now that you have a single wrapper, you can use it to do some common configuration and error handling, especially:
set reasonable connect timeouts
set reasonable busy timeouts
enforce command timeouts on client side
retry automatically on errors, especially on SQLITE_BUSY (insert large delays between retries, fail after a few retries)
use exceptions to reduce the number of application level handlers. You may be able to restart the whole application on database errors. However, do that only if you have confidence as to in which state you are aborting the application; consistent use of transactions may ensure that the restart method does not leave inconsistent data behind.
ask a human for help when you detect a locking error
...but there comes a moment where you need to bite the bullet and let the error out into the application, and see what all the particular callers are likely to do with it.

Related

Problems Using the Berkeley DB Transactional Processing

I'm writing a set of programs that have to operate on a common database, possibly concurrently. For the sake of simplicity (for the user), I didn't want to require the setup of a database server. Therefore I setteled on Berkeley DB, where one can just fire up a program and let it create the DB if it doesn't exist.
In order to let programs work concurrently on a database, one has to use the transactional features present in the 5.x release (here I use python3-bsddb3 6.1.0-1+b2 with libdb5.3 5.3.28-12): the documentation clearly says that it can be done. However I quickly ran in trouble, even with some basic tasks :
Program 1 initializes records in a table
Program 2 has to scan the records previously added by program 1 and updates them with additional data.
To speed things up, there is an index for said additional data. When program 1 creates the records, the additional data isn't present, so the pointer to that record is added to the index under an empty key. Program 2 can then just quickly seek to the not-yet-updated records.
Even when not run concurrently, the record updating program crashes after a few updates. First it complained about insufficient space in the mutex area. I had to resolve this with an obscure DB_CONFIG file and then run db_recover.
Next, again after a few updates it complained 'Cannot allocate memory -- BDB3017 unable to allocate space from the buffer cache'. db_recover and relaunching the program did the trick, only for it to crash again with the same error a few records later.
I'm not even mentioning concurrent use: when one of the programs is launched while the other is running, they almost instantly crash with deadlock, panic about corrupted segments and ask to run recover. I made many changes so I went throug a wide spectrum of errors which often yield irrelevant matches when searched for. I even rewrote the db calls to use lmdb, which in fact works quite well and is really quick, which tends to indicate my program logic isn't at fault. Unfortunately it seems the datafile produced by lmdb is quite sparse, and quickly grew to unacceptable sizes.
From what I said, it seems that maybe some resources are being leaked somewhere. I'm hesitant to rewrite all this directly in C to check if the problem can come from the Python binding.
I can and I will update the question with code, but for the moment ti is long enough. I'm looking for people who have used the transactional stuff in BDB, for similar uses, which could point me to some of the gotchas.
Thanks

RPM (see http://rpm5.org) uses Berkeley DB in transactional mode. There's a fair number of gotchas, depending on what you are attempting.
You have already found DB_CONFIG: you MUST configure the sizes for mutexes and locks, the defaults are invariably too small.
Needing to run db_recover while developing is quite painful too. The best fix (imho) is to automate recovery while opening by checking the return code for DB_RUNRECOVERY, and then reopening the dbenv with DB_RECOVER.
Deadlocks are usually design/coding errors: run db_stat -CA to see what is deadlocked (or what locks are held) and adjust your program. "Works with lmdv" isn't sufficient to claim working code ;-)
Leaks can be seen with either valgrind and/or BDB compilation with -fsanitize:address. Note that valgrind will report false uninitializations unless you use overrides and/or compile BDB to initialize.

Python - Creating meaningful crash logs / crash report

Actual Problem (Crash Log Generation)
Is there a python module that could help me produce meaningful crash logs? Or a good way to go about producing them?
I want my crash logs to contain:
All variables within the current execution stack
All global variables
Parameters at invocation (i.e. flags, their values, etc)
Is this something I'd just be better off writing myself?
Context (Not particularly relevant)
I have a program that is used by a large number of people within my company that I am responsible for supporting. Unfortunately it doesn't always work correctly (about 1 out of 1000 times) and I am having difficulty tracking the bug down. I think that having solid crash logs would really help here so that my users could just submit those rather than making vague phone calls for help.

You can look at how django generates its error pages: https://github.com/django/django/blob/master/django/views/debug.py#L59
In non-debug mode these are mailed to the list of admins in the settings file. It's super handy!

The python logging module has everything you need already.

Database migrations on django production

From someone who has a django application in a non-trivial production environment, how do you handle database migrations? I know there is south, but it seems like that would miss quite a lot if anything substantial is involved.
The other two options (that I can think of or have used) is doing the changes on a test database and then (going offline with the app) and importing that sql export. Or, perhaps a riskier option, doing the necessary changes on the production database in real-time, and if anything goes wrong reverting to the back-up.
How do you usually handle your database migrations and schema changes?

I think there are two parts to this problem.
First is managing the database schema and it's changes. We do this using South, keeping both the working models and the migration files in our SCM repository. For safety (or paranoia), we take a dump of the database before (and if we are really scared, after) running any migrations. South has been adequate for all our requirements so far.
Second is deploying the schema change which goes beyond just running the migration file generated by South. In my experience, a change to the database normally requires a change to deployed code. If you have even a small web farm, keeping the deployed code in sync with the current version of your database schema may not be trivial - this gets worse if you consider the different caching layers and effect to an already active site user. Different sites handle this problem differently, and I don't think there is a one-size-fits-all answer.
Solving the second part of this problem is not necessarily straight forward. I don't believe there is a one-size-fits-all approach, and there is not enough information about your website and environment to suggest a solution that would be most suitable for your situation. However, I think there are a few considerations that can be kept in mind to help guide deployment in most situations.
Taking the whole site (web servers and database) offline is an option in some cases. It is certainly the most straight forward way to manage updates. But frequent downtime (even when planned) can be a good way to go our of business quickly, makes it tiresome to deploy even small code changes, and might take many hours if you have a large dataset and/or complex migration. That said, for sites I help manage (which are all internal and generally only used during working hours on business days) this approach works wonders.
Be careful if you do the changes on a copy of your master database. The main problem here is that your site is still live, and presumably accepting writes to the database. What happens to data written to the master database while you are busy migrating the clone for later use? Your site has to either be down the whole time or put in some read-only state temporarily otherwise you'll lose them.
If your changes are backwards compatible, and you have a web farm, sometimes you can get away with updating the live production database server (which I think is unavoidable in most situations) and then incrementally updating nodes in the farm by taking them out of the load balancer for a short period. This can work ok - however the main problem here is if a node that has already been updated sends a request for a url which isn't supported by an older node you will get fail as you cant manage that at the load balancer level.
I've seen/heard a couple of other ways work well.
The first is wrapping all code changes in a feature lock which is then configurable at run-time through some site-wide configuration options. This essentially means you can release code where all your changes are turned off, and then after you have made all the necessary updates to your servers you change your configuration option to enable the feature. But this makes quite heavy code...
The second is letting the code manage the migration. I've heard of sites where changes to the code is written in such a way that it handles the migration at runtime. It is able to detect the version of the schema being used, and the format of the data it got back - if the data is from the old schema it does the migration in place, if the data is already from the new schema it does nothing. From natural site usage a high portion of your data will be migrated by people using the site, the rest you can do with a migration script whenever you like.
But I think at this point Google becomes your friend, because as I say, the solution is very context specific and I'm worried this answer will start to get meaningless... Search for something like "zero down time deployment" and you'll get results such as this with plenty of ideas...

I use South for a production server with a codebase of ~40K lines and we have had no problems so far. We have also been through a couple of major refactors for some of our models and we have had zero problems.
One thing that we also have is version control on our models which helps us revert any changes we make to models on the software side with South being more for the actual data. We use Django Reversion

I have sometimes taken an unconventional approach (reading the other answers perhaps it's not that unconventional) to this problem. I never tried it with django so I just did some experiments with it.
In short, I let the code catch the exception resulting from the old schema and apply the appropriate schema upgrade. I don't expect this to be the accepted answer - it is only appropriate in some cases (and some might argue never). But I think it has an ugly-duckling elegance.
Of course, I have a test environment which I can reset back to the production state at any point. Using that test environment, I update my schema and write code against it - as usual.
I then revert the schema change and test the new code again. I catch the resulting errors, perform the schema upgrade and then re-try the erring query.
The upgrade function must be written so it will "do no harm" so that if it's called multiple times (as may happen when put into production) it only acts once.
Actual python code - I put this at the end of my settings.py to test the concept, but you would probably want to keep it in a separate module:
from django.db.models.sql.compiler import SQLCompiler
from MySQLdb import OperationalError
orig_exec = SQLCompiler.execute_sql
def new_exec(self, *args, **kw):
try:
return orig_exec(self, *args, **kw)
except OperationalError, e:
if e[0] != 1054: # unknown column
raise
upgradeSchema(self.connection)
return orig_exec(self, *args, **kw)
SQLCompiler.execute_sql = new_exec
def upgradeSchema(conn):
cursor = conn.cursor()
try:
cursor.execute("alter table users add phone varchar(255)")
except OperationalError, e:
if e[0] != 1060: # duplicate column name
raise
Once your production environment is up to date, you are free to remove this self-upgrade code from your codebase. But even if you don't, the code isn't doing any significant unnecessary work.
You would need to tailor the exception class (MySQLdb.OperationalError in my case) and numbers (1054 "unknown column" / 1060 "duplicate column" in my case) to your database engine and schema change, but that should be easy.
You might want to add some additional checks to ensure the sql being executed is actually erring because of the schema change in question rather than some other problem, but even if you don't, this should re-raise unrelated exception. The only penalty is that in that situation you'd be trying the upgrade and the bad query twice before raising the exception.
One of my favorite things about python is one's ability to easily override system methods at run-time like this. It provides so much flexibility.

If your database is non-trivial and Postgresql you have a whole bunch of excellent options SQL-wise, including:
snapshotting and rollback
live replication to a backup server
trial upgrade then live
The trial upgrade option is nice (but best done in collaboration with a snapshot)
su postgres
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql
psql template1
# create database upgrade_test template current_db
# \c upgradetest
# \i upgrade_file.sql
...assuming all well...
# \q
pg_dump <db> > $(date "+%Y%m%d_%H%M").sql # we're paranoid
psql <db>
# \i upgrade_file.sql
If you like the above arrangement, but you are worried about the time it takes to run upgrade twice, you can lock db for writes and then if the upgrade to upgradetest goes well you can then rename db to dbold and upgradetest to db. There are lots of options.
If you have an SQL file listing all the changes you want to make, an extremely handy psql command \set ON_ERROR_STOP 1. This stops the upgrade script in its tracks the moment something goes wrong. And, with lots of testing, you can make sure nothing does.
There are a whole host of database schema diffing tools available, with a number noted in this StackOverflow answer. But it is basically pretty easy to do by hand ...
pg_dump --schema-only production_db > production_schema.sql
pg_dump --schema-only upgraded_db > upgrade_schema.sql
vimdiff production_schema.sql upgrade_schema.sql
or
diff -Naur production_schema.sql upgrade_schema.sql > changes.patch
vim changes.patch (to check/edit)

South isnt used everywhere. Like in my orgainzation we have 3 levels of code testing. One is local dev environment, one is staging dev enviroment, and third is that of a production .
Local Dev is on the developers hands where he can play according to his needs. Then comes staging dev which is kept identical to production, ofcourse, until a db change has to be done on the live site, where we do the db changes on staging first, and check if everything is working fine and then we manually change the production db making it identical to staging again.

If its not trivial, you should have pre-prod database/ app that mimic the production one. To avoid downtime on production.

In python, why use logging instead of print?

For simple debugging in a complex project is there a reason to use the python logger instead of print? What about other use-cases? Is there an accepted best use-case for each (especially when you're only looking for stdout)?
I've always heard that this is a "best practice" but I haven't been able to figure out why.

The logging package has a lot of useful features:
Easy to see where and when (even what line no.) a logging call is being made from.
You can log to files, sockets, pretty much anything, all at the same time.
You can differentiate your logging based on severity.
Print doesn't have any of these.
Also, if your project is meant to be imported by other python tools, it's bad practice for your package to print things to stdout, since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propogate logging messages from your tool or not.

One of the biggest advantages of proper logging is that you can categorize messages and turn them on or off depending on what you need. For example, it might be useful to turn on debugging level messages for a certain part of the project, but tone it down for other parts, so as not to be taken over by information overload and to easily concentrate on the task for which you need logging.
Also, logs are configurable. You can easily filter them, send them to files, format them, add timestamps, and any other things you might need on a global basis. Print statements are not easily managed.

Print statements are sort of the worst of both worlds, combining the negative aspects of an online debugger with diagnostic instrumentation. You have to modify the program but you don't get more, useful code from it.
An online debugger allows you to inspect the state of a running program; But the nice thing about a real debugger is that you don't have to modify the source; neither before nor after the debugging session; You just load the program into the debugger, tell the debugger where you want to look, and you're all set.
Instrumenting the application might take some work up front, modifying the source code in some way, but the resulting diagnostic output can have enormous amounts of detail, and can be turned on or off to a very specific degree. The python logging module can show not just the message logged, but also the file and function that called it, a traceback if there was one, the actual time that the message was emitted, and so on. More than that; diagnostic instrumentation need never be removed; It's just as valid and useful when the program is finished and in production as it was the day it was added; but it can have it's output stuck in a log file where it's not likely to annoy anyone, or the log level can be turned down to keep all but the most urgent messages out.
anticipating the need or use for a debugger is really no harder than using ipython while you're testing, and becoming familiar with the commands it uses to control the built in pdb debugger.
When you find yourself thinking that a print statement might be easier than using pdb (as it often is), You'll find that using a logger pulls your program in a much easier to work on state than if you use and later remove print statements.
I have my editor configured to highlight print statements as syntax errors, and logging statements as comments, since that's about how I regard them.

In brief, the advantages of using logging libraries do outweigh print as below reasons:
Control what’s emitted
Define what types of information you want to include in your logs
Configure how it looks when it’s emitted
Most importantly, set the destination for your logs
In detail, segmenting log events by severity level is a good way to sift through which log messages may be most relevant at a given time. A log event’s severity level also gives you an indication of how worried you should be when you see a particular message. For instance, dividing logging type to debug, info, warning, critical, and error. Timing can be everything when you’re trying to understand what went wrong with an application. You want to know the answers to questions like:
“Was this happening before or after my database connection died?”
“Exactly when did that request come in?”
Furthermore, it is easy to see where a log has occurred through line number and filename or method name even in which thread.
Here's a functional logging library for Python named loguru.

If you use logging then the person responsible for deployment can configure the logger to send it to a custom location, with custom information. If you only print, then that's all they get.

Logging essentially creates a searchable plain text database of print outputs with other meta data (timestamp, loglevel, line number, process etc.).
This is pure gold, I can run egrep over the log file after the python script has run.
I can tune my egrep pattern search to pick exactly what I am interested in and ignore the rest. This reduction of cognitive load and freedom to pick my egrep pattern later on by trial and error is the key benefit for me.
tail -f mylogfile.log | egrep "key_word1|key_word2"
Now throw in other cool things that print can't do (sending to socket, setting debug levels, logrotate, adding meta data etc.), you have every reason to prefer logging over plain print statements.
I tend to use print statements because it's lazy and easy, adding logging needs some boiler plate code, hey we have yasnippets (emacs) and ultisnips (vim) and other templating tools, so why give up logging for plain print statements!?

I would add to all other mentionned advantages that the print function in standard configuration is buffered. The flush may occure only at the end of the current block (the one where the print is).
This is true for any program launched in a non interactive shell (codebuild, gitlab-ci for instance) or whose output is redirected.
If for any reason the program is killed (kill -9, hard reset of the computer, …), you may be missing some line of logs if you used print for the same.
However, the logging library will ensure to flush the logs printed to stderr and stdout immediately at any call.

In Python in GAE, what is the best way to limit the risk of executing untrusted code?

I would like to enable students to submit python code solutions to a few simple python problems. My applicatoin will be running in GAE. How can I limit the risk from malicios code that is sumitted? I realize that this is a hard problem and I have read related Stackoverflow and other posts on the subject. I am curious if the restrictions aleady in place in the GAE environment make it simpler to limit damage that untrusted code could inflict. Is it possible to simply scan the submitted code for a few restricted keywords (exec, import, etc.) and then ensure the code only runs for less than a fixed amount of time, or is it still difficult to sandbox untrusted code even in the resticted GAE environment? For example:
# Import and execute untrusted code in GAE
untrustedCode = """#Untrusted code from students."""
class TestSpace(object):pass
testspace = TestSpace()
try:
#Check the untrusted code somehow and throw and exception.
except:
print "Code attempted to import or access network"
try:
# exec code in a new namespace (Thanks Alex Martelli)
# limit runtime somehow
exec untrustedCode in vars(testspace)
except:
print "Code took more than x seconds to run"

#mjv's smiley comment is actually spot-on: make sure the submitter IS identified and associated with the code in question (which presumably is going to be sent to a task queue), and log any diagnostics caused by an individual's submissions.
Beyond that, you can indeed prepare a test-space that's more restrictive (thanks for the acknowledgment;-) including a special 'builtin' that has all you want the students to be able to use and redefines __import__ &c. That, plus a token pass to forbid exec, eval, import, __subclasses__, __bases__, __mro__, ..., gets you closer. A totally secure sandbox in a GAE environment however is a real challenge, unless you can whitelist a tiny subset of the language that the students are allowed.
So I would suggest a layered approach: the sandbox GAE app in which the students upload and execute their code has essentially no persistent layer to worry about; rather, it "persists" by sending urlfetch requests to ANOTHER app, which never runs any untrusted code and is able to vet each request very critically. Default-denial with whitelisting is still the holy grail, but with such an extra layer for security you may be able to afford a default-acceptance with blacklisting...

You really can't sandbox Python code inside App Engine with any degree of certainty. Alex's idea of logging who's running what is a good one, but if the user manages to break out of the sandbox, they can erase the event logs. The only place this information would be safe is in the per-request logging, since users can't erase that.
For a good example of what a rathole trying to sandbox Python turns into, see this post. For Guido's take on securing Python, see this post.
There are another couple of options: If you're free to choose the language, you could run Rhino (a Javascript interpreter) on the Java runtime; Rhino is nicely sandboxed. You may also be able to use Jython; I don't know if it's practical to sandbox it, but it seems likely.
Alex's suggestion of using a separate app is also a good one. This is pretty much the approach that shell.appspot.com takes: It can't prevent you from doing malicious things, but the app itself stores nothing of value, so there's no harm if you do.

Here's an idea. Instead of running the code server-side, run it client-side with Skuplt:
http://www.skulpt.org/
This is both safer, and easier to implement.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.