Deploying database changes with Python

Deploying database changes with Python - python

I'm wondering if someone can recommend a good pattern for deploying database changes via python.
In my scenario, I've got one or more PostgreSQL databases and I'm trying to deploy a code base to each one. Here's an example of the directory structure for my SQL scripts:
my_db/
main.sql
some_directory/
foo.sql
bar.sql
some_other_directory/
baz.sql
Here's an example of what's in main.sql
/* main.sql has the following contents: */
BEGIN TRANSACTION
\i some_directory/bar.sql
\i some_directory/foo.sql
\i some_other_directory/baz.sql
COMMIT;
As you can see, main.sql defines a specific order of operations and a transaction for the database updates.
I've also got a python / twisted service monitoring SVN for changes in this db code, and I'd like to automatically deploy this code upon discovery of new stuff from the svn repository.
Can someone recommend a good pattern to use here?
Should I be parsing each file?
Should I be shelling out to psql?
...

What you're doing is actually a decent approach if you control all the servers and they're all postgresql servers.
A more general approach is to have a directory of "migrations" which are generally classes with an apply() and undo() that actually do the work in your database, and often come with
abstractions like .create_table() that generate the DDL instructions specific to whatever RDBMS you're using.
Generally, you have some naming convention that ensures the migrations run in the order they were created.
There's a migration library for python called South, though it appears to be geared specifically toward django development.
http://south.aeracode.org/docs/about.html

We just integrated sqlalchemy-migrate which has some pretty Rails-like conventions but with the power of SQLAlchemy. It's shaping up to be a really awesome product, but it does have some downfalls. It's pretty seamless to integrate though.

Related

How can I create a unit test for SQL statements?

I have a couple of SQL statements stored as files which get executed by a Python script. The database is hosted in Snowflake and I use Snowflake SQLAlchemy to connect to it.
How can I test those statements? I don't want to execute them, I just want to check if they could be executable.
One very basic thing to check if it is valid standard SQL. A better answer would be something that considers snowflake-specific stuff like
copy into s3://example from table ...
The best answer would be something that also checks permissions, e.g. for SELECT statements if the table is visible / readable.

An in-memory sqlite database is one option. But if you are executing raw SQL queries against snowflake in your code, your tests may fail if the same syntax isn't valid against sqlite. Recording your HTTP requests against a test snowflake database, and then replaying them for your unit tests suits this purpose better. There are two very good libraries that do this, check them out:
vcrpy
betamax

We do run integration tests on our Snowflake databases. We maintain clones of our production databases, for example, one of our production databases is called data_lake and we maintain a clone that gets cloned on a nightly basis called data_lake_test which is used to run our integration tests against.
Like Tim Biegeleisen mentioned, a "true" unittest would mock the response but our integration tests do run real Snowflake queries on our test cloned databases. There is the possibility that a test drastically alters the test database, but we run integration tests only during our CI/CD process so it is rare if there is ever a conflict between two tests.

I very much like this idea, however I can suggest a work around, as I often have to check my syntax and need help there. What I would recommend, if you plan on using the Snowflake interface would be to make sure to use the LIMIT 10 or LIMIT 1 on the SELECT statements that you would be needing to validate.
Another tip I would recommend is talking to a Snowflake representative about a trial if you are just getting started. They will also have alot of tips for more specific queries you are seeking to validate.
And finally, based on some comments, make sure it uses SQL: ANSI and the live in the https://docs.snowflake.net/manuals/index.html for reference.

As far as the validity of the sql statement is a concern you can run explain of the statement and it should give you error if syntax is incorrect or if you do not have permission to access the object/database. That being there still some exceptions which you cannot run explain for like 'use' command which I do not think is needed for validation.
Hope this helps.

How to approach updating an database-driven application after release?

I'm working on a web-app that's very heavily database driven. I'm nearing the initial release and so I've locked down the features for this version, but there are going to be lots of other features implemented after release. These features will inevitably require some modification to the database models, so I'm concerned about the complexity of migrating the database on each release. What I'd like to know is how much should I concern myself with locking down a solid database design now so that I can release quickly, against trying to anticipate certain features now so that I can build it into the database before release? I'm also anticipating finding flaws with my current model and would probably then want to make changes to it, but if I release the app and then data starts coming in, migrating the data would be a difficult task I imagine. Are there conventional methods to tackle this type of problem? A point in the right direction would be very useful.
For a bit of background I'm developing an asset management system for a CG production pipeline. So lots of pieces of data with lots of connections between them. It's web-based, written entirely in Python and it uses SQLAlchemy with a SQLite engine.

Some thoughts for managing databases for a production application:
Make backups nightly. This is crucial because if you try to do an update (to the data or the schema), and you mess up, you'll need to be able to revert to something more stable.
Create environments. You should have something like a local copy of the database for development, a staging database for other people to see and test before going live and of course a production database that your live system points to.
Make sure all three environments are in sync before you start development locally. This way you can track changes over time.
Start writing scripts and version them for releases. Make sure you store these in a source control system (SVN, Git, etc.) You just want a historical record of what has changed and also a small set of scripts that need to be run with a given release. Just helps you stay organized.
Do your changes to your local database and test it. Make sure you have scripts that do two things, 1) Scripts that modify the data, or the schema, 2) Scripts that undo what you've done in case things go wrong. Test these over and over locally. Run the scripts, test and then rollback. Are things still ok?
Run the scripts on staging and see if everything is still ok. Just another chance to prove your work is good and that if needed you can undo your changes.
Once staging is good and you feel confident, run your scripts on the production database. Remember you have scripts to change data (update, delete statements) and scripts to change schema (add fields, rename fields, add tables).
In general take your time and be very deliberate in your actions. The more disciplined you are the more confident you'll be. Updating the database can be scary, so don't rush things, write out your plan of action, and test, test, test!

One approach that I saw (and liked) was a table called versions that contained an id only.
Then there was an updates.sql script that had a structure similar to this:
DELIMITER $$
CREATE PROCEDURE IF NOT EXISTS DBUpdate()
BEGIN
IF (SELECT id FROM versions) = 1 THEN
CREATE TABLE IF NOT EXISTS new_feature_table(
id INT PRIMARY KEY AUTO-INCREMENT,
blah VARCHAR(128) ...,
);
IF (SELECT id FROM versions) = 2 THEN
CREATE TABLE IF NOT EXISTS newer_feature_table(
id INT PRIMARY KEY AUTO-INCREMENT,
blah VARCHAR(128) ...,
);
END$$
DELIMITER ;
CALL PROCEDURE DBUpdate();
Then you write a python script to check the repository for updates, connect to the db, and run any changes to the schema via this procedure. It's nice because you only need a versions table with the appropriate id value to build out the entire database (with no data, that is; see ryan1234's answer concerning data backups).

Building a DSL query language

i'm working on a project (written in Django) which has only a few entities, but many rows for each entity.
In my application i have several static "reports", directly written in plain SQL. The users can also search the database via a generic filter form. Since the target audience is really tech-savvy and at some point the filter doesn't fit their needs, i think about creating a query language for my database like YQL or Jira's advanced search.
I found http://sourceforge.net/projects/littletable/ and http://www.quicksort.co.uk/DeeDoc.html, but it seems that they only operate on in-memory objects. Since the database can be too large for holding it in-memory, i would prefer that the query is translated in SQL (or better a Django query) before doing the actual work.
Are there any library or best practices on how to do this?

Writing such a DSL is actually surprisingly easy with PLY, and what ho—there's already an example available for doing just what you want, in Django. You see, Django has this fancy thing called a Q object which make the Django querying side of things fairly easy.
At DjangoCon EU 2012, Matthieu Amiguet gave a session entitled Implementing Domain-specific Languages in Django Applications in which he went through the process, right down to implementing such a DSL as you desire. His slides, which include all you need, are available on his website. The final code (linked to from the last slide, anyway) is available at http://www.matthieuamiguet.ch/media/misc/djangocon2012/resources/compiler.html.
Reinout van Rees also produced some good comments on that session. (He normally does!) These cover a little of the missing context.
You see in there something very similar to YQL and JQL in the examples given:
groups__name="XXX" AND NOT groups__name="YYY"
(modified > 1/4/2011 OR NOT state__name="OK") AND groups__name="XXX"
It can also be tweaked very easily; for example, you might want to use groups.name rather than groups__name (I would). This modification could be made fairly trivially (allow . in the FIELD token, by modifying t_FIELD, and then replacing . with __ before constructing the Q object in p_expression_ID).
So, that satisfies simple querying; it also gives you a good starting point should you wish to make a more complex DSL.

I've faced exactly this problem - a large database which needs searching. I made some static reports and several fancy filters using django (very easy with django) just like you have.
However the power users were clamouring for more. I decided that there already was a DSL that they all knew - SQL. The question was how to make it secure enough.
So I used django permissions to give the power users permission to make SQL queries in a new table. I then made a view for the not-quite-so-power users to use these queries. I made them take optional parameters. The queries were run using Python's lower level DB-API which django is using under the hood for its ORM anyway.
The real trick was opening a read only database connection to run these queries just to make sure that no updates were ever run. I made a read only connection by creating a different user in the database with lower permissions and opening a specific connection for that in the view.
TL;DR - SQL is the way to go!

Depending on the form of your data, the types of queries your users need to use, and the frequency that your data is updated, an alternative to the pure SQL solution suggested by Nick Craig-Wood is to index your data in Solr and then run queries against it.
Solr is an added layer of complexity (configuration, data synchronization) but it is super-fast, can handle large datasets, and provides a (relatively) intuitive query language.

You could write your own SQL-ish language using pyparsing, actually. There is even pretty verbose example you could extend.

How to efficiently manage frequent schema changes using sqlalchemy?

I'm programming a web application using sqlalchemy. Everything was smooth during the first phase of development when the site was not in production. I could easily change the database schema by simply deleting the old sqlite database and creating a new one from scratch.
Now the site is in production and I need to preserve the data, but I still want to keep my original development speed by easily converting the database to the new schema.
So let's say that I have model.py at revision 50 and model.py a revision 75, describing the schema of the database. Between those two schema most changes are trivial, for example a new column is declared with a default value and I just want to add this default value to old records.
Eventually a few changes may not be trivial and require some pre-computation.
How do (or would) you handle fast changing web applications with, say, one or two new version of the production code per day ?
By the way, the site is written in Pylons if this makes any difference.

Alembic is a new database migrations tool, written by the author of SQLAlchemy. I've found it much easier to use than sqlalchemy-migrate. It also works seamlessly with Flask-SQLAlchemy.
Auto generate the schema migration script from your SQLAlchemy models:
alembic revision --autogenerate -m "description of changes"
Then apply the new schema changes to your database:
alembic upgrade head
More info here: http://readthedocs.org/docs/alembic/

What we do.
Use "major version"."minor version" identification of your applications. Major version is the schema version number. The major number is no some random "enough new functionality" kind of thing. It's a formal declaration of compatibility with database schema.
Release 2.3 and 2.4 both use schema version 2.
Release 3.1 uses the version 3 schema.
Make the schema version very, very visible. For SQLite, this means keep the schema version number in the database file name. For MySQL, use the database name.
Write migration scripts. 2to3.py, 3to4.py. These scripts work in two phases. (1) Query the old data into the new structure creating simple CSV or JSON files. (2) Load the new structure from the simple CSV or JSON files with no further processing. These extract files -- because they're in the proper structure, are fast to load and can easily be used as unit test fixtures. Also, you never have two databases open at the same time. This makes the scripts slightly simpler. Finally, the load files can be used to move the data to another database server.
It's very, very hard to "automate" schema migration. It's easy (and common) to have database surgery so profound that an automated script can't easily map data from old schema to new schema.

Use sqlalchemy-migrate.
It is designed to support an agile approach to database design, and make it easier to keep development and production databases in sync, as schema changes are required. It makes schema versioning easy.
Think of it as a version control for your database schema. You commit each schema change to it, and it will be able to go forwards/backwards on the schema versions. That way you can upgrade a client and it will know exactly which set of changes to apply on that client's database.
It does what S.Lott proposes in his answer, automatically for you. Makes a hard thing easy.

The best way to deal with your problem is to reflect your schema instead doing it the declarative way. I wrote an article about the reflective approach here:
http://petrushev.wordpress.com/2010/06/16/reflective-approach-on-sqlalchemy-usage/
but there are other resources about this also. In this manner, every time you make changes to your schema, all you need to do is restart the app and the reflection will fetch the new metadata for the changes in tables. This is quite fast and sqlalchemy does it only once per process. Of course, you'll have to manage the relationships changes you make yourself.

DB Permissions with Django unit testing

Disclaimer:
I'm very new to Django. I must say that so far I really like it. :)
(now for the "but"...)
But, there seems to be something I'm missing related to unit testing. I'm working on a new project with an Oracle backend. When you run the unit tests, it immediately gives a permissions error when trying to create the schema. So, I get what it's trying to do (create a clean sandbox), but what I really want is to test against an existing schema. And I want to run the test with the same username/password that my server is going to use in production. And of course, that user is NOT going to have any kind of DDL type rights.
So, the basic problem/issue that I see boils down to this: my system (and most) want to have their "app_user" account to have ONLY the permissions needed to run. Usually, this is basic "CRUD" permissions. However, Django unit tests seem to need more than this to do a test run.
How do other people handle this? Is there some settings/work around/feature of Django that I'm not aware (please refer to the initial disclaimer).
Thanks in advance for your help.
David

Don't force Django to do something unnatural.
Allow it to create the test schema. It's a good thing.
From your existing schema, do an unload to create .JSON dump files of the data. These files are your "fixtures". These fixtures are used by Django to populate the test database. This is The Greatest Testing Tool Ever. Once you get your fixtures squared away, this really does work well.
Put your fixture files into fixtures directories within each app package.
Update your unit tests to name the various fixtures files that are required for that test case.
This -- in effect -- tests with an existing schema. It rebuilds, reloads and tests in a virgin database so you can be absolutely sure that it works without destroying (or even touching) live data.

As you've discovered, Django's default test runner makes quite a few assumptions, including that it'll be able to create a new test database to run the tests against.
If you need to override this or any of these default assumptions, you probably want to write a custom test runner. By doing so you'll have full control over exactly how tests are discovered, bootstrapped, and run.
(If you're running Django's development trunk, or are looking forward to Django 1.2, note that defining custom test runners has recently gotten quite a bit easier.)
If you poke around, you'll find a few examples of custom test runners you could use to get started.
Now, keep in mind that once you've taken control of test running you'll need to ensure that you someone meet the same assumptions about environment that Django's built-in runner does. In particular, you'll need to someone guarantee that whatever test database you'll use is a clean, fresh one for the tests -- you'll be quite unhappy if you try to run tests against a database with unpredictable contents.

After I read David's (OP) question, I was curious about this too, but I don't see the answer I was hoping to see. So let me try to rephrase what I think at least part of what David is asking. In a production environment, I'm sure his Django models probably will not have access to create or drop tables. His DBA will probably not allow him to have permission to do this. (Let's assume this is True). He will only be logged into the database with regular user privileges. But in his development environment, the Django unittest framework forces him to have higher level privileges for the unittests instead of a regular user because Django requires it to create/drop tables for the model unittests. Since the unittests are now running at a higher privilege than will occur in production, you could argue that running the unittests in development are not 100% valid and errors could happen in production that might have been caught in development if Django could run the unittests with user privileges.
I'm curious if Django unittests will ever have the ability to create/drop tables with one user's (higher) privileges, and run the unittests with a different user's (lower) privileges. This would help more accurately simulate the production environment in development.
Maybe in practice this is really not an issue. And the risk is so minor compared to the reward that it not worth worrying about it.

Generally speaking, when unit tests depend on test data to be present, they also depend on it to be in a specific format/state. As such, your framework's policy is to not only execute DML (delete/insert test data records), but it also executes DDL (drop/create tables) to ensure that everything is in working order prior to running your tests.
What I would suggest is that you grant the necessary privileges for DDL to your app_user ONLY on your test_ database.
If you don't like that solution, then have a look at this blog entry where a developer also ran into your scenario and solved it with a workaround:
http://www.stopfinder.com/blog/2008/07/26/flexible-test-database-engine-selection-in-django/
Personally, my choice would be to modify the privileges for the test database. This way, I could rule out all other variables when comparing performance/results between testing/production environments.
HTH,
-aj

What you can do, is creating separate test settings.
As I've learned at http://mindlesstechnology.wordpress.com/2008/08/16/faster-django-unit-tests/ you can use the sqlite3 backend, which is created in memory by the Django unit test framework.
Quoting:
Create a new test-settings.py file next to your app’s settings.py
containing:
from projectname.settings import * DATABASE_ENGINE = 'sqlite3'
Then when you want to run tests real fast, instead of manage.py test,
you run
manage.py test --settings=test-settings
This runs my test suite in less than 5 seconds.
Obviously you still want to run tests on your real db backend, but
this is awesome for sanity checks, and while you’re doing test
development.
To load initial data, provide fixtures in your testcase.
class MyAppTestCase(TestCase):
fixtures = ['myapp/fixtures/filename']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.