SQLAlchemy trying to access a table in the wrong database - python

We use CherryPy and SQLAlchemy to build our web app and everything was fine until we tested with 2 concurrent users - then things started to go wrong! Not very good for a web app so I'd be very appreciative if anyone could shine some light on this.
TL;DR
We're getting the following error about 10% of the time when two users are using our site (but accessing different databases) at the same time:
ProgrammingError: (ProgrammingError) (1146, "Table 'test_one.other_child_entity' doesn't exist")
This table is not present in that database so the error makes sense but the problem is that SQLAlchemy shouldn't be looking for the table in that database.
I have reproduced the error in an example here https://gist.github.com/1729817
Explanation
We're developing an application that is very dynamic and is based on the entity_name pattern found at http://www.sqlalchemy.org/trac/wiki/UsageRecipes/EntityName
We've since grown that idea so that it stores entities in different databases depending on what user you're logged in as. This is because each user in the system has their own database and can create their own entities (tables). To do this we extend a base entity for each database and then extend that new entity for each additional entity they create in their database.
When the app starts we create a dictionary containing the engine, metadata, classes and tables of all these databases and reflect all of the metadata. When a user logs in they get access to one.
When two users are accessing the site at the same time something is going wrong and SQLAlchemy ends up looking for tables in the wrong database. I guess this is to do with threading but as far as I can see we are following all the rules when it comes to sessions (CP and SQLA), engines, metadata, tables and mappers.
If anyone could give my example (https://gist.github.com/1729817) a quick glance over and point out any glaring problems that would be great.
Update
I can fix the problem by changing the code to use my own custom session router like so:
# Thank you zzzeek (http://techspot.zzzeek.org/2012/01/11/django-style-database-routers-in-sqlalchemy/)
class RoutingSession(Session):
def get_bind(self, mapper = None, clause = None):
return databases[cherrypy.session.get('database')]['engine']
And then:
Session = scoped_session(sessionmaker(autoflush = True, autocommit = False, class_ = RoutingSession))
So just hard-coding it to return the engine that's linked to the database that's set in the session. This is great news but now I want to know why my original code didn't work. Either I'm doing it wrong or the following code is not completely safe:
# Before each request (but after the session tool)
def before_request_body():
if cherrypy.session.get('logged_in', None) is True:
# Configure the DB session for this thread to point to the correct DB
Session.configure(bind = databases[cherrypy.session.get('database')]['engine'])
I guess that the binding that's happening here was being overwritten by the user in the other thread which is strange because I thought scoped_session was all about thread-safety?

Related

Proper way to rollback on DB connection fail in Django

This is more of a design question than anything else.
Until recently I have been using Django with SQLite in my development environment, but I have now changed to PostgreSQL for production. My app is deployed with Heroku, and after some days I realized that they do random maintenance to the DB and it goes down during a few minutes.
For example, having a model with 3 tables, one Procedure which each of them point to a ProcedureList, and a ProcedureList can have more than one Procedure. A ProcedureUser which links a ProcedureList and a user and sets some specific variables for the user on that ProcedureList. Finally there is a ProcedureState which links a Procedure with its state for an specific user.
On my app, in one of the views I have a function that modifies the DB in the following way:
user = request.user
plist = ProcedureList.objects.get(id=idFromUrl)
procedures = Procedure.objects.filter(ProcedureList=pList)
pUser = ProcedureUser(plist, user, someVariables)
pUser.save()
for procedure in procedures:
pState = ProcedureState(plist, user, pUser, procedure, otherVariables)
pState.save()
So what I'm thinking now, is that if Heroku decides to go into maintenance between those object.save() calls, we will have a problem. The later calls to .save() will fail and the DB will be corrupted. The request by the user will of course fail and there will be no way to rollback the previous insertions, because the connection with the DB is not possible.
My question is, in case of a DB fail (given by Heroku maintenance, network error or whatever), how are we supposed to correctly rollback the DB? Shall we make a list of insertions and wait for DB to go up again to roll them back?
I am using Python 3 and Django 4 but I think this is more of a general question than specific to any platform.
in case of a DB fail (given by Heroku maintenance, network error or whatever), how are we supposed to correctly rollback the DB?
This is solved by databases through atomic transactions [wiki]. An atomic transaction is a set of queries that are committed all or none. It is thus not possible that for such transaction, certain queries are applied whereas others are not.
Django offers a transaction context manager [Django-doc] to perform work in a transaction:
from django.db import transaction
with transaction.atomic():
user = request.user
plist = ProcedureList.objects.get(id=idFromUrl)
procedures = Procedure.objects.filter(ProcedureList=pList)
pUser = ProcedureUser(plist, user, someVariables)
pUser.save()
ProcedureState.objects.bulk_create([
ProcedureState(plist, user, pUser, procedure, otherVariables)
for procedure in procedures
])
At the end of the context block, it will commit the changes. This means that if the database fails in between, the actions will not be committed, and the block will raise an exception (usually an IntegrityError).
Note: Django has a .bulk_create(…) method [Django-doc] to create multiple items with a single database query, minimizing the bandwidth between the database and the application layer. This will usually outperform creating items in a loop.

SQLAlchemy - query without writing/locking the database

I have a multithreaded data analysis pipeline, which queries a database (via SQLAlchemy). Additionally, the database is synchronized across multiple systems by syncthing - long story short, this means that write permission cannot always be guaranteed.
Even when I am able to guarantee write access, I still occasionally and rather randomly get operational errors:
OperationalError: (sqlite3.OperationalError) database is locked
The code I use to load the session for the query is the following:
def loadSession(db_path):
db_path = "sqlite:///" + path.expanduser(db_path)
engine = create_engine(db_path, echo=False)
Session = sessionmaker(bind=engine)
session = Session()
Base.metadata.create_all(engine)
return session, engine
And can be seen in its full context here.
My query (and the way I turn it into a value) look like this:
session, engine = loadSession(db_path)
sql_query=session.query(LaserStimulationProtocol).filter(LaserStimulationProtocol.code==stim_protocol_dictionary[scan_type])
mystring = sql_query.statement
mydf = pd.read_sql_query(mystring,engine)
delay = int(mydf["stimulation_onset"][0])
And again, the full context can be found here.
How could I change my code so the database can be queried without having to rely on the file being writeable/unlocked? I have checked the file's checksum, and it does not change upon query, so clearly I'm not writing anything to it. As such, I guess there should be some way to extract the info I am looking for without write access?
I've written a blog post on the subject which provides some more explanation of the issue and some ways to work around it: http://charlesleifer.com/blog/multi-threaded-sqlite-without-the-operationalerrors/
Peewee ORM has an extension that is designed to support multiple threads writing to a SQLite database. http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#sqliteq

AppEngine NDB Query return different Results

I have a query in my live app that has gone "odd"...
Running 1.8.4 SDK... 1.8.5 live instance using Python 2.7
Measurement is an NDB model... with a string property called status and a key property called asset....
(Deep in my handler code.... )
cursor=None
limit=10
asset_key = <a key to an actual asset>
qry = Measurement.query(
Measurement.status=='PENDING',
Measurement.asset=asset_key)
results, cursor, more = qry.fetch_page(page_size=limit, start_cursor=cursor)
Now the weird thing is if I run this sometimes I get 4 items and sometimes only 1. (the right answer is 4)....
The dump of the query is exactly the same ... cursor is set to None... limit is always the same....same handler...same query and no new records in between each query. Fresh instance (eg 1st time + no other users)
Each query is only separated by seconds yet results a different.
Am I missing something here... has anyone else experienced this? Is this some sort of corrupt index? (It is a relatively large "table" with 482,911 items) Is NDB caching a cursor variable???
Very very odd.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (as per the docs). https://developers.google.com/appengine/docs/python/ndb/cache#incontext
Perhaps review the caching policy for the entity in question. However, from your snippet I'm unsure if your query is strongly consistent. That is more likely the cause of this issue: https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency

How to retrieve the real SQL from the Django logger?

I am trying to analyse the SQL performance of our Django (1.3) web application. I have added a custom log handler which attaches to django.db.backends and set DEBUG = True, this allows me to see all the database queries that are being executed.
However the SQL is not valid SQL! The actual query is select * from app_model where name = %s with some parameters passed in (e.g. "admin"), however the logging message doesn't quote the params, so the sql is select * from app_model where name = admin, which is wrong. This also happens using django.db.connection.queries. AFAIK the django debug toolbar has a complex custom cursor to handle this.
Update For those suggesting the Django debug toolbar: I am aware of that tool, it is great. However it does not do what I need. I want to run a sample interaction of our application, and aggregate the SQL that's used. DjDT is great for showing and shallow learning. But not great for aggregating and summarazing the interaction of dozens of pages.
Is there any easy way to get the real, legit, SQL that is run?
Check out django-debug-toolbar. Open a page, and a sidebar will be displayed with all SQL queries plus other information.
select * from app_model where name = %s is a prepared statement. I would recommend you to log the statement and the parameters separately. In order to get a wellformed query you need to do something like "select * from app_model where name = %s" % quote_string("user") or more general query % map(quote_string, params).
Please note that quote_string is DB specific and the DB 2.0 API does not define a quote_string method. So you need to write one yourself. For logging purposes I'd recommend keeping the queries and parameters separate as it allows for far better profiling as you can easily group the queries without taking the actual values into account.
The Django Docs state that this incorrect quoting only happens for SQLite.
https://docs.djangoproject.com/en/dev/ref/databases/#sqlite-connection-queries
Have you tried another Database Engine?
Every QuerySet object has a 'query' attribute. One way to do what you want (I accept perhaps not an ideal one) is to chain the lookups each view is producing into a kind of scripted user-story, using Django's test client. For each lookup your user story contains just append the query to a file-like object that you write at the end, for example (using a list instead for brevity):
l = []
o = Object.objects.all()
l.append(o.query)

How to disable SQLAlchemy caching?

I have a caching problem when I use sqlalchemy.
I use sqlalchemy to insert data into a MySQL database. Then, I have another application process this data, and update it directly.
But sqlalchemy always returns the old data rather than the updated data. I think sqlalchemy cached my request ... so ... how should I disable it?
The usual cause for people thinking there's a "cache" at play, besides the usual SQLAlchemy identity map which is local to a transaction, is that they are observing the effects of transaction isolation. SQLAlchemy's session works by default in a transactional mode, meaning it waits until session.commit() is called in order to persist data to the database. During this time, other transactions in progress elsewhere will not see this data.
However, due to the isolated nature of transactions, there's an extra twist. Those other transactions in progress will not only not see your transaction's data until it is committed, they also can't see it in some cases until they are committed or rolled back also (which is the same effect your close() is having here). A transaction with an average degree of isolation will hold onto the state that it has loaded thus far, and keep giving you that same state local to the transaction even though the real data has changed - this is called repeatable reads in transaction isolation parlance.
http://en.wikipedia.org/wiki/Isolation_%28database_systems%29
This issue has been really frustrating for me, but I have finally figured it out.
I have a Flask/SQLAlchemy Application running alongside an older PHP site. The PHP site would write to the database and SQLAlchemy would not be aware of any changes.
I tried the sessionmaker setting autoflush=True unsuccessfully
I tried db_session.flush(), db_session.expire_all(), and db_session.commit() before querying and NONE worked. Still showed stale data.
Finally I came across this section of the SQLAlchemy docs: http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html#transaction-isolation-level
Setting the isolation_level worked great. Now my Flask app is "talking" to the PHP app. Here's the code:
engine = create_engine(
"postgresql+pg8000://scott:tiger#localhost/test",
isolation_level="READ UNCOMMITTED"
)
When the SQLAlchemy engine is started with the "READ UNCOMMITED" isolation_level it will perform "dirty reads" which means it will read uncommited changes directly from the database.
Hope this helps
Here is a possible solution courtesy of AaronD in the comments
from flask.ext.sqlalchemy import SQLAlchemy
class UnlockedAlchemy(SQLAlchemy):
def apply_driver_hacks(self, app, info, options):
if "isolation_level" not in options:
options["isolation_level"] = "READ COMMITTED"
return super(UnlockedAlchemy, self).apply_driver_hacks(app, info, options)
Additionally to zzzeek excellent answer,
I had a similar issue. I solved the problem by using short living sessions.
with closing(new_session()) as sess:
# do your stuff
I used a fresh session per task, task group or request (in case of web app). That solved the "caching" problem for me.
This material was very useful for me:
When do I construct a Session, when do I commit it, and when do I close it
This was happening in my Flask application, and my solution was to expire all objects in the session after every request.
from flask.signals import request_finished
def expire_session(sender, response, **extra):
app.db.session.expire_all()
request_finished.connect(expire_session, flask_app)
Worked like a charm.
I have tried session.commit(), session.flush() none worked for me.
After going through sqlalchemy source code, I found the solution to disable caching.
Setting query_cache_size=0 in create_engine worked.
create_engine(connection_string, convert_unicode=True, echo=True, query_cache_size=0)
First, there is no cache for SQLAlchemy.
Based on your method to fetch data from DB, you should do some test after database is updated by others, see whether you can get new data.
(1) use connection:
connection = engine.connect()
result = connection.execute("select username from users")
for row in result:
print "username:", row['username']
connection.close()
(2) use Engine ...
(3) use MegaData...
please folowing the step in : http://docs.sqlalchemy.org/en/latest/core/connections.html
Another possible reason is your MySQL DB is not updated permanently. Restart MySQL service and have a check.
As i know SQLAlchemy does not store caches, so you need to looking at logging output.

Categories

Resources