Summary
One of our threads in production hit an error and is now producing InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. errors, on every request with a query that it serves, for the rest of its life! It's been doing this for days, now! How is this possible, and how can we prevent it going forward?
Background
We are using a Flask app on uWSGI (4 processes, 2 threads), with Flask-SQLAlchemy providing us DB connections to SQL Server.
The problem seemed to start when one of our threads in production was tearing down its request, inside this Flask-SQLAlchemy method:
#teardown
def shutdown_session(response_or_exc):
if app.config['SQLALCHEMY_COMMIT_ON_TEARDOWN']:
if response_or_exc is None:
self.session.commit()
self.session.remove()
return response_or_exc
...and somehow managed to call self.session.commit() when the transaction was invalid. This resulted in sqlalchemy.exc.InvalidRequestError: Can't reconnect until invalid transaction is rolled back getting output to stdout, in defiance of our logging configuration, which makes sense since it happened during the app context tearing down, which is never supposed to raise exceptions. I'm not sure how the transaction got to be invalid without response_or_exec getting set, but that's actually the lesser problem AFAIK.
The bigger problem is, that's when the "'prepared' state" errors started, and haven't stopped since. Every time this thread serves a request that hits the DB, it 500s. Every other thread seems to be fine: as far as I can tell, even the thread that's in the same process is doing OK.
Wild guess
The SQLAlchemy mailing list has an entry about the "'prepared' state" error saying it happens if a session started committing and hasn't finished yet, and something else tries to use it. My guess is that the session in this thread never got to the self.session.remove() step, and now it never will.
I still feel like that doesn't explain how this session is persisting across requests though. We haven't modified Flask-SQLAlchemy's use of request-scoped sessions, so the session should get returned to SQLAlchemy's pool and rolled back at the end of the request, even the ones that are erroring (though admittedly, probably not the first one, since that raised during the app context tearing down). Why are the rollbacks not happening? I could understand it if we were seeing the "invalid transaction" errors on stdout (in uwsgi's log) every time, but we're not: I only saw it once, the first time. But I see the "'prepared' state" error (in our app's log) every time the 500s occur.
Configuration details
We've turned off expire_on_commit in the session_options, and we've turned on SQLALCHEMY_COMMIT_ON_TEARDOWN. We're only reading from the database, not writing yet. We're also using Dogpile-Cache for all of our queries (using the memcached lock since we have multiple processes, and actually, 2 load-balanced servers). The cache expires every minute for our major query.
Updated 2014-04-28: Resolution steps
Restarting the server seems to have fixed the problem, which isn't entirely surprising. That said, I expect to see it again until we figure out how to stop it. benselme (below) suggested writing our own teardown callback with exception handling around the commit, but I feel like the bigger problem is that the thread was messed up for the rest of its life. The fact that this didn't go away after a request or two really makes me nervous!
Edit 2016-06-05:
A PR that solves this problem has been merged on May 26, 2016.
Flask PR 1822
Edit 2015-04-13:
Mystery solved!
TL;DR: Be absolutely sure your teardown functions succeed, by using the teardown-wrapping recipe in the 2014-12-11 edit!
Started a new job also using Flask, and this issue popped up again, before I'd put in place the teardown-wrapping recipe. So I revisited this issue and finally figured out what happened.
As I thought, Flask pushes a new request context onto the request context stack every time a new request comes down the line. This is used to support request-local globals, like the session.
Flask also has a notion of "application" context which is separate from request context. It's meant to support things like testing and CLI access, where HTTP isn't happening. I knew this, and I also knew that that's where Flask-SQLA puts its DB sessions.
During normal operation, both a request and an app context are pushed at the beginning of a request, and popped at the end.
However, it turns out that when pushing a request context, the request context checks whether there's an existing app context, and if one's present, it doesn't push a new one!
So if the app context isn't popped at the end of a request due to a teardown function raising, not only will it stick around forever, it won't even have a new app context pushed on top of it.
That also explains some magic I hadn't understood in our integration tests. You can INSERT some test data, then run some requests and those requests will be able to access that data despite you not committing. That's only possible since the request has a new request context, but is reusing the test application context, so it's reusing the existing DB connection. So this really is a feature, not a bug.
That said, it does mean you have to be absolutely sure your teardown functions succeed, using something like the teardown-function wrapper below. That's a good idea even without that feature to avoid leaking memory and DB connections, but is especially important in light of these findings. I'll be submitting a PR to Flask's docs for this reason. (Here it is)
Edit 2014-12-11:
One thing we ended up putting in place was the following code (in our application factory), which wraps every teardown function to make sure it logs the exception and doesn't raise further. This ensures the app context always gets popped successfully. Obviously this has to go after you're sure all teardown functions have been registered.
# Flask specifies that teardown functions should not raise.
# However, they might not have their own error handling,
# so we wrap them here to log any errors and prevent errors from
# propagating.
def wrap_teardown_func(teardown_func):
#wraps(teardown_func)
def log_teardown_error(*args, **kwargs):
try:
teardown_func(*args, **kwargs)
except Exception as exc:
app.logger.exception(exc)
return log_teardown_error
if app.teardown_request_funcs:
for bp, func_list in app.teardown_request_funcs.items():
for i, func in enumerate(func_list):
app.teardown_request_funcs[bp][i] = wrap_teardown_func(func)
if app.teardown_appcontext_funcs:
for i, func in enumerate(app.teardown_appcontext_funcs):
app.teardown_appcontext_funcs[i] = wrap_teardown_func(func)
Edit 2014-09-19:
Ok, turns out --reload-on-exception isn't a good idea if 1.) you're using multiple threads and 2.) terminating a thread mid-request could cause trouble. I thought uWSGI would wait for all requests for that worker to finish, like uWSGI's "graceful reload" feature does, but it seems that's not the case. We started having problems where a thread would acquire a dogpile lock in Memcached, then get terminated when uWSGI reloads the worker due to an exception in a different thread, meaning the lock is never released.
Removing SQLALCHEMY_COMMIT_ON_TEARDOWN solved part of our problem, though we're still getting occasional errors during app teardown during session.remove(). It seems these are caused by SQLAlchemy issue 3043, which was fixed in version 0.9.5, so hopefully upgrading to 0.9.5 will allow us to rely on the app context teardown always working.
Original:
How this happened in the first place is still an open question, but I did find a way to prevent it: uWSGI's --reload-on-exception option.
Our Flask app's error handling ought to be catching just about anything, so it can serve a custom error response, which means only the most unexpected exceptions should make it all the way to uWSGI. So it makes sense to reload the whole app whenever that happens.
We'll also turn off SQLALCHEMY_COMMIT_ON_TEARDOWN, though we'll probably commit explicitly rather than writing our own callback for app teardown, since we're writing to the database so rarely.
A surprising thing is that there's no exception handling around that self.session.commit. And a commit can fail, for example if the connection to the DB is lost. So the commit fails, session is not removed and next time that particular thread handles a request it still tries to use that now-invalid session.
Unfortunately, Flask-SQLAlchemy doesn't offer any clean possibility to have your own teardown function. One way would be to have the SQLALCHEMY_COMMIT_ON_TEARDOWN set to False and then writing your own teardown function.
It should look like this:
#app.teardown_appcontext
def shutdown_session(response_or_exc):
try:
if response_or_exc is None:
sqla.session.commit()
finally:
sqla.session.remove()
return response_or_exc
Now, you will still have your failing commits, and you'll have to investigate that separately... But at least your thread should recover.
Related
I am trying to find a bug which happens from time to time on our production server, but could not be reproduced otherwise: some value in the DB gets changed in a way which I don't want it to.
I could write a PostgreSQL trigger which fires if this bug happens, and raise an exception from said trigger. I would see the Python traceback which executes the unwanted SQL statement.
But in this case I don't want to stop the processing of the request.
Is there a way to log the Python/Django traceback from within a PostgreSQL trigger?
I know that this is not trival since the DB code runs under a different linux process with a different user id.
I am using Python, Django, PostgreSQL, Linux.
I guess this is not easy since the DB trigger runs in a different context than the python interpreter.
Please ask if you need further information.
Update
One solution might be to overwrite connection.notices of psycopg2.
Is there a way to log the Python/Django traceback from within a PostgreSQL trigger?
No, there is not
The (SQL) query is executed on the DBMS-server, and so is the code inside the trigger
The Python code is executed on the client which is a different process, possibly executed by a different user, and maybe even on a different machine.
The only connection between the server (which detects the condition) and the client (which needs to perform the stackdump) is the connected socket. You could try to extend the server's reply (if there is one) by some status code, which is used by the client to stackddump itself. This will only work if the trigger is part of the current transaction, not of some unrelated process.
The other way is: massive logging. Make the DBMS write every submitted SQL to its logfile. This can cause huge amounts of log entries, which you have to inspect.
Given this setup
(django/python) -[SQL connection]-> (PostgreSQL server)
your intuition that
I guess this is not easy since the DB trigger runs in a different context than the python interpreter.
is correct. At least, we won't be able to do this exactly the way you want it; not without much acrobatics.
However, there are options, each with drawbacks:
If you are using django with SQLAlchemy, you can register event listeners (either ORM events or Core Events) that detect this bad SQL statement you are hunting, and log a traceback.
Write a wrapper around your SQL driver, check for the bad SQL statement you are hunting, and log the traceback every time it's detected.
Give every SQL transaction, or every django request, an ID (could just be some UUID in werkzeug's request-bound storage manager). From here, we gain more options:
Configure the logger to log this request ID everywhere, and log all SQL statements in SQLAlchemy. This lets you correlate Django requests, and specific function invocations, with SQL statements. You can do this with echo= in SQLAlchemy.
Include this request ID in every SQL statement (extra column?), then log this ID in the PostgreSQL trigger with RAISE NOTICE. This lets you correlate client-side activity in django against server-side activity in PostgreSQL.
In the spirit of "Test in Production" espoused by Charity Majors, send every request to a sandbox copy of your Django app that reads/writes a sandboxed copy of your production database. In the sandbox database, raise the exception and log your traceback.
You can take this idea further and create smaller "async" setups. For example, you can, for each request, trigger a async duplicate (say, with celery) of the same request that hits a DB configured with your PostgreSQL trigger to fail and log the traceback.
Use RAISE EXCEPTION in the PostgreSQL trigger to rollback the current transaction. In Python, catch that specific exception, log it, then repeat the transaction, changing the data slightly (extra column?) to indicate that this is a retry and the trigger should not fail.
Is there a reason you can't SELECT all row values into Python, then do the detection in Python entirely?
So if you're able to detect the condition after the queries execute, then you can log the condition and/or throw an exception.
Then what you need is tooling like Sentry or New Relic.
You could use LISTEN+NOTIFY.
First let some daemon thread LISTEN and in the db trigger you can execute a NOTIFY.
The daemon thread receives the notify event and can dump the stacktrace of the main thread.
If you use psycopg2, you can use this
# Overwriting connetion.notices via Django
class MyAppConfig(AppConfig):
def ready(self):
connection_created.connect(connection_created_check_for_notice_in_connection)
class ConnectionNoticeList(object):
def append(self, message):
if not 'some_magic_of_db_trigger' in message:
return
logger.warn('%s %s' % (message, ''.join(traceback.format_stack())))
def connection_created_check_for_notice_in_connection(sender, connection, **kwargs):
connection.connection.notices=ConnectionNoticeList()
I'm experiencing some strange bugs which seem to be caused by connections used by Sqlalchemy, which i can't pin down exactly.. i was hoping someone has a clue whats going on here.
We're working on a Pyramid (version 1.5b1) and use Sqlalchemy (version 0.9.6) for all our database connectivity. Sometimes we get errors related to the db connection or session, most of the time this would be a cursor already closed or This Connection is closed error, but we get other related exceptions too:
(OperationalError) connection pointer is NULL
(InterfaceError) cursor already closed
Parent instance <...> is not bound to a Session, and no contextual session is established; lazy load operation of attribute '...' cannot proceed
A conflicting state is already present in the identity map for key (<class '...'>, (1001L,))
This Connection is closed (original cause: ResourceClosedError: This Connection is closed)
(InterfaceError) cursor already closed
Parent instance <...> is not bound to a Session; lazy load operation of attribute '...' cannot proceed
Parent instance <...> is not bound to a Session, and no contextual session is established; lazy load operation of attribute '...' cannot proceed
'NoneType' object has no attribute 'twophase'
(OperationalError) connection pointer is NULL
This session is in 'prepared' state; no further
There is no silver bullet to reproduce them, only by refreshing many times they are bound to happen one at some point. So i made a script using multi-mechanize to spam different urls concurrently and see where and when it happens.
It appears the url triggered doesn't really matter, the errors happen when there are concurrent requests that span a longer time (and other requests get served in between). This seems to indicate there is some kind of threading problem; that either the session or connection is shared among different threads.
After googling for these issues I found a lot of topics, most of them tell to use scoped sessions, but the thing is we do use them already:
db_session = scoped_session(sessionmaker(extension=ZopeTransactionExtension(), autocommit=False, autoflush=False))
db_meta = MetaData()
We have a BaseModel for all our orm objects:
BaseModel = declarative_base(cls=BaseModelObj, metaclass=BaseMeta, metadata=db_meta)
We use the pyramid_tm tween to handle transactions during the request
We hook db_session.remove() to the pyramid NewResponse event (which is fired after everything has run). I also tried putting it in a seperate tween running after pyramid_tm or even not doing it at all, none of these seem to have effect, so the response event seemed like the most clean place to put it.
We create the engine in our main entrypoint of our pyramid project and use a NullPool and leave connection pooling to pgbouncer. We also configure the session and the bind for our BaseModel here:
engine = engine_from_config(config.registry.settings, 'sqlalchemy.', poolclass=NullPool)
db_session.configure(bind=engine, query_cls=FilterQuery)
BaseModel.metadata.bind = engine
config.add_subscriber(cleanup_db_session, NewResponse)
return config.make_wsgi_app()
In our app we access all db operation using:
from project.db import db_session
...
db_session.query(MyModel).filter(...)
db_session.execute(...)
We use psycopg2==2.5.2 to handle the connection to postgres with pgbouncer in between
I made sure no references to db_session or connections are saved anywhere (which could result in other threads reusing them)
I also tried the spamming test using different webservers, using waitress and cogen i got the errors very easily, using wsgiref we unsurprisingly have no errors (which is singlethreaded). Using uwsgi and gunicorn (4 workers, gevent) i didn't get any errors.
Given the differences in the webserver used, I thought it either has to do with some webservers handling requests in threads and some using new processes (maybe a forking problem)? To complicate matters even more, when time went on and i did some new tests, the problem had gone away in waitress but now happened with gunicorn (when using gevent)! I have no clue on how to go debugging this...
Finally, to test what happens to the connection, i attached an attribute to the connection at the start of the cursor execute and tried to read the attribute out at the end of the execute:
#event.listens_for(Engine, "before_cursor_execute")
def _before_cursor_execute(conn, cursor, stmt, params, context, execmany):
conn.pdtb_start_timer = time.time()
#event.listens_for(Engine, "after_cursor_execute")
def _after_cursor_execute(conn, cursor, stmt, params, context, execmany):
print conn.pdtb_start_timer
Surprisingly this sometimes raised an exception: 'Connection' object has no attribute 'pdtb_start_timer'
Which struck me as very strange.. I found one discussion about something similar: https://groups.google.com/d/msg/sqlalchemy/GQZSjHAGkWM/rDflJvuyWnEJ
And tried adding strategy='threadlocal' to the engine, which from what i understand should force 1 connection for the tread. But it didn't have any effect on the errors im seeing.. (besides some unittests failing because i need two different sessions/connections for some tests and this forces 1 connection to be associated)
Does anyone have any idea what might go on here or have some more pointers on how to attack this problem?
Thanks in advance!
Matthijs Blaas
Update: The errors where caused by multiple commands that where send in one prepared sql statement. Psycopg2 seems to allow this, but apparently it can cause strange issues. The PG8000 connector is more strict and bailed out on the multiple commands, sending one command fixed the issue!
i have a Flask app on gae, it is working correctly. I am trying to add Appstats support, but once i enable it, i have a deadlock.
This deadlock is apparently happening when i try to setup a werkzeug LocalProxy with the logged user ndb model (it is called current_user, like it's done in Flask-Login, to give you more details).
The error is:
RuntimeError: Deadlock waiting for <Future 104c02f50 created by get_async(key.py:545) for tasklet get(context.py:612) suspended generator get(context.py:645); pending>
The LocalProxy object is setup using this syntax (as per Werkzeug doc):
current_user = LocalProxy(lambda: _get_user())
And _get_user() makes a simple synchronous query ndb.query.
Thanks in advance for any help.
I ran into this issue today. In my case it seems to be that the request to get a users details is triggering appstats. Appstats is then going through the call stack and storing details of all the local variables in each stack frame.
The session itself is in one of these stack frames, so appstats tries to print it out and triggers the user fetching code again.
Came up with 2 "solutions", though neither of them are great.
Disable appstats altogether.
Disable logging of local variables in appstats.
I've gone for the latter at the moment. appstats allows you to configure various settings in your appengine_config.py file. I was able to avoid logging of local variable details (which stops the code from triggering the bug) by adding this:
appstats_MAX_LOCALS = 0
I've ran into a strange situation. I'm writing some test cases for my program. The program is written to work on sqllite or postgresqul depending on preferences. Now I'm writing my test code using unittest. Very basically what I'm doing:
def setUp(self):
"""
Reset the database before each test.
"""
if os.path.exists(root_storage):
shutil.rmtree(root_storage)
reset_database()
initialize_startup()
self.project_service = ProjectService()
self.structure_helper = FilesHelper()
user = model.User("test_user", "test_pass", "test_mail#tvb.org",
True, "user")
self.test_user = dao.store_entity(user)
In the setUp I remove any folders that exist(created by some tests) then I reset my database (drop tables cascade basically) then I initialize the database again and create some services that will be used for testing.
def tearDown(self):
"""
Remove project folders and clean up database.
"""
created_projects = dao.get_projects_for_user(self.test_user.id)
for project in created_projects:
self.structure_helper.remove_project_structure(project.name)
reset_database()
Tear down does the same thing except creating the services, because this test module is part of the same suite with other modules and I don't want things to be left behind by some tests.
Now all my tests run fine with sqllite. With postgresql I'm running into a very weird situation: at some point in the execution, which actually differs from run to run by a small margin (ex one or two extra calls) the program just halts. I mean no error is generated, no exception thrown, the program just stops.
Now only thing I can think of is that somehow I forget a connection opened somewhere and after I while it timesout and something happens. But I have A LOT of connections so before I start going trough all that code, I would appreciate some suggestions/ opinions.
What could cause this kind of behaviour? Where to start looking?
Regards,
Bogdan
PostgreSQL based applications freeze because PG locks tables fairly aggressively, in particular it will not allow a DROP command to continue if any connections are open in a pending transaction, which have accessed that table in any way (SELECT included).
If you're on a unix system, the command "ps -ef | grep 'post'" will show you all the Postgresql processes and you'll see the status of current commands, including your hung "DROP TABLE" or whatever it is that's freezing. You can also see it if you select from the pg_stat_activity view.
So the key is to ensure that no pending transactions remain - this means at a DBAPI level that any result cursors are closed, and any connection that is currently open has rollback() called on it, or is otherwise explicitly closed. In SQLAlchemy, this means any result sets (i.e. ResultProxy) with pending rows are fully exhausted and any Connection objects have been close()d, which returns them to the pool and calls rollback() on the underlying DBAPI connection. you'd want to make sure there is some kind of unconditional teardown code which makes sure this happens before any DROP TABLE type of command is emitted.
As far as "I have A LOT of connections", you should get that under control. When the SQLA test suite runs through its 3000 something tests, we make sure we're absolutely in control of connections and typically only one connection is opened at a time (still, running on Pypy has some behaviors that still cause hangs with PG..its tough). There's a pool class called AssertionPool you can use for this which ensures only one connection is ever checked out at a time else an informative error is raised (shows where it was checked out).
One solution I found to this problem was to call db.session.close() before any attempt to call db.drop_all(). This will close the connection before dropping the tables, preventing Postgres from locking the tables.
See a much more in-depth discussion of the problem here.
I'm using psycopg2 for the cherrypy app I'm currently working on and cli & phpgadmin to handle some operations manually. Here's the python code :
#One connection per thread
cherrypy.thread_data.pgconn = psycopg2.connect("...")
...
#Later, an object is created by a thread :
class dbobj(object):
def __init__(self):
self.connection=cherrypy.thread_data.pgconn
self.curs=self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
...
#Then,
try:
blabla
self.curs.execute(...)
self.connection.commit()
except:
self.connection.rollback()
lalala
...
#Finally, the destructor is called :
def __del__(self):
self.curs.close()
I'm having a problem with either psycopg or postgres (altough I think the latter is more likely). After having sent a few queries, my connections drop dead. Similarly, phpgadmin -usually- gets dropped as well ; it prompts me to reconnect after having made requests several times. Only the CLI remains persistent.
The problem is, these happen very randomly and I can't even track down what the cause is. I can either get locked down after a few page requests or never really encounter anything after having requested hundreds of pages. The only error I've found in postgres log, after terminating the app is :
...
LOG: unexpected EOF on client connection
LOG: could not send data to client: Broken pipe
LOG: unexpected EOF on client connection
...
I thought of creating a new connection every time a new dbobj instance is created but I absolutely don't want to do this.
Also, I've read that one may run into similar problems unless all transactions are committed : I use the try/except block for every single INSERT/UPDATE query, but I never use it for SELECT queries nor do I want to write even more boilerplate code (btw, do they need to be committed ?). Even if that's the case, why would phpgadmin close down ?
max_connections is set to 100 in the .conf file, so I don't think that's the reason either. A single cherrypy worker has only 10 threads.
Does anyone have an idea where I should look first ?
Psycopg2 needs a commit or rollback after every transaction, including SELECT queries, or it leaves the connections "IDLE IN TRANSACTION". This is now a warning in the docs:
Warning: By default, any query execution, including a simple SELECT will start a transaction: for long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat...). For long lived scripts, either ensure to terminate a transaction as soon as possible or use an autocommit connection.
It's a bit difficult to see exactly where you're populating and accessing cherrypy.thread_data. I'd recommend investigating psycopg2.pool.ThreadedConnectionPool instead of trying to bind one conn to each thread yourself.
Even though I don't have any idea why successful SELECT queries block the connection, spilling .commit() after pretty much every single query that doesn't have to work in conjunction with another solved the problem.