Problem
I am working on a long-running python process that performs a lot of database access (mostly reads, occasional writes). Sometimes it may be necessary to terminate the process before it finishes (e.g. by using the kill command) and when this happens I would like to log a value to the database indicating that the particular run was canceled. (I am also logging the occurrence to a log file; I would like to have the information in both places.)
I have found that if I interrupt the process while the database connection is active, the connection becomes unusable; specifically, it hangs the process if I try to use it in any way.
Minimum working example
The actual application is rather large and complex, but this snippet reproduces the problem reliably.
The table test in the database has two columns, id (serial) and message (text). I prepopulated it with one row so the UPDATE statement below would have something to change.
import psycopg2
import sys
import signal
pg_host = 'localhost'
pg_user = 'redacted'
pg_password = 'redacted'
pg_database = 'test_db'
def write_message(msg):
print "Writing: " + msg
cur.execute("UPDATE test SET message = %s WHERE id = 1", (msg,))
conn.commit()
def signal_handler(signal, frame):
write_message('Interrupt!')
sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
if __name__ == '__main__':
conn = psycopg2.connect(host=pg_host, user=pg_user, password=pg_password, database=pg_database)
cur = conn.cursor()
write_message("Starting")
for i in xrange(10000):
# I press ^C somewhere in here
cur.execute("SELECT * FROM test")
cur.fetchall()
write_message("Finishing")
When I run this script without interruption, it completes as expected. That is, the row in the database is updated to say "Starting" then "Finishing".
If I press ctrl-C during the loop indicated by the comment, python hangs indefinitely. It no longer responds to keyboard input, and the process has to be killed from elsewhere. Looking in my postgresql log, the UPDATE statement with "Interrupted!" is never received by the database server.
If I add a debugging breakpoint at the beginning of signal_handler() I can see that doing almost anything with the database connection at that point causes the same hang. Trying to execute a SELECT, issuing a conn.rollback(), conn.commit(), conn.close() or conn.reset() all cause the hang. Executing conn.cancel() does not cause a hang, but it doesn't improve the situation; subsequent use of the connection still causes a hang. If I remove the database access from write_message() then the script is able to exit gracefully when interrupted, so the hang is definitely database connection related.
Also worth noting: if I change the script so that I am interrupting something other than database activity, it works as desired, logging "Interrupted!" to the database. E.g., if I replace the for i in xrange(10000) loop with a simple sleep(10) and interrupt that, it works fine. So the problem seems to be specifically related to interrupting psycopg2 with a signal while it is performing database access, then trying to use the connection.
Questions
Is there any way to salvage the existing psycopg2 connection and use it to update the database after this kind of interruption?
If not, is there at least a way to terminate it cleanly so if some subsequent code tried to use it, it wouldn't cause a hang?
Finally, is this somehow expected behavior, or is it a bug that should be reported? It makes sense to me that the connection could be in a bad state after this kind of interruption, but ideally it would throw an exception indicating the problem rather than hanging.
Workaround
In the meantime, I have discovered that if I create an entirely new connection with psycopg2.connect() after the interrupt and am careful not to access the old one, I can still update the database from the interrupted process. This is probably what I'll do for now, but it feels untidy.
Environment
OS X 10.11.6
python 2.7.11
psycopg2 2.6.1
postgresql 9.5.1.0
I filed an issue for this on the psycopg2 github and received a helpful response from the developer. In summary:
The behavior of an existing connection within a signal handler is OS dependent and there's probably no way to use the old connection reliably; creating a new one is the recommended solution.
Using psycopg2.extensions.set_wait_callback(psycopg2.extras.wait_select) improves the situation a bit (at least in my environment) by causing execute() statements called from within the signal handler to throw an exception rather than hang. However, doing other things with the conneciton (e.g. reset()) still caused a hang for me, so ultimately it's still best to just create a new connection within the signal handler rather than trying to salvage the existing one.
Related
I'm playing around with SQLAlchemy core in Python, and I've read over the documentation numerous times and still need clarification about engine.execute() vs connection.execute().
As I understand it, engine.execute() is the same as doing connection.execute(), followed by connection.close().
The tutorials I've followed let me to use this in my code:
Initial setup in script
try:
engine = db.create_engine("postgres://user:pass#ip/dbname", connect_args={'connect_timeout': 5})
connection = engine.connect()
metadata = db.MetaData()
except exc.OperationalError:
print_error(f":: Could not connect to {db_ip}!")
sys.exit()
Then, I have functions that handle my database access, for example:
def add_user(a_username):
query = db.insert(table_users).values(username=a_username)
connection.execute(query)
Am I supposed to be calling connection.close() before my script ends? Or is that handled efficiently enough by itself? Would I be better off closing the connection at the end of add_user(), or is that inefficient?
If I do need to be calling connection.close() before the script ends, does that mean interrupting the script will cause hanging connections on my Postgres DB?
I found this post helpful to better understand the different interaction paradigms in sqlalchemy, in case you haven't read it yet.
Regarding your question as to when to close your db connection: It is indeed very inefficient to create and close connections for every statement execution. However you should make sure that your application does not have connection leaks in it's global flow.
I am trying to find a bug which happens from time to time on our production server, but could not be reproduced otherwise: some value in the DB gets changed in a way which I don't want it to.
I could write a PostgreSQL trigger which fires if this bug happens, and raise an exception from said trigger. I would see the Python traceback which executes the unwanted SQL statement.
But in this case I don't want to stop the processing of the request.
Is there a way to log the Python/Django traceback from within a PostgreSQL trigger?
I know that this is not trival since the DB code runs under a different linux process with a different user id.
I am using Python, Django, PostgreSQL, Linux.
I guess this is not easy since the DB trigger runs in a different context than the python interpreter.
Please ask if you need further information.
Update
One solution might be to overwrite connection.notices of psycopg2.
Is there a way to log the Python/Django traceback from within a PostgreSQL trigger?
No, there is not
The (SQL) query is executed on the DBMS-server, and so is the code inside the trigger
The Python code is executed on the client which is a different process, possibly executed by a different user, and maybe even on a different machine.
The only connection between the server (which detects the condition) and the client (which needs to perform the stackdump) is the connected socket. You could try to extend the server's reply (if there is one) by some status code, which is used by the client to stackddump itself. This will only work if the trigger is part of the current transaction, not of some unrelated process.
The other way is: massive logging. Make the DBMS write every submitted SQL to its logfile. This can cause huge amounts of log entries, which you have to inspect.
Given this setup
(django/python) -[SQL connection]-> (PostgreSQL server)
your intuition that
I guess this is not easy since the DB trigger runs in a different context than the python interpreter.
is correct. At least, we won't be able to do this exactly the way you want it; not without much acrobatics.
However, there are options, each with drawbacks:
If you are using django with SQLAlchemy, you can register event listeners (either ORM events or Core Events) that detect this bad SQL statement you are hunting, and log a traceback.
Write a wrapper around your SQL driver, check for the bad SQL statement you are hunting, and log the traceback every time it's detected.
Give every SQL transaction, or every django request, an ID (could just be some UUID in werkzeug's request-bound storage manager). From here, we gain more options:
Configure the logger to log this request ID everywhere, and log all SQL statements in SQLAlchemy. This lets you correlate Django requests, and specific function invocations, with SQL statements. You can do this with echo= in SQLAlchemy.
Include this request ID in every SQL statement (extra column?), then log this ID in the PostgreSQL trigger with RAISE NOTICE. This lets you correlate client-side activity in django against server-side activity in PostgreSQL.
In the spirit of "Test in Production" espoused by Charity Majors, send every request to a sandbox copy of your Django app that reads/writes a sandboxed copy of your production database. In the sandbox database, raise the exception and log your traceback.
You can take this idea further and create smaller "async" setups. For example, you can, for each request, trigger a async duplicate (say, with celery) of the same request that hits a DB configured with your PostgreSQL trigger to fail and log the traceback.
Use RAISE EXCEPTION in the PostgreSQL trigger to rollback the current transaction. In Python, catch that specific exception, log it, then repeat the transaction, changing the data slightly (extra column?) to indicate that this is a retry and the trigger should not fail.
Is there a reason you can't SELECT all row values into Python, then do the detection in Python entirely?
So if you're able to detect the condition after the queries execute, then you can log the condition and/or throw an exception.
Then what you need is tooling like Sentry or New Relic.
You could use LISTEN+NOTIFY.
First let some daemon thread LISTEN and in the db trigger you can execute a NOTIFY.
The daemon thread receives the notify event and can dump the stacktrace of the main thread.
If you use psycopg2, you can use this
# Overwriting connetion.notices via Django
class MyAppConfig(AppConfig):
def ready(self):
connection_created.connect(connection_created_check_for_notice_in_connection)
class ConnectionNoticeList(object):
def append(self, message):
if not 'some_magic_of_db_trigger' in message:
return
logger.warn('%s %s' % (message, ''.join(traceback.format_stack())))
def connection_created_check_for_notice_in_connection(sender, connection, **kwargs):
connection.connection.notices=ConnectionNoticeList()
I want to stop executing SQL statement if it takes too long to run.
To achieve this I hacked django.core.db.backends.oracle.base. In FormatStylePlaceholderCursor.execute and executemany instead of:
return self.cursor.execute(TIMEOUT, query, self._param_generator(params))
I do:
return timelimited(TIMEOUT, self.cursor.execute, query, self._param_generator(params))
And timelimited is a function from this recipe: http://code.activestate.com/recipes/576780-timeout-for-nearly-any-callable/. It wraps a function (i.e. cursor.execute) in separate thread and waits TIMEOUT. If function doesn't return the thread is stopped.
With this modification the application I'm running is throwing ora-01000 maximum cursors exceeded after some short period of time. I'm wandering why wrapping cursor.execute is causing this problem, how to fix it and what are other available solution to this problem.
I'm not familiar with Django nor Python. I can tell you what OCI drivers offer to users.
You must close the query handle - or whatever is it's name in Python. Otherwise you're leaking resources on database side
If the query is still active, you can interrupt it using OCIBreak call. This one is thread safe, and can be called from any thread, regardless what background thread is doing with the connection
Try to check, whether Python drivers for Oracle do allow you to call OCIBreak and OCIReset
This is what you need. Connection.cancel()
I've ran into a strange situation. I'm writing some test cases for my program. The program is written to work on sqllite or postgresqul depending on preferences. Now I'm writing my test code using unittest. Very basically what I'm doing:
def setUp(self):
"""
Reset the database before each test.
"""
if os.path.exists(root_storage):
shutil.rmtree(root_storage)
reset_database()
initialize_startup()
self.project_service = ProjectService()
self.structure_helper = FilesHelper()
user = model.User("test_user", "test_pass", "test_mail#tvb.org",
True, "user")
self.test_user = dao.store_entity(user)
In the setUp I remove any folders that exist(created by some tests) then I reset my database (drop tables cascade basically) then I initialize the database again and create some services that will be used for testing.
def tearDown(self):
"""
Remove project folders and clean up database.
"""
created_projects = dao.get_projects_for_user(self.test_user.id)
for project in created_projects:
self.structure_helper.remove_project_structure(project.name)
reset_database()
Tear down does the same thing except creating the services, because this test module is part of the same suite with other modules and I don't want things to be left behind by some tests.
Now all my tests run fine with sqllite. With postgresql I'm running into a very weird situation: at some point in the execution, which actually differs from run to run by a small margin (ex one or two extra calls) the program just halts. I mean no error is generated, no exception thrown, the program just stops.
Now only thing I can think of is that somehow I forget a connection opened somewhere and after I while it timesout and something happens. But I have A LOT of connections so before I start going trough all that code, I would appreciate some suggestions/ opinions.
What could cause this kind of behaviour? Where to start looking?
Regards,
Bogdan
PostgreSQL based applications freeze because PG locks tables fairly aggressively, in particular it will not allow a DROP command to continue if any connections are open in a pending transaction, which have accessed that table in any way (SELECT included).
If you're on a unix system, the command "ps -ef | grep 'post'" will show you all the Postgresql processes and you'll see the status of current commands, including your hung "DROP TABLE" or whatever it is that's freezing. You can also see it if you select from the pg_stat_activity view.
So the key is to ensure that no pending transactions remain - this means at a DBAPI level that any result cursors are closed, and any connection that is currently open has rollback() called on it, or is otherwise explicitly closed. In SQLAlchemy, this means any result sets (i.e. ResultProxy) with pending rows are fully exhausted and any Connection objects have been close()d, which returns them to the pool and calls rollback() on the underlying DBAPI connection. you'd want to make sure there is some kind of unconditional teardown code which makes sure this happens before any DROP TABLE type of command is emitted.
As far as "I have A LOT of connections", you should get that under control. When the SQLA test suite runs through its 3000 something tests, we make sure we're absolutely in control of connections and typically only one connection is opened at a time (still, running on Pypy has some behaviors that still cause hangs with PG..its tough). There's a pool class called AssertionPool you can use for this which ensures only one connection is ever checked out at a time else an informative error is raised (shows where it was checked out).
One solution I found to this problem was to call db.session.close() before any attempt to call db.drop_all(). This will close the connection before dropping the tables, preventing Postgres from locking the tables.
See a much more in-depth discussion of the problem here.
I'm using psycopg2 for the cherrypy app I'm currently working on and cli & phpgadmin to handle some operations manually. Here's the python code :
#One connection per thread
cherrypy.thread_data.pgconn = psycopg2.connect("...")
...
#Later, an object is created by a thread :
class dbobj(object):
def __init__(self):
self.connection=cherrypy.thread_data.pgconn
self.curs=self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
...
#Then,
try:
blabla
self.curs.execute(...)
self.connection.commit()
except:
self.connection.rollback()
lalala
...
#Finally, the destructor is called :
def __del__(self):
self.curs.close()
I'm having a problem with either psycopg or postgres (altough I think the latter is more likely). After having sent a few queries, my connections drop dead. Similarly, phpgadmin -usually- gets dropped as well ; it prompts me to reconnect after having made requests several times. Only the CLI remains persistent.
The problem is, these happen very randomly and I can't even track down what the cause is. I can either get locked down after a few page requests or never really encounter anything after having requested hundreds of pages. The only error I've found in postgres log, after terminating the app is :
...
LOG: unexpected EOF on client connection
LOG: could not send data to client: Broken pipe
LOG: unexpected EOF on client connection
...
I thought of creating a new connection every time a new dbobj instance is created but I absolutely don't want to do this.
Also, I've read that one may run into similar problems unless all transactions are committed : I use the try/except block for every single INSERT/UPDATE query, but I never use it for SELECT queries nor do I want to write even more boilerplate code (btw, do they need to be committed ?). Even if that's the case, why would phpgadmin close down ?
max_connections is set to 100 in the .conf file, so I don't think that's the reason either. A single cherrypy worker has only 10 threads.
Does anyone have an idea where I should look first ?
Psycopg2 needs a commit or rollback after every transaction, including SELECT queries, or it leaves the connections "IDLE IN TRANSACTION". This is now a warning in the docs:
Warning: By default, any query execution, including a simple SELECT will start a transaction: for long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat...). For long lived scripts, either ensure to terminate a transaction as soon as possible or use an autocommit connection.
It's a bit difficult to see exactly where you're populating and accessing cherrypy.thread_data. I'd recommend investigating psycopg2.pool.ThreadedConnectionPool instead of trying to bind one conn to each thread yourself.
Even though I don't have any idea why successful SELECT queries block the connection, spilling .commit() after pretty much every single query that doesn't have to work in conjunction with another solved the problem.