Query execution hanging in specific circumstances

Query execution hanging in specific circumstances - python

The problem
For a while now i've encountered a bug where a data retrieval query keeps hanging during execution. If that was all, then debugging would be fine, but it is not easy to reproduce, namely:
It only occurs on my linux laptop (manjaro xfce), with no problems on my windows pc
Primarily occurs on a few specific timestamps (mostly 4:05)
Even then doesn't consistently appear
I know how this can be fixed (by prepending the query with a select 1;), but i don't understand why the problem occurs, and why my solution fixes it, which is where i'm stuck. I've not seen any other problems that specifically describe this issue.
Code
The query in question below. What is does is select a range of measurements, and then averaging those measurements per timestep (and interpolating in case it's necessary) to get a range of averages.
SELECT datetime, AVG(wc) as wc
FROM (
SELECT public.time_bucket_gapfill('5 minutes', m.datetime)
AS datetime, public.interpolate(AVG(m.wc)) as wc
FROM growficient.measurement AS m
INNER JOIN growficient.placement AS p ON m.placement_id = p.id
WHERE m.datetime >= '2022-09-30T22:00:00+00:00'
AND m.datetime < '2022-10-01T04:05:00+00:00'
AND p.section_id = 'bd5114b8-4aab-11eb-af66-32bd66d4e25c'
GROUP BY public.time_bucket_gapfill('5 minutes', m.datetime), p.id
) AS placement_averages
GROUP BY datetime
ORDER BY datetime;
This is then executed via SQLAlchemy on a session level. In case the bug appears, it never gets to the fetchall().
execute_result = session.execute(query)
readings = execute_result.fetchall()
We're using session management very similar to what's seen in the SQLAlchemy documentation. This is meant to be a debug-session however, meaning that no commit statements are included.
sessionMaker = sessionmaker(
autocommit=False,
autoflush=False,
bind=create_engine(
config.get_settings().main_db,
echo=False,
connect_args=connect_options,
pool_pre_ping=True,
),
)
#contextlib.contextmanager
def managed_session() -> Session:
session = sessionMaker()
try:
yield session
except Exception as e:
session.rollback()
logger.error("Session error: %s", e)
raise
finally:
session.close()
Observations
I can visually see the transaction hanging if i execute select * from pg_catalog.pg_stat_activity psa
Printing the identical query and then executing it inside the database directly (i.e. dbeaver) correctly returns the results
None of the timeouts mentioned in the Postgres documentation do anything to break out of the hang
Adding a SELECT 1; statement works, but setting pool_pre_ping=True in the engine doesn't, which confuses me as they do the same thing to my understanding.

Related

Multi-file Python project, one file unable to connect to SQL Server after a while

I have a multi-file Python Project, of which many of the files make connections to an Azure SQL Database. The project works fine but, for some reason, one of the files stops being able to connect to the database after a while of the application running, and I can see no reason as to why; especially when other connection attempts work fine.
The connection string, for all the connections (so across all the files), is define as the following:
SQLServer = os.getenv('SQL_SERVER')
SQLDatabase = os.getenv('SQL_DATABASE')
SQLLogin = os.getenv('SQL_LOGIN')
SQLPassword = os.getenv('SQL_PASSWORD')
SQLConnString = 'Driver={ODBC Driver 17 for SQL Server};Server=' + SQLServer + ';Database=' + SQLDatabase + ';UID='+ SQLLogin +';PWD=' + SQLPassword
sqlConn = pyodbc.connect(SQLConnString,timeout=20)
And the function I am calling, when the error happens is below:
def iscaptain(guild,user):
userRoles = user.roles
roleParam = ""
for role in userRoles:
roleParam = roleParam + "," + str(role.id)
cursor = sqlConn.cursor()
roleParam = roleParam[1:]
cursor.execute('EXEC corgi.GetUserAccess ?, ?;',guild.id,roleParam)
for row in cursor:
if row[1] == "Team Captain":
cursor.close()
return True
cursor.close()
return False
The error specifically happens at cursor.execute. I currently get the error
pyodbc.OperationalError: ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x68 (104) (SQLExecDirectW)')
Previously I didn't have the timeout in the connection on the specific file that was having a problem, and I did get a different error:
Communication link failure
Apologies, I don't have the full previous error.
Other connections, in other files in the same project, work fine, so the problem is not a network issue; if it were none of the connections would work. The problem only happens in one file, where all the connection attempts fail.
Googling the latest error really doesn't get me far. For example, there's a Github issue that gets nowhere, and this question isn't related as connecting works fie from other files.
Note, as well, that this happens after a period of time; I don't really know how long that period is but it's certainly hours. Restarting the project fixes the issue as well; the above function will work fine. That isn't really a solution though, I can't keep restarting the application ad-hoc.
The error is immediate as well; it's like Python/PyODBC isn't trying to connect. When stepping into the cursor.execute the error is generated straight after; it's not like when you get a timeout and you'll be waiting a few seconds, or more, for the timeout to occur.
I'm at a loss here. Why is the file (and only that one) unable to connect any more later on? There are no locks on the database either, so It's not like I have a transaction left hanging; though I would expect a timeout error again then as the procedure would be unable to gain a lock on the data.
Note, as well, that if I manually execute the procedure, in sqlcmd/SSMS/ADS, data is returned fine as well, so the Procedure does work fine. And, again, if I restart the application it'll work without issue for many hours.
Edit: I attempted the answer from Sabik below, however, this only broke to application, unfortunately. The solution they provided had the parameter self on the function validate_conn and so calling validate_conn() failed as I don't have a parameter for this "self". The method they said to use, just validate_conn didn't do anything; it doesn't call the function (which I expected). Removing the parameter, and references to self, also broke the application, stating that sqlConn wasn't declared even though it was; see image below where you can clearly see that sqlConn has a value:
Yet immediately after that line I get the error below:
UnboundLocalError: local variable 'sqlConn' referenced before assignment
So something appears to be wrong with their code, but I don't know what.

One possibility being discussed in the comments is that it's a 30-minute idle timeout on the database end, in which case one solution would be to record the time the connection has been opened, then reconnect if it's been more than 25 minutes.
This would be a method like:
def validate_conn(self):
if self.sqlConn is None or datetime.datetime.now() > self.conn_expiry:
try:
self.sqlConn.close()
except: # pylint: disable=broad-except
# suppress all exceptions; we're in any case about to reconnect,
# which will either resolve the situation or raise its own error
pass
self.sqlConn = pyodbc.connect(...)
self.conn_expiry = datetime.datetime.now() + datetime.timedelta(minutes=25)
(Adjust as appropriate if sqlConn is a global.)
At the beginning of each function which uses sqlConn, call validate_conn first, then use the connection freely.
Note: this is one of the rare situations in which suppressing all exceptions is reasonable; we're in any case about to reconnect to the database, which will either resolve the situation satisfactorily, or raise its own error.
Edit: If sqlConn is a global, it will need to be declared as such in the function:
def validate_conn():
global sqlConn, conn_expiry
if sqlConn is None or datetime.datetime.now() > conn_expiry:
try:
sqlConn.close()
except: # pylint: disable=broad-except
# suppress all exceptions; we're in any case about to reconnect,
# which will either resolve the situation or raise its own error
pass
sqlConn = pyodbc.connect(...)
conn_expiry = datetime.datetime.now() + datetime.timedelta(minutes=25)
As an unrelated style note, a shorter way to write the function would be using (a) a with statement and (b) the any operator, like this:
with sqlConn.cursor() as cursor:
roleParam = roleParam[1:]
cursor.execute('EXEC corgi.GetUserAccess ?, ?;', guild.id, roleParam)
return any(row[1] == "Team Captain" for row in cursor)
The with statement has the advantage that the cursor is guaranteed to be closed regardless of how the code is exited, even if there's an unexpected exception or if a later modification adds more branches.

Although the solution for Sabik didn't work for me, the answer did push me in the right direction to find a solution. That was, specifically, with the use of the with clauses.
Instead of having a long lasting connection, as I have been informed I had, I've now changed these to short lived connections with I open with a with, and then also changed the cursor to a with as well. So, for the iscaptain function, I now have code that looks like this:
def iscaptain(guild,user):
userRoles = user.roles
roleParam = ""
for role in userRoles:
roleParam = roleParam + "," + str(role.id)
#sqlConn = pyodbc.connect(SQLConnString,timeout=20)
with pyodbc.connect(SQLConnString,timeout=20) as sqlConn:
with sqlConn.cursor() as cursor:
roleParam = roleParam[1:]
cursor.execute('EXEC corgi.GetUserAccess ?, ?;', guild.id, roleParam)
return any(row[1] == "Team Captain" for row in cursor)
return False
It did appear that Azure was closing the connections after a period of time, and thus when the connection was attempted to be reused it failed to connect. As, however, both hosts are in Azure, but the Server running the Python application and the SQL Database, I am happy to reconnect as needed, as speed should not be a massive issue; certainly it hasn't been during the last 48 hours of testing.
This does mean i have a lot of code to refactor, but for the stability, it's a must.

Why don't simultaneous updates to the same record in sqlalchemy fail?

(Sorry in advance for the long question. I tried to break it up into sections to make it clearer what I'm asking. Please let me know if I should add anything else or reorganize it at all.)
Background:
I'm writing a web crawler that uses a producer/consumer model with jobs (pages to crawl or re-crawl) stored in a postgresql database table called crawler_table. I'm using SQLAlchemy to access and make changes to the database table. The exact schema is not important for this question. The important thing is that I (will) have multiple consumers, each of which repeatedly selects a record from the table, loads the page with phantomjs, and then writes information about the page back to the record.
It can happen on occasion that two consumers select the same job. This is not itself a problem; however, it is important that if they update the record with their results simultaneously, that they make consistent changes. It's good enough for me to just find out if an update would cause the record to become inconsistent. If so, I can deal with it.
Investigation:
I initially assumed that if two transactions in separate sessions read then updated the same record simultaneously, the second one to commit would fail. To test that assumption, I ran the following code (simplified slightly):
SQLAlchemySession = sessionmaker(bind=create_engine(my_postgresql_uri))
class Session (object):
# A simple wrapper for use with `with` statement
def __enter__ (self):
self.session = SQLAlchemySession()
return self.session
def __exit__ (self, exc_type, exc_val, exc_tb):
if exc_type:
self.session.rollback()
else:
self.session.commit()
self.session.close()
with Session() as session: # Create a record to play with
if session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url').count() == 0:
session.add(CrawlerPage(website='website', url='url',
first_seen=datetime.utcnow()))
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
page.failed_count = 0
# commit
# Actual experiment:
with Session() as session:
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'initial (session)', page.failed_count
# 0 (expected)
page.failed_count += 5
with Session() as other_session:
same_page = other_session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'initial (other_session)', same_page.failed_count
# 0 (expected)
same_page.failed_count += 10
print 'final (other_session)', same_page.failed_count
# 10 (expected)
# commit other_session, no errors (expected)
print 'final (session)', page.failed_count
# 5 (expected)
# commit session, no errors (why?)
with Session() as session:
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'final value', page.failed_count
# 5 (expected, given that there were no errors)
(Apparently Incorrect) Expectations:
I would have expected that reading a value from a record then updating that value within the same transaction would:
Be an atomic operation. That is, either succeed completely or fail completely. This much appears to be true, since the final value is 5, the value set in the last transaction to be committed.
Fail if the record being updated is updated by a concurrent session (other_session) upon attempting to commit the transaction. My rationale is that all transactions should behave as though they are performed independently in order of commit whenever possible, or should fail to commit. In these circumstances, the two transactions read then update the same value of the same record. In a version-control system, this would be the equivalent of a merge conflict. Obviously databases are not the same as version-control systems, but they have enough similarities to inform some of my assumptions about them, for better or worse.
Questions:
Why doesn't the second commit raise an exception?
Am I misunderstanding something about how SQLAlchemy handles transactions?
Am I misunderstanding something about how postgresql handles transactions? (This one seems most likely to me.)
Something else?
Is there a way to get the second commit to raise an exception?

PostgreSQL has select . . . for update, which SQLAlchemy seems to support.
My rationale is that all transactions should behave as though they are
performed independently in order of commit whenever possible, or
should fail to commit.
Well, in general there's a lot more to transactions than that. PostgreSQL's default transaction isolation level is "read committed". Loosely speaking, that means multiple transactions can simultaneously read committed values from the same rows in a table. If you want to prevent that, set transaction isolation serializable (might not work), or select...for update, or lock the table, or use a column-by-column WHERE clause, or whatever.
You can test and demonstrate transaction behavior by opening two psql connections.
begin transaction; begin transaction;
select *
from test
where pid = 1
and date = '2014-10-01'
for update;
(1 row)
select *
from test
where pid = 1
and date = '2014-10-01'
for update;
(waiting)
update test
set date = '2014-10-31'
where pid = 1
and date = '2014-10-01';
commit;
-- Locks released. SELECT for update fails.
(0 rows)

psycopg2 occasionally returns null

So I'm using psycopg2, I have a simple table:
CREATE TABLE IF NOT EXISTS feed_cache (
feed_id int REFERENCES feeds(id) UNIQUE,
feed_cache text NOT NULL,
expire_date timestamp --without time zone
);
I'm calling the following method and query:
#staticmethod
def get_feed_cache(conn, feed_id):
c = conn.cursor()
try:
sql = 'SELECT feed_cache FROM feed_cache WHERE feed_id=%s AND localtimestamp <= expire_date;'
c.execute(sql, (feed_id,))
result = c.fetchone()
if result:
conn.commit()
return result[0]
else:
print 'DBSELECT.get_feed_cache: %s' % result
print 'sql: %s' % (c.mogrify(sql, (feed_id,)))
except:
conn.rollback()
raise
finally:
c.close()
return None
I've added the else statement to output the exact sql and result that is being executed and returned.
The get_feed_cache() method is called from a database connection thread pool. When the get_feed_cache() method is called "slowishly" (~1/sec or less) the result is returned as expected, however when called concurrently it will occasionally return None. I have tried multiple ways of writing this query & method.
Some observations:
If I remove 'AND localtimestamp <= expire_date' from the query, the query ALWAYS returns a result.
Executing the query rapidly in serial in psql always returns a result.
After reading about the fetch*() methods of psycopg's cursor class they note that the results are cached for the cursor, I'm assuming that the cache is not shared between different cursors. http://initd.org/psycopg/docs/faq.html#best-practices
I have tried using postgresql's now() and current_timestamp functions with the same results. (I am aware of the timezone aspect of now() & current_timestamp)
Conditions to note:
There will NEVER be a case where there is not a feed_cache value for a provided feed_id.
There will NEVER be a case where any value in the feed_cache table is NULL
While testing I have completely disabled any & all writes to this table
I have set the expire_date to be sufficiently far in the future for all values such that the expression 'AND localtimestamp <= expire_date' will always be true.
Here is a copy & pasted output of it returning None:
DBSELECT.get_feed_cache: None
sql: SELECT feed_cache FROM feed_cache WHERE feed_id=5 AND localtimestamp < expire_date;
Well that's pretty much it, I'm not sure what's going on. Maybe I'm making some really dumb mistake and I just don't notice it! My current guess is that it has something to do with psycopg2 and perhaps the way it's caching results between cursors. If the cursors DO share the cache and the queries happen near-simultaneously then it could be possible that the first cursor fetches the result, the second cursor sees there is a cache of the same query, so it does not execute, then the first cursor closes and deletes the cache and the second cursor tries to fetch a now null/None cache.*
That said, psycopg2 states that it's thread-safe for read-only queries, so unless I'm miss-interpreting their implementation of thread-safe, this shouldn't be the case.
Thank you for your time!
*After adding a thread lock for the get_feed_cache, acquiring before creating the cursor and releasing before returning, I still occasionally get a None result

I think this might have to do with the fact that the time stamps returned by localtimestamp or current_timestamp are fixed when the transaction starts, not when you run the statement. And psycopg manages the transactions behind your back to some degree. So you might be getting a slightly older time stamp.
You could debug this by setting log_statement = all in your server and then observing when the BEGIN statements are executed relative to your queries.
You might want to look into using a function such as clock_timestamp(), which updates more often per transaction. See http://www.postgresql.org/docs/current/static/functions-datetime.html.

sqlalchemy caching some queries

I have this running on a live website. When a user logs in I query his profile to see how many "credits" he has available. Credits are purchased via paypal. If a person buys credits and the payment comes through, the query still shows 0 credits even though if I run the same query in phpmyadmin it brings the right result. If I restart the apache webserver and reload the page the right number of credits are being shown. Here's my mapper code which shows the number of credits each user has:
mapper( User, users_table, order_by = 'user.date_added DESC, user.id DESC', properties = {
'userCreditsCount': column_property(
select(
[func.ifnull( func.sum( orders_table.c.quantity ), 0 )],
orders_table.c.user_id == users_table.c.id
).where( and_(
orders_table.c.date_added > get_order_expire_limit(), # order must not be older than a month
orders_table.c.status == STATUS_COMPLETED
) ).\
label( 'userCreditsCount' ),
deferred = True
)
# other properties....
} )
I'm using sqlalchemy with flask framework but not using their flask-sqlalchemy package (just pure sqlalchemy)
Here's how I initiate my database:
engine = create_engine( config.DATABASE_URI, pool_recycle = True )
metadata = MetaData()
db_session = scoped_session( sessionmaker( bind = engine, autoflush = True, autocommit = False ) )
I learned both python and sqlalchemy on this project so I may be missing things but this one is driving me nuts. Any ideas?

when you work with a Session, as soon as it starts working with a connection, it holds onto that connection until commit(), rollback() or close() is called. With the DBAPI, the connection to the database also remains in a transaction until the transaction is committed or rolled back.
In this case, when you've loaded data into your session, SQLAlchemy doesn't refresh the data until the transaction is ended (or if you explicitly expire some part of the data with expire()). This is the natural behavior to have, since due to transaction isolation, it's very likely that the current transaction cannot see changes that have occurred since that transaction started in any case.
So while using expire() or refresh() may or may not be part of how to get the latest data into your Session, really you need to end your transaction and start a new one to truly see what's been changed elsewhere since that transaction started. you should organize your application so that a particular Session() is ready to go when a new request comes in, but when that request completes, the Session() should be closed out, and a new one (or at least a new transaction) started up on the next request.

Please try to call refresh or expire on your object before accessing the field userCreditsCount:
user1 = session.query(User).get(1)
# ...
session.refresh(user1, ('userCreditsCount',))
This will make the query execute again (when refresh is called).
However, depending on the isolation mode your transaction uses, it might not resolve the problem, in which case you might need to commit/rollback the transaction (session) in order for the query to give you new result.

Lifespan of a Contextual Session
I'd make sure you're closing the session when you're done with it.
session = db_session()
try:
return session.query(User).get(5)
finally:
session.close()

set sessionmaker's autocommit to True and see if that helps, according to documentation sessionmaker caches
the identity map pattern, and stores objects keyed to their primary key. However, it doesn’t do any kind of query caching.
so in your code it would become:
sessionmaker(bind = engine, autoflush = True, autocommit = True)

SQLAlchemy autocommiting?

I have an issue with SQLAlchemy apparently committing. A rough sketch of my code:
trans = self.conn.begin()
try:
assert not self.conn.execute(my_obj.__table__.select(my_obj.id == id)).first()
self.conn.execute(my_obj.__table__.insert().values(id=id))
assert not self.conn.execute(my_obj.__table__.select(my_obj.id == id)).first()
except:
trans.rollback()
raise
I don't commit, and the second assert always fails! In other words, it seems the data is getting inserted into the database even though the code is within a transaction! Is this assessment accurate?

You're right in that changes aren't get commited to DB. But they are auto-flushed by SQLAlchemy when you perform query, in your case flush is performed on lines with asserts. So if you will not explicitly call commit you will never see these changes in DB, within real data. However, you will get them back as long as you use the same conn object.
You can pass autoflush=False to session constructor do disable this behavior.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Query execution hanging in specific circumstances - python

Related

Multi-file Python project, one file unable to connect to SQL Server after a while

Why don't simultaneous updates to the same record in sqlalchemy fail?

psycopg2 occasionally returns null

sqlalchemy caching some queries

SQLAlchemy autocommiting?

Categories

Resources