Tracking out of database connection issues with Python and PostgreSQL

Tracking out of database connection issues with Python and PostgreSQL - python

I periodically get connection issues to PostgreSQL - either "FATAL: remaining connection slots are reserved for non-replication superuser connections" or "QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30" depending on whether psycopg or Pyramid is raising the exception. Having established that the transaction manager is properly installed, it's frustrating to not know why I am still running out of connections.
I know the connection data is in pg_stat_activity but it's a single snapshot. Is there any way of seeing connections over time so that I can see just what is actually running over a period of time (ideally from before it's an issue up until the time the issue requires an application restart)?

The first part is in properly identifying all of the queries running at a point in time. For that I used this query:
SELECT (SELECT COUNT(1) FROM pg_stat_activity) AS total_connections,
(SELECT COUNT(1) FROM pg_stat_activity
WHERE current_query in ('<IDLE>', '<IDLE> in transaction'))
AS idle_connections,
current_query
FROM pg_stat_activity
WHERE current_query NOT IN ('<IDLE>', '<IDLE> in transaction')
AND NOT procpid=pg_backend_pid();
NOTE! "current_query" is simply called "query" in later versions of postgresql (from 9.2 on)
This strips out all idle database connections (seeing IDLE connections is not going to help you fix it) and the "NOT procpid=pg_backend_pid()" bit excludes this query itself from showing up in the results (which would bloat your output considerably). You can also filter by datname if you want to isolate a particular database.
I needed these results in a way that was really easy to query them and so I used a table on the database. This should work:
CREATE TABLE connection_audit
(
snapshot timestamp without time zone NOT NULL DEFAULT now(),
total_connections integer,
idle_connections integer,
query text
)
WITH (
OIDS=FALSE
);
This will store the current timestamp in "snapshot", the total and idle connections and the query itself.
I wrote a script to insert the top query into the table and saved that into a file called "pg_connections.sql".
I ran a script to insert these results into the table every second:
while true ; do psql -U user -d database_name -f 'pg_connections.sql' >> connections.log ; sleep 1; done
What this is effectively doing is writing all CURRENTLY executing queries to the table.
Tailing the connections.log file showed me if the script was running as expected (but it isn't really required). Obviously, running a script like this every second can be taxing on a system but it's a short-term measure when you don't have any other way of finding this information out so it should be worth it. Run this script for as long as you need to accumulate sufficient data and hopefully it should pay dirt.

Related

Python: Using a generator when accessing a SQL source

I'm trying to refactor some code and have come up with this
def get_inpatients():
"""
Getting all the inpatients currently sitting in A&E
"""
cnxn = pyodbc.connect(f'DRIVER={DB_DRIVER};SERVER={DB_SERVER};DATABASE={DB_NAME};UID={DB_USER};PWD={DB_PASS}')
cursor = cnxn.cursor()
cursor.execute('EXEC spGetInpatients')
row = cursor.fetchone()
while row is not None:
yield row[0]
row = cursor.fetchone()
In the main file I then do this
for nhs_number in get_inpatients():
.... # This then goes and grabs details from several APIs meaning
# it will be a few seconds for each loop
My question is whether a genertaor is a good choice here. I previously had it so that the function would return a list. Thinking about it now, would this then mean the connection is open for as long as the for loop is running in the main file in which case I am better returning a list?

Yes, the connection will remain open. Whether that is a good idea depends on the circumstances. Normally it is a good idea to use the generator because it allows the processing in your application to run concurrently with the fetching of more rows by the database. It also reduces memory consumption and improves CPU cache efficiency in your application. When done right, it also reduces latency which is very user-visible.
But of course you could run into the maximum connection limit sooner. I'd argue that increasing the connection limit is better than artifically making your application perform worse.
Also note that you can have multiple cursors per connection. See for example
Max SQL connections with Python and pyodbc on a local database showing as 1

I am adding this answer for 2 reasons.
To point that the cursor is an iterator
To make more clear that the "maximum connection limit" (as per the answer of #Homer512) is a client side setting and not a server side one and defaults to 0 both for the database connection and the queries.
So:
According to pyodbc wiki you can avoid the boilerplate code:
The fetchall() function returns all remaining rows in a list. Bear
in mind those rows will all be stored in memory so if there a lot
of rows, you may run out of memory. If you are going to process the rows one at a time, you can use the
cursor itself as an iterator:
for row in cursor.execute("select user_id, user_name from users"):
print(row.user_id, row.user_name)
The connection limit lies in the client side and not the server side.
The comment on that answer reads:
You should clarify what server scoped means. SQL Server has a remote
query timeout value that refers to its queries issued on over linked
servers, not to queries issued by clients to it. I believe the query
timeout is a client property, not a server property. The server runs
the query indefinitely. There is such a thing as a query governor for
addressing this issue which is disabled by default.
Indeed, the docs verify:
This value applies to an outgoing connection initiated by the Database
Engine as a remote query. This value has no effect on queries received
by the Database Engine. A query will wait until it completes.
Regarding the question if it is safe to keep open a database connection for a long time, I found this old but relevant question which has an extended answer in favor of "yes, if you know what you are doing".

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?

I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs
FATAL: sorry, too many clients already
despite being no-where close to its configured limit of connections. This query:
SELECT current_setting('max_connections') AS max,
COUNT(*) AS total
FROM pg_stat_activity
tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.
Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.
For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';
then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?

This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.
I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.
However, you can control this number by setting numPartitions option:
The maximum number of partitions that can be used for parallelism in
table reading and writing. This also determines the maximum number of
concurrent JDBC connections. If the number of partitions to write
exceeds this limit, we decrease it to this limit by calling
coalesce(numPartitions) before writing.
Example:
spark.read.format("jdbc") \
.option("numPartitions", "20") \
# ...

Three possibilities:
The connections are very short-lived, and they were already gone by the time you looked.
You have a lower connection limit on that database.
You have a lower connection limit on the database user.
But options 2 and 3 would result in a different error message, so it must be the short-lived connections.
Whatever it is, the answer to your problem would be a well-configured connection pool.

Postgres "CREATE TABLE AS (SELECT ...)" stuck

I am using Python with psycopg2 2.8.6 against Postgresql 11.6 (also tried on 11.9)
When I am running a query
CREATE TABLE tbl AS (SELECT (row_number() over())::integer "id", "col" FROM tbl2)
Code is getting stuck (cursor.execute never returns), killing the transaction with pg_terminate_backend removes the query from the server, but the code is not released. Yet in this case, the target table is created.
Nothing locks the transaction. The internal SELECT query on its own was tested and it works well.
I tried analysing clues on the server and found out the following inside pg_stat_activity:
Transaction state is idle in transaction
wait_event_type is Client
wait_event is ClientRead
The same effect is happening when I am running the query from within SQL editor (pgModeler), but in this case, the query is stuck on Idle state and the target table is created.
I am not sure what is wrong and how to proceed from here.
Thanks!

I am answering my own question here, to make it helpful for others.
The problem was solved by modifying tcp_keepalives_idle Postgres setting from default 2 hours to 5 minutes.

The problem is not reporducible, you have to investigate more. You must share more details about your database table, your python code and server OS.
You can also share with us the strace attached to Python, so we can see what actually happens during the query.
wait_event_type = Client: The server process is waiting for some activity on a socket from user applications, and that the server expects something to happen that is independent from its internal processes. wait_event will identify the specific wait point.
wait_event = ClientRead: A session that waits for ClientRead is done processing the last query and waits for the client to send the next request. The only way that such a session can block anything is if its state is idle in transaction. All locks are held until the transaction ends, and no locks are held once the transaction finishes.
Idle in transaction: The activity can be idle (i.e., waiting for a client command), idle in transaction (waiting for client inside a BEGIN block), or a command type name such as SELECT. Also, waiting is appended if the server process is presently waiting on a lock held by another session.
The problem could be related to:
Network problems
Uncommitted transaction someplace that has created the same table name.
The transaction is not committed
You pointed out that is not a commit problem because the SQL editor do the same, but in your question you specify that the editor succesfully create the table.
In pgModeler you see idle, that means the session is idle, not the query.
If the session is idle, the "query" column of pg_stat_activity shows the last executed statement in that session.
So this simply means all those sessions properly ended their transaction using a ROLLBACK statement.
If sessions remain in state idle in transaction for a longer time, that is always an application bug where the application is not ending the transaction.
You can do two things:
Set the idle_in_transaction_session_timeout so that these transactions are automatically rolled back by the server after a while. This will keep locks from being held indefinitly, but your application will receive an error.
Fix the application as shown below
.commit() solution
The only way that I found to reproduce the problem is to omit the commit action.
The module psycopg2 is Python DB API-compliant, so the auto-commit feature is off by default.
Whit this option set to False you need to call conn.commit to commit any pending transaction to the database.
Enable auto-commit
You can enable the auto-commit as follow:
import psycopg2
connection = None
try:
connection = psycopg2.connect("dbname='myDB' user='myUser' host='localhost' password='myPassword'")
connection.autocommit = True
except:
print "Connection failed."
if(connection != None):
cursor = connection.cursor()
try:
cursor.execute("""CREATE TABLE tbl AS (SELECT (row_number() over())::integer 'id', 'col' FROM tbl2)""")
except:
print("Failed to create table.")
with statement
You can also use the with statement to auto-commit a transaction:
with connection, connection.cursor() as cursor: # start a transaction and create a cursor
cursor.execute("""CREATE TABLE tbl AS (SELECT (row_number() over())::integer 'id', 'col' FROM tbl2)""")
Traditional way
If you don't want to auto-commit the transaction you need to do it manually calling .commit() after your execute.

just remove the ( ) around the SELECT...
https://www.postgresql.org/docs/11/sql-createtableas.html

Redshift + SQLAlchemy long query hangs

I'm doing something among the lines of:
conn_string = "postgresql+pg8000://%s:%s#%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,execution_options={'autocommit':True},encoding='utf-8',isolation_level="AUTOCOMMIT")
rows = cur.execute(sql_query)
To run queries on a Redshift cluster. Lately, I've been doing maintenance tasks such as running vacuum reindex on large tables that get truncated and reloaded every day.
The problem is that that command above takes around 7 minutes for a particular table (the table is huge, 60 million rows across 15 columns) and when I run it using the method above it just never finishes and hangs. I can see in the cluster dashboard in AWS that parts of the vacuum command are being run for about 5 minutes and then it just stops. No python errors, no errors on the cluster, no nothing.
My guess is that the connection is lost during the command. So, how do I prove my theory? Anybody else with the issue? What do I change the connection string to keep it alive longer?
EDIT:
I change my connection this after the comments here:
conn = sqlalchemy.engine.create_engine(conn_string,
execution_options={'autocommit': True},
encoding='utf-8',
connect_args={"keepalives": 1, "keepalives_idle": 60,
"keepalives_interval": 60},
isolation_level="AUTOCOMMIT")
And it has been working for a while. However, it decided to start with the same behaviour for even larger tables in which the vacuum reindex actually takes around 45 minutes (at least that is my estimate, the command never finishes running in Python).
How can I make this work regardless of the query runtime?

It's most likely not a connection drop issue. To confirm this , try pushing a few million rows into a dummy table (something which takes more than 5 minutes) and see if the statement fails. Once a query has been submitted to redshift , regardless of your connection string shutting the query executes in the background.
Now, coming to the problem itself - my guess is that you are running out of memory or disk space, can you please be more elaborate and list out your redshift setup (How many nodes of dc1/ds2) ? Also, try running some admin queries and see how much space you have left on the disk. Sometimes when the cluster is loaded to the brim a disk full error is thrown but in your case since the connection might be dropped much before the error is thrown to your python shell.

Redshift unload terminates when called by sqlalchemy

I'm running a few large UNLOAD queries from Redshift to S3 from a python script using SQLAlchemy. (along with the sqlalchemy-redshift package)
The first couple work but the last, which runs the longs (~30 minutes) is marked Terminated in the Redshift Query Dashboard. Some data is loaded to S3 but I suspect it's not ALL of it.
I'm fairly confident the query itself works because I've used it to download locally in the past.
Does SQLAlchemy close queries that take too long? Is there a way to set or lengthen the query-timeout? The script itself continues as if nothing went wrong and the Redshift logs don't indicate a problem either but when a query is marked Terminated it usually means something external has killed the process.

There are two places where you can control timeouts in Redshift:
In the workload manager console, you get an option to specify timeout for each queue.
The ODBC/ JDBC driver settings. Update your registry based on the steps in the link below,
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-firewall-guidance.html

It turned out to be more an issue with sqlalchemy than AWS/Redshift.
SQLAlchemy does not implicitly "Commit Transactions" so if the connection is closed while uncommitted transactions are still open (even if the query itself appears to be finished), all transactions within that connection are marked Terminated.
Solution is to finish your connection or each transaction with "commit transaction;"
conn = engine.connect()
conn.execute("""SELECT .... """)
conn.execute("""COMMIT TRANSACTION""")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.