Redshift + SQLAlchemy long query hangs

Redshift + SQLAlchemy long query hangs - python

I'm doing something among the lines of:
conn_string = "postgresql+pg8000://%s:%s#%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,execution_options={'autocommit':True},encoding='utf-8',isolation_level="AUTOCOMMIT")
rows = cur.execute(sql_query)
To run queries on a Redshift cluster. Lately, I've been doing maintenance tasks such as running vacuum reindex on large tables that get truncated and reloaded every day.
The problem is that that command above takes around 7 minutes for a particular table (the table is huge, 60 million rows across 15 columns) and when I run it using the method above it just never finishes and hangs. I can see in the cluster dashboard in AWS that parts of the vacuum command are being run for about 5 minutes and then it just stops. No python errors, no errors on the cluster, no nothing.
My guess is that the connection is lost during the command. So, how do I prove my theory? Anybody else with the issue? What do I change the connection string to keep it alive longer?
EDIT:
I change my connection this after the comments here:
conn = sqlalchemy.engine.create_engine(conn_string,
execution_options={'autocommit': True},
encoding='utf-8',
connect_args={"keepalives": 1, "keepalives_idle": 60,
"keepalives_interval": 60},
isolation_level="AUTOCOMMIT")
And it has been working for a while. However, it decided to start with the same behaviour for even larger tables in which the vacuum reindex actually takes around 45 minutes (at least that is my estimate, the command never finishes running in Python).
How can I make this work regardless of the query runtime?

It's most likely not a connection drop issue. To confirm this , try pushing a few million rows into a dummy table (something which takes more than 5 minutes) and see if the statement fails. Once a query has been submitted to redshift , regardless of your connection string shutting the query executes in the background.
Now, coming to the problem itself - my guess is that you are running out of memory or disk space, can you please be more elaborate and list out your redshift setup (How many nodes of dc1/ds2) ? Also, try running some admin queries and see how much space you have left on the disk. Sometimes when the cluster is loaded to the brim a disk full error is thrown but in your case since the connection might be dropped much before the error is thrown to your python shell.

Related

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?

I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs
FATAL: sorry, too many clients already
despite being no-where close to its configured limit of connections. This query:
SELECT current_setting('max_connections') AS max,
COUNT(*) AS total
FROM pg_stat_activity
tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.
Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.
For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';
then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?

This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.
I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.
However, you can control this number by setting numPartitions option:
The maximum number of partitions that can be used for parallelism in
table reading and writing. This also determines the maximum number of
concurrent JDBC connections. If the number of partitions to write
exceeds this limit, we decrease it to this limit by calling
coalesce(numPartitions) before writing.
Example:
spark.read.format("jdbc") \
.option("numPartitions", "20") \
# ...

Three possibilities:
The connections are very short-lived, and they were already gone by the time you looked.
You have a lower connection limit on that database.
You have a lower connection limit on the database user.
But options 2 and 3 would result in a different error message, so it must be the short-lived connections.
Whatever it is, the answer to your problem would be a well-configured connection pool.

Can i do vagrant ssh in a parallel window for a same guest machine which also currently runs a job? Will it anyway disturb the current job?

As part of my homework I need to load large data files into two MySQL tables, parsed using Python, on my guest machine that is called via Vagrant SSH.
I also then need to run a Sqoop job on one of the 2 tables so now I'm up to the point where I loaded one of the tables successfully and ran the Python script to load the second table and it's been more than 3 hours and still loading.
I was wondering whether I could complete my Sqoop job on the already loaded table instead of staring at a black screen for almost 4 hours now.
My questions are:
Is there any other way to Vagrant SSH into the same machine without doing Vagrant reload (because --reload eventually shuts down my virtual machine thereby killing all the current jobs running on my guests).
If there is, then given that I open a parallel window to log in to the guest machine as usual and start working on my Sqoop job on the first table that already loaded; will it any way affect my current job with the second table that is still loading? Or will it have a data loss as I can't risk re-doing it as it is super large and extremely time-consuming.
python code goes like this
~~
def parser():
with open('1950-sample.txt', 'r', encoding='latin_1') as input:
for line in input:
....
Inserting into tables
def insert():
if (tablename == '1950_psr'):
cursor.execute("INSERT INTO 1950_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
elif (tablename == '1986_psr'):
cursor.execute("INSERT INTO 1986_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
parser()
Saving & closing
conn.commit()
conn.close()

I don't know what's in your login scripts, and I'm not clear what that --reload flag is, but in general you can have multiple ssh sessions to the same machine. Just open another terminal and ssh into the VM.
However, in your case, that's probably not a good idea. I suspect that the second table is taking long time to load because your database is reindexing or it's waiting on a lock to be released.
Unless you are loading hundreds of meg's, I suggest you first check for locks and see what queries are pending.
Even if you are loading very large dataset and there are no constraints on the table you need for you script, you are just going to pile up on a machine that's already taxed pretty heavily...

Redshift unload terminates when called by sqlalchemy

I'm running a few large UNLOAD queries from Redshift to S3 from a python script using SQLAlchemy. (along with the sqlalchemy-redshift package)
The first couple work but the last, which runs the longs (~30 minutes) is marked Terminated in the Redshift Query Dashboard. Some data is loaded to S3 but I suspect it's not ALL of it.
I'm fairly confident the query itself works because I've used it to download locally in the past.
Does SQLAlchemy close queries that take too long? Is there a way to set or lengthen the query-timeout? The script itself continues as if nothing went wrong and the Redshift logs don't indicate a problem either but when a query is marked Terminated it usually means something external has killed the process.

There are two places where you can control timeouts in Redshift:
In the workload manager console, you get an option to specify timeout for each queue.
The ODBC/ JDBC driver settings. Update your registry based on the steps in the link below,
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-firewall-guidance.html

It turned out to be more an issue with sqlalchemy than AWS/Redshift.
SQLAlchemy does not implicitly "Commit Transactions" so if the connection is closed while uncommitted transactions are still open (even if the query itself appears to be finished), all transactions within that connection are marked Terminated.
Solution is to finish your connection or each transaction with "commit transaction;"
conn = engine.connect()
conn.execute("""SELECT .... """)
conn.execute("""COMMIT TRANSACTION""")

Tracking out of database connection issues with Python and PostgreSQL

I periodically get connection issues to PostgreSQL - either "FATAL: remaining connection slots are reserved for non-replication superuser connections" or "QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30" depending on whether psycopg or Pyramid is raising the exception. Having established that the transaction manager is properly installed, it's frustrating to not know why I am still running out of connections.
I know the connection data is in pg_stat_activity but it's a single snapshot. Is there any way of seeing connections over time so that I can see just what is actually running over a period of time (ideally from before it's an issue up until the time the issue requires an application restart)?

The first part is in properly identifying all of the queries running at a point in time. For that I used this query:
SELECT (SELECT COUNT(1) FROM pg_stat_activity) AS total_connections,
(SELECT COUNT(1) FROM pg_stat_activity
WHERE current_query in ('<IDLE>', '<IDLE> in transaction'))
AS idle_connections,
current_query
FROM pg_stat_activity
WHERE current_query NOT IN ('<IDLE>', '<IDLE> in transaction')
AND NOT procpid=pg_backend_pid();
NOTE! "current_query" is simply called "query" in later versions of postgresql (from 9.2 on)
This strips out all idle database connections (seeing IDLE connections is not going to help you fix it) and the "NOT procpid=pg_backend_pid()" bit excludes this query itself from showing up in the results (which would bloat your output considerably). You can also filter by datname if you want to isolate a particular database.
I needed these results in a way that was really easy to query them and so I used a table on the database. This should work:
CREATE TABLE connection_audit
(
snapshot timestamp without time zone NOT NULL DEFAULT now(),
total_connections integer,
idle_connections integer,
query text
)
WITH (
OIDS=FALSE
);
This will store the current timestamp in "snapshot", the total and idle connections and the query itself.
I wrote a script to insert the top query into the table and saved that into a file called "pg_connections.sql".
I ran a script to insert these results into the table every second:
while true ; do psql -U user -d database_name -f 'pg_connections.sql' >> connections.log ; sleep 1; done
What this is effectively doing is writing all CURRENTLY executing queries to the table.
Tailing the connections.log file showed me if the script was running as expected (but it isn't really required). Obviously, running a script like this every second can be taxing on a system but it's a short-term measure when you don't have any other way of finding this information out so it should be worth it. Run this script for as long as you need to accumulate sufficient data and hopefully it should pay dirt.

Query failing on first try of the day, succeeding on second try

Exact error I get is here:
{'trace': "(Error) ('08S01', '[08S01] [FreeTDS][SQL Server]Write to the server failed (20006) (SQLExecDirectW)')"}
I get this when I first run a query in my Pyramid application. Any query I run (In my case, it is a web search form that returns info from a database)
The entire application is read-only, as is the account used to connect to the db. I don't know what it would be writing that would fail. And like I said, if I re-run the exact same thing (or refresh the page) it runs just fine without error.
Edit: Emphasis on the "first try of the day". If no queries for x amount of time, I get this write error again, and then it'll work. It's almost like it's fallen asleep and that first query will wake it up.

I would guess that there's a pool of DB connections that is kept open for some time, T. The server, however, terminates open connections after some time, S, which is less than T.
The first connection of the day (or after S elapses in general) would give you this error.
Try to look for a way to change the "timeout" of the connections in the pool to be less than S and that should fix the problem.
Edit: These times (T and S) are dependent on configs or default values for the server and libraries you use. I've experienced a similar issue with a Flask+SQLAlchemy+MySQL app in the past and I had to change the connection timeouts, etc.
Edit 2: T might be "keep connections open forever" or a very high value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Redshift + SQLAlchemy long query hangs - python

Related

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?

Can i do vagrant ssh in a parallel window for a same guest machine which also currently runs a job? Will it anyway disturb the current job?

Redshift unload terminates when called by sqlalchemy

Tracking out of database connection issues with Python and PostgreSQL

Query failing on first try of the day, succeeding on second try

Categories

Resources