Query failing on first try of the day, succeeding on second try - python

Exact error I get is here:
{'trace': "(Error) ('08S01', '[08S01] [FreeTDS][SQL Server]Write to the server failed (20006) (SQLExecDirectW)')"}
I get this when I first run a query in my Pyramid application. Any query I run (In my case, it is a web search form that returns info from a database)
The entire application is read-only, as is the account used to connect to the db. I don't know what it would be writing that would fail. And like I said, if I re-run the exact same thing (or refresh the page) it runs just fine without error.
Edit: Emphasis on the "first try of the day". If no queries for x amount of time, I get this write error again, and then it'll work. It's almost like it's fallen asleep and that first query will wake it up.

I would guess that there's a pool of DB connections that is kept open for some time, T. The server, however, terminates open connections after some time, S, which is less than T.
The first connection of the day (or after S elapses in general) would give you this error.
Try to look for a way to change the "timeout" of the connections in the pool to be less than S and that should fix the problem.
Edit: These times (T and S) are dependent on configs or default values for the server and libraries you use. I've experienced a similar issue with a Flask+SQLAlchemy+MySQL app in the past and I had to change the connection timeouts, etc.
Edit 2: T might be "keep connections open forever" or a very high value

Related

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?

I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs
FATAL: sorry, too many clients already
despite being no-where close to its configured limit of connections. This query:
SELECT current_setting('max_connections') AS max,
COUNT(*) AS total
FROM pg_stat_activity
tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.
Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.
For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';
then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?
This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.
I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.
However, you can control this number by setting numPartitions option:
The maximum number of partitions that can be used for parallelism in
table reading and writing. This also determines the maximum number of
concurrent JDBC connections. If the number of partitions to write
exceeds this limit, we decrease it to this limit by calling
coalesce(numPartitions) before writing.
Example:
spark.read.format("jdbc") \
.option("numPartitions", "20") \
# ...
Three possibilities:
The connections are very short-lived, and they were already gone by the time you looked.
You have a lower connection limit on that database.
You have a lower connection limit on the database user.
But options 2 and 3 would result in a different error message, so it must be the short-lived connections.
Whatever it is, the answer to your problem would be a well-configured connection pool.

Python: handle any and all postgres timeout situations

I am struggling to find a solution to this.
We use Python to monitor our Postgres databases that are running on AWS RDS. This means we have extremely limited control over the server side.
If there is an issue, and the server fails (hardware fault, network fault, you name it) our scripts just hang for sometimes 8-10 minutes (non-SSL) and up to 15-20 minutes (SSL connections). And by the time they recover, and finally reach whatever timeout this random amount of minutes is from, the server has failed over and everything works again.
Obviously, this renders our tools useless, if we can't catch these situations.
We basically run this (pseudo-code):
while True:
try:
query("select current_user")
except:
page("Database X failed")
For the basic use cases, this works just fine. E.g. if an instance is restarted, or something of the sort, no problems.
But, if there is an actual issue, the query just hangs. For minutes and minutes.
We've tried setting the statement_timeout on the psycopg2 connection. But that is a server setting, and if the instance fails and fails over, well, there is no server. So the client ends up waiting indefinitely or until it hits one of those arbitrary timeouts.
I've looked into sockets and tested something like this:
pg.connect('user', 'db-name', instanceName = 'foo')
fd = pg.Connection.fileno()
s = socket.socket(fileno=fd)
print(s)
s.settimeout(0.0000001)
timeval = struct.pack('ll', 0, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, timeval)
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDTIMEO, timeval)
data = pg.query("SELECT pg_sleep(10);")
In dumping the socket with the print(s) statement, I can clearly see that we've got the right socket.
But the timeouts I set, do nothing whatsoever.
I've tried many values, and it just has no effect. With the above, it should raise a timeout if more than 1 microsecond has elapsed. Ignoring common sense, I checked with tcpdump and made sure that we definitely do not get a response in 1 microsecond. Yet the thing just sits there and waits for to pg_sleep(10) to complete.
Could someone shed some light on this?
It seems simple enough:
All I want is that ANY CALL made to postgres can NEVER TAKE LONGER than, say, 10 seconds. Regardless of what it is, regardless of what happens. If more than 10 seconds have elapsed, it needs to raise an exception.
From what I can see, the only way would be to use subprocess with a process-timeout. But we run threaded (we monitor hundreds of instances and spawn a persistent connection in a thread for each instance) and I've seen posts saying that this isn't reliable inside threads. It also seems silly to have each thread spawn yet another subprocess. Inception comes to mind. Where does it end?
But I digress. The question seems simple enough, yet my wall is showing clear signs of a large dent developing.
Greatly appreciate any insights
PS: Python 3.6 on Ubuntu LTS
Cheers
Stefan

MongoDB query returns no results after a while

I am running a process that updates some entries in my mongo db using Pymongo. I have another process who does polling on these entries (using 'find' evrey minutes) to see if the other process is done.
I noticed that after about 30-40 minutes I get an empty cursor even though these entries are still in the database.
At first I thought it happens due to changing these entries but then I run a process that just use the same query once every minute and I saw the same phenomena: After 30-40 minutes I get no results.
I noticed that if I wait 2-3 minutes I get the results I am requesting.
I tried to use the explain function but couldn't find anything helpful there.
Did you ever see something similar? If so what can I do?
Is there a way to tell that the cursor is empty? Is the rate limit configurable?
thank you in advance!
Apparently it was due to high CPU in mongo.
The database was synced with another one once every hour and during that time the queries returned empty results.
When we scheduled the sync to happen only once a day we stopped seeing this problem (we also added a retry mechanism to avoid error on the sync time. However, this retry will be helpful only when you know for sure that the query should not return an empty cursor).

Postgres 8.4.4 + psycopg2 + python 2.6.5 + Win7 instability

You can see the combination of software components I'm using in the title of the question.
I have a simple 10-table database running on a Postgres server (Win 7 Pro). I have client apps (python using psycopg to connect to Postgres) who connect to the database at random intervals to conduct relatively light transactions. There's only one client app at a time doing any kind of heavy transaction, and those are typically < 500ms. The rest of them spend more time connecting than actually waiting for the database to execute the transaction. The point is that the database is under light load, but the load is evenly split between reads and writes.
My client apps run as servers/services themselves. I've found that it is pretty common for me to be able to (1) take the Postgres server completely down, and (2) ruin the database by killing the client app with a keyboard interrupt.
By (1), I mean that the Postgres process on the server aborts and the service needs to be restarted.
By (2), I mean that the database crashes again whenever a client tries to access the database after it has restarted and (presumably) finished "recovery mode" operations. I need to delete the old database/schema from the database server, then rebuild it each time to return it to a stable state. (After recovery mode, I have tried various combinations of Vacuums to see whether that improves stability; the vacuums run, but the server will still go down quickly when clients try to access the database again.)
I don't recall seeing the same effect when I kill the client app using a "taskkill" - only when using a keyboard interrupt to take the python process down. It doesn't happen all the time, but frequently enough that it's a major concern (25%?).
Really surprised that anything on a client would actually be able to take down an "enterprise class" database. Can anyone share tips on how to improve robustness, and hopefully help me to understand why this is happening in the first place? Thanks, M
If you're having problems with postgresql acting up like this, you should read this page:
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
For an example of a real bug, and how to ask a question that gets action and answers, read this thread.
http://archives.postgresql.org/pgsql-general/2010-12/msg01030.php

Psycopg / Postgres : Connections hang out randomly

I'm using psycopg2 for the cherrypy app I'm currently working on and cli & phpgadmin to handle some operations manually. Here's the python code :
#One connection per thread
cherrypy.thread_data.pgconn = psycopg2.connect("...")
...
#Later, an object is created by a thread :
class dbobj(object):
def __init__(self):
self.connection=cherrypy.thread_data.pgconn
self.curs=self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
...
#Then,
try:
blabla
self.curs.execute(...)
self.connection.commit()
except:
self.connection.rollback()
lalala
...
#Finally, the destructor is called :
def __del__(self):
self.curs.close()
I'm having a problem with either psycopg or postgres (altough I think the latter is more likely). After having sent a few queries, my connections drop dead. Similarly, phpgadmin -usually- gets dropped as well ; it prompts me to reconnect after having made requests several times. Only the CLI remains persistent.
The problem is, these happen very randomly and I can't even track down what the cause is. I can either get locked down after a few page requests or never really encounter anything after having requested hundreds of pages. The only error I've found in postgres log, after terminating the app is :
...
LOG: unexpected EOF on client connection
LOG: could not send data to client: Broken pipe
LOG: unexpected EOF on client connection
...
I thought of creating a new connection every time a new dbobj instance is created but I absolutely don't want to do this.
Also, I've read that one may run into similar problems unless all transactions are committed : I use the try/except block for every single INSERT/UPDATE query, but I never use it for SELECT queries nor do I want to write even more boilerplate code (btw, do they need to be committed ?). Even if that's the case, why would phpgadmin close down ?
max_connections is set to 100 in the .conf file, so I don't think that's the reason either. A single cherrypy worker has only 10 threads.
Does anyone have an idea where I should look first ?
Psycopg2 needs a commit or rollback after every transaction, including SELECT queries, or it leaves the connections "IDLE IN TRANSACTION". This is now a warning in the docs:
Warning: By default, any query execution, including a simple SELECT will start a transaction: for long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat...). For long lived scripts, either ensure to terminate a transaction as soon as possible or use an autocommit connection.
It's a bit difficult to see exactly where you're populating and accessing cherrypy.thread_data. I'd recommend investigating psycopg2.pool.ThreadedConnectionPool instead of trying to bind one conn to each thread yourself.
Even though I don't have any idea why successful SELECT queries block the connection, spilling .commit() after pretty much every single query that doesn't have to work in conjunction with another solved the problem.

Categories

Resources