Python: handle any and all postgres timeout situations

Python: handle any and all postgres timeout situations - python

I am struggling to find a solution to this.
We use Python to monitor our Postgres databases that are running on AWS RDS. This means we have extremely limited control over the server side.
If there is an issue, and the server fails (hardware fault, network fault, you name it) our scripts just hang for sometimes 8-10 minutes (non-SSL) and up to 15-20 minutes (SSL connections). And by the time they recover, and finally reach whatever timeout this random amount of minutes is from, the server has failed over and everything works again.
Obviously, this renders our tools useless, if we can't catch these situations.
We basically run this (pseudo-code):
while True:
try:
query("select current_user")
except:
page("Database X failed")
For the basic use cases, this works just fine. E.g. if an instance is restarted, or something of the sort, no problems.
But, if there is an actual issue, the query just hangs. For minutes and minutes.
We've tried setting the statement_timeout on the psycopg2 connection. But that is a server setting, and if the instance fails and fails over, well, there is no server. So the client ends up waiting indefinitely or until it hits one of those arbitrary timeouts.
I've looked into sockets and tested something like this:
pg.connect('user', 'db-name', instanceName = 'foo')
fd = pg.Connection.fileno()
s = socket.socket(fileno=fd)
print(s)
s.settimeout(0.0000001)
timeval = struct.pack('ll', 0, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, timeval)
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDTIMEO, timeval)
data = pg.query("SELECT pg_sleep(10);")
In dumping the socket with the print(s) statement, I can clearly see that we've got the right socket.
But the timeouts I set, do nothing whatsoever.
I've tried many values, and it just has no effect. With the above, it should raise a timeout if more than 1 microsecond has elapsed. Ignoring common sense, I checked with tcpdump and made sure that we definitely do not get a response in 1 microsecond. Yet the thing just sits there and waits for to pg_sleep(10) to complete.
Could someone shed some light on this?
It seems simple enough:
All I want is that ANY CALL made to postgres can NEVER TAKE LONGER than, say, 10 seconds. Regardless of what it is, regardless of what happens. If more than 10 seconds have elapsed, it needs to raise an exception.
From what I can see, the only way would be to use subprocess with a process-timeout. But we run threaded (we monitor hundreds of instances and spawn a persistent connection in a thread for each instance) and I've seen posts saying that this isn't reliable inside threads. It also seems silly to have each thread spawn yet another subprocess. Inception comes to mind. Where does it end?
But I digress. The question seems simple enough, yet my wall is showing clear signs of a large dent developing.
Greatly appreciate any insights
PS: Python 3.6 on Ubuntu LTS
Cheers
Stefan

Related

AutobahnPython + Twisted 'Publish' floods messages after script is finished

I have a Python script that sometimes runs a process that lasts ~5-60 seconds. During this time, ten calls to session.publish() are ignored until the script is done. As soon as the script finishes, all ten messages are published in a flood.
I have corroborated this by opening the Crossbar.io router in debug mode, and it shows logs corresponding to the published messages after the time is over (not during its run as expected).
The script in question is long, complex and includes a combined frontend and backend for Crossbar/Twisted/AutobahnPython. I feel I would risk misreporting the problem if I tried to condense and include it here.
What reasons are there for publish to not happen instantaneously?
A couple of unsuccessful tries so far:
Source: Twisted needs 'non-blocking code'. So, I try to incorporate reactor.callLater but without success (I also don't really know how to do this for a publish event).
I looked into the idea of using Pool to spawn workers to perform the publish.
The AutobahnPython repo doesn't seem to have any examples that really include this kind of situation.
Thanks!

What reasons are there for publish to not happen instantaneously?
The reactor has to get a chance to run for I/O to happen. The example code doesn't let the reactor run because it keeps execution in a while loop in user code for a long time.

Python SimpleHTTPServer keeps going down and I don't know why

This is my first time working with SimpleHTTPServer, and honestly my first time working with web servers in general, and I'm having a frustrating problem. I'll start up my server (via SSH) and then I'll go try to access it and everything will be fine. But I'll come back a few hours later and the server won't be running anymore. And by that point the SSH session has disconnected, so I can't see if there were any error messages. (Yes, I know I should use something like screen to save the shell messages -- trying that right now, but I need to wait for it to go down again.)
I thought it might just be that my code was throwing an exception, since I had no error handling, but I added what should be a pretty catch-all try/catch block, and I'm still experiencing the issue. (I feel like this is probably not the best method of error handling, but I'm new at this... so let me know if there's a better way to do this)
class MyRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
# (this is the only function my request handler has)
def do_GET(self):
if 'search=' in self.path:
try:
# (my code that does stuff)
except Exception as e:
# (log the error to a file)
return
else:
SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
Does anyone have any advice for things to check, or ways to diagnose the issue? Most likely, I guess, is that my code is just crashing somewhere else... but if there's anything in particular I should know about the way SimpleHTTPServer operates, let me know.

I never had SimpleHTTPServer running for an extended period of time usually I just use it to transfer a couple of files in an ad-hoc manner, but I guess that it wouldn't be so bad as long as your security restraints are elsewhere (ie firewall) and you don't have need for much scale.
The SSH session is ending, which is killing your tasks (both foreground and background tasks). There are two solutions to this:
Like you've already mentioned use a utility such as screen to prevent your session from ending.
If you really want this to run for an extended period of time, you should look into your operating system's documentation on how to start/stop/enable services (now-a-days most of the cool kids are using systemd, but you might also find yourself using SysVinit or some other init system)
EDIT:
This link is in the comments, but I thought I should put it here as it answers this question pretty well

Console output consuming much CPU? (about 140 lines per second)

I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...

I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)

Query failing on first try of the day, succeeding on second try

Exact error I get is here:
{'trace': "(Error) ('08S01', '[08S01] [FreeTDS][SQL Server]Write to the server failed (20006) (SQLExecDirectW)')"}
I get this when I first run a query in my Pyramid application. Any query I run (In my case, it is a web search form that returns info from a database)
The entire application is read-only, as is the account used to connect to the db. I don't know what it would be writing that would fail. And like I said, if I re-run the exact same thing (or refresh the page) it runs just fine without error.
Edit: Emphasis on the "first try of the day". If no queries for x amount of time, I get this write error again, and then it'll work. It's almost like it's fallen asleep and that first query will wake it up.

I would guess that there's a pool of DB connections that is kept open for some time, T. The server, however, terminates open connections after some time, S, which is less than T.
The first connection of the day (or after S elapses in general) would give you this error.
Try to look for a way to change the "timeout" of the connections in the pool to be less than S and that should fix the problem.
Edit: These times (T and S) are dependent on configs or default values for the server and libraries you use. I've experienced a similar issue with a Flask+SQLAlchemy+MySQL app in the past and I had to change the connection timeouts, etc.
Edit 2: T might be "keep connections open forever" or a very high value

Python/PySerial and CPU usage

I've created a script to monitor the output of a serial port that receives 3-4 lines of data every half hour - the script runs fine and grabs everything that comes off the port which at the end of the day is what matters...
What bugs me, however, is that the cpu usage seems rather high for a program that's just monitoring a single serial port, 1 core will always be at 100% usage while this script is running.
I'm basically running a modified version of the code in this question: pyserial - How to Read Last Line Sent from Serial Device
I've tried polling the inWaiting() function at regular intervals and having it sleep when inWaiting() is 0 - I've tried intervals from 1 second down to 0.001 seconds (basically, as often as I can without driving up the cpu usage) - this will succeed in grabbing the first line but seems to miss the rest of the data.
Adjusting the timeout of the serial port doesn't seem to have any effect on cpu usage, nor does putting the listening function into it's own thread (not that I really expected a difference but it was worth trying).
Should python/pyserial be using this much cpu? (this seems like overkill)
Am I wasting my time on this quest / Should I just bite the bullet and schedule the script to sleep for the periods that I know no data will be coming?

Maybe you could issue a blocking read(1) call, and when it succeeds use read(inWaiting()) to get the right number of remaining bytes.

Would a system style solution be better? Create the python script and have it executed via Cron/Scheduled Task?
pySerial shouldn't be using that much CPU but if its just sitting there polling for an hour I can see how it may happen. Sleeping may be a better option in conjunction with periodic wakeup and polls.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.