psycopg2.OperationalError: SSL SYSCALL error: EOF detected - python

I was using psycopg2 in python script to connect to Redshift database and occasionally I receive the error as below:
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
This error only happened once awhile and 90% of the time the script worked.
I tried to put it into a try and except block to catch the error, but it seems like the catching didn't work. For example, I try to capture the error so that it will automatically send me an email if this happens. However, the email was not sent when error happened. Below are my code for try except:
try:
conn2 = psycopg2.connect(host="localhost", port = '5439',
database="testing", user="admin", password="admin")
except psycopg2.Error as e:
print ("Unable to connect!")
print (e.pgerror)
print (e.diag.message_detail)
# Call check_row_count function to check today's number of rows and send
mail to notify issue
print("Trigger send mail now")
import status_mail
print (status_mail.redshift_failed(YtdDate))
sys.exit(1)
else:
print("RedShift Database Connected")
cur2 = conn2.cursor()
rowcount = cur2.rowcount
Errors I received in my log:
Traceback (most recent call last):
File "/home/ec2-user/dradis/dradisetl-daily.py", line 579, in
load_from_redshift_to_s3()
File "/home/ec2-user/dradis/dradisetl-daily.py", line 106, in load_from_redshift_to_s3
delimiter as ','; """.format(YtdDate, s3location))
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
So the question is, what causes this error and why isn't my try except block catching it?

From the docs:
exception psycopg2.OperationalError
Exception raised for errors that are related to the database’s
operation and not necessarily under the control of the programmer,
e.g. an unexpected disconnect occurs, the data source name is not
found, a transaction could not be processed, a memory allocation error
occurred during processing, etc.
This is an error which can be a result of many different things.
slow query
the process is running out of memory
other queries running causing tables to be locked indefinitely
running out of disk space
firewall
(You should definitely provide more information about these factors and more code.)
You were connected successfully but the OperationalError happened later.
Try to handle these disconnects in your script:
Put the command you want to execute into a try-catch block and try to reconnect if the connection is lost.

Recently encountered this error. The cause in my case was the network instability while working with database. If network will became down for enough time that the socket detect the timeout you will see this error. If down time is not so long you wont see any errors.
You may control timeouts of Keepalive and RTO features using this code sample
s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 6)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000)
More information you can find in this post

If you attach the actual code that you are trying to except it would be helpful. In your attached stack trace its : " File "/home/ec2-user/dradis/dradisetl-daily.py", line 106"
Similar except code works fine for me. mind you, e.pgerror will be empty if the error occurred on the client side, such as the error in my example. the e.diag object will also be useless in this case.
try:
conn = psycopg2.connect('')
except psycopg2.Error as e:
print('Unable to connect!\n{0}'.format(e))
else:
print('Connected!')

Maybe it will be helpful for someone but I had such an error when I've tried to restore backup to the database which has not sufficient space for it.

Related

Why does dnspython module give LifetimeTimeout error?

I am trying to check if a domain name has MX records resolved using dnspython module. I am getting the following error while connecting to the mx record server. Can anyone explain why I am facing this issue?
Traceback (most recent call last):
File "c:\Users\iamfa\OneDrive\Desktop\test\email_mx.py", line 26, in <module>
dns.resolver.resolve("cmrit.ac.in", 'MX')
File "c:\Users\iamfa\OneDrive\Desktop\test\env1\lib\site-packages\dns\resolver.py", line 1193, in resolve
return get_default_resolver().resolve(qname, rdtype, rdclass, tcp, source,
File "c:\Users\iamfa\OneDrive\Desktop\test\env1\lib\site-packages\dns\resolver.py", line 1066, in resolve
timeout = self._compute_timeout(start, lifetime,
File "c:\Users\iamfa\OneDrive\Desktop\test\env1\lib\site-packages\dns\resolver.py", line 879, in _compute_timeout
raise LifetimeTimeout(timeout=duration, errors=errors)
dns.resolver.LifetimeTimeout: The resolution lifetime expired after 5.001 seconds: Server 10.24.0.1 UDP port 53 answered The DNS operation timed out.; Server 198.51.100.1 UDP port 53 answered The DNS operation timed out.; Server 10.95.11.110 UDP port 53 answered The DNS operation timed out.
This is my code:
import dns.resolver
if dns.resolver.resolve("cmrit.ac.in", 'MX'):
print(True)
else:
print(False)
However it was working fine till yesterday but when I try to run the same code today I am facing this issue.
If the remote DNS server takes a long time to respond, or accepts the connection but does not respond at all, the only thing you can really do is move along. Perhaps try again later. You can catch the error with try/except:
import dns.resolver
try:
if dns.resolver.resolve("cmrit.ac.in", 'MX'):
print(True)
else:
print(False)
except dns.resolver.LifetimeError:
print("timed out, try again later maybe?")
If you want to apply a longer timeout, the resolve method accepts a lifetime keyword argument which is documented in the Resolver.resolve documentation.
The Resolver class (documented at the top of the same page) also has a timeout parameter you can tweak if you build your own resolver.
For production code, you should probably add the other possible errors to the except clause; the exemplary documentation shows you precisely which exceptions resolve can raise.
...
except (dns.resolver.LifetimeTimeout, dns.resolver.NXDOMAIN,
dns.resolver.YXDOMAIN, dns.resolver.NoAnswer,
dns.resolver.NoNameservers) as err:
print("halp, something went wrong:", err)
Probably there is a base exception class which all of these inherit from; I was too lazy to go back and check. Then you only have to list the base class in the except statement.
It's probably more useful to extract the actual MX record and display it, rather than just print True, but that's a separate topic.
Your error message indicates that you were able to connect to your own resolver at 10.24.0.1 but in the general case, this error could also happen if your network (firewall etc) prevents you from accessing DNS for some reason.

Python Azure module error handling TCP 104 not being caught

I am currently using python 3.8.8 with version 12.9.0 of azure.storage.blob and 1.14.0 of azure.core.
I am downloading multiple files using the azure.storage.blob package. My code looks something like the following
from azure.storage.blob import ContainerClient
from azure.core.exceptions import ResourceNotFoundError, AzureError
from time import sleep
max_attempts = 5
container_client = ContainerClient(DETAILS)
for file in multiple_files:
attempts = 0
while attempts < max_attempts:
try:
data = container.download_blob(file).readall()
break
except ResourceNotFoundError:
# log missing data
break
except AzureError:
# This is mainly here as connections seem to drop randomly.
attempts += 1
sleep(1)
if attempts >= max_attempts:
#log connection error
#do something with the data.
It seems to be running fine, and I don't see any loss of data. However, within my terminal I keep getting the message
Unable to stream download: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
This appears to be a TCP 104 return message but isn't being handled by the azure module. My questions are as follows.
Where is this message coming from? I can't see it in any of the packages I am using.
How do I handle this error better? It doesn't appear to be caught as an exception as it isn't crashing my code.
Can I get this to print to a log?
Where is this message coming from? I can't see it in any of the packages I am using.
Looks like The clients seemed to be connected to the server, but when they attempted to transfer data, they received a Errno 104 Connection reset by peer error. This also means, that the other side has reset the connection else the client would encounter with [Errno 32] Broken pipe exception.
How do I handle this error better? It doesn't appear to be caught as an exception as it isn't crashing my code.
One of the workarounds you can try is to have try and catch block to handle that exception:
from socket import error as SocketError
import errno
try:
response = urllib2.urlopen(request).read()
except SocketError as e:
if e.errno != errno.ECONNRESET:
raise # Not error we are looking for
pass # Handle error here.
Also try referring to this similar issue where sudo pip3 install urllib3 solved the issue.
Can I get this to print to a log?
One workaround is that you can pass exception instance in exc_info argument:
import logging
try:
1/0
except Exception as e:
logging.error('Error at %s', 'division', exc_info=e)
For more information you can refer How to log python exception?
Here is a related issue that you can follow up
azure storage blob download: ConnectionResetError(104, 'Connection reset by peer')
REFERENCE:
Connection broken: ConnectionResetError(104, 'Connection reset by peer') error while streaming

How to determine if remote ESXI Host has booted fully?

I am writing a Python Script to fully boot up a handful of ESXI hosts remotely, and I am having trouble with determining when ESXI has finished booting and is ready to receive commands send over SSH. I am running the script on a windows host that is hardwired to each ESXI host and the system is air-gapped so there is no firewalls in the way and no security software would interfere.
Currently I am doing this: I remote into the chassis through SSH and send the power commands to the ESXI host - this works and has always worked. Then, I attempt to SSH into each blade and send the following command
esxcli system stats uptime get
The command doesn't matter, I just need a response to make sure that the host is up. Below is the function I am using to send the SSH commands in hopes of getting a response
def send_command(ip, port, timeout, retry_interval, cmd, user, password):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
retry_interval = float(retry_interval)
timeout = int(timeout)
timeout_start = time.time()
worked = False
while worked == False:
time.sleep(retry_interval)
try:
ssh.connect(ip, port, user, password, timeout=5)
stdin,stdout,stderr=ssh.exec_command(cmd)
outlines=stdout.readlines()
resp=''.join(outlines)
print(resp)
worked = True
return (resp)
except socket_error as e:
worked = False
print(e)
continue
except paramiko.ssh_exception.SSHException as e:
worked = False
# socket is open, but not SSH service responded
print(e)
continue
except TimeoutError as e:
print(e)
worked = False
pass
except socket.timeout as e:
print(e)
worked = False
continue
except paramiko.ssh_exception.NoValidConnectionsError as e:
print(e)
worked = False
continue
except socket.error as serr:
print(serr)
worked = False
continue
except IOError as e:
print(e)
worked = False
continue
except:
print(e)
worked = False
continue
My goal here is to catch all of the exceptions long enough for the host to finish booting and then receive a response. The issue is that sometimes it will loop for several minutes (as expected when booting a system like this), and then it will print
IO error: [Errno 111] Connection refused
And then drop out of the function/try catch block and never establish the connection. I know that this is a fault of my exceptions handling because when this happens, I stop the script, wait a few minutes, run it again without touching anything else and the esxcli command will work perfectly and the script will work great.
How do I prevent the Errno 111 error from breaking my loop? Any help is greatly appreciated
Edit: One possible duct tape solution could be changing the command to "esxcli system hostname get" and checking the response for the word "Domain". This might work because the IOError seems to be a response and not an exception, I'll have to wait until monday to test that solution though.
I solved it. It occured to me that I was handling all possible exceptions that any python code could possibly throw, so my defect wasn't a python error and that would make sense why I wasn't finding anything online about the relationship between Python, SSH and the Errno 111 error.
The print out is in fact a response from the ESXI host, and my code is looking for any response. So I simply changed the esxcli command from requesting the uptime to
esxcli system hostname get
and then through this into the function
substring = "Domain"
if substring not in resp:
print(resp)
continue
I am looking for the word "Domain" because that must be there if that call is successful.
How I figure it out: I installed ESXI 7 on an old Intel Nuc, turned on SSH in the kickstart script, started the script and then turned on the nuc. The reason I used the NUC is because a fresh install on simple hardware boots up much faster and quietly than Dell Blades! Also, I wrapped the resp variable in a print(type(OBJECT)) line and was able to determine that it was infact a string and not an error object.
This may not help someone that has a legitimate Errno 111 error, I knew I was going to run into this error each and everytime I ran the code and I just needed to know how to handle it and hold the loop until I got the response I wanted.
Edit: I suppose it would be easier to just filter for the world "errno" and then continue the loop instead of using a different substring. That would handle all of my use cases and eliminate the need for a different function.

Why does StreamReader.readexactly() cause a socket error but not StreamReader.read()?

I'm writing application in python using Asyncio for networking. I have code similar too:
try:
data = await self._reader.readexactly(10000)
# Code that uses data
except IncompleteReadError as e:
data = e.parial
# More code
When I try running this code, it never seems to actually run. If I set a breakpoint on the second line, the breakpoint will trip, but the rest of the function is ignored.
The closest thing I get to an error is this from the asyncio logger:
Traceback (most recent call last):
File "c:\python36\Lib\asyncio\selector_events.py", line 724, in _read_ready
data = self._sock.recv(self.max_size)
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
Replacing the second line with data = await self._reader.read(10000) appears to solve this issue, but read() doesn't solve my issue, I need to use readexactly(). So why does readexactly() cause a socket error but read() not?
the only difference between the two is that "read" read up to n bytes while readexactly reads exactly n bytes and if reaches the end before n bytes is raises an IncompleteReadError , which may cause your socket to get the error you have pointed out.

Where can arbitrary exceptions during WebSocket connections in Autobahn be caught in the Python code?

I have created a Python program that uses Autobahn to make WebSocket connections to a remote host, and receive a data flow over these connections.
From time to time, some different exceptions occur during these connections, most often either an exception immediately when attempting to connect, stating that the initial WebSocket handshake failed (most likely due to an overloaded server), like this:
2017-05-03T20:31:10 dropping connection to peer tcp:1.2.3.4:443 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
Or a later exception during a successful and ongoing connection, saying that the connection timed out due to lack of pong response to a ping, as follows:
2017-05-04T13:33:40 dropping connection to peer tcp:1.2.3.4:443 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
2017-05-04T13:33:40 session closed with reason wamp.close.transport_lost [WAMP transport was lost without closing the session before]
2017-05-04T13:33:40 While firing onDisconnect: Traceback (most recent call last):
File "c:\Python36\lib\site-packages\txaio\aio.py", line 450, in done
f._result = x
AttributeError: attribute '_result' of '_asyncio.Future' objects is not writable
As can be seen above, this also triggers some other strange exception in the txaio module in this particular case.
No matter what kind of exception that occurs, I would like to catch them and handle them gracefully, but for some reason the exceptions (none of them) seem to bubble up to the code that initiated these connections (i.e. get caught by my try ... except clause there), which looks like this:
from autobahn.asyncio.wamp import ApplicationSession
from autobahn.asyncio.wamp import ApplicationRunner
...
class MyComponent(ApplicationSession):
...
try:
runner = ApplicationRunner("wss://my.websocket.server.com:443", "realm1")
runner.run(MyComponent)
except Exception as e:
print('Unexpected connection error')
...
Instead, all these exceptions just hang my program completely after the error messages have been dumped out to the terminal as above, why is this?
So, the question is: How and where in the code can I catch these exceptions that occur during the WebSocket connections in Autobahn, and react/handle them gracefully?

Categories

Resources