We have a script which talks to Scylla (a cassandra drop-in replacement). The script is supposed to run for a few thusands systems. The script runs a few thousands queries to get its required data. However, after sometime the script gets crashed throwing this error:
2021-09-29 12:13:48 Could not execute query because of : errors={'x.x.x.x': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=x.x.x.x
2021-09-29 12:13:48 Trying for : 4th time
Traceback (most recent call last):
File ".../db_base.py", line 92, in db_base
ret_val = SESSION.execute(query)
File "cassandra/cluster.py", line 2171, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'x.x.x.x': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=x.x.x.x
The DB Connection code:
def db_base(current_keyspace, query, try_for_times, current_IPs, port):
global SESSION
if SESSION is None:
# This logic to ensure given number of retrying runs on failure of connecting to the Cluster
for i in range(try_for_times):
try:
cluster = Cluster(contact_points = current_IPs, port=port)
session = cluster.connect() # error can be encountered in this command
break
except NoHostAvailable:
print("No Host Available! Trying for : " + str(i) + "th time")
if i == try_for_times - 1:
# shutting down cluster
cluster.shutdown()
raise db_connection_error("Could not connect to the cluster even in " + str(try_for_times) + " tries! Exiting")
SESSION = session
# This logic to ensure given number of retrying runs in the case of failing the actual query
for i in range(try_for_times):
try:
# setting keyspace
SESSION.set_keyspace(current_keyspace)
# execute actual query - error can be encountered in this
ret_val = SESSION.execute(query)
break
except Exception as e:
print("Could not execute query because of : " + str(e))
print("Trying for : " + str(i) + "th time")
if i == (try_for_times -1):
# shutting down session and cluster
cluster.shutdown()
session.shutdown()
raise db_connection_error("Could not execute query even in " + str(try_for_times) + " tries! Exiting")
return ret_val
How can this code be improved to sustain and be able to run for this large no. of queries? Or we should look into other tools / approach to help us with getting this data? Thank you
The Client session timeout indicates that the driver is timing out before the server does or - should it be overloaded - that Scylla hasn't replied back the timeout to the driver. There are a couple of ways to figure this out:
1 - Ensure that your default_timeout is higher than Scylla enforced timeouts in /etc/scylla/scylla.yaml
2 - Check the Scylla logs for any sign of overload. If there is, consider throttling your requests to find a balanced sweet spot to ensure they no longer fail. If it continues, consider resizing your instances.
In addition to these, it is worth to mention that your sample code is not using PreparedStatements, TokenAwareness and other best practices as mentioned under https://docs.datastax.com/en/developer/python-driver/3.19/api/cassandra/policies/ that will certainly improve your overall throughput down the road.
You can find further information on Scylla docs:
https://docs.scylladb.com/using-scylla/drivers/cql-drivers/scylla-python-driver/
and Scylla University
https://university.scylladb.com/courses/using-scylla-drivers/lessons/coding-with-python/
Related
MySQL-Connector for python is taking > 1 minute to open a connection to the Database.
I'm writing an AWS Lambda function to simply pull the current secret from AWS Secret manager and then use that secret to open a connection to an SQL Database and run a query to fetch some test data.
secret = {
'user': (responseDict['User ID']),
'password': (responseDict['Password']),
'host': (responseDict['Data Source']),
'database': (responseDict['Initial Catalog']),
'raise_on_warnings': True
}
def connect_database(secret):
# Creates a client connection to the database, using the secret
log.info('Attempting to connect')
try:
dbClient = mysql.connector.connect(**secret)
except mysql.connector.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
logger.error('ERROR: Failed to auth with the SQL Database.')
elif err.errno == errorcode.ER_BAD_DB_ERROR:
logger.error('ERROR: Database does not exist.')
else:
logger.error('ERROR: Unkown error when connecting to database')
logger.error(err)
logging.info('Connected succesfully')
return dbClient
When I first ran the lambda function, it failed due to reaching the timeout threshold of 3 seconds, so I increased the timeout to 60 seconds and placed a few info logs at various points of the script to find the point where it times out. To my suprise, with a 60-second threshold, it still times out.
The last log shown is the "log.info('Attempting to connect')" just before it tries to open a connection to the SQL Server. The 'Connected Succesfully' does not log and neither does any of the error catches.
Can anyone clarify if the connection should take this long or the more likely choice, point out where I've gone wrong?
EDIT: This ended up being related to a networking issue that I was unaware of, thanks!
AWS Lambda function itself takes sometime to get ready and execute, eventhouh lambda is in hotstand by mode. please consider this fact when you're setting the timeout. Try increasing the timeout, you can increase timeout as 300 seconds and try.
I was using psycopg2 in python script to connect to Redshift database and occasionally I receive the error as below:
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
This error only happened once awhile and 90% of the time the script worked.
I tried to put it into a try and except block to catch the error, but it seems like the catching didn't work. For example, I try to capture the error so that it will automatically send me an email if this happens. However, the email was not sent when error happened. Below are my code for try except:
try:
conn2 = psycopg2.connect(host="localhost", port = '5439',
database="testing", user="admin", password="admin")
except psycopg2.Error as e:
print ("Unable to connect!")
print (e.pgerror)
print (e.diag.message_detail)
# Call check_row_count function to check today's number of rows and send
mail to notify issue
print("Trigger send mail now")
import status_mail
print (status_mail.redshift_failed(YtdDate))
sys.exit(1)
else:
print("RedShift Database Connected")
cur2 = conn2.cursor()
rowcount = cur2.rowcount
Errors I received in my log:
Traceback (most recent call last):
File "/home/ec2-user/dradis/dradisetl-daily.py", line 579, in
load_from_redshift_to_s3()
File "/home/ec2-user/dradis/dradisetl-daily.py", line 106, in load_from_redshift_to_s3
delimiter as ','; """.format(YtdDate, s3location))
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
So the question is, what causes this error and why isn't my try except block catching it?
From the docs:
exception psycopg2.OperationalError
Exception raised for errors that are related to the database’s
operation and not necessarily under the control of the programmer,
e.g. an unexpected disconnect occurs, the data source name is not
found, a transaction could not be processed, a memory allocation error
occurred during processing, etc.
This is an error which can be a result of many different things.
slow query
the process is running out of memory
other queries running causing tables to be locked indefinitely
running out of disk space
firewall
(You should definitely provide more information about these factors and more code.)
You were connected successfully but the OperationalError happened later.
Try to handle these disconnects in your script:
Put the command you want to execute into a try-catch block and try to reconnect if the connection is lost.
Recently encountered this error. The cause in my case was the network instability while working with database. If network will became down for enough time that the socket detect the timeout you will see this error. If down time is not so long you wont see any errors.
You may control timeouts of Keepalive and RTO features using this code sample
s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 6)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000)
More information you can find in this post
If you attach the actual code that you are trying to except it would be helpful. In your attached stack trace its : " File "/home/ec2-user/dradis/dradisetl-daily.py", line 106"
Similar except code works fine for me. mind you, e.pgerror will be empty if the error occurred on the client side, such as the error in my example. the e.diag object will also be useless in this case.
try:
conn = psycopg2.connect('')
except psycopg2.Error as e:
print('Unable to connect!\n{0}'.format(e))
else:
print('Connected!')
Maybe it will be helpful for someone but I had such an error when I've tried to restore backup to the database which has not sufficient space for it.
I wrote a bot for TeamSpeak 3 that runs over ServerQuery (a telnet interface).
But the bot keeps responding later and later, in the beginning it takes like 0.1 sec, after like 1 minute the bot takes about 10 seconds to respond, and using commands makes it even faster.
Any idea why?
So basically the telnet interface sends data from the TS3 Server to my python script, the ts3 module recieves and processes the data, then the script will make a decision of what the action will be.
As modules I am using MySQLdb and ts3(https://github.com/benediktschmitt/py-ts3)
My sourcecode is here: https://pastebin.com/cJuyB9ZH
Another script, which just takes all clients and pushes them into a database every 5 min, runs multiple days without any issues.
I checked the code multiple times now and even deleted variables right after they have been used, but it still has the same issue.
My guess would be that is sortof clogges the RAM, so I looked through the code multiple times, but couldn't find out why or where.
Sidenote: I know I sometimes call a commit() when its totally not necessary, but I don't know if that might cause problems, but I dont see how.
Short(er) version of my code:
import ts3
import MySQLdb
# Some other imports like time and threading and such
## Connect to TS3
tsConn = ts3.query.TS3Connection(tsAddr, tsPort)
try:
tsConn.login(client_login_name=tsUser, client_login_password=tsPass)
tsConn.use(sid=tsSID, virtual=True)
print(" ==>> CONNECTED TO TS3 SERVER: " + tsAddr)
except ts3.query.TS3QueryError as e:
print("Login to TS Server failed! Aborting...")
exit(1)
## Connect to mySQL
try:
qConn = MySQLdb.connect(host=qHost, user=qUser, passwd=qPass, db=qDB)
qServer = qConn.cursor()
print(" ==>> CONNECTED TO mySQL SERVER: " + qHost)
except OperationalError:
print("Cannot connect to mySQL Database! Aborting...")
exit(1)
running = True
while running:
tsConn.send_keepalive()
qServer.execute("SELECT 1") # keepalive
try:
e = tsConn.wait_for_event(timeout=1)
except TS3TimeoutError:
pass
else:
try:
# <some command processing here>
except KeyError:
try:
if event[0]["reasonid"] == "0":
tsConn.sendtextmessage(targetmode=1, target=event[0]["clid"], msg=greetingmsg.format(event[0]["client_nickname"]))
except:
pass
I am totally new to Kafka and Docker, and have been handed a problem to fix. Our Continuous Integration tests for Kafka (Apache) queues run just fine on local machines, but when on the Jenkins CI server, occasionally fail with this sort of error:
%3|1508247800.270|FAIL|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: 1/1 brokers are down
The working theory is that the Docker image takes time to get started, by which time the Kafka producer has given up. The offending code is
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'client.id': self._job_queue.client_id
}
try:
self._producer = kafka.Producer(**producer_properties)
except:
print("Bang!")
with the error lines above appearing in the creation of the producer. However, no exception is raised, and the call returns an otherwise valid looking producer, so I can't programmatically test the presence of the broker endpoint. Is there an API to check the status of a broker?
It seems the client doesn't throw exception if connection to broker fails. It actually tries to connect to bootstrap servers when first time producer tries to send the message. If connection fails, it repeatedly tries to connect to any of the brokers passed in the bootstrap list. Eventually, if the brokers come up, send will happen (and we may check the status in the callback function).
The confluent kafka python library is using librdkafka library and this client doesn't seem to have proper documentation. Some of the Kafka producer option specified by Kafka protocol, seem not supported by librdkafka.
Here is the sample code with callback I used:
from confluent_kafka import Producer
def notifyme(err, msg):
print err, msg.key(), msg.value()
p = Producer({'bootstrap.servers': '127.0.0.1:9092', 'retry.backoff.ms' : 100,
'message.send.max.retries' : 20,
"reconnect.backoff.jitter.ms" : 2000})
try:
p.produce(topic='sometopic', value='this is data', on_delivery=notifyme)
except Exception as e:
print e
p.flush()
Also, checking for the presence of the broker, you may just telnet to the broker ip on its port (in this example it is 9092). And on the Zookeeper used by Kafka cluster, you may check the contents of the znodes under /brokers/ids
Here is the code that seems to work for me. If it looks a bit Frankenstein, then you are right, it is! If there is a clean solution, I look forward to seeing it:
import time
import uuid
from threading import Event
from typing import Dict
import confluent_kafka as kafka
# pylint: disable=no-name-in-module
from confluent_kafka.cimpl import KafkaError
# more imports...
LOG = # ...
# Default number of times to retry connection to Kafka Broker
_DEFAULT_RETRIES = 3
# Default time in seconds to wait between connection attempts
_DEFAULT_RETRY_DELAY_S = 5.0
# Number of times to scan for an error after initiating the connection. It appears that calling
# flush() once on a producer after construction isn't sufficient to catch the 'broker not available'
# # error. At least twice seems to work.
_NUM_ERROR_SCANS = 2
class JobProducer(object):
def __init__(self, connection_retries: int=_DEFAULT_RETRIES,
retry_delay_s: float=_DEFAULT_RETRY_DELAY_S) -> None:
"""
Constructs a producer.
:param connection_retries: how many times to retry the connection before raising a
RuntimeError. If 0, retry forever.
:param retry_delay_s: how long to wait between retries in seconds.
"""
self.__error_event = Event()
self._job_queue = JobQueue()
self._producer = self.__wait_for_broker(connection_retries, retry_delay_s)
self._topic = self._job_queue.topic
def produce_job(self, job_definition: Dict) -> None:
"""
Produce a job definition on the queue
:param job_definition: definition of the job to be executed
"""
value = ... # Conversion to JSON
key = str(uuid.uuid4())
LOG.info('Produced message: %s', value)
self.__error_event.clear()
self._producer.produce(self._topic,
value=value,
key=key,
on_delivery=self._on_delivery)
self._producer.flush(self._job_queue.flush_timeout)
#staticmethod
def _on_delivery(error, message):
if error:
LOG.error('Failed to produce job %s, with error: %s', message.key(), error)
def __create_producer(self) -> kafka.Producer:
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'error_cb': self.__on_error,
'client.id': self._job_queue.client_id,
}
return kafka.Producer(**producer_properties)
def __wait_for_broker(self, retries: int, delay: float) -> kafka.Producer:
retry_count = 0
while True:
self.__error_event.clear()
producer = self.__create_producer()
# Need to call flush() several times with a delay between to ensure errors are caught.
if not self.__error_event.is_set():
for _ in range(_NUM_ERROR_SCANS):
producer.flush(0.1)
if self.__error_event.is_set():
break
time.sleep(0.1)
else:
# Success: no errors.
return producer
# If we get to here, the error callback was invoked.
retry_count += 1
if retries == 0:
msg = '({})'.format(retry_count)
else:
if retry_count <= retries:
msg = '({}/{})'.format(retry_count, retries)
else:
raise RuntimeError('JobProducer timed out')
LOG.warn('JobProducer: could not connect to broker, will retry %s', msg)
time.sleep(delay)
def __on_error(self, error: KafkaError) -> None:
LOG.error('KafkaError: %s', error.str())
self.__error_event.set()
I've been using my script for a unix server and it's working perfectly. However when i use the same script( with some minor command changes) to connect to HP Procurve switches , script crashes with error. Part of the script is below:
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(address, username=userna, password=passwd)
stdin,stdout,stderr= ssh.exec_command("show ver")
for line in stdout:
print '... ' + line.strip('\n')
ssh.close()
This gives error
Traceback (most recent call last):
File "C:/Users/kucar/Desktop/my_python/switchmodel", line 34, in <module>
stdin,stdout,stderr= ssh.exec_command("show ver")
File "C:\Python27\lib\site-packages\paramiko\client.py", line 379, in exec_command
chan.exec_command(command)
File "C:\Python27\lib\site-packages\paramiko\channel.py", line 218, in exec_command
self._wait_for_event()
File "C:\Python27\lib\site-packages\paramiko\channel.py", line 1122, in _wait_for_event
raise e
SSHException: Channel closed.
I've found similar complaints in the web however seems like solution is not provided at all. Switch is open to ssh and works fine with putty. Appreciate if you give any ideas that could help me. I cannot do "show ver" command manually for 100 switches.
As #dobbo mentioned above you have to do invoke_shell() on the channel so that you can execute multiple commands. Also HP ProCurve has ANSI Escape Codes in the output so you have to strip those out. Finally, HP ProCurve throws up a "Press any key to continue" message which you have to get past at least on some devices.
I have an HP ProCurve handler in this library https://github.com/ktbyers/netmiko
Set device_type to "hp_procurve".
Exscript also has some sort of a ProCurve handler though I haven't dug into it enough to get it to work.
I had the same experience connecting to my Samsung s4 phone with an ssh server.
I had no problem connecting to a SUSE VM or a Rasperry Pi and also tried MobaXterm (putty is SO last week).
I have not found the answer but will share my research.
I had a look at the source and found line 1122 in channel.py (copied below).
With my phone (and possibly your HP switch) I have noticed that there is no login message or MOTD at all and when exiting (with putty/mobaXterm) the session doesn't end properly.
In some other reading, I have found that the parameko is not getting much support from the author any more but others are working to port it to python 3x.
Here is the source code I found.
def _wait_for_send_window(self, size):
"""
(You are already holding the lock.)
Wait for the send window to open up, and allocate up to C{size} bytes
for transmission. If no space opens up before the timeout, a timeout
exception is raised. Returns the number of bytes available to send
(may be less than requested).
"""
# you are already holding the lock
if self.closed or self.eof_sent:
return 0
if self.out_window_size == 0:
# should we block?
if self.timeout == 0.0:
raise socket.timeout()
# loop here in case we get woken up but a different thread has filled the buffer
timeout = self.timeout
while self.out_window_size == 0:
if self.closed or self.eof_sent:
return 0
then = time.time()
self.out_buffer_cv.wait(timeout)
if timeout != None:
timeout -= time.time() - then
if timeout <= 0.0:
raise socket.timeout()
# we have some window to squeeze into
It seems that if you don't clean up the connection buffer Paramiko goes nuts when working with HP Procurves. First off you need to invoke a shell or Paramiko will simply drop the connection after the first command (normal behavior, but confusing).
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(switch_ip, username=switch_user, password=switch_pass,look_for_keys=False)
conn = ssh.invoke_shell()
recieveData() # <-- see below
It's important to actually handle the data, and as I've learned you need to make sure Paramiko has actually received all the data before you ask it to do stuff with it. I do this by using the following function. You can adjust the sleep as needed, in some cases 0.050 will work fine.
def recieveData():
tCheck = 0
while not conn.recv_ready():
time.sleep(1)
tCheck+=1
if tCheck >=10:
print "time out"
cleanThatStuffUp(conn.recv(1024)) # <-- see below
This is an example of the garbage that is returning to your ssh client.
[1;24r[24;1H[24;1H[2K[24;1H[?25h[24;1H[24;1HProCurve Switch 2650# [24;1H[24;23H[24;1H[?5h[24;23H[24;23Hconfigure[24;23H[?25h[24;32H[24;0HE[24;1H[24;32H[24;1H[2K[24;1H[?5h[24;1H[1;24r[24;1H[1;24r[24;1H[24;1H[2K[24;1H[?25h[24;1H[24;1H
There's also exit codes to deal with before each "[". So to deal with that I figured out some regex to clean all of that "stuff" up.
procurve_re1 = re.compile(r'(\[\d+[HKJ])|(\[\?\d+[hl])|(\[\d+)|(\;\d+\w?)')
procurve_re2 = re.compile(r'([E]\b)')
procurve_re3 = re.compile(ur'[\u001B]+') #remove stupid escapes
def cleanThatStuffUp(message):
message = procurve_re1.sub("", message)
message = procurve_re2.sub("", message)
message = procurve_re3.sub("", message)
print message
Now you can go about entering commands, just make sure you clear out the buffer each time using recieveData().
conn.send("\n") # Get past "Press any key"
recieveData()