I am using python and pika on linux OS Environment.
Message/Topic Receiver keeps crashing when RabbitMQ is not running.
I am wondering is there a way to keep the Message/Topic Receiver running when RabbitMQ is not because RabbitMQ would not be on the same Virtual Machine as the Message/Topic Receiver.
This cover if RabbitMQ crashes for some reason but the Message/Topic Receiver should keep running. Saving having to start/restart the Message/Topic Receiver again.
As far as I understand "Message/Topic Reciever" in your case is the consumer.
You are responsible to make an application in such a way that it will catch an exception if it is trying to connect to the not running RabbitMQ.
for example:
creds = pika.PlainCredentials(**creds)
params = pika.ConnectionParameters(credentials=creds,
**conn_params)
try:
connection = pika.BlockingConnection(params)
LOG.info("Connection to Rabbit was established")
return connection
except (ProbableAuthenticationError, AuthenticationError):
LOG.error("Authentication Failed", exc_info=True)
except ProbableAccessDeniedError:
LOG.error("The Virtual Host configured wrong!", exc_info=True)
except ChannelClosed:
LOG.error("ChannelClosed error", exc_info=True)
except AMQPConnectionError:
LOG.error("RabbitMQ server is down or Host Unreachable")
LOG.error("Connection attempt timed out!")
LOG.error("Trying to re-connect to RabbitMQ...")
time.sleep(reconnection_interval)
# <here goes your reconnection logic >
And as far as making sure that you Rabbit server is always up and running:
you can create a cluster make you queue durable, HA
install some type of supervision (let say monit or supervisord) and configure it to check rabbit process. for example:
check process rabbitmq with pidfile /var/run/rabbitmq/pid
start program = "/etc/init.d/rabbitmq-server stop"
stop program = "/etc/init.d/rabbitmq-server start"
if 3 restarts within 5 cycles then alert
Related
I am writing a Python Script to fully boot up a handful of ESXI hosts remotely, and I am having trouble with determining when ESXI has finished booting and is ready to receive commands send over SSH. I am running the script on a windows host that is hardwired to each ESXI host and the system is air-gapped so there is no firewalls in the way and no security software would interfere.
Currently I am doing this: I remote into the chassis through SSH and send the power commands to the ESXI host - this works and has always worked. Then, I attempt to SSH into each blade and send the following command
esxcli system stats uptime get
The command doesn't matter, I just need a response to make sure that the host is up. Below is the function I am using to send the SSH commands in hopes of getting a response
def send_command(ip, port, timeout, retry_interval, cmd, user, password):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
retry_interval = float(retry_interval)
timeout = int(timeout)
timeout_start = time.time()
worked = False
while worked == False:
time.sleep(retry_interval)
try:
ssh.connect(ip, port, user, password, timeout=5)
stdin,stdout,stderr=ssh.exec_command(cmd)
outlines=stdout.readlines()
resp=''.join(outlines)
print(resp)
worked = True
return (resp)
except socket_error as e:
worked = False
print(e)
continue
except paramiko.ssh_exception.SSHException as e:
worked = False
# socket is open, but not SSH service responded
print(e)
continue
except TimeoutError as e:
print(e)
worked = False
pass
except socket.timeout as e:
print(e)
worked = False
continue
except paramiko.ssh_exception.NoValidConnectionsError as e:
print(e)
worked = False
continue
except socket.error as serr:
print(serr)
worked = False
continue
except IOError as e:
print(e)
worked = False
continue
except:
print(e)
worked = False
continue
My goal here is to catch all of the exceptions long enough for the host to finish booting and then receive a response. The issue is that sometimes it will loop for several minutes (as expected when booting a system like this), and then it will print
IO error: [Errno 111] Connection refused
And then drop out of the function/try catch block and never establish the connection. I know that this is a fault of my exceptions handling because when this happens, I stop the script, wait a few minutes, run it again without touching anything else and the esxcli command will work perfectly and the script will work great.
How do I prevent the Errno 111 error from breaking my loop? Any help is greatly appreciated
Edit: One possible duct tape solution could be changing the command to "esxcli system hostname get" and checking the response for the word "Domain". This might work because the IOError seems to be a response and not an exception, I'll have to wait until monday to test that solution though.
I solved it. It occured to me that I was handling all possible exceptions that any python code could possibly throw, so my defect wasn't a python error and that would make sense why I wasn't finding anything online about the relationship between Python, SSH and the Errno 111 error.
The print out is in fact a response from the ESXI host, and my code is looking for any response. So I simply changed the esxcli command from requesting the uptime to
esxcli system hostname get
and then through this into the function
substring = "Domain"
if substring not in resp:
print(resp)
continue
I am looking for the word "Domain" because that must be there if that call is successful.
How I figure it out: I installed ESXI 7 on an old Intel Nuc, turned on SSH in the kickstart script, started the script and then turned on the nuc. The reason I used the NUC is because a fresh install on simple hardware boots up much faster and quietly than Dell Blades! Also, I wrapped the resp variable in a print(type(OBJECT)) line and was able to determine that it was infact a string and not an error object.
This may not help someone that has a legitimate Errno 111 error, I knew I was going to run into this error each and everytime I ran the code and I just needed to know how to handle it and hold the loop until I got the response I wanted.
Edit: I suppose it would be easier to just filter for the world "errno" and then continue the loop instead of using a different substring. That would handle all of my use cases and eliminate the need for a different function.
I have a very simple implementation on python using redis-py to interface with Redis.
As part of the development, I am shutting redis down to simulate a timeout exception.
Problem is that I am setting the timeout to a few seconds, but the connection just sits there without timing out.
from redis import StrictRedis
print('Connecting')
redis_instance = StrictRedis(host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DB,
socket_connect_timeout=5,
socket_timeout=5,
)
print('Setting key')
redis_instance.set('X','Y')
print('Key SET')
I can see it goes up to Setting key message, but doesnt go beyond that or throw a timeout.
Any idea what I am doing wrong?
If you shutdown redis before running the code. redis-py raises socket exception ConnectionRefusedError, and redis ConnectionError.
You have not connected the redis yet, how could the connection times out?
Edit:
The main issue is the 3rd party rabbitmq machine seems to kill idle connections every now and then. That's when I start getting "Broken Pipe" exceptions. The only way to gets comms. back to normal is for me to kill the processes and restart them. I assume there's a better way?
--
I'm a little lost here. I am connecting to a 3rd party RabbitMQ server to push messages to. Every now and then all the sockets on their machine gets dropped and I end up getting a "Broken Pipe" exception.
I've been told to implement a heartbeat check in my code but I'm not sure how exactly. I've found some info here: http://kombu.readthedocs.org/en/latest/changelog.html#version-2-3-0 but no real example code.
Do I only need to add "?heartbeat=x" to the connection string? Does Kombu do the rest? I see I need to call "Connection.heartbeat_check()" at "x/2". Should I create a periodic task to call this? How does the connection get re-established?
I'm using:
celery==3.0.12
kombu==2.5.4
My code looks like this right now. A simple Celery task gets called to send the message through to the 3rd party RabbitMQ server (removed logging and comments to keep it short, basic enough):
class SendMessageTask(Task):
name = "campaign.backends.send"
routing_key = "campaign.backends.send"
ignore_result = True
default_retry_delay = 60 # 1 minute.
max_retries = 5
def run(self, send_to, message, **kwargs):
payload = "Testing message"
try:
conn = BrokerConnection(
hostname=HOSTNAME,
port=PORT,
userid=USER_ID,
password=PASSWORD,
virtual_host=VHOST
)
with producers[conn].acquire(block=True) as producer:
publish = conn.ensure(producer, producer.publish, errback=sending_errback, max_retries=3)
publish(
body=payload,
routing_key=OUT_ROUTING_KEY,
delivery_mode=2,
exchange=EXCHANGE,
serializer=None,
content_type='text/xml',
content_encoding = 'utf-8'
)
except Exception, ex:
print ex
Thanks for any and all help.
While you certainly can add heartbeat support to a producer, it makes more sense for consumer processes.
Enabling heartbeats means that you have to send heartbeats regularly, e.g. if the heartbeat is set to 1 second, then you have to send a heartbeat every second or more or the remote will close the connection.
This means that you have to use a separate thread or use async io to reliably send heartbeats in time, and since a connection cannot be shared between threads this leaves us with async io.
The good news is that you probably won't get much benefit adding heartbeats to a produce-only connection.
I ran an API RestFul with bottle and python, all works fine, the API is a daemon running in the system, if I stop the daemon by command line the service stop very well and closed all the port and connections, but when I go to close the service through the API, the port keep alive in state LISTEN and later in TIME_WAIT, it does not liberate the port. I've read for two days but the problem is because bottle have a socket and it does not close the server well, but I can find he solution
The code to close the API as a service is a subprocess launched by python like this
#get('/v1.0/services/<id_service>/restart')
def restart_service(id_service):
try:
service = __find_a_specific_service(id_service)
if(service == None or len(service) < 1):
logging.warning("RESTful URI: /v1.0/services/<id_service>/restart " + id_service +" , restart a specific service, service does not exists")
response.status = utils.CODE_404
return utils.convert_to_json(utils.FAILURE, utils.create_failed_resource(utils.WARNING, utils.SERVICES_API_SERVICE_NOT_EXIST))
else:
if id_service != "API":
api.ServiceApi().restart(id_service)
else:
import subprocess
args='/var/lib/stackops-head/bin/apirestd stop; sleep 5; /var/lib/stackops-head/bin/apirestd start'
subprocess.Popen(args, shell=True)
logging.info("RESTful URI: /v1.0/services/<id_service>/restart " + id_service +" , restart a specific service, ready to construct json response...")
return utils.convert_to_json(utils.SERVICE, None)
except Exception, e:
logging.error("Services: Error during the process of restart a specific service. %r", e)
raise HTTPError(code=utils.CODE_500, output=e.message, exception=e, traceback=None, head
To terminate a bottle process from the outside, send SIGINT.
If app is exited or killed, all file descriptors/handles also including socket are closed by OS.
You can also use
sudo netstat -anp --tcp
in Linux to make sure if the specified port is owned by some processes. Or use
netstat -a -n -b -p tcp
in Windows to do the same thing.
TIME_WAIT is normal state managed by OS rather than app to keep a connection/port for a while. Sometimes it is annoying. Your can tune OS for how long it will keep, but it is not safe.
#import ssh
import socket
from fabric.operations import run
def connect_and_wait():
#ssh.config.socket.setdefaulttimeout(5)
socket.setdefaulttimeout(5)
print('SSTART')
run('echo START')
run('sleep 10')
run('echo END')
print('EEND')
The script above prints everything without any error/exception.
Python 2.6.5, Fabric 1.4.2.
socket.setdefaulttimeout() does not work.
ssh.config.socket.setdefaulttimeout() does not work.
fabric.api.env['timeout'] is for connecting phase only I suppose.
Fabric uses "lazy" connections to remote hosts and can automatically reconnect when executing task on a host and connection is lost. Seems there is no way to explicitly drop idling connections, but you can close all connections and let fabric reconnect to "active" hosts. fabric.network.disconnect_all() do the trick.