RabbitMQ heartbeat vs connection drain events timeout

RabbitMQ heartbeat vs connection drain events timeout - python

I have a rabbitmq server and a amqp consumer (python) using kombu.
I have installed my app in a system that has a firewall that closes idle connections after 1 hour.
This is my amqp_consumer.py:
try:
# connections
with Connection(self.broker_url, ssl=_ssl, heartbeat=self.heartbeat) as conn:
chan = conn.channel()
# more stuff here
with conn.Consumer(queue, callbacks = [messageHandler], channel = chan):
# Process messages and handle events on all channels
while True:
conn.drain_events()
except Exception as e:
# do stuff
what i want is that if the firewall closed the connection, then i want to reconnect. should i use the heartbeat argument or should i pass a timeout argument (of 3600 sec) to the drain_events() function?
What are the differences between both options? (seems to do the same).
Thanks.

The drain_events on it's own would not produce any heartbeats, unless there are messages to consume and acknowledge. If the queue is idle then eventually the connection would be closed (by rabbit server or by your firewall).
What you should do is use both the heartbeat and the timeout like so:
while True:
try:
conn.drain_events(timeout=1)
except socket.timeout:
conn.heartbeat_check()
This way even if the queue is idle the connection won't be closed.
Besides that you might want to wrap the whole thing with a retry policy in case the connection does get closed or some other network error.

Related

How do I forcibly disconnect all currently connected clients to my TCP or HTTP server during shutdown?

I have a fake HTTP server that I use as a fixture in my testing. At some point in the test, I want to stop the server regardless of any still open connections. Clients on these open connections should get a TCP FIN.
I am aware that usually production servers need to solve different problem, that of quiescing, sometimes called graceful shutdown. This is the opposite of what I want.
With a standalone process, it is usually possible to simply get the process to quit and the OS will take care of the rest. (Forcibly killing processes is easy, while forcibly killing threads is not.) My fake server is, however, running in a thread of the test process itself, so I don't have this option (and I don't want to externalize it if there is other way around).
I investigated this issue in Python, with the HTTPServer class, where I was not able to find any solution.
I also investigated this in Go, where I was able to find the concept of Contexts, which is close to what I need, but it works the other way around: a http server would propagate a Context that can be used to cancel e.g. a database lookup if a client disconnected.
Edit: looks like Go actually does what I need and has a separate graceful and nongraceful shutdown methods, with the nongraceful being net/http#Server.Close.
server = http.server.HTTPServer(...)
thread = threading.Thread(run=server.serve_forever)
thread.start()
# a client has connected ....
server.shutdown()
# at this point I want to have the server stopped,
# without waiting for the request handling to complete

I've implemented the Go solution in Python. When new client connects, I remember the client socket, and when I want to quit, I shutdown all remembered sockets.
It seems to work.
import socket
import http.server.HTTPServer
class MyHTTPServer(HTTPServer):
"""Adds a method to the HTTPServer to allow it to exit gracefully"""
def __init__(self, addr, handler_cls):
super().__init__(addr, handler_cls)
self._client_sockets: List[socket.socket] = []
self.server_killed = False
def get_request(self) -> Tuple[socket.socket, Any]:
"""Remember the client socket"""
sock, addr = super().get_request()
self._client_sockets.append(sock)
return sock, addr
def shutdown_request(self, request: socket.socket) -> None:
"""Forget the client socket"""
self._client_sockets.remove(request)
print(f"{self._client_sockets=}")
super().shutdown_request(request)
def force_disconnect_clients(self) -> None:
"""Shutdown the remembered sockets"""
for client in self._client_sockets:
client.shutdown(socket.SHUT_RDWR)
Usage
server = MyHTTPServer(server_addr, MyRequestHandler)
# in a new thread
while not server.server_killed:
self._server.handle_request()
# ... use the server (keep in mind it can have at most one client at a time) ...
# in the main program
server.server_killed = True
server.force_disconnect_clients()
server.server_close()

Python: Multithreaded socket server runs endlessly when client stops unexpectedly

I have created a multithreaded socket server to connect many clients to the server using python. If a client stops unexpectedly due to an exception, server runs nonstop. Is there a way to kill that particular thread alone in the server and the rest running
Server:
class ClientThread(Thread):
def __init__(self,ip,port):
Thread.__init__(self)
self.ip = ip
self.port = port
print("New server socket thread started for " + ip + ":" + str(port))
def run(self):
while True :
try:
message = conn.recv(2048)
dataInfo = message.decode('ascii')
print("recv:::::"+str(dataInfo)+"::")
except:
print("Unexpected error:", sys.exc_info()[0])
Thread._stop(self)
tcpServer = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
tcpServer.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
tcpServer.bind((TCP_IP, 0))
tcpServer.listen(10)
print("Port:"+ str(tcpServer.getsockname()[1]))
threads = []
while True:
print( "Waiting for connections from clients..." )
(conn, (ip,port)) = tcpServer.accept()
newthread = ClientThread(ip,port)
newthread.start()
threads.append(newthread)
for t in threads:
t.join()
Client:
def Main():
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect((host,int(port)))
while True:
try:
message = input("Enter Command")
s.send(message.encode('ascii'))
except Exception as ex:
logging.exception("Unexpected error:")
break
s.close()

Sorry about a very, very long answer but here goes.
There are quite a many issues with your code. First of all, your client does not actually close the socket, as s.close() will never get executed. Your loop is interrupted at break and anything that follows it will be ignored. So change the order of these statements for the sake of good programming but it has nothing to do with your problem.
Your server code is wrong in quite a many ways. As it is currently written, it never exits. Your threads also do not work right. I have fixed your code so that it is a working, multithreaded server, but it still does not exit as I have no idea what would be the trigger to make it exit. But let us start from the main loop:
while True:
print( "Waiting for connections from clients..." )
(conn, (ip,port)) = tcpServer.accept()
newthread = ClientThread(conn, ip,port)
newthread.daemon = True
newthread.start()
threads.append(newthread) # Do we need this?
for t in threads:
t.join()
I have added passing of conn to your client thread, the reason of which becomes apparent in a moment. However, your while True loop never breaks, so you will never enter the for loop where you join your threads. If your server is meant to be run indefinitely, this is not a problem at all. Just remove the for loop and this part is fine. You do not need to join threads just for the sake of joining them. Joining threads only allows your program to block until a thread has finished executing.
Another addition is newthread.daemon = True. This sets your threads to daemonic, which means they will exit as soon as your main thread exits. Now your server responds to control + c even when there are active connections.
If your server is meant to be never ending, there is also no need to store threads in your main loop to threads list. This list just keeps growing as a new entry will be added every time a client connects and disconnects, and this leaks memory as you are not using the threads list for anything. I have kept it as it was there, but there still is no mechanism to exit the infinite loop.
Then let us move on to your thread. If you want to simplify the code, you can replace the run part with a function. There is no need to subclass Thread in this case, but this works so I have kept your structure:
class ClientThread(Thread):
def __init__(self,conn, ip,port):
Thread.__init__(self)
self.ip = ip
self.port = port
self.conn = conn
print("New server socket thread started for " + ip + ":" + str(port))
def run(self):
while True :
try:
message = self.conn.recv(2048)
if not message:
print("closed")
try:
self.conn.close()
except:
pass
return
try:
dataInfo = message.decode('ascii')
print("recv:::::"+str(dataInfo)+"::")
except UnicodeDecodeError:
print("non-ascii data")
continue
except socket.error:
print("Unexpected error:", sys.exc_info()[0])
try:
self.conn.close()
except:
pass
return
First of all, we store conn to self.conn. Your version used a global version of conn variable. This caused unexpected results when you had more than one connection to the server. conn is actually a new socket created for the client connection at accept, and this is unique to each thread. This is how servers differentiate between client connections. They listen to a known port, but when the server accepts the connection, accept creates another port for that particular connection and returns it. This is why we need to pass this to the thread and then read from self.conn instead of global conn.
Your server "hung" upon client connetion errors as there was no mechanism to detect this in your loop. If the client closes connection, socket.recv() does not raise an exception but returns nothing. This is the condition you need to detect. I am fairly sure you do not even need try/except here but it does not hurt - but you need to add the exception you are expecting here. In this case catching everything with undeclared except is just wrong. You have also another statement there potentially raising exceptions. If your client sends something that cannot be decoded with ascii codec, you would get UnicodeDecodeError (try this without error handling here, telnet to your server port and copypaste some Hebrew or Japanese into the connection and see what happens). If you just caught everything and treated as socket errors, you would now enter the thread ending part of the code just because you could not parse a message. Typically we just ignore "illegal" messages and carry on. I have added this. If you want to shut down the connection upon receiving a "bad" message, just add self.conn.close() and return to this exception handler as well.
Then when you really are encountering a socket error - or the client has closed the connection, you will need to close the socket and exit the thread. You will call close() on the socket - encapsulating it in try/except as you do not really care if it fails for not being there anymore.
And when you want to exit your thread, you just return from your run() loop. When you do this, your thread exits orderly. As simple as that.
Then there is yet another potential problem, if you are not only printing the messages but are parsing them and doing something with the data you receive. This I do not fix but leave this to you.
TCP sockets transmit data, not messages. When you build a communication protocol, you must not assume that when your recv returns, it will return a single message. When your recv() returns something, it can mean one of five things:
The client has closed the connection and nothing is returned
There is exactly one full message and you receive that
There is only a partial message. Either because you read the socket before the client had transmitted all data, or because the client sent more than 2048 bytes (even if your client never sends over 2048 bytes, a malicious client would definitely try this)
There are more than one messages waiting and you received them all
As 4, but the last message is partial.
Most socket programming mistakes are related to this. The programmer expects 2 to happen (as you do now) but they do not cater for 3-5. You should instead analyse what was received and act accordingly. If there seems to be less data than a full message, store it somewhere and wait for more data to appear. When more data appears, concatenate these and see if you now have a full message. And when you have parsed a full message from this buffer, inspect the buffer to see if there is more data there - the first part of the next message or even more full messages if your client is fast and server is slow. If you process a message and then wipe the buffer, you might have wiped also bytes from your next message.

RabbitMQ broken pipe error or lost messages

Using the pika library's BlockingConnection to connect to RabbitMQ, I occasionally get an error when publishing messages:
Fatal Socket Error: error(32, 'Broken pipe')
This is from a very simple sub-process that takes some information out of an in-memory queue and sends a small JSON message into AMQP. The error only seems to come up when the system hasn't sent any messages for a few minutes.
Setup:
connection = pika.BlockingConnection(parameters)
channel = self.connection.channel()
channel.exchange_declare(
exchange='xyz',
exchange_type='fanout',
passive=False,
durable=True,
auto_delete=False
)
Enqueue code catches any connection errors and retries:
def _enqueue(self, message_id, data):
try:
published = self.channel.basic_publish(
self.amqp_exchange,
self.amqp_routing_key,
json.dumps(data),
pika.BasicProperties(
content_type="application/json",
delivery_mode=2,
message_id=message_id
)
)
# Confirm delivery or retry
if published:
self.retry_count = 0
else:
raise EnqueueException("Message publish not confirmed.")
except (EnqueueException, pika.exceptions.AMQPChannelError, pika.exceptions.AMQPConnectionError,
pika.exceptions.ChannelClosed, pika.exceptions.ConnectionClosed, pika.exceptions.UnexpectedFrameError,
pika.exceptions.UnroutableError, socket.timeout) as e:
self.retry_count += 1
if self.retry_count < 5:
logging.warning("Reconnecting and resending")
if self.connection.is_open:
self.connection.close()
self.connect()
self._enqueue(message_id, data)
else:
raise e
This sometimes works on the second attempt. It often hangs for a while or just throws away messages before eventually throwing an exception (possibly related bug report). Since it only happens when the system is quiet for a few minutes I'm guessing it's due to a connection timeout. But AMQP has a heartbeat system and pika reportedly uses it (related bug report).
Why do I get this error or lose messages, and why won't the connection stay open when not in use?

From another bug report:
As BlockingConnection doesn't handle heartbeats in the background and the heartbeat_interval can't override the servers suggested heartbeat interval (that's a bug too), i suggest that heartbeats should be disabled by default (rely on TCP keep-alive instead).
If processing a task in a consume block takes longer time then the server suggested heartbeat interval, the connection will be closed by the server and the client won't be able to ack the message when it's done processing.
An update in v1.0.0 may help with the issue.
So I implemented a workaround. Every 30 seconds I publish a heartbeat message through the queue. This keeps the connection open and has the added benefit of confirming to clients that my application is up and running.

The Broken Pipe error means that server is trying to write something into the socket when connection is closed on client's side.
As i can see, you have some shared "self.connection" that may be closed before/in parallel thread?
Also you could set up logging level to DEBUG and look at client's log to determine the moment when client closes connection.

Will select.poll detect read event (select.POLLIN) if the socket is closed

I am not able to detect socket client closing in a particular network. I am running a socket server and once a client connects I am saving the client socket and periodically sending a request to the client . I am using select.poll then to check if there is any data to be read from the socket, and if there is , will read from the socket. All this is fine as of now.
Question is , if the remote socket client is terminated, will select.poll signal a read event in the client socket. If this happens then I can check the data length returned in socket.recv to detect the client has disconnected - as is described here
Adding a code snippet for select
def _wait_for_socket_poller(self, read, write, message=None):
"""
Instead of blockign wait, this polls and check if the read or write socket is ready. If so it proceeds with
reading or writing to the socket. The advantage is that while the poll blocks, it yeilds back to the other
waiting greenlets; poll blocks because we have not given a timeout
:param read: The read function
:param write: The write function
:param message: The CrowdBox API call
:return: The result : In case of read - In JSON format; But catch is that the caller cannot wait on the
result being available,as else the thread will block
"""
if not self.client_socket:
logging.error("CB ID =%d - Connection closed", self.id)
return
poller = select.poll()
# Commonly used flag setes
READ_ONLY = select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
WRITE_ONLY = select.POLLOUT
READ_WRITE = READ_ONLY | select.POLLOUT
if read and write:
poller.register(self.client_socket, READ_WRITE)
elif write:
poller.register(self.client_socket, WRITE_ONLY)
elif read:
poller.register(self.client_socket, READ_ONLY)
# Map file descriptors to socket objects
fd_to_socket = {self.client_socket.fileno(): self.client_socket, }
result = ''
retry = True
while retry:
# Poll will Block!!
events = poller.poll(
1) # using poll instead of select as the latter runs out of file descriptors on load
# Note here, Poll needs to timeout or will block ,as there is no gevent patched poll, the moment it blocks
# neither greenlets or Twisted Deffered can help -Everything freezes,as all of this is in main thread
if not events:
retry = True
gevent.sleep(0) # This is needed to yeild in case no input comes from CB
else:
retry = False
clientsock = None
fd = None
flag = None
for fd, flag in events:
# Retrieve the actual socket from its file descriptor to map return of poll to socket
clientsock = fd_to_socket[fd]
if clientsock is None:
logging.error("Problem Houston")
raise ValueError("Client Sokcet has Become Invalid")
if flag & select.POLLHUP:
logging.error("Client Socket Closed")
self.client_socket.close()
self.client_socket = None
return None
if flag & (select.POLLIN | select.POLLPRI):
if read:
result = read()
if flag & select.POLLOUT:
if write:
result = write(message)
# poller.uregister(self.client_socket)
return result

In general, yes, a socket will be marked as "readable" when a TCP connection is closed. But this assumes that there was a normal closing, meaning a TCP FIN or RST packet.
Sometimes TCP connections don't end that way. In particular, if TCP Keep-Alive is not enabled (and by default it is not), a network outage between server and client could effectively terminate the connection without either side knowing until they try to send data.
So if you want to make sure you are promptly notified when a TCP connection is broken, you need to send keep-alive messages at either the TCP layer or the application layer.
Keep-alive messages have the additional benefit that they can prevent unused connections from being automatically dropped by various network appliances due to long periods of inactivity.
For more on keep-alive, see here: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

Thought of adding an anwer here so that I can post some tcp dump trace. We tested this in a live network. The Socket client process in the remote machine terminated and python socket.send ( on a non blocking socket) client_socket.setblocking(0), did not return any error, for subsequent request send to the client from the server There was no event generated to indicate (EPOLLIN) something to read either.
So to detect the client connection loss, we ping the client periodically and if there is no expected response after three retrials , disconnect the client. Basically handled this in the application layer. Clients also changed to reply with some data for our 'are you alive' requests instead of just ignoring it.
sent = 0
try:
sent = self.client_socket.send(out)
except socket.error as e:
if e.args[0] == errno.EPIPE:
logging.error("Socket connection is closed or broken")
if sent == 0 and self.client_socket is not None:
logging.error("socket connection is already closed by client, cannot write request")
self.close_socket_connection()
else
# send succcessfully
Below is the tcpdump wireshark trace where you can see the re-transmit happening. IP details masked for security

How to reconnect to RabbitMQ?

My python script constantly has to send messages to RabbitMQ once it receives one from another data source. The frequency in which the python script sends them can vary, say, 1 minute - 30 minutes.
Here's how I establish a connection to RabbitMQ:
rabt_conn = pika.BlockingConnection(pika.ConnectionParameters("some_host"))
channel = rbt_conn.channel()
I just got an exception
pika.exceptions.ConnectionClosed
How can I reconnect to it? What's the best way? Is there any "strategy"? Is there an ability to send pings to keep a connection alive or set timeout?
Any pointers will be appreciated.

RabbitMQ uses heartbeats to detect and close "dead" connections and to prevent network devices (firewalls etc.) from terminating "idle" connections. From version 3.5.5 on, the default timeout is set to 60 seconds (previously it was ~10 minutes). From the docs:
Heartbeat frames are sent about every timeout / 2 seconds. After two missed heartbeats, the peer is considered to be unreachable.
The problem with Pika's BlockingConnection is that it is unable to respond to heartbeats until some API call is made (for example, channel.basic_publish(), connection.sleep(), etc).
The approaches I found so far:
Increase or deactivate the timeout
RabbitMQ negotiates the timeout with the client when establishing the connection. In theory, it should be possible to override the server default value with a bigger one using the heartbeat_interval argument, but the current Pika version (0.10.0) uses the min value between those offered by the server and the client. This issue is fixed on current master.
On the other hand, is possible to deactivate the heartbeat functionality completely by setting the heartbeat_interval argument to 0, which may well drive you into new issues (firewalls dropping connections, etc)
Reconnecting
Expanding on #itsafire's answer, you can write your own publisher class, letting you reconnect when required. An example naive implementation:
import logging
import json
import pika
class Publisher:
EXCHANGE='my_exchange'
TYPE='topic'
ROUTING_KEY = 'some_routing_key'
def __init__(self, host, virtual_host, username, password):
self._params = pika.connection.ConnectionParameters(
host=host,
virtual_host=virtual_host,
credentials=pika.credentials.PlainCredentials(username, password))
self._conn = None
self._channel = None
def connect(self):
if not self._conn or self._conn.is_closed:
self._conn = pika.BlockingConnection(self._params)
self._channel = self._conn.channel()
self._channel.exchange_declare(exchange=self.EXCHANGE,
type=self.TYPE)
def _publish(self, msg):
self._channel.basic_publish(exchange=self.EXCHANGE,
routing_key=self.ROUTING_KEY,
body=json.dumps(msg).encode())
logging.debug('message sent: %s', msg)
def publish(self, msg):
"""Publish msg, reconnecting if necessary."""
try:
self._publish(msg)
except pika.exceptions.ConnectionClosed:
logging.debug('reconnecting to queue')
self.connect()
self._publish(msg)
def close(self):
if self._conn and self._conn.is_open:
logging.debug('closing queue connection')
self._conn.close()
Other possibilities
Other possibilities which I yet didn't explore:
Using an asynchronous adapter for publishing
Keeping your RabbitMQ connection and your "publish" code on a background thread, which calls periodically connection.sleep() to responde to server heartbeats.

Dead simple: some pattern like this.
import time
while True:
try:
communication_handles = connect_pika()
do_your_stuff(communication_handles)
except pika.exceptions.ConnectionClosed:
print 'oops. lost connection. trying to reconnect.'
# avoid rapid reconnection on longer RMQ server outage
time.sleep(0.5)
You will probably have to re-factor your code, but basically it is about catching the exception, mitigate the problem and continue doing your stuff.
The communication_handles contain all the pika elements like channels, queues and whatever that your stuff needs to communicate with RabbitMQ via pika.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.