RabbitMQ broken pipe error or lost messages

RabbitMQ broken pipe error or lost messages - python

Using the pika library's BlockingConnection to connect to RabbitMQ, I occasionally get an error when publishing messages:
Fatal Socket Error: error(32, 'Broken pipe')
This is from a very simple sub-process that takes some information out of an in-memory queue and sends a small JSON message into AMQP. The error only seems to come up when the system hasn't sent any messages for a few minutes.
Setup:
connection = pika.BlockingConnection(parameters)
channel = self.connection.channel()
channel.exchange_declare(
exchange='xyz',
exchange_type='fanout',
passive=False,
durable=True,
auto_delete=False
)
Enqueue code catches any connection errors and retries:
def _enqueue(self, message_id, data):
try:
published = self.channel.basic_publish(
self.amqp_exchange,
self.amqp_routing_key,
json.dumps(data),
pika.BasicProperties(
content_type="application/json",
delivery_mode=2,
message_id=message_id
)
)
# Confirm delivery or retry
if published:
self.retry_count = 0
else:
raise EnqueueException("Message publish not confirmed.")
except (EnqueueException, pika.exceptions.AMQPChannelError, pika.exceptions.AMQPConnectionError,
pika.exceptions.ChannelClosed, pika.exceptions.ConnectionClosed, pika.exceptions.UnexpectedFrameError,
pika.exceptions.UnroutableError, socket.timeout) as e:
self.retry_count += 1
if self.retry_count < 5:
logging.warning("Reconnecting and resending")
if self.connection.is_open:
self.connection.close()
self.connect()
self._enqueue(message_id, data)
else:
raise e
This sometimes works on the second attempt. It often hangs for a while or just throws away messages before eventually throwing an exception (possibly related bug report). Since it only happens when the system is quiet for a few minutes I'm guessing it's due to a connection timeout. But AMQP has a heartbeat system and pika reportedly uses it (related bug report).
Why do I get this error or lose messages, and why won't the connection stay open when not in use?

From another bug report:
As BlockingConnection doesn't handle heartbeats in the background and the heartbeat_interval can't override the servers suggested heartbeat interval (that's a bug too), i suggest that heartbeats should be disabled by default (rely on TCP keep-alive instead).
If processing a task in a consume block takes longer time then the server suggested heartbeat interval, the connection will be closed by the server and the client won't be able to ack the message when it's done processing.
An update in v1.0.0 may help with the issue.
So I implemented a workaround. Every 30 seconds I publish a heartbeat message through the queue. This keeps the connection open and has the added benefit of confirming to clients that my application is up and running.

The Broken Pipe error means that server is trying to write something into the socket when connection is closed on client's side.
As i can see, you have some shared "self.connection" that may be closed before/in parallel thread?
Also you could set up logging level to DEBUG and look at client's log to determine the moment when client closes connection.

Related

1006 Connection closed abnormally error with python 3.7 websockets

I'm having the same problem as this github issue with python websockets:
https://github.com/aaugustin/websockets/issues/367
The proposed solution isn't working for me though. The error I'm getting is:
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), no reason
This is my code:
async def get_order_book(symbol):
with open('test.csv', 'a+') as csvfile:
csvw = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
DT = Data(data=data, last_OB_id=ob_id, last_TR_id=tr_id, sz=10, csvw=csvw)
while True:
if not websocket.open:
print('Reconnecting')
websocket = await websockets.connect(ws_url)
else:
resp = await websocket.recv()
update = ujson.loads(resp)
DT.update_data(update)
async def get_order_books():
r = requests.get(url='https://api.binance.com/api/v1/ticker/24hr')
await asyncio.gather(*[get_order_book(data['symbol']) for data in r.json()])
if __name__ == '__main__':
asyncio.run(get_order_books())
The way I've been testing it is by closing my internet connection, but after a ten second delay it still returns the 1006 error.
I'm running Python 3.7 and Websockets 7.0.
Let me know what your thoughts are, thanks!

I encountered the same problem.
After digging a while I found multiple versions of the answer that tells to just reconnect, but I didn't think it was a reasonable route, so I dug some more.
Enabling DEBUG level logging I found out that python websockets default to sending ping packets, and failing to receive a response, timeouts the connection. I am not sure if this lines up with the standard, but at least javascript websockets are completely fine with the server my python script times out with.
The fix is simple: add another kw argument to connect:
websockets.connect(uri, ping_interval=None)
The same argument should also work for server side function serve.
More info at https://websockets.readthedocs.io/en/stable/api.html

So I found the solution:
When the connection closes, it breaks out of the while loop for some reason. So in order to keep the websocket running you have to surround
resp = await websocket.recv()
with try ... except and have
print('Reconnecting')
websocket = await websockets.connect(ws_url)
in the exception handling part.

I ran in to this same issue. The solution by shinola worked for awhile, but I would still get errors sometimes.
To handle this I put the connection into a while True: loop and added two separate try except blocks. The consumer variable is a function that processes the messages received from the websocket connection.
async def websocketConnect(uri, payload, consumer):
websocket = await websockets.connect(uri, ssl=True)
await websocket.send(json.dumps(payload))
while True:
if not websocket.open:
try:
print('Websocket is NOT connected. Reconnecting...')
websocket = await websockets.connect(uri, ssl=True)
await websocket.send(json.dumps(payload))
except:
print('Unable to reconnect, trying again.')
try:
async for message in websocket:
if message is not None:
consumer(json.loads(message))
except:
print('Error receiving message from websocket.')
I start the connection using:
def startWebsocket(uri, payload, consumer):
asyncio.run(websocketConnect(uri, payload, consumer))

I might be a year late but i was just having this issue. No connection issues on my html5 websocket client but .py test client would crash after around a minute (raising 1006 exceptions for both the client and server too). As a test i started to just await connection.recv()ing after every frame the client sends. No more issues. I didnt need to receive data for my .py test client but apparently it causes issues if you let it build up. It's also probably why my web version was working fine, since i was handling the .onmessage callbacks.
Im pretty sure this is why this error occurs. So this solution of just receiving the data is an actual solution and not disabling pinging and screwing up a vital function of the protocol.

I think they explained here: https://websockets.readthedocs.io/en/stable/faq.html
it means that the TCP connection was lost. As a consequence, the
WebSocket connection was closed without receiving a close frame, which
is abnormal.
You can catch and handle ConnectionClosed to prevent it from being
logged.
There are several reasons why long-lived connections may be lost:
End-user devices tend to lose network connectivity often and
unpredictably because they can move out of wireless network coverage,
get unplugged from a wired network, enter airplane mode, be put to
sleep, etc.
HTTP load balancers or proxies that aren’t configured for long-lived
connections may terminate connections after a short amount of time,
usually 30 seconds.

I solved this problem by uvicorn the table with a hypercorn.
hypercorn app:app

How to reconnect to RabbitMQ?

My python script constantly has to send messages to RabbitMQ once it receives one from another data source. The frequency in which the python script sends them can vary, say, 1 minute - 30 minutes.
Here's how I establish a connection to RabbitMQ:
rabt_conn = pika.BlockingConnection(pika.ConnectionParameters("some_host"))
channel = rbt_conn.channel()
I just got an exception
pika.exceptions.ConnectionClosed
How can I reconnect to it? What's the best way? Is there any "strategy"? Is there an ability to send pings to keep a connection alive or set timeout?
Any pointers will be appreciated.

RabbitMQ uses heartbeats to detect and close "dead" connections and to prevent network devices (firewalls etc.) from terminating "idle" connections. From version 3.5.5 on, the default timeout is set to 60 seconds (previously it was ~10 minutes). From the docs:
Heartbeat frames are sent about every timeout / 2 seconds. After two missed heartbeats, the peer is considered to be unreachable.
The problem with Pika's BlockingConnection is that it is unable to respond to heartbeats until some API call is made (for example, channel.basic_publish(), connection.sleep(), etc).
The approaches I found so far:
Increase or deactivate the timeout
RabbitMQ negotiates the timeout with the client when establishing the connection. In theory, it should be possible to override the server default value with a bigger one using the heartbeat_interval argument, but the current Pika version (0.10.0) uses the min value between those offered by the server and the client. This issue is fixed on current master.
On the other hand, is possible to deactivate the heartbeat functionality completely by setting the heartbeat_interval argument to 0, which may well drive you into new issues (firewalls dropping connections, etc)
Reconnecting
Expanding on #itsafire's answer, you can write your own publisher class, letting you reconnect when required. An example naive implementation:
import logging
import json
import pika
class Publisher:
EXCHANGE='my_exchange'
TYPE='topic'
ROUTING_KEY = 'some_routing_key'
def __init__(self, host, virtual_host, username, password):
self._params = pika.connection.ConnectionParameters(
host=host,
virtual_host=virtual_host,
credentials=pika.credentials.PlainCredentials(username, password))
self._conn = None
self._channel = None
def connect(self):
if not self._conn or self._conn.is_closed:
self._conn = pika.BlockingConnection(self._params)
self._channel = self._conn.channel()
self._channel.exchange_declare(exchange=self.EXCHANGE,
type=self.TYPE)
def _publish(self, msg):
self._channel.basic_publish(exchange=self.EXCHANGE,
routing_key=self.ROUTING_KEY,
body=json.dumps(msg).encode())
logging.debug('message sent: %s', msg)
def publish(self, msg):
"""Publish msg, reconnecting if necessary."""
try:
self._publish(msg)
except pika.exceptions.ConnectionClosed:
logging.debug('reconnecting to queue')
self.connect()
self._publish(msg)
def close(self):
if self._conn and self._conn.is_open:
logging.debug('closing queue connection')
self._conn.close()
Other possibilities
Other possibilities which I yet didn't explore:
Using an asynchronous adapter for publishing
Keeping your RabbitMQ connection and your "publish" code on a background thread, which calls periodically connection.sleep() to responde to server heartbeats.

Dead simple: some pattern like this.
import time
while True:
try:
communication_handles = connect_pika()
do_your_stuff(communication_handles)
except pika.exceptions.ConnectionClosed:
print 'oops. lost connection. trying to reconnect.'
# avoid rapid reconnection on longer RMQ server outage
time.sleep(0.5)
You will probably have to re-factor your code, but basically it is about catching the exception, mitigate the problem and continue doing your stuff.
The communication_handles contain all the pika elements like channels, queues and whatever that your stuff needs to communicate with RabbitMQ via pika.

RabbitMQ heartbeat vs connection drain events timeout

I have a rabbitmq server and a amqp consumer (python) using kombu.
I have installed my app in a system that has a firewall that closes idle connections after 1 hour.
This is my amqp_consumer.py:
try:
# connections
with Connection(self.broker_url, ssl=_ssl, heartbeat=self.heartbeat) as conn:
chan = conn.channel()
# more stuff here
with conn.Consumer(queue, callbacks = [messageHandler], channel = chan):
# Process messages and handle events on all channels
while True:
conn.drain_events()
except Exception as e:
# do stuff
what i want is that if the firewall closed the connection, then i want to reconnect. should i use the heartbeat argument or should i pass a timeout argument (of 3600 sec) to the drain_events() function?
What are the differences between both options? (seems to do the same).
Thanks.

The drain_events on it's own would not produce any heartbeats, unless there are messages to consume and acknowledge. If the queue is idle then eventually the connection would be closed (by rabbit server or by your firewall).
What you should do is use both the heartbeat and the timeout like so:
while True:
try:
conn.drain_events(timeout=1)
except socket.timeout:
conn.heartbeat_check()
This way even if the queue is idle the connection won't be closed.
Besides that you might want to wrap the whole thing with a retry policy in case the connection does get closed or some other network error.

twisted: test if connection exists before writing to transport

Is there a possibility to test if the connection still exists before executing a transport.write()?
I have modified the simpleserv/simpleclient examples so that a message is being send (written to Protocol.transport) every 5 seconds. The connection is persistent.
When disconnecting my wifi, it still writes to transport (of course the messages don't arrive on the other side) but no error is thrown.
When enabling the wifi again, the messages are being delivered, but the next attempt to send a message fails (and Protocol.connectionLost is called).
Here again what happens chronologically:
Sending a message establishes the connection, the message is delivered.
Disabling wifi
Sending a message writes to transport, does not throw an error, the message does not arrive
Enabling wifi
Message sent in 3. arrives
Sending a message results in Protocol.connectionLost call
It would be nice to know before executing step 6 if I can write to transport. Is there any way?
Server:
# Copyright (c) Twisted Matrix Laboratories.
# See LICENSE for details.
from twisted.internet import reactor, protocol
class Echo(protocol.Protocol):
"""This is just about the simplest possible protocol"""
def dataReceived(self, data):
"As soon as any data is received, write it back."
print
print data
self.transport.write(data)
def main():
"""This runs the protocol on port 8000"""
factory = protocol.ServerFactory()
factory.protocol = Echo
reactor.listenTCP(8000,factory)
reactor.run()
# this only runs if the module was *not* imported
if __name__ == '__main__':
main()
Client:
# Copyright (c) Twisted Matrix Laboratories.
# See LICENSE for details.
"""
An example client. Run simpleserv.py first before running this.
"""
from twisted.internet import reactor, protocol
# a client protocol
counter = 0
class EchoClient(protocol.Protocol):
"""Once connected, send a message, then print the result."""
def connectionMade(self):
print 'connectionMade'
def dataReceived(self, data):
"As soon as any data is received, write it back."
print "Server said:", data
def connectionLost(self, reason):
print "connection lost"
def say_hello(self):
global counter
counter += 1
msg = '%s. hello, world' %counter
print 'sending: %s' %msg
self.transport.write(msg)
class EchoFactory(protocol.ClientFactory):
def buildProtocol(self, addr):
self.p = EchoClient()
return self.p
def clientConnectionFailed(self, connector, reason):
print "Connection failed - goodbye!"
def clientConnectionLost(self, connector, reason):
print "Connection lost - goodbye!"
def say_hello(self):
self.p.say_hello()
reactor.callLater(5, self.say_hello)
# this connects the protocol to a server running on port 8000
def main():
f = EchoFactory()
reactor.connectTCP("REMOTE_SERVER_ADDR", 8000, f)
reactor.callLater(5, f.say_hello)
reactor.run()
# this only runs if the module was *not* imported
if __name__ == '__main__':
main()

Protocol.connectionLost is the only way to know when the connection no longer exists. It is also called at the earliest time when it is known that the connection no longer exists.
It is obvious to you or me that disconnecting your network adapter (ie, turning off your wifi card) will break the connection - at least, if you leave it off or if you configure it different when you turn it back on again. It's not obvious to your platform's TCP implementation though.
Since network communication isn't instant and any individual packet may be lost for normal (non-fatal) reasons, TCP includes various timeouts and retries. When you disconnect your network adapter these packets can no longer be delivered but the platform doesn't know that this condition will outlast the longest TCP timeout. So your TCP connection doesn't get closed when you turn off your wifi. It hangs around and starts retrying the send and waiting for an acknowledgement.
At some point the timeouts and retries all expire and the connection really does get closed (although the way TCP works means that if there is no data waiting to be sent then there actually isn't a timeout, a "dead" connection will live forever; addressing this is the reason the TCP "keepalive" feature exists). This is made slightly more complicated by the fact that there are timeouts on both sides of the connection. If the connection closes as soon as you do the write in step six (and no sooner) then the cause is probably a "reset" (RST) packet.
A reset will occur after the timeout on the other side of the connection expires and closes the connection while the connection is still open on your side. Now when your side sends a packet for this TCP connection the other side won't recognize the TCP connection it belongs to (because as far as the other side is concerned that connection no longer exists) and reply with a reset message. This tells the original sender that there is no such connection. The original sender reacts to this by closing its side of the connection (since one side of a two-sided connection isn't very useful by itself). This is presumably when Protocol.connectionLost is called in your application.
All of this is basically just how TCP works. If the timeout behavior isn't suitable for your application then you have a couple options. You could turn on TCP keepalives (this usually doesn't help, by default TCP keepalives introduce timeouts that are hours long though you can tune this on most platforms) or you could build an application-level keepalive feature. This is simply some extra traffic that your protocol generates and then expects a response to. You can build your own timeouts (no response in 3 seconds? close the connection and establish a new one) on top of this or just rely on it to trigger one of the somewhat faster (~2 minute) TCP timeouts. The downside of a faster timeout is that spurious network issues may cause you to close the connection when you really didn't need to.

zeromq: how to prevent infinite wait?

I just got started with ZMQ. I am designing an app whose workflow is:
one of many clients (who have random PULL addresses) PUSH a request to a server at 5555
the server is forever waiting for client PUSHes. When one comes, a worker process is spawned for that particular request. Yes, worker processes can exist concurrently.
When that process completes it's task, it PUSHes the result to the client.
I assume that the PUSH/PULL architecture is suited for this. Please correct me on this.
But how do I handle these scenarios?
the client_receiver.recv() will wait for an infinite time when server fails to respond.
the client may send request, but it will fail immediately after, hence a worker process will remain stuck at server_sender.send() forever.
So how do I setup something like a timeout in the PUSH/PULL model?
EDIT: Thanks user938949's suggestions, I got a working answer and I am sharing it for posterity.

If you are using zeromq >= 3.0, then you can set the RCVTIMEO socket option:
client_receiver.RCVTIMEO = 1000 # in milliseconds
But in general, you can use pollers:
poller = zmq.Poller()
poller.register(client_receiver, zmq.POLLIN) # POLLIN for recv, POLLOUT for send
And poller.poll() takes a timeout:
evts = poller.poll(1000) # wait *up to* one second for a message to arrive.
evts will be an empty list if there is nothing to receive.
You can poll with zmq.POLLOUT, to check if a send will succeed.
Or, to handle the case of a peer that might have failed, a:
worker.send(msg, zmq.NOBLOCK)
might suffice, which will always return immediately - raising a ZMQError(zmq.EAGAIN) if the send could not complete.

This was a quick hack I made after I referred user938949's answer and http://taotetek.wordpress.com/2011/02/02/python-multiprocessing-with-zeromq/ . If you do better, please post your answer, I will recommend your answer.
For those wanting lasting solutions on reliability, refer http://zguide.zeromq.org/page:all#toc64
Version 3.0 of zeromq (beta ATM) supports timeout in ZMQ_RCVTIMEO and ZMQ_SNDTIMEO. http://api.zeromq.org/3-0:zmq-setsockopt
Server
The zmq.NOBLOCK ensures that when a client does not exist, the send() does not block.
import time
import zmq
context = zmq.Context()
ventilator_send = context.socket(zmq.PUSH)
ventilator_send.bind("tcp://127.0.0.1:5557")
i=0
while True:
i=i+1
time.sleep(0.5)
print ">>sending message ",i
try:
ventilator_send.send(repr(i),zmq.NOBLOCK)
print " succeed"
except:
print " failed"
Client
The poller object can listen in on many recieving sockets (see the "Python Multiprocessing with ZeroMQ" linked above. I linked it only on work_receiver. In the infinite loop, the client polls with an interval of 1000ms. The socks object returns empty if no message has been recieved in that time.
import time
import zmq
context = zmq.Context()
work_receiver = context.socket(zmq.PULL)
work_receiver.connect("tcp://127.0.0.1:5557")
poller = zmq.Poller()
poller.register(work_receiver, zmq.POLLIN)
# Loop and accept messages from both channels, acting accordingly
while True:
socks = dict(poller.poll(1000))
if socks:
if socks.get(work_receiver) == zmq.POLLIN:
print "got message ",work_receiver.recv(zmq.NOBLOCK)
else:
print "error: message timeout"

The send wont block if you use ZMQ_NOBLOCK, but if you try closing the socket and context, this step would block the program from exiting..
The reason is that the socket waits for any peer so that the outgoing messages are ensured to get queued.. To close the socket immediately and flush the outgoing messages from the buffer, use ZMQ_LINGER and set it to 0..

If you're only waiting for one socket, rather than create a Poller, you can do this:
if work_receiver.poll(1000, zmq.POLLIN):
print "got message ",work_receiver.recv(zmq.NOBLOCK)
else:
print "error: message timeout"
You can use this if your timeout changes depending on the situation, instead of setting work_receiver.RCVTIMEO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.