A good heartbeat interval for pika-rabbitmq in Amazon ec2

A good heartbeat interval for pika-rabbitmq in Amazon ec2 - python

I am using the latest pika library(0.9.9+) for rabbitmq. My usage for rabbitmq and pika is as follows :
I have long running tasks (about 5 minutes) as workers. These tasks take their requests from rabbitmq.The requests come very infrequently i.e. there is a long idle time between requests.
The problem i was facing previously is related to idle connections(connection closures due to idle connections). So, I have enabled heartbeat in pika.
Now the selection of heartbeat is a problem. Pika seems to be a single threaded library where heartbeats reception and acknowledgement happens to be done in-between requests time frame.
So, if the heartbeat interval is set less than the time the callback function uses to do its long running computations, the server does not receive any heartbeat acknowledgements and closes the connection.
So, I assume the minimum heartbeat interval should be the maximum computation time of the callback function in a blocking connection.
What can be a good heartbeat value for amazon ec2 to prevent it closing idle connections ?
Also, some suggest to use rabbitmq keepalive (or libkeepalive) to maintain tcp connections. I think managing heartbeats at the tcp layer is much better because the application need not manage them.Is this true ? Is keepalive a good method when compared to RMQ heartbeats ?
I have seen that some suggest using multiple threads and queue for long running tasks. But is this the only option for long running tasks ? It is quite disappointing that another queue must be used for this scenario.
Thank you in advance. I think I have detailed the problem. Let me know if I can provide more details.

If you're not tied to using pika, this thread helped me achieve what you're trying to do using kombu:
#!/usr/bin/env python
import time, logging, weakref, eventlet
from kombu import Connection, Exchange, Queue
from kombu.utils.debug import setup_logging
from kombu.common import eventloop
from eventlet import spawn_after
eventlet.monkey_patch()
log_format = ('%(levelname) -10s %(asctime)s %(name) -30s %(funcName) '
'-35s %(lineno) -5d: %(message)s')
logging.basicConfig(level=logging.INFO, format=log_format)
logger = logging.getLogger('job_worker')
logger.setLevel(logging.INFO)
def long_running_function(body):
time.sleep(300)
def job_worker(body, message):
long_running_function(body)
message.ack()
def monitor_heartbeats(connection, rate=2):
"""Function to send heartbeat checks to RabbitMQ. This keeps the
connection alive over long-running processes."""
if not connection.heartbeat:
logger.info("No heartbeat set for connection: %s" % connection.heartbeat)
return
interval = connection.heartbeat
cref = weakref.ref(connection)
logger.info("Starting heartbeat monitor.")
def heartbeat_check():
conn = cref()
if conn is not None and conn.connected:
conn.heartbeat_check(rate=rate)
logger.info("Ran heartbeat check.")
spawn_after(interval, heartbeat_check)
return spawn_after(interval, heartbeat_check)
def main():
setup_logging(loglevel='INFO')
# process for heartbeat monitor
p = None
try:
with Connection('amqp://guest:guest#localhost:5672//', heartbeat=300) as conn:
conn.ensure_connection()
monitor_heartbeats(conn)
queue = Queue('job_queue',
Exchange('job_queue', type='direct'),
routing_key='job_queue')
logger.info("Starting worker.")
with conn.Consumer(queue, callbacks=[job_worker]) as consumer:
consumer.qos(prefetch_count=1)
for _ in eventloop(conn, timeout=1, ignore_timeouts=True):
pass
except KeyboardInterrupt:
logger.info("Worker was shut down.")
if __name__ == "__main__":
main()
I stripped out my domain specific code but essentially this is the framework I use.

Related

How do I forcibly disconnect all currently connected clients to my TCP or HTTP server during shutdown?

I have a fake HTTP server that I use as a fixture in my testing. At some point in the test, I want to stop the server regardless of any still open connections. Clients on these open connections should get a TCP FIN.
I am aware that usually production servers need to solve different problem, that of quiescing, sometimes called graceful shutdown. This is the opposite of what I want.
With a standalone process, it is usually possible to simply get the process to quit and the OS will take care of the rest. (Forcibly killing processes is easy, while forcibly killing threads is not.) My fake server is, however, running in a thread of the test process itself, so I don't have this option (and I don't want to externalize it if there is other way around).
I investigated this issue in Python, with the HTTPServer class, where I was not able to find any solution.
I also investigated this in Go, where I was able to find the concept of Contexts, which is close to what I need, but it works the other way around: a http server would propagate a Context that can be used to cancel e.g. a database lookup if a client disconnected.
Edit: looks like Go actually does what I need and has a separate graceful and nongraceful shutdown methods, with the nongraceful being net/http#Server.Close.
server = http.server.HTTPServer(...)
thread = threading.Thread(run=server.serve_forever)
thread.start()
# a client has connected ....
server.shutdown()
# at this point I want to have the server stopped,
# without waiting for the request handling to complete

I've implemented the Go solution in Python. When new client connects, I remember the client socket, and when I want to quit, I shutdown all remembered sockets.
It seems to work.
import socket
import http.server.HTTPServer
class MyHTTPServer(HTTPServer):
"""Adds a method to the HTTPServer to allow it to exit gracefully"""
def __init__(self, addr, handler_cls):
super().__init__(addr, handler_cls)
self._client_sockets: List[socket.socket] = []
self.server_killed = False
def get_request(self) -> Tuple[socket.socket, Any]:
"""Remember the client socket"""
sock, addr = super().get_request()
self._client_sockets.append(sock)
return sock, addr
def shutdown_request(self, request: socket.socket) -> None:
"""Forget the client socket"""
self._client_sockets.remove(request)
print(f"{self._client_sockets=}")
super().shutdown_request(request)
def force_disconnect_clients(self) -> None:
"""Shutdown the remembered sockets"""
for client in self._client_sockets:
client.shutdown(socket.SHUT_RDWR)
Usage
server = MyHTTPServer(server_addr, MyRequestHandler)
# in a new thread
while not server.server_killed:
self._server.handle_request()
# ... use the server (keep in mind it can have at most one client at a time) ...
# in the main program
server.server_killed = True
server.force_disconnect_clients()
server.server_close()

Is there a easy way to shut down python grpc server gracefully?

Here is a blog explaining how to gracefully shutdown a GRPC server in kotlin.
Is this the only way to do it? Counting live calls and handling SIGTERM manually? This should have been normal behavior.
I couldn't find how to count live calls in python. Can someone point me to docs that will help?

Turns out there is a easy way instead of counting RPCs, here is how I got it done:
server = grpc.server(futures.ThreadPoolExecutor(max_workers=100))
{} = {}Impl()
add_{}Servicer_to_server({}, server)
server.add_insecure_port('[::]:' + port)
server.start()
logger.info('Started server at ' + port)
done = threading.Event()
def on_done(signum, frame):
logger.info('Got signal {}, {}'.format(signum, frame))
done.set()
signal.signal(signal.SIGTERM, on_done)
done.wait()
logger.info('Stopped RPC server, Waiting for RPCs to complete...')
server.stop(NUM_SECS_TO_WAIT).wait()
logger.info('Done stopping server')

gRPC Python servers have a (newish) method for this. Just call server.wait_for_termination()

ZeroMQ REQ .recv() hangs with messages larger than ~1kB if run inside Docker

I'm working on a relatively simple Python / ZeroMQ based work distribution system, using REQ/ROUTER sockets. The system is distributed and worker nodes are geographically distributed on different continents.
The ROUTER, responsible for distributing work, .bind()-s a ROUTER socket. Workers .connect() to it over TCP using a REQ socket.
In the process of setting up a new worker node, I've noticed that while smaller messages (up to 1kB) do the trip with no issues, replies of ~2kB and up, sent by the ROUTER-end are never received by the worker into their REQ-socket - when I call recv(), the socket just hangs.
The worker code runs inside Docker containers, and I was able to work around the issue when running the same image with --net=host - it seems to not happen if Docker is using the host network.
I'm wondering if this is something in the network stack configuration on the host machine or in Docker, or maybe something that can be prevented in my code?
Here is a simplified version of my code that reproduces this issue:
Worker
import sys
import zmq
import logging
import time
READY = 'R'
def worker(connect_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.REQ)
socket.connect(connect_to)
log = logging.getLogger(__name__)
while True:
socket.send_string(READY)
log.debug("Send READY message, waiting for reply")
message = socket.recv()
log.debug("Got reply of %d bytes", len(message))
time.sleep(5)
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
worker(sys.argv[1])
Router
import sys
import zmq
import logging
REPLY_SIZE = 1024 * 8
def router(bind_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.ROUTER)
socket.bind(bind_to)
poller = zmq.Poller()
poller.register(socket, zmq.POLLIN)
log = logging.getLogger(__name__)
while True:
socks = dict(poller.poll(5000))
if socks.get(socket) == zmq.POLLIN:
message = socket.recv_multipart()
log.debug("Received message of %d parts", len(message))
identity, _ = message[:2]
res = handle_message(message[2:])
log.debug("Sending %d bytes back in response on socket", len(res))
socket.send_multipart([identity, '', res])
def handle_message(parts):
log = logging.getLogger(__name__)
log.debug("Got message: %s", parts)
return 'A' * REPLY_SIZE
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
router(sys.argv[1])
FWIW I was able to reproduce this on Ubuntu 16.04 (both router and worker) with Docker 17.09.0-ce, libzmq 4.1.5 and PyZMQ 15.4.0.

No, sir, the socket does not hang at all:
Why?
The issue is, that you have instructed the Socket()-instance to enter into an infinitely blocking state, once having called .recv() method, without specifying a zmq.NOBLOCK flag ( the ZMQ_DONTWAIT flag in the ZeroMQ original API ).
This is the cause, that upon other circumstances reported yesterday, moves the code into infinite blocking, as there seem to be other issues that prevent Docker-container to properly deliver any first message to the hands of the Worker's Docker-embedded-ZeroMQ-Context() I/O-engine and to the hands of the REQ-access-point. As the REQ-archetype uses a strict two-step Finite-State-Automaton - strictly striding ( .send()->.recv()->.send()-> ... ad infimum )
This cause->effect reversing is wrong and misleading -
the issue of "socket just hangs"
is un-decideable
from an issue Docker does not deliver a single message ( to allow .recv() to return )
Next steps:
may use .poll() in REQ-side to sniff without blocking for any already arrived message in the Worker.
Once there are none such, focus on Docker first + next may benefit from ZeroMQ Context()-I/O-engine performance and link-level tweaking configuration options.

How to reconnect to RabbitMQ?

My python script constantly has to send messages to RabbitMQ once it receives one from another data source. The frequency in which the python script sends them can vary, say, 1 minute - 30 minutes.
Here's how I establish a connection to RabbitMQ:
rabt_conn = pika.BlockingConnection(pika.ConnectionParameters("some_host"))
channel = rbt_conn.channel()
I just got an exception
pika.exceptions.ConnectionClosed
How can I reconnect to it? What's the best way? Is there any "strategy"? Is there an ability to send pings to keep a connection alive or set timeout?
Any pointers will be appreciated.

RabbitMQ uses heartbeats to detect and close "dead" connections and to prevent network devices (firewalls etc.) from terminating "idle" connections. From version 3.5.5 on, the default timeout is set to 60 seconds (previously it was ~10 minutes). From the docs:
Heartbeat frames are sent about every timeout / 2 seconds. After two missed heartbeats, the peer is considered to be unreachable.
The problem with Pika's BlockingConnection is that it is unable to respond to heartbeats until some API call is made (for example, channel.basic_publish(), connection.sleep(), etc).
The approaches I found so far:
Increase or deactivate the timeout
RabbitMQ negotiates the timeout with the client when establishing the connection. In theory, it should be possible to override the server default value with a bigger one using the heartbeat_interval argument, but the current Pika version (0.10.0) uses the min value between those offered by the server and the client. This issue is fixed on current master.
On the other hand, is possible to deactivate the heartbeat functionality completely by setting the heartbeat_interval argument to 0, which may well drive you into new issues (firewalls dropping connections, etc)
Reconnecting
Expanding on #itsafire's answer, you can write your own publisher class, letting you reconnect when required. An example naive implementation:
import logging
import json
import pika
class Publisher:
EXCHANGE='my_exchange'
TYPE='topic'
ROUTING_KEY = 'some_routing_key'
def __init__(self, host, virtual_host, username, password):
self._params = pika.connection.ConnectionParameters(
host=host,
virtual_host=virtual_host,
credentials=pika.credentials.PlainCredentials(username, password))
self._conn = None
self._channel = None
def connect(self):
if not self._conn or self._conn.is_closed:
self._conn = pika.BlockingConnection(self._params)
self._channel = self._conn.channel()
self._channel.exchange_declare(exchange=self.EXCHANGE,
type=self.TYPE)
def _publish(self, msg):
self._channel.basic_publish(exchange=self.EXCHANGE,
routing_key=self.ROUTING_KEY,
body=json.dumps(msg).encode())
logging.debug('message sent: %s', msg)
def publish(self, msg):
"""Publish msg, reconnecting if necessary."""
try:
self._publish(msg)
except pika.exceptions.ConnectionClosed:
logging.debug('reconnecting to queue')
self.connect()
self._publish(msg)
def close(self):
if self._conn and self._conn.is_open:
logging.debug('closing queue connection')
self._conn.close()
Other possibilities
Other possibilities which I yet didn't explore:
Using an asynchronous adapter for publishing
Keeping your RabbitMQ connection and your "publish" code on a background thread, which calls periodically connection.sleep() to responde to server heartbeats.

Dead simple: some pattern like this.
import time
while True:
try:
communication_handles = connect_pika()
do_your_stuff(communication_handles)
except pika.exceptions.ConnectionClosed:
print 'oops. lost connection. trying to reconnect.'
# avoid rapid reconnection on longer RMQ server outage
time.sleep(0.5)
You will probably have to re-factor your code, but basically it is about catching the exception, mitigate the problem and continue doing your stuff.
The communication_handles contain all the pika elements like channels, queues and whatever that your stuff needs to communicate with RabbitMQ via pika.

Dbus/GLib Main Loop, Background Thread

I'm starting out with DBus and event driven programming in general. The service that I'm trying to create really consists of three parts but two are really "server" things.
1) The actual DBus server talks to a remote website over HTTPS, manages sessions, and conveys info the clients.
2) The other part of the service calls a keep alive page every 2 minutes to keep the session active on the external website
3) The clients make calls to the service to retrieve info from the service.
I found some simple example programs. I'm trying to adapt them to prototype #1 and #2. Rather than building separate programs for both. I thought I that I can run them in a single, two threaded process.
The problem that I'm seeing is that I call time.sleep(X) in my keep alive thread. The thread goes to sleep, but won't ever wake up. I think that the GIL isn't released by the GLib main loop.
Here's my thread code:
class Keepalive(threading.Thread):
def __init__(self, interval=60):
super(Keepalive, self).__init__()
self.interval = interval
bus = dbus.SessionBus()
self.remote = bus.get_object("com.example.SampleService", "/SomeObject")
def run(self):
while True:
print('sleep %i' % self.interval)
time.sleep(self.interval)
print('sleep done')
reply_status = self.remote.keepalive()
if reply_status:
print('Keepalive: Success')
else:
print('Keepalive: Failure')
From the print statements, I know that the sleep starts, but I never see "sleep done."
Here is the main code:
if __name__ == '__main__':
try:
dbus.mainloop.glib.DBusGMainLoop(set_as_default=True)
session_bus = dbus.SessionBus()
name = dbus.service.BusName("com.example.SampleService", session_bus)
object = SomeObject(session_bus, '/SomeObject')
mainloop = gobject.MainLoop()
ka = Keepalive(15)
ka.start()
print('Begin main loop')
mainloop.run()
except Exception as e:
print(e)
finally:
ka.join()
Some other observations:
I see the "begin main loop" message, so I know it's getting control. Then, I see "sleep %i," and after that, nothing.
If I ^C, then I see "sleep done." After ~20 seconds, I get an exception from self.run() that the remote application didn't respond:
DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
What's the best way to run my keep alive code within the server?
Thanks,

You have to explicitly enable multithreading when using gobject by calling gobject.threads_init(). See the PyGTK FAQ for background info.
Next to that, for the purpose you're describing, timeouts seem to be a better fit. Use as follows:
# Enable timer
self.timer = gobject.timeout_add(time_in_ms, self.remote.keepalive)
# Disable timer
gobject.source_remove(self.timer)
This calls the keepalive function every time_in_ms (milli)seconds. Further details, again, can be found at the PyGTK reference.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.