Rabbitmq reaching file descriptor limit causes clients to hang indefinitely

Rabbitmq reaching file descriptor limit causes clients to hang indefinitely - python

After experiencing an issue where my rabbitmq server reached its file descriptor limit and ceased to accept any new connections, I noticed that my clients consuming from queues behaved in a very undesirable way.
When trying to open their connections they hung indefinitely without throwing any errors..
I'm currently using the Kombu library, and upon recreating the issue, no amount of tweaking the connection parameters would prevent the connection instantiation from blocking indefinitely. The timeout doesn't trigger, and enabling heartbeat doesn't help either. Looking at strace I see that it opens a connection to the rabbitmq server and then waits for data forever.
I've also just tried using the Pika library and have experienced the same issue. The difference being that strace shows the connection being polled. The connection instantiation still blocks indefinitely though.
Is there something I'm missing? What is the correct way to open connections to your rabbitmq server without them hanging silently forever when something goes wrong?
Edit:
Here's some code, it's pretty much hello world..
Pika:
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost', socket_timeout=2, heartbeat_interval=1))
channel = connection.channel() # Hangs indefinitely
Kombu:
import kombu
connection = kombu.Connection('amqp://guest:guest#localhost:5672//')
connection.connect() # Hangs indefinitely

Related

Python Socket reconnect after connection failure [duplicate]

Okay, I've read this post in search for the right answer, but it does not seem to serve my purpose.
This Question
Now, getting to the trouble:
I have a conventional client-server architecture in C (all sockets are non-blocking), where the server is listening for incoming connections and the client tries to connect. The first connect succeeds and everything goes on just fine until I press Ctrl + C on my server.
The client side of the code detects that the connection is lost and arms a retry timer.
The client code is supposed to try a reconnect on the server again and again by using the POSIX interval timers on each timer popping. It however, does not close the socket or start out afresh. Now, every time it retries the connection, the connect() returns
Transport endpoint is already connected
Even after restarting the server, which uses the SO_REUSEADDR and successfully starts, the connect does not complete.
One thing that I will need to implement is the signal handler on the server for the shutdown on Ctrl+C.
But still, do I need to close the socket descriptor on the client side and start afresh every time a disconnect happens, or is there a way out of this?

sockets cannot be reused.
Once the connection a socket served has gone down in both directions, the socket is unusable.
close() the client socket on loss of connection and create a new socket for a new connection.
Update (based on the comments below):
In the OP's case one side (the server side) went down (by means of the server process ending). This implies all sockets held by this process are implicitly close()ed and therefore shutdown() in both directions.

How to disable heartbeats with pika and rabbitmq

I am using rabbitmq to facilitate some tasks from my rabbit server to my respective consumers. I have noticed that when I run some rather lengthy tests, 20+ minutes, my consumer will lose contact with the producer after it completes it's task. In my rabbit logs, I have seen the error
closing AMQP connection <0.14009.27> (192.168.101.2:64855 ->
192.168.101.3:5672):
missed heartbeats from client, timeout: 60s
Also, I receive this error from pika
pika.exceptions.ConnectionClosed: (-1, "error(10054, 'An existing connection was forcibly closed by the remote host')")
I'm assuming this is due to this code right here and the conflict of heartbeats with the lengthy blocking connection time.
self.connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.101.2', 5672, 'user', credentials))
self.channel = self.connection.channel()
self.channel.queue_declare(queue=self.tool,
arguments={'x-message-ttl': 1000,
"x-dead-letter-exchange": "dlx",
"x-dead-letter-routing-key": "dl",
'durable': True})
Is there a proper way to increase the heartbeat time or how would I turn it off(would it be wise to) completely? Like I said, tests that are 20+ min seem to lead to a closedconnection error but I've ran plenty of tests from the 1-15 minute mark where everything is fine and the consumer client continues to wait for a message to be delivered.

Please don't disable heartbeats. Instead, use Pika correctly. This means:
Use Pika version 0.12.0
Do your long-running task in a separate thread
When your task completes, use the add_callback_threadsafe method to schedule the basic_ack call.
Example code can be found here: link
I'm a RabbitMQ core team member and Pika maintainer so if you have further questions or issues, I recommend following up on either the pika-python or rabbitmq-users mailing list. Thanks!
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

You can set the minimum heartbeat interval when creating the connection.
You can see an example in the pika documentation.
I'd recommend against disabling the heartbeat as it might lead to hanging connections piling up on the broker. We experienced such issue in production.
Always make sure the connections have a minimum reasonable heartbeat. If the heartbeat interval needs to be long (hours for example), make sure you close the connection when the application crashes or exits. In this way you won't leave the connection open on the broker side.

As #Luke mentioned, heartbeats are useful but if you still want to disable them, just set heartbeat parameter to zero when creating a connection. So,
For URL parameters: connection = pika.BlockingConnection(pika.URLParameters("amqp://user:pass#127.0.0.1?heartbeat=0"))
For Connection parameters: connection = pika.BlockingConnection(pika.ConnectionParameters(heartbeat=0))

RabbitMQ closes connection when processing long running tasks and timeout settings produce errors

I am using a RabbitMQ producer to send long running tasks (30 mins+) to a consumer. The problem is that the consumer is still working on a task when the connection to the server is closed and the unacknowledged task is requeued.
From researching I understand that either a heartbeat or an increased connection timeout can be used to solve this. Both these solutions raise errors when attempting them. In reading answers to similar posts I've also learned that many changes have been implemented to RabbitMQ since the answers were posted (e.g. the default heartbeat timeout has changed to 60 from 580 prior to RabbitMQ 3.5.5).
When specifying a heartbeat and blocked connection timeout:
credentials = pika.PlainCredentials('user', 'password')
parameters = pika.ConnectionParameters('XXX.XXX.XXX.XXX', port, '/', credentials, blocked_connection_timeout=2000)
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
The following error is displayed:
TypeError: __init__() got an unexpected keyword argument 'blocked_connection_timeout'
When specifying heartbeat_interval=1000 in the connection parameters a similar error is shown: TypeError: __init__() got an unexpected keyword argument 'heartbeat_interval'
And similarly for socket_timeout = 1000 the following error is displayed: TypeError: __init__() got an unexpected keyword argument 'socket_timeout'
I am running RabbitMQ 3.6.1, pika 0.10.0 and python 2.7 on Ubuntu 14.04.
Why are the above approaches producing errors?
Can a heartbeat approach be used where there is a long running continuous task? For example can heartbeats be used when performing large database joins which take 30+ mins? I am in favour of the heartbeat approach as many times it is difficult to judge how long a task such as database join will take.
I've read through answers to similar questions
Update: running code from the pika documentation produces the same error.

I've run into the same problem with my systems, that you are seeing, with dropped connection during very long tasks.
It's possible the heartbeat might help keep your connection alive, if your network setup is such that idle TCP/IP connections are forcefully dropped. If that's not the case, though, changing the heartbeat won't help.
Changing the connection timeout won't help at all. This setting is only used when initially creating the connection.
I am using a RabbitMQ producer to send long running tasks (30 mins+) to a consumer. The problem is that the consumer is still working on a task when the connection to the server is closed and the unacknowledged task is requeued.
there are two reasons for this, both of which you have run into already:
Connections drop randomly, even under the best of circumstances
Re-starting a process because of a re-queued message can cause problems
Having deployed RabbitMQ code with tasks that range from less than a second, out to several hours in time, I found that acknowledging the message immediately and updating the system with status messages works best for very long tasks, like this.
You will need to have a system of record (probably with a database) that keeps track of the status of a given job.
When the consumer picks up a message and starts the process, it should acknowledge the message right away and send a "started" status message to the system of record.
As the process completes, send another message to say it's done.
This won't solve the dropped connection problem, but nothing will 100% solve that anyways. Instead, it will prevent the message re-queueing problem from happening when a connection is dropped.
This solution does introduce another problem, though: when the long running process crashes, how do you resume the work?
The basic answer is to use the system of record (your database) status for the job to tell you that you need to pick up that work again. When the app starts, check the database to see if there is work that is unfinished. If there is, resume or restart that work in whatever manner is appropriate.

I've already see this issue. The reason is you declare to use this queue. but you didn't bind the queue in the exchange.
for example:
#Bean(name = "test_queue")
public Queue testQueue() {
return queue("test_queue");
}
#RabbitListener(queues = "test_queue_1")
public void listenCreateEvent(){
}
if you listen a queue didn't bind to the exchange. it will happen.

python xinetd client disconnection handling

This may or may not being a coding issue. It may also be an xinetd deamon issue, i do not know.
I have a python script which is triggered from a linux server running xinetd. Xinetd has been setup to only allow one instance as I only want one machine to be able to connect to the service, which is therefore also limited by IP.
Currently when the client connects to xinetd the service works correctly and the script begins sending its output to the client machine. However, when the client disconnects (i.e: due to reboot), the process is still alive on the server, and this blocks the ability for the client to connect once its finished rebooting or so on.
Q: How can i detect in python that the client has disconnected. Perhaps i can test if stdout is no longer being read from by the client (and then exit the script), or is there a much eaiser way in xinetd to have the child process be killed when the client disconnects ?
(I'm using python 2.4.3 on RHEL5 linux - solutions for 2.4 are needed, but 3.1 solutions would be useful to know also.)

Add a signal handler for SIGHUP. (x)inetd sends this upon the socket disconnecting.

Monitor the signals sent to your proccess. Maybe your script isn't responding to the SIGHUP sent by xinet, monitor the signal and let it die.

You don't seem to get a SIGHUP, but you do get a SIGPIPE, at least so long as you are attempting any IO on the connection. If the application spends long periods of time not doing any IO, then you could just start a thread reading stdin to ensure you get the SIGPIPE as soon as the disconnection occurs. This was good enough for my application but then I didn't use any pipes other than the ones xinetd gave me.
I've seen several places on the net where people talk about the SIGHUP getting sent on client disconnection, so I've written an inetd python script to test out a couple of servers (one inetd and another xinetd), so you could use that to check on the signals getting sent. It just logs what it finds to /var/log/test.log. Perhaps it will be useful.
#!/usr/bin/python
import os, signal, sys
skip = ["SIGKILL", "SIG_DFL", "SIGSTOP", "SIG_IGN", "SIGCLD", "SIGCHLD"]
name_map = {}
identifiers = [i for i in dir(signal) if i.startswith("SIG") and not i in skip]
for i in identifiers:
name_map[getattr(signal, i)] = i
def handler(num, frame):
signame = name_map[num]
os.system("echo handled %s >> /var/log/test.log" % signame)
if __name__ == "__main__":
for id, name in name_map.iteritems():
signal.signal(id, handler)
while True:
print sys.stdin.readline()
sys.stdout.flush()

MySQLdb execute timeout

Sometimes in our production environment occurs situation when connection between service (which is python program that uses MySQLdb) and mysql server is flacky, some packages are lost, some black magic happens and .execute() of MySQLdb.Cursor object never ends (or take great amount of time to end).
This is very bad because it is waste of service worker threads. Sometimes it leads to exhausting of workers pool and service stops responding at all.
So the question is: Is there a way to interrupt MySQLdb.Connection.execute operation after given amount of time?

if the communication is such a problem, consider writing a 'proxy' that receives your SQL commands over the flaky connection and relays them to the MySQL server on a reliable channel (maybe running on the same box as the MySQL server). This way you have total control over failure detection and retrying.

You need to analyse exactly what the problem is. MySQL connections should eventually timeout if the server is gone; TCP keepalives are generally enabled. You may be able to tune the OS-level TCP timeouts.
If the database is "flaky", then you definitely need to investigate how. It seems unlikely that the database really is the problem, more likely that networking in between is.
If you are using (some) stateful firewalls of any kind, it's possible that they're losing some of the state, thus causing otherwise good long-lived connections to go dead.
You might want to consider changing the idle timeout parameter in MySQL; otherwise, a long-lived, unused connection may go "stale", where the server and client both think it's still alive, but some stateful network element in between has "forgotten" about the TCP connection. An application trying to use such a "stale" connection will have a long wait before receiving an error (but it should eventually).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.