I am using rabbitmq to facilitate some tasks from my rabbit server to my respective consumers. I have noticed that when I run some rather lengthy tests, 20+ minutes, my consumer will lose contact with the producer after it completes it's task. In my rabbit logs, I have seen the error
closing AMQP connection <0.14009.27> (192.168.101.2:64855 ->
192.168.101.3:5672):
missed heartbeats from client, timeout: 60s
Also, I receive this error from pika
pika.exceptions.ConnectionClosed: (-1, "error(10054, 'An existing connection was forcibly closed by the remote host')")
I'm assuming this is due to this code right here and the conflict of heartbeats with the lengthy blocking connection time.
self.connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.101.2', 5672, 'user', credentials))
self.channel = self.connection.channel()
self.channel.queue_declare(queue=self.tool,
arguments={'x-message-ttl': 1000,
"x-dead-letter-exchange": "dlx",
"x-dead-letter-routing-key": "dl",
'durable': True})
Is there a proper way to increase the heartbeat time or how would I turn it off(would it be wise to) completely? Like I said, tests that are 20+ min seem to lead to a closedconnection error but I've ran plenty of tests from the 1-15 minute mark where everything is fine and the consumer client continues to wait for a message to be delivered.
Please don't disable heartbeats. Instead, use Pika correctly. This means:
Use Pika version 0.12.0
Do your long-running task in a separate thread
When your task completes, use the add_callback_threadsafe method to schedule the basic_ack call.
Example code can be found here: link
I'm a RabbitMQ core team member and Pika maintainer so if you have further questions or issues, I recommend following up on either the pika-python or rabbitmq-users mailing list. Thanks!
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
You can set the minimum heartbeat interval when creating the connection.
You can see an example in the pika documentation.
I'd recommend against disabling the heartbeat as it might lead to hanging connections piling up on the broker. We experienced such issue in production.
Always make sure the connections have a minimum reasonable heartbeat. If the heartbeat interval needs to be long (hours for example), make sure you close the connection when the application crashes or exits. In this way you won't leave the connection open on the broker side.
As #Luke mentioned, heartbeats are useful but if you still want to disable them, just set heartbeat parameter to zero when creating a connection. So,
For URL parameters: connection = pika.BlockingConnection(pika.URLParameters("amqp://user:pass#127.0.0.1?heartbeat=0"))
For Connection parameters: connection = pika.BlockingConnection(pika.ConnectionParameters(heartbeat=0))
Related
Discussion here talks high level about some of the impacts of running celery workers with the --without-hearbeat --without-gossip --without-mingle flags.
I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The celery documentation only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality at the most basic level.
What are the actual, application-specific implications of turning off heartbeats?
So this is the explanation of the heartbeat mechanism. Now since AMQP uses TCP the celery workers will try to reconnect if they can't establish a connection or whenever the TCP protocol dictates. So it looks like the heartbeat mechanism is not needed. But it as a few advantages :
It doesn't rely on the broker protocol, so if the broker have some internal issues or uses UDP the worker will still know if events are not received, and will be able to act accordingly
The heartbeat mechanism checks that events are sent and received which is a much greater indicator that the app is running as expected. If for example the broker doesn't have enough space and is starting to drop events, the worker will have an indication for that with the heartbeat mechanism. And if the worker is using multiple brokers, it can also decide to connect to another broker which should be less busy.
NOTE: regarding "[heartbeat] does not rely on the broker protocol... or uses UDP":
Given that celery supports multiple brokers and may use UDP, celery wants to guarantee the connection to the broker even if your broker protocol uses UDP --> so the only way for celery to guarantee that connection when the broker protocol uses UDP is to implement your own application level heartbeats.
Sometimes our rabbit messaging server requires a restart. After which however some consumers which are listening via basic consume blocking call do not consume any messages until they are restarted themselves and neither do they raise any exception.
What is the reason for this and how might I fix?
In the connectionFactory, please ensure the following property is set to true:
factory.setAutomaticRecoveryEnabled(true);
For more details, please refer the document here
As I mentioned in my comment, every AMQP client library has a different way to recover connections, and some depend on the developer to do that. There is NO canonical method.
Pika has this example as a starting point for connection recovery. Note that the code is for the unreleased version of Pika (1.0.0). If you're on 0.12.0 you will have to adjust the parameters to the method calls.
The best way to test and implement connection recovery is to simulate failure conditions and then code for them. Run your application, then kill the beam.smp process (RabbitMQ) to see what happens. If you have a RabbitMQ cluster, use firewall rules to simulate a network partition. Can your application handle that? What happens when you run rabbitmqctl stop_app; sleep 10; rabbitmqctl start_app? Can your app handle that?
Run your application through a TCP proxy like toxiproxy and introduce latency and other non-optimal conditions. Shut down the proxy to simulate a sudden TCP connection close. In each case, code for that failure condition and log the event so that someone can later diagnose what has happened.
I have seen too many developers code for the "happy path" only to have their applications fail spectacularly in production with zero ability to determine the source of the failure.
I am using a RabbitMQ producer to send long running tasks (30 mins+) to a consumer. The problem is that the consumer is still working on a task when the connection to the server is closed and the unacknowledged task is requeued.
From researching I understand that either a heartbeat or an increased connection timeout can be used to solve this. Both these solutions raise errors when attempting them. In reading answers to similar posts I've also learned that many changes have been implemented to RabbitMQ since the answers were posted (e.g. the default heartbeat timeout has changed to 60 from 580 prior to RabbitMQ 3.5.5).
When specifying a heartbeat and blocked connection timeout:
credentials = pika.PlainCredentials('user', 'password')
parameters = pika.ConnectionParameters('XXX.XXX.XXX.XXX', port, '/', credentials, blocked_connection_timeout=2000)
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
The following error is displayed:
TypeError: __init__() got an unexpected keyword argument 'blocked_connection_timeout'
When specifying heartbeat_interval=1000 in the connection parameters a similar error is shown: TypeError: __init__() got an unexpected keyword argument 'heartbeat_interval'
And similarly for socket_timeout = 1000 the following error is displayed: TypeError: __init__() got an unexpected keyword argument 'socket_timeout'
I am running RabbitMQ 3.6.1, pika 0.10.0 and python 2.7 on Ubuntu 14.04.
Why are the above approaches producing errors?
Can a heartbeat approach be used where there is a long running continuous task? For example can heartbeats be used when performing large database joins which take 30+ mins? I am in favour of the heartbeat approach as many times it is difficult to judge how long a task such as database join will take.
I've read through answers to similar questions
Update: running code from the pika documentation produces the same error.
I've run into the same problem with my systems, that you are seeing, with dropped connection during very long tasks.
It's possible the heartbeat might help keep your connection alive, if your network setup is such that idle TCP/IP connections are forcefully dropped. If that's not the case, though, changing the heartbeat won't help.
Changing the connection timeout won't help at all. This setting is only used when initially creating the connection.
I am using a RabbitMQ producer to send long running tasks (30 mins+) to a consumer. The problem is that the consumer is still working on a task when the connection to the server is closed and the unacknowledged task is requeued.
there are two reasons for this, both of which you have run into already:
Connections drop randomly, even under the best of circumstances
Re-starting a process because of a re-queued message can cause problems
Having deployed RabbitMQ code with tasks that range from less than a second, out to several hours in time, I found that acknowledging the message immediately and updating the system with status messages works best for very long tasks, like this.
You will need to have a system of record (probably with a database) that keeps track of the status of a given job.
When the consumer picks up a message and starts the process, it should acknowledge the message right away and send a "started" status message to the system of record.
As the process completes, send another message to say it's done.
This won't solve the dropped connection problem, but nothing will 100% solve that anyways. Instead, it will prevent the message re-queueing problem from happening when a connection is dropped.
This solution does introduce another problem, though: when the long running process crashes, how do you resume the work?
The basic answer is to use the system of record (your database) status for the job to tell you that you need to pick up that work again. When the app starts, check the database to see if there is work that is unfinished. If there is, resume or restart that work in whatever manner is appropriate.
I've already see this issue. The reason is you declare to use this queue. but you didn't bind the queue in the exchange.
for example:
#Bean(name = "test_queue")
public Queue testQueue() {
return queue("test_queue");
}
#RabbitListener(queues = "test_queue_1")
public void listenCreateEvent(){
}
if you listen a queue didn't bind to the exchange. it will happen.
After experiencing an issue where my rabbitmq server reached its file descriptor limit and ceased to accept any new connections, I noticed that my clients consuming from queues behaved in a very undesirable way.
When trying to open their connections they hung indefinitely without throwing any errors..
I'm currently using the Kombu library, and upon recreating the issue, no amount of tweaking the connection parameters would prevent the connection instantiation from blocking indefinitely. The timeout doesn't trigger, and enabling heartbeat doesn't help either. Looking at strace I see that it opens a connection to the rabbitmq server and then waits for data forever.
I've also just tried using the Pika library and have experienced the same issue. The difference being that strace shows the connection being polled. The connection instantiation still blocks indefinitely though.
Is there something I'm missing? What is the correct way to open connections to your rabbitmq server without them hanging silently forever when something goes wrong?
Edit:
Here's some code, it's pretty much hello world..
Pika:
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost', socket_timeout=2, heartbeat_interval=1))
channel = connection.channel() # Hangs indefinitely
Kombu:
import kombu
connection = kombu.Connection('amqp://guest:guest#localhost:5672//')
connection.connect() # Hangs indefinitely
I've been trying to figure out which form of connection i should use when using pika, I've got two alternatives as far as I understand.
Either the BlockingConnection or the SelectConnection, however I'm not really sure about the differences between these two (i.e. what is the BlockingConnection blocking? and more)
The documentation for pika says that SelectConnection is the preferred way to connect to rabbit since it provides "multiple event notification methods including select, epoll, kqueue and poll."
So I'm wondering what are the implications of these two different kinds of connections?
PS: I know I shouldn't put a tag in the title but in this case I think it does help to clarify the question.
The SelectConnection is useful if your application architecture can benefit from an asynchronous design, e.g. doing something else while the RabbitMQ IO completes (e.g. switch to some other IO etc) . This type of connection uses callbacks to indicate when functions return. For example you can declare callbacks for
on_connected, on_channel_open, on_exchange_declared, on_queue_declared etc.
...to perform operations when these events are triggered.
The benefit is especially good if your RabbitMQ server (or connection to that server) is slow or overloaded.
BlockingConnection on the hand is just that - it blocks until the called function returns. so it will block the execution thread until connected or channel_open or exchange_declared or queue_declared return for example. That said, its often simpler to program this sort of serialized logic than the async SelectConnection logic. For simple apps with responsive RabbitMQ servers these also work OK IMO.
I suppose you've read the Pika documentation already http://pika.readthedocs.io/en/stable/intro.html, if not, then this is absolutely vital information before you use Pika!
Cheers!
The Pika documentation is quite clear about the differences between the connection types. The main difference is that the pika.adapters.blocking_connection.BlockingConnection() adapter is used for non-asynchronous programming and that the pika.adapters.select_connection.SelectConnection() adapter is used for asynchronous programming.
If you don't know what the difference is between non-asynchronous/synchronous and asynchronous programming I suggest that you read this question or for the more deeper technical explanation this article.
Now let's dive into the different Pika adapters and see what they do, for the example purpose I imagine that we use Pika for setting up a client connection with RabbitMQ as AMQP message broker.
BlockingConnection()
In the following example, a connection is made to RabbitMQ listening to port 5672 on localhost using the username guest and password guest and virtual host '/'. Once connected, a channel is opened and a message is published to the test_exchange exchange using the test_routing_key routing key. The BasicProperties value passed in sets the message to delivery mode 1 (non-persisted) with a content-type of text/plain. Once the message is published, the connection is closed:
import pika
parameters = pika.URLParameters('amqp://guest:guest#localhost:5672/%2F')
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
channel.basic_publish('test_exchange',
'test_routing_key',
'message body value',
pika.BasicProperties(content_type='text/plain',
delivery_mode=1))
connection.close()
SelectConnection()
In contrast, using this connection adapter is more complicated and less pythonic, but when used with other asynchronous services it can have tremendous performance improvements. In the following code example, all of the same parameters and values are used as were used in the previous example:
import pika
# Step #3
def on_open(connection):
connection.channel(on_open_callback=on_channel_open)
# Step #4
def on_channel_open(channel):
channel.basic_publish('test_exchange',
'test_routing_key',
'message body value',
pika.BasicProperties(content_type='text/plain',
delivery_mode=1))
connection.close()
# Step #1: Connect to RabbitMQ
parameters = pika.URLParameters('amqp://guest:guest#localhost:5672/%2F')
connection = pika.SelectConnection(parameters=parameters,
on_open_callback=on_open)
try:
# Step #2 - Block on the IOLoop
connection.ioloop.start()
# Catch a Keyboard Interrupt to make sure that the connection is closed cleanly
except KeyboardInterrupt:
# Gracefully close the connection
connection.close()
# Start the IOLoop again so Pika can communicate, it will stop on its own when the connection is closed
connection.ioloop.start()
Conclusion
For those doing simple, non-asynchronous/synchronous programming, the BlockingConnection() adapter proves to be the easiest way to get up and running with Pika to publish messages. But if you are looking for a way to implement asynchronous message handling, the SelectConnection() handler is your better choice.
Happy coding!