Application impacts of celery workers running with the `--without-heartbeat` flag

Application impacts of celery workers running with the `--without-heartbeat` flag - python

Discussion here talks high level about some of the impacts of running celery workers with the --without-hearbeat --without-gossip --without-mingle flags.
I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The celery documentation only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality at the most basic level.
What are the actual, application-specific implications of turning off heartbeats?

So this is the explanation of the heartbeat mechanism. Now since AMQP uses TCP the celery workers will try to reconnect if they can't establish a connection or whenever the TCP protocol dictates. So it looks like the heartbeat mechanism is not needed. But it as a few advantages :
It doesn't rely on the broker protocol, so if the broker have some internal issues or uses UDP the worker will still know if events are not received, and will be able to act accordingly
The heartbeat mechanism checks that events are sent and received which is a much greater indicator that the app is running as expected. If for example the broker doesn't have enough space and is starting to drop events, the worker will have an indication for that with the heartbeat mechanism. And if the worker is using multiple brokers, it can also decide to connect to another broker which should be less busy.
NOTE: regarding "[heartbeat] does not rely on the broker protocol... or uses UDP":
Given that celery supports multiple brokers and may use UDP, celery wants to guarantee the connection to the broker even if your broker protocol uses UDP --> so the only way for celery to guarantee that connection when the broker protocol uses UDP is to implement your own application level heartbeats.

Related

How to disable heartbeats with pika and rabbitmq

I am using rabbitmq to facilitate some tasks from my rabbit server to my respective consumers. I have noticed that when I run some rather lengthy tests, 20+ minutes, my consumer will lose contact with the producer after it completes it's task. In my rabbit logs, I have seen the error
closing AMQP connection <0.14009.27> (192.168.101.2:64855 ->
192.168.101.3:5672):
missed heartbeats from client, timeout: 60s
Also, I receive this error from pika
pika.exceptions.ConnectionClosed: (-1, "error(10054, 'An existing connection was forcibly closed by the remote host')")
I'm assuming this is due to this code right here and the conflict of heartbeats with the lengthy blocking connection time.
self.connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.101.2', 5672, 'user', credentials))
self.channel = self.connection.channel()
self.channel.queue_declare(queue=self.tool,
arguments={'x-message-ttl': 1000,
"x-dead-letter-exchange": "dlx",
"x-dead-letter-routing-key": "dl",
'durable': True})
Is there a proper way to increase the heartbeat time or how would I turn it off(would it be wise to) completely? Like I said, tests that are 20+ min seem to lead to a closedconnection error but I've ran plenty of tests from the 1-15 minute mark where everything is fine and the consumer client continues to wait for a message to be delivered.

Please don't disable heartbeats. Instead, use Pika correctly. This means:
Use Pika version 0.12.0
Do your long-running task in a separate thread
When your task completes, use the add_callback_threadsafe method to schedule the basic_ack call.
Example code can be found here: link
I'm a RabbitMQ core team member and Pika maintainer so if you have further questions or issues, I recommend following up on either the pika-python or rabbitmq-users mailing list. Thanks!
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

You can set the minimum heartbeat interval when creating the connection.
You can see an example in the pika documentation.
I'd recommend against disabling the heartbeat as it might lead to hanging connections piling up on the broker. We experienced such issue in production.
Always make sure the connections have a minimum reasonable heartbeat. If the heartbeat interval needs to be long (hours for example), make sure you close the connection when the application crashes or exits. In this way you won't leave the connection open on the broker side.

As #Luke mentioned, heartbeats are useful but if you still want to disable them, just set heartbeat parameter to zero when creating a connection. So,
For URL parameters: connection = pika.BlockingConnection(pika.URLParameters("amqp://user:pass#127.0.0.1?heartbeat=0"))
For Connection parameters: connection = pika.BlockingConnection(pika.ConnectionParameters(heartbeat=0))

Consumers do not reconnect after Rabbit MQ restart

Sometimes our rabbit messaging server requires a restart. After which however some consumers which are listening via basic consume blocking call do not consume any messages until they are restarted themselves and neither do they raise any exception.
What is the reason for this and how might I fix?

In the connectionFactory, please ensure the following property is set to true:
factory.setAutomaticRecoveryEnabled(true);
For more details, please refer the document here

As I mentioned in my comment, every AMQP client library has a different way to recover connections, and some depend on the developer to do that. There is NO canonical method.
Pika has this example as a starting point for connection recovery. Note that the code is for the unreleased version of Pika (1.0.0). If you're on 0.12.0 you will have to adjust the parameters to the method calls.
The best way to test and implement connection recovery is to simulate failure conditions and then code for them. Run your application, then kill the beam.smp process (RabbitMQ) to see what happens. If you have a RabbitMQ cluster, use firewall rules to simulate a network partition. Can your application handle that? What happens when you run rabbitmqctl stop_app; sleep 10; rabbitmqctl start_app? Can your app handle that?
Run your application through a TCP proxy like toxiproxy and introduce latency and other non-optimal conditions. Shut down the proxy to simulate a sudden TCP connection close. In each case, code for that failure condition and log the event so that someone can later diagnose what has happened.
I have seen too many developers code for the "happy path" only to have their applications fail spectacularly in production with zero ability to determine the source of the failure.

ZeroMQ bidirectional async communication with subprocesses

I have a server process which receives requests from a web clients.
The server has to call an external worker process ( another .py ) which streams data to the server and the server streams back to the client.
The server has to monitor these worker processes and send messages to them ( basically kill them or send messages to control which kind of data gets streamed ). These messages are asynchronous ( e.g. depend on the web client )
I thought in using ZeroMQ sockets over an ipc://-transport-class , but the call for socket.recv() method is blocking.
Should I use two sockets ( one for streaming data to the server and another to receive control messages from server )?

Using a separate socket for signalling and messaging is always better
While a Poller-instance will help a bit, the cardinal step is to use separate socket for signalling and another one for data-streaming. Always. The point is, that in such setup, both the Poller.poll() and the event-loop can remain socket-specific and spent not more than a predefined amount of time, during a real-time controlled code-execution.
So, do not hesitate to setup a bit richer signalling/messaging infrastructure as an environment where you will only enjoy the increased simplicity of control, separation of concerns and clarity of intents.
ZeroMQ is an excellent tool for doing this - including per-socket IO-thread affinity, so indeed a fine-grain performance tuning is available at your fingertips.

I think if figured out a solution, but I don't know if there is a better (more efficient, safer, ...) way of doing this.
The client makes a request to the server, which spawns N processes worker to attend the request.
This is the relevant excerpt from worker.py:
for i in range(start_counter,10):
# Check if there is any message from server
while True:
try:
msg = worker.recv(zmq.DONTWAIT)
print("Received {} from server".format(msg))
except zmq.Again:
break
# Send data to server
worker.send(b"Message {} from {}".format(i, worker_id))
# Take some sleep
time.sleep(random.uniform(0.3, 1.1))
In this way, the worker a) does not need a separate socket and b) does not need a separate thread to process messages from server.
In the real implementation, worker must stream 128 byte messages at 100Hz to the server, and the server must receive lots of this messages (many clients asking requests that need 3-10 worker each).
Will this approach suffer a performance hit if implemented this way?

Celery worker not reconnecting on network change/IP Change

I deployed celery for some tasks that need to be performed at my workplace. These tasks are huge and I bought a few high-spec machines for performing these. Before I detail my issue, let me brief about what all I've deployed:
RabbitMQ broker on a remote server
Producer that pushes tasks on another remote server
Workers at 3 machines deployed at my workplace
Now, when I started the whole process was as smooth as I tested and everything process just great!
The problem
Unfortunately, I forgot to consult my network guy about a fixed IP address, and as per our location, we do not have a fixed IP address from our ISP. So my celery workers upon network disconnect freeze and do nothing. Even when the network is running, because the IP Address changed, and the connection to the broker is not being recreated or worker is not retrying connection. I have tried configuration like BROKER_CONNECTION_MAX_RETRIES = 0 and BROKER_HEARTBEAT = 10. But I had no option but to post it out here and look for experts on this matter!
PS: I cannot restart the workers manually everytime the network changes the IP address by kill -9

Restarting the app using:
sudo rabbitmqctl stop_app
sudo rabbitmqctl start_app
solved the issue for me.
Also, since I had virtual host setup, I needed to get that reset too.
Not sure why was that needed. Or in fact any of the above was needed, but it did solve the problem for me.

The issue was because I was unable to understand the nature of AMQP protocol or RabbitMQ.
When a celery worker starts it opens up a channel at RabbitMQ. This channel upon any network changes tries to reconnect, but the port/sock opened for the channel previously is registered with a different public IP address of the client. As such the negotiations between the celery worker (client) and RabbitMQ (server) cannot resume because the client has changed the address, hence a new channel needs to be established in case of a change in the public IP address of the client.
The answer by #qreOct above is due to either I was unable to express the question properly or because of the difference in our perceptions. Still thanks a lot for taking your time out!

Message queue proxy in Python + Twisted

I want to implement a lightweight Message Queue proxy. It's job is to receive messages from a web application (PHP) and send them to the Message Queue server asynchronously. The reason for this proxy is that the MQ isn't always avaliable and is sometimes lagging, or even down, but I want to make sure the messages are delivered, and the web application returns immediately.
So, PHP would send the message to the MQ proxy running on the same host. That proxy would save the messages to SQLite for persistence, in case of crashes. At the same time it would send the messages from SQLite to the MQ in batches when the connection is available, and delete them from SQLite.
Now, the way I understand, there are these components in this service:
message listener (listens to the messages from PHP and writes them to a Incoming Queue)
DB flusher (reads messages from the Incoming Queue and saves them to a database; due to SQLite single-threadedness)
MQ connection handler (keeps the connection to the MQ server online by reconnecting)
message sender (collects messages from SQlite db and sends them to the MQ server, then removes them from db)
I was thinking of using Twisted for #1 (TCPServer), but I'm having problem with integrating it with other points, which aren't event-driven. Intuition tells me that each of these points should be running in a separate thread, because all are IO-bound and independent of each other, but I could easily put them in a single thread. Even though, I couldn't find any good and clear (to me) examples on how to implement this worker thread aside of Twisted's main loop.
The example I've started with is the chatserver.py, which uses service.Application and internet.TCPServer objects. If I start my own thread prior to creating TCPServer service, it runs a few times, but the it stops and never runs again. I'm not sure, why this is happening, but it's probably because I don't use threads with Twisted correctly.
Any suggestions on how to implement a separate worker thread and keep Twisted? Do you have any alternative architectures in mind?

You're basically considering writing an ad-hoc extension to your messaging server, the job of which it is to provide whatever reliability guarantees you've asked of it.
Instead, perhaps you should take the hardware where you were planning to run this new proxy and run another MQ node on it. The new node should take care of persisting and relaying messages that you deliver to it while the other nodes are overloaded or offline.

Maybe it's not the best bang for your buck to use a separate thread in Twisted to get around a blocking call, but sometimes the least evil solution is the best. Here's a link that shows you how to integrate threading into Twisted:
http://twistedmatrix.com/documents/10.1.0/core/howto/threading.html
Sometimes in a pinch easy-to-implement is faster than hours/days of research which may all turn out to be for nought.

A neat solution to this problem would be to use the Key Value store Redis. Its a high speed persistent data store, with plenty of clients - it has a php and a python client (if you want to use a timed/batch process to process messages - it saves you creating a database, and also deals with your persistence stories. It runs fine on Cywin/Windows + posix environments.
PHP Redis client is here.
Python client is here.
Both have a very clean and simple API. Redis also offers a publish/subscribe mechanism, should you need it, although it sounds like it would be of limited value if you're publishing to an inconsistent queue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.