Celery worker not reconnecting on network change/IP Change

Celery worker not reconnecting on network change/IP Change - python

I deployed celery for some tasks that need to be performed at my workplace. These tasks are huge and I bought a few high-spec machines for performing these. Before I detail my issue, let me brief about what all I've deployed:
RabbitMQ broker on a remote server
Producer that pushes tasks on another remote server
Workers at 3 machines deployed at my workplace
Now, when I started the whole process was as smooth as I tested and everything process just great!
The problem
Unfortunately, I forgot to consult my network guy about a fixed IP address, and as per our location, we do not have a fixed IP address from our ISP. So my celery workers upon network disconnect freeze and do nothing. Even when the network is running, because the IP Address changed, and the connection to the broker is not being recreated or worker is not retrying connection. I have tried configuration like BROKER_CONNECTION_MAX_RETRIES = 0 and BROKER_HEARTBEAT = 10. But I had no option but to post it out here and look for experts on this matter!
PS: I cannot restart the workers manually everytime the network changes the IP address by kill -9

Restarting the app using:
sudo rabbitmqctl stop_app
sudo rabbitmqctl start_app
solved the issue for me.
Also, since I had virtual host setup, I needed to get that reset too.
Not sure why was that needed. Or in fact any of the above was needed, but it did solve the problem for me.

The issue was because I was unable to understand the nature of AMQP protocol or RabbitMQ.
When a celery worker starts it opens up a channel at RabbitMQ. This channel upon any network changes tries to reconnect, but the port/sock opened for the channel previously is registered with a different public IP address of the client. As such the negotiations between the celery worker (client) and RabbitMQ (server) cannot resume because the client has changed the address, hence a new channel needs to be established in case of a change in the public IP address of the client.
The answer by #qreOct above is due to either I was unable to express the question properly or because of the difference in our perceptions. Still thanks a lot for taking your time out!

Related

Application impacts of celery workers running with the `--without-heartbeat` flag

Discussion here talks high level about some of the impacts of running celery workers with the --without-hearbeat --without-gossip --without-mingle flags.
I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The celery documentation only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality at the most basic level.
What are the actual, application-specific implications of turning off heartbeats?

So this is the explanation of the heartbeat mechanism. Now since AMQP uses TCP the celery workers will try to reconnect if they can't establish a connection or whenever the TCP protocol dictates. So it looks like the heartbeat mechanism is not needed. But it as a few advantages :
It doesn't rely on the broker protocol, so if the broker have some internal issues or uses UDP the worker will still know if events are not received, and will be able to act accordingly
The heartbeat mechanism checks that events are sent and received which is a much greater indicator that the app is running as expected. If for example the broker doesn't have enough space and is starting to drop events, the worker will have an indication for that with the heartbeat mechanism. And if the worker is using multiple brokers, it can also decide to connect to another broker which should be less busy.
NOTE: regarding "[heartbeat] does not rely on the broker protocol... or uses UDP":
Given that celery supports multiple brokers and may use UDP, celery wants to guarantee the connection to the broker even if your broker protocol uses UDP --> so the only way for celery to guarantee that connection when the broker protocol uses UDP is to implement your own application level heartbeats.

Can talk to Zookeeper but not to the message brokers

I'm using kafka-python to produce messages for a Kafka 2.2.1 cluster (a managed cluster instance from AWS's MSK service). I'm able to retrieve the bootstrap servers and establish a network connection to them, but no message ever gets through. Instead after each message of the Type A I immediately receive one of type B... and eventually a type C:
A [INFO] 2019-11-19T15:17:19.603Z <BrokerConnection ... <connecting> [IPv4 ('10.0.128.56', 9094)]>: Connection complete.
B [ERROR] 2019-11-19T15:17:19.605Z <BrokerConnection ... <connected> [IPv4 ('10.0.128.56', 9094)]>: socket disconnected
C [ERROR] KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
What causes a broker node to accept a TCP connection from a hopeful producer, but then immediately close it again?
Edit
The topic already exists, and kafka-topics.sh --list displays it.
I have the same problem with all clients I've used: Kafka's kafka-console-producer.sh, kafka-python, confluent-kafka, and kafkacat
The Kafka cluster is in the same VPC as all my other machines, and its security group allows any incoming and outgoing traffic within that VPC.
However, it's managed by Amazon's Managed Streaming for Kafka (MSK) servive, which means I don't have fine-grained control over the server installation settings (or even know what they are). MSK just publishes the zookeeper and message broker URLs for clients to use.
The producer runs as an AWS Lambda function, but the problem persists when I run it on a normal EC2 instance.
Permissions are not the issue. I have assigned the lambda role all the AWS permissions it needs (AWS is always very explicit about which operation required which missing permission).
Connectivity is not the issue. I can reach the URLs of both the zookeepers and the message brokers with standard telnet. However, issuing commands to the zookeepers works, while issuing commands to the message brokers always eventually fails. Since Kafka uses a binary protocol over TCP, I'm at a loss how to debug the problem further.
Edit
As suggested, I debugged this with
./kafkacat -b $BROKERS -L -d broker
and got:
7|1574772202.379|FEATURE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1574772202.379|STATE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1574772202.379|BROKERFAIL|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: failed: err: Local: Broker transport failure: (errno: Operation now in progress)
%7|1574772202.379|FEATURE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Updated enabled protocol features -ApiVersion to
%7|1574772202.380|STATE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Broker changed state APIVERSION_QUERY -> DOWN
So, is this a kind of mismatch between client and broker API versions? How can I recover from this, bearing in mind that I have no control over the version or the configuration of the Kafka cluster that AWS provides?

I think that this is related to the TLS encryption. By default, MSK spins up a cluster that accepts both PLAINTEXT and TLS but if you are grabbing the bootstrap servers programmatically from the cluster it will only provide you with the TLS ports. If this is the case for you, try using the PLAINTEXT port 9092 instead.
To authenticate the client for TLS you need to generate a certificate: https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html and would then need to get this certificate onto your lambda and reference the certificate in your Producer configuration.
If you are able to configure your MSK cluster as PLAINTEXT only then when you grab the bootstrap servers from the AWS SDK it will give you the PLAINTEXT port and you should be good.

Since it doesn't work for non-python clients either, it's unlikely that it's a bug in the library.
It seems to be a networking issue.
There is a kafka broker setting called advertised.listeners which specifies the address that the client will be using after the first connection. In other words, this is what happens when a client consumes or produces:
Using the bootstrap.servers, it establish the first connection and ask for the real address to use.
The broker answers back with the address specified by advertised.listeners within the brokers configuration.
The client tries consuming or producing using that new address.
This is a security feature that prevents brokers that could be public accessible from being consumed/produced by clients that shouldn't have access.
How to diagnose
Run the following command:
$ kafkacat -b ec2-54-191-84-122.us-west-2.compute.amazonaws.com:9092 -L
which returns
Metadata for all topics (from broker -1: ec2-54-191-84-122.us-west-2.compute.amazonaws.com:9092/bootstrap):
1 brokers:
broker 0 at ip-172-31-18-160.us-west-2.compute.internal:9092
In this scenario, ec2-54-191-84-122.us-west-2.compute.amazonaws.com:9092 is the address specified by the client, and even if the client have access to that address/port, ip-172-31-18-160.us-west-2.compute.internal:9092 will be the address that will be used to consume/produce.
Now, if you are running kafka in AWS MSK, it would probably be managing this for you. You have to make sure that you can access the address returned by that command. If you don't, you might need to either change it or run your command from a host that have access to it.
Another option might be to open a ssh tunnel using a bastion host that have access internally to that address.
You can find more detailed info at: https://rmoff.net/2018/08/02/kafka-listeners-explained

Consumers do not reconnect after Rabbit MQ restart

Sometimes our rabbit messaging server requires a restart. After which however some consumers which are listening via basic consume blocking call do not consume any messages until they are restarted themselves and neither do they raise any exception.
What is the reason for this and how might I fix?

In the connectionFactory, please ensure the following property is set to true:
factory.setAutomaticRecoveryEnabled(true);
For more details, please refer the document here

As I mentioned in my comment, every AMQP client library has a different way to recover connections, and some depend on the developer to do that. There is NO canonical method.
Pika has this example as a starting point for connection recovery. Note that the code is for the unreleased version of Pika (1.0.0). If you're on 0.12.0 you will have to adjust the parameters to the method calls.
The best way to test and implement connection recovery is to simulate failure conditions and then code for them. Run your application, then kill the beam.smp process (RabbitMQ) to see what happens. If you have a RabbitMQ cluster, use firewall rules to simulate a network partition. Can your application handle that? What happens when you run rabbitmqctl stop_app; sleep 10; rabbitmqctl start_app? Can your app handle that?
Run your application through a TCP proxy like toxiproxy and introduce latency and other non-optimal conditions. Shut down the proxy to simulate a sudden TCP connection close. In each case, code for that failure condition and log the event so that someone can later diagnose what has happened.
I have seen too many developers code for the "happy path" only to have their applications fail spectacularly in production with zero ability to determine the source of the failure.

Executing a command on a remote server with decoupling, redundancy, and asynchronous

I have a few servers that require executing commands on other servers. For example a Bitbucket Server post receive hook executing a git pull on another server. Another example is the CI server pulling a new docker image and restarting an instance on another server.
I would normally use ssh for this, creating a user/group specifically for the job with limited permission.
A few downsides with ssh:
Synchronous ssh call means a git push will have to wait until complete.
If a host is not contactable for whatever reason, the ssh command will fail.
Maintaining keys, users, and sudoers permissions can become unwieldy.
Few possibilities:
Find an open source out of the box solution (I have tried with no luck so far)
Set up an REST API on each server that accepts calls with some type of authentication, e.g. POST https://server/git/pull/?apikey=a1b2c3
Set up Python/Celery to execute tasks on a different queue for each host. This means a celery worker on each server that can execute commands and possibly a service that accepts REST API calls, converting them to Celery tasks.
Is there a nice solution to this problem?

Defining the problem
You want to be able to trigger a remote task without waiting for it to complete.
This can be achieved in any number of ways, including with SSH. You can execute a remote command without waiting for it to complete by closing or redirecting all I/O streams, e.g. like this:
ssh user#host "/usr/bin/foobar </dev/null >/dev/null 2>&1"
You want to be able to defer the task if the host is currently unavailable.
This requires a queuing/retry system of some kind. You will also need to decide whether the target hosts will be querying for messages ("pull") or whether messages will be sent to the target hosts from elsewhere ("push").
You want to simplify access control as much as possible.
There's no way to completely avoid this issue. One solution would be to put most of the authentication logic in a centralized task server. This splits the problem into two parts: configuring access rights in the task server, and configuring authentication between the task server and the target hosts.
Example solutions
Hosts attempt to start tasks over SSH using method above for asynchrony. If host is unavailable, task is written to local file. Cron job periodically retries sending failed tasks. Access control via SSH keys.
Hosts add tasks by writing commands to files on an SFTP server. Cron job on target hosts periodically checks for new commands and executes them if found. Access control managed via SSH keys on the SFTP server.
Hosts post tasks to REST API which adds them to queue. Celery daemon on each target host consumes from queue and executes tasks. Access managed primarily by credentials sent to the task queuing server.
Hosts post tasks to API which adds tasks to queue. Task consumer nodes pull tasks off the queue and send requests to API on target hosts. Authentication managed by cryptographic signature of sender appended to request, verified by task server on target host.
You can also look into tools that do some or all of the required functions out of the box. For example, some Google searching came up with Rundeck which seems to have some job scheduling capabilities and a REST API. You should also consider whether you can leverage any existing automated deployment or management tools already present in your system.
Conclusions
Ultimately, there's no single right answer to this question. It really depends on your particular needs. Ask yourself: How much time and effort do you want to spend creating this system? What about maintenance? How reliable does it need to be? How much does it need to scale? And so on, ad infinitum...

Celery dont send response to backend

In my app I use Celery and RabbitMQ.
On localhost everything works fine:
I send tasks to few workers, they calculate it and return result to call.py (I use groups).
The problem start here:
On my laptop (macbook) I have RabbitMQ, on desktop (pc, windows) - celery-workers. I start call.py (on laptop), it sends data to my desktop (to workers), they recieve and calculate tasks, and in the end (when all tasks succeeded) my laptop dont receive any response from workers.
No errors, nothing.
My laptop ip - 192.168.1.14. This ip I use in broker and backend parametrs, when I make Celery instanse.
In rabbitmq-env.conf:
NODE_IP_ADRESS=192.168.1.14
On my router I make forwading to port 5672 to 192.168.1.14.
So, if all app runs on localhost and I use my public ip (5.57.N.N.) - all works.
If I use workers on another host (192.168.1.14) I dont have response from them (calculated result).
How to fix that?
Thanks!

Are you using the default user guest:guest ? If so, that user can only connect via localhost. See https://www.rabbitmq.com/access-control.html#loopback-users

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.