Can talk to Zookeeper but not to the message brokers - python

I'm using kafka-python to produce messages for a Kafka 2.2.1 cluster (a managed cluster instance from AWS's MSK service). I'm able to retrieve the bootstrap servers and establish a network connection to them, but no message ever gets through. Instead after each message of the Type A I immediately receive one of type B... and eventually a type C:
A [INFO] 2019-11-19T15:17:19.603Z <BrokerConnection ... <connecting> [IPv4 ('', 9094)]>: Connection complete.
B [ERROR] 2019-11-19T15:17:19.605Z <BrokerConnection ... <connected> [IPv4 ('', 9094)]>: socket disconnected
C [ERROR] KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
What causes a broker node to accept a TCP connection from a hopeful producer, but then immediately close it again?
The topic already exists, and --list displays it.
I have the same problem with all clients I've used: Kafka's, kafka-python, confluent-kafka, and kafkacat
The Kafka cluster is in the same VPC as all my other machines, and its security group allows any incoming and outgoing traffic within that VPC.
However, it's managed by Amazon's Managed Streaming for Kafka (MSK) servive, which means I don't have fine-grained control over the server installation settings (or even know what they are). MSK just publishes the zookeeper and message broker URLs for clients to use.
The producer runs as an AWS Lambda function, but the problem persists when I run it on a normal EC2 instance.
Permissions are not the issue. I have assigned the lambda role all the AWS permissions it needs (AWS is always very explicit about which operation required which missing permission).
Connectivity is not the issue. I can reach the URLs of both the zookeepers and the message brokers with standard telnet. However, issuing commands to the zookeepers works, while issuing commands to the message brokers always eventually fails. Since Kafka uses a binary protocol over TCP, I'm at a loss how to debug the problem further.
As suggested, I debugged this with
./kafkacat -b $BROKERS -L -d broker
and got:
7|1574772202.379|FEATURE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1574772202.379|STATE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1574772202.379|BROKERFAIL|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: failed: err: Local: Broker transport failure: (errno: Operation now in progress)
%7|1574772202.379|FEATURE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Updated enabled protocol features -ApiVersion to
%7|1574772202.380|STATE|rdkafka#producer-1| [thrd:HOSTNAME]: HOSTNAME:9094/bootstrap: Broker changed state APIVERSION_QUERY -> DOWN
So, is this a kind of mismatch between client and broker API versions? How can I recover from this, bearing in mind that I have no control over the version or the configuration of the Kafka cluster that AWS provides?

I think that this is related to the TLS encryption. By default, MSK spins up a cluster that accepts both PLAINTEXT and TLS but if you are grabbing the bootstrap servers programmatically from the cluster it will only provide you with the TLS ports. If this is the case for you, try using the PLAINTEXT port 9092 instead.
To authenticate the client for TLS you need to generate a certificate: and would then need to get this certificate onto your lambda and reference the certificate in your Producer configuration.
If you are able to configure your MSK cluster as PLAINTEXT only then when you grab the bootstrap servers from the AWS SDK it will give you the PLAINTEXT port and you should be good.

Since it doesn't work for non-python clients either, it's unlikely that it's a bug in the library.
It seems to be a networking issue.
There is a kafka broker setting called advertised.listeners which specifies the address that the client will be using after the first connection. In other words, this is what happens when a client consumes or produces:
Using the bootstrap.servers, it establish the first connection and ask for the real address to use.
The broker answers back with the address specified by advertised.listeners within the brokers configuration.
The client tries consuming or producing using that new address.
This is a security feature that prevents brokers that could be public accessible from being consumed/produced by clients that shouldn't have access.
How to diagnose
Run the following command:
$ kafkacat -b -L
which returns
Metadata for all topics (from broker -1:
1 brokers:
broker 0 at
In this scenario, is the address specified by the client, and even if the client have access to that address/port, will be the address that will be used to consume/produce.
Now, if you are running kafka in AWS MSK, it would probably be managing this for you. You have to make sure that you can access the address returned by that command. If you don't, you might need to either change it or run your command from a host that have access to it.
Another option might be to open a ssh tunnel using a bastion host that have access internally to that address.
You can find more detailed info at:


Google Cloud Run - Container failed to start workarounds

Similarly to Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable I cannot start my container because it does not (need to) listen on a port. It is a Discord bot that just needs outbound connections for the APIs.
Is there a way I can get around this error? I've tried listening on port using socket module with
import socket
s = socket.socket()
s.bind(("", 8080))
Cloud Run is oriented to request-driven tasks and this explains Cloud Run's listen requirement.
Generally (!) clients make requests to your Cloud Run service endpoint triggering the creation of instances to process the requests and generate responses.
Generally (!) if there are no outstanding responses to be sent, the service scales down to zero instances.
Running a bot, you will need to configure Cloud Run artificially to:
Always run (and pay for) at least one instance (so that the bot isn't terminated)
Respond (!) to incoming requests (on one thread|process)
Run your bot (on one thread|process)
To do both #2 and #3 you'll need to consider Python multithreading|multiprocessing.
For #2, the code in your question is insufficient. You can use low-level sockets, but it will need to respond to incoming requests and so you will need to implement a server. It would be simpler to use e.g. Flask which gives you an HTTP server with very little code.
And this server code only exists to satisfy the Cloud Run requirement, it is not required for your bot.
If I were you, I'd run the bot on a Compute Engine VM. You can do this for Free.
If your bot is already packaged using a container, you can deploy the container directly to a VM.

Should we always have a KAFKA_LISTENERS (inside and outside) specified even if the producer and consumer are on the same n/w?

what if the containers(producer, consumer and kafka) are on the same n/w bridge?
I am new to kafka, just trying to run a simple producer and consumer example. I have a docker container which produces messages and pushes it to kafka (this works with by declaring kafka:9092 as a bootstrap server. Since my docker container for kafka is called kafka)
Do i still need to declare inside and outside ports for kafka? Cant the consumer listen to the same port as producer?
Using kafka-python to send and receive messages.
Consumers and producers don't listen on ports, but as long as you have (at least) PLAINTEXT://kafka:9092 as the advertised listener, and listeners includes port 9092, then you don't necessarily need any other listener.
However, if you add other brokers in the same network for replication, I'd strongly recommend using at least SASL_PLAINTEXT for the inter-broker communication. That way all brokers in the same network "trust" each other as a cluster (and you can fine tune network traffic for replication, but that's not really needed for Docker)

Celery worker not reconnecting on network change/IP Change

I deployed celery for some tasks that need to be performed at my workplace. These tasks are huge and I bought a few high-spec machines for performing these. Before I detail my issue, let me brief about what all I've deployed:
RabbitMQ broker on a remote server
Producer that pushes tasks on another remote server
Workers at 3 machines deployed at my workplace
Now, when I started the whole process was as smooth as I tested and everything process just great!
The problem
Unfortunately, I forgot to consult my network guy about a fixed IP address, and as per our location, we do not have a fixed IP address from our ISP. So my celery workers upon network disconnect freeze and do nothing. Even when the network is running, because the IP Address changed, and the connection to the broker is not being recreated or worker is not retrying connection. I have tried configuration like BROKER_CONNECTION_MAX_RETRIES = 0 and BROKER_HEARTBEAT = 10. But I had no option but to post it out here and look for experts on this matter!
PS: I cannot restart the workers manually everytime the network changes the IP address by kill -9
Restarting the app using:
sudo rabbitmqctl stop_app
sudo rabbitmqctl start_app
solved the issue for me.
Also, since I had virtual host setup, I needed to get that reset too.
Not sure why was that needed. Or in fact any of the above was needed, but it did solve the problem for me.
The issue was because I was unable to understand the nature of AMQP protocol or RabbitMQ.
When a celery worker starts it opens up a channel at RabbitMQ. This channel upon any network changes tries to reconnect, but the port/sock opened for the channel previously is registered with a different public IP address of the client. As such the negotiations between the celery worker (client) and RabbitMQ (server) cannot resume because the client has changed the address, hence a new channel needs to be established in case of a change in the public IP address of the client.
The answer by #qreOct above is due to either I was unable to express the question properly or because of the difference in our perceptions. Still thanks a lot for taking your time out!

Executing a command on a remote server with decoupling, redundancy, and asynchronous

I have a few servers that require executing commands on other servers. For example a Bitbucket Server post receive hook executing a git pull on another server. Another example is the CI server pulling a new docker image and restarting an instance on another server.
I would normally use ssh for this, creating a user/group specifically for the job with limited permission.
A few downsides with ssh:
Synchronous ssh call means a git push will have to wait until complete.
If a host is not contactable for whatever reason, the ssh command will fail.
Maintaining keys, users, and sudoers permissions can become unwieldy.
Few possibilities:
Find an open source out of the box solution (I have tried with no luck so far)
Set up an REST API on each server that accepts calls with some type of authentication, e.g. POST https://server/git/pull/?apikey=a1b2c3
Set up Python/Celery to execute tasks on a different queue for each host. This means a celery worker on each server that can execute commands and possibly a service that accepts REST API calls, converting them to Celery tasks.
Is there a nice solution to this problem?
Defining the problem
You want to be able to trigger a remote task without waiting for it to complete.
This can be achieved in any number of ways, including with SSH. You can execute a remote command without waiting for it to complete by closing or redirecting all I/O streams, e.g. like this:
ssh user#host "/usr/bin/foobar </dev/null >/dev/null 2>&1"
You want to be able to defer the task if the host is currently unavailable.
This requires a queuing/retry system of some kind. You will also need to decide whether the target hosts will be querying for messages ("pull") or whether messages will be sent to the target hosts from elsewhere ("push").
You want to simplify access control as much as possible.
There's no way to completely avoid this issue. One solution would be to put most of the authentication logic in a centralized task server. This splits the problem into two parts: configuring access rights in the task server, and configuring authentication between the task server and the target hosts.
Example solutions
Hosts attempt to start tasks over SSH using method above for asynchrony. If host is unavailable, task is written to local file. Cron job periodically retries sending failed tasks. Access control via SSH keys.
Hosts add tasks by writing commands to files on an SFTP server. Cron job on target hosts periodically checks for new commands and executes them if found. Access control managed via SSH keys on the SFTP server.
Hosts post tasks to REST API which adds them to queue. Celery daemon on each target host consumes from queue and executes tasks. Access managed primarily by credentials sent to the task queuing server.
Hosts post tasks to API which adds tasks to queue. Task consumer nodes pull tasks off the queue and send requests to API on target hosts. Authentication managed by cryptographic signature of sender appended to request, verified by task server on target host.
You can also look into tools that do some or all of the required functions out of the box. For example, some Google searching came up with Rundeck which seems to have some job scheduling capabilities and a REST API. You should also consider whether you can leverage any existing automated deployment or management tools already present in your system.
Ultimately, there's no single right answer to this question. It really depends on your particular needs. Ask yourself: How much time and effort do you want to spend creating this system? What about maintenance? How reliable does it need to be? How much does it need to scale? And so on, ad infinitum...

EC2 fails to connect via FTPS, but works locally

I'm running Python 2.6.5 on ec2 and I've replaced the old ftplib with the newer one from Python2.7 that allows importing of FTP_TLS. Yet the following hangs up on me:
from ftplib import FTP_TLS
ftp = FTP_TLS('host', 'username', 'password')
ftp.retrlines('LIST') (Times out after 15-20 min)
I'm able to run these three lines successfully in a matter of seconds on my local machine, but it fails on ec2. Any idea as to why this is?
It certainly sounds like a problem related to whether or not you're in PASSIVE mode on your FTP connection, and whether both ends of the connection can support it.
The ftplib documentations suggests that it is on by default, which is a shame, because I was going to suggest that you turn it on. Instead, I'll suggest that you set_debuglevel to where you can see the lower levels of the protocol happening and see what mode you're in. That should give you information on how to proceed. Either you're in passive mode and the other end can't deal with it properly, or (hopefully) you'd not, but you should be.
FTP and FTPS (but not SFTP) can be configured so that the server makes a backwards connection to the client for the actual transfers or so that the client makes a second forward connection to the server for the transfers. The former, especially, is prone to complications whenever network address translation is involved. Without the TLS, some firewalls can actually rewrite the FTP session traffic to make it magically work, but with TLS that's impossible due to encryption.
The fact that are presumably authenticating and then timing out when you try to transfer data (LIST requires a 2nd connection in one direction or the other) is the classic symptom, usually, of a setup that either needs passive mode, OR, there's this:
Connect as usual to port 21 implicitly securing* the FTP control connection before authenticating. Securing the data connection requires the user to explicitly ask for it by calling the prot_p() method.
ftps.prot_p() # switch to secure data connection
ftps.retrlines('LIST') # list directory content securely
I don't work with FTPS often, since SFTP is so much less problematic, but if you're not doing that, the far end server might not be cooperating.
*note, I suspect this sentence is trying to say that FTP_TLS "implicitly secures the FTP control connection" in contrast with the explicit securing of the data connection.
If you're still having trouble could you try ruling out Amazon firewall problems. (I'm assuming you're not using a host based firewall.)
If your EC2 instance is in a VPC then in the AWS Management Console could you:
ensure you have an internet gateway
ensure that the subnet your EC2 instance is in has a default route ( configured pointing at the internet gateway
in the Security Group for both inbound and outbound allow All Traffic from all sources (
in the Network ACLs for both inbound and outbound allow All Traffic from all sources (
If your EC2 instance is NOT in a VPC then in the AWS Management Console could you:
in the Security Group for inbound allow All Traffic from all sources (
Only do this in a test environment! (obviously)
This will open your EC2 instance up to all traffic from the internet. Hopefully you'll find that your FTPS is now working. Then you can gradually reapply the security rules until you find out the cause of the problem. If it's still not working then the AWS firewall is not the cause of the problem (or you have more than one problem).

