Access Kafka producer server through python script on GCP

Access Kafka producer server through python script on GCP - python

I have got a successful connection between a Kafka producer and consumer on a Google Cloud Platform cluster established by:
$ cd /usr/lib/kafka
$ bin/kafka-console-producer.sh config/server.properties --broker-list \
PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092 --topic test
and executing in a new shell
$ cd /usr/lib/kafka
$ bin/kafka-console-consumer.sh --bootstrap-server \
PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092 --topic test \
--from-beginning
Now, I want to send messages to the Kafka producer server using the following python script:
from kafka import *
topic = 'test'
producer = KafkaProducer(bootstrap_servers='PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092',
api_version=(0,10))
producer.send(topic, b"Test test test")
However, this results in a KafkaTimeoutError:
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
Looking around online told me to consider:
uncommenting listeners=... and advertised.listeners=... in the /usr/lib/kafka/config/server.properties file.
However, listeners=PLAINTEXT://:9092 does not work and this post suggests to set PLAINTEXT://<external-ip>:9092.
So, I started wondering about accessing a Kafka server through an external (static) IP address of the GCP cluster. Then, we have set up a firewall rule to access the port (?) and allow https access to the cluster. But I am unsure whether this is an overkill of the problem.
I definitely need some guidance to connect successfully to the Kafka server from the python script.

You need to set advertised.listeners to the address that your client connects to.
More info: https://rmoff.net/2018/08/02/kafka-listeners-explained/

Thanks Robin! The link you posted was very helpful to find the below working configurations.
Despite the fact that SimpleProducer seems to be a deprecated approach, the following settings finally worked for me:
Python script:
from kafka import *
topic = 'test'
kafka = KafkaClient('[project-name]-w-0.c.[cluster-id].internal:9092')
producer = SimpleProducer(kafka)
message = "Test"
producer.send_messages(topic, message.encode('utf-8'))
and uncomment in the /usr/lib/kafka/config/server.properties file:
listeners=PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092
advertised.listeners=PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092

Related

How to set up a SSH tunnel in Google Cloud Dataflow to an external database server?

I am facing a problem to make my Apache Beam pipeline work on Cloud Dataflow, with DataflowRunner.
The first step of the pipeline is to connect to an external Postgresql server hosted on a VM which is only externally accessible through SSH, port 22, and extract some data. I can't change these firewalling rules, so I can only connect to the DB server via SSH tunneling, aka port-forwarding.
In my code I make use of the python library sshtunnel. It works perfectly when the pipeline is launched from my development computer with DirectRunner:
from sshtunnel import open_tunnel
with open_tunnel(
(user_options.ssh_tunnel_host, user_options.ssh_tunnel_port),
ssh_username=user_options.ssh_tunnel_user,
ssh_password=user_options.ssh_tunnel_password,
remote_bind_address=(user_options.dbhost, user_options.dbport)
) as tunnel:
with beam.Pipeline(options=pipeline_options) as p:
(p | "Read data" >> ReadFromSQL(
host=tunnel.local_bind_host,
port=tunnel.local_bind_port,
username=user_options.dbusername,
password=user_options.dbpassword,
database=user_options.dbname,
wrapper=PostgresWrapper,
query=select_query
)
| "Format CSV" >> DictToCSV(headers)
| "Write CSV" >> WriteToText(user_options.export_location)
)
The same code, launched with DataflowRunner inside a non-default VPC where all ingress are deny but no egress restriction, and CloudNAT configured, fails with this message:
psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "0.0.0.0" and accepting TCP/IP connections on port 41697? [while running 'Read data/Read']
So, obviously something is wrong with my tunnel but I cannot spot what exactly. I was beginning to wonder whether a direct SSH tunnel setup was even possible through CloudNAT, until I found this blog post: https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1 stating:
A core strength of Cloud Dataflow is that you can call external services for data enrichment. For example, you can call a micro service to get additional data for an element.
Within a DoFn, call-out to the service (usually done via HTTP). You have full control to make any type of connection that you choose, so long as the firewall rules you set up within your project/network allow it.
So it should be possible to set up this tunnel ! I don't want to give up but I don't know what to try next. Any idea ?
Thanks for reading

Problem solved ! I can't believe I've spent two full days on this... I was looking completely in the wrong direction.
The issue was not with some Dataflow or GCP networking configuration, and as far as I can tell...
You have full control to make any type of connection that you choose, so long as the firewall rules you set up within your project/network allow it
is true.
The problem was of course in my code : only the problem was revealed only in a distributed environment. I had make the mistake of opening the tunnel from the main pipeline processor, instead of the workers. So the SSH tunnel was up but not between the workers and the target server, only between the main pipeline and the target!
To fix this, I had to change my requesting DoFn to wrap the query execution with the tunnel :
class TunnelledSQLSourceDoFn(sql.SQLSourceDoFn):
"""Wraps SQLSourceDoFn in a ssh tunnel"""
def __init__(self, *args, **kwargs):
self.dbport = kwargs["port"]
self.dbhost = kwargs["host"]
self.args = args
self.kwargs = kwargs
super().__init__(*args, **kwargs)
def process(self, query, *args, **kwargs):
# Remote side of the SSH Tunnel
remote_address = (self.dbhost, self.dbport)
ssh_tunnel = (self.kwargs['ssh_host'], self.kwargs['ssh_port'])
with open_tunnel(
ssh_tunnel,
ssh_username=self.kwargs["ssh_user"],
ssh_password=self.kwargs["ssh_password"],
remote_bind_address=remote_address,
set_keepalive=10.0
) as tunnel:
forwarded_port = tunnel.local_bind_port
self.kwargs["port"] = forwarded_port
source = sql.SQLSource(*self.args, **self.kwargs)
sql.SQLSouceInput._build_value(source, source.runtime_params)
logging.info("Processing - {}".format(query))
for records, schema in source.client.read(query):
for row in records:
yield source.client.row_as_dict(row, schema)
as you can see, I had to override some bits of pysql_beam library.
Finally, each worker open its own tunnel for each request. It's probably possible to optimize this behavior but it's enough for my needs.

Now Feedback if the connection works

I am trying to create a simple client with pykafka. For this I need SSL certificates. The client runs under RHEL 7 and Python 3.6.x
It looks like the connection works but I don't get any feedback or data, only a black screen.
How can I check the connection or get error messages.
#!/usr/bin/scl enable rh-python36 -- python3
from pykafka import KafkaClient, SslConfig
config = SslConfig(cafile='key/root_ca.crt',
certfile='key/cert.crt',
keyfile='key/key.key',
password='xxxx')
client = KafkaClient(hosts="xxxxxxx:9093",
ssl_config=config)
print("topics", client.topics)
topic = client.topics['xxxxxx']
consumer = topic.get_simple_consumer(
consumer_group="yyyyy",
auto_offset_reset=OffsetType.EARLIEST,
reset_offset_on_start=False
)
for message in consumer:
if message is not None:
print (message.offset, message.value)

confluent-kafka-python consumer unable to read messages

I am stuck with an issue related to Kafka consumer using confluent-kafka's python library.
CONTEXT
I have a Kafka topic on AWS EC2 that I need to consume.
SCENARIO
Consumer Script (my_topic_consumer.py) uses confluent-kafka-python to create a consumer (shown below) and subscribe to the 'my_topic' topic. The issue is that the consumer is not able to read messages from the Kafka cluster.
All required security steps are met:
1. SSL - security protocol for the consumer and broker.
2. Addition of the consumer EC2 IP block has been added to the Security Group on the cluster.
#my_topic_consumer.py
from confluent_kafka import Consumer, KafkaError
c = Consumer({
'bootstrap.servers': 'my_host:my_port',
'group.id': 'my_group',
'auto.offset.reset': 'latest',
'security.protocol': 'SSL',
'ssl.ca.location': '/path/to/certificate.pem'
})
c.subscribe(['my_topic'])
try:
while True:
msg = c.poll(5)
if msg is None:
print('None')
continue
if msg.error():
print(msg)
continue
else:
#process_msg(msg) - Writes messages to a data file.
except KeyboardInterrupt:
print('Aborted by user\n')
finally:
c.close()
URLS
Broker Host: my_host
Port: my_port
Group ID: my_group
CONSOLE COMMANDS
working - Running the console-consumer script, I am able to see the data:
kafka-console-consumer --bootstrap-server my_host:my_port --consumer.config client-ssl.properties --skip-message-on-error --topic my_topic | jq
Note: client-ssl.properties: points to the JKS file which has the certs.
Further debugging on the Kafka cluster (separate EC2 instance from consumer), I couldn't see any registration of my consumer by my group_id (my_group):
kafka-consumer-groups --botstrap-server my_host:my_port --command-config client-ssl.properties --descrive --group my_group
This leads me to believe the consumer is not getting registered on the cluster, so may be the SSL handshake is failing? How do I check this from consumer side in python?
Note
- the cluster is behind a proxy (corporate), but I do run the proxy on the consumer EC2 before testing.
- ran the process via pm2, yet didn't see any error logs like req timeouts etc.
Is there any way I can check that the Consumer creation is failing in a definite way and find out the root cause? Any help and feedback is appreciated.

Kafka Consumer not reading messages

I have Kafka v1.0.1 running on the single node and I am able to push the messages to the topic but somehow unable to consume the message from another node using the below python code.
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'kotak-test',
bootstrap_servers=['kmblhdpedge:9092'],
auto offset reset='earliest',
enable auto commit=True,
group id=' test1',
value_deserializer-lambda x: loads (x.decode('utf-8')))
for message in consumer:
message = message.value
print (message)
I am constantly pushing the messages from the console using the below command:
bin/kafka-console-producer --zookeeper <zookeeper-node>:<port> --topic <topic_name>
and also I can read via console

You're using the old Zookeeper based producer, but the newer Kafka based Consumer. The logic for how these work and store offsets are not the same.
You need to use --broker-list on the Console Producer
Similarly with Console Consumer, use --bootstrap-server, not --zookeeper
Also, these properties should not have spaces in them
auto offset reset='earliest',
enable auto commit=True,
group id=' test1',

kafka-python consumer not receiving messages

I am having trouble with KafaConsumer to make it read from the beginning, or from any other explicit offset.
Running the command line tools for the consumer for the same topic , I do see messages with the --from-beginning option and it hangs otherwise
$ ./kafka-console-consumer.sh --zookeeper {localhost:port} --topic {topic_name} --from-beginning
If I run it through python, it hangs, which I suspect to be caused by incorrect consumer configs
consumer = KafkaConsumer(topic_name,
bootstrap_servers=['localhost:9092'],
group_id=None,
auto_commit_enable=False,
auto_offset_reset='smallest')
print "Consuming messages from the given topic"
for message in consumer:
print "Message", message
if message is not None:
print message.offset, message.value
print "Quit"
Output:
Consuming messages from the given topic
(hangs after that)
I am using kafka-python 0.9.5 and the broker runs kafka 8.2. Not sure what the exact problem is.
Set _group_id=None_ as suggested by dpkp to emulate the behavior of console consumer.

The difference between the console-consumer and the python consumer code you have posted is the python consumer uses a consumer group to save offsets: group_id="test-consumer-group" . If instead you set group_id=None, you should see the same behavior as the console consumer.

I ran into the same problem: I can recieve in kafka console but can't get message with python script using package kafka-python.
Finally I figure the reason is that I didn't call producer.flush() and producer.close() in my producer.py which is not mentioned in its documentation .

auto_offset_reset='earliest' solved it for me.

auto_offset_reset='earliest' and group_id=None solved it for me.

My take is: to print and ensure offset is what you expect it to be. By using position() and seek_to_beginning(), please see comments in the code.
I can't explain:
Why after instantiating KafkaConsumer, the partitions are not assigned, is this by design? Hack around is to call poll() once before seek_to_beginning()
Why sometimes after seek_to_beginning(), first call to poll() returns no data and doesnt change the offset.
Code:
import kafka
print(kafka.__version__)
from kafka import KafkaProducer, KafkaConsumer
from time import sleep
KAFKA_URL = 'localhost:9092' # kafka broker
KAFKA_TOPIC = 'sida3_sdtest_topic' # topic name
# ASSUMING THAT the topic exist
# write to the topic
producer = KafkaProducer(bootstrap_servers=[KAFKA_URL])
for i in range(20):
producer.send(KAFKA_TOPIC, ('msg' + str(i)).encode() )
producer.flush()
# read from the topic
# auto_offset_reset='earliest', # auto_offset_reset is needed when offset is not found, it's NOT what we need here
consumer = KafkaConsumer(KAFKA_TOPIC,
bootstrap_servers=[KAFKA_URL],
max_poll_records=2,
group_id='sida3'
)
# (!?) wtf, why we need this to get partitions assigned
# AssertionError: No partitions are currently assigned if poll() is not called
consumer.poll()
consumer.seek_to_beginning()
# also AssertionError: No partitions are currently assigned if poll() is not called
print('partitions of the topic: ',consumer.partitions_for_topic(KAFKA_TOPIC))
from kafka import TopicPartition
print('before poll() x2: ')
print(consumer.position(TopicPartition(KAFKA_TOPIC, 0)))
print(consumer.position(TopicPartition(KAFKA_TOPIC, 1)))
# (!?) sometimes the first call to poll() returns nothing and doesnt change the offset
messages = consumer.poll()
sleep(1)
messages = consumer.poll()
print('after poll() x2: ')
print(consumer.position(TopicPartition(KAFKA_TOPIC, 0)))
print(consumer.position(TopicPartition(KAFKA_TOPIC, 1)))
print('messages: ', messages)
Output:
2.0.1
partitions of the topic: {0, 1}
before poll() x2:
0
0
after poll() x2:
0
2
messages: {TopicPartition(topic='sida3_sdtest_topic', partition=1): [ConsumerRecord(topic='sida3_sdtest_topic', partition=1, offset=0, timestamp=1600335075864, timestamp_type=0, key=None, value=b'msg0', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=4, serialized_header_size=-1), ConsumerRecord(topic='sida3_sdtest_topic', partition=1, offset=1, timestamp=1600335075864, timestamp_type=0, key=None, value=b'msg1', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=4, serialized_header_size=-1)]}

I faced the same issue before, so I ran kafka-topics locally at the machine running the code to test and I got UnknownHostException. I added the IP and the host name in hosts file and it worked fine in both kafka-topics and the code.
It seems that KafkaConsumer was trying to fetch the messages but failed without raising any exceptions.

For me, I had to specify the router's IP in the kafka PLAINTEXT configuration.
Get the router's IP with:
echo $(ifconfig | grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" | grep -v 127.0.0.1 | awk '{ print $2 }' | cut -f2 -d: | head -n1)
and then add PLAINTEXT_HOST://<touter_ip>:9092 to the kafka advertised listeners. In case of a confluent docker service the configuration is as follows:
kafka:
image: confluentinc/cp-kafka:7.0.1
container_name: kafka
depends_on:
- zookeeper
ports:
- 9092:9092
- 29092:29092
environment:
- KAFKA_BROKER_ID=1
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:29092,PLAINTEXT_HOST://172.28.0.1:9092
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
and finally the python consumer is:
from kafka import KafkaConsumer
from json import loads
consumer = KafkaConsumer(
'my-topic',
bootstrap_servers=['172.28.0.1:9092'],
auto_offset_reset = 'earliest',
group_id=None,
)
print('Listening')
for msg in consumer:
print(msg)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access Kafka producer server through python script on GCP - python

You need to set advertised.listeners to the address that your client connects to. More info: https://rmoff.net/2018/08/02/kafka-listeners-explained/

Related

How to set up a SSH tunnel in Google Cloud Dataflow to an external database server?

Now Feedback if the connection works

confluent-kafka-python consumer unable to read messages

Kafka Consumer not reading messages

kafka-python consumer not receiving messages

Categories

Resources