Read from specific Kafka topic using Python - python

I have topic with 3 partitions and I'm trying to read from each specific partition using following code
from kafka import KafkaConsumer, TopicPartition
brokers = 'localhost:9092'
topic = 'b3'
m = KafkaConsumer(topic, bootstrap_servers=['localhost:9092'])
par = TopicPartition(topic=topic, partition=1)
m.assign(par)
but I am getting this error:
raise IllegalStateError(self._SUBSCRIPTION_EXCEPTION_MESSAGE)
kafka.errors.IllegalStateError: IllegalStateError: You must choose only one way to configure your consumer: (1) subscribe to specific topics by name, (2) subscribe to topics matching a regex pattern, (3) assign itself specific topic-partitions.
Can somebody help me with this?

Can you remove the topic parameter from KafkaConsumer() and try again?
example:
# manually assign the partition list for the consumer
from kafka import TopicPartition, KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers='localhost:1234')
consumer.assign([TopicPartition('foobar', 2)])
msg = next(consumer)
ref: http://kafka-python.readthedocs.io/en/master/

Related

Kafka consumer implementation not working in Python

First time working with Kafka and I've run into a problem.
I have a following implementation of my consumer:
from kafka import KafkaConsumer
import config
class KafkaMessageConsumer:
def __init__(self):
self.consumer = KafkaConsumer(
bootstrap_servers=config.KAFKA_BOOTSTRAP_SERVER,
security_protocol=config.KAFKA_SECURITY_PROTOCOL,
sasl_mechanism=config.KAFKA_SASL_MECHANISM,
sasl_plain_username=config.KAFKA_USERNAME,
sasl_plain_password=config.KAFKA_PASSWORD,
value_deserializer=lambda x: json.loads(x.decode("utf-8")),
)
def receive_messages(self, topic):
self.consumer.subscribe(topics=[topic])
print(f"Subscribed to topics: {self.consumer.subscription()}")
for msg in self.consumer:
yield msg.value
if __name__ == "__main__":
consumer = KafkaMessageConsumer()
for message in consumer.receive_messages(config.KAFKA_TOPIC):
print("Received message:", message)
Where the credentials should be implemented correctly. I get the message of subscribing to the topic without error, but there are no yielded messages eventhough I know for sure that there are messages to be consumed on the topic. Am I missing some neccessery config here?
I'm no expert in Python but it looks like you haven't consumed any messages. You have subscribed to the topic but you would need to poll() for messages https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.poll
Also, where have you set the topic name? [topic]

Polling data based on date filters

I have the below code that reads data from Kafka queue.
bootstrap_servers = ['server.com']
topicName = 'topicname'
consumer = KafkaConsumer(topicName, group_id='topic', bootstrap_servers=bootstrap_servers, auto_offset_reset='earliest', consumer_timeout_ms=20000)
data_list = []
for message in consumer:
print(message)
data = json.loads(message.value)
df = json_normalize(data)
data_list.append(df)
I am however trying to see if I can restrict the data based on certain timestamp.
You can filter each message on your own that doesn't match the criteria you want to process with a simple if statement
Or you can get all the offsets of a particular timestamp, then seek your consumer to those timestamps before polling (careful about handling application restarts and not reprocessing the same range)
https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html?highlight=seek#kafka.KafkaConsumer.offsets_for_times

kafka-python: Can a consumer object join a consumer group without subscribing to any topics?

I have a topics_subscriber.py that continuously checks for any new topics and subscribes to them.
def _subscribe_topics(consumer: KafkaConsumer) -> None:
topics_to_subscribe = {
topic
for topic in consumer.topics()
if topic.startswith("topic-")
}
subscribed_topics = consumer.subscription()
print(subscribed_topics)
new_topics = (
topics_to_subscribe - subscribed_topics
if subscribed_topics
else topics_to_subscribe
)
if new_topics:
print("new topics detected:")
for topic in new_topics:
print(topic)
print("\nsubscribing to old+new topics\n")
consumer.subscribe(topics=list(topics_to_subscribe)) // subscribe() is not incremental, hence subscribing to old+new topics
print(consumer.subscription())
else:
print("\nno new topics detected: exiting normally\n")
def main() -> None:
consumer = kafka.KafkaConsumer(
client_id="client_1",
group_id="my_group",
bootstrap_servers=['localhost:9092']
)
while True:
_subscribe_topics(consumer)
print("\nsleeping 10 sec\n")
time.sleep(10)
Now, in another script kafka_extractor.py, I want to create a new consumer and join the my_group consumer group and start consuming messages from the topics that are subscribed by the group. i.e without specifically subscribing to topics for this new consumer.
def main() -> None:
consumer = kafka.KafkaConsumer(
client_id="client_2",
group_id="my_group",
bootstrap_servers=['localhost:9092']
)
print("created consumer")
print(consumer.subscription())
for msg in consumer:
print(msg.topic)
Two things to note in the output of kafka_extractor.py:
print(consumer.subscription()) outputs as None
for msg in consumer: -> is stuck and does not move forward nor exits the program.
Any directions would be appreciated here.
Joining a consumer to a group does not gain the group's existing topics because groups can contain multiple topics, and consumer instances decide which topics to consume by subscribing
Your loop is stuck because it's defaulted to read from the end of the subscribed topics, but there are none
If you wanted to consume a list of topics that's refreshed periodically, use a regex subscription

Get Latest Message for a Confluent Kafka Topic in Python

Here's what I've tried so far:
from confluent_kafka import Consumer
c = Consumer({... several security/server settings skipped...
'auto.offset.reset': 'beginning',
'group.id': 'my-group'})
c.subscribe(['my.topic'])
msg = poll(30.0) # msg is of None type.
msg almost always ends up being None though. I think the issue might be that 'my-group' has already consumed all the messages for 'my.topic'... but I don't care whether a message has already been consumed or not - I still need the latest message. Specifically, I need the timestamp from that latest message.
I tried a bit more, and from this it looks like there are probably 25 messages in the topic, but I have no idea how to get at them:
a = c.assignment()
print(a) # Outputs [TopicPartition{topic=my.topic,partition=0,offset=-1001,error=None}]
offsets = c.get_watermark_offsets(a[0])
print(offsets) # Outputs: (25, 25)
If there are no messages because the topic has never had anything written to it at all, how can I determine that? And if that's the case, how can I determine how long the topic has existed for? I'm looking to write a script that automatically deletes any topics that haven't been written to in the past X days (14 initially - will probably tweak it over time.)
I run into the same issue, and no example on this. In my case there is one partition, and I need to read the last message, to know the some info from that message to setup the consumer/producer component I have.
Logic is that start Consumer, subscribe to topic, poll for message -> this triggers on_assign, where the rewinding happens, by assigning the modified partitions back. After on_assign finishes, the poll for msg continues and reads the last message from topic.
settings = {
"bootstrap.servers": "my.kafka.server",
"group.id": "my-work-group",
"client.id": "my-work-client-1",
"enable.auto.commit": False,
"session.timeout.ms": 6000,
"default.topic.config": {"auto.offset.reset": "largest"},
}
consumer = Consumer(settings)
def on_assign(a_consumer, partitions):
# get offset tuple from the first partition
last_offset = a_consumer.get_watermark_offsets(partitions[0])
# position [1] being the last index
partitions[0].offset = last_offset[1] - 1
consumer.assign(partitions)
consumer.subscribe(["test-topic"], on_assign=on_assign)
msg = consumer.poll(6.0)
Now msg is having the last message inside.
If anyone still needs an example for case with multiple partitions; this is how I did it:
from confluent_kafka import OFFSET_END, Consumer
settings = {
'bootstrap.servers': "my.kafka.server",
'group.id': "my-work-group",
'auto.offset.reset': "latest"
}
def on_assign(consumer, partitions):
for partition in partitions:
partition.offset = OFFSET_END
consumer.assign(partitions)
consumer = Consumer(settings)
consumer.subscribe(["test-topic"], on_assign=on_assign)
msg = consumer.poll(1.0)

Confluent kafka python pause-resume functionality example

Was trying to use the confluent kafka consumer's pause and resume functionality but couldnt find any examples over the internet except the main link.
https://docs.confluent.io/5.0.0/clients/confluent-kafka-python/index.html
Couldn't understand the parameters to be passed to it. Either its list of patitions or topic names or what?
As OneCricketeer mentioned pause() and resume() takes list of TopicPartition
and to initialize TopicPartition class you need topic, partition, and offset which you can get from the message object
This is how you can achieve it through Confluent-Kafka-Python:
import time
from confluent_kafka import Consumer, Producer, TopicPartition
conf = {
'bootstrap.servers': "localhost:9092",
'group.id': "test-consumer-group",
'auto.offset.reset': 'earliest',
'enable.auto.commit': False
}
topics = ['topic1']
consumer = Consumer(conf)
consumer.subscribe(topics)
while True:
try:
msg = consumer.poll(1.0)
if msg is None:
print("Waiting for message or event/error in poll()...")
continue
if msg.error():
print("Error: {}".format(msg.error()))
continue
else:
# Call to your processing function and pause the consumer
consumer.pause([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
time.sleep(60) # Think of it as processing time
# Once the processing is done resume the consumer and commit the message
consumer.resume([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
consumer.commit()
except Exception as e:
print(e)
This is just an example and you might want to modify it based on your use case.
Pause and resume take a list of TopicPartition
class confluent_kafka.TopicPartition
TopicPartition is a generic type to hold a single partition and various information about it.
It is typically used to provide a list of topics or partitions for various operations, such as Consumer.assign().
TopicPartition(topic[, partition][, offset])
Instantiate a TopicPartition object.
Parameters:
topic (string) – Topic name
partition (int) – Partition id
offset (int) – Initial partition offset

Categories

Resources