Polling data based on date filters - python

I have the below code that reads data from Kafka queue.
bootstrap_servers = ['server.com']
topicName = 'topicname'
consumer = KafkaConsumer(topicName, group_id='topic', bootstrap_servers=bootstrap_servers, auto_offset_reset='earliest', consumer_timeout_ms=20000)
data_list = []
for message in consumer:
print(message)
data = json.loads(message.value)
df = json_normalize(data)
data_list.append(df)
I am however trying to see if I can restrict the data based on certain timestamp.

You can filter each message on your own that doesn't match the criteria you want to process with a simple if statement
Or you can get all the offsets of a particular timestamp, then seek your consumer to those timestamps before polling (careful about handling application restarts and not reprocessing the same range)
https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html?highlight=seek#kafka.KafkaConsumer.offsets_for_times

Related

How do i send data to specific partition (let's say partition 0) in event hub

i have created 2 partitions in event hub
following the same code to send data to event hub
https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/README.md#publish-events-to-an-event-hub
below code i tried i was able to send data but was not able to send it to specific partition
def submit_images(jsondata):
connection_str = ['****************************']
eventhub_name = ['*********']
client = EventHubProducerClient.from_connection_string(connection_str, eventhub_name=eventhub_name)
event_data_batch = client.create_batch(partition_id=0)
event_data_batch.add(EventData(jsondata))
with client:
client.send_batch(event_data_batch)
return jsondata
the partition_id kwarg takes in a string value, so changing this line to event_data_batch = client.create_batch(partition_id='0') should send the data to the intended partition.
Another way to do that is to pass in a kwarg in to send_batch like this
client.send_batch(event_data_batch, partition_id='1').
Here is a sample that has more examples

Get Latest Message for a Confluent Kafka Topic in Python

Here's what I've tried so far:
from confluent_kafka import Consumer
c = Consumer({... several security/server settings skipped...
'auto.offset.reset': 'beginning',
'group.id': 'my-group'})
c.subscribe(['my.topic'])
msg = poll(30.0) # msg is of None type.
msg almost always ends up being None though. I think the issue might be that 'my-group' has already consumed all the messages for 'my.topic'... but I don't care whether a message has already been consumed or not - I still need the latest message. Specifically, I need the timestamp from that latest message.
I tried a bit more, and from this it looks like there are probably 25 messages in the topic, but I have no idea how to get at them:
a = c.assignment()
print(a) # Outputs [TopicPartition{topic=my.topic,partition=0,offset=-1001,error=None}]
offsets = c.get_watermark_offsets(a[0])
print(offsets) # Outputs: (25, 25)
If there are no messages because the topic has never had anything written to it at all, how can I determine that? And if that's the case, how can I determine how long the topic has existed for? I'm looking to write a script that automatically deletes any topics that haven't been written to in the past X days (14 initially - will probably tweak it over time.)
I run into the same issue, and no example on this. In my case there is one partition, and I need to read the last message, to know the some info from that message to setup the consumer/producer component I have.
Logic is that start Consumer, subscribe to topic, poll for message -> this triggers on_assign, where the rewinding happens, by assigning the modified partitions back. After on_assign finishes, the poll for msg continues and reads the last message from topic.
settings = {
"bootstrap.servers": "my.kafka.server",
"group.id": "my-work-group",
"client.id": "my-work-client-1",
"enable.auto.commit": False,
"session.timeout.ms": 6000,
"default.topic.config": {"auto.offset.reset": "largest"},
}
consumer = Consumer(settings)
def on_assign(a_consumer, partitions):
# get offset tuple from the first partition
last_offset = a_consumer.get_watermark_offsets(partitions[0])
# position [1] being the last index
partitions[0].offset = last_offset[1] - 1
consumer.assign(partitions)
consumer.subscribe(["test-topic"], on_assign=on_assign)
msg = consumer.poll(6.0)
Now msg is having the last message inside.
If anyone still needs an example for case with multiple partitions; this is how I did it:
from confluent_kafka import OFFSET_END, Consumer
settings = {
'bootstrap.servers': "my.kafka.server",
'group.id': "my-work-group",
'auto.offset.reset': "latest"
}
def on_assign(consumer, partitions):
for partition in partitions:
partition.offset = OFFSET_END
consumer.assign(partitions)
consumer = Consumer(settings)
consumer.subscribe(["test-topic"], on_assign=on_assign)
msg = consumer.poll(1.0)

Python produce to different Kafka partition

I am trying to learn Kafka by taking the classic Twitter streaming example. I am trying to use my producer to stream twitter data based on 2 filters to different partition of same topic. For example, twitter data with tracks='Google' to one partition and track='Apple' to another.
class Producer(StreamListener):
def __init__(self, producer):
self.producer = producer
def on_data(self, data):
self.producer.send(topic_name, value=data)
return True
def on_error(self, error):
print(error)
twitter_stream = Stream(auth, Producer(producer))
twitter_stream.filter(track=["Google"])
How do i add another track and stream that data to another partition.
Likewise, how do i make my consumer consume from a specific partition.
consumer = KafkaConsumer(
topic_name,
bootstrap_servers=['localhost:9092'],
auto_offset_reset='latest',
enable_auto_commit=True,
auto_commit_interval_ms = 5000,
max_poll_records = 100,
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
After some research, I was able to resolve this issue:
In the producer side, specify the partition:
self.producer.send(topic_name, value=data,partition=0)
In the consumer side,
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'],
auto_offset_reset='latest',
enable_auto_commit=True,
auto_commit_interval_ms = 5000,
max_poll_records = 100,
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
consumer.assign([TopicPartition('trial', 0)])
Kafka partitions data on the key of the message. In your given code, you are only passing in a value to the Producer message, so the key will be null, and therefore will round-robin between all partitions.
Refer the documentation for your Kafka library to see how you can give a key for each message

Confluent kafka python pause-resume functionality example

Was trying to use the confluent kafka consumer's pause and resume functionality but couldnt find any examples over the internet except the main link.
https://docs.confluent.io/5.0.0/clients/confluent-kafka-python/index.html
Couldn't understand the parameters to be passed to it. Either its list of patitions or topic names or what?
As OneCricketeer mentioned pause() and resume() takes list of TopicPartition
and to initialize TopicPartition class you need topic, partition, and offset which you can get from the message object
This is how you can achieve it through Confluent-Kafka-Python:
import time
from confluent_kafka import Consumer, Producer, TopicPartition
conf = {
'bootstrap.servers': "localhost:9092",
'group.id': "test-consumer-group",
'auto.offset.reset': 'earliest',
'enable.auto.commit': False
}
topics = ['topic1']
consumer = Consumer(conf)
consumer.subscribe(topics)
while True:
try:
msg = consumer.poll(1.0)
if msg is None:
print("Waiting for message or event/error in poll()...")
continue
if msg.error():
print("Error: {}".format(msg.error()))
continue
else:
# Call to your processing function and pause the consumer
consumer.pause([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
time.sleep(60) # Think of it as processing time
# Once the processing is done resume the consumer and commit the message
consumer.resume([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
consumer.commit()
except Exception as e:
print(e)
This is just an example and you might want to modify it based on your use case.
Pause and resume take a list of TopicPartition
class confluent_kafka.TopicPartition
TopicPartition is a generic type to hold a single partition and various information about it.
It is typically used to provide a list of topics or partitions for various operations, such as Consumer.assign().
TopicPartition(topic[, partition][, offset])
Instantiate a TopicPartition object.
Parameters:
topic (string) – Topic name
partition (int) – Partition id
offset (int) – Initial partition offset

Read from specific Kafka topic using Python

I have topic with 3 partitions and I'm trying to read from each specific partition using following code
from kafka import KafkaConsumer, TopicPartition
brokers = 'localhost:9092'
topic = 'b3'
m = KafkaConsumer(topic, bootstrap_servers=['localhost:9092'])
par = TopicPartition(topic=topic, partition=1)
m.assign(par)
but I am getting this error:
raise IllegalStateError(self._SUBSCRIPTION_EXCEPTION_MESSAGE)
kafka.errors.IllegalStateError: IllegalStateError: You must choose only one way to configure your consumer: (1) subscribe to specific topics by name, (2) subscribe to topics matching a regex pattern, (3) assign itself specific topic-partitions.
Can somebody help me with this?
Can you remove the topic parameter from KafkaConsumer() and try again?
example:
# manually assign the partition list for the consumer
from kafka import TopicPartition, KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers='localhost:1234')
consumer.assign([TopicPartition('foobar', 2)])
msg = next(consumer)
ref: http://kafka-python.readthedocs.io/en/master/

Categories

Resources