I am trying to learn Kafka by taking the classic Twitter streaming example. I am trying to use my producer to stream twitter data based on 2 filters to different partition of same topic. For example, twitter data with tracks='Google' to one partition and track='Apple' to another.
class Producer(StreamListener):
def __init__(self, producer):
self.producer = producer
def on_data(self, data):
self.producer.send(topic_name, value=data)
return True
def on_error(self, error):
print(error)
twitter_stream = Stream(auth, Producer(producer))
twitter_stream.filter(track=["Google"])
How do i add another track and stream that data to another partition.
Likewise, how do i make my consumer consume from a specific partition.
consumer = KafkaConsumer(
topic_name,
bootstrap_servers=['localhost:9092'],
auto_offset_reset='latest',
enable_auto_commit=True,
auto_commit_interval_ms = 5000,
max_poll_records = 100,
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
After some research, I was able to resolve this issue:
In the producer side, specify the partition:
self.producer.send(topic_name, value=data,partition=0)
In the consumer side,
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'],
auto_offset_reset='latest',
enable_auto_commit=True,
auto_commit_interval_ms = 5000,
max_poll_records = 100,
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
consumer.assign([TopicPartition('trial', 0)])
Kafka partitions data on the key of the message. In your given code, you are only passing in a value to the Producer message, so the key will be null, and therefore will round-robin between all partitions.
Refer the documentation for your Kafka library to see how you can give a key for each message
Related
First time working with Kafka and I've run into a problem.
I have a following implementation of my consumer:
from kafka import KafkaConsumer
import config
class KafkaMessageConsumer:
def __init__(self):
self.consumer = KafkaConsumer(
bootstrap_servers=config.KAFKA_BOOTSTRAP_SERVER,
security_protocol=config.KAFKA_SECURITY_PROTOCOL,
sasl_mechanism=config.KAFKA_SASL_MECHANISM,
sasl_plain_username=config.KAFKA_USERNAME,
sasl_plain_password=config.KAFKA_PASSWORD,
value_deserializer=lambda x: json.loads(x.decode("utf-8")),
)
def receive_messages(self, topic):
self.consumer.subscribe(topics=[topic])
print(f"Subscribed to topics: {self.consumer.subscription()}")
for msg in self.consumer:
yield msg.value
if __name__ == "__main__":
consumer = KafkaMessageConsumer()
for message in consumer.receive_messages(config.KAFKA_TOPIC):
print("Received message:", message)
Where the credentials should be implemented correctly. I get the message of subscribing to the topic without error, but there are no yielded messages eventhough I know for sure that there are messages to be consumed on the topic. Am I missing some neccessery config here?
I'm no expert in Python but it looks like you haven't consumed any messages. You have subscribed to the topic but you would need to poll() for messages https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.poll
Also, where have you set the topic name? [topic]
I have the below code that reads data from Kafka queue.
bootstrap_servers = ['server.com']
topicName = 'topicname'
consumer = KafkaConsumer(topicName, group_id='topic', bootstrap_servers=bootstrap_servers, auto_offset_reset='earliest', consumer_timeout_ms=20000)
data_list = []
for message in consumer:
print(message)
data = json.loads(message.value)
df = json_normalize(data)
data_list.append(df)
I am however trying to see if I can restrict the data based on certain timestamp.
You can filter each message on your own that doesn't match the criteria you want to process with a simple if statement
Or you can get all the offsets of a particular timestamp, then seek your consumer to those timestamps before polling (careful about handling application restarts and not reprocessing the same range)
https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html?highlight=seek#kafka.KafkaConsumer.offsets_for_times
I have a topics_subscriber.py that continuously checks for any new topics and subscribes to them.
def _subscribe_topics(consumer: KafkaConsumer) -> None:
topics_to_subscribe = {
topic
for topic in consumer.topics()
if topic.startswith("topic-")
}
subscribed_topics = consumer.subscription()
print(subscribed_topics)
new_topics = (
topics_to_subscribe - subscribed_topics
if subscribed_topics
else topics_to_subscribe
)
if new_topics:
print("new topics detected:")
for topic in new_topics:
print(topic)
print("\nsubscribing to old+new topics\n")
consumer.subscribe(topics=list(topics_to_subscribe)) // subscribe() is not incremental, hence subscribing to old+new topics
print(consumer.subscription())
else:
print("\nno new topics detected: exiting normally\n")
def main() -> None:
consumer = kafka.KafkaConsumer(
client_id="client_1",
group_id="my_group",
bootstrap_servers=['localhost:9092']
)
while True:
_subscribe_topics(consumer)
print("\nsleeping 10 sec\n")
time.sleep(10)
Now, in another script kafka_extractor.py, I want to create a new consumer and join the my_group consumer group and start consuming messages from the topics that are subscribed by the group. i.e without specifically subscribing to topics for this new consumer.
def main() -> None:
consumer = kafka.KafkaConsumer(
client_id="client_2",
group_id="my_group",
bootstrap_servers=['localhost:9092']
)
print("created consumer")
print(consumer.subscription())
for msg in consumer:
print(msg.topic)
Two things to note in the output of kafka_extractor.py:
print(consumer.subscription()) outputs as None
for msg in consumer: -> is stuck and does not move forward nor exits the program.
Any directions would be appreciated here.
Joining a consumer to a group does not gain the group's existing topics because groups can contain multiple topics, and consumer instances decide which topics to consume by subscribing
Your loop is stuck because it's defaulted to read from the end of the subscribed topics, but there are none
If you wanted to consume a list of topics that's refreshed periodically, use a regex subscription
I'm trying to create a simple Kafka producer based on confluent_kafka. My code is the following:
#!/usr/bin/env python
from confluent_kafka import Producer
import json
def delivery_report(err, msg):
"""Called once for each message produced to indicate delivery result.
Triggered by poll() or flush().
see https://github.com/confluentinc/confluent-kafka-python/blob/master/README.md"""
if err is not None:
print('Message delivery failed: {}'.format(err))
else:
print('Message delivered to {} [{}]'.format(
msg.topic(), msg.partition()))
class MySource:
"""Kafka producer"""
def __init__(self, kafka_hosts, topic):
"""
:kafka_host list(str): hostnames or 'host:port' of Kafka
:topic str: topic to produce messages to
"""
self.topic = topic
# see https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
config = {
'metadata.broker.list': ','.join(kafka_hosts),
'group.id': 'mygroup',
}
self.producer = Producer(config)
#staticmethod
def main():
topic = 'my-topic'
message = json.dumps({
'measurement': [1, 2, 3]})
mys = MySource(['kafka'], topic)
mys.producer.produce(
topic, message, on_delivery=delivery_report)
mys.producer.flush()
if __name__ == "__main__":
MySource.main()
The first time I use a topic (here: "my-topic"), Kafka does react with "Auto creation of topic my-topic with 1 partitions and replication factor 1 is successful (kafka.server.KafkaApis)". However, the call-back function (on_delivery=delivery_report) is never called and it hangs at flush() (it terminates if I set a timeout for flush) neither the first time nor subsequent times. The Kafka logs does not show anything if I use an existing topic.
I have topic with 3 partitions and I'm trying to read from each specific partition using following code
from kafka import KafkaConsumer, TopicPartition
brokers = 'localhost:9092'
topic = 'b3'
m = KafkaConsumer(topic, bootstrap_servers=['localhost:9092'])
par = TopicPartition(topic=topic, partition=1)
m.assign(par)
but I am getting this error:
raise IllegalStateError(self._SUBSCRIPTION_EXCEPTION_MESSAGE)
kafka.errors.IllegalStateError: IllegalStateError: You must choose only one way to configure your consumer: (1) subscribe to specific topics by name, (2) subscribe to topics matching a regex pattern, (3) assign itself specific topic-partitions.
Can somebody help me with this?
Can you remove the topic parameter from KafkaConsumer() and try again?
example:
# manually assign the partition list for the consumer
from kafka import TopicPartition, KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers='localhost:1234')
consumer.assign([TopicPartition('foobar', 2)])
msg = next(consumer)
ref: http://kafka-python.readthedocs.io/en/master/