I am trying to launch dynamic consumer whenever new topic is created in Kafka but dynamically launched consumer is missing starting/first message always but consuming the message from there on. I am using kafka-python module and am using updated KafkaConsumer and KafkaProducer.
Code for Producer is
producer = KafkaProducer(bootstrap_servers='localhost:9092')
record_metadata = producer.send(topic, data)
and code for consumer is
consumer = KafkaConsumer(topic,group_id="abc",bootstrap_servers='localhost:9092',auto_offset_reset='earliest')
Please suggest something to over come this problem or any configuration i have to include in my producer and consumer instances.
Can you set auto_offset_reset to earliest.
When a new consumer stream is created, it starts from latest offset (which is default value for auto_offset_reset) and you will miss messages which were sent while consumer wasn't started.
You can read about it in kafka python doc. Relevant portion is below
auto_offset_reset (str) – A policy for resetting offsets on
OffsetOutOfRange errors: ‘earliest’ will move to the oldest available
message, ‘latest’ will move to the most recent. Any ofther value will
raise the exception. Default: ‘latest’.
Related
I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')
I am going to use Kafka as a message broker in my application. This application is written entirely using Python. For a part of this application (Login and Authentication), I need to implement a request-reply messaging system. In other words, the producer needs to get the response of the produced message from the consumer, synchronously.
Is it feasible using Kafka and its Python libraries (kafka-python, ...) ?
I'm facing the same issue (request-reply for an HTTP hit in my case)
My first bet was (100% python):
start a consumer thread,
publish the request message (including a request_id)
join the consumer thread
get the answer from the consumer thread
The consumer thread subscribe to the reply topic (seeked to end) and deals with received messages until finding the request_id (modulus timeout)
If it works for a basic testing, unfortunatly, creating a KafkaConsumer object is a slow process (~300ms) so it's not an option for a system with massive traffic.
In addition, if your system deals with parallel request-reply (for example, multi-threaded like a web server is) you'll need to create a KafkaConsumer dedicated to request_id (basically by using request_id as consumer_group) to avoid to have reply to request published by thread-A consumed (and ignored) by thread-B.
So you can't here reclycle your KafkaConsumer and have to pay the creation time for each request (in addition to processing time on backend).
If your request-reply processing is not parallelizable you can try to keep the KafkaConsuser object available for threads started to get answer
The only solution I can see at this point is to use a DB (relational/noSQL):
requestor store request_id in DB (as local as possible) aznd publish request in kafka
requestor poll DB until finding answer to request_id
In parallel, a consumer process receiving messages from reply topic and storing result in DB
But I don't like polling..... It wil generate heavy load on DB in a massive traffic system
My 2CTS
I am just getting started with Kafka, kafka-python. In the code below, I am trying to read the messages as they arrive. But for some reason, the consumer seems to be waiting till a certain number of messages accrue before fetching them.
I initially thought it was because of the producer which was publishing in batch. When I ran "kafka-console-consumer --bootstrap-servers --topic ", I could see every message that was being received as soon as it got published ( as seen on the consumer console)
But the python script is not able to receive the messages in the same way.
def run():
success_consumer = KafkaConsumer('success_logs',
bootstrap_servers=KAFKA_BROKER_URL,
group_id=None,
fetch_min_bytes=1,
fetch_max_bytes=10,
enable_auto_commit=True)
#dummy poll
success_consumer.poll()
for msg in success_consumer:
print(msg)
success_consumer.close()
Can someone point out what configuration changed with KafkaConsumer? Why is it not able to read messages like "kafka-console-consumer" ?
The KafkaConsumer class also has a fetch_max_wait_ms parameter. You should set it to 0
success_consumer = KafkaConsumer(...,fetch_max_wait_ms=0)
I have two microservices.
MProducer - sending messages to kafka queue
MConsumer - reading messages from kafka queue
When consumer crashes and restart, I want to continue consuming from last message.
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='latest',
enable_auto_commit=False)
It looks like you are using kafka-python, so you'll need to pass the group_id argument to your Consumer. See the description for this argument in the KafkaConsumer documentation.
By setting a group id, the Consumer will periodically commit its position to Kafka and will automatically retrieve it upon restarting.
You do that by having a consumer group. Assuming you're using confluent library then just add 'group.id': 'your-group'
When the service is down then coming up, it will start from last committed point.
The information about each consumer group is saved in a special topic in Kafka (starting from v0.9) called __consumer_offsets. More info in kafka docs [https://kafka.apache.org/intro#intro_consumers]
I'm using Kafka-Python to read from a topic from a kafka broker but I can't seem to get the consumer iterator to return anything
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup")
for record in consumer:
print(record)
It seems like it's just hanging. I've verified that the topic exists and has data on the broker and that new data is being produced. When I change the call to the KafkaConsumer constructor, and add auto_offset_reset="earliest", everything works as expected and the consumer iterator returns records. The default value for this param is "latest", but with that value I can't seem to consume data.
Why would this be the case?
You also need to include auto_offset_reset='smallest' when instantiating KafkaConsumer which is equivalent to --from-beginning for the command line tool kafka-console-consumer.sh
i.e.
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup", auto_offset_reset='smallest')
The reason why you might see no data being consumed is probably because when your consumer is up and running no data is produced from the producer's side. Therefore, you need to indicate that you want to consume all the data in the topic (even if no data is being inserted at the moment).
According to the official documentation:
The Kafka consumer works by issuing "fetch" requests to the brokers
leading the partitions it wants to consume. The consumer specifies its
offset in the log with each request and receives back a chunk of log
beginning from that position. The consumer thus has significant
control over this position and can rewind it to re-consume data if
need be.