I have two microservices.
MProducer - sending messages to kafka queue
MConsumer - reading messages from kafka queue
When consumer crashes and restart, I want to continue consuming from last message.
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='latest',
enable_auto_commit=False)
It looks like you are using kafka-python, so you'll need to pass the group_id argument to your Consumer. See the description for this argument in the KafkaConsumer documentation.
By setting a group id, the Consumer will periodically commit its position to Kafka and will automatically retrieve it upon restarting.
You do that by having a consumer group. Assuming you're using confluent library then just add 'group.id': 'your-group'
When the service is down then coming up, it will start from last committed point.
The information about each consumer group is saved in a special topic in Kafka (starting from v0.9) called __consumer_offsets. More info in kafka docs [https://kafka.apache.org/intro#intro_consumers]
Related
I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')
I have gone through the fundamentals of RabbitMQ. One thing I figured out that a publisher does not directly publish on a queue. The exchange decides on which queue the message should be published based on routing-key and type of exchange (code below is using default exchange). I have also found an example code of publisher.
import pika, os, logging
logging.basicConfig()
# Parse CLODUAMQP_URL (fallback to localhost)
url = os.environ.get('CLOUDAMQP_URL', 'amqp://guest:guest#localhost/%2f')
params = pika.URLParameters(url)
params.socket_timeout = 5
connection = pika.BlockingConnection(params)
channel = connection.channel()
channel.queue_declare(queue='pdfprocess')
# send a message
channel.basic_publish(exchange='', routing_key='pdfprocess', body='User information')
print ("[x] Message sent to consumer")
connection.close()
In line #9 the queue is being declared. I am a bit confused because the publisher does not have to be aware of the queue. For example if it is using fanout exchange and there are 100 queues with different names, how the consumer know and declare 100 queues?
The consumer can declare the queue and bind it to the exchange when the consumer connects to RabbitMQ. A fanout exchange then copies and routes a received message to all queues bound to it, regardless of routing keys or pattern matching as with direct and topic exchanges.
So no, the publisher does not have to be aware of all queues bound to the exchange. However, the publisher can ensure that the queue exists to ensure that the code will run smoothly, but that is of more importance for other exchange types.
Any client (Publisher or Consumer) can create queues in RabbitMQ. Sometimes you might want the Publisher to create a queue, but for me that is usually the role of the Consumer. The Publisher doesn't need to know where or even whether anything it sends will be consumed.
For example, the Publisher can get an acknowledgement from the RabbitMQ server that a message has been received. The RabbitMQ server can get a acknowledgement from the Consumer when a message is consumed from a Queue.
A Publisher cannot get an acknowledgement of when a message is Consumed from a Queue, it has no visibity of whether the message was routes to zero, one or multiple queues, or whether they were consumed from these queues.
I am just getting started with Kafka, kafka-python. In the code below, I am trying to read the messages as they arrive. But for some reason, the consumer seems to be waiting till a certain number of messages accrue before fetching them.
I initially thought it was because of the producer which was publishing in batch. When I ran "kafka-console-consumer --bootstrap-servers --topic ", I could see every message that was being received as soon as it got published ( as seen on the consumer console)
But the python script is not able to receive the messages in the same way.
def run():
success_consumer = KafkaConsumer('success_logs',
bootstrap_servers=KAFKA_BROKER_URL,
group_id=None,
fetch_min_bytes=1,
fetch_max_bytes=10,
enable_auto_commit=True)
#dummy poll
success_consumer.poll()
for msg in success_consumer:
print(msg)
success_consumer.close()
Can someone point out what configuration changed with KafkaConsumer? Why is it not able to read messages like "kafka-console-consumer" ?
The KafkaConsumer class also has a fetch_max_wait_ms parameter. You should set it to 0
success_consumer = KafkaConsumer(...,fetch_max_wait_ms=0)
I'm using Kafka-Python to read from a topic from a kafka broker but I can't seem to get the consumer iterator to return anything
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup")
for record in consumer:
print(record)
It seems like it's just hanging. I've verified that the topic exists and has data on the broker and that new data is being produced. When I change the call to the KafkaConsumer constructor, and add auto_offset_reset="earliest", everything works as expected and the consumer iterator returns records. The default value for this param is "latest", but with that value I can't seem to consume data.
Why would this be the case?
You also need to include auto_offset_reset='smallest' when instantiating KafkaConsumer which is equivalent to --from-beginning for the command line tool kafka-console-consumer.sh
i.e.
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup", auto_offset_reset='smallest')
The reason why you might see no data being consumed is probably because when your consumer is up and running no data is produced from the producer's side. Therefore, you need to indicate that you want to consume all the data in the topic (even if no data is being inserted at the moment).
According to the official documentation:
The Kafka consumer works by issuing "fetch" requests to the brokers
leading the partitions it wants to consume. The consumer specifies its
offset in the log with each request and receives back a chunk of log
beginning from that position. The consumer thus has significant
control over this position and can rewind it to re-consume data if
need be.
I am trying to launch dynamic consumer whenever new topic is created in Kafka but dynamically launched consumer is missing starting/first message always but consuming the message from there on. I am using kafka-python module and am using updated KafkaConsumer and KafkaProducer.
Code for Producer is
producer = KafkaProducer(bootstrap_servers='localhost:9092')
record_metadata = producer.send(topic, data)
and code for consumer is
consumer = KafkaConsumer(topic,group_id="abc",bootstrap_servers='localhost:9092',auto_offset_reset='earliest')
Please suggest something to over come this problem or any configuration i have to include in my producer and consumer instances.
Can you set auto_offset_reset to earliest.
When a new consumer stream is created, it starts from latest offset (which is default value for auto_offset_reset) and you will miss messages which were sent while consumer wasn't started.
You can read about it in kafka python doc. Relevant portion is below
auto_offset_reset (str) – A policy for resetting offsets on
OffsetOutOfRange errors: ‘earliest’ will move to the oldest available
message, ‘latest’ will move to the most recent. Any ofther value will
raise the exception. Default: ‘latest’.