I'm using Kafka-Python to read from a topic from a kafka broker but I can't seem to get the consumer iterator to return anything
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup")
for record in consumer:
print(record)
It seems like it's just hanging. I've verified that the topic exists and has data on the broker and that new data is being produced. When I change the call to the KafkaConsumer constructor, and add auto_offset_reset="earliest", everything works as expected and the consumer iterator returns records. The default value for this param is "latest", but with that value I can't seem to consume data.
Why would this be the case?
You also need to include auto_offset_reset='smallest' when instantiating KafkaConsumer which is equivalent to --from-beginning for the command line tool kafka-console-consumer.sh
i.e.
consumer = KafkaConsumer("topic",bootstrap_servers=bootstrap_server + ":" + str(port), group_id="mygroup", auto_offset_reset='smallest')
The reason why you might see no data being consumed is probably because when your consumer is up and running no data is produced from the producer's side. Therefore, you need to indicate that you want to consume all the data in the topic (even if no data is being inserted at the moment).
According to the official documentation:
The Kafka consumer works by issuing "fetch" requests to the brokers
leading the partitions it wants to consume. The consumer specifies its
offset in the log with each request and receives back a chunk of log
beginning from that position. The consumer thus has significant
control over this position and can rewind it to re-consume data if
need be.
Related
I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')
I have two microservices.
MProducer - sending messages to kafka queue
MConsumer - reading messages from kafka queue
When consumer crashes and restart, I want to continue consuming from last message.
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='latest',
enable_auto_commit=False)
It looks like you are using kafka-python, so you'll need to pass the group_id argument to your Consumer. See the description for this argument in the KafkaConsumer documentation.
By setting a group id, the Consumer will periodically commit its position to Kafka and will automatically retrieve it upon restarting.
You do that by having a consumer group. Assuming you're using confluent library then just add 'group.id': 'your-group'
When the service is down then coming up, it will start from last committed point.
The information about each consumer group is saved in a special topic in Kafka (starting from v0.9) called __consumer_offsets. More info in kafka docs [https://kafka.apache.org/intro#intro_consumers]
I am using The python client (That comes as part of google-cloud 0.30.0) to process messages.
Sometimes (about 10% ) my messages are being duplicated. I will get the same message again and again up to 50 instances within a few hours.
My Subscription setup is for a 600 seconds ack time but a message may be resent a minute after its predecessor.
While running , I would occasionally get 503 errors (Which I log with my policy_class)
Has anybody experienced that behavior? any ideas ?
My code look like
c = pubsub_v1.SubscriberClient(policy_class)
subscription = c.subscribe(c.subscription_path(my_proj ,my_topic)
res = subscription.open(callback=callback_func)
res.result()
def callback_func(msg)
try:
log.info('got %s', msg.data )
...
finally:
ms.ack()
The client library you are using uses a new Pub/Sub API for subscribing called StreamingPull. One effect of this is that the subscription deadline you have set is no longer used, and instead one calculated by the client library is. The client library also automatically extends the deadlines of messages for you.
When you get these duplicate messages - have you already ack'd the message when it is redelivered, or is this while you are still processing it? If you have already ack'd, are there some messages you have avoided acking? Some messages may be duplicated if they were ack'd but messages in the same batch needed to be sent again.
Also keep in mind that some duplicates are expected currently if you take over a half hour to process a message.
This seems to be an issue with google-cloud-pubsub python client, I upgraded to version 0.29.4 and ack() work as expected
In general, duplicates can happen given that Google Cloud Pub/Sub offers at-least-once delivery. Typically, this rate should be very low. A rate of 10% would be very high. In this particular instance, it was likely an issue in the client libraries that resulted in excessive duplicates, which was fixed in April 2018.
For the general case of excessive duplicates there are a few things to check to determine if the problem is on the user side or not. There are two places where duplication can happen: on the publish side (where there are two distinct messages that are each delivered once) or on the subscribe side (where there is a single message delivered multiple times). The way to distinguish the cases is to look at the messageID provided with the message. If the same ID is repeated, then the duplication is on the subscribe side. If the IDs are unique, then duplication is happening on the publish side. In the latter case, one should look at the publisher to see if it is getting errors that are resulting in publish retries.
If the issue is on the subscriber side, then one should check to ensure that messages are being acknowledged before the ack deadline. Messages that are not acknowledged within this time will be redelivered. If this is the issue, then the solution is to either acknowledge messages faster (perhaps by scaling up with more subscribers for the subscription) or by increasing the acknowledgement deadline. For the Python client library, one sets the acknowledgement deadline by setting the max_lease_duration in the FlowControl object passed into the subscribe method.
I am trying to launch dynamic consumer whenever new topic is created in Kafka but dynamically launched consumer is missing starting/first message always but consuming the message from there on. I am using kafka-python module and am using updated KafkaConsumer and KafkaProducer.
Code for Producer is
producer = KafkaProducer(bootstrap_servers='localhost:9092')
record_metadata = producer.send(topic, data)
and code for consumer is
consumer = KafkaConsumer(topic,group_id="abc",bootstrap_servers='localhost:9092',auto_offset_reset='earliest')
Please suggest something to over come this problem or any configuration i have to include in my producer and consumer instances.
Can you set auto_offset_reset to earliest.
When a new consumer stream is created, it starts from latest offset (which is default value for auto_offset_reset) and you will miss messages which were sent while consumer wasn't started.
You can read about it in kafka python doc. Relevant portion is below
auto_offset_reset (str) – A policy for resetting offsets on
OffsetOutOfRange errors: ‘earliest’ will move to the oldest available
message, ‘latest’ will move to the most recent. Any ofther value will
raise the exception. Default: ‘latest’.
I'd like to do some routing magic with AMQP. My setup is Python with Pika on the consumer/producer side and RabbitMQ for the AMQP server.
What I'd like to achieve:
send a message to a single exchange
(insert magic here)
consume messages like so:
one set of subscribers should be able to retrieve based on a routing key
one set of subscribers should just get all messages.
The tricky part is that if the any server in the second set has received a message no other server from the second set will receive it. All the servers from the first set should still be able to consume this message.
Is this possible with a single basic_publish call or do I need to send the message to a routing exchange (for the first set of consumers) and to a "global" exchange for the second set of consumers?
CLARIFICATION:
What I'd like to achieve is a single
call to publish a message and have it
received by 2 distinct sets of
consumers.
Case 1: Just receive messages based on routing key (that is a
message with routing key foo will be
received by all the consumers
currently interested in that topic)
Case 2: This basically resembles the RabbitMQ Tutorial for Worker
Queues.
There are a number of workers that
will receive messages dispatched in a
round robin way. Only one worker will receive a message
Still the message that is received by the consumers that are interested in a certain
routing key should be exactly the same as the messages received by the workers, produced
by a single API call.
(Hope my question makes sense I'm not too familiar with AMQP terms)
To start with, you need to use a topic exchange and publish your messages with a different routing key for each queue. The magic happens when the consumer binds the queue with a binding key (or pattern to be matched). Some consumers just use the routing keys as their binding key. But the second set will use a wildcard pattern for their binding key.
For Case 1, you need to create a queue per consumer, and bind each queue with an appropriate routing key.
For Case 2, just create a single queue with a routing key of # and have each of your worker consumers consume from that. The broker will dispatch in a round-robin manner to the workers.
Here's a screenshot of what it would look like in RabbitMQ. In this example there are two consumers from your "case 1" (Foo and Bar) and one queue for all the workers to satisfy "case 2".
This model should be supported by all AMQP-compliant brokers and wouldn't require any vendor-specific enhancements.