avoid duplicate message from kafka consumer in kafka-python

avoid duplicate message from kafka consumer in kafka-python - python

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')

Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.

Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')

Related

Is it possible to unacknowledge messages in synchronous pull for pubsub

I am pulling pubsub messages through a subscription and need to acknowledge these before processing as I am doing multiprocessing and that throws an error of SSL corruption on account of the grpc module.
I want to ack all messages beforehand and unack in case there was an error, I am aware that we can do this for an asynchronous pull but is there a way where we can implement unack in synchronous pull as well?
I am using the official python module to pull from subscription

I suppose that unack you mean nack explained in Python API reference:
In Pub/Sub, the term ack stands for “acknowledge”.
...
It is also possible to nack a message, which is the opposite...
The same documentation contain part Pulling a Subscription Synchronously
in which it is explained how to nack with modify_ack_deadline():
If you want to nack some of the received messages (...), you can use the modify_ack_deadline() method and set their
acknowledge deadlines to zero. This will cause them to be dropped by
this client and the backend will try to re-deliver them.

How to implement request-reply (synchronous) messaging paradigm in Kafka?

I am going to use Kafka as a message broker in my application. This application is written entirely using Python. For a part of this application (Login and Authentication), I need to implement a request-reply messaging system. In other words, the producer needs to get the response of the produced message from the consumer, synchronously.
Is it feasible using Kafka and its Python libraries (kafka-python, ...) ?

I'm facing the same issue (request-reply for an HTTP hit in my case)
My first bet was (100% python):
start a consumer thread,
publish the request message (including a request_id)
join the consumer thread
get the answer from the consumer thread
The consumer thread subscribe to the reply topic (seeked to end) and deals with received messages until finding the request_id (modulus timeout)
If it works for a basic testing, unfortunatly, creating a KafkaConsumer object is a slow process (~300ms) so it's not an option for a system with massive traffic.
In addition, if your system deals with parallel request-reply (for example, multi-threaded like a web server is) you'll need to create a KafkaConsumer dedicated to request_id (basically by using request_id as consumer_group) to avoid to have reply to request published by thread-A consumed (and ignored) by thread-B.
So you can't here reclycle your KafkaConsumer and have to pay the creation time for each request (in addition to processing time on backend).
If your request-reply processing is not parallelizable you can try to keep the KafkaConsuser object available for threads started to get answer
The only solution I can see at this point is to use a DB (relational/noSQL):
requestor store request_id in DB (as local as possible) aznd publish request in kafka
requestor poll DB until finding answer to request_id
In parallel, a consumer process receiving messages from reply topic and storing result in DB
But I don't like polling..... It wil generate heavy load on DB in a massive traffic system
My 2CTS

PyKafka Api usage

I am a newbie to Kafka and PyKafka.I know that a producer and a consumer are made in PyKafka via the below code.
from pykafka import KafkaClient
client = KafkaClient("localhost:9092")
topic = client.topics["topicname"]
producer = topic.get_producer()
consumer = topic.get_simple_consumer()
I want to know what is KafkaClient, and how it is helping in creating producer and consumer.
I have read we can create cluster and broker also using client.cluster and client.broker, but I can't understand the use of client here.

To make terms simpler, replace Kafka with "server".
You interact with servers with clients.
To interact with Kafka, in particular, you send messages to topics via producers, and get messages with consumers.
I don't know this library off-hand, but .broker and .cluster aren't actually "making a Kafka broker / cluster", only establishing a connection to an existing one, from which you can issue later commands.
You need the client. on those function calls because the client is a wrapper around both
To know why it is structured in this way, you'd have to ask the developers themselves

pykafka.KafkaClient is the root object of the PyKafka API, providing an interface to Kafka brokers as well as the ability to instantiate consumer and producer instances. The KafkaClient can be thought of as a representation of the totality of one Python process' interaction with a given Kafka cluster. There is no direct comparison between KafkaClient and any of the concepts mentioned in the official Kafka documentation.
It's totally possible in theory to design a python Kafka client library that doesn't have a "client" class like KafkaClient. We decided not to since in our opinion a single root class provides a cleaner, more learnable interface than a bag of various classes.

Kafka-Python consume last unread message

I have two microservices.
MProducer - sending messages to kafka queue
MConsumer - reading messages from kafka queue
When consumer crashes and restart, I want to continue consuming from last message.
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='latest',
enable_auto_commit=False)

It looks like you are using kafka-python, so you'll need to pass the group_id argument to your Consumer. See the description for this argument in the KafkaConsumer documentation.
By setting a group id, the Consumer will periodically commit its position to Kafka and will automatically retrieve it upon restarting.

You do that by having a consumer group. Assuming you're using confluent library then just add 'group.id': 'your-group'
When the service is down then coming up, it will start from last committed point.
The information about each consumer group is saved in a special topic in Kafka (starting from v0.9) called __consumer_offsets. More info in kafka docs [https://kafka.apache.org/intro#intro_consumers]

Google PubSub message duplication

I am using The python client (That comes as part of google-cloud 0.30.0) to process messages.
Sometimes (about 10% ) my messages are being duplicated. I will get the same message again and again up to 50 instances within a few hours.
My Subscription setup is for a 600 seconds ack time but a message may be resent a minute after its predecessor.
While running , I would occasionally get 503 errors (Which I log with my policy_class)
Has anybody experienced that behavior? any ideas ?
My code look like
c = pubsub_v1.SubscriberClient(policy_class)
subscription = c.subscribe(c.subscription_path(my_proj ,my_topic)
res = subscription.open(callback=callback_func)
res.result()
def callback_func(msg)
try:
log.info('got %s', msg.data )
...
finally:
ms.ack()

The client library you are using uses a new Pub/Sub API for subscribing called StreamingPull. One effect of this is that the subscription deadline you have set is no longer used, and instead one calculated by the client library is. The client library also automatically extends the deadlines of messages for you.
When you get these duplicate messages - have you already ack'd the message when it is redelivered, or is this while you are still processing it? If you have already ack'd, are there some messages you have avoided acking? Some messages may be duplicated if they were ack'd but messages in the same batch needed to be sent again.
Also keep in mind that some duplicates are expected currently if you take over a half hour to process a message.

This seems to be an issue with google-cloud-pubsub python client, I upgraded to version 0.29.4 and ack() work as expected

In general, duplicates can happen given that Google Cloud Pub/Sub offers at-least-once delivery. Typically, this rate should be very low. A rate of 10% would be very high. In this particular instance, it was likely an issue in the client libraries that resulted in excessive duplicates, which was fixed in April 2018.
For the general case of excessive duplicates there are a few things to check to determine if the problem is on the user side or not. There are two places where duplication can happen: on the publish side (where there are two distinct messages that are each delivered once) or on the subscribe side (where there is a single message delivered multiple times). The way to distinguish the cases is to look at the messageID provided with the message. If the same ID is repeated, then the duplication is on the subscribe side. If the IDs are unique, then duplication is happening on the publish side. In the latter case, one should look at the publisher to see if it is getting errors that are resulting in publish retries.
If the issue is on the subscriber side, then one should check to ensure that messages are being acknowledged before the ack deadline. Messages that are not acknowledged within this time will be redelivered. If this is the issue, then the solution is to either acknowledge messages faster (perhaps by scaling up with more subscribers for the subscription) or by increasing the acknowledgement deadline. For the Python client library, one sets the acknowledgement deadline by setting the max_lease_duration in the FlowControl object passed into the subscribe method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.