Google PubSub message duplication - python

I am using The python client (That comes as part of google-cloud 0.30.0) to process messages.
Sometimes (about 10% ) my messages are being duplicated. I will get the same message again and again up to 50 instances within a few hours.
My Subscription setup is for a 600 seconds ack time but a message may be resent a minute after its predecessor.
While running , I would occasionally get 503 errors (Which I log with my policy_class)
Has anybody experienced that behavior? any ideas ?
My code look like
c = pubsub_v1.SubscriberClient(policy_class)
subscription = c.subscribe(c.subscription_path(my_proj ,my_topic)
res = subscription.open(callback=callback_func)
res.result()
def callback_func(msg)
try:
log.info('got %s', msg.data )
...
finally:
ms.ack()

The client library you are using uses a new Pub/Sub API for subscribing called StreamingPull. One effect of this is that the subscription deadline you have set is no longer used, and instead one calculated by the client library is. The client library also automatically extends the deadlines of messages for you.
When you get these duplicate messages - have you already ack'd the message when it is redelivered, or is this while you are still processing it? If you have already ack'd, are there some messages you have avoided acking? Some messages may be duplicated if they were ack'd but messages in the same batch needed to be sent again.
Also keep in mind that some duplicates are expected currently if you take over a half hour to process a message.

This seems to be an issue with google-cloud-pubsub python client, I upgraded to version 0.29.4 and ack() work as expected

In general, duplicates can happen given that Google Cloud Pub/Sub offers at-least-once delivery. Typically, this rate should be very low. A rate of 10% would be very high. In this particular instance, it was likely an issue in the client libraries that resulted in excessive duplicates, which was fixed in April 2018.
For the general case of excessive duplicates there are a few things to check to determine if the problem is on the user side or not. There are two places where duplication can happen: on the publish side (where there are two distinct messages that are each delivered once) or on the subscribe side (where there is a single message delivered multiple times). The way to distinguish the cases is to look at the messageID provided with the message. If the same ID is repeated, then the duplication is on the subscribe side. If the IDs are unique, then duplication is happening on the publish side. In the latter case, one should look at the publisher to see if it is getting errors that are resulting in publish retries.
If the issue is on the subscriber side, then one should check to ensure that messages are being acknowledged before the ack deadline. Messages that are not acknowledged within this time will be redelivered. If this is the issue, then the solution is to either acknowledge messages faster (perhaps by scaling up with more subscribers for the subscription) or by increasing the acknowledgement deadline. For the Python client library, one sets the acknowledgement deadline by setting the max_lease_duration in the FlowControl object passed into the subscribe method.

Related

avoid duplicate message from kafka consumer in kafka-python

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')

Have channels coincide with rooms in Python-socketio, and general pubsub questions

I’m working on a project involving websockets using python-socketio.
Our primary concern is fluidity, each connected user will have a cursor whose position on the board is sent as event every 50ms, boards are identified as (socket) rooms, and we are expecting many of these.
I’m new to PubSub, we are horizontally scaling our architecture and it seems to be the fit for events broadcasting.
I had a look at AsyncRedisManager class and from my understanding, it seems any message, sent by any socket on any socketio server (with pub/sub) is then transmitted / published from this server to redis on a single channel of communication. Subscribers to this channel can then see this flow of messages.
I’m hence concerned about 3 things:
Since all messages are simply going through one channel, isn’t this a “design flaw” as some servers might have no sockets connected to “one” specific room (at the moment), still they will be receiving (and pickle.loading), messages they don’t care about at that time.
The actual details of these messages (room, payload, etc.) is pickled.dumped and pickle.loaded by servers. In case of 50 rooms with 50 cursors each sending 25 event/s, isn’t this gonna be a huge CPU-bound bottleneck ?
I’m wrapping my head around the socket.io docs, comparing side by side the redis adapter to Python-socketio pubsub manager, and it seems channels are dynamically namespaced like “socketio#room_name” and messages broadcasted to these “namespaced” channels so that psubscribe would be a viable solution. Some other MQ refer in the terms of “topics”.
if the former assumption, is correct, still we cannot assume whether one server should or should not psubscribe to a channel#room_name unless no or at least one socket for that server is in the room.
I understand the paradigm of pub/sub is, from Redis page:
Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be. Subscribers express interest in one or more channels, and only receive messages that are of interest, without knowledge of what (if any) publishers there are.
But my question would be summarized as:
is it possible to make Python-socketio servers dynamically subscribe/unsubscribe to channels whenever there is a need for it, with channels identified as rooms, hence having as many channels as rooms in total. Would that be feasible while keeping this “plug-&-play” simple logic as a pubsubManager subclass? Am I missing something or does this make sense ?
Thank you for your time, any ideas, corrections, or “draft” code would be greatly appreciated.
is it possible to make Python-socketio servers dynamically subscribe/unsubscribe to channels whenever there is a need for it, with channels identified as rooms
I guess it is possible, with a custom client manager class. You would need to inherit from one of the existing client managers or the base client manager and implement a different pub/sub logic that fits your needs. But keep in mind that if you have 10,000 clients, there's going to be at least 10,000 rooms, since each clients gets a personal room.

Is it possible to unacknowledge messages in synchronous pull for pubsub

I am pulling pubsub messages through a subscription and need to acknowledge these before processing as I am doing multiprocessing and that throws an error of SSL corruption on account of the grpc module.
I want to ack all messages beforehand and unack in case there was an error, I am aware that we can do this for an asynchronous pull but is there a way where we can implement unack in synchronous pull as well?
I am using the official python module to pull from subscription
I suppose that unack you mean nack explained in Python API reference:
In Pub/Sub, the term ack stands for “acknowledge”.
...
It is also possible to nack a message, which is the opposite...
The same documentation contain part Pulling a Subscription Synchronously
in which it is explained how to nack with modify_ack_deadline():
If you want to nack some of the received messages (...), you can use the modify_ack_deadline() method and set their
acknowledge deadlines to zero. This will cause them to be dropped by
this client and the backend will try to re-deliver them.

Allowing message dropping in websockets

Is there a simple method or library to allow a websocket to drop certain messages if bandwidth doesn't allow? Or any one of the following?
to measure the queue size of outgoing messages that haven't yet reached a particular client
to measure the approximate bitrate that a client has been receiving recent messages at
to measure the time that a particular write_message finished being transmitted to the client
I'm using Tornado on the server side (tornado.websocket.WebSocketHandler) and vanilla JS on the client side. In my use case it's really only important that the server realize that a client is slow and throttle its messages (or use lossier compression) when it realizes that condition.
You can implement this on top of what you have by having the client confirm every message it gets and then use that information on the server to adapt the sending of messages to each client.
This is the only way you will know which outgoing messages haven't yet reached the client, be able to approximate bitrate or figure out the time it took for the message to reach the client. You must consider that the message back to the server will also take time and that if you use timestamps on the client, they will likely not match your servers as clients have their time set incorrectly more often than not.

kafka producer parameters require one message sent to take effect

I'm using confluent-kafka-python (https://github.com/confluentinc/confluent-kafka-python) to send some messages to Kafka, using Python. I send messages infrequently, so want the latency to be really really low.
If I do this, I can get messages to appear to my consumer with about a 2ms delay:
conf = { "bootstrap.servers" : "kafka-test-10-01",
"queue.buffering.max.ms" : 0,
'batch.num.messages': 1,
'queue.buffering.max.messages': 100,
"default.topic.config" : {"acks" : 0 }}
p = confluent_kafka.Producer(**conf)
p.produce(...)
BUT: the latency only drops to near zero after I've sent a first message with this new producer. Subsequent messages have latency near the 2ms mark.
The first message though has around a 1 second latency. Why?
Magnus Edenhill, the author of librdkafka, documented some useful parameters to set to decrease latency in any librdkafka client:
https://github.com/edenhill/librdkafka/wiki/How-to-decrease-message-latency
You don't show your consumer parameters but from your description it sounds like the consumer is polling and rightly getting nothing (null messages) before the first message is published and so it then waits the default 500 ms fetch.error.backoff.ms interval before trying to poll again and getting the first message. After that the messages are probably coming fast enough that the error back off is not triggered. Perhaps try setting fetch.error.backoff.ms lower and see if that helps.

Categories

Resources