Need help debugging kafka latency for high frequency data

Need help debugging kafka latency for high frequency data - python

I have an IoT device that is streaming data to a python server every 15ms. The python server uploads the data to kafka and another server consumes it.
I have a partition key for the data based on sensor id. The latency for the first few messages are sub 30ms, but then it skyrockets to 500ms before slowly coming down and then repeating (image below).
My assumption here is that the producer is batching the data before sending it. I can't seem to find a setting to turn this off so that my latency is consistent. Issue seems to happen even if I have a blank message.
Here's the code for my producer
producer = KafkaProducer(
bootstrap_servers=get_kafka_brokers(),
security_protocol='SSL',
ssl_context=get_kafka_ssl_context(),
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
batch_size=0,
acks=1
)
message = {}
producer.send(app.config['NEW_LOG_TOPIC'], message, key=str(device.id).encode('utf-8'))
I have been reading the documentation up and down and have tried several different configurations but nothing has helped. My servers and kafka instance are running on heroku.
Any help is appreciated.

Related

Latency under low traffic in Google PubSub

I have noticed that under high load pubsub gives great throughput with pretty low latency. But if I want to send a single message, the latency can often be several seconds. I have used the publish_time in the incoming message to see how long the message spent in the queue and it is usually pretty low. Can't tell if, under very low traffic conditions, a published message doesn't actually get sent by the client libraries right away or if the libraries don't deliver it to the application immediately. I am using asynchronous pull in Python.

There can be several factors that impact latency of low-throughput Pub/Sub streams. First of all, the publish-side client library does wait a period of time to try to batch messages by default. You can get a little bit of improvement by setting the max_messages property of the pubsub_v1.types.BatchSettings to 1, which will ensure that every message is sent as soon as it is ready.
In general, there is also the issue of cold caches in the Pub/Sub service. If your publish rate is infrequent, say, O(1) publish call every 10-15 minutes, then the service may have to load state on each publish that can delay the delivery. If low latency for these messages is very important, our current recommendation is to send a heartbeat message every few seconds to keep all of the state active. You can add an attribute to the messages to indicate it is a heartbeat message and have your subscriber ignore them.

avoid duplicate message from kafka consumer in kafka-python

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')

Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.

Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')

Not able to consume large messages with Kafka (over 1.3 MB)?

I followed the link How can I send large messages with Kafka (over 15MB)? to resolve the kafka msg limit issue. But no luck
I tried increasing
A.) On Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
B.) On Consumer: fetch.message.max.bytes=15728640
Still facing the same problem. Not able to consume data that is over 1.3 MB on a particular topic
In my application, a msg in sent on a topic from python code and is consumed on node server.

Kafka does have strict restriction over the size of data; default is 1MB.
I believe you have missed topic level config.
There are multiple configs at different levels:
You have a broker setting message.max.bytes (default is 1000012) http://kafka.apache.org/documentation/#brokerconfigs
There is a topic level config max.message.bytes (default is 1000012)
http://kafka.apache.org/documentation/#topicconfigs
Producer has max.request.size (default is 1048576)
http://kafka.apache.org/documentation/#producerconfigs
Consumer has max.partition.fetch.bytes (default is 1048576)
http://kafka.apache.org/documentation/#consumerconfigs

Google PubSub message duplication

I am using The python client (That comes as part of google-cloud 0.30.0) to process messages.
Sometimes (about 10% ) my messages are being duplicated. I will get the same message again and again up to 50 instances within a few hours.
My Subscription setup is for a 600 seconds ack time but a message may be resent a minute after its predecessor.
While running , I would occasionally get 503 errors (Which I log with my policy_class)
Has anybody experienced that behavior? any ideas ?
My code look like
c = pubsub_v1.SubscriberClient(policy_class)
subscription = c.subscribe(c.subscription_path(my_proj ,my_topic)
res = subscription.open(callback=callback_func)
res.result()
def callback_func(msg)
try:
log.info('got %s', msg.data )
...
finally:
ms.ack()

The client library you are using uses a new Pub/Sub API for subscribing called StreamingPull. One effect of this is that the subscription deadline you have set is no longer used, and instead one calculated by the client library is. The client library also automatically extends the deadlines of messages for you.
When you get these duplicate messages - have you already ack'd the message when it is redelivered, or is this while you are still processing it? If you have already ack'd, are there some messages you have avoided acking? Some messages may be duplicated if they were ack'd but messages in the same batch needed to be sent again.
Also keep in mind that some duplicates are expected currently if you take over a half hour to process a message.

This seems to be an issue with google-cloud-pubsub python client, I upgraded to version 0.29.4 and ack() work as expected

In general, duplicates can happen given that Google Cloud Pub/Sub offers at-least-once delivery. Typically, this rate should be very low. A rate of 10% would be very high. In this particular instance, it was likely an issue in the client libraries that resulted in excessive duplicates, which was fixed in April 2018.
For the general case of excessive duplicates there are a few things to check to determine if the problem is on the user side or not. There are two places where duplication can happen: on the publish side (where there are two distinct messages that are each delivered once) or on the subscribe side (where there is a single message delivered multiple times). The way to distinguish the cases is to look at the messageID provided with the message. If the same ID is repeated, then the duplication is on the subscribe side. If the IDs are unique, then duplication is happening on the publish side. In the latter case, one should look at the publisher to see if it is getting errors that are resulting in publish retries.
If the issue is on the subscriber side, then one should check to ensure that messages are being acknowledged before the ack deadline. Messages that are not acknowledged within this time will be redelivered. If this is the issue, then the solution is to either acknowledge messages faster (perhaps by scaling up with more subscribers for the subscription) or by increasing the acknowledgement deadline. For the Python client library, one sets the acknowledgement deadline by setting the max_lease_duration in the FlowControl object passed into the subscribe method.

Flask - Pulling the live stream kafka data - Integrating Kafka with Python Flask

This project is for real time search engine - log analysis performance.
I have a live streaming data out from Spark processing to Kafka.
Now with the Kafka output,
I want to get the data from the Kafka using Flask.. and visualize it using Chartjs or some other visualization..
How do I get the live streaming data from Kafka using the python flask?
Any idea how do I start with?
Any help would be greatly appreciated!
Thanks!

I would check out the Kafka package for python:
http://kafka-python.readthedocs.org/en/master/usage.html
This should get you setup to stream data from Kafka. Additionally, I might check out this project: https://github.com/travel-intelligence/flasfka which has to do with using Flask and Kafka together (just found it on a google search).

I'm working on a similar problem (small Flask app with live streaming data coming out of Kafka).
You have to do a couple things to set this up. First, you need a KafkaConsumer to grab messages:
from kafka import KafkaConsumer
consumer = KafkaConsumer(group_id='groupid', boostrap_servers=kafkakserver)
consumer.subscribe(topics=['topicid'])
try:
# this method should auto-commit offsets as you consume them.
# If it doesn't, turn on logging.DEBUG to see why it gets turned off.
# Not assigning a group_id can be one cause
for msg in consumer:
# TODO: process the kafka messages.
finally:
# Always close your producers/consumers when you're done
consumer.close()
This is about the most basic KafkaConsumer. The for loop blocks the thread and loops until it's committed the last message. There is also the consumer.poll() method to just grab what messages you can in a given time, depending on how you want to architect the data flow. Kafka was designed with long-running consumer processes in mind, but if you're committing messages properly you can open and close consumers on an as needed basis as well.
Now you have the data, so you can stream it to the browser with Flask. I'm not familiar with ChartJS, but live streaming from Flask centers on calling a python function that ends in yield inside a loop instead of just a return at the end of processing.
Check out Michael Grinberg's blog and his followup on streaming as practical examples of streaming with Flask. (Note: anyone actually streaming video in a serious Web app will probably want to encode it into a video codec like the widely used H.264 using ffmpy and wrap it in MPEG-DASH ...or maybe choose a framework that does more of this stuff for you.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.