Not able to consume large messages with Kafka (over 1.3 MB)? - python

I followed the link How can I send large messages with Kafka (over 15MB)? to resolve the kafka msg limit issue. But no luck
I tried increasing
A.) On Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
B.) On Consumer: fetch.message.max.bytes=15728640
Still facing the same problem. Not able to consume data that is over 1.3 MB on a particular topic
In my application, a msg in sent on a topic from python code and is consumed on node server.

Kafka does have strict restriction over the size of data; default is 1MB.
I believe you have missed topic level config.
There are multiple configs at different levels:
You have a broker setting message.max.bytes (default is 1000012) http://kafka.apache.org/documentation/#brokerconfigs
There is a topic level config max.message.bytes (default is 1000012)
http://kafka.apache.org/documentation/#topicconfigs
Producer has max.request.size (default is 1048576)
http://kafka.apache.org/documentation/#producerconfigs
Consumer has max.partition.fetch.bytes (default is 1048576)
http://kafka.apache.org/documentation/#consumerconfigs

Related

avoid duplicate message from kafka consumer in kafka-python

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')

how to set kafka-python with kerberos (and how to set JAAS and krb5)

I want to setup kafka consumer (using python) that connects to a remote kafka broker. But it requires a kerberos authentication.
So from what I understood, I am required to have jaas.conf and krb5.conf
the following is my code snippet
from kafka import KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers=brokers, group_id='group_id', auto_offset_reset='earliest',
security_protocol='SASL_PLAINTEXT', sasl_mechanism='GSSAPI', sasl_kerberos_service_name='kafka')
but i am not sure how and where to put the jaas and krb5
I read that i need to set them as
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
-Djava.security.krb5.conf=/etc/kafka/krb5.conf
but if my understanding is correct, that is for the kafka server (not as client consumer)
if i do indeed need to set both jaas and krb5, how should i do it as a consumer?
because I am not familiar with kerberos, it seems that I am taking bits of information from everywhere an reached to the wrong conclusion. Any help is much appreciated!

Azure Storage Queue data overhead with python

I'm using wireshark to understand the data usage when posting messages to an azure storage queue via the python SDK (https://pypi.org/project/azure-storage-queue/).
The wireshark capture filter has been set to show communication to and from the Azure queue. The following table shows the data transferred for a single post to the queue (certificate exchange has already occurred). If I post multiple messages using queue.send_message, the entire block repeats.
The message itself is posted as part of line 2 (length 429 bytes, which varies with message size as expected). Then there is a TCP Ack and the response comes back (fixed length, 753 bytes), see https://learn.microsoft.com/en-us/rest/api/storageservices/put-message.
I do not understand the first (619 bytes) or last (88 bytes) packets, which are also of fixed length even when varying message size. Any idea what these other packets are?

Need help debugging kafka latency for high frequency data

I have an IoT device that is streaming data to a python server every 15ms. The python server uploads the data to kafka and another server consumes it.
I have a partition key for the data based on sensor id. The latency for the first few messages are sub 30ms, but then it skyrockets to 500ms before slowly coming down and then repeating (image below).
My assumption here is that the producer is batching the data before sending it. I can't seem to find a setting to turn this off so that my latency is consistent. Issue seems to happen even if I have a blank message.
Here's the code for my producer
producer = KafkaProducer(
bootstrap_servers=get_kafka_brokers(),
security_protocol='SSL',
ssl_context=get_kafka_ssl_context(),
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
batch_size=0,
acks=1
)
message = {}
producer.send(app.config['NEW_LOG_TOPIC'], message, key=str(device.id).encode('utf-8'))
I have been reading the documentation up and down and have tried several different configurations but nothing has helped. My servers and kafka instance are running on heroku.
Any help is appreciated.

Kafka Consumer not fetching all messages

I am trying to launch dynamic consumer whenever new topic is created in Kafka but dynamically launched consumer is missing starting/first message always but consuming the message from there on. I am using kafka-python module and am using updated KafkaConsumer and KafkaProducer.
Code for Producer is
producer = KafkaProducer(bootstrap_servers='localhost:9092')
record_metadata = producer.send(topic, data)
and code for consumer is
consumer = KafkaConsumer(topic,group_id="abc",bootstrap_servers='localhost:9092',auto_offset_reset='earliest')
Please suggest something to over come this problem or any configuration i have to include in my producer and consumer instances.
Can you set auto_offset_reset to earliest.
When a new consumer stream is created, it starts from latest offset (which is default value for auto_offset_reset) and you will miss messages which were sent while consumer wasn't started.
You can read about it in kafka python doc. Relevant portion is below
auto_offset_reset (str) – A policy for resetting offsets on
OffsetOutOfRange errors: ‘earliest’ will move to the oldest available
message, ‘latest’ will move to the most recent. Any ofther value will
raise the exception. Default: ‘latest’.

Categories

Resources