PyKafka Api usage - python

I am a newbie to Kafka and PyKafka.I know that a producer and a consumer are made in PyKafka via the below code.
from pykafka import KafkaClient
client = KafkaClient("localhost:9092")
topic = client.topics["topicname"]
producer = topic.get_producer()
consumer = topic.get_simple_consumer()
I want to know what is KafkaClient, and how it is helping in creating producer and consumer.
I have read we can create cluster and broker also using client.cluster and client.broker, but I can't understand the use of client here.

To make terms simpler, replace Kafka with "server".
You interact with servers with clients.
To interact with Kafka, in particular, you send messages to topics via producers, and get messages with consumers.
I don't know this library off-hand, but .broker and .cluster aren't actually "making a Kafka broker / cluster", only establishing a connection to an existing one, from which you can issue later commands.
You need the client. on those function calls because the client is a wrapper around both
To know why it is structured in this way, you'd have to ask the developers themselves

pykafka.KafkaClient is the root object of the PyKafka API, providing an interface to Kafka brokers as well as the ability to instantiate consumer and producer instances. The KafkaClient can be thought of as a representation of the totality of one Python process' interaction with a given Kafka cluster. There is no direct comparison between KafkaClient and any of the concepts mentioned in the official Kafka documentation.
It's totally possible in theory to design a python Kafka client library that doesn't have a "client" class like KafkaClient. We decided not to since in our opinion a single root class provides a cleaner, more learnable interface than a bag of various classes.

Related

avoid duplicate message from kafka consumer in kafka-python

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.
def consume_from_kafka():
consumer = KafkaConsumer(
TOPIC,
bootstrap_servers=["localhost"],
group_id='my-group')
Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.
There are generally 2 cases:
The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.
Hope that helps.
Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.
consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')

how to set kafka-python with kerberos (and how to set JAAS and krb5)

I want to setup kafka consumer (using python) that connects to a remote kafka broker. But it requires a kerberos authentication.
So from what I understood, I am required to have jaas.conf and krb5.conf
the following is my code snippet
from kafka import KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers=brokers, group_id='group_id', auto_offset_reset='earliest',
security_protocol='SASL_PLAINTEXT', sasl_mechanism='GSSAPI', sasl_kerberos_service_name='kafka')
but i am not sure how and where to put the jaas and krb5
I read that i need to set them as
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
-Djava.security.krb5.conf=/etc/kafka/krb5.conf
but if my understanding is correct, that is for the kafka server (not as client consumer)
if i do indeed need to set both jaas and krb5, how should i do it as a consumer?
because I am not familiar with kerberos, it seems that I am taking bits of information from everywhere an reached to the wrong conclusion. Any help is much appreciated!

How to implement request-reply (synchronous) messaging paradigm in Kafka?

I am going to use Kafka as a message broker in my application. This application is written entirely using Python. For a part of this application (Login and Authentication), I need to implement a request-reply messaging system. In other words, the producer needs to get the response of the produced message from the consumer, synchronously.
Is it feasible using Kafka and its Python libraries (kafka-python, ...) ?
I'm facing the same issue (request-reply for an HTTP hit in my case)
My first bet was (100% python):
start a consumer thread,
publish the request message (including a request_id)
join the consumer thread
get the answer from the consumer thread
The consumer thread subscribe to the reply topic (seeked to end) and deals with received messages until finding the request_id (modulus timeout)
If it works for a basic testing, unfortunatly, creating a KafkaConsumer object is a slow process (~300ms) so it's not an option for a system with massive traffic.
In addition, if your system deals with parallel request-reply (for example, multi-threaded like a web server is) you'll need to create a KafkaConsumer dedicated to request_id (basically by using request_id as consumer_group) to avoid to have reply to request published by thread-A consumed (and ignored) by thread-B.
So you can't here reclycle your KafkaConsumer and have to pay the creation time for each request (in addition to processing time on backend).
If your request-reply processing is not parallelizable you can try to keep the KafkaConsuser object available for threads started to get answer
The only solution I can see at this point is to use a DB (relational/noSQL):
requestor store request_id in DB (as local as possible) aznd publish request in kafka
requestor poll DB until finding answer to request_id
In parallel, a consumer process receiving messages from reply topic and storing result in DB
But I don't like polling..... It wil generate heavy load on DB in a massive traffic system
My 2CTS

MQTT-like Publish-Subscribe with Python and WebSockets?

I'm working on a project that needs a framework to handle pub/sub connections between a webpage and Python.
I've already used mosquitto (an open-source implementation of MQTT) and it worked, but the server needs a modded Apache module to redirect WebSocket connections to the broker.
Right now, I'm looking at Tornado but it doesn't fit on my requirements. I need a solution for the follwing:
A web page connects to a python server or some kind of broker and subscribes a topic do receive data associated with that topic.
Every time Python has data associated with that topic (let's say every 10 seconds), the data is sent to the specific client (or clients) that subscribed to that topic.
Thanks in advance
You could try the HiveMQ* MQTT broker instead of mosquitto as that has MQTT over websocket support built in.
http://www.hivemq.com/
Autobahn provides Publish & Subscribe (and RPC) over WebSocket via the WAMP protocol, and comes with client for JS (besides others) and Python/Twisted for server.
Here is a complete example.
Disclosure: I am original author of Autobahn and work for Tavendo.
websockify provides a websockets to tcp proxy that you could put in front of mosquitto. You would have to run it on a different port than 80 if you already have a web server of course, but it is easier than dealing with custom apache/lighttpd modules.

kafka consumer in R

I am looking to hack together a kafka consumer in Python or R (preferably R).
Using the kafka console consumer I can grep for a string and retrieve the relevant data but I am at a loss when it comes to parsing it suitably in R.
There are kafka clients available in other languages (for example: PHP, CPP) but one in R would be helpful from a data analytics point of view.
It would be great if the expert R developers on this forum could hint at/suggest resources that would allow me to make headway in this direction.
Apache Kafka : incubator.apache.org/kafka/
Kafka Consumer Client(s) : https://github.com/kafka-dev/kafka/tree/master/clients
[2015 Update] there is a library that allows you to connect to kafka - rkafka
http://cran.r-project.org/web/packages/rkafka/rkafka.pdf
As there is a C++ API for Kafka, you could use Rcpp to bring it to R.
Edit in response to comment on R-only solution: I do not know Kafka well enough to answer, but generally speaking, middleware runs fast, connecting multiple clients, streams etc. So you would to simplify some thing somewhere to get R (single-threaded as it is) to play with it.

Categories

Resources