Confluent Kafka poll, when does a message get committed - python

I have a Python application that has autocommit=True and it is using poll() to get messages with a interval of 1 second. I was reading on the documentation and it mentions that polling reads message in a background thread and queues them so that the main thread can take them afterwards. I was a bit confused there on what happens if I have multiple messages queued and my consumer crashes. Would those messages queued from the background thread have been committed already and hence get lost?

As mentioned in the docs, every auto.commit.interval.ms, any polled offsets will get committed.
If you are concerned about missing data, you should always disable auto-commits, in any Kafka client, and handle commits on your own after you know you've actually processed those records.

Related

Data race between differenct processes using same database

I have a system which includes MySQL as database and RabbitMQ for organizing asynchronous data processing.
There are two processes (in two different containers) that work with the same record. First one updates record status in db transaction and sends message to rabbit queue. Second process fetches record from db and does some job. The problem is that the second process can read the message from the queue before the first process completes the update of the record.
Currently, in order to avoid this problem, the second process checks the status of the record, if it does not correspond to the target value, then the process waits for an update by re-sending it to the same queue.
This behavior occurs due to the fact that sending to the queue is performed within the transaction context. If I move the sending to the queue outside the transaction, it is possible that an error will occur or the process will be interrupted after the db transaction completed, the status in the database will change, but the message will not be sent to the queue, and the second process will not process this record.
What can you suggest to solve this architectural problem?

RabbitMQ: Consuming only one message at a time from multiple queues

I'm trying to stay connected to multiple queues in RabbitMQ. Each time I pop a new message from one of these queue, I'd like to spawn an external process.
This process will take some time to process the message, and I don't want to start processing another message from that specific queue until the one I popped earlier is completed. If possible, I wouldn't want to keep a process/thread around just to wait on the external process to complete and ack the server. Ideally, I would like to ack in this external process, maybe passing some identifier so that it can connect to RabbitMQ and ack the message.
Is it possible to design this system with RabbitMQ? I'm using Python and Pika, if this is relevant to the answer.
Thanks!
RabbitMQ can do this.
You only want to read from the queue when you're ready - so spin up a thread that can spawn the external process and watch it, then fetch the next message from the queue when the process is done. You can then have mulitiple threads running in parallel to manage multiple queues.
I'm not sure what you want an ack for? Are you trying to stop RabbitMQ from adding new elements to that queue if it gets too full (because its elements are being processed too slowly/not at all)? There might be a way to do this when you add messages to the queues - before adding an item, check to make sure that the number of messages already in that queue is not "much greater than" the average across all queues?

Delay message consumption with SelectConnection

I want to write a consumer with a SelectConnection.
We have several devices in our network infrastructure that close connections after a certain time, therefore I want to use the heartbeat functionality.
As far as I know, the IOLoop runs on the main thread, so heartbeat frames can not be processed while this thread is processing the message.
My idea is to create several worker threads that process the messages so that the main thread can handle the IOLoop. The processing of a message takes a lot of resources, so only a certain amount of the messages should be processed at once. Instead of storing the remaining messages on the client side, I would like to leave them in the queue.
Is there a way to interrupt the consumption of messages, without interrupting the heartbeat?
I am not an expert on SelectConnection for pika, but you could implement this by setting the Consumer Prefetch (QoS) to the wanted number of processes.
This would basically mean that once a message comes in, you offload it to a process or thread, once the message has been processed you then acknowledge that the message has been processed.
As an example, if you set the QoS to 10. The client would pull at most 10 messages, and won't pull any new messages until at least one of those has been acknowledged.
The important part here is that you would need to acknowledge messages only once you are finished processing them.

Python: How to combine a process poll and a non-blocking WebSocket server?

I have an idea. Write a WebSocket based RPC that would process messages according to the scenario below.
Client connects to a WS (web socket) server
Client sends a message to the WS server
WS server puts the message into the incoming queue (can be a multiprocessing.Queue or RabbitMQ queue)
One of the workers in the process pool picks up the message for processing
Message is being processed (can be blazingly fast or extremely slow - it is irrelevant for the WS server)
After the message is processed, results of the processing are pushed to the outcoming queue
WS server pops the result from the queue and sends it to the client
NOTE: the key point is that the WS server should be non-blocking and responsible only for:
connection acceptance
getting messages from the client and puting them into the incoming queue
popping messages from the outcoming queue and sending them back to the client
NOTE2: it might be a good idea to store client identifier somehow and pass it around with the message from the client
NOTE3: it is completely fine that because of queueing the messages back and forth the speed of simple message processing (e.g. get message as input and push it back as a result) shall become lower. Target goal is to be able to run processor expensive operations (rough non-practical example: several nested “for” loops) in the pool with the same code style as handling fast messages. I.e. pop message from the input queue together with some sort of client identifier, process it (might take a while) and push the processing results together with client ID to the output queue.
Questions:
In TornadoWeb, if I have a queue (multiprocessing or Rabit), how can
I make Tornado’s IOLoop trigger some callback whenever there is a new
item in that queue? Can you navigate me to some existing
implementation if there is any?
Is there any ready implementation of such a design? (Not necessarily with Tornado)
Maybe I should use another language (not python) to implement such a design?
Acknowledgments:
Recommendations to use REST and WSGI for whatever goal I aim to achieve are not welcome
Comments like “Here is a link to the code that I found by googling for 2 seconds. It has some imports from tornado and multiprocessing.I am not sure what it does, however I am for 99% certain that it isexactly what you need” are not welcome neither
Recommendations to use asynchronous libraries instead of normal blocking ones are ... :)
Tornado's IOLoop allows you handling events from any file object by its file descriptor, so you could try this:
connect with each of your workers processes through multiprocessing.Pipe
call add_handler for each pipe's parent end (using the connection's fileno())
make the workers write some random garbage each time they put something into the output queue, no matter if that's multiprocessing.Queue of any MQ.
handle the answers form the workers in the event handlers

Producing and consuming messages with no declared queue

Where are the messages stored (in rabbit) when you produce a message and send it without declaring a queue or mentioning it in basic_publish? The code I have to work with looks something like this:
... bunch of setup code (no queue declaring tho)...
channel.exchange_declare(exchange='name', type='direct')
channel.basic_publish(exchange='exch_name', routing_key='rkey', body='message' )
conn.close()
I've looked through the web to my abbilities but haven't found an answer to this. I have a hunch that rabbit creates a queue for as long as this message isn't consumed, and my worries are that this would be quite heavy for the rabbit if it has to declare this queue and then destroy it several (thousand!?) times per minute/hour.
When you publish you (usually) publish to an exchange, as you are doing. The exchange decides what to do with that message. If there is nothing to do with the message it is discarded. If there is something to do with the message then it is routed accordingly.
In your original code snippet where there is not queue declared the message will be discarded.
As you say in your comment there was a queue created by the producer. There are options here that you haven't stated. I will try to run through the possibilities. Usually you would declared the queue in the consumer. However if you wish to make sure that they consumer sees all the messages then the queue must be created by the producer and bound to the exchange by the producer to ensure that every message ends up in this queue. Then when the queue is consumed by the consumer it will see all the messages. Alternatively you can create the queue externally from the code as a non autodelete queue and possibly as a durable queue (this will keep the queue even if you restart RabbitMQ) using the commandline or management gui. You will still need to do a declaration in the producer for the exchange in order to send and a declaration in the consumer to receive but they exchange and queue will already exist and you will just simply be connecting to them.
Queues and exchanges are not persistent they are durable or not, which means they will exist after restarting RabbitMQ. Queues have autodelete so that when they consumer is disconnected from them they will no longer exist.
Messages can be persistent, so that if you send a message to an exchange that will be routed to a queue, the message is not read and the RabbitMQ is restarted the message will still be there upon restart. Even if a message is not persistent if the Queue is not Durable then it will be lost, or if the message is not routed to a queue in the first place.
Make sure that after you create a queue you bind the queue properly to the exchange using the same key that you are using as the routing key for your messages.

Categories

Resources