Consume multiple messages at a time - python

I am using an external service (Service) to process some particular type of objects. The Service works faster if I send objects in batches of 10. My current architecture is as follows. A producer broadcasts objects one-by-one, and a bunch of consumers pull them (one-by-one) from a queue and send them to The Service. This is obviously suboptimal.
I don't want to modify producer code as it can be used in different cases. I can modify consumer code but only with the cost of additional complexity. I'm also aware of the prefetch_count option but I think it only works on the network level -- the client library (pika) does not allow fetching multiple messages at once in the consumer callback.
So, can RabbitMQ create batches of messages before sending them to consumers? I'm looking for an option like "consume n messages at a time".

You cannot batch messages in the consumer callback, but you could use a thread safe library and use multiple threads to consume data. The advantage here is that you can fetch five messages on five different threads and combine the data if needed.
As an example you can take a look on how I would implement this using my AMQP library.
https://github.com/eandersson/amqpstorm/blob/master/examples/scalable_consumer.py

The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs

Related

Sending object metadata causes FPS drops in the stream

I want to send some object metadata(class_id, confidence value, etc…) to another PC when the object is detected but it causes FPS drops and the stream is frozen. Which parallel programming technique I should use to solve it? Can you give me an example of it?
Checking if detected object in the class_dict:
if obj_meta.class_id in class_dict:
send_one(obj_meta.class_id)
I am using this function to send class_id message.
from __future__ import print_function
import can
def send_one(class_id):
bus = can.interface.Bus()
bus = can.interface.Bus(bustype='socketcan', channel='vcan0', bitrate=250000)
msg = can.Message(arbitration_id=0xc0ffee,
**data=[class_id]**,
is_extended_id=True)
try:
bus.send(msg)
print("Message sent on {}".format(bus.channel_info))
except can.CanError:
print("Message NOT sent")
I am not sure what's your usecase but I would recommend to have a look at msgbroker (DS plugin) for msg passing between the applications
A little bit more code would help, but I'm assuming you are (were?) doing the check inside a gstreamer buffer probe. Buffer probe blocks buffer downstream so no new buffers keep coming until you've disposed of it.
A: using external service: use the msgbroker element to produce messages and inject into alternative service (eg rabbit, kafka). See reference implementation here. Then, use a service-specific consumer to process the data (and call your send_one).
B: from python: You should extract metadata as quickly as possible, and then process it from outside.
from queue import Queue, Empty
from threading import Thread
q = Queue()
...
#in buffer probe:
if obj_meta.class_id in class_dict:
q.put(obj_meta.class_id)
...
def consume():
while True:
try:
data = q.get(block=True, timeout=1)
except Empty:
pass
...
consumer = Thread(target=consume)
consumer.start()
you could improve from this eg by reading in batches, running multiple consumer threads, etc.

Task delegation in Python/Redis

I have an issue thinking of an architecture that'll solve the following problem:
I have a web application (producer) that receives some data on request. I also have a number of processes (consumers) that should process this data. 1 request generates 1 batch of data and should be processes by only 1 consumer.
My current solution consists of receiving the data, cache-ing it in memory with Redis, sending a message through a message channel that data has been written while the consumers are listening on the same channel, and then the data is processed by the consumers. The issue here is that I need to stop multiple consumers from working on the same data. So how can I inform the other consumers that I have started working on this task?
Producer code (flask endpoint):
data = request.get_json()
db = redis.Redis(connection_pool=pool)
db.set(data["externalId"], data)
# Subscribe to the batches channel and publish the id
db.pubsub()
db.publish('batches', request_key)
results = None
result_key = str(data["externalId"])
# Wait till the batch is processed
while results is None:
results = db.get(result_key)
if results is not None:
results = results.decode('utf8')
db.delete(data["externalId"])
db.delete(result_key)
Consumer:
db = redis.Redis(connection_pool = pool)
channel = db.pubsub()
channel.subscribe('batches')
while True:
try:
message = channel.get_message()
message_data = bytes(message['data']).decode('utf8')
external_id = message_data.split('-')[-1]
data = json.loads(db.get(external_id).decode('utf8'))
result = DataProcessor.process(data)
db.set(str(external_id), result)
except Exception:
pass
PUBSUB is often problematic for task queuing for exactly this reason. From the docs (https://redis.io/topics/pubsub):
SUBSCRIBE, UNSUBSCRIBE and PUBLISH implement the Publish/Subscribe messaging paradigm where (citing Wikipedia) senders (publishers) are not programmed to send their messages to specific receivers (subscribers). Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be.
A popular alternative to consider would be to implement "publish" by pushing an element to the end of a Redis list, and "subscribe" by having your worker poll that list at some interval (exponential backoff is often an appropriate choice). In order to avoid cases where multiple workers get the same job, use lpop to get and remove an element from the list. Redis is single-threaded, so you're guaranteed only one worker will receive each element.
So, on the publish side, aim for something like this:
db = redis.Redis(connection_pool=pool)
db.rpush("my_queue", task_payload)
And on the subscribe side, you can safely run a loop like this in parallel as many times as you need:
while True:
db = redis.Redis(connection_pool=pool)
payload = db.lpop("my_queue")
if not payload:
continue
< deserialize and process payload here >
Note this is a last-in-first-out queue (LIFO) since we're pushing onto the right side with rpush and popping off the left with lpop. You can implement the FIFO version trivially by combining lpush/lpop.

RabbitMQ + kombu: write/read to one-time use queues with random names

I am new to working with message exchange and met problem finding proper manual for the task.
I need to organize pool of queues so, that:
Producer create some random empty queue and write there all the pack of messages (100 messages usually).
Consumer find non-empty and non-locked queue and read from it till
it's empty and then delete it and look for next one.
So my task is to work with messages as packs, I understand how to produce and consume using same key in one queue, but can't find how to work with the pool of queues.
We can have several producers and consumers run in parallel, but there is no matter which of them send to whom. We don't need and ever can't link particular producers with particular consumer.
General task: we have lot of clients to receive push-notifications, we group pushes by some parameters to process later as group, so such group should be in one queue in RabbitMQ to be produced and consumed as a group, but each group is independent from other groups.
Big thanks to Hannu for the help: key idea of his easy and robust solution that we can have one persistant queue with known name where producer will write names of created queues and consumer will read these names from there.
To make his solution more readable and easy work with in future in my personal task, I have divided publish_data() in producer into two function - one make random queue and write it to control_queue another receive this random_queue and fill it with messages. Similar idea is good for consumer - one function to process queue, another will be called for process message itself.
I have done something like this but with Pika. I had to clean and kombufy an old code snippet for the examples. It is probably not very kombuish (this is my absolutely first code snippet written using it) but this is how I would solve it. Basically I would set up a control queue with a known name.
Publishers will create a random queue name for a pack of messages, dump N messages to it (in my case numbers 1-42) and then post the queue name to the control queue. A consumer then receives this queue name, binds to it, reads messages until queue is empty and then deletes the queue.
This keeps things relatively simple, as publishers do not need to figure out where they are allowed to publish their groups of data (every queue is new with a random name). Receivers do not need to worry about timeouts or "all done" -messages, as a receiver would receive a queue name only when a group of data has been written to the queue and every message is there waiting.
There is also no need to tinker with locks or signalling or anything else that would complicate things. You can have as many consumers and producers as you want. And of course using exchanges and routing keys there could be different sets of consumers for different tasks etc.
Publisher
from kombu import Connection
import uuid
from time import sleep
def publish_data(conn):
random_name= "q" + str(uuid.uuid4()).replace("-", "")
random_queue = conn.SimpleQueue(random_name)
for i in xrange(0, 42):
random_queue.put(i)
random_queue.close()
return random_name
with Connection('amqp://guest:guest#localhost:5672//') as conn:
control_queue = conn.SimpleQueue('control_queue')
_a = 0
while True:
y_name = publish_data(conn)
message = y_name
control_queue.put(message)
print('Sent: {0}'.format(message))
_a += 1
sleep(0.3)
if _a > 20:
break
control_queue.close()
Consumer
from Queue import Empty
from kombu import Connection, Queue
def process_msg(foo):
print str(foo)
with Connection("amqp://guest:guest#localhost:5672//") as _conn:
sub_queue = _conn.SimpleQueue(str(foo))
while True:
try:
_msg = sub_queue.get(block=False)
print _msg.payload
_msg.ack()
except Empty:
break
sub_queue.close()
chan = _conn.channel()
dq = Queue(name=str(foo), exchange="")
bdq = dq(chan)
bdq.delete()
with Connection('amqp://guest:guest#localhost:5672//') as conn:
rec = conn.SimpleQueue('control_queue')
while True:
msg = rec.get(block=True)
entry = msg.payload
msg.ack()
process_msg(entry)

Python script with multiple threads works normally only in debug mode

I am currently working with one Python 2.7 script with multiple threads. One of the threads is listening for JSON data in long polling mode and parse it after receiving or go into timeout after some period. I noticed that it works as expected only in debug mode (I use Wing IDE). In case of just normal run it seems like this particular thread of the script hanging after first GET request, before entering the "for" loop. Loop condition doesn't affect the result. At the same time other threads continue to work normally.
I believe this is related to multi-threading. How to properly troubleshoot and fix this issue?
Below I put code of the class responsible for long polling job.
class Listener(threading.Thread):
def __init__(self, router, *args, **kwargs):
self.stop = False
self._cid = kwargs.pop("cid", None)
self._auth = kwargs.pop("auth", None)
self._router = router
self._c = webclient.AAHWebClient()
threading.Thread.__init__(self, *args, **kwargs)
def run(self):
while True:
try:
# Data items that should be routed to the device is retrieved by doing a
# long polling GET request on the "/tunnel" resource. This will block until
# there are data items available, or the request times out
log.info("LISTENER: Waiting for data...")
response = self._c.send_request("GET", self._cid, auth=self._auth)
# A timed out request will not contain any data
if len(response) == 0:
log.info("LISTENER: No data this time")
else:
items = response["resources"]["tunnel"]
undeliverable = []
#print items # - reaching this point, able to return output
for item in items:
# The data items contains the data as a base64 encoded string and the
# external reference ID for the device that should receive it
extId = item["extId"]
data = base64.b64decode(item["data"])
# Try to deliver the data to the device identified by "extId"
if not self._router.route(extId, data):
item["message"] = "Could not be routed"
undeliverable.append(item)
# Data items that for some reason could not be delivered to the device should
# be POST:ed back to the "/tunnel" resource as "undeliverable"
if len(undeliverable) > 0:
log.warning("LISTENER: Sending error report...")
response = self._c.send_request("POST", "/tunnel", body={"undeliverable": undeliverable}, auth=self._auth)
except webclient.RequestError as e:
log.error("LISTENER: ERROR %d - %s", e.status, e.response)
UPD:
class Router:
def route(self, extId, data):
log.info("ROUTER: Received data for %s: %s", extId, repr(data))
# nothing special
return True
If you're using the CPython interpreter you're not actually system threading:
CPython implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
So your process is probably locking while listening on the first request because your are long polling.
Multi-processing might be a better choice. I haven't tried it with long polling but the Twisted framework might also work in your situation.

RPC calls to multiple consumers

I have a consumer which listens for messages, if the flow of messages is more than the consumer can handle I want to start another instance of this consumer.
But I also want to be able to poll for information from the consumer(s), my thought was that I could use RPC to request this information from the producers by using a fanout exchange so all the producers gets the RPC-call.
My question is first of all is this possible and secondly is it reasonable?
If the question is "is it possible to send an RPC message to more than one server?" the answer is yes.
When you build an RPC call you attach a temporary queue to the message (usually in header.reply_to but you can also use internal message fields). This is the queue where RPC targets will publish their answers.
When you send an RPC to a single server you can receive more than one message on the temporary queue: this means that an RPC answer could be formed by:
a single message from a single source
more than one message from a single source
more than one message from several sources
The problems arising in this scenario are
when do you stop listening? If you know the number of RPC servers you can wait until each of them sent you an answer, otherwise you have to implement some form of timeout
do you need to track the source of the answer? You can add some special fields in your message to keep this information. The same for messages order.
Just some code to show how you can do it (Python with Pika library). Pay attention, this is far from perfection: the biggest problem is that you should reset the timeout when you get a new answer.
def consume_rpc(self, queue, result_len=1, callback=None, timeout=None, raise_timeout=False):
if timeout is None:
timeout = self.rpc_timeout
result_list = []
def _callback(channel, method, header, body):
print "### Got 1/%s RPC result" %(result_len)
msg = self.encoder.decode(body)
result_dict = {}
result_dict.update(msg['content']['data'])
result_list.append(result_dict)
if callback is not None:
callback(msg)
if len(result_list) == result_len:
print "### All results are here: stopping RPC"
channel.stop_consuming()
def _outoftime():
self.channel.stop_consuming()
raise TimeoutError
if timeout != -1:
print "### Setting timeout %s seconds" %(timeout)
self.conn_broker.add_timeout(timeout, _outoftime)
self.channel.basic_consume(_callback, queue=queue, consumer_tag=queue)
if raise_timeout is True:
print "### Start consuming RPC with raise_timeout"
self.channel.start_consuming()
else:
try:
print "### Start consuming RPC without raise_timeout"
self.channel.start_consuming()
except TimeoutError:
pass
return result_list
After some researching it seems that this is not possible. If you look at the tutorial on RabbitMQ.com you see that there is an id for the call which, as far as I understand gets consumed.
I've choosen to go another way, which is reading the log-files and aggregating the data.

Categories

Resources