RabbitMQ bulk consuming messages solution

RabbitMQ bulk consuming messages solution - python

I'm using RabbitMQ as a queue of different messages. When I consume this messages with two different consumers from one queue, I process them and insert processing results to a DB:
def consumer_callback(self, channel, delivery_tag, properties, message):
result = make_some_processing(message)
insert_to_db(result)
channel.basic_ack(delivery_tag)
I want to bulk consume messages from queue, that will reduce DB load. Since RabbitMQ does not support bulk reading messages by consumers I'm going to do smth like this:
some_messages_list = []
def consumer_callback(self, channel, delivery_tag, properties, message):
some_messages_list.append({delivery_tag: message})
if len(some_messages_list) > 1000:
results_list = make_some_processing_bulk(some_messages_list)
insert_to_db_bulk(results_list)
for tag in some_messages_list:
channel.basic_ack(tag)
some_messages_list.clear()
Messages are in queue before they all fully processed
If consumer falls or disconnects - messages stays safe
What do you think about that solution?
If it's okay, how can I get all unacknoleged messages anew, if consumer falls?

I've tested this solution for several months and can say that it is pretty good. Till AMPQ doesn't provide feature for bulk consuming, we have to use some walkarounds like this.
Note: if you decided to use this solution, beware of concurent consuming with several consumers (threads), or use some Locks (I've used python threading.Lock module) to provide guarantees for no race conditions happens.

Related

Kafka consume single message on request

I want to make a flask application/API with gunicorn that on every request-
reads a single value from a Kafka topic
does some processing
and returns the processed value to the user(or any application calling the API).
So, far I couldn't find any examples of it. So, is the following function is the correct way of doing this?
consumer = KafkaConsumer(
"first_topic",
bootstrap_servers='xxxxxx',
auto_offset_reset='xxxx',
group_id="my_group")
def get_value_from_topic:
for msg in consumer:
return msg
if __name__ == "__main__":
print(get_value_from_topic())
Or is there any better way of doing this using any library like Faust?
My reason for using Kafka is to avoid all the hassle of synchronization among the flask workers(in the case of traditional database) because I want to use a value from Kafka only once.

This seems okay at first glance. Your consumer iterator is iterated once, and you return that value.
The more idiomatic way to do that would be like this, however
def get_value_from_topic():
return next(consumer)
With your other settings, though, there's no guarantee this only polls one message because Kafka consumers poll in batches, and will auto-commit those batches of offsets. Therefore, you'll want to disable auto commits and handle that on your own, for example committing after handling the http request will give you at-least-once delivery, and committing before will give you at-most-once. Since you're interacting with an HTTP server, Kafka can't give you exactly once processing

Google PubSub thread safety and ignoring duplicate messages

I am listening to financial data published by Google Cloud Platform, Monday to Friday. I would like to save all messages to disk. I am doing this in Python.
I need to recover any missing packets if my application goes down. I understand Google will automatically resend un-ack'd messages.
The GCP documentation lists many subscription techniques available (Asynchronous/Synchronous, Push/Pull, Streaming pull etc). There is an asynchronous sample code:
def callback(message):
print(f"Received {message}.")
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}..\n")
# Wrap subscriber in a 'with' block to automatically call close() when done.
with subscriber:
try:
# When `timeout` is not set, result() will block indefinitely,
# unless an exception is encountered first.
streaming_pull_future.result(timeout=5)
except TimeoutError:
streaming_pull_future.cancel()
https://cloud.google.com/pubsub/docs/pull
Is the callback thread-safe/is there only one thread calling-back?
What is the best way to ignore already-received messages? Does the client need to maintain a map?
UPDATE for Kamal Aboul-Hosn
I think I can persist ok but my problem is I need to manually check all messages have indeed been received. To do this I enabled ordered delivery. Our message data contains a sequence number, so I wanted to add a global variable like next_expected_seq_num. After I receive each message I will process and ack the message and increment next_expected_seq_num.
However, if I have say 10 threads invoking the callback method, I assume any of the 10 could contain the next message? And I'd have to make my callback method smart enough to block processing on the other 9 threads whilst the 10th thread processes the next message. Something like:
(pseudo code)
def callback(msg)
{
seq_num = getSeqNum(msg.data);
while(seq_num != next_expected_seq_num); // Make atomic
// When we reach here, we have the next message
assert(db.exists(seq_num) == false);
// persist message
++next_expected_seq_num; // make atomic/cannot be earlier
msg.ack();
}
Should I just disable multiple callback threads given i'm preventing multithreading anyway?
Is there a better way to check/guarantee we process every message?
I'm wondering if we should trust GCP like TCP, enable multithreading (and just lock around the database-write)?
def callback(msg)
{
seq_num = getSeqNum(msg.data);
lock();
if(db.exists(seq_num) == false)
{
// persist message
}
unlock();
msg.ack();
}

The callback is not thread safe if you are running in a Python environment that doesn't have a global interpreter lock. Multiple callbacks could be executed in parallel in that case and you would you have to guard any shared data structures with locks.
Since Cloud Pub/Sub has at-least-once delivery semantics, if you need to ignore duplicate messages then yes, you will need to maintain some kind of data structure with the already-received messages. Note that duplicates could be delivered across subscriber restarts. Therefore, you will probably need this to be some kind of persistent storage. Redis tends to be a popular choice for this type of deduplication.
With ordered delivery, it is guaranteed that the callback will only run for one message for an ordering key at a time. Therefore, you would not have to program expecting multiple messages to be running simultaneously for the key. Note that in general, using ordering keys to totally order all messages in the topic will only work if your throughput is no more than 1MB/s as that is the publish limit for messages with ordering keys. Also, only use ordering keys if it is important to process the messages in order.
With regard to when to use multithreading or not, it really depends on the nature of the processing. If most of the callback would need to be guarded with a lock, then multithreading won't help much. If, though, only small portions need to be guarded by locks, e.g., checking for duplicates, while most of the processing can safely be done in parallel, then multithreading could result in better performance.
If all you want to do is prevent duplicates, then you probably don't need to guard the writes to the database with a lock unless the database doesn't guarantee consistency. Also, keep in mind that the locking only helps if you have a single subscriber client.

ECS task only able to pick one message from SQS queue

I have an architecture which looks like that:
As soon as a message is sent to a SQS queue, an ECS task picks this message and process it.
Which means that if X messages are sent into the queue, X ECS task will be spun up in parallel. An ECS task is only able to fetch one message (per my code above)
The ECS task uses a dockerized Python container, and uses boto3 SQS client to retrieve and parse the SQS message:
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
while sqs_message is not None:
# Process it
# Delete if from the queue
# Get next message in queue
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
def get_sqs_task_data(queue_url):
client = boto3.client('sqs')
response = client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1
)
return response
def parse_sqs_message(response_sqs_message):
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
# ... parse it and return a dict
return {
data_1 = ...,
data_2 = ...
}
All in all, pretty straightforward.
In get_sqs_data(), I explicitely specify that I want to retrieve only one message (because 1 ECS task has to process only one message).
In parse_sqs_message(), I test if there are some messages left in the queue with
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
When there is only one message in the queue (meaning one ECS task has been triggered), everything is working fine. The ECS task is able to pick the message, process it and delete it.
However, when the queue is populated with X messages (X > 1) at the same time, X ECS task are triggered, but only ECS task is able to fetch one of the message and process it.
All the others ECS tasks will exit with No messages found in queue, although there are X - 1 messages left to be processed.
Why is that? Why are the others task not able to pick the messages left to be picked?
If that matters, the VisibilityTimeout of SQS is set to 30mins.
Any help would greatly be appreciated!
Feel free to ask for more precision if you want so.

I forgot to give an answer to that question.
The problem was the fact the the SQS was setup as a FIFO queue.
A FIFO Queue only allows one consumer at a time (to preserve the order of the message). Changing it to a normal (standard) queue fixed this issue.

I'm not sure to understand how the tasks are triggered from SQS, but from what I understand in the SQS SDK documentation, this might happen if the number of messages is small when using short polling. From the get_sqs_task_data definition, I see that your are using short polling.
Short poll is the default behavior where a weighted random set of machines
is sampled on a ReceiveMessage call. Thus, only the messages on the
sampled machines are returned. If the number of messages in the queue
is small (fewer than 1,000), you most likely get fewer messages than you requested
per ReceiveMessage call.
If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response.
If this happens, repeat the request.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message
You might want to try to use Long polling with a value superior to the visibility timeout
I hope it helps

Celery Task Grouping/Aggregation

I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert

I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)

An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)

How does Amazon's SQS notify one of my "worker" servers whenever there is something in the queue?

I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.

Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue

This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.

Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.

Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.

There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.