ECS task only able to pick one message from SQS queue

ECS task only able to pick one message from SQS queue - python

I have an architecture which looks like that:
As soon as a message is sent to a SQS queue, an ECS task picks this message and process it.
Which means that if X messages are sent into the queue, X ECS task will be spun up in parallel. An ECS task is only able to fetch one message (per my code above)
The ECS task uses a dockerized Python container, and uses boto3 SQS client to retrieve and parse the SQS message:
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
while sqs_message is not None:
# Process it
# Delete if from the queue
# Get next message in queue
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
def get_sqs_task_data(queue_url):
client = boto3.client('sqs')
response = client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1
)
return response
def parse_sqs_message(response_sqs_message):
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
# ... parse it and return a dict
return {
data_1 = ...,
data_2 = ...
}
All in all, pretty straightforward.
In get_sqs_data(), I explicitely specify that I want to retrieve only one message (because 1 ECS task has to process only one message).
In parse_sqs_message(), I test if there are some messages left in the queue with
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
When there is only one message in the queue (meaning one ECS task has been triggered), everything is working fine. The ECS task is able to pick the message, process it and delete it.
However, when the queue is populated with X messages (X > 1) at the same time, X ECS task are triggered, but only ECS task is able to fetch one of the message and process it.
All the others ECS tasks will exit with No messages found in queue, although there are X - 1 messages left to be processed.
Why is that? Why are the others task not able to pick the messages left to be picked?
If that matters, the VisibilityTimeout of SQS is set to 30mins.
Any help would greatly be appreciated!
Feel free to ask for more precision if you want so.

I forgot to give an answer to that question.
The problem was the fact the the SQS was setup as a FIFO queue.
A FIFO Queue only allows one consumer at a time (to preserve the order of the message). Changing it to a normal (standard) queue fixed this issue.

I'm not sure to understand how the tasks are triggered from SQS, but from what I understand in the SQS SDK documentation, this might happen if the number of messages is small when using short polling. From the get_sqs_task_data definition, I see that your are using short polling.
Short poll is the default behavior where a weighted random set of machines
is sampled on a ReceiveMessage call. Thus, only the messages on the
sampled machines are returned. If the number of messages in the queue
is small (fewer than 1,000), you most likely get fewer messages than you requested
per ReceiveMessage call.
If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response.
If this happens, repeat the request.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message
You might want to try to use Long polling with a value superior to the visibility timeout
I hope it helps

Related

RabbitMQ bulk consuming messages solution

I'm using RabbitMQ as a queue of different messages. When I consume this messages with two different consumers from one queue, I process them and insert processing results to a DB:
def consumer_callback(self, channel, delivery_tag, properties, message):
result = make_some_processing(message)
insert_to_db(result)
channel.basic_ack(delivery_tag)
I want to bulk consume messages from queue, that will reduce DB load. Since RabbitMQ does not support bulk reading messages by consumers I'm going to do smth like this:
some_messages_list = []
def consumer_callback(self, channel, delivery_tag, properties, message):
some_messages_list.append({delivery_tag: message})
if len(some_messages_list) > 1000:
results_list = make_some_processing_bulk(some_messages_list)
insert_to_db_bulk(results_list)
for tag in some_messages_list:
channel.basic_ack(tag)
some_messages_list.clear()
Messages are in queue before they all fully processed
If consumer falls or disconnects - messages stays safe
What do you think about that solution?
If it's okay, how can I get all unacknoleged messages anew, if consumer falls?

I've tested this solution for several months and can say that it is pretty good. Till AMPQ doesn't provide feature for bulk consuming, we have to use some walkarounds like this.
Note: if you decided to use this solution, beware of concurent consuming with several consumers (threads), or use some Locks (I've used python threading.Lock module) to provide guarantees for no race conditions happens.

AWS/Python: Peeking SQS message

There is a receive function at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message to get SQS message,
Is there a function that I can just take a Peek at the SQS messages, without actually receiving it. Because If I receive the messages, it will be deleted from the queue. But I want the messages to be stay in the queue after peeking.

you can check
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
aws sqs sdk's (and client libraries written on top of them) by default they dont delete messages. but they have 'Visibility Timeout' which is 30 seconds by default. That means after you read the message , it wont be visible to other consumers for 30 seconds. It is up to a client to delete it within that time frame so that no one else will get that message ever.
So You can reduce that visility time out to something really small like 1 second. So you can download the message and within 1 second it will be available to other consumers . you can even set it to 0 so everyone can read the message at any point.
But this still means you will receieve the message. SQS is pretty simple queue system. you might want to check other queue system like Kafka or different way to design your system like using Notification services such as SNS

How to retry a celery task without duplicating it - SQS

I have a Celery task that takes a message from an SQS queue and tries to run it. If it fails it is supposed to retry every 10 seconds at least 144 times. What I think is happening is that it fails and gets back into the queue, and at the same time it creates a new one, duplicating it to 2. These 2 fail again and follow the same pattern to create 2 new and becoming 4 messages in total. So if I let it run for some time the queue gets clogged.
What I am not getting is the proper way to retry it without duplicating. Following is the code that retries. Please see if someone can guide me here.
from celery import shared_task
from celery.exceptions import MaxRetriesExceededError
#shared_task
def send_br_update(bgc_id, xref_id, user_id, event):
from myapp.models.mappings import BGC
try:
bgc = BGC.objects.get(pk=bgc_id)
return bgc.send_br_update(user_id, event)
except BGC.DoesNotExist:
pass
except MaxRetriesExceededError:
pass
except Exception as exc:
# retry every 10 minutes for at least 24 hours
raise send_br_update.retry(exc=exc, countdown=600, max_retries=144)
Update:
More explanation of the issue...
A user creates an object in my database. Other users act upon that object and as they change the state of that object, my code emits signals. The signal handler then initiates a celery task, which means that it connects to the desired SQS queue and submits the message to the queue. The celery server, running the workers, see that new message and try to execute the task. This is where it fails and the retry logic comes in.
According to celery documentation to retry a task all we need to do is to raise self.retry() call with countdown and/or max_retries. If a celery task raises an exception it is considered as failed. I am not sure how SQS handles this. All I know is that one task fails and there are two in the queue, both of these fail and then there are 4 in the queue and so on...

This is NOT celery nor SQS issues.
The real issues is the workflow , i.e. way of you sending message to MQ service and handle it that cause duplication. You will face the same problem using any other MQ service.
Imagine your flow
script : read task message. MQ Message : lock for 30 seconds
script : task fail. MQ Message : locking timeout, message are now free to be grab again
script : create another task message
Script : Repeat Step 1. MQ Message : 2 message with the same task, so step 1 will launch 2 task.
So if the task keep failing, it will keep multiply, 2,4,8,16,32....
If celery script are mean to "Recreate failed task and send to message queue", you want to make sure these message can only be read ONCE. **You MUST discard the task message after it already been read 1 time, even if the task failed. **
There are at least 2 ways to do this, choose one.
Delete the message before recreate the task. OR
In SQS, you can enforce this by create DeadLetter Queue, configure the Redrive Policy, set Maximum Receives to 1. This will make sure the message
with the task that have been read never recycle.
You may prefer method 2, because method 1 require you to configure celery to "consume"(read and delete) ASAP it read the message, which is not very practical. (and you must make sure you delete it before create a new message for failed task)
This dead letter queue is a way to let you to check if celery CRASH, i.e. message that have been read once but not consumed (delete) means program stop somewhere.

This is probably a little bit late, I have written a backoff policy for Celery + SQS as a patch.
You can see how it is implemented in this repository
https://github.com/galCohen88/celery_sqs_retry_policy/blob/master/svc/celery.py

How to implement a priority queue using SQS(Amazon simple queue service)

I have a situation when a msg fails and I would like to replay that msg with the highest priority using python boto package so he will be taken first. If I'm not wrong SQS queue does not support priority queue, so I would like to implement something simple.
Important note: when a msg fails I no longer have the message object, I only persist the receipt_handle so I can delete the message(if there was more than x retries) / change timeout visibility in order to push him back to queue.
Thanks!

I don't think there is any way to do this with a single SQS queue. You have no control over delivery of messages and, therefore, no way to impose a priority on messages. If you find a way, I would love to hear about it.
I think you could possibly use two queues (or more generally N queues where N is the number of levels of priority) but even this seems impossible if you don't actually have the message object at the time you determine that it has failed. You would need the message object so that the data could be written to the high-priority queue.
I'm not sure this actually qualifies as an answer 8^)

As far as I know AWS SQS does not provide a native way or doing a priority queue (single queue priority). If you are open to considering other options, RabbitMQ can do this. Client can specify a priority of 0-255 in the message, and the queue will prioritize a higher priority message gets to the customer first.
For more information, please look at https://www.rabbitmq.com/priority.html

By "when a msg fails", if you meant "processing failure" then you could look into Dead Letter Queue (DLQ) feature that comes with SQS. You can set the receive count threshold to move the failed messages to DLQ. Each DLQ is associated with an SQS.
In your case, you could make "max receive count" = 1 and you deal with that message seperately

Looking at the original question, we need to have a prioritized retry queue which means:
the items in the queue should be prioritized because low-priority items may depend on high-priority items, e.g. the low-priority item can be successfully processed only after a high-priority item successfully processed;
we need to implement multiple retries for queue items;
in this case yes, we can use SQS with some not too important assumptions.
Let's say your message is in JSON format and it contains a Priority field. Add it to your SQS.
At the consumer, get the message, process it, and if it's failed, set visibility timeout to (priority)*(60minutes)
queue
Setup maxReceiveCount for the queue equal to the number of retries you need or configure the retention period to move the message to the dead letter queue after a specific number of retries.
So, in theory, the low-priority request may be executed before high-priority only once at the beginning, and then it will be "moved after" high-priority messages using visibility timeout.

How does Amazon's SQS notify one of my "worker" servers whenever there is something in the queue?

I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.

Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue

This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.

Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.

Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.

There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.