managing AWS SQS and DLQ - python

Scenario :
create a lambda and it will be triggered whenever a message comes to SQS(let's assume SQS-A). The lambda (written in python)is responsible for sending the incoming payload to the another endpoint.
The problem is, whenever the target endpoint or server is down, I was trying to place it into the another SQS (let's assume SQS-B), if other exceptions comes than placing it into Deal Letter Queue.
Here I want to two things.
If ConnectionError (it is the python exception says which says endpoint is down) comes I want to stop the the SQS-A(there is no point to run the lambda as the target server is down).
(or)
As whenever I get this error I am sending it to the SQS-B, I want the SQS-B to be triggered for when the first request comes and it should check if still there is a connection error, it has to trigger after 10 minutes, and again check, if exception persists trigger after 30 minutes, like this
I want to increment the time up to 4 hours and after that check/trigger the lambda every 4 hours. If there is no exception then it should read all the messages in the SQS-B.
Help me how to achieve any one of the approach or recommend any other better approach

You are creating a complex architecture due to a simple problem (the target not being available). Try not to over-complicate things.
I would recommend:
Have the originating system send the message to an Amazon SNS topic
The topic triggers a Lambda function
If it successfully processes the message, no further action required
If the remote endpoint is not available, put the message into an Amazon SQS queue for later processing
Use Amazon CloudWatch Events to trigger a Lambda function every n minutes that grabs any messages in the queue and tries to send them again. If the remote endpoint is still down, it exits and the process will be attempted again n minutes later.
Probably worth also sending an email to an Admin if a message gets older than a few hours.
If you must send the original message to an SQS queue, then you could do as you described... send to Queue-A first, which triggers a Lambda function. If the endpoint is down, Lambda sends the message to Queue-B for later processing. However, only process from Queue-B every n minutes (rather than trying to make each individual message have its own delay timer).

Related

AWS/Python: Peeking SQS message

There is a receive function at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message to get SQS message,
Is there a function that I can just take a Peek at the SQS messages, without actually receiving it. Because If I receive the messages, it will be deleted from the queue. But I want the messages to be stay in the queue after peeking.
you can check
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
aws sqs sdk's (and client libraries written on top of them) by default they dont delete messages. but they have 'Visibility Timeout' which is 30 seconds by default. That means after you read the message , it wont be visible to other consumers for 30 seconds. It is up to a client to delete it within that time frame so that no one else will get that message ever.
So You can reduce that visility time out to something really small like 1 second. So you can download the message and within 1 second it will be available to other consumers . you can even set it to 0 so everyone can read the message at any point.
But this still means you will receieve the message. SQS is pretty simple queue system. you might want to check other queue system like Kafka or different way to design your system like using Notification services such as SNS

Interupt backend function and wait for frontend response

I'm looking for a good solution to implement a handshake between a python backend server and a react frontend connected through a websocket.
The frontend allows the user to upload a file and then send it to the backend for processing. Now it might be possible that the processing encounters some issues and likes the user's confirmation to proceed or cancel - and that's where I'm stuck.
My current implementation has different "endpoints" in the backend which call then different function implementations and a queue which is continuously processed and content (messages) is sent to the frontend. But these are always complete actions, they either succeed or fail and the returned message is accordingly. I have no system in place to interupt a running task (e.g. file processing), send a request to the frontend and then wait for response before I continue the function.
Is there a design pattern or common approach for this kind of problem?
How long it takes to process? Maybe a good solution is set up a message broker like RabbitMQ and create a queue for this process. In the front-end you have to create a panel to see the state of the process, which is running in an async task, and if it has found some issues, let the user know and ask what to do.

ECS task only able to pick one message from SQS queue

I have an architecture which looks like that:
As soon as a message is sent to a SQS queue, an ECS task picks this message and process it.
Which means that if X messages are sent into the queue, X ECS task will be spun up in parallel. An ECS task is only able to fetch one message (per my code above)
The ECS task uses a dockerized Python container, and uses boto3 SQS client to retrieve and parse the SQS message:
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
while sqs_message is not None:
# Process it
# Delete if from the queue
# Get next message in queue
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
def get_sqs_task_data(queue_url):
client = boto3.client('sqs')
response = client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1
)
return response
def parse_sqs_message(response_sqs_message):
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
# ... parse it and return a dict
return {
data_1 = ...,
data_2 = ...
}
All in all, pretty straightforward.
In get_sqs_data(), I explicitely specify that I want to retrieve only one message (because 1 ECS task has to process only one message).
In parse_sqs_message(), I test if there are some messages left in the queue with
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
When there is only one message in the queue (meaning one ECS task has been triggered), everything is working fine. The ECS task is able to pick the message, process it and delete it.
However, when the queue is populated with X messages (X > 1) at the same time, X ECS task are triggered, but only ECS task is able to fetch one of the message and process it.
All the others ECS tasks will exit with No messages found in queue, although there are X - 1 messages left to be processed.
Why is that? Why are the others task not able to pick the messages left to be picked?
If that matters, the VisibilityTimeout of SQS is set to 30mins.
Any help would greatly be appreciated!
Feel free to ask for more precision if you want so.
I forgot to give an answer to that question.
The problem was the fact the the SQS was setup as a FIFO queue.
A FIFO Queue only allows one consumer at a time (to preserve the order of the message). Changing it to a normal (standard) queue fixed this issue.
I'm not sure to understand how the tasks are triggered from SQS, but from what I understand in the SQS SDK documentation, this might happen if the number of messages is small when using short polling. From the get_sqs_task_data definition, I see that your are using short polling.
Short poll is the default behavior where a weighted random set of machines
is sampled on a ReceiveMessage call. Thus, only the messages on the
sampled machines are returned. If the number of messages in the queue
is small (fewer than 1,000), you most likely get fewer messages than you requested
per ReceiveMessage call.
If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response.
If this happens, repeat the request.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message
You might want to try to use Long polling with a value superior to the visibility timeout
I hope it helps

How to retry a celery task without duplicating it - SQS

I have a Celery task that takes a message from an SQS queue and tries to run it. If it fails it is supposed to retry every 10 seconds at least 144 times. What I think is happening is that it fails and gets back into the queue, and at the same time it creates a new one, duplicating it to 2. These 2 fail again and follow the same pattern to create 2 new and becoming 4 messages in total. So if I let it run for some time the queue gets clogged.
What I am not getting is the proper way to retry it without duplicating. Following is the code that retries. Please see if someone can guide me here.
from celery import shared_task
from celery.exceptions import MaxRetriesExceededError
#shared_task
def send_br_update(bgc_id, xref_id, user_id, event):
from myapp.models.mappings import BGC
try:
bgc = BGC.objects.get(pk=bgc_id)
return bgc.send_br_update(user_id, event)
except BGC.DoesNotExist:
pass
except MaxRetriesExceededError:
pass
except Exception as exc:
# retry every 10 minutes for at least 24 hours
raise send_br_update.retry(exc=exc, countdown=600, max_retries=144)
Update:
More explanation of the issue...
A user creates an object in my database. Other users act upon that object and as they change the state of that object, my code emits signals. The signal handler then initiates a celery task, which means that it connects to the desired SQS queue and submits the message to the queue. The celery server, running the workers, see that new message and try to execute the task. This is where it fails and the retry logic comes in.
According to celery documentation to retry a task all we need to do is to raise self.retry() call with countdown and/or max_retries. If a celery task raises an exception it is considered as failed. I am not sure how SQS handles this. All I know is that one task fails and there are two in the queue, both of these fail and then there are 4 in the queue and so on...
This is NOT celery nor SQS issues.
The real issues is the workflow , i.e. way of you sending message to MQ service and handle it that cause duplication. You will face the same problem using any other MQ service.
Imagine your flow
script : read task message. MQ Message : lock for 30 seconds
script : task fail. MQ Message : locking timeout, message are now free to be grab again
script : create another task message
Script : Repeat Step 1. MQ Message : 2 message with the same task, so step 1 will launch 2 task.
So if the task keep failing, it will keep multiply, 2,4,8,16,32....
If celery script are mean to "Recreate failed task and send to message queue", you want to make sure these message can only be read ONCE. **You MUST discard the task message after it already been read 1 time, even if the task failed. **
There are at least 2 ways to do this, choose one.
Delete the message before recreate the task. OR
In SQS, you can enforce this by create DeadLetter Queue, configure the Redrive Policy, set Maximum Receives to 1. This will make sure the message
with the task that have been read never recycle.
You may prefer method 2, because method 1 require you to configure celery to "consume"(read and delete) ASAP it read the message, which is not very practical. (and you must make sure you delete it before create a new message for failed task)
This dead letter queue is a way to let you to check if celery CRASH, i.e. message that have been read once but not consumed (delete) means program stop somewhere.
This is probably a little bit late, I have written a backoff policy for Celery + SQS as a patch.
You can see how it is implemented in this repository
https://github.com/galCohen88/celery_sqs_retry_policy/blob/master/svc/celery.py

How does Amazon's SQS notify one of my "worker" servers whenever there is something in the queue?

I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.
Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue
This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.
Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.
Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.
There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg

Categories

Resources