How to retry a celery task without duplicating it - SQS

How to retry a celery task without duplicating it - SQS - python

I have a Celery task that takes a message from an SQS queue and tries to run it. If it fails it is supposed to retry every 10 seconds at least 144 times. What I think is happening is that it fails and gets back into the queue, and at the same time it creates a new one, duplicating it to 2. These 2 fail again and follow the same pattern to create 2 new and becoming 4 messages in total. So if I let it run for some time the queue gets clogged.
What I am not getting is the proper way to retry it without duplicating. Following is the code that retries. Please see if someone can guide me here.
from celery import shared_task
from celery.exceptions import MaxRetriesExceededError
#shared_task
def send_br_update(bgc_id, xref_id, user_id, event):
from myapp.models.mappings import BGC
try:
bgc = BGC.objects.get(pk=bgc_id)
return bgc.send_br_update(user_id, event)
except BGC.DoesNotExist:
pass
except MaxRetriesExceededError:
pass
except Exception as exc:
# retry every 10 minutes for at least 24 hours
raise send_br_update.retry(exc=exc, countdown=600, max_retries=144)
Update:
More explanation of the issue...
A user creates an object in my database. Other users act upon that object and as they change the state of that object, my code emits signals. The signal handler then initiates a celery task, which means that it connects to the desired SQS queue and submits the message to the queue. The celery server, running the workers, see that new message and try to execute the task. This is where it fails and the retry logic comes in.
According to celery documentation to retry a task all we need to do is to raise self.retry() call with countdown and/or max_retries. If a celery task raises an exception it is considered as failed. I am not sure how SQS handles this. All I know is that one task fails and there are two in the queue, both of these fail and then there are 4 in the queue and so on...

This is NOT celery nor SQS issues.
The real issues is the workflow , i.e. way of you sending message to MQ service and handle it that cause duplication. You will face the same problem using any other MQ service.
Imagine your flow
script : read task message. MQ Message : lock for 30 seconds
script : task fail. MQ Message : locking timeout, message are now free to be grab again
script : create another task message
Script : Repeat Step 1. MQ Message : 2 message with the same task, so step 1 will launch 2 task.
So if the task keep failing, it will keep multiply, 2,4,8,16,32....
If celery script are mean to "Recreate failed task and send to message queue", you want to make sure these message can only be read ONCE. **You MUST discard the task message after it already been read 1 time, even if the task failed. **
There are at least 2 ways to do this, choose one.
Delete the message before recreate the task. OR
In SQS, you can enforce this by create DeadLetter Queue, configure the Redrive Policy, set Maximum Receives to 1. This will make sure the message
with the task that have been read never recycle.
You may prefer method 2, because method 1 require you to configure celery to "consume"(read and delete) ASAP it read the message, which is not very practical. (and you must make sure you delete it before create a new message for failed task)
This dead letter queue is a way to let you to check if celery CRASH, i.e. message that have been read once but not consumed (delete) means program stop somewhere.

This is probably a little bit late, I have written a backoff policy for Celery + SQS as a patch.
You can see how it is implemented in this repository
https://github.com/galCohen88/celery_sqs_retry_policy/blob/master/svc/celery.py

Related

How to make sure a Celery task was enqueued in Pytest?

I have an API endpoint to register new user. The "welcome email" will enqueue and do this task async. I have 2 unit tests to check:
Api does save user's information to DB OK
The Celery task does send email with right content+template
I want to add 3rd unit test to ensure "The endpoint has to enqueue email-sending after saving user form to DB"
I try with celery.AsyncResult but it ask me to run the worker. For further, even if the worker is ready, we still can't verify the task was enqueued or not because the ambiguous PENDING state:
Task exists in queue but not execute yet: PENDING
Task doesn't exist in queue: PENDING
Does anyone face this problem? How do I solve it?

Common way to solve this problem in testing environments is to use the task_always_eager configuration setting, which basically instructs Celery to run the task like a regular function. Instead of the AsyncResult, Celery will make an object of the EagerResult type that behaves the same, but has completely different execution logic.

ECS task only able to pick one message from SQS queue

I have an architecture which looks like that:
As soon as a message is sent to a SQS queue, an ECS task picks this message and process it.
Which means that if X messages are sent into the queue, X ECS task will be spun up in parallel. An ECS task is only able to fetch one message (per my code above)
The ECS task uses a dockerized Python container, and uses boto3 SQS client to retrieve and parse the SQS message:
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
while sqs_message is not None:
# Process it
# Delete if from the queue
# Get next message in queue
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
def get_sqs_task_data(queue_url):
client = boto3.client('sqs')
response = client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1
)
return response
def parse_sqs_message(response_sqs_message):
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
# ... parse it and return a dict
return {
data_1 = ...,
data_2 = ...
}
All in all, pretty straightforward.
In get_sqs_data(), I explicitely specify that I want to retrieve only one message (because 1 ECS task has to process only one message).
In parse_sqs_message(), I test if there are some messages left in the queue with
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
When there is only one message in the queue (meaning one ECS task has been triggered), everything is working fine. The ECS task is able to pick the message, process it and delete it.
However, when the queue is populated with X messages (X > 1) at the same time, X ECS task are triggered, but only ECS task is able to fetch one of the message and process it.
All the others ECS tasks will exit with No messages found in queue, although there are X - 1 messages left to be processed.
Why is that? Why are the others task not able to pick the messages left to be picked?
If that matters, the VisibilityTimeout of SQS is set to 30mins.
Any help would greatly be appreciated!
Feel free to ask for more precision if you want so.

I forgot to give an answer to that question.
The problem was the fact the the SQS was setup as a FIFO queue.
A FIFO Queue only allows one consumer at a time (to preserve the order of the message). Changing it to a normal (standard) queue fixed this issue.

I'm not sure to understand how the tasks are triggered from SQS, but from what I understand in the SQS SDK documentation, this might happen if the number of messages is small when using short polling. From the get_sqs_task_data definition, I see that your are using short polling.
Short poll is the default behavior where a weighted random set of machines
is sampled on a ReceiveMessage call. Thus, only the messages on the
sampled machines are returned. If the number of messages in the queue
is small (fewer than 1,000), you most likely get fewer messages than you requested
per ReceiveMessage call.
If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response.
If this happens, repeat the request.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message
You might want to try to use Long polling with a value superior to the visibility timeout
I hope it helps

How can I run a task after other (already started) tasks are finished

I'm new to Celery and I'm trying to understand if it can solve my problem.
I need to start a number of tasks (An) and then run another task (B) after these are done. The problem is that tasks An are added sequentially and I don't want to wait for the last one to be added before I start the first one. Can I configure task B to execute after tasks An are done?
Now to the real scenario:
Task An - Process a file uploaded by user (Added after each file is
uploaded)
Task B - do something with the results of processing all
uploaded files
Alternative solutions are welcome as well

For sure you can do this, celery canvas supports many options, inluding the behaviour you require, running a task after a group of tasks ... it is called "Chords", e.g.:
from celery import chord
from tasks import task_upload1, task_upload2, task_upload3, final_execution
result = chord(task_upload1.s(), task_upload2.s(), task_upload3.s())(final_execution.s())
get_required_result = result.get()
you can refer to this link for more details

With RabbitMQ you can get exact behavior using message acknowledgment and aggregator pattern.
You start worker, that consumes messages (A) and do some work(process a file uploaded by user in your case), but doesn't sent ack when finished. Instead it takes next message form queue and if it's A task again, he is doing the same thing. At some point he will receive task B and could process all previous A's results, all atones and send ack to all of them.
Unfortunately, this scenario can't be done with Celery, because you have to specify all A tasks and final B task(chains, chords, callbacks, etc.) on creating time.
Alternatively, you can save Task.id for each successful A task in separate queue (not Celery queue) and process this messages, when executing B task. Celery can fit for this algorithm.

Sending message to RabbitMQ (pika) from scheduler callback doesn't work

I need different messages to be sent to the queue - each by its personal schedule. So I have message list and related interval to resend each one. I use rabbitMQ/pika and apscheduler.
According to numerous examples, I created the simplest BlockingConnection/channel/queue. When immediately after that I try to push messages - everything works fine, I can see in rabbitmq web-interface that all the messages become in the queue. Here is the piece of code that works:
self.cr = Queue('DIRECT_C_QUEUE', True, ex_type='direct')
for i in range(1,10000):
self.cr.channel.basic_publish(exchange='', routing_key='DIRECT_C_QUEUE', body='hello_world')
But if I try to push messages (in exactly same way) via apscheduler callback function - only few (about 1-10) messages appear in the queue (but callbacks are fired all the time and there are no any exception when publishing message!).
Finally I begin to receive such warnings:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pika/connection.py:642: UserWarning: Pika: Write buffer exceeded warning threshold at 1125 bytes and an estimated 43 frames behind
warn(message % (self.outbound_buffer.size, est_frames_behind))
and still no new messages in the queue.
I am new in python, any help is much appreciate.

I found source of the problem:
apscheduler runs basic_publish calls in separate thread, and pika doesn't recommend share connections between threads - http://pika.github.com/faq.html
So I had choice either create new connection each time, or put new messages in some queue and publish them from the main thread (where connection was created).

I fixed that problem by increasing ulimit
edit /etc/default/rabbitmq-server and set
ulimit -n 4096
then restart rabbitmq
sudo /etc/init.d/rabbitmq-server restart

How does Amazon's SQS notify one of my "worker" servers whenever there is something in the queue?

I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.

Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue

This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.

Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.

Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.

There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.