I like to combine a chain and a group in a small workflow of immutable tasks and without a results backend.
However, when I try this Celery automatically converts it to a chord and then complains that there is no results backend.
Is there any way I can get this to work without a results backend?
Code:
#shared_task
def test_canvas():
workflow = chain(group(test_task_a.si(), test_task_b.si()), test_task_c.si())
workflow.delay()
Here is the error message I get:
raised unexpected: NotImplementedError('Starting chords requires a result backend to be configured.
Note that a group chained with a task is also upgraded to be a chord, as this pattern requires synchronization.
Result backends that supports chords: Redis, Database, Memcached, and more.',)
Interestingly, running a chain or a group by itself works just fine.
Example:
workflow = chain(test_task_a.si(), test_task_b.si(), test_task_c.si())
workflow.delay()
Unfortunately, I think that the answer is no - you can't run chord without backend:
Tasks used within a chord must not ignore their results. In practice this means that you must enable a result_backend in order to use chords.
Your first example in test_canvas is implicitly chord:
A chord is a task that only executes after all of the tasks in a group have finished executing (link).
If you think about the logic behind (well expalin here)
someone (backend) need to figure out when all parallel tasks ended (the group) to know when it should trigger the next (chained) task.
In the second example, running multiple tasks concurrently with group is simple (nothing to coordinate later if no action should be taken).
Same for the chain - each task is responsible for triggering the next one, no complicated coordination is needed.
Related
I'm currently leveraging celery for periodic tasks. I am new to celery. I have two workers running two different queues. One for slow background jobs and one for jobs user's queue up in the application.
I am monitoring my tasks on datadog because it's an easy way to confirm my workers a running appropriately.
What I want to do is after each task completes, record which queue the task was completed on.
#after_task_publish.connect()
def on_task_publish(sender=None, headers=None, body=None, **kwargs):
statsd.increment("celery.on_task_publish.start.increment")
task = celery.tasks.get(sender)
queue_name = task.queue
statsd.increment("celery.on_task_publish.increment", tags=[f"{queue_name}:{task}"])
The following function is something that I implemented after researching the celery docs and some StackOverflow posts, but it's not working as intended. I get the first statsd increment but the remaining code does not execute.
I am wondering if there is a simpler way to inspect inside/after each task completes, what queue processed the task.
Since your question says is there a way to inspect inside/after each task completes - I'm assuming you haven't tried this celery-result-backend stuff. So you could check out this feature which is provided by Celery itself : Celery-Result-Backend / Task-result-Backend .
It is very useful for storing results of your celery tasks.
Read through this => https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-result-backend-settings
Once you get an idea of how to setup this result-backend, Search for result_extended key (in the same link) to be able to add queue-names in your task return values.
Number of options are available - Like you can setup these results to go to any of these :
Sql-DB / NoSql-DB / S3 / Azure / Elasticsearch / etc
I have made use of this Result-Backend feature with Elasticsearch and this how my task results are stored :
It is just a matter of adding few configurations in settings.py file as per your requirements. Worked really well for my application. And I have a weekly cron that clears only successful results of tasks - since we don't need the results anymore - and I can see only failed results (like the one in image).
These were main keys for my requirement : task_track_started and task_acks_late along with result_backend
I have a task chain:
result = celery.chain(task_a.s(), task_b.s())()
I am interested only in result of task_b(), however celery saves results of both task_a() and task_b() to backend.
Is there any way to store the results only for task_b()?
I haven't tested this, but based on the docs (1, 2)
it should be possible to add ignore_result=True parameter to .s() calls.
If the above doesn't work then you can always configure the whole task
to not store results (by adding ignore_result=True to task class or decorator)
Important: as per Celery docs, tasks used within chords cannot ignore their results.
So while it shouldn't concern chains, it's something to be aware of if you plan to use chords.
Try mix link with ignore results for first task.
add.apply_async((2, 2),ignore_result=True, link=add.s(16))
With Airflow, is it possible to restart an upstream task if a downstream task fails? This seems to be against the "Acyclic" part of the term DAG. I would think this is a common problem though.
Background
I'm looking into using Airflow to manage a data processing workflow that has been managed manually.
There is a task that will fail if a parameter x is set too high, but increasing the parameter value gives better quality results. We have not found a way to calculate a safe but maximally high parameter x. The process by hand has been to restart the job if failed with a lower parameter until it works.
The workflow looks something like this:
Task A - Gather the raw data
Task B - Generate config file for job
Task C - Modify config file parameter x
Task D - Run the data manipulation Job
Task E - Process Job results
Task F - Generate reports
Issue
If task D fails because of parameter x being too high, I want to rerun task C and task D. This doesn't seem to be supported. I would really appreciate some guidance on how to handle this.
First of all: that's an excellent question, I wonder why it hasn't been discussed widely until now
I can think of two possible approaches
Fusing Operators: As pointed out by #Kris, Combining Operators together appears to be the most obvious workaround
Separate Top-Level DAGs: Read below
Separate Top-Level DAGs approach
Given
Say you have tasks A & B
A is upstream to B
You want execution to resume (retry) from A if B fails
(Possibile) Idea: If your'e feeling adventurous
Put tasks A & B in separate top-level DAGs, say DAG-A & DAG-B
At the end of DAG-A, trigger DAG-B using TriggerDagRunOperator
In all likelihood, you will also have to use an ExternalTaskSensor after TriggerDagRunOperator
In DAG-B, put a BranchPythonOperator after Task-B with trigger_rule=all_done
This BranchPythonOperator should branch out to another TriggerDagRunOperator that then invokes DAG-A (again!)
Useful references
Fusing Operators Together
Wiring Top-Level DAGs together
EDIT-1
Here's a much simpler way that can achieve similar behaviour
How can you re-run upstream task if a downstream task fails in Airflow (using Sub Dags)
My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?
Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.
You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.
That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.
I have a bunch of Feed objects in my database, and I'm trying to get each Feed to be updated every hour. My issue here is that I need to make sure there aren't any duplicate updates -- it needs to happen no more than once an hour, but I also don't want feeds waiting two hours for an update. (It's okay if it happens every hour +/- a few minutes, but twice in a few minutes is bad.)
I'm using Django and Celery with Amazon SQS as a broker. I have the feed update code set up as a Celery task, but I'm failing to find a way to prevent duplicates while remaining compatible with Celery running on multiple nodes.
My current solution is to add a last_update_scheduled attribute to the Feed model and run the following task every 5 minutes (pseudo-code):
threshold = datetime.now() - timedelta(seconds=3600)
for f in Feed.objects.filter(Q(last_update_scheduled__lt = threshold) |
Q(last_update_scheduled = None)):
updateFeed.delay(f)
f.last_update_scheduled = now
f.save()
This is susceptible to a number of synchronization issues. For example, if my task queues get backed up, this task could run twice at the same time, causing duplicate updates. I've seen some solutions for this (like Celery's recipe and an adaptation on Stack Overflow), but the memcached solution isn't reliable, e.g. duplicates could happen when restarting memcached or if it happens to run out of memory and purge old data. Not to mention I'd hate to have to add memcached to my production configuration just for a simple lock.
In a perfect world, I'd like to be able to say:
#modelTask(Feed, run_every=3600)
def updateFeed(feed):
# do something expensive
But so far my imagination fails me on how to implement that decorator.
To be clear, the Celery recipe is not using memcached per se, but rather Django's caching middleware. There are a number of other caching methods that would suit your needs without the downside of memcached. See the Django caching documentation for details.