How to avoid running previously successful tasks in Airflow? - python

I have multiple tasks that are passing some data objects to each other. In some tasks, if some condition is not met, I'm raising an exception. This leads to the failure of that task. When the next DAG run is triggered, the already successful task runs once again. I'm finding some way to avoid running the previously successful tasks and resume the DAG run from the failed task in the next DAG run.

As mentioned, every DAG has it's set of tasks that are executed every run. In order to avoid running previously successful tasks, you could perform a check for an external variable via Airflow XCOMs or Airflow Variables, you could also query the meta database as to the status of previous runs. You could also store a variable in something like Redis or a similar external database.
Using that variable you can then skip the execution of a Task and directly mark the task successful until it reaches the task that is to be completed.
Of course you need to be mindful of any potential race conditions if the DAG run times can overlap.
def task_1( **kwargs ):
if external_variable:
pass
else:
perform_task()
return True

Related

How to get all running tasks in luigi, including dynamic ones

I currently have a list of tasks that I run through the command
luigi.build(tasks, workers=N, local_scheduler=True, detailed_summary=True)
I would like to programmatically get the status of the local scheduler, hence I could not use global scheduler for my application. How can I get the list of running, pending, and completed tasks?
At first, I used to check for the creation of output files of some known tasks, but it is not efficient anymore since the complexity is increased and I also have dynamic tasks (yielded by some other tasks at runtime).
I noticed that I could list dependencies through:
import luigi.tools.deps
luigi.tools.deps.find_deps(my_main_task, luigi.tools.deps.upstream().family)
but it is not helping more than looking at task's output().
I have also noticed Workers having a nice _running_tasks attribute, thus I would need to get the worker and list it, but I am also wondering about what happens if I have more than 1 worker with pending tasks along with running ones.

Airflow: Best way to store a value in Airflow task that could be retrieved in the recurring task runs

In Airflow for a DAG, I'm writing a monitoring task which will run again and again until a certain condition is met. In this task, when some event happened, I need to store the timestamp and retrieve this value in next task run (for the same task) and update it again if required.
What's the best way to store this value?
So far I have tried below approaches to store:
storing in xcoms, but this value couldn't be retried in next task run as the xcom variable gets deleted for each new task run for the same DAG run.
storing in Airflow Variables - this solves the purpose, I could store, update, delete as needed, but it doesn't look clean for my use case as lot of new Variables are getting generated per DAG and we have over 2k DAGs (pipelines).
global variables in the python class, but the value gets overridden in next task run.
Any suggestion would be helpful.
If you have task that is re-run with the same "Execution Date", using Airflow Variables is your best choice. XCom will be deleted by definition when you re-run the same task with the same execution date and it won't change.
Basically what you want to do is to store the "state" of task execution and it's kinda "against" Airflow's principle of idempotent tasks (where re-running the task should produce "final" results of running the task every time you run it. You want to store the state of the task between re-runs on the other hand and have it behave differently with subsequent re-runs - based on the stored state.
Another option that you could use, is to store the state in an external storage (for example object in S3). This might be better in case of performance if you do not want to load your DB too much. You could come up with a "convention" of naming of such state object and pull it a start and push when you finish the task.
You could use XComs with include_prior_dates parameter. Docs state the following:
include_prior_dates (bool) -- If False, only XComs from the current execution_date are returned. If True, XComs from previous dates are returned as well.
(Default value is False)
Then you would do: xcom_pull(task_ids='previous_task', include_prior_dates=True)
I haven't tried out personally but looks like this may be a good solution to your case.

Celery Django runing periodic tasks after previus was done. [django-celery-beat]

I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

Re-running Failed SubDAGs

I've been playing around with SubDAGs. A big problem I've faced is whenever something within the SubDAG fails, and I re-run things by hitting Clear, only the cleared task will re-run; the success does not propagate to downstream tasks in the SubDAG and get them running.
How do I re-run a failed task in a SubDAG such that the downstream tasks will flow correctly? Right now, I have to literally re-run every task in the SubDAG that is downstream of the failed task.
I think I followed the best practices of SubDAGs; the SubDAG inherits the Parent DAG properties wherever possible (including schedule_interval), and I don't turn the SubDAG on in the UI; the parent DAG is on and triggers it instead.
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).

Categories

Resources