Job status in Airflow - python

I have created a DAG with 3 tasks. One of the task failed so in airflow UI the job is shown as failed. But when I do a backfill of the previous tasks. They now succedd but the status of the JOB still remains unchanged it still shows failed even though the individual tasks have succeeded. Am I missing out something?

backfill in airflow doesn't work similarly to the scheduler runs. The backfill command doesn't create dag runs but simply queues the tasks of the dag. The scheduler however creates dag runs in the db and uses them to schedule tasks. Therefore, when the backfill is done, no actual dag run objects are created/updated in the db hence the circle remains in its old color (state).

Related

how to limit the number of tasks run simultaneously in a specific dag on airflow?

I have a Dag in air-flow looks like that:
start >> [task1,task2,....task16] >> end
I want to limit the tasks that running simultaneously in this dag to be 4 for example.
I know there is 'max_active_tasks_per_dag' parameters, but it affects on all dags and in my case I want to define it only for my dag.
how can I do so? it is even possible?
If you look at the description of the parameter you will find out that you can configure it per dag:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-active-tasks-per-dag
max_active_tasks_per_dag New in version 2.2.0.
The maximum number of task instances allowed to run concurrently in
each DAG. To calculate the number of tasks that is running
concurrently for a DAG, add up the number of running tasks for all DAG
runs of the DAG. This is configurable at the DAG level with
max_active_tasks, which is defaulted as max_active_tasks_per_dag.

How do you mark DAG run as success when majority of tasks succeed, and only a few fail?

I have a DAG that runs hundreds of tasks. There are tasks that if they fail the failures are handled elsewhere, so it is ok if they fail. However, Airflow marks the whole DAG run as a failure.
What I want to do is as follows: I want to measure the number of tasks and if more than a certain percentage succeed mark the DAG run as a success.
You can achieve this by defining Airflow trigger rule:
all_done: all parents are done with their execution
op = DummyOperator(task_id='join', dag=dag, trigger_rule='all_done')
https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#trigger-rules

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

How to avoid running previously successful tasks in Airflow?

I have multiple tasks that are passing some data objects to each other. In some tasks, if some condition is not met, I'm raising an exception. This leads to the failure of that task. When the next DAG run is triggered, the already successful task runs once again. I'm finding some way to avoid running the previously successful tasks and resume the DAG run from the failed task in the next DAG run.
As mentioned, every DAG has it's set of tasks that are executed every run. In order to avoid running previously successful tasks, you could perform a check for an external variable via Airflow XCOMs or Airflow Variables, you could also query the meta database as to the status of previous runs. You could also store a variable in something like Redis or a similar external database.
Using that variable you can then skip the execution of a Task and directly mark the task successful until it reaches the task that is to be completed.
Of course you need to be mindful of any potential race conditions if the DAG run times can overlap.
def task_1( **kwargs ):
if external_variable:
pass
else:
perform_task()
return True

Re-running Failed SubDAGs

I've been playing around with SubDAGs. A big problem I've faced is whenever something within the SubDAG fails, and I re-run things by hitting Clear, only the cleared task will re-run; the success does not propagate to downstream tasks in the SubDAG and get them running.
How do I re-run a failed task in a SubDAG such that the downstream tasks will flow correctly? Right now, I have to literally re-run every task in the SubDAG that is downstream of the failed task.
I think I followed the best practices of SubDAGs; the SubDAG inherits the Parent DAG properties wherever possible (including schedule_interval), and I don't turn the SubDAG on in the UI; the parent DAG is on and triggers it instead.
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).

Categories

Resources