I am experimenting with airbnb airflow. While I am trying to run it with 'backfill' option for one day with timedelta of 60 minutes, only 13 instances are executed. Rest are shown as waiting and never executed.
Please provide more info like which Airflow version is and which Executor you defined in the "airflow.cfg".
For CeleryExecutor, you have to have "airflow worker" running together with the scheduler.
Also your DAG may be pending if your ending date ("-e END_DATE") has reached.
Related
What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.
I am extremely confused by something in our airflow ui. In the tree view (and the graph view), a dag is indicated to have failed. However, all of its member tasks appear to have succeeded. You can see it here below (third from the end):
Does anyone know how this is possible, what it means, or how one would investigate it?
I have experienced the same. All tasks complete with success, but the DAG fails. Did not find anything in any logs.
In my case, it was the DAG's dagrun_timeout setting that was set too low for my tasks that did run for more than 30 minutes:
dag = DAG(...,
dagrun_timeout=timedelta(minutes=30),
...)
I am on Airflow version 1.10.1.
I have created a DAG with 3 tasks. One of the task failed so in airflow UI the job is shown as failed. But when I do a backfill of the previous tasks. They now succedd but the status of the JOB still remains unchanged it still shows failed even though the individual tasks have succeeded. Am I missing out something?
backfill in airflow doesn't work similarly to the scheduler runs. The backfill command doesn't create dag runs but simply queues the tasks of the dag. The scheduler however creates dag runs in the db and uses them to schedule tasks. Therefore, when the backfill is done, no actual dag run objects are created/updated in the db hence the circle remains in its old color (state).
Lets say scheduler is stopped for 5 hours and I had dag scheduled for twice every hour. Now when I restart the scheduler I do not want to airflow to backfill all the instances those were missed, Instead I want it to continue from the current hour.
To achieve this behavior, you can use the LatestOnlyOperator, which was just recently introduced to master, to the start of your DAG. It is not currently part of a released version though (1.7.1.3 is the latest version as of the writing of this post).
I'm sure you're no longer waiting for an answer, but for reference, this is covered here: https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls.
"When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc..."
I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:
Assume today is the 20th of the month.
Assume the start_date is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?
This feature is in the roadmap for Airflow, but does not currently exist.
See:
Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date for this specific task or the whole dag.
Every operator has a start_date
http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.