Why airflow scheduler does not run my DAG? - python

I'm not able to run airflow DAG by scheduler. I have checked multiple threads here on forum, but I'm still not able to find the root cause. Of course DAG slider is set to ON. Below you can find DAG information:
with DAG(
dag_id='blablabla',
default_args=default_args,
description='run my DAG',
schedule_interval='45 0 * * *',
start_date=datetime(2021, 8, 5, 0, 45),
max_active_runs=1,
tags=['bla']) as dag:
t1 = BashOperator(
task_id='blabla',
bash_command="python3 /home/data/blabla.py",
dag=dag
)
I have checked cron expression which seems to be fine, start_date is hardcoded so it excludes the issue with time set to "now". When I'm checking DAGs run history all other scheduled DAGs are there listed, only this one seems to be invisible for the scheduler.
Triggering DAG manually works fine, python code works properly, there's issue only with scheduler.
What was done:
checked CRON expression
checked start_date whether it's hardcoded
tried changing start_date to date couple months ago
tried many schedule_interval values (but always daily)
checked multiple threads here but did not found anything more than above bullets

Looks okay. One thing that comes to mind is the once-a-day schedule interval, which sometimes confuses because the first run will start at the end of the interval, i.e. the next day. Since you set your start_date to more than one day ago, that shouldn't be a problem.
To find a solution, we would need more information:
Could you post the default_args, or your full DAG?
Any details about your Airflow setup (versions, executor, etc.)
Could you check the scheduler logs for any information/errors? Specifically, $AIRFLOW_HOME/logs/dag_processor_manager.log and $AIRFLOW_HOME/logs/scheduler/[date]/[yourdagfile.py].log

Issue resolved by below steps found in some other post:
try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.

Related

Why does Airflow keep running the DAG?

I am learning Airflow for a Data Engineering project, and I setup a DAG to retrieve a csv file online. I was testing out the schedule_interval and I set it to 30 mins initially.
I started the Airflow scheduler at 22:17, and expecting the DAG to be executed at least at 22:47. However, the DAG is running almost at every second, and I see from the log that the execution date was a few hours ago.
DAG
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Your DAG is being backfilled. Airflow will attempt to catch up to your current time from when it was started.
E.g. if the exact moment in which you launched your DAG is on 6th March, 10:00AM, and the DAG has an execution date of 6th March 6:00AM (assuming the same timezone), with a scheduling interval of 30 mins, then the DAG will run immediately until it has "caught up" to 10:00AM.
That is, it would run (6:00AM - 10:00AM = 4 hours; 4 hrs/30 mins = 8) 8 times one after another until it has reached the current moment in time.
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Seems like it, if the DAG's execution start date is whatever time you launched your DAG at.
It would be very helpful. If you can paste the DAG as well or atleast the DAG configuration object.
Make sure you set the flag catchup=False so that backfilling does not happen. The default value is True. If you did not set catchup=False scheduler assumes that it needs to backfill and hence it is running every 30secs.
See the example below
dag = DAG(
dag_id='my_test_dag'
, default_args=default_args
, schedule_interval='1 * * * *'
, start_date=datetime(2020, 9, 22, tzinfo=local_tz)
, catchup=False
)

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

airflow dag failed... but all tasks succeeded

I am extremely confused by something in our airflow ui. In the tree view (and the graph view), a dag is indicated to have failed. However, all of its member tasks appear to have succeeded. You can see it here below (third from the end):
Does anyone know how this is possible, what it means, or how one would investigate it?
I have experienced the same. All tasks complete with success, but the DAG fails. Did not find anything in any logs.
In my case, it was the DAG's dagrun_timeout setting that was set too low for my tasks that did run for more than 30 minutes:
dag = DAG(...,
dagrun_timeout=timedelta(minutes=30),
...)
I am on Airflow version 1.10.1.

How to deploy modified airflow dag from a different start time?

Lets say scheduler is stopped for 5 hours and I had dag scheduled for twice every hour. Now when I restart the scheduler I do not want to airflow to backfill all the instances those were missed, Instead I want it to continue from the current hour.
To achieve this behavior, you can use the LatestOnlyOperator, which was just recently introduced to master, to the start of your DAG. It is not currently part of a released version though (1.7.1.3 is the latest version as of the writing of this post).
I'm sure you're no longer waiting for an answer, but for reference, this is covered here: https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls.
"When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc..."

In Python's Airflow, how can I stop a task from running after a certain time?

I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:
Assume today is the 20th of the month.
Assume the start_date is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?
This feature is in the roadmap for Airflow, but does not currently exist.
See:
Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date for this specific task or the whole dag.
Every operator has a start_date
http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.

Categories

Resources