Airflow: start date concepts - python

I'm working with Airflow and struggling a little bit with its concept of time. In fact, my situation is: I would like to schedule my DAG like
with DAG(
'MY_DAG',
default_args=default_args,
catchup=False,
schedule_interval='0 0 1,11-20 * *'
#Every 1st of the month and each day between the 11st and the 20th
) as dag:
According to the documentation, Airflow schedule tasks at the END of the interval. So my understand is like: for example, a DAG with an hourly schedule starting at 8am, it will run the first DAG at 9am… and the execution_date of that DAG Run will be 8am. So at 9am, the 8am DAG Run is triggered. We can think of it as “at 9am, i’m ready to process the 8am data… so run the workflow with a data date of 8am”.
So in my case, using the same logic, on the 11th day, the 1st DAG Run will be triggered, right? And on the 1st of the next month, Airflow will execute the job of the 20th last month? Am I right, please? If not, could you guys please tell me why?
Thank you guys !!!

Related

Airflow DAG Schedule Meaning

What does the below airflow dag schedule mean?
schedule: "12 0-4,14-23 * * *"
Thanks,
cha
I want to schedule airflow dag to run run hourly but not between midnight and morning 7. Also, i want to pass more resources during last run of the day. so, I am trying to figure out how to do in airflow. I usually schedule once a day at certain hour. I want to understand how to schedule multiple times.
It's a cron expression. There are several tools on the internet to explain a cron expression in human-readable language. For example https://crontab.guru/#12_0-4,14-23___*:
"At minute 12 past every hour from 0 through 4 and every hour from 14 through 23."

how can i set an AIRFLOW dag to trigger every hour

here is the thing, i got an AIRFLOW dag who trigger every day(#daily) for collect markets datas, example : today is 2022-03-23 so he trigger at midnight but he minus 2 days then he collect the datas for monday: for making this happen my program need a startDate, so i use this as an arg in SparkSubmitOperator :
"{{ macros.ds_add(ds, -1) }}"
now i want it to trigger every hour for making the same delta as above but in hours
i try a lot of things but nothing works, i set the dag #hourly and everything but still doesnt work, i read the AIRFLOW documentation but there is nothing except some "ts" but it doesnt work, and last but not least im noob in python.
any thoughts ?
Thanks in advance.

Why airflow scheduler does not run my DAG?

I'm not able to run airflow DAG by scheduler. I have checked multiple threads here on forum, but I'm still not able to find the root cause. Of course DAG slider is set to ON. Below you can find DAG information:
with DAG(
dag_id='blablabla',
default_args=default_args,
description='run my DAG',
schedule_interval='45 0 * * *',
start_date=datetime(2021, 8, 5, 0, 45),
max_active_runs=1,
tags=['bla']) as dag:
t1 = BashOperator(
task_id='blabla',
bash_command="python3 /home/data/blabla.py",
dag=dag
)
I have checked cron expression which seems to be fine, start_date is hardcoded so it excludes the issue with time set to "now". When I'm checking DAGs run history all other scheduled DAGs are there listed, only this one seems to be invisible for the scheduler.
Triggering DAG manually works fine, python code works properly, there's issue only with scheduler.
What was done:
checked CRON expression
checked start_date whether it's hardcoded
tried changing start_date to date couple months ago
tried many schedule_interval values (but always daily)
checked multiple threads here but did not found anything more than above bullets
Looks okay. One thing that comes to mind is the once-a-day schedule interval, which sometimes confuses because the first run will start at the end of the interval, i.e. the next day. Since you set your start_date to more than one day ago, that shouldn't be a problem.
To find a solution, we would need more information:
Could you post the default_args, or your full DAG?
Any details about your Airflow setup (versions, executor, etc.)
Could you check the scheduler logs for any information/errors? Specifically, $AIRFLOW_HOME/logs/dag_processor_manager.log and $AIRFLOW_HOME/logs/scheduler/[date]/[yourdagfile.py].log
Issue resolved by below steps found in some other post:
try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.

Run Airflow Dag at the third of a month but not on Sundays

I having trouble finding the correct cron notation in order to schedule my DAG at the third of a month but not on Sundays.
The following statement does not take the Sunday into account
schedule_interval='0 16 3 * *
Can someone help?
There's unfortunately no way to express exclusions in cron.
A workaround in Airflow could be to have one task at the start which checks if the execution_date is a Sunday, and skips all remaining tasks if so.
There's an Airflow AIP (it's currently being worked on) to provide more detailed scheduling intervals: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval, which would allow you to express this interval in future Airflow versions.

Why does Airflow keep running the DAG?

I am learning Airflow for a Data Engineering project, and I setup a DAG to retrieve a csv file online. I was testing out the schedule_interval and I set it to 30 mins initially.
I started the Airflow scheduler at 22:17, and expecting the DAG to be executed at least at 22:47. However, the DAG is running almost at every second, and I see from the log that the execution date was a few hours ago.
DAG
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Your DAG is being backfilled. Airflow will attempt to catch up to your current time from when it was started.
E.g. if the exact moment in which you launched your DAG is on 6th March, 10:00AM, and the DAG has an execution date of 6th March 6:00AM (assuming the same timezone), with a scheduling interval of 30 mins, then the DAG will run immediately until it has "caught up" to 10:00AM.
That is, it would run (6:00AM - 10:00AM = 4 hours; 4 hrs/30 mins = 8) 8 times one after another until it has reached the current moment in time.
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Seems like it, if the DAG's execution start date is whatever time you launched your DAG at.
It would be very helpful. If you can paste the DAG as well or atleast the DAG configuration object.
Make sure you set the flag catchup=False so that backfilling does not happen. The default value is True. If you did not set catchup=False scheduler assumes that it needs to backfill and hence it is running every 30secs.
See the example below
dag = DAG(
dag_id='my_test_dag'
, default_args=default_args
, schedule_interval='1 * * * *'
, start_date=datetime(2020, 9, 22, tzinfo=local_tz)
, catchup=False
)

Categories

Resources