Why does Airflow keep running the DAG? - python

I am learning Airflow for a Data Engineering project, and I setup a DAG to retrieve a csv file online. I was testing out the schedule_interval and I set it to 30 mins initially.
I started the Airflow scheduler at 22:17, and expecting the DAG to be executed at least at 22:47. However, the DAG is running almost at every second, and I see from the log that the execution date was a few hours ago.
DAG
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?

Your DAG is being backfilled. Airflow will attempt to catch up to your current time from when it was started.
E.g. if the exact moment in which you launched your DAG is on 6th March, 10:00AM, and the DAG has an execution date of 6th March 6:00AM (assuming the same timezone), with a scheduling interval of 30 mins, then the DAG will run immediately until it has "caught up" to 10:00AM.
That is, it would run (6:00AM - 10:00AM = 4 hours; 4 hrs/30 mins = 8) 8 times one after another until it has reached the current moment in time.
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Seems like it, if the DAG's execution start date is whatever time you launched your DAG at.

It would be very helpful. If you can paste the DAG as well or atleast the DAG configuration object.
Make sure you set the flag catchup=False so that backfilling does not happen. The default value is True. If you did not set catchup=False scheduler assumes that it needs to backfill and hence it is running every 30secs.
See the example below
dag = DAG(
dag_id='my_test_dag'
, default_args=default_args
, schedule_interval='1 * * * *'
, start_date=datetime(2020, 9, 22, tzinfo=local_tz)
, catchup=False
)

Related

Airflow DAG Schedule Meaning

What does the below airflow dag schedule mean?
schedule: "12 0-4,14-23 * * *"
Thanks,
cha
I want to schedule airflow dag to run run hourly but not between midnight and morning 7. Also, i want to pass more resources during last run of the day. so, I am trying to figure out how to do in airflow. I usually schedule once a day at certain hour. I want to understand how to schedule multiple times.
It's a cron expression. There are several tools on the internet to explain a cron expression in human-readable language. For example https://crontab.guru/#12_0-4,14-23___*:
"At minute 12 past every hour from 0 through 4 and every hour from 14 through 23."

Airflow: start date concepts

I'm working with Airflow and struggling a little bit with its concept of time. In fact, my situation is: I would like to schedule my DAG like
with DAG(
'MY_DAG',
default_args=default_args,
catchup=False,
schedule_interval='0 0 1,11-20 * *'
#Every 1st of the month and each day between the 11st and the 20th
) as dag:
According to the documentation, Airflow schedule tasks at the END of the interval. So my understand is like: for example, a DAG with an hourly schedule starting at 8am, it will run the first DAG at 9am… and the execution_date of that DAG Run will be 8am. So at 9am, the 8am DAG Run is triggered. We can think of it as “at 9am, i’m ready to process the 8am data… so run the workflow with a data date of 8am”.
So in my case, using the same logic, on the 11th day, the 1st DAG Run will be triggered, right? And on the 1st of the next month, Airflow will execute the job of the 20th last month? Am I right, please? If not, could you guys please tell me why?
Thank you guys !!!

Why airflow scheduler does not run my DAG?

I'm not able to run airflow DAG by scheduler. I have checked multiple threads here on forum, but I'm still not able to find the root cause. Of course DAG slider is set to ON. Below you can find DAG information:
with DAG(
dag_id='blablabla',
default_args=default_args,
description='run my DAG',
schedule_interval='45 0 * * *',
start_date=datetime(2021, 8, 5, 0, 45),
max_active_runs=1,
tags=['bla']) as dag:
t1 = BashOperator(
task_id='blabla',
bash_command="python3 /home/data/blabla.py",
dag=dag
)
I have checked cron expression which seems to be fine, start_date is hardcoded so it excludes the issue with time set to "now". When I'm checking DAGs run history all other scheduled DAGs are there listed, only this one seems to be invisible for the scheduler.
Triggering DAG manually works fine, python code works properly, there's issue only with scheduler.
What was done:
checked CRON expression
checked start_date whether it's hardcoded
tried changing start_date to date couple months ago
tried many schedule_interval values (but always daily)
checked multiple threads here but did not found anything more than above bullets
Looks okay. One thing that comes to mind is the once-a-day schedule interval, which sometimes confuses because the first run will start at the end of the interval, i.e. the next day. Since you set your start_date to more than one day ago, that shouldn't be a problem.
To find a solution, we would need more information:
Could you post the default_args, or your full DAG?
Any details about your Airflow setup (versions, executor, etc.)
Could you check the scheduler logs for any information/errors? Specifically, $AIRFLOW_HOME/logs/dag_processor_manager.log and $AIRFLOW_HOME/logs/scheduler/[date]/[yourdagfile.py].log
Issue resolved by below steps found in some other post:
try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.

Run Airflow Dag at the third of a month but not on Sundays

I having trouble finding the correct cron notation in order to schedule my DAG at the third of a month but not on Sundays.
The following statement does not take the Sunday into account
schedule_interval='0 16 3 * *
Can someone help?
There's unfortunately no way to express exclusions in cron.
A workaround in Airflow could be to have one task at the start which checks if the execution_date is a Sunday, and skips all remaining tasks if so.
There's an Airflow AIP (it's currently being worked on) to provide more detailed scheduling intervals: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval, which would allow you to express this interval in future Airflow versions.

Airflow - long running task in SubDag marked as failed after an hour

I have a SubDAG in airflow with a long-running step (typically about 2 hours, though it varies based on which unit is being run). Under 1.7.1.3, this step would consistently cause AIRFLOW-736 and the SubDAG would stall in the 'running' state when all steps within were successful. We could work around this as we didn't have steps after the SubDAG by manually marking the SubDagOperator as successful (rather than running) in the database.
We're testing Airflow 1.8.1 now, upgrading by doing the following:
Shuting down our scheduler and workers
Via pip, uninstalling airflow and installing apache-airflow (version 1.8.1)
Runing airflow upgradedb
Running the airflow scheduler and workers
With the system otherwise untouched, the same DAG is now failing 100% of the time roughly after the long-running task hits the 1 hour mark (though oddly, not exactly 3600 seconds later - it can be anywhere from 30 to 90 seconds after the hour ticks) with the message "Executor reports task instance finished (failed) although the task says its running. Was the task killed externally?". However, the task itself continues running on the worker unabated. Somehow, there's disagreement between the scheduler mistaken in thinking the task failed (see this line of jobs.py) based on the database, despite the actual task running fine.
I've confirmed that, somehow, the state is 'failed' in the task_instance table of the airflow database. Thus, I'd like to know what could be setting the task state to failed when the task itself is still running.
Here's a sample dag which triggers the issue:
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
DEFAULT_ARGS = {'owner': 'jdoe', 'start_date': datetime(2017, 05, 30)}
def define_sub(dag, step_name, sleeptime):
op = BashOperator(
task_id=step_name, bash_command='sleep %i' % sleeptime,queue="model", dag=dag
)
return dag
def gen_sub_dag(parent_name, step_name, sleeptime):
sub = DAG(dag_id='%s.%s' % (parent_name, step_name), default_args=DEFAULT_ARGS)
define_sub(sub, step_name, sleeptime)
return sub
long_runner_parent = DAG(dag_id='long_runner', default_args=DEFAULT_ARGS, schedule_interval=None)
long_sub_dag = SubDagOperator(
subdag=gen_sub_dag('long_runner', 'long_runner_sub', 7500), task_id='long_runner_sub', dag=long_runner_parent
)
If you are indeed running with Celery and Redis have a look at the visibility timeout setting for Celery and increase it beyond the expected end time of your task.
Although we configure Celery to tasks-ack-late, it still has issues with tasks disappearing. We consider this a bug in Celery.

Categories

Resources