how can i set an AIRFLOW dag to trigger every hour - python

here is the thing, i got an AIRFLOW dag who trigger every day(#daily) for collect markets datas, example : today is 2022-03-23 so he trigger at midnight but he minus 2 days then he collect the datas for monday: for making this happen my program need a startDate, so i use this as an arg in SparkSubmitOperator :
"{{ macros.ds_add(ds, -1) }}"
now i want it to trigger every hour for making the same delta as above but in hours
i try a lot of things but nothing works, i set the dag #hourly and everything but still doesnt work, i read the AIRFLOW documentation but there is nothing except some "ts" but it doesnt work, and last but not least im noob in python.
any thoughts ?
Thanks in advance.

Related

Why airflow scheduler does not run my DAG?

I'm not able to run airflow DAG by scheduler. I have checked multiple threads here on forum, but I'm still not able to find the root cause. Of course DAG slider is set to ON. Below you can find DAG information:
with DAG(
dag_id='blablabla',
default_args=default_args,
description='run my DAG',
schedule_interval='45 0 * * *',
start_date=datetime(2021, 8, 5, 0, 45),
max_active_runs=1,
tags=['bla']) as dag:
t1 = BashOperator(
task_id='blabla',
bash_command="python3 /home/data/blabla.py",
dag=dag
)
I have checked cron expression which seems to be fine, start_date is hardcoded so it excludes the issue with time set to "now". When I'm checking DAGs run history all other scheduled DAGs are there listed, only this one seems to be invisible for the scheduler.
Triggering DAG manually works fine, python code works properly, there's issue only with scheduler.
What was done:
checked CRON expression
checked start_date whether it's hardcoded
tried changing start_date to date couple months ago
tried many schedule_interval values (but always daily)
checked multiple threads here but did not found anything more than above bullets
Looks okay. One thing that comes to mind is the once-a-day schedule interval, which sometimes confuses because the first run will start at the end of the interval, i.e. the next day. Since you set your start_date to more than one day ago, that shouldn't be a problem.
To find a solution, we would need more information:
Could you post the default_args, or your full DAG?
Any details about your Airflow setup (versions, executor, etc.)
Could you check the scheduler logs for any information/errors? Specifically, $AIRFLOW_HOME/logs/dag_processor_manager.log and $AIRFLOW_HOME/logs/scheduler/[date]/[yourdagfile.py].log
Issue resolved by below steps found in some other post:
try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.

Run Airflow Dag at the third of a month but not on Sundays

I having trouble finding the correct cron notation in order to schedule my DAG at the third of a month but not on Sundays.
The following statement does not take the Sunday into account
schedule_interval='0 16 3 * *
Can someone help?
There's unfortunately no way to express exclusions in cron.
A workaround in Airflow could be to have one task at the start which checks if the execution_date is a Sunday, and skips all remaining tasks if so.
There's an Airflow AIP (it's currently being worked on) to provide more detailed scheduling intervals: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval, which would allow you to express this interval in future Airflow versions.

Why does Airflow keep running the DAG?

I am learning Airflow for a Data Engineering project, and I setup a DAG to retrieve a csv file online. I was testing out the schedule_interval and I set it to 30 mins initially.
I started the Airflow scheduler at 22:17, and expecting the DAG to be executed at least at 22:47. However, the DAG is running almost at every second, and I see from the log that the execution date was a few hours ago.
DAG
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Your DAG is being backfilled. Airflow will attempt to catch up to your current time from when it was started.
E.g. if the exact moment in which you launched your DAG is on 6th March, 10:00AM, and the DAG has an execution date of 6th March 6:00AM (assuming the same timezone), with a scheduling interval of 30 mins, then the DAG will run immediately until it has "caught up" to 10:00AM.
That is, it would run (6:00AM - 10:00AM = 4 hours; 4 hrs/30 mins = 8) 8 times one after another until it has reached the current moment in time.
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Seems like it, if the DAG's execution start date is whatever time you launched your DAG at.
It would be very helpful. If you can paste the DAG as well or atleast the DAG configuration object.
Make sure you set the flag catchup=False so that backfilling does not happen. The default value is True. If you did not set catchup=False scheduler assumes that it needs to backfill and hence it is running every 30secs.
See the example below
dag = DAG(
dag_id='my_test_dag'
, default_args=default_args
, schedule_interval='1 * * * *'
, start_date=datetime(2020, 9, 22, tzinfo=local_tz)
, catchup=False
)

Airflow: Retry up to a specific time

I need to create an Airflow job that needs to run absolutely before 9h.
I currently have a job that starts at 7h, with retries=8 with 15 minutes interval (8*15m=2h) unfortunately, my job takes more time, and due to this, the task fails after 9h that is the hard deadline.
How can I make it do retry every 15 minutes but fail if it's after 9h so a human can take a look at the issue ?
Thanks for your help
You could use the execution_timeout argument when creating the task to control how long it'll run before timing out. So if you run your task at 7AM, and want it to end at 9AM, then set the timeout to 2 hours. Below is info from Airflow documentation
aggregate_db_message_job = BashOperator(
task_id='aggregate_db_message_job',
execution_timeout=timedelta(hours=2),
pool='ep_data_pipeline_db_msg_agg',
bash_command=aggregate_db_message_job_cmd,
dag=dag)
aggregate_db_message_job.set_upstream(wait_for_empty_queue)

In Python's Airflow, how can I stop a task from running after a certain time?

I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:
Assume today is the 20th of the month.
Assume the start_date is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?
This feature is in the roadmap for Airflow, but does not currently exist.
See:
Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date for this specific task or the whole dag.
Every operator has a start_date
http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.

Categories

Resources