Airflow DAG Schedule Meaning - python

What does the below airflow dag schedule mean?
schedule: "12 0-4,14-23 * * *"
Thanks,
cha
I want to schedule airflow dag to run run hourly but not between midnight and morning 7. Also, i want to pass more resources during last run of the day. so, I am trying to figure out how to do in airflow. I usually schedule once a day at certain hour. I want to understand how to schedule multiple times.

It's a cron expression. There are several tools on the internet to explain a cron expression in human-readable language. For example https://crontab.guru/#12_0-4,14-23___*:
"At minute 12 past every hour from 0 through 4 and every hour from 14 through 23."

Related

Run Airflow Dag at the third of a month but not on Sundays

I having trouble finding the correct cron notation in order to schedule my DAG at the third of a month but not on Sundays.
The following statement does not take the Sunday into account
schedule_interval='0 16 3 * *
Can someone help?
There's unfortunately no way to express exclusions in cron.
A workaround in Airflow could be to have one task at the start which checks if the execution_date is a Sunday, and skips all remaining tasks if so.
There's an Airflow AIP (it's currently being worked on) to provide more detailed scheduling intervals: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval, which would allow you to express this interval in future Airflow versions.

Why does Airflow keep running the DAG?

I am learning Airflow for a Data Engineering project, and I setup a DAG to retrieve a csv file online. I was testing out the schedule_interval and I set it to 30 mins initially.
I started the Airflow scheduler at 22:17, and expecting the DAG to be executed at least at 22:47. However, the DAG is running almost at every second, and I see from the log that the execution date was a few hours ago.
DAG
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Your DAG is being backfilled. Airflow will attempt to catch up to your current time from when it was started.
E.g. if the exact moment in which you launched your DAG is on 6th March, 10:00AM, and the DAG has an execution date of 6th March 6:00AM (assuming the same timezone), with a scheduling interval of 30 mins, then the DAG will run immediately until it has "caught up" to 10:00AM.
That is, it would run (6:00AM - 10:00AM = 4 hours; 4 hrs/30 mins = 8) 8 times one after another until it has reached the current moment in time.
Is this because of the time difference from UTC to my local time? The DAG is trying to catch up to the time difference?
Seems like it, if the DAG's execution start date is whatever time you launched your DAG at.
It would be very helpful. If you can paste the DAG as well or atleast the DAG configuration object.
Make sure you set the flag catchup=False so that backfilling does not happen. The default value is True. If you did not set catchup=False scheduler assumes that it needs to backfill and hence it is running every 30secs.
See the example below
dag = DAG(
dag_id='my_test_dag'
, default_args=default_args
, schedule_interval='1 * * * *'
, start_date=datetime(2020, 9, 22, tzinfo=local_tz)
, catchup=False
)

What does the landing time mean in airflow?

There is a section called "landing time" in the DAG view on the web console of airflow.
An example screen shot taken from airbnb's blog:
But what does it mean? There is no definition in the documents or in their repository.
Since the existing answer here wasn't totally clear, and this is the top hit for "airflow landing time" I went to the chat archives and found the original answer being referenced here:
Maxime Beauchemin #mistercrunch Jun 09 2016 11:12
it's the number of hours after the time the scheduling period ended
take a schedule_interval='#daily' run for 2016-01-01 that finishes at 2016-01-02 03:52:00
landing time is 3:52
https://gitter.im/apache/incubator-airflow/archives/2016/06/09
It seems the Y axis is in hours, and the negative landing times are a result of running jobs manually so they finish hours before they "should have finished" based on the schedule.
I directly asked the author Maxime. His answer was landing_time is when the job completes minus when the job should have started (for airflow, it's the end of the scheduled period).
source:
http://gitter.im/apache/incubator-airflow
It is a good place to get help and Maxine is very nice and helpful. But the answers are not persistent..
For me its easier to understand landing_time using an example.
So let's say we have a dag scheduled to run daily at 0 0 * * *. This dag has 2 tasks that execute sequentially:
first_task >> second_task
The first_task starts at 00:00 and 10 seconds and finishes after 5 minutes at 00:05:10.
The landing_time for first_task will be 5 mins and 10 seconds.
The second_task starts execution at 00:07:00 minute and finishes after 2 minutes. The landing_time for the second_task would be 9 minutes.
So we just delete from the task end_time the dag execution_date.
Thanks to #Kalinde Pride for commenting and pointing me to the only source of truth, the airflow code base.
I usually use landing_time as a measure and metric of the performance of the whole airflow system. For example increase in landing_times in the first tasks seems to mean that scheduler is under heavy load or we should adapt task parallelization (through airflow.cfg).
Landing Times: Total time spent including retries.

Multiple Time Zones in Google Appengine Cron Job

I want to schedule a task for 9:00 AM in every country. (basically 9:00 AM in every time zone). How can I schedule that in google appengine?
Will it take multiple timezones for time zone parameter?
Thanks in advance
You can schedule a cron job to run every hour, because every hour there is 9 am somewhere.

Python or Bash commands to determine time since a cron job string would have triggered

I'm writing a Django app (although parts can be Bash) that stores the cron job strings of many other machines. It needs to calculate the amount of time since that cron job would have triggered on that machine. Is there a python library useful for converting cron style strings to another Python friendly scheduler format that has a function for determining when that should have last triggered?
For example:
a machine has a cron job at "0 8 * * 1-5" (every weekday at 8am local time to that server). Assuming my Django app was in the same time zone, and the current time was 10:15 AM on a Tuesday, then my app would need to be able to calculate 2 hours and 15 minutes as the answer.
Celery is the package that's usually used with Django for job scheduling. It has a module for parsing cron specs. It might be of use.

Categories

Resources