I'm trying to figure out the best way to schedule a DAG in airflow that doesn't conform to the ways that they are typically scheduled.
The times I want the DAG to run are between 9:40 AM and 4:00 PM, Monday-Friday and to run every ten minutes.
1) cron could sort of work here as I could set up multiple DAGs that execute the same code and give them different cron triggers. For instance, trigger the first to run at 9:40 and run once, run the second at 9:50 (also run once) and then the third run from 10 AM to 4, Mon-Fri every ten minutes.
2) The airflow preset (eg #hourly) or timer interval also wouldn't really work here either, since as far as I can tell there is no way to set up a timer interval with the weird start time (9:40 AM) and the Mon-Fri restriction. But at least here I can set the timedelta to 10 minutes.
3) The other option would be to set the scheduler to None and have a second script externally trigger the DAG, using the subprocessing module.
In my ideal scenario, I could write a generator which would give python datetimes that I want the dag to be triggered and give that to the DAG object. I guess I could combine that solution with 3 above.
Solution 1 could work, but seems hacky.
Wanted to know what other folks have done in this situation.
*/10 9-16 1-5 * *
This CRON will give you a run every 10 minutes between 9am and 4pm ( 16 hours ) and only Mondays to Fridays ( 0-5 ).
I don't know how you can get the finer granularity to have 9:40am to 4pm.
*/10 indicates a run every 10 minutes
9-16 indicates runs only between hour 9 and hour 16
1-5 indicates runs as per the following table:
0 - Sun Sunday
1 - Mon Monday
2 - Tue Tuesday
3 - Wed Wednesday
4 - Thu Thursday
5 - Fri Friday
6 - Sat Saturday
7 - Sun Sunday
Related
What does the below airflow dag schedule mean?
schedule: "12 0-4,14-23 * * *"
Thanks,
cha
I want to schedule airflow dag to run run hourly but not between midnight and morning 7. Also, i want to pass more resources during last run of the day. so, I am trying to figure out how to do in airflow. I usually schedule once a day at certain hour. I want to understand how to schedule multiple times.
It's a cron expression. There are several tools on the internet to explain a cron expression in human-readable language. For example https://crontab.guru/#12_0-4,14-23___*:
"At minute 12 past every hour from 0 through 4 and every hour from 14 through 23."
I am currently working on a program that needs to run every 14 days. I have looked into Schedule which works fine, but I have a few doubts about how to go about this.
I will create a service which will handle the execution of the python program itself on a CentOS 7 system.
The issue here is that every 14 days I will run a function that generates a lot of email addresses and send them to a support entity. I am afraid that if something unintended happens, and the program restart - the support entity will get spammed with emails outside the time frame in which they should receive emails.
As far as I can tell, Schedule does not have any way of determining if the program has restarted, and therefore a reboot of either the system or the service will cause this behaviour.
Would it be a correct solution to write a date to a text file after each completed function run, and then check that text file once a day to determine whether the function should run or not? This method would survive a service and/or system reboot, but is it a "correct" way of doing it?
****UPDATE**** Having the cronjob run on specific days of the month (for example 1st and 15th.) is not sufficient. This could cause gaps in the data which the program processes. The script makes a call which pulls data from 14 days back, and this is the maximum number of days supported by the script (licensing and stuff, can't be changed so not that important except that it is a limitation). So it need to run on lets say odd or even week numbers (to get 14 days).
Any ideas on how to accomplish this given this new information?.
You should look into the use of cron (or google it yourself if you dont like the link).
I suggest creating a simple Python script that is called by cron every 14 days. The crontab entry could look like the following:
# this will run at 00:01 on the 15th and 30th of every month
1 0 */15 * * /path/to/python/script.py
# this will run at 00:01 on the 1st and 15th of every month
1 0 1,15 * * /path/to/python/script.py
You still could make your script write some sort of result (with maybe a timestamp) to a file, so that you could easily check that file to see if it ran correctly (or log some error info).
# this will run at 00:01 on the 1st and 15th of every month
1 0 1,15 * * /path/to/python/script.py >> /path/to/logfile.log 2>&1
EDIT
You can also configure cron to run every Monday (or another day) if the 1st and 15th of every month are not sufficient. And the script could check a log file to see if it was run the previous Monday to assure it only executes your business logic every 2 weeks.
# this will run at 00:01 once a week on Mondays
1 0 * * 1 /path/to/python/script.py >> /path/to/logfile.log 2>&1
There is a section called "landing time" in the DAG view on the web console of airflow.
An example screen shot taken from airbnb's blog:
But what does it mean? There is no definition in the documents or in their repository.
Since the existing answer here wasn't totally clear, and this is the top hit for "airflow landing time" I went to the chat archives and found the original answer being referenced here:
Maxime Beauchemin #mistercrunch Jun 09 2016 11:12
it's the number of hours after the time the scheduling period ended
take a schedule_interval='#daily' run for 2016-01-01 that finishes at 2016-01-02 03:52:00
landing time is 3:52
https://gitter.im/apache/incubator-airflow/archives/2016/06/09
It seems the Y axis is in hours, and the negative landing times are a result of running jobs manually so they finish hours before they "should have finished" based on the schedule.
I directly asked the author Maxime. His answer was landing_time is when the job completes minus when the job should have started (for airflow, it's the end of the scheduled period).
source:
http://gitter.im/apache/incubator-airflow
It is a good place to get help and Maxine is very nice and helpful. But the answers are not persistent..
For me its easier to understand landing_time using an example.
So let's say we have a dag scheduled to run daily at 0 0 * * *. This dag has 2 tasks that execute sequentially:
first_task >> second_task
The first_task starts at 00:00 and 10 seconds and finishes after 5 minutes at 00:05:10.
The landing_time for first_task will be 5 mins and 10 seconds.
The second_task starts execution at 00:07:00 minute and finishes after 2 minutes. The landing_time for the second_task would be 9 minutes.
So we just delete from the task end_time the dag execution_date.
Thanks to #Kalinde Pride for commenting and pointing me to the only source of truth, the airflow code base.
I usually use landing_time as a measure and metric of the performance of the whole airflow system. For example increase in landing_times in the first tasks seems to mean that scheduler is under heavy load or we should adapt task parallelization (through airflow.cfg).
Landing Times: Total time spent including retries.
I wanted to run my cron job as 'schedule: every saturday every 2 minutes from 01:00 to 3:00', and it won't allow this format. Is it possible to set a cron job to target another cron job? Or is my schedule possible just not in the correct format?
Unfortunately, you cannot combine the weekday option with the interval.
You could add a switch in the request handler of your cron-job, that will just exit if current week-day is not Saturday, while your cron.job is scheduled "every 2 minutes from 01:00 to 03:00". But that means that your handler will be called 300 times per week for doing nothing, and only doing something the other 60 times.
Alternatively, you could combine an "every saturday 01:00" cron-job (as dispatcher) that will create 60 push tasks (as worker) with countdown or ETA, spread between 01:00 and 03:00. However, I don't think the execution time is not guaranteed.
How to config a cron job to run every 5 minutes between 9:00am~20:00pm,
but every 10 minutes in other time of the day.
I would recommend just using every 5 minutes synchronized in the cron.yaml, and then just terminate immediately in the handler if the exact time is not to your liking (hour before 9 or after 20 and minute // 5 is odd, for example). GAE's cron is not very sophisticated, but running a trivial handler which just gets the time, checks whether that's OK, and terminates immediately otherwise, is pretty simple and cheap (and the 70 or so "extra hits per day", each with a trivial amount of resource consumption, will hardly make a difference to your app's overall resource consumption anyway).
The new API for cron now can do it. Please check the document at: https://cloud.google.com/appengine/docs/python/config/cron#Python_app_yaml_The_schedule_format
every 12 hours
every 5 minutes from 10:00 to 14:00
every day 00:00
every monday 09:00
2nd,third mon,wed,thu of march 17:00
1st monday of sep,oct,nov 17:00
1 of jan,april,july,oct 00:00