Airflow: Retry up to a specific time

Airflow: Retry up to a specific time - python

I need to create an Airflow job that needs to run absolutely before 9h.
I currently have a job that starts at 7h, with retries=8 with 15 minutes interval (8*15m=2h) unfortunately, my job takes more time, and due to this, the task fails after 9h that is the hard deadline.
How can I make it do retry every 15 minutes but fail if it's after 9h so a human can take a look at the issue ?
Thanks for your help

You could use the execution_timeout argument when creating the task to control how long it'll run before timing out. So if you run your task at 7AM, and want it to end at 9AM, then set the timeout to 2 hours. Below is info from Airflow documentation
aggregate_db_message_job = BashOperator(
task_id='aggregate_db_message_job',
execution_timeout=timedelta(hours=2),
pool='ep_data_pipeline_db_msg_agg',
bash_command=aggregate_db_message_job_cmd,
dag=dag)
aggregate_db_message_job.set_upstream(wait_for_empty_queue)

Related

Running scheduled task in python

I have a python script where a certain job needs to be done at say 8 AM everyday. To do this what i was thinking was have a while loop to keep the program running all the time and inside the while loop use scheduler type package to specify a time where a specific subroutine needs to start. So if there are other routines which run at different times of the day this would work.
def job(t):
print "I'm working...", t
return
schedule.every().day.at("08:00").do(job,'It is 08:00')
Then let windows scheduler run this program and done. But I was wondering if this is terribly inefficient since the while loop is waste of cpu cycles and plus could freeze the computer as the program gets larger in future. Could you please advise if there is a more efficient way to schedule tasks which needs to executed down to the second at the same time not having to run a while loop?

I noted that you have a hard time requirement for executing your script. Just set your Windows Scheduler to start the script a few minutes before 8am. Once the script starts it will start running your schedule code. When your task is done exit the script. This entire process will start again the next day.
and here is the correct way to use the Python module schedule
from time import sleep
import schedule
def schedule_actions():
# Every Day task() is called at 08:00
schedule.every().day.at('08:00').do(job, variable="It is 08:00")
# Checks whether a scheduled task is pending to run or not
while True:
schedule.run_pending()
# set the sleep time to fit your needs
sleep(1)
def job(variable):
print(f"I'm working...{variable}")
return
schedule_actions()
Here are other answers of mine on this topic:
How schedule a job run at 8 PM CET using schedule package in python
How can I run task every 10 minutes on the 5s, using BlockingScheduler?
Execute logic every X minutes (without cron)?

Why a while loop ? Why not just let your Windows Scheduler or on Linux cron job run your simple python script to do whatever, then stop ?
Maintenance tends to become a big problem over time, so try to keep things as lightweight as possible.

Celery Django runing periodic tasks after previus was done. [django-celery-beat]

I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?

There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.

Celery beat schedule, schedule to run on load then on interval

I am trying to figure out how to configure a periodic task in celery to be scheduled to run on load regardless of interval.
For example,
beat_schedule = {
'my-task': {
'task': 'module.my_task',
'schedule': 60.0,
},
}
will wait 60 seconds after the beat is started to run for the first time.
This is problematic for a longer interval, such as an hour, that can do work that is immediately valuable but is not needed "fresh" at shorter intervals.
This question addresses this issue but neither of the answers are satisfactory:
Adding startup lag for the task to be en-queued is both undesirable performance-wise and bad for maintainability since the initial run and schedule are now separated.
Re-implementing the schedule within the task is bad for maintainability.
This seems to me something that should be obvious, so I am quite surprised that that SO question is all I can find on the matter. I am unable to figure this out from the docs and the celery github issues so I wonder if I am missing something obvious.
Edit:
There seems to be more to the story here, because after trying a different task with an hour interval, it ran immediately as the project celery is started.
If I stop and clear the queue with celery purge -A proj -f then start celery again, the task does not run within the heartbeat interval. This would makes sense because the worker handles the messages but beat has its own schedule record celerybeat-schedule which would be unaffected by the purge.
If I delete celerybeat-schedule and restart beat the task still does not run. Starting celery beat with a non-default schedule db location also does not cause the task to run. The next time the task runs is one hour from the time I started the new beat (14:59) not one hour from the first start time of the task (13:47).
There seems to be some state that is not documented well or is unknown that is the basis of this issue. My question can also be stated as: how do you force beat to clear its record of last runs?.
I am also concerned that while running the worker and beat, running celery -A proj inspect scheduled gives - empty - but presumably the task had to be scheduled at some point because it gets run.

What does the landing time mean in airflow?

There is a section called "landing time" in the DAG view on the web console of airflow.
An example screen shot taken from airbnb's blog:
But what does it mean? There is no definition in the documents or in their repository.

Since the existing answer here wasn't totally clear, and this is the top hit for "airflow landing time" I went to the chat archives and found the original answer being referenced here:
Maxime Beauchemin #mistercrunch Jun 09 2016 11:12
it's the number of hours after the time the scheduling period ended
take a schedule_interval='#daily' run for 2016-01-01 that finishes at 2016-01-02 03:52:00
landing time is 3:52
https://gitter.im/apache/incubator-airflow/archives/2016/06/09
It seems the Y axis is in hours, and the negative landing times are a result of running jobs manually so they finish hours before they "should have finished" based on the schedule.

I directly asked the author Maxime. His answer was landing_time is when the job completes minus when the job should have started (for airflow, it's the end of the scheduled period).
source:
http://gitter.im/apache/incubator-airflow
It is a good place to get help and Maxine is very nice and helpful. But the answers are not persistent..

For me its easier to understand landing_time using an example.
So let's say we have a dag scheduled to run daily at 0 0 * * *. This dag has 2 tasks that execute sequentially:
first_task >> second_task
The first_task starts at 00:00 and 10 seconds and finishes after 5 minutes at 00:05:10.
The landing_time for first_task will be 5 mins and 10 seconds.
The second_task starts execution at 00:07:00 minute and finishes after 2 minutes. The landing_time for the second_task would be 9 minutes.
So we just delete from the task end_time the dag execution_date.
Thanks to #Kalinde Pride for commenting and pointing me to the only source of truth, the airflow code base.
I usually use landing_time as a measure and metric of the performance of the whole airflow system. For example increase in landing_times in the first tasks seems to mean that scheduler is under heavy load or we should adapt task parallelization (through airflow.cfg).

Landing Times: Total time spent including retries.

Airflow backfill only 13 instances are run with default setup

I am experimenting with airbnb airflow. While I am trying to run it with 'backfill' option for one day with timedelta of 60 minutes, only 13 instances are executed. Rest are shown as waiting and never executed.

Please provide more info like which Airflow version is and which Executor you defined in the "airflow.cfg".
For CeleryExecutor, you have to have "airflow worker" running together with the scheduler.
Also your DAG may be pending if your ending date ("-e END_DATE") has reached.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.