How to deploy modified airflow dag from a different start time? - python

Lets say scheduler is stopped for 5 hours and I had dag scheduled for twice every hour. Now when I restart the scheduler I do not want to airflow to backfill all the instances those were missed, Instead I want it to continue from the current hour.

To achieve this behavior, you can use the LatestOnlyOperator, which was just recently introduced to master, to the start of your DAG. It is not currently part of a released version though (1.7.1.3 is the latest version as of the writing of this post).

I'm sure you're no longer waiting for an answer, but for reference, this is covered here: https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls.
"When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc..."

Related

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

Django Crontab : How to stop parallel execution

I have few cronjobs running with the help of django-crontab. Let us take one cronjob as an example, suppose this job A is scheduled to run every two minutes.
However, while the job is running and if it is not finished in two minutes, I do not want another instance of this job to execute.
Exploring few resources, I came across this article, but I am not sure where to fit this in.
https://bencane.com/2015/09/22/preventing-duplicate-cron-job-executions/
Did someone already came across this issue? How did you fix it?
According to the readme, you should be able to set:
CRONTAB_LOCK_JOBS = True
in your Django settings. That will prevent a new job instance from starting if a previous one is still running.

In Python's Airflow, how can I stop a task from running after a certain time?

I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:
Assume today is the 20th of the month.
Assume the start_date is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?
This feature is in the roadmap for Airflow, but does not currently exist.
See:
Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date for this specific task or the whole dag.
Every operator has a start_date
http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.

Add Repeating Task With Redis

How do I schedule a task to run once every six hours (on repeat)?
I am trying to implement a Redis queue for the first time.
I went through Heroku's tutorial : https://devcenter.heroku.com/articles/python-rq
But the tutorial did not explain how to run a task repeatedly with a timeframe (such as checking a couple of websites for info, once every six hours)
Also, since I am new to do this, if I should not be using Redis for such a task, please let me know what I should be using to check a couple of websites for info once every six hours
Thanks
You don't need Redis for this functionality at all.
Take a look at the Heroku Scheduler here: https://devcenter.heroku.com/articles/scheduler
You can set this to run your code every hour, and have your code check if the current hour is 0,5,11,17 (or whatever other interval you may need).

Exceptions from Buildbots PeriodicScheduler intervals?

Buildbots Periodic scheduler triggers builds at fixed intervals (e.g. every 30 minutes). But there are certain times (e.g. at night, during the weekend or while regular maintenance is performed) where I'd like it to relax.
Is there a way to have a more fine-grained description for the Periodic scheduler? Or should I rather use the Nightly scheduler and explicitly list all build trigger times I need for the whole week?
If you use Nightly, then you can specify that the scheduler only trigger if there are interesting changes to build. Presumably, you won't have commits during that period either, so it will not trigger builds then.
One way to solve this would of course be to write a piece of python code that generates the build times for up the nightly scheduler according to some rules.

Categories

Resources