What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.
Related
I am creating an airflow pipeline for pulling comment data from an API for a popular forum. For this I am creating two separate dags:
one dag with schedule_interval set to every minute that checks for new posts, and insert these posts into a database
another dag that I run manually to backfill my database with historic data. This dag simply looks for posts older than the oldest post in my database. For example if the oldest post in my db had id 1000, I would trigger the dag with argument 100 (number of historic posts I want) to fetch all posts in between 1000 and 900.
I have already created both dags, and right now I want to keep dag #2 manual so that I can trigger it whenever I want more historic data. The problem is that I do not want this to interfere with the schedule of dag #1. For this reason, I would like to be able to implement a system where, on calling dag #2, airflow first checks to see if the dag #1 is running, and IF SO, waits until dag #1 is finished to proceed. Likewise, I want to do this the other way around, where dag #1 will check if dag #2 is running before executing, and if so wait until dag #2 is finished. This is kind of confusing, but I want to build a dual-dependency between both dags, so that both cannot run at the same time, and respect each other by waiting until the other is finished before proceeding.
I am currently using Airflow to run a DAG (say dag.py) which has a few tasks, and then, it has a python script to execute (done via bash_operator). The python script (say report.py) basically takes data from a cloud (s3) location as a dataframe, does a few transformations, and then sends them out as a report over email.
But the issue I'm having is that airflow is basically running this python script, report.py, everytime Airflow scans the repository for changes (i.e. every 2 mins). So, the script is being run every 2 mins (and hence the email is being sent out every two minutes!).
Is there any work around to this? Can we use something apart from a bash operator (bare in mind that we need to do a few dataframe transformations before sending out the report)?
Thanks!
Just make sure you do everything serious in the tasks. It in the python script. The script will be executed often by scheduler but it should simply create tasks and build dependencies between them. The actual work is done in the 'execute' methods of the tasks.
For example rather than sending email in the script you should add the 'EmailOperator' as a task and the right dependencies, so the execute method of the operator will be executed not when the file is parsed by scheduler, but when all dependencies (other tasks ) will complete
I have an Airflow task that runs daily for the past year or so. After making some changes to the DAG, I would like to rerun based on the new code -- the UI still detects it as the same tasks that has successfully ran, even though they are different now.
Manually going to each DAG to clear all > run seems counterintuitive and there are hundreds (even thousands) of runs, is there a way I can make them all run again?
Airflow doesn't reset task instance statuses after they have been completed if there is a code change. You could however give the task a new name and then set the DAG Runs to running state.
Option 1 - User Interface
Airflow can clear the state of tasks within the UI - just use "Browse -> Task Instances". There you can:
Create filter to select the specific task instances, i.e. by filtering for a task/DAG name and a time frame, then "Apply" the filter
You can then see a table with all the task runs that apply to this filter, you can select all and and choose "With selected -> clear"
Option 2 - Command Line
Use airflow clear:
Clear a set of task instance, as if they never ran
For example:
airflow clear -s <start_date> -e <end_date> -t task_to_reset <DAG_NAME>
I have multiple tasks that are passing some data objects to each other. In some tasks, if some condition is not met, I'm raising an exception. This leads to the failure of that task. When the next DAG run is triggered, the already successful task runs once again. I'm finding some way to avoid running the previously successful tasks and resume the DAG run from the failed task in the next DAG run.
As mentioned, every DAG has it's set of tasks that are executed every run. In order to avoid running previously successful tasks, you could perform a check for an external variable via Airflow XCOMs or Airflow Variables, you could also query the meta database as to the status of previous runs. You could also store a variable in something like Redis or a similar external database.
Using that variable you can then skip the execution of a Task and directly mark the task successful until it reaches the task that is to be completed.
Of course you need to be mindful of any potential race conditions if the DAG run times can overlap.
def task_1( **kwargs ):
if external_variable:
pass
else:
perform_task()
return True
I have few cronjobs running with the help of django-crontab. Let us take one cronjob as an example, suppose this job A is scheduled to run every two minutes.
However, while the job is running and if it is not finished in two minutes, I do not want another instance of this job to execute.
Exploring few resources, I came across this article, but I am not sure where to fit this in.
https://bencane.com/2015/09/22/preventing-duplicate-cron-job-executions/
Did someone already came across this issue? How did you fix it?
According to the readme, you should be able to set:
CRONTAB_LOCK_JOBS = True
in your Django settings. That will prevent a new job instance from starting if a previous one is still running.