How to rerun past DAG runs after making changes - python

I have an Airflow task that runs daily for the past year or so. After making some changes to the DAG, I would like to rerun based on the new code -- the UI still detects it as the same tasks that has successfully ran, even though they are different now.
Manually going to each DAG to clear all > run seems counterintuitive and there are hundreds (even thousands) of runs, is there a way I can make them all run again?

Airflow doesn't reset task instance statuses after they have been completed if there is a code change. You could however give the task a new name and then set the DAG Runs to running state.
Option 1 - User Interface
Airflow can clear the state of tasks within the UI - just use "Browse -> Task Instances". There you can:
Create filter to select the specific task instances, i.e. by filtering for a task/DAG name and a time frame, then "Apply" the filter
You can then see a table with all the task runs that apply to this filter, you can select all and and choose "With selected -> clear"
Option 2 - Command Line
Use airflow clear:
Clear a set of task instance, as if they never ran
For example:
airflow clear -s <start_date> -e <end_date> -t task_to_reset <DAG_NAME>

Related

Airflow: Best way to store a value in Airflow task that could be retrieved in the recurring task runs

In Airflow for a DAG, I'm writing a monitoring task which will run again and again until a certain condition is met. In this task, when some event happened, I need to store the timestamp and retrieve this value in next task run (for the same task) and update it again if required.
What's the best way to store this value?
So far I have tried below approaches to store:
storing in xcoms, but this value couldn't be retried in next task run as the xcom variable gets deleted for each new task run for the same DAG run.
storing in Airflow Variables - this solves the purpose, I could store, update, delete as needed, but it doesn't look clean for my use case as lot of new Variables are getting generated per DAG and we have over 2k DAGs (pipelines).
global variables in the python class, but the value gets overridden in next task run.
Any suggestion would be helpful.
If you have task that is re-run with the same "Execution Date", using Airflow Variables is your best choice. XCom will be deleted by definition when you re-run the same task with the same execution date and it won't change.
Basically what you want to do is to store the "state" of task execution and it's kinda "against" Airflow's principle of idempotent tasks (where re-running the task should produce "final" results of running the task every time you run it. You want to store the state of the task between re-runs on the other hand and have it behave differently with subsequent re-runs - based on the stored state.
Another option that you could use, is to store the state in an external storage (for example object in S3). This might be better in case of performance if you do not want to load your DB too much. You could come up with a "convention" of naming of such state object and pull it a start and push when you finish the task.
You could use XComs with include_prior_dates parameter. Docs state the following:
include_prior_dates (bool) -- If False, only XComs from the current execution_date are returned. If True, XComs from previous dates are returned as well.
(Default value is False)
Then you would do: xcom_pull(task_ids='previous_task', include_prior_dates=True)
I haven't tried out personally but looks like this may be a good solution to your case.

Airflow: How to ensure that two that two dags are not running at the same time

I am creating an airflow pipeline for pulling comment data from an API for a popular forum. For this I am creating two separate dags:
one dag with schedule_interval set to every minute that checks for new posts, and insert these posts into a database
another dag that I run manually to backfill my database with historic data. This dag simply looks for posts older than the oldest post in my database. For example if the oldest post in my db had id 1000, I would trigger the dag with argument 100 (number of historic posts I want) to fetch all posts in between 1000 and 900.
I have already created both dags, and right now I want to keep dag #2 manual so that I can trigger it whenever I want more historic data. The problem is that I do not want this to interfere with the schedule of dag #1. For this reason, I would like to be able to implement a system where, on calling dag #2, airflow first checks to see if the dag #1 is running, and IF SO, waits until dag #1 is finished to proceed. Likewise, I want to do this the other way around, where dag #1 will check if dag #2 is running before executing, and if so wait until dag #2 is finished. This is kind of confusing, but I want to build a dual-dependency between both dags, so that both cannot run at the same time, and respect each other by waiting until the other is finished before proceeding.

multiple filepaths in S3KeySensor on Airflow

I have some tasks that need to be run when one of few certain files or directories changes on S3.
Let's say I have PythonOperator, and it needs to run if /path/file.csv changes or if /path/nested_path/some_other_file.csv changes.
I have tried to create dynamic KeySensors like this:
trigger_path_list = ['/path/file.csv', '//path/nested_path/some_other_file.csv']
for trigger_path in trigger_path_list:
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, This would mean both S3KeySensors would have to be triggered in order for it to be processed.
I have also tried to make both tasks unique like here:
for trigger_path in trigger_path_list:
main_task = PythonOperator(
task_id='{}_task_triggered_by_{}'.format(dag_name, trigger_path),
...)
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, this would mean that the DAG would not finish if all of the files from the list did not appear. So if /path/file.csv appeared 2 times in a row, it would not be triggered the second time, as this part of the DAG would be completed.
Isn't there a way to pass multiple files to the S3KeySensor ? I do not want to create one DAG for every path, as for me it would be 40 DAGS x around 5 paths, which gives around 200 DAGs.
Any ideas?
Couple ideas for this:
Use Airflow's other task trigger rules, specifically you probably want one_success on the main task, which means just one of however many upstream sensors need to succeed for the task to run. This does mean other sensors will still keep running, but you could potentially use soft_fail flag with a low poll_timeout to avoid any failure. Alternatively, you can have the main task or a separate post-cleanup task mark the rest of the sensors in the DAG as success.
Depending on how many possible paths there are, if it's not too many, then maybe just have a single task sensor that loops through the paths to check for changes. As soon as one path passes the check, you can return so the sensor succeeds. Otherwise, keep polling if no path passes.
In either case, you would still have to schedule this DAG frequently/non-stop if you're looking to keep listening on new files. In general, Airflow isn't really intended for long-running processes. If the main task logic is easier to perform via Airflow, you could still consider having an external process monitor changes, but then trigger a DAG via the API or CLI that contains the main task.
Also not sure if applicable here or something you considered already, but you may be interested in S3 Event Notifications to more explicitly learn about changed files or directories, which could then be consumed by the SQSSensor.

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

Methods to schedule a task prior to runtime

What are the best methods to set a .py file to run at one specific time in the future? Ideally, its like to do everything within a single script.
Details: I often travel for business so I built a program to automatically check me in to my flights 24 hours prior to takeoff so I can board earlier. I currently am editing my script to input my confirmation number and then setting up cron jobs to run said script at the specified time. Is there a better way to do this?
Options I know of:
• current method
• put code in the script to delay until x time. Run the script immediately after booking the flight and it would stay open until the specified time, then check me in and close. This would prevent me from shutting down my computer, though, and my machine is prone to overheating.
Ideal method: input my confirmation number & flight date, run the script, have it set up whatever cron automatically, be done with it. I want to make sure whatever method I use doesn't include keeping a script open and running in the background.
cron is best for jobs that you want to repeat periodically. For one-time jobs, use at or batch.

Categories

Resources