Dynamically building collection to loop over in Airflow dag - python

I have been working with Airflow a lot recently and finding a very common pattern is to loop over some collection to create multiple tasks. Very similar to the example_python_operator.py dag found in the example dags folder in github.
My question has to do with dynamically building up the collection the loop is iterating over. Let's say you want to create a task for each of an unknown set of clients stored in a database and you plan to query them as a means to populate your list. Something like this:
first_task = PythonOperator(
task_id='some_upstream_task',
provide_context=True,
python_callable=some_upstream_task,
dag=dag)
clients = my_database_query()
for client in clients:
task = PythonOperator(
task_id='client_' + str(client),
python_callable=some_function,
dag=dag)
task.set_upstream(first_task)
From what I have seen this means that even if your dag only runs weekly your database is being polled every 30 seconds for these clients. Even if you set an upstream operator from the iterator and return the clients via xcoms and replace the my_database_query() with an xcom_pull() your still polling xcoms every 30 secs. This seems wasteful to me, so I'm wondering if there are any better patterns for this type of dag?

In your code sample we don't see the schedule interval of the DAG, but I'm assuming that you have it scheduled let's say #daily, and that you want the DB query to run once a day.
In Airflow, the DAG is parsed periodically by the scheduler (hence the "every 30 seconds"). So your python code causes an issue.
In your case, I would consider changing perspective : why not trying to run the database query in a PosgresOperator link and then make that part of the DAG ? Based on the output of that Operator (that you can propagate via XCOM for example or via a file in Object Storage) you can then have a PythonOperator downstream that does not run a function for one client but for all of them.

Related

how to limit the number of tasks run simultaneously in a specific dag on airflow?

I have a Dag in air-flow looks like that:
start >> [task1,task2,....task16] >> end
I want to limit the tasks that running simultaneously in this dag to be 4 for example.
I know there is 'max_active_tasks_per_dag' parameters, but it affects on all dags and in my case I want to define it only for my dag.
how can I do so? it is even possible?
If you look at the description of the parameter you will find out that you can configure it per dag:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-active-tasks-per-dag
max_active_tasks_per_dag New in version 2.2.0.
The maximum number of task instances allowed to run concurrently in
each DAG. To calculate the number of tasks that is running
concurrently for a DAG, add up the number of running tasks for all DAG
runs of the DAG. This is configurable at the DAG level with
max_active_tasks, which is defaulted as max_active_tasks_per_dag.

Airflow: Best way to store a value in Airflow task that could be retrieved in the recurring task runs

In Airflow for a DAG, I'm writing a monitoring task which will run again and again until a certain condition is met. In this task, when some event happened, I need to store the timestamp and retrieve this value in next task run (for the same task) and update it again if required.
What's the best way to store this value?
So far I have tried below approaches to store:
storing in xcoms, but this value couldn't be retried in next task run as the xcom variable gets deleted for each new task run for the same DAG run.
storing in Airflow Variables - this solves the purpose, I could store, update, delete as needed, but it doesn't look clean for my use case as lot of new Variables are getting generated per DAG and we have over 2k DAGs (pipelines).
global variables in the python class, but the value gets overridden in next task run.
Any suggestion would be helpful.
If you have task that is re-run with the same "Execution Date", using Airflow Variables is your best choice. XCom will be deleted by definition when you re-run the same task with the same execution date and it won't change.
Basically what you want to do is to store the "state" of task execution and it's kinda "against" Airflow's principle of idempotent tasks (where re-running the task should produce "final" results of running the task every time you run it. You want to store the state of the task between re-runs on the other hand and have it behave differently with subsequent re-runs - based on the stored state.
Another option that you could use, is to store the state in an external storage (for example object in S3). This might be better in case of performance if you do not want to load your DB too much. You could come up with a "convention" of naming of such state object and pull it a start and push when you finish the task.
You could use XComs with include_prior_dates parameter. Docs state the following:
include_prior_dates (bool) -- If False, only XComs from the current execution_date are returned. If True, XComs from previous dates are returned as well.
(Default value is False)
Then you would do: xcom_pull(task_ids='previous_task', include_prior_dates=True)
I haven't tried out personally but looks like this may be a good solution to your case.

Airflow: How to ensure that two that two dags are not running at the same time

I am creating an airflow pipeline for pulling comment data from an API for a popular forum. For this I am creating two separate dags:
one dag with schedule_interval set to every minute that checks for new posts, and insert these posts into a database
another dag that I run manually to backfill my database with historic data. This dag simply looks for posts older than the oldest post in my database. For example if the oldest post in my db had id 1000, I would trigger the dag with argument 100 (number of historic posts I want) to fetch all posts in between 1000 and 900.
I have already created both dags, and right now I want to keep dag #2 manual so that I can trigger it whenever I want more historic data. The problem is that I do not want this to interfere with the schedule of dag #1. For this reason, I would like to be able to implement a system where, on calling dag #2, airflow first checks to see if the dag #1 is running, and IF SO, waits until dag #1 is finished to proceed. Likewise, I want to do this the other way around, where dag #1 will check if dag #2 is running before executing, and if so wait until dag #2 is finished. This is kind of confusing, but I want to build a dual-dependency between both dags, so that both cannot run at the same time, and respect each other by waiting until the other is finished before proceeding.

multiple filepaths in S3KeySensor on Airflow

I have some tasks that need to be run when one of few certain files or directories changes on S3.
Let's say I have PythonOperator, and it needs to run if /path/file.csv changes or if /path/nested_path/some_other_file.csv changes.
I have tried to create dynamic KeySensors like this:
trigger_path_list = ['/path/file.csv', '//path/nested_path/some_other_file.csv']
for trigger_path in trigger_path_list:
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, This would mean both S3KeySensors would have to be triggered in order for it to be processed.
I have also tried to make both tasks unique like here:
for trigger_path in trigger_path_list:
main_task = PythonOperator(
task_id='{}_task_triggered_by_{}'.format(dag_name, trigger_path),
...)
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, this would mean that the DAG would not finish if all of the files from the list did not appear. So if /path/file.csv appeared 2 times in a row, it would not be triggered the second time, as this part of the DAG would be completed.
Isn't there a way to pass multiple files to the S3KeySensor ? I do not want to create one DAG for every path, as for me it would be 40 DAGS x around 5 paths, which gives around 200 DAGs.
Any ideas?
Couple ideas for this:
Use Airflow's other task trigger rules, specifically you probably want one_success on the main task, which means just one of however many upstream sensors need to succeed for the task to run. This does mean other sensors will still keep running, but you could potentially use soft_fail flag with a low poll_timeout to avoid any failure. Alternatively, you can have the main task or a separate post-cleanup task mark the rest of the sensors in the DAG as success.
Depending on how many possible paths there are, if it's not too many, then maybe just have a single task sensor that loops through the paths to check for changes. As soon as one path passes the check, you can return so the sensor succeeds. Otherwise, keep polling if no path passes.
In either case, you would still have to schedule this DAG frequently/non-stop if you're looking to keep listening on new files. In general, Airflow isn't really intended for long-running processes. If the main task logic is easier to perform via Airflow, you could still consider having an external process monitor changes, but then trigger a DAG via the API or CLI that contains the main task.
Also not sure if applicable here or something you considered already, but you may be interested in S3 Event Notifications to more explicitly learn about changed files or directories, which could then be consumed by the SQSSensor.

How to avoid running previously successful tasks in Airflow?

I have multiple tasks that are passing some data objects to each other. In some tasks, if some condition is not met, I'm raising an exception. This leads to the failure of that task. When the next DAG run is triggered, the already successful task runs once again. I'm finding some way to avoid running the previously successful tasks and resume the DAG run from the failed task in the next DAG run.
As mentioned, every DAG has it's set of tasks that are executed every run. In order to avoid running previously successful tasks, you could perform a check for an external variable via Airflow XCOMs or Airflow Variables, you could also query the meta database as to the status of previous runs. You could also store a variable in something like Redis or a similar external database.
Using that variable you can then skip the execution of a Task and directly mark the task successful until it reaches the task that is to be completed.
Of course you need to be mindful of any potential race conditions if the DAG run times can overlap.
def task_1( **kwargs ):
if external_variable:
pass
else:
perform_task()
return True

Categories

Resources