multiple filepaths in S3KeySensor on Airflow

multiple filepaths in S3KeySensor on Airflow - python

I have some tasks that need to be run when one of few certain files or directories changes on S3.
Let's say I have PythonOperator, and it needs to run if /path/file.csv changes or if /path/nested_path/some_other_file.csv changes.
I have tried to create dynamic KeySensors like this:
trigger_path_list = ['/path/file.csv', '//path/nested_path/some_other_file.csv']
for trigger_path in trigger_path_list:
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, This would mean both S3KeySensors would have to be triggered in order for it to be processed.
I have also tried to make both tasks unique like here:
for trigger_path in trigger_path_list:
main_task = PythonOperator(
task_id='{}_task_triggered_by_{}'.format(dag_name, trigger_path),
...)
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, this would mean that the DAG would not finish if all of the files from the list did not appear. So if /path/file.csv appeared 2 times in a row, it would not be triggered the second time, as this part of the DAG would be completed.
Isn't there a way to pass multiple files to the S3KeySensor ? I do not want to create one DAG for every path, as for me it would be 40 DAGS x around 5 paths, which gives around 200 DAGs.
Any ideas?

Couple ideas for this:
Use Airflow's other task trigger rules, specifically you probably want one_success on the main task, which means just one of however many upstream sensors need to succeed for the task to run. This does mean other sensors will still keep running, but you could potentially use soft_fail flag with a low poll_timeout to avoid any failure. Alternatively, you can have the main task or a separate post-cleanup task mark the rest of the sensors in the DAG as success.
Depending on how many possible paths there are, if it's not too many, then maybe just have a single task sensor that loops through the paths to check for changes. As soon as one path passes the check, you can return so the sensor succeeds. Otherwise, keep polling if no path passes.
In either case, you would still have to schedule this DAG frequently/non-stop if you're looking to keep listening on new files. In general, Airflow isn't really intended for long-running processes. If the main task logic is easier to perform via Airflow, you could still consider having an external process monitor changes, but then trigger a DAG via the API or CLI that contains the main task.
Also not sure if applicable here or something you considered already, but you may be interested in S3 Event Notifications to more explicitly learn about changed files or directories, which could then be consumed by the SQSSensor.

Related

Run multiple schedule jobs at same time using Python Schedule

I am using cx Oracle and schedule module in python. Following is the psuedo code.
import schedule,cx_Oracle
def db_operation(query):
'''
Some DB operations like
1. Get connection
2. Execute query
3. commit result (in case of DML operations)
'''
schedule.every().hour.at(":10").do(db_operation,query='some_query_1') # Runs at 10th minute in every hour
schedule.every().day.at("13:10").do(db_operation,query='some_query_2') # Runs at 1:10 p.m every day
Both the above scheduled jobs calls the same function (which does some DB operations) and will coincide at 13:10.
Questions:
So how does the scheduler handles this scenario? Like running 2 jobs at the same time. Does it puts in some sort of queue and runs one by one even though time is same? or are they in parallel?
Which one gets picked first? and if I would want the priority of first job over second, how to do it?
Also, important thing is that at a time only one of these should be accessing the database, otherwise it may lead to inconsistent data. How to take care of this scenario? Like is it possible to put a sort of lock while accessing the function or should the table be locked somehow?

I took a look at the code of schedule and I have come to the following conclusions:
The schedule library does not work in parallel or concurrent. Therefore, jobs that have expired are processed one after the other. They are sorted according to their due date. The job that should be performed furthest in the past is performed first.
If jobs are due at the same time, schedule execute the jobs according to the FIFO scheme, regarding the creation of the jobs. So in your example, some_query_1 would be executed before some_query_2.
Question three is actually self-explanatory as only one function can be executed at a time. Therefore, the functions should not actually get in each other's way.

Airflow: Proper way to run DAG for each file

I have the following task to solve:
Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed
Overall the flows looks as follows: For each file, run tasks A->B->C->D
Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any).
First pattern: Use experimental REST API to trigger dag.
That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf
Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)
Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.
Always using the same ws as described before, but this time it justs stores the file. Then we have:
First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
Second dag: Task A->B->C
Cons: Need to avoid that the same files are being sent to two different DAG runs.
Example:
Files in folder x.json
Sensor finds x, triggers DAG (1)
Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.
Third pattern: for file in files, task A->B->C
As seen in this question.
Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read
Fourth pattern: Use subdags
I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.
Cons: Seems like subdags can only be used with the sequential executor.
Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks

I know I am late, but I would choose the second pattern: "2 dags. One senses and triggers with TriggerDagOperator, one processes", because:
Every file can be executed in parallel
The first DAG could pick a file to process, rename it (adding a suffix '_processing' or moving it to a processing folder)
If I am a new developer in your company, and I open the workflow, I want to understand what is the logic of workflow doing, rather than which files were processed in the last time was dynamically built
If the dag 2, finds an issue with the file, then it renames it (with the '_error' suffix or move it to an error folder)
It's a standard way to process files without creating any additional operator
it makes de DAG idempotent and easier to test. More info in this article
Renaming and/or moving files is a pretty standard way to process files in every ETL.
By the way, I always recommend this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. It doesn't

Seems like you should be able to run a batch processor dag with a bash operator to clear the folder, just make sure you set depends_on_past=True on your dag to make sure the folder is successfully cleared before the next time the dag is scheduled.

I found this article: https://medium.com/#igorlubimov/dynamic-scheduling-in-airflow-52979b3e6b13
where a new operator, namely TriggerMultiDagRunOperator is used. I think this suits my needs.

As of Airflow 2.3.0, you can use Dynamic Task Mapping, which was added to support use cases like this:
https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html#

Can a failed Airflow DAG Task Retry with changed parameter

With Airflow, is it possible to restart an upstream task if a downstream task fails? This seems to be against the "Acyclic" part of the term DAG. I would think this is a common problem though.
Background
I'm looking into using Airflow to manage a data processing workflow that has been managed manually.
There is a task that will fail if a parameter x is set too high, but increasing the parameter value gives better quality results. We have not found a way to calculate a safe but maximally high parameter x. The process by hand has been to restart the job if failed with a lower parameter until it works.
The workflow looks something like this:
Task A - Gather the raw data
Task B - Generate config file for job
Task C - Modify config file parameter x
Task D - Run the data manipulation Job
Task E - Process Job results
Task F - Generate reports
Issue
If task D fails because of parameter x being too high, I want to rerun task C and task D. This doesn't seem to be supported. I would really appreciate some guidance on how to handle this.

First of all: that's an excellent question, I wonder why it hasn't been discussed widely until now
I can think of two possible approaches
Fusing Operators: As pointed out by #Kris, Combining Operators together appears to be the most obvious workaround
Separate Top-Level DAGs: Read below
Separate Top-Level DAGs approach
Given
Say you have tasks A & B
A is upstream to B
You want execution to resume (retry) from A if B fails
(Possibile) Idea: If your'e feeling adventurous
Put tasks A & B in separate top-level DAGs, say DAG-A & DAG-B
At the end of DAG-A, trigger DAG-B using TriggerDagRunOperator
In all likelihood, you will also have to use an ExternalTaskSensor after TriggerDagRunOperator
In DAG-B, put a BranchPythonOperator after Task-B with trigger_rule=all_done
This BranchPythonOperator should branch out to another TriggerDagRunOperator that then invokes DAG-A (again!)
Useful references
Fusing Operators Together
Wiring Top-Level DAGs together
EDIT-1
Here's a much simpler way that can achieve similar behaviour
How can you re-run upstream task if a downstream task fails in Airflow (using Sub Dags)

Dynamically building collection to loop over in Airflow dag

I have been working with Airflow a lot recently and finding a very common pattern is to loop over some collection to create multiple tasks. Very similar to the example_python_operator.py dag found in the example dags folder in github.
My question has to do with dynamically building up the collection the loop is iterating over. Let's say you want to create a task for each of an unknown set of clients stored in a database and you plan to query them as a means to populate your list. Something like this:
first_task = PythonOperator(
task_id='some_upstream_task',
provide_context=True,
python_callable=some_upstream_task,
dag=dag)
clients = my_database_query()
for client in clients:
task = PythonOperator(
task_id='client_' + str(client),
python_callable=some_function,
dag=dag)
task.set_upstream(first_task)
From what I have seen this means that even if your dag only runs weekly your database is being polled every 30 seconds for these clients. Even if you set an upstream operator from the iterator and return the clients via xcoms and replace the my_database_query() with an xcom_pull() your still polling xcoms every 30 secs. This seems wasteful to me, so I'm wondering if there are any better patterns for this type of dag?

In your code sample we don't see the schedule interval of the DAG, but I'm assuming that you have it scheduled let's say #daily, and that you want the DB query to run once a day.
In Airflow, the DAG is parsed periodically by the scheduler (hence the "every 30 seconds"). So your python code causes an issue.
In your case, I would consider changing perspective : why not trying to run the database query in a PosgresOperator link and then make that part of the DAG ? Based on the output of that Operator (that you can propagate via XCOM for example or via a file in Object Storage) you can then have a PythonOperator downstream that does not run a function for one client but for all of them.

Web2py scheduler - Best practices to rerun task continuously and to add task at startup

I want to add a task to the queue at app startup, currently adding a scheduler.queue_task(...) to the main db.py file. This is not ideal as I had to define the task function in this file.
I also want the task to repeat every 2 minutes continuously.
I would like to know what is the best practices for this?

As stated in web2py doc, to rerun task continuously, you just have to specify it at task queuing time :
scheduler.queue_task(your_function,
pargs=your_args,
timeout = 120, # just in case
period=120, # as you want to run it every 2 minutes
immediate=True, # starts task ASAP
repeats=0 # just does the infinite repeat magic
)
To queue it at startup, you might want to use web2py cron feature this simple way:
#reboot root *your_controller/your_function_that_calls_queue_task
Do not forget to enable this feature (-Y, more details in the doc).

There is no real mechanism for this within web2py it seems.
There are a few hacks one could do to continuously repeat tasks or schedule at startup but as far as I can see the web2py scheduler needs alot of work.
Best option is to just abondon this web2py feature and use celery or similar for advanced usage.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.