Airflow: Proper way to run DAG for each file - python

I have the following task to solve:
Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed
Overall the flows looks as follows: For each file, run tasks A->B->C->D
Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any).
First pattern: Use experimental REST API to trigger dag.
That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf
Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)
Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.
Always using the same ws as described before, but this time it justs stores the file. Then we have:
First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
Second dag: Task A->B->C
Cons: Need to avoid that the same files are being sent to two different DAG runs.
Example:
Files in folder x.json
Sensor finds x, triggers DAG (1)
Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.
Third pattern: for file in files, task A->B->C
As seen in this question.
Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read
Fourth pattern: Use subdags
I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.
Cons: Seems like subdags can only be used with the sequential executor.
Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks

I know I am late, but I would choose the second pattern: "2 dags. One senses and triggers with TriggerDagOperator, one processes", because:
Every file can be executed in parallel
The first DAG could pick a file to process, rename it (adding a suffix '_processing' or moving it to a processing folder)
If I am a new developer in your company, and I open the workflow, I want to understand what is the logic of workflow doing, rather than which files were processed in the last time was dynamically built
If the dag 2, finds an issue with the file, then it renames it (with the '_error' suffix or move it to an error folder)
It's a standard way to process files without creating any additional operator
it makes de DAG idempotent and easier to test. More info in this article
Renaming and/or moving files is a pretty standard way to process files in every ETL.
By the way, I always recommend this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. It doesn't

Seems like you should be able to run a batch processor dag with a bash operator to clear the folder, just make sure you set depends_on_past=True on your dag to make sure the folder is successfully cleared before the next time the dag is scheduled.

I found this article: https://medium.com/#igorlubimov/dynamic-scheduling-in-airflow-52979b3e6b13
where a new operator, namely TriggerMultiDagRunOperator is used. I think this suits my needs.

As of Airflow 2.3.0, you can use Dynamic Task Mapping, which was added to support use cases like this:
https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html#

Related

Is Apache Airflow or Luigi a good tool for this use case?

I'm working at an org that has an embarrassingly large amount of manual tasks that touch multiple databases (internal and external), multiple file types/formats, and a huge amount of datasets. The "general" workflow is to take data from one place, extract/create metadata for it, change the file format into something more standardised, and publish the dataset.
I'm trying to improve automation here and I've been looking at Luigi and Apache-Airflow to try standardise some of the common blocks that get used but I'm not sure if these are the appropriate tools. Before I sink too much time in figuring out these tools I thought I'd ask here.
A dummy example:
Check a REST API end point to see if a dataset has changed (www.some_server.org/api/datasets/My_dataset/last_update_time)
If it's changed download the zip file (My_dataset.zip)
Unzip the file (My_dataset.zip >> my_file1.csv, my_file2.csv ... my_fileN.csv)
Do something with the each CSV; filter, delete, pivot whatever
Combine the csv's and transform into "My_filtered_dataset.json"
For each step create/append a "my_dataset_metadata.csv" file to show things like the processing date, inputs, authors, pipeline version etc.
Upload json and metadata files somewhere else
My end goal would be to quickly swap out blocks, like the "csv_to_json" function with a "csv_to_xlsx" function, for different processing tasks. Also have things like alerting on failure, job visualisation, worker management etc.
Some problems I'm seeing is that Luigi isn't so good at handling dynamic filenames and would struggle to create N branches when I don't know the number of files coming out of the zip file. It's also very basic and doesn't seem to have much community support.
Also from the Airflow docs: "This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. (although there does seem to be some support for this ability with XCOMs)" In my dummy case it I would probably need to share, at least, the filenames and the metadata between each step. Combining all steps into a single operator would kind of defeat the point of Airflow...
Am I misunderstanding things here? Are these tools good for this kind of application? Is this task too simple/complex and should just be stuck into a single script?
With Airflow you can achieve all your goals:
there is sensor operators to wait for a condition: check an API, check if a file exists, run a query on a database and check the result, ...
create a dag to define the dependencies between your tasks, and decide which tasks can run in parallel and which should be run sequentially
a lot of existing operators developed by the community: SSH operators, operators to interact with cloud providers services, ...
a built-in mechanism to send emails on run failure and retry
it's based on python scripts, so you can create a method to create dags dynamically (dags factory), so if you have several dags which share a part of the same logic, you can create them by a conditional method
a built-in messaging system (XCom), to send small data between tasks
a secure way to store your credentials and secrets (Airflow variables and connections)
a modern UI to manage your dag runs and read the logs of your tasks, with Access Control.
you can develop your own plugins and add them to Airflow (ex: UI plugin using FlaskAppBuilder)
you can process each file in a separate task in parallel, on a cluster of nodes (Celery or K8S), using the new feature Dynamic Task Mapping (Airflow >= 2.3)
To pass files between the tasks, which is a basic need for everyone, you can use an external storage service (google GCS, AWS S3, ...) to store the output of each task, use XCom to pass the file path, then read the file in the second task. You can also use a custom backend for XCom to use S3 for example, instead of Airflow metastore db, in this case all the variables and the files passed by XCom will be stored automatically on S3, and there will be no more limit on message size.

How to run same dag two times in a single run in Airflow

I am absolutely new to Airflow. I have one requirement where I have to run two EMR jobs. . Currently I have a python script which depends on some input files, if present it triggers a EMR job.
My new requirement is, I will be having to different input files(same type) and these two files will be input to the emr jobs, in both of this two cases the spark will do the same thing but only the input file are different.
create_job_workflow = EmrCreateJobFlowOperator(
task_id='some-task',
job_flow_overrides=job_flow_args,
aws_conn_id=aws_conn,
emr_conn_id=emr_conn,
dag=dag
)
Ho can I achieve this to run two same dag run by only changing the input file inside spark-submit, basically whenever I will do 'trigger DAG' it will take two different input files and trigger two different emr jobs in two different emr cluster. Or can you any one please provide me some best practice to do it? Or any how is it possible by altering the max_active_runs=2
Best practice will be to have two different tasks for it. by setting max_active_runs=2 you will just limit the number of concurrent dag_runs to 2. You can take help of any data structure to set the config for your tasks, iterate over it and build the tasks based on each attribute.
Another thing you can do:
You can receive the filename as the payload of your dag
Access it like: context['dag_run'].conf.get('filename')
And retrigger the same dag with a trigger dag_run operator, updating the desired payload with the other file

multiple filepaths in S3KeySensor on Airflow

I have some tasks that need to be run when one of few certain files or directories changes on S3.
Let's say I have PythonOperator, and it needs to run if /path/file.csv changes or if /path/nested_path/some_other_file.csv changes.
I have tried to create dynamic KeySensors like this:
trigger_path_list = ['/path/file.csv', '//path/nested_path/some_other_file.csv']
for trigger_path in trigger_path_list:
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, This would mean both S3KeySensors would have to be triggered in order for it to be processed.
I have also tried to make both tasks unique like here:
for trigger_path in trigger_path_list:
main_task = PythonOperator(
task_id='{}_task_triggered_by_{}'.format(dag_name, trigger_path),
...)
file_sensor_task = S3KeySensor(
task_id=get_sensor_task_name(trigger_path),
poke_interval=30,
timeout=60 * 60 * 24 * 8,
bucket_key=os.path.join('s3://', s3_bucket_name, trigger_path),
wildcard_match=True)
file_sensor_task >> main_task
However, this would mean that the DAG would not finish if all of the files from the list did not appear. So if /path/file.csv appeared 2 times in a row, it would not be triggered the second time, as this part of the DAG would be completed.
Isn't there a way to pass multiple files to the S3KeySensor ? I do not want to create one DAG for every path, as for me it would be 40 DAGS x around 5 paths, which gives around 200 DAGs.
Any ideas?
Couple ideas for this:
Use Airflow's other task trigger rules, specifically you probably want one_success on the main task, which means just one of however many upstream sensors need to succeed for the task to run. This does mean other sensors will still keep running, but you could potentially use soft_fail flag with a low poll_timeout to avoid any failure. Alternatively, you can have the main task or a separate post-cleanup task mark the rest of the sensors in the DAG as success.
Depending on how many possible paths there are, if it's not too many, then maybe just have a single task sensor that loops through the paths to check for changes. As soon as one path passes the check, you can return so the sensor succeeds. Otherwise, keep polling if no path passes.
In either case, you would still have to schedule this DAG frequently/non-stop if you're looking to keep listening on new files. In general, Airflow isn't really intended for long-running processes. If the main task logic is easier to perform via Airflow, you could still consider having an external process monitor changes, but then trigger a DAG via the API or CLI that contains the main task.
Also not sure if applicable here or something you considered already, but you may be interested in S3 Event Notifications to more explicitly learn about changed files or directories, which could then be consumed by the SQSSensor.

how to efficiently make airflow dag definitions database-driven

Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.
Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.
I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.

Using ZooKeeper to manage tasks which are in process or have been processed

I have a python script which periodically scans directories, processing new files. Each file takes a long time to process (many hours). I currently have the script running on a single computer, writing the names of processed files to a local file. Not fancy or robust, but it more or less works. I would like to use multiple worker machines to improve throughput (and robustness). My goals are to keep it as simple as possible. A zookeeper cluster is readily available.
My plan is to have in zookeeper a directory "started_files" with ephemeral nodes with the filename, which is known to be unique. I would have another directory "completed_files" with regular nodes with the filename. In pseudocode,
if filename does not exist in completed files:
try:
create emphemeral node filename in started files
process(filename)
create node filename in completed files
except node exists error:
do nothing, another worker is processing it
My first question is whether or not this is safe. Under any circumstance, can two different machines each create the same node successfully? I don't fully understand the doc. Having a file processed twice won't cause anything ALL that bad, but I would prefer it to be correct out of principle.
Secondly, is this a decent approach? Is there another approach which is clearly better? I will be processing 10's of files per DAY, so performance of this part of the application doesn't really matter to me (I sure wish processing the file was faster). Alternatively, I could have another script with just a single instance (or elect a leader) to scan for files and put them in a queue. I could modify the code which is causing these files to magically appear in the first place. I could use celery or storm. However all of those alternatives grow the scope which I am trying to keep small and simple.
In general your approach should work. It is possible, that you configure writing znodes to ZooKeeper in a way that consecutive creation of the same path will fail if it exists.
For the ephermal znodes you already found out quite well that these would vanish automatically if a client closes the connection to ZooKeeper which could,be especially useful in the case of failing compute nodes.
Other nodes can actually monitor the path with the ephermal znodes in order to figure out when it would be a good idea to scan for new tasks.
It would even be possible to implement a queue on top of ZooKeeper for instance using the sequencing of znodes; there are possible better ways.
In general I believe that a message queue system with publish subscribe pattern would scale a bit better. In that case you would only need to think about how to reschedule jobs of failed compute nodes.

Categories

Resources