With Airflow, is it possible to restart an upstream task if a downstream task fails? This seems to be against the "Acyclic" part of the term DAG. I would think this is a common problem though.
Background
I'm looking into using Airflow to manage a data processing workflow that has been managed manually.
There is a task that will fail if a parameter x is set too high, but increasing the parameter value gives better quality results. We have not found a way to calculate a safe but maximally high parameter x. The process by hand has been to restart the job if failed with a lower parameter until it works.
The workflow looks something like this:
Task A - Gather the raw data
Task B - Generate config file for job
Task C - Modify config file parameter x
Task D - Run the data manipulation Job
Task E - Process Job results
Task F - Generate reports
Issue
If task D fails because of parameter x being too high, I want to rerun task C and task D. This doesn't seem to be supported. I would really appreciate some guidance on how to handle this.
First of all: that's an excellent question, I wonder why it hasn't been discussed widely until now
I can think of two possible approaches
Fusing Operators: As pointed out by #Kris, Combining Operators together appears to be the most obvious workaround
Separate Top-Level DAGs: Read below
Separate Top-Level DAGs approach
Given
Say you have tasks A & B
A is upstream to B
You want execution to resume (retry) from A if B fails
(Possibile) Idea: If your'e feeling adventurous
Put tasks A & B in separate top-level DAGs, say DAG-A & DAG-B
At the end of DAG-A, trigger DAG-B using TriggerDagRunOperator
In all likelihood, you will also have to use an ExternalTaskSensor after TriggerDagRunOperator
In DAG-B, put a BranchPythonOperator after Task-B with trigger_rule=all_done
This BranchPythonOperator should branch out to another TriggerDagRunOperator that then invokes DAG-A (again!)
Useful references
Fusing Operators Together
Wiring Top-Level DAGs together
EDIT-1
Here's a much simpler way that can achieve similar behaviour
How can you re-run upstream task if a downstream task fails in Airflow (using Sub Dags)
Related
In Airflow for a DAG, I'm writing a monitoring task which will run again and again until a certain condition is met. In this task, when some event happened, I need to store the timestamp and retrieve this value in next task run (for the same task) and update it again if required.
What's the best way to store this value?
So far I have tried below approaches to store:
storing in xcoms, but this value couldn't be retried in next task run as the xcom variable gets deleted for each new task run for the same DAG run.
storing in Airflow Variables - this solves the purpose, I could store, update, delete as needed, but it doesn't look clean for my use case as lot of new Variables are getting generated per DAG and we have over 2k DAGs (pipelines).
global variables in the python class, but the value gets overridden in next task run.
Any suggestion would be helpful.
If you have task that is re-run with the same "Execution Date", using Airflow Variables is your best choice. XCom will be deleted by definition when you re-run the same task with the same execution date and it won't change.
Basically what you want to do is to store the "state" of task execution and it's kinda "against" Airflow's principle of idempotent tasks (where re-running the task should produce "final" results of running the task every time you run it. You want to store the state of the task between re-runs on the other hand and have it behave differently with subsequent re-runs - based on the stored state.
Another option that you could use, is to store the state in an external storage (for example object in S3). This might be better in case of performance if you do not want to load your DB too much. You could come up with a "convention" of naming of such state object and pull it a start and push when you finish the task.
You could use XComs with include_prior_dates parameter. Docs state the following:
include_prior_dates (bool) -- If False, only XComs from the current execution_date are returned. If True, XComs from previous dates are returned as well.
(Default value is False)
Then you would do: xcom_pull(task_ids='previous_task', include_prior_dates=True)
I haven't tried out personally but looks like this may be a good solution to your case.
I need a scheduled task in airflow to run with different parameters. One way would be to write different dags for different parameters, but i was wondering if there is a better way of doing this like how we pass parameters to manual trigger.
The bigger issue with writing the same DAG with different URLs is that you're breaking the DRY (Don't Repeat Yourself) principle. If you need different schedules and you don't want to repeat the code twice, you can take this DAG factory idea and build your own factory for these two DAGs. You'll finally have two DAG files that invokes the same factory with the parameters of schedule (1 AM and 2 AM) and URL.
I have the following task to solve:
Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed
Overall the flows looks as follows: For each file, run tasks A->B->C->D
Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any).
First pattern: Use experimental REST API to trigger dag.
That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf
Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)
Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.
Always using the same ws as described before, but this time it justs stores the file. Then we have:
First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
Second dag: Task A->B->C
Cons: Need to avoid that the same files are being sent to two different DAG runs.
Example:
Files in folder x.json
Sensor finds x, triggers DAG (1)
Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.
Third pattern: for file in files, task A->B->C
As seen in this question.
Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read
Fourth pattern: Use subdags
I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.
Cons: Seems like subdags can only be used with the sequential executor.
Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks
I know I am late, but I would choose the second pattern: "2 dags. One senses and triggers with TriggerDagOperator, one processes", because:
Every file can be executed in parallel
The first DAG could pick a file to process, rename it (adding a suffix '_processing' or moving it to a processing folder)
If I am a new developer in your company, and I open the workflow, I want to understand what is the logic of workflow doing, rather than which files were processed in the last time was dynamically built
If the dag 2, finds an issue with the file, then it renames it (with the '_error' suffix or move it to an error folder)
It's a standard way to process files without creating any additional operator
it makes de DAG idempotent and easier to test. More info in this article
Renaming and/or moving files is a pretty standard way to process files in every ETL.
By the way, I always recommend this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. It doesn't
Seems like you should be able to run a batch processor dag with a bash operator to clear the folder, just make sure you set depends_on_past=True on your dag to make sure the folder is successfully cleared before the next time the dag is scheduled.
I found this article: https://medium.com/#igorlubimov/dynamic-scheduling-in-airflow-52979b3e6b13
where a new operator, namely TriggerMultiDagRunOperator is used. I think this suits my needs.
As of Airflow 2.3.0, you can use Dynamic Task Mapping, which was added to support use cases like this:
https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html#
I like to combine a chain and a group in a small workflow of immutable tasks and without a results backend.
However, when I try this Celery automatically converts it to a chord and then complains that there is no results backend.
Is there any way I can get this to work without a results backend?
Code:
#shared_task
def test_canvas():
workflow = chain(group(test_task_a.si(), test_task_b.si()), test_task_c.si())
workflow.delay()
Here is the error message I get:
raised unexpected: NotImplementedError('Starting chords requires a result backend to be configured.
Note that a group chained with a task is also upgraded to be a chord, as this pattern requires synchronization.
Result backends that supports chords: Redis, Database, Memcached, and more.',)
Interestingly, running a chain or a group by itself works just fine.
Example:
workflow = chain(test_task_a.si(), test_task_b.si(), test_task_c.si())
workflow.delay()
Unfortunately, I think that the answer is no - you can't run chord without backend:
Tasks used within a chord must not ignore their results. In practice this means that you must enable a result_backend in order to use chords.
Your first example in test_canvas is implicitly chord:
A chord is a task that only executes after all of the tasks in a group have finished executing (link).
If you think about the logic behind (well expalin here)
someone (backend) need to figure out when all parallel tasks ended (the group) to know when it should trigger the next (chained) task.
In the second example, running multiple tasks concurrently with group is simple (nothing to coordinate later if no action should be taken).
Same for the chain - each task is responsible for triggering the next one, no complicated coordination is needed.
Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.
Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.
I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.