airflow dag failed... but all tasks succeeded - python

I am extremely confused by something in our airflow ui. In the tree view (and the graph view), a dag is indicated to have failed. However, all of its member tasks appear to have succeeded. You can see it here below (third from the end):
Does anyone know how this is possible, what it means, or how one would investigate it?

I have experienced the same. All tasks complete with success, but the DAG fails. Did not find anything in any logs.
In my case, it was the DAG's dagrun_timeout setting that was set too low for my tasks that did run for more than 30 minutes:
dag = DAG(...,
dagrun_timeout=timedelta(minutes=30),
...)
I am on Airflow version 1.10.1.

Related

Why airflow scheduler does not run my DAG?

I'm not able to run airflow DAG by scheduler. I have checked multiple threads here on forum, but I'm still not able to find the root cause. Of course DAG slider is set to ON. Below you can find DAG information:
with DAG(
dag_id='blablabla',
default_args=default_args,
description='run my DAG',
schedule_interval='45 0 * * *',
start_date=datetime(2021, 8, 5, 0, 45),
max_active_runs=1,
tags=['bla']) as dag:
t1 = BashOperator(
task_id='blabla',
bash_command="python3 /home/data/blabla.py",
dag=dag
)
I have checked cron expression which seems to be fine, start_date is hardcoded so it excludes the issue with time set to "now". When I'm checking DAGs run history all other scheduled DAGs are there listed, only this one seems to be invisible for the scheduler.
Triggering DAG manually works fine, python code works properly, there's issue only with scheduler.
What was done:
checked CRON expression
checked start_date whether it's hardcoded
tried changing start_date to date couple months ago
tried many schedule_interval values (but always daily)
checked multiple threads here but did not found anything more than above bullets
Looks okay. One thing that comes to mind is the once-a-day schedule interval, which sometimes confuses because the first run will start at the end of the interval, i.e. the next day. Since you set your start_date to more than one day ago, that shouldn't be a problem.
To find a solution, we would need more information:
Could you post the default_args, or your full DAG?
Any details about your Airflow setup (versions, executor, etc.)
Could you check the scheduler logs for any information/errors? Specifically, $AIRFLOW_HOME/logs/dag_processor_manager.log and $AIRFLOW_HOME/logs/scheduler/[date]/[yourdagfile.py].log
Issue resolved by below steps found in some other post:
try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.

What happens if run same dag multiple times while already running?

What happens if the same dag is triggered concurrently (or such that the run times overlap)?
Asking because recently manually triggered a dag that ended up still being running when its actual scheduled run time passed, at which point, from the perspective of the web-server UI, it began running again from the beginning (and I could no longer track the previous instance). Is this just a case of that "run instance" overloading the dag_id or is the job literally restarting (ie. the previous processes are killed)?
As I understand it depends on how it was triggered and if the DAG has a schedule. If it's based on the schedule defined in the DAG say a task to run daily it is incomplete / still working and you click the rerun then this instance of the task will be rerun. i.e the one for today. Likewise if the frequency were any other unit of time.
If you wanted to rerun other instances you need to delete them from
the previous jobs as described by #lars-haughseth in a different
question. airflow-re-run-dag-from-beginning-with-new-schedule
If you trigger a DAG run then it will get the triggers execution
timestamp and the run will be displayed separately to the scheduled
runs. As described in the documentation here. external-triggers documentation
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
In your instance it sounds like the latter. Hope that helps.

airflow cleared tasks not getting executed

Preamble
Yet another airflow tasks not getting executed question...
Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.
I have checked all the standard things e.g. as outlined in this helpful post.
I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.
Environment
version: airflow 1.10.2
os: centos 7
python: python 3.6
virtualenv: yes
executor: LocalExecutor
backend db: mysql
The problem
Here's what happens in my troubleshooting infinite loop / recurring nightmare.
I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
When task fails, it goes to retry state.
I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
Cleared task instances do not run, but just sit in a "none" state
Attempts to get dag running again fail.
Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.
But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.
Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.
Other observation
This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".
But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.
I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.
Edit 1
I am noticing that there is some "null operator" going on:
Edit 2
Is null a valid value for task instance state? Or is this an indicator that something is wrong.
Edit 3
More none stuff.
Here are some bits from the task instance details page. Lots of attributes are none:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)
If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute Value
duration None
end_date None
is_premature False
job_id None
operator None
pid None
queued_dttm None
raw False
run_as_user None
start_date None
state None
Update
I may finally be on to something...
After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.
So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.
So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.
So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.
Ok I am closing this out and marking the presumptive root cause as server was out of space.
There were a number of contributing factors:
My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.

Re-running Failed SubDAGs

I've been playing around with SubDAGs. A big problem I've faced is whenever something within the SubDAG fails, and I re-run things by hitting Clear, only the cleared task will re-run; the success does not propagate to downstream tasks in the SubDAG and get them running.
How do I re-run a failed task in a SubDAG such that the downstream tasks will flow correctly? Right now, I have to literally re-run every task in the SubDAG that is downstream of the failed task.
I think I followed the best practices of SubDAGs; the SubDAG inherits the Parent DAG properties wherever possible (including schedule_interval), and I don't turn the SubDAG on in the UI; the parent DAG is on and triggers it instead.
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).

Register a celery PeriodicTask after it's created and at runtime

My application creates PeriodicTask objects according to user-defined schedules. That is, the schedule for the PeriodicTask can change at any time. The past couple days have been spent in frustration trying to figure out how to get Celery to support this. Ultimately, the issue is that, for something to run as a PeriodicTask it first, has to be created and then second, has to be registered (I have no idea why this is required).
So, for dynamic tasks to work, I need
to register all the tasks when the celery server starts
to register a task when it is newly created.
#1 should be solved easily enough by running a startup script (i.e., something that gets run after ./manage.py celerybeat gets called). Unfortunately, I don't think there's a convenient place to put this. If there were, the script would go something like this:
from djcelery.models import PeriodicTask
from celery.registry import tasks
for task in PeriodicTask.objects.filter(name__startswith = 'scheduler.'):
tasks.register(task)
I'm filtering for 'scheduler.' because the names of all my dynamic tasks begin that way.
#2 I have no idea. The issue so far as I see it is that celery.registry.tasks is kept in memory and there's no way, barring some coding magic, to access the celerybeat's tasks registry once it started running.
Thanks in advance for your help.

Categories

Resources