airflow cleared tasks not getting executed

airflow cleared tasks not getting executed - python

Preamble
Yet another airflow tasks not getting executed question...
Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.
I have checked all the standard things e.g. as outlined in this helpful post.
I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.
Environment
version: airflow 1.10.2
os: centos 7
python: python 3.6
virtualenv: yes
executor: LocalExecutor
backend db: mysql
The problem
Here's what happens in my troubleshooting infinite loop / recurring nightmare.
I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
When task fails, it goes to retry state.
I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
Cleared task instances do not run, but just sit in a "none" state
Attempts to get dag running again fail.
Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.
But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.
Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.
Other observation
This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".
But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.
I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.
Edit 1
I am noticing that there is some "null operator" going on:
Edit 2
Is null a valid value for task instance state? Or is this an indicator that something is wrong.
Edit 3
More none stuff.
Here are some bits from the task instance details page. Lots of attributes are none:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)
If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute Value
duration None
end_date None
is_premature False
job_id None
operator None
pid None
queued_dttm None
raw False
run_as_user None
start_date None
state None
Update
I may finally be on to something...
After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.
So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.
So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.
So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.

Ok I am closing this out and marking the presumptive root cause as server was out of space.
There were a number of contributing factors:
My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.

Related

how to efficiently make airflow dag definitions database-driven

Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.

Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.

I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.

Ensuring at most a single instance of job executing on Kubernetes and writing into Postgresql

I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?

A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.

Google App Engine - run task on publish

I have been looking for a solution for my app that does not seem to be directly discussed anywhere. My goal is to publish an app and have it reach out, automatically, to a server I am working with. This just needs to be a simple Post. I have everything working fine, and am currently solving this problem with a cron job, but it is not quite sufficient - I would like the job to execute automatically once the app has been published, not after a minute (or whichever the specified time it may be set to).
In concept I am trying to have my app register itself with my server and to do this I'd like for it to run once on publish and never be ran again.
Is there a solution to this problem? I have looked at Task Queues and am unsure if it is what I am looking for.
Any help will be greatly appreciated.
Thank you.

Personally, this makes more sense to me as a responsibility of your deploy process, rather than of the app itself. If you have your own deploy script, add the post request there (after a successful deploy). If you use google's command line tools, you could wrap that in a script. If you use a 3rd party tool for something like continuous integration, they probably have deploy hooks you could use for this purpose.

The main question will be how to ensure it only runs once for a particular version.
Here is an outline on how you might approach it.
You create a HasRun module, which you use store each the version of the deployed app and this indicates if the one time code has been run.
Then make sure you increment your version, when ever you deploy your new code.
In you warmup handler or appengine_config.py grab the version deployed,
then in a transaction try and fetch the new HasRun entity by Key (version number).
If you get the Entity then don't run the one time code.
If you can not find it then create it and run the one time code, either in a task (make sure the process is idempotent, as tasks can be retried) or in the warmup/front facing request.
Now you will probably want to wrap all of that in memcache CAS operation to provide a lock or some sort. To prevent some other instance trying to do the same thing.
Alternately if you want to use the task queue, consider naming the task the version number, you can only submit a task with a particular name once.
It still needs to be idempotent (again could be scheduled to retry) but there will only ever be one task scheduled for that version - at least for a few weeks.
Or a combination/variation of all of the above.

In Django, what are the best ways to handle scheduled jobs where only one job is spun up instead of multiple?

[TLDR] What are the best ways to handle scheduled jobs in development where only one job is spun up instead of two? Do you use the 'noreload' option? Do you test to see if a file is locked and then stop the second instance of the job if it is? Are there better alternatives?
[Context]
[edit] We are still in a development environment and am looking to next steps for a production environment.
My team and I are currently developing a Django project, Django 1.9 with Python 3.5. We just discovered that Django spins up two instances of itself to allow for real time code changes.
APScheduler 3.1.0 is being used to schedule a DB ping every few minutes to see if there is new data for us to process. However, when Django spins up, we noticed that we were pinging twice and that there were two instances of our functions running. We tried to shut down the second job through APS but as they are in two different processes, APS is unable to see the other job.
After researching this, we discovered the 'noreload' option and another suggestion to test if a file has been locked.
The noreload prevents Django from spinning up the second instance. This solution works but feels weird. We haven't come across any documentation or guides that suggest that this is something that you want to/not do in production.
Filelock 2.0.6 is another option that we have tested. In this solution, the two scheduled tasks ping a local file to see if it is locked. If it isn't locked, then that task will lock it and run while the other one will stop running. If the task crashes, then the locked file will remain locked until a server restart. This feels like a hack.
In a production environment, are these good solutions? Are there other alternatives that we should look at for handling scheduled tasks that are better for this? Are there cons to either of these solutions that we haven't thought of?
'noreload' - is this something that is done normally in a production environment?

event, exception logging for django

Ok, here is my confusion/problem:
I develop in localhost and there you could raise exceptions and could see the logs in command line easily.
Then I deploy the code on test, stage and production server, that is where the problem begins, it is not easy to see logs or debug errors and exceptions. For normal errors I guess django-toolbar could be enabled, but I do get some silent exceptions which dont crash but the process manipulates to failure because of that. For example, I have payment integration, and few days ago the payments were failing on return (callback) on our site, but nothing was crashing, just that payment process failed message was coming, but the payment gateway vendor was working fine, then I had to look for some failure instances which could lead to this problem and figured out that one db save operation was not saving because the variable was not there.
Now my question, is Sentry (https://github.com/getsentry/sentry) an answer for that? Or is there any other option for this?
Please do ask if any further clarification is needed for my requirement.

Sentry is an option, but honestly is too limited (I tried it a month ago or so), it's intended to track exceptions, but in the real world we should track important informations and events too.
If you didn't setup an application logging, I suggest you to do it, by following this example.
In my app I defined several loggers, for different purposes, the python logging configuration via dictionary (the one used by Django), is very powerful and you have a full control over how things get logged, for example you can write logs to a file, to a database, send an email, call a third party api or whatever. If your app is running in a load balanced environment (so you have several machines running your app), you can use services like Loggly to aggregate the logs coming from your instances in a single place (and since it uses RSYSLOG, it aggregates not only your Django app logs, but also all the logs of your underlying OS).
I suggest you to use also New Relic, which keeps track of a lot of stuff automatically: query executed and time, template loading time, errors and a lot of other useful statistics.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.