How to set time object on SLA in airflow instead of timedelta?

How to set time object on SLA in airflow instead of timedelta? - python

I'm trying to implement SLA in my airflow DAG.
I know how SLAs work, you set a timedelta object and if the task does not get done in that duration, it will send an email and notifies that the task is not done yet.
I want some similar functionality, but instead of giving duration, I want to set specific time in SLA. For example, if the task is not done due to 8:00 AM, it sends the email and notifies the manager. Something like this:
'sla': time(hour=8, minute=0, second=0)
I have searched a lot, but nothing found.
Is there any solution for this specific problem? or any other solutions than SLA?
Thanks in advance.

SLA param of BaseOperator expects a datetime.timedelta object, so there is nothing more to do there. Take into consideration that SLA represents a time delta after the scheduled period is over. The example from the docs supposes a DAG scheduled daily:
For example if you set an SLA of 1 hour, the scheduler would send an email soon after 1:00AM on the 2016-01-02 if the 2016-01-01 instance has not succeeded yet.
The point is, it's always a time delta from the schedule period which is not what you are looking for.
So I think you should take another approach, like schedule your DAG whenever you need it, execute the tasks you want and then add a sensor operator to check if the condition you are looking for is met or not. There are a few types of sensors depending on the context you have you could choose from them.
Another option could be, create a new DAG dedicated to check if your tasks executed in the original DAG were successfully executed or not, and act accordingly (for example, send emails, etc.). To do this you could use an ExternalTaskSensor, check online for tutorials on how to implement it, although it may be simpler to avoid cross DAG dependencies as stated in the docs.
Hope that this could point you into the right direction.

Related

How to best partition Airflow jobs that work on data that cannot be partioned by date

I have some DAGs already defined in Airflow that perform queries on third party APIs, pull some data partitioned by date (for example the trending items for yesterday) and write them to the DB. They can also be triggered manually with a bunch of parameters to download the same items without the date-based logic. So far so good, this is a standard scenario for Airflow.
I now want to reuse and adapt some of these dags to perform special queries: in Airflow's terms this means receiving different Job's parameters. I can do it one by one manually but clearly this is not the best. The main reason is that these third party APIs have daily quota thresholds that we don't want to cross. So we are not free to run everything every day but we need to be considerate with the executions.
So let's say I want to download 100 entities, which ID I can download through a service call and let's say my quota is 10 per day. One solution would be to create a DAG that does the call, saves the ids into a database with the date in which they should be executed, but I'm doing the Airflow scheduler's job and it seems not good.There are many things that could go wrong.
I could do the same trick but with something that looks like a queue: one manual DAG puts tasks in the queue and another, daily, DAG pulls from the queue. This one kinda works in my mind but it seems like a lot of effort and I'm not sure what should keep track of the queue. Something like Celery seems like an overkill so probably I would have to use a database. Still, it seems like over engineering and some kind of Airflow anti-pattern but I don't have much experience with the tool so feedbacks are welcome.
Are there other options? Is there some Airflow's feature that would solve this easily?

Django run tasks (possibly) in the far future

Suppose I have a model Event. I want to send a notification (email, push, whatever) to all invited users once the event has elapsed. Something along the lines of:
class Event(models.Model):
start = models.DateTimeField(...)
end = models.DateTimeField(...)
invited = models.ManyToManyField(model=User)
def onEventElapsed(self):
for user in self.invited:
my_notification_backend.sendMessage(target=user, message="Event has elapsed")
Now, of course, the crucial part is to invoke onEventElapsed whenever timezone.now() >= event.end.
Keep in mind, end could be months away from the current date.
I have thought about two basic ways of doing this:
Use a periodic cron job (say, every five minutes or so) which checks if any events have elapsed within the last five minutes and executes my method.
Use celery and schedule onEventElapsed using the eta parameter to be run in the future (within the models save method).
Considering option 1, a potential solution could be django-celery-beat. However, it seems a bit odd to run a task at a fixed interval for sending notifications. In addition I came up with a (potential) issue that would (probably) result in a not-so elegant solution:
Check every five minutes for events that have elapsed in the previous five minutes? seems shaky, maybe some events are missed (or others get their notifications send twice?). Potential workaroung: add a boolean field to the model that is set to True once notifications have been sent.
Then again, option 2 also has its problems:
Manually take care of the situation when an event start/end datetime is moved. When using celery, one would have to store the taskID (easy, ofc) and revoke the task once the dates have changed and issue a new task. But I have read, that celery has (design-specific) problems when dealing with tasks that are run in the future: Open Issue on github. I realize how this happens and why it is everything but trivial to solve.
Now, I have come across some libraries which could potentially solve my problem:
celery_longterm_scheduler (But does this mean I cannot use celery as I would have before, because of the differend Scheduler class? This also ties into the possible usage of django-celery-beat... Using any of the two frameworks, is it still possible to queue jobs (that are just a bit longer-running but not months away?)
django-apscheduler, uses apscheduler. However, I was unable to find any information on how it would handle tasks that are run in the far future.
Is there a fundemantal flaw with the way I am approaching this? Im glad for any inputs you might have.
Notice: I know this is likely to be somehwat opinion based, however, maybe there is a very basic thing that I have missed, regardless of what could be considered by some as ugly or elegant.

We're doing something like this in the company i work for, and the solution is quite simple.
Have a cron / celery beat that runs every hour to check if any notification needs to be sent.
Then send those notifications and mark them as done. This way, even if your notification time is years ahead, it will still be sent. Using ETA is NOT the way to go for a very long wait time, your cache / amqp might loose the data.
You can reduce your interval depending on your needs, but do make sure they dont overlap.
If one hour is too huge of a time difference, then what you can do is, run a scheduler every hour. Logic would be something like
run a task (lets call this scheduler task) hourly that gets all notifications that needs to be sent in the next hour (via celery beat) -
Schedule those notifications via apply_async(eta) - this will be the actual sending
Using that methodology would get you both of best worlds (eta and beat)

In Python's Airflow, how can I stop a task from running after a certain time?

I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:
Assume today is the 20th of the month.
Assume the start_date is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?

This feature is in the roadmap for Airflow, but does not currently exist.
See:
Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.

Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date for this specific task or the whole dag.
Every operator has a start_date
http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.

Django execute task on time specified in model datetime field

got a simple question, I believe, but it got me stuck anyways.
Say I have a simple model:
class myModel(models.Model):
expires = models.DateTimeField(...)
and I want, say on the specified time do something: send an email, delete model, change some of the models fields... Something. Is there a tool in django core, allowing me to do so?
Or, if not, I think some task queuing tool might be in order. I have djcelery working in my project, though I'm a completely newbie in it, and all I was able to perform so far, is to run django-celery-email package, in order to send my mail asynchronically. Though I can't say I'm fully capable of defining task and workers to work in background and be reliable.
If any ideas, on how to solve such problem, please, do not hesitate =)

Write a custom management command to do the task that you desire. When you are done, you should be able to run your task with python manage.py yourtaskname.
Use cron, at, periodic tasks in celery, django-cron, djangotaskscheduler or django-future to schedule your tasks.

I think the best is a background-task the reads the datime and executes a task if a datetime is or has been reached.
See the solution given here for a scheduled task
So the workflow would be:
Create the task you want to apply on objects whose date has been reached
Create a managment command that checks the datetimes in your DB, and execute the above task for every object the datetime has been reached
Use cron (Linux) or at(Windows) to schedule the command call

If you're on a UNIX-like machine, it's possible that you have access to cronjobs. If you're on Windows, I hear there's a program called at that can do similar things. If this doesn't suit your needs, there are a number of ways to do things every X hours using the time library (time.sleep(SOME_NUMBER_OF_SECONDS) in a loop with whatever else you want to do will do it if you want something done regularly, otherwise you'll need to look at time.localtime() and check for conditions).

How could I do this kind of scheduled task in AppEngine Python?

I have a form with a text field for entering a number without decimal
places representing an amount of minutes that will have to be added to the current time
and will be inserted into a table named Alarm.
When the resulting time comes, my web app must make an insert operation over another table.
For example, if the user enters 20 minutes, and the current time is 22:10, the result time
will have to be 22:30 and will be inserted into Alarm table. So, when the 22:30 arrives, a new insert will have to be made over the another table.
How can I do this on AppEngine using Python?

Depending on your requirements, you may also want to consider using Tasks with an eta or countdown.
If you plan to allow users to cancel the action, you'd need to use some type of no-op marker the task checks for before adding to the "other" table. Or, make the task check the Alarm table before performing the add.
Also, note that the countdown / eta are not precise, they are more like polite requests. So if your queues are backing up with tasks, your adds will happen after they are supposed to. (though cron, particularly 1 minute jobs, also periodically suffer timing issues).
The advantage of this method is that you don't have to figure out how to avoid missing work. Each task represents one add (or a related set of adds). Also, if a write fails the task will retry, which is nice.
Cron may be a better solution for your particular problem though.

You've said that you're storing the target time in the Alarm table. So, your cron just has to run every minute (or every 5 or 10, depending on the resolution of your alarms) and check if there's an alarm matching the current time, and if so then do the action.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.