Airflow / Python - How to resume DAG flow based on external process

Airflow / Python - How to resume DAG flow based on external process - python

Using Airflow 1.8.0 and python 2.7
Having the following DAG (simplified):
(Phase 1)-->(Phase 2)
On phase 1 I'm triggering an async process that is time consuming and can run for up to 2 days, when the process ends it writes some payload on S3. On that period I want the DAG to wait and continue to phase 2 only when the S3 payload exists.
I thought of 2 solutions:
When phase 1 start pause the DAG using the experimental REST API and resume once the process ends.
Wait using an operator that checks for the S3 payload every X minuets.
I can't use option 1 since my admin does not allow the experimental API usage and option 2 seems like a bad practice (checking every X minuets).
Are there any other options to solve my task?

I think Option (2) is the "correct way", you may optimize it a bit:
BaseSensorOperator supports poke_interval, so it should be usable for S3KeySensor to increase the time between tries.
Poke_interval - Time in seconds that the job should wait in between
each tries
Additionally, you could try to use mode and switch it to reschedule:
mode: How the sensor operates.
Options are: { poke | reschedule }, default is poke.
When set to poke the sensor is taking up a worker slot for its
whole execution time and sleeps between pokes. Use this mode if the
expected runtime of the sensor is short or if a short poke interval
is required. Note that the sensor will hold onto a worker slot and
a pool slot for the duration of the sensor's runtime in this mode.
When set to reschedule the sensor task frees the worker slot when
the criteria is not yet met and it's rescheduled at a later time. Use
this mode if the time before the criteria is met is expected to be
quite long. The poke interval should be more than one minute to
prevent too much load on the scheduler.
Not sure about Airflow 1.8.0 - couldn't find the old documentation (I assume poke_interval is supported, but not mode).

Related

Celery Django runing periodic tasks after previus was done. [django-celery-beat]

I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?

There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.

Dask task to worker appointment

I run a computation graph with long functions (several hours) and big results (several hundred of megabytes). This type of load can be atypical for dask.
I try to run this graph on 4 workers. I see depicted task to worker appointment:
In first row "green" task depends only on a "blue" one, not "violet". Why green task is not moved to other worker?
Is it possible to give some hints to a scheduler to always move a task on a free worker? Which information do you need and possible to obtain which helps to debug more?
Such appointment is non-optimal and graph computation takes more time than necessary.
A little bit information:
Computation graph composing is done using dask.delayed
Computation invocation is done using next code
to_compute = [result_of_dask_delayed_1, ... , result_of_dask_delayed_n]
running = client.persist(to_compute)
results = as_completed(
futures_of(running),
with_results=True,
raise_errors=not bool(cmdline_options.partial_fails),
)

Firstly, we should state stat scheduling tasks to workers is complicated - please read this description. Roughly, when a batch of tasks come in, they will be distributed to workers, accounting for dependencies.
Dask has no idea, before running, how long any given task will take or how big the data result it generates might be. Once a task is done, this information is available for the completed task (but not for ones still waiting to run), but dask needs to make a decision on whether to steal an allotted task to an idle worker, and copy across the data it needs to run. Please read the link to see how this is done.
Finally, you can indeed have more fine-grained control over where things are run, e.g., this section.

limited number of user-initiated background processes

I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?

I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.

First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.

First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.

apscheduler scheduler timeout

i have a problem regarding to pythons' apscheduler.
i'm running a task that includes pulling data from db. The dbs' response time varies, because of different operations on it, from different sources, and predicting when the dbs' response time will be low, is not possible.
for example when running
scheduler.add_interval_job(self.readFromDb, start_date = now(), seconds=60)
The seconds parameter stops the task, if it didn't finish, and starts the next task
is there a way of changing the seconds parameter dynamically? or should i use the default value of 0?
cheers

The "seconds" parameter does not in any way limit how long the job can take, and it certainly does not terminate it prematurely. However, it will, with the default settings, prevent another instance of the job from being spawned if the previous instance is taking longer than the specified interval (60 seconds here). The way I see it, you have two options here:
Ignore the fact that a new instance of the task sometimes fails to start
Increase the max_instances parameter from the default of 1 so more than one instance of the task can run concurrently

Python/PySerial and CPU usage

I've created a script to monitor the output of a serial port that receives 3-4 lines of data every half hour - the script runs fine and grabs everything that comes off the port which at the end of the day is what matters...
What bugs me, however, is that the cpu usage seems rather high for a program that's just monitoring a single serial port, 1 core will always be at 100% usage while this script is running.
I'm basically running a modified version of the code in this question: pyserial - How to Read Last Line Sent from Serial Device
I've tried polling the inWaiting() function at regular intervals and having it sleep when inWaiting() is 0 - I've tried intervals from 1 second down to 0.001 seconds (basically, as often as I can without driving up the cpu usage) - this will succeed in grabbing the first line but seems to miss the rest of the data.
Adjusting the timeout of the serial port doesn't seem to have any effect on cpu usage, nor does putting the listening function into it's own thread (not that I really expected a difference but it was worth trying).
Should python/pyserial be using this much cpu? (this seems like overkill)
Am I wasting my time on this quest / Should I just bite the bullet and schedule the script to sleep for the periods that I know no data will be coming?

Maybe you could issue a blocking read(1) call, and when it succeeds use read(inWaiting()) to get the right number of remaining bytes.

Would a system style solution be better? Create the python script and have it executed via Cron/Scheduled Task?
pySerial shouldn't be using that much CPU but if its just sitting there polling for an hour I can see how it may happen. Sleeping may be a better option in conjunction with periodic wakeup and polls.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.