I currently have a list of tasks that I run through the command
luigi.build(tasks, workers=N, local_scheduler=True, detailed_summary=True)
I would like to programmatically get the status of the local scheduler, hence I could not use global scheduler for my application. How can I get the list of running, pending, and completed tasks?
At first, I used to check for the creation of output files of some known tasks, but it is not efficient anymore since the complexity is increased and I also have dynamic tasks (yielded by some other tasks at runtime).
I noticed that I could list dependencies through:
import luigi.tools.deps
luigi.tools.deps.find_deps(my_main_task, luigi.tools.deps.upstream().family)
but it is not helping more than looking at task's output().
I have also noticed Workers having a nice _running_tasks attribute, thus I would need to get the worker and list it, but I am also wondering about what happens if I have more than 1 worker with pending tasks along with running ones.
Related
I'm new to Airflow.
I`m considering to construct multiple airflow schedulers (celeryexecutor).
But, I'm curious about multiple schedulers operation
How does the multiple schedulers schedule for serialized dags in meta database?
Is there any rules for them? Who gets dag with which rules?
Is there any load balancing for multiple schedulers?
If you answer this questions, It'll be very helpful. Thanks...
Airflow does not provide a magic solution to synchronize the different schedulers, where there is no load balancing, but it does batch scheduling to allow all schedulers to work together to schedule runs and task instances.
Airflow scheduler is running in an infinite loop, in each scheduling loop, the scheduler takes care of creating dag runs for max_dagruns_to_create_per_loop dags (just creating dag runs in queued state), checking max_dagruns_per_loop_to_schedule dag runs if they can be scheduled (queued -> scheduled) starting by the runs with the smaller execution dates, and trying to schedule max_tis_per_query task instances (queued -> scheduled).
All this selected objects (dags, runs and tis) are locked in the DB by the scheduler, and they are not visible to the other, so the other schedulers do the same thing with other objects.
In the case of a small number of dags, dag runs or task instances, using big values for this 3 configurations may lead to scheduling being done by one of the schedulers.
I run a computation graph with long functions (several hours) and big results (several hundred of megabytes). This type of load can be atypical for dask.
I try to run this graph on 4 workers. I see depicted task to worker appointment:
In first row "green" task depends only on a "blue" one, not "violet". Why green task is not moved to other worker?
Is it possible to give some hints to a scheduler to always move a task on a free worker? Which information do you need and possible to obtain which helps to debug more?
Such appointment is non-optimal and graph computation takes more time than necessary.
A little bit information:
Computation graph composing is done using dask.delayed
Computation invocation is done using next code
to_compute = [result_of_dask_delayed_1, ... , result_of_dask_delayed_n]
running = client.persist(to_compute)
results = as_completed(
futures_of(running),
with_results=True,
raise_errors=not bool(cmdline_options.partial_fails),
)
Firstly, we should state stat scheduling tasks to workers is complicated - please read this description. Roughly, when a batch of tasks come in, they will be distributed to workers, accounting for dependencies.
Dask has no idea, before running, how long any given task will take or how big the data result it generates might be. Once a task is done, this information is available for the completed task (but not for ones still waiting to run), but dask needs to make a decision on whether to steal an allotted task to an idle worker, and copy across the data it needs to run. Please read the link to see how this is done.
Finally, you can indeed have more fine-grained control over where things are run, e.g., this section.
I have multiple tasks that are passing some data objects to each other. In some tasks, if some condition is not met, I'm raising an exception. This leads to the failure of that task. When the next DAG run is triggered, the already successful task runs once again. I'm finding some way to avoid running the previously successful tasks and resume the DAG run from the failed task in the next DAG run.
As mentioned, every DAG has it's set of tasks that are executed every run. In order to avoid running previously successful tasks, you could perform a check for an external variable via Airflow XCOMs or Airflow Variables, you could also query the meta database as to the status of previous runs. You could also store a variable in something like Redis or a similar external database.
Using that variable you can then skip the execution of a Task and directly mark the task successful until it reaches the task that is to be completed.
Of course you need to be mindful of any potential race conditions if the DAG run times can overlap.
def task_1( **kwargs ):
if external_variable:
pass
else:
perform_task()
return True
I'm testing a pipeline with Luigi and I've noticed strange caching behavior in the task visualizer. For one thing, tasks seem to stay in the cache for a set time, sometimes overlapping with tasks from a second run of the pipeline, causing clutter in the UI. I've also noticed that when two pipelines are run in succession it takes a while for tasks from the new pipeline to appear. Is there a way to manually reset the cache before each run? Is there a configuration variable that sets how long tasks are cached before they expire?
You can use the remove_delay setting for the scheduler. In your config file:
[scheduler]
remove_delay = 10
This applies to the scheduler so you need to restart luigid to enable it.
From the doc:
Number of seconds to wait before removing a task that has no
stakeholders. Defaults to 600 (10 minutes).
From experience, stakeholders in that case seem to mean workers and upstream/downstream dependencies.
I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.