Clearing the Luigi Task Visualizer Cache - python

I'm testing a pipeline with Luigi and I've noticed strange caching behavior in the task visualizer. For one thing, tasks seem to stay in the cache for a set time, sometimes overlapping with tasks from a second run of the pipeline, causing clutter in the UI. I've also noticed that when two pipelines are run in succession it takes a while for tasks from the new pipeline to appear. Is there a way to manually reset the cache before each run? Is there a configuration variable that sets how long tasks are cached before they expire?

You can use the remove_delay setting for the scheduler. In your config file:
[scheduler]
remove_delay = 10
This applies to the scheduler so you need to restart luigid to enable it.
From the doc:
Number of seconds to wait before removing a task that has no
stakeholders. Defaults to 600 (10 minutes).
From experience, stakeholders in that case seem to mean workers and upstream/downstream dependencies.

Related

How to get all running tasks in luigi, including dynamic ones

I currently have a list of tasks that I run through the command
luigi.build(tasks, workers=N, local_scheduler=True, detailed_summary=True)
I would like to programmatically get the status of the local scheduler, hence I could not use global scheduler for my application. How can I get the list of running, pending, and completed tasks?
At first, I used to check for the creation of output files of some known tasks, but it is not efficient anymore since the complexity is increased and I also have dynamic tasks (yielded by some other tasks at runtime).
I noticed that I could list dependencies through:
import luigi.tools.deps
luigi.tools.deps.find_deps(my_main_task, luigi.tools.deps.upstream().family)
but it is not helping more than looking at task's output().
I have also noticed Workers having a nice _running_tasks attribute, thus I would need to get the worker and list it, but I am also wondering about what happens if I have more than 1 worker with pending tasks along with running ones.

Celery Django runing periodic tasks after previus was done. [django-celery-beat]

I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.

Web2py scheduler - Best practices to rerun task continuously and to add task at startup

I want to add a task to the queue at app startup, currently adding a scheduler.queue_task(...) to the main db.py file. This is not ideal as I had to define the task function in this file.
I also want the task to repeat every 2 minutes continuously.
I would like to know what is the best practices for this?
As stated in web2py doc, to rerun task continuously, you just have to specify it at task queuing time :
scheduler.queue_task(your_function,
pargs=your_args,
timeout = 120, # just in case
period=120, # as you want to run it every 2 minutes
immediate=True, # starts task ASAP
repeats=0 # just does the infinite repeat magic
)
To queue it at startup, you might want to use web2py cron feature this simple way:
#reboot root *your_controller/your_function_that_calls_queue_task
Do not forget to enable this feature (-Y, more details in the doc).
There is no real mechanism for this within web2py it seems.
There are a few hacks one could do to continuously repeat tasks or schedule at startup but as far as I can see the web2py scheduler needs alot of work.
Best option is to just abondon this web2py feature and use celery or similar for advanced usage.

Appropriate method to define a pool of workers in Python

I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.

Task scheduling in AppEngine dev_appserver.py

I have a [python] AppEngine app which creates multiple tasks and adds them to a custom task queue. dev_appserver.py seems to ignore the rate/scheduling parameters I specify in queue.yaml and executes all the tasks immediately. This is a problem [as least for dev/testing purposes] as my tasks call a rate-throttled url; immediate execution of all tasks breaches the throttling limits and returns me a bunch of errors.
Does anyone know if task scheduling if dev_appserver.py is disabled ? I can't find anything that suggests this in the AppEngine docs. Can anyone suggest a workaround ?
Thank you.
When your app is running in the development server, tasks are automatically executed at the appropriate time just as in production.
You can examine and manipulate tasks from the developer console:
http://localhost:8080/_ah/admin/taskqueue
Documentation here
The documentation lies: the development server doesn't appear to support rate limiting. (This is documented for the Java dev server, but not for Python). You can demonstrate this by pausing a queue by giving it a 0/s rate, but you'll find it executes tasks anyway. When such an app is uploaded to production, it behaves as expected.
I opened a defect.
Rate parameter is not used for setting absolute upper bounds of TaskQueue processing. In fact, if you use for example:
rate: 10/s
bucket_size: 20
the processing can burst up to 20/s. Something more useful would be:
max_concurrent_requests: 1
which sets the maximum number of execution to 1 at a time.
However, this will not stop tasks from executing. If you are adding multiple Tasks a time but know that they need to be executed at a later time, you should probably use countdown.
_countdown using deferred library
countdown using Task class

Categories

Resources