I'm trying to schedule a job to start every minute.
I have the scheduler defined in a scheduler.py script:
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
executors = {
'default': ThreadPoolExecutor(10),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 5
}
scheduler = BackgroundScheduler(executors=executors,job_defaults=job_defaults)
I initialize the scheduler in the __init__.py of the module like this:
from scheduler import scheduler
scheduler.start()
I want to start a scheduled job on a specific action, like this:
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
def TestScheduler():
for i in range(0,29):
starttime = time()
print "test"
sleep(1.0 - ((time() - starttime) % 1.0))
First: when I'm executing the AddJob() function in the python console it starts to run as expected but not in the background, the console is blocked until the TestScheduler function ends after 30 seconds. I was expecting it to run in the background because it's a background scheduler.
Second: the job never starts again even when specifying a repeat interval of 1 minute.
What am I missing?
UPDATE
I found the issue thanks to another thread. The wrong line is this:
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
I changed it to:
scheduler.add_job(func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
TestScheduler() becomes TestScheduler. Using TestScheduler() cause the result of the function TestScheduler() to be passed as an argument of the add_job().
The first problem seems to be that you are initializing the scheduler inside the __init__.py, which doesn't seem to be the recommended way.
Code that exists in the __init__.py gets executed the first time a module from the specific folder gets imported. For example, imagine this structure:
my_module
|--__init__.py
|--test.py
with __init__.py:
from scheduler import scheduler
scheduler.start()
the scheduler.start() command gets executed when from my_module import something. So it either doesn't start at all from __init__.py or it starts many times (depending on the rest of your code!).
Another problem must be the use of scheduler.scheduled_job() method. If you read the documentation on adding jobs, you will observe that the recomended way is to use the add_job() method and not the scheduled_job() which is a decorator for convenience.
I would suggest something like this:
Keep my_scheduler.py as is.
Remove the scheduler.start() line from __init__.py.
Change your main file as follows:
from my_scheduler import scheduler
if not scheduler.running: # Clause suggested by #CyrilleMODIANO
scheduler.start()
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.add_job(
func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
...
Related
I've been using celery for a couple of months and, stumbled upon a case where i just don't see any info or even an example of what i intend to achieve.
In this specific case i have a docker container running an API and two other separate containers with celery workers.
I have my queues and tasks defined and i call a task with the send_task method. Example:
r = celery_app.send_task('task_a')
Similarly i have another container with "task_b" that can be called the same way as "task_a".
I'm defining my tasks by updating the configuration of my celery app and detailing their respective queues since they run on other separate containers.
ex:
celery_app.conf.update({
'broker_url': 'amqp://admin:mypass#rabbit:5672',
'result_backend': 'redis://redis:6379/0',
'imports': (
'tasks_a_dev',
'tasks_b_dev',
),
'task_routes': {
'task_a': {'queue': 'qtasks_a_dev'},
'task_b': {'queue': 'qtasks_b_dev'},
},
'task_serializer': 'json',
'result_serializer': 'json',
'accept_content': ['json']
})
Is there anyway i can chain these two tasks together while passing the result of task_a to task_b?
If you use Celery > v3.0.0, you can use chaining.
So if you wanted task_b to run with the result of task_a, you could do the following.
from celery import chain
#task()
def task_a(a, b):
time.sleep(5)
return a + b
#task()
def task_b(a, b):
time.sleep(5)
return a + b
# the result of the first job will be the first argument of the second job
res = chain(task_a.s(1, 2), task_b.s(3)).apply_async()
# Alternatively, you could do the following
res_2 = (task_a.s(1, 2) | task_b.s(3)).apply_async()
# check ret status to get result
if ret.status == u'SUCCESS':
print "result:", ret.get()
Am working new with RSS feed
For every x minutes, i want to add things to my database from rss feed if it has any new things in that. I have written the code to fetch and update in database but how to make that code run for every X minutes.
If i put the piece of code inside one of my views function which renders home page, it slows down the page loading speed. I want it to happen automatically every x minutes without affecting my website functionality.
VIEWS.PY
from django.shortcuts import render
from .models import Article,Slide
import feedparser
rss = feedparser.parse('url am passing')
already_updated = False
first_entry = rss.entries[0]
for slide in Slide.objects.all():
if first_entry.title == slide.title:
already_updated = True
if not already_updated:
for entry in rss.entries:
new = Slide(title = entry.title, article_name = Article.objects.last())
new.save()
print(entry['title'])
def test(request):
articles = Article.objects.all()
slides = Slide.objects.all()
return render(request, 'sample/test_amp.html', {'articles':articles, 'slides':slides})
A simple approach is to use APScheduler library. Once installed, you need to call the scheduler from the app's config file (apps.py) to start when manage.py runserver command is run. Once the APScheduler process has started this way, it will run every interval that you have defined. Here is a working example assuming you have an app called Home.
Directory structure:
Basedir
| - ProjectName
| - Home
| - - __init__.py
| - - admin.py
| - - apps.py
| - - models.py
| - - test.py
| - - views.py
| - - jobs.py
| - - BackgroundClass.py
In your BackgroundClass.py, you will define a function that is going to be doing the processing part where you get the RSS feed and update the DB using the results.
Home/BackgroundClass.py
class BackgroundClass:
#staticmethod
def update_db():
# Do your update db from RSS task here
Now in your jobs.py, you will define a function/class that will create an instance of BackgroundScheduler from APScheduler, that keeps running in the background indefinitely every X intervals that you define. Using this, you will call your update_db function from the BackgroundClass.
Home/jobs.py
from apscheduler.schedulers.background import BackgroundScheduler
from .BackgroundClass import BackgroundClass
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(BackgroundClass.update_db, 'interval', minutes=1)
scheduler.start()
Now in the apps.py, you are going to call that function/class defined in jobs.py to run when manage.py runserver command is called, so your background task starts with the server, and keeps running as long as the server is running; executing every X intervals.
Home/apps.py
from django.apps import AppConfig
class HomeConfig(AppConfig):
name = 'Home'
def ready(self):
import os
from . import jobs
# RUN_MAIN check to avoid running the code twice since manage.py runserver runs 'ready' twice on startup
if os.environ.get('RUN_MAIN', None) != 'true':
jobs.start()
After reading the documentation on Output Caching based on a file target
, I figured this workflow should be an example of output caching:
from time import sleep
from prefect import Flow, task
from prefect.engine.results import LocalResult
#task(target="func_task_target.txt", checkpoint=True,
result=LocalResult(dir="~/.prefect"))
def func_task():
sleep(5)
return 99
with Flow("Test-cache") as flow:
func_task()
if __name__ == '__main__':
flow.run()
I would expect func_task to run one time, get cached, and then use the cached value next time I run the flow. However, it seems that func_task runs each time.
Where am I going wrong? Or have I misunderstood the documentation?
Try setting environment variable PREFECT__FLOWS__CHECKPOINTING to True
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
you can also change the results dir
os.environ["PREFECT__HOME_DIR"] = "path to dir"
I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?
The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.
default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"
What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).
I'm trying to solve a problem in celery:
I have one task that queries an API for ids, and then starts a sub-task for each of these.
I do not know, ahead of time, what the ids are, or how many there are.
For each id, I go through a big calculation that then dumps some data into a database.
After all the sub-tasks are complete, I want to run a summary function (export DB results to an Excel format).
Ideally, I do not want to block my main worker querying the status of the sub-tasks (Celery gets angry if you try this.)
This question looks very similar (if not identical?): Celery: Callback after task hierarchy
So using the "solution" (which is a link to this discussion, I tried the following test script:
# test.py
from celery import Celery, chord
from celery.utils.log import get_task_logger
app = Celery('test', backend='redis://localhost:45000/10?new_join=1', broker='redis://localhost:45000/11')
app.conf.CELERY_ALWAYS_EAGER = False
logger = get_task_logger(__name__)
#app.task(bind=True)
def get_one(self):
print('hello world')
self.replace(get_two.s())
return 1
#app.task
def get_two():
print('Returning two')
return 2
#app.task
def sum_all(data):
print('Logging data')
logger.error(data)
return sum(data)
if __name__ == '__main__':
print('Running test')
x = chord(get_one.s() for i in range(3))
body = sum_all.s()
result = x(body)
print(result.get())
print('Finished w/ test')
It doesn't work for me. I get an error:
AttributeError: 'get_one' object has no attribute 'replace'
Note that I do have new_join=1 in my backend URL, though not the broker. If I put it there, I get an error:
TypeError: _init_params() got an unexpected keyword argument 'new_join'
What am I doing wrong? I'm using the Python 3.4.3 and the following packages:
amqp==1.4.6
anyjson==0.3.3
billiard==3.3.0.20
celery==3.1.18
kombu==3.0.26
pytz==2015.4
redis==2.10.3
The Task.replace method will be added in Celery 3.2: http://celery.readthedocs.org/en/master/whatsnew-3.2.html#task-replace (that changelog entry is misleading, because it suggests that Task.replace existed before and has been changed.)