Cron job (using django_cron) not updating objects - python

I'm attempting to use a Cron Job to update the number of days each record in my database has been "open" for(ie. the number of days between today and the created date). The logic I'm using is for the cron job to run every night at 23:00 and to update the days_open field (IntegerField) by F('days_open') + 1 each time the job runs.
I've set the run-time to once a minute for testing purposes. I've run "python manage.py runcrons "request_form_app.cron.DaysOpenCronJob" and "
python manage.py runcrons --force" to force the jobs, and I receive no errors but the field is not updating on any of the records.
cron.py
from django_cron import CronJobBase, Schedule
from django.db.models import F
class DaysOpenCronJob(CronJobBase):
RUN_EVERY_MINS = 1
# RUN_AT_TIMES = ['23:00']
schedule = Schedule(run_at_times=RUN_EVERY_MINS)
code = 'request_form_app.cron.DaysOpenCronJob'
def update_days(self,*args,**kwargs):
data_request = Request.objects.all()
for record in data_request:
record.days_open = F('days_open') + 1
record.save(update_field=['days_open'])

Use update_fields plural, not update_field.
The save method of the base class django.db.models.base.Model doesn't support any keyword arguments other than what is in the function definition.

Related

How to update my Django database with rss feed every X minutes?

Am working new with RSS feed
For every x minutes, i want to add things to my database from rss feed if it has any new things in that. I have written the code to fetch and update in database but how to make that code run for every X minutes.
If i put the piece of code inside one of my views function which renders home page, it slows down the page loading speed. I want it to happen automatically every x minutes without affecting my website functionality.
VIEWS.PY
from django.shortcuts import render
from .models import Article,Slide
import feedparser
rss = feedparser.parse('url am passing')
already_updated = False
first_entry = rss.entries[0]
for slide in Slide.objects.all():
if first_entry.title == slide.title:
already_updated = True
if not already_updated:
for entry in rss.entries:
new = Slide(title = entry.title, article_name = Article.objects.last())
new.save()
print(entry['title'])
def test(request):
articles = Article.objects.all()
slides = Slide.objects.all()
return render(request, 'sample/test_amp.html', {'articles':articles, 'slides':slides})
A simple approach is to use APScheduler library. Once installed, you need to call the scheduler from the app's config file (apps.py) to start when manage.py runserver command is run. Once the APScheduler process has started this way, it will run every interval that you have defined. Here is a working example assuming you have an app called Home.
Directory structure:
Basedir
| - ProjectName
| - Home
| - - __init__.py
| - - admin.py
| - - apps.py
| - - models.py
| - - test.py
| - - views.py
| - - jobs.py
| - - BackgroundClass.py
In your BackgroundClass.py, you will define a function that is going to be doing the processing part where you get the RSS feed and update the DB using the results.
Home/BackgroundClass.py
class BackgroundClass:
#staticmethod
def update_db():
# Do your update db from RSS task here
Now in your jobs.py, you will define a function/class that will create an instance of BackgroundScheduler from APScheduler, that keeps running in the background indefinitely every X intervals that you define. Using this, you will call your update_db function from the BackgroundClass.
Home/jobs.py
from apscheduler.schedulers.background import BackgroundScheduler
from .BackgroundClass import BackgroundClass
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(BackgroundClass.update_db, 'interval', minutes=1)
scheduler.start()
Now in the apps.py, you are going to call that function/class defined in jobs.py to run when manage.py runserver command is called, so your background task starts with the server, and keeps running as long as the server is running; executing every X intervals.
Home/apps.py
from django.apps import AppConfig
class HomeConfig(AppConfig):
name = 'Home'
def ready(self):
import os
from . import jobs
# RUN_MAIN check to avoid running the code twice since manage.py runserver runs 'ready' twice on startup
if os.environ.get('RUN_MAIN', None) != 'true':
jobs.start()

Set user_id in Mlflow

I am running MLflow with several users can instanciate experiments.
Is there any way to override user_id which is set to system username?
Looking for a solution which works with start_run block.
Any ideas?
with mlflow.start_run():
start_run doesn't support user_id as argument, and its kwargs are actually Run tags, so you need to work with other functions.
In order to do what you need, you have to create for yourself the run from a FileStore object, then wrap this new run inside an ActiveRun wrapper, that after the with-block will automatically finish the run passed as argument.
The code should look like this.
import time
import mlflow
from mlflow import ActiveRun
from mlflow.store.tracking.file_store import FileStore
now = int(time.time() * 1000) # mlflow default when start_time is None
fs = FileStore()
run = fs.create_run(
experiment_id = None, # defaults to "0"
user_id = "your_user_id", # your actual user-id
start_time = now,
tags = {},
)
with ActiveRun(run) as arun:
...
# after with block, run is finished

Celery task does not timeout

I would like to verify that setting time limits for celery tasks work.
I currently have my configuration looking like this:
CELERYD_TASK_SOFT_TIME_LIMIT = 30
CELERYD_TASK_TIME_LIMIT = 120
task_soft_time_limit = 29
task_time_limit = 44_LIMIT = 120
I am overloading the timeout parameters because it appears that there name change coming and I just want to be sure that I hit at least one timeout.
But when I run a test in the debugger the cellery app.conf dictionary looks like this:
(Pdb) app.conf['task_time_limit'] == None
True
(Pdb) app.conf['task_soft_time_limit'] == None
True
(Pdb) app.conf['CELERYD_TASK_SOFT_TIME_LIMIT']
30
(Pdb) app.conf['CELERYD_TASK_TIME_LIMIT']
120
I've written a test which I believe would trigger the timeout but no error is ever raised:
#app.task(soft_time_limit=15)
def time_out_task():
import time
count = 0
#import pdb; pdb.set_trace()
while count < 1000:
time.sleep(1)
count += 1
print(count)
My questions are as follows:
What are the conical settings I should set for a soft and hard time limit?
How could I execute a task in a test which proves to me that the time limits are in place.
Thanks
I solved the issue by changing the way I was testing and by changing the way I was importing the Celery configuration.
Initially, I was setting the configuration by importing a Django settings object:
app = Celery('groot')
app.config_from_object('django.conf:settings', namespace='CELERY')
But this was ignoring the settings with the CELERYD_... prefix. Thus I used the new notation and called the following method:
app.conf.update(
task_soft_time_limit=30,
task_time_limit=120,
)
I also changed from testing this in the Django test environment to spinning up an actual Celery worker and sending the task to the worker.
If someone would supply a solution for how to test the timeout settings in the unit test it would be much appreciated.
In django settings CELERY_TASK_TIME_LIMIT is working for me
CELERY_TASK_TIME_LIMIT = 60

APScheduler job is not starting as scheduled

I'm trying to schedule a job to start every minute.
I have the scheduler defined in a scheduler.py script:
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
executors = {
'default': ThreadPoolExecutor(10),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 5
}
scheduler = BackgroundScheduler(executors=executors,job_defaults=job_defaults)
I initialize the scheduler in the __init__.py of the module like this:
from scheduler import scheduler
scheduler.start()
I want to start a scheduled job on a specific action, like this:
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
def TestScheduler():
for i in range(0,29):
starttime = time()
print "test"
sleep(1.0 - ((time() - starttime) % 1.0))
First: when I'm executing the AddJob() function in the python console it starts to run as expected but not in the background, the console is blocked until the TestScheduler function ends after 30 seconds. I was expecting it to run in the background because it's a background scheduler.
Second: the job never starts again even when specifying a repeat interval of 1 minute.
What am I missing?
UPDATE
I found the issue thanks to another thread. The wrong line is this:
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
I changed it to:
scheduler.add_job(func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
TestScheduler() becomes TestScheduler. Using TestScheduler() cause the result of the function TestScheduler() to be passed as an argument of the add_job().
The first problem seems to be that you are initializing the scheduler inside the __init__.py, which doesn't seem to be the recommended way.
Code that exists in the __init__.py gets executed the first time a module from the specific folder gets imported. For example, imagine this structure:
my_module
|--__init__.py
|--test.py
with __init__.py:
from scheduler import scheduler
scheduler.start()
the scheduler.start() command gets executed when from my_module import something. So it either doesn't start at all from __init__.py or it starts many times (depending on the rest of your code!).
Another problem must be the use of scheduler.scheduled_job() method. If you read the documentation on adding jobs, you will observe that the recomended way is to use the add_job() method and not the scheduled_job() which is a decorator for convenience.
I would suggest something like this:
Keep my_scheduler.py as is.
Remove the scheduler.start() line from __init__.py.
Change your main file as follows:
from my_scheduler import scheduler
if not scheduler.running: # Clause suggested by #CyrilleMODIANO
scheduler.start()
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.add_job(
func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
...

django DoesNotExist matching query does not exist with postgres only [duplicate]

This question already has answers here:
Django related objects are missing from celery task (race condition?)
(3 answers)
Closed 5 years ago.
The assets django app I'm working on runs well with SQLite but I am facing performance issues with deletes / updates of large sets of records and so I am making the transition to a PostgreSQL database.
To do so, I am starting fresh by updating theapp/settings.py to configure PostgreSQL, starting with a fresh db and deleting the assets/migrations/ directory. I am then running:
./manage.py makemigrations assets
./manage.py migrate --run-syncdb
./manage.py createsuperuser
I have a function called within a registered post_create signal. It runs a scan when a Scan object is created. Within the class assets.models.Scan:
#classmethod
def post_create(cls, sender, instance, created, *args, **kwargs):
if not created:
return
from celery.result import AsyncResult
# get the domains for the project, from scan
print("debug: task = tasks.populate_endpoints.delay({})".format(instance.pk))
task = tasks.populate_endpoints.delay(instance.pk)
The offending code:
from celery import shared_task
....
import datetime
#shared_task
def populate_endpoints(scan_pk):
from .models import Scan, Project,
from anotherapp.plugins.sensual import subdomains
scan = Scan.objects.get(pk=scan_pk) #<<<<<<<< django no like
new_entries_count = 0
project = Project.objects.get(id=scan.project.id)
....
The resultant exception DoesNotExist raised:
debug: task = tasks.populate_endpoints.delay(2)
[2017-09-14 23:18:34,950: ERROR/ForkPoolWorker-8] Task assets.tasks.populate_endpoints[4555d329-2873-4184-be60-55e44c46a858] raised unexpected: DoesNotExist('Scan matching query does not exist.',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/src/app/theapp/assets/tasks.py", line 12, in populate_endpoints
scan = Scan.objects.get(pk=scan_pk)
Interacting through ./manage.py shell however indicates that Scan object with pk == 2 exists:
>>> from assets.models import Scan
>>> Scan.objects.all()
<QuerySet [<Scan: ACME Web Test Scan>]>
>>> s = Scan.objects.all().first()
>>> s.pk
2
My only guess is that at the time the post_create function is called, the Scan object still does not exist in the PostgreSQL database, despite save() having been called.
SQLite does not exhibit this problem.
Also, I haven't found a relevant, related problem on stackoverflow as the DoesNotExist exception looks to be fairly generic and caused by many things.
Any ideas on this would be much appreciated.
This is a well known problem resulting from transactions and isolation level - sometimes the transaction has not been commited when the task is executed and if your isolation level is READ COMMITED then you can't indeed read this record from another process. Django 1.9 introduced the on_commit hook as a solution.
NB : technically this question is a duplicate of Django related objects are missing from celery task (race condition?) but the accepted answer uses django-transaction-hooks which has since then been merged into django.

Categories

Resources