How to update my Django database with rss feed every X minutes?

How to update my Django database with rss feed every X minutes? - python

Am working new with RSS feed
For every x minutes, i want to add things to my database from rss feed if it has any new things in that. I have written the code to fetch and update in database but how to make that code run for every X minutes.
If i put the piece of code inside one of my views function which renders home page, it slows down the page loading speed. I want it to happen automatically every x minutes without affecting my website functionality.
VIEWS.PY
from django.shortcuts import render
from .models import Article,Slide
import feedparser
rss = feedparser.parse('url am passing')
already_updated = False
first_entry = rss.entries[0]
for slide in Slide.objects.all():
if first_entry.title == slide.title:
already_updated = True
if not already_updated:
for entry in rss.entries:
new = Slide(title = entry.title, article_name = Article.objects.last())
new.save()
print(entry['title'])
def test(request):
articles = Article.objects.all()
slides = Slide.objects.all()
return render(request, 'sample/test_amp.html', {'articles':articles, 'slides':slides})

A simple approach is to use APScheduler library. Once installed, you need to call the scheduler from the app's config file (apps.py) to start when manage.py runserver command is run. Once the APScheduler process has started this way, it will run every interval that you have defined. Here is a working example assuming you have an app called Home.
Directory structure:
Basedir
| - ProjectName
| - Home
| - - __init__.py
| - - admin.py
| - - apps.py
| - - models.py
| - - test.py
| - - views.py
| - - jobs.py
| - - BackgroundClass.py
In your BackgroundClass.py, you will define a function that is going to be doing the processing part where you get the RSS feed and update the DB using the results.
Home/BackgroundClass.py
class BackgroundClass:
#staticmethod
def update_db():
# Do your update db from RSS task here
Now in your jobs.py, you will define a function/class that will create an instance of BackgroundScheduler from APScheduler, that keeps running in the background indefinitely every X intervals that you define. Using this, you will call your update_db function from the BackgroundClass.
Home/jobs.py
from apscheduler.schedulers.background import BackgroundScheduler
from .BackgroundClass import BackgroundClass
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(BackgroundClass.update_db, 'interval', minutes=1)
scheduler.start()
Now in the apps.py, you are going to call that function/class defined in jobs.py to run when manage.py runserver command is called, so your background task starts with the server, and keeps running as long as the server is running; executing every X intervals.
Home/apps.py
from django.apps import AppConfig
class HomeConfig(AppConfig):
name = 'Home'
def ready(self):
import os
from . import jobs
# RUN_MAIN check to avoid running the code twice since manage.py runserver runs 'ready' twice on startup
if os.environ.get('RUN_MAIN', None) != 'true':
jobs.start()

Related

How do you return the result of a completed celery task and store the data in variables?

I have two flask modules app.py and tasks.py.
I set up Celery in tasks.py to complete a selenium webdriver request (which takes about 20 seconds). My goal is to simply return the result of that request to app.py.
Running the Celery worker on another terminal, I can see in the console that the Celery task completes successfully and prints all the data I need from the selenium request. However, now I just want to return the task result to app.py.
How do I obtain the celery worker results data from tasks.py and store each result element as a variable in app.py?
app.py:
I define the marketplace and call the task function and request the indexed results:
import tasks
marketplace = 'cheddar_block_games'
# This is what I am trying to get back:
price_check = tasks.scope(marketplace[0])
image = tasks.scope(marketplace[1])
tasks.py:
celery = Celery(broker='redis://127.0.0.1:6379')
#celery.task()
def scope(marketplace):
web.get(f'https://magiceden.io/marketplace/{marketplace}')
price_check = WebDriverWait(web,30).until(EC.visibility_of_element_located((By.XPATH, "/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[3]/div[2]/div[4]/div/div[2]/div[1]/div[2]/div/div[2]/div/div[2]/div/span/div[2]/div/span[1]"))).text
image = WebDriverWait(web,30).until(EC.visibility_of_element_located((By.XPATH, "/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[3]/div[2]/div[4]/div/div[2]/div[1]/div[2]/div/div[1]/div/div/img")))
return (price_check, image)

This answer might be relevant:
https://stackoverflow.com/a/30760142/9347535
app.py should call the task e.g. using scope.delay or scope.apply_async. You could then fetch the task result with AsyncResult.get():
https://docs.celeryq.dev/en/latest/userguide/tasks.html#result-backends
Since the task returns a tuple, you can store each variable by unpacking it:
https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences
The result would be something like this:
import tasks
marketplace = 'cheddar_block_games'
result = tasks.scope.delay(marketplace)
price_check, image = result.get()

Cron job (using django_cron) not updating objects

I'm attempting to use a Cron Job to update the number of days each record in my database has been "open" for(ie. the number of days between today and the created date). The logic I'm using is for the cron job to run every night at 23:00 and to update the days_open field (IntegerField) by F('days_open') + 1 each time the job runs.
I've set the run-time to once a minute for testing purposes. I've run "python manage.py runcrons "request_form_app.cron.DaysOpenCronJob" and "
python manage.py runcrons --force" to force the jobs, and I receive no errors but the field is not updating on any of the records.
cron.py
from django_cron import CronJobBase, Schedule
from django.db.models import F
class DaysOpenCronJob(CronJobBase):
RUN_EVERY_MINS = 1
# RUN_AT_TIMES = ['23:00']
schedule = Schedule(run_at_times=RUN_EVERY_MINS)
code = 'request_form_app.cron.DaysOpenCronJob'
def update_days(self,*args,**kwargs):
data_request = Request.objects.all()
for record in data_request:
record.days_open = F('days_open') + 1
record.save(update_field=['days_open'])

Use update_fields plural, not update_field.
The save method of the base class django.db.models.base.Model doesn't support any keyword arguments other than what is in the function definition.

APScheduler job is not starting as scheduled

I'm trying to schedule a job to start every minute.
I have the scheduler defined in a scheduler.py script:
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
executors = {
'default': ThreadPoolExecutor(10),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 5
}
scheduler = BackgroundScheduler(executors=executors,job_defaults=job_defaults)
I initialize the scheduler in the __init__.py of the module like this:
from scheduler import scheduler
scheduler.start()
I want to start a scheduled job on a specific action, like this:
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
def TestScheduler():
for i in range(0,29):
starttime = time()
print "test"
sleep(1.0 - ((time() - starttime) % 1.0))
First: when I'm executing the AddJob() function in the python console it starts to run as expected but not in the background, the console is blocked until the TestScheduler function ends after 30 seconds. I was expecting it to run in the background because it's a background scheduler.
Second: the job never starts again even when specifying a repeat interval of 1 minute.
What am I missing?
UPDATE
I found the issue thanks to another thread. The wrong line is this:
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
I changed it to:
scheduler.add_job(func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
TestScheduler() becomes TestScheduler. Using TestScheduler() cause the result of the function TestScheduler() to be passed as an argument of the add_job().

The first problem seems to be that you are initializing the scheduler inside the __init__.py, which doesn't seem to be the recommended way.
Code that exists in the __init__.py gets executed the first time a module from the specific folder gets imported. For example, imagine this structure:
my_module
|--__init__.py
|--test.py
with __init__.py:
from scheduler import scheduler
scheduler.start()
the scheduler.start() command gets executed when from my_module import something. So it either doesn't start at all from __init__.py or it starts many times (depending on the rest of your code!).
Another problem must be the use of scheduler.scheduled_job() method. If you read the documentation on adding jobs, you will observe that the recomended way is to use the add_job() method and not the scheduled_job() which is a decorator for convenience.
I would suggest something like this:
Keep my_scheduler.py as is.
Remove the scheduler.start() line from __init__.py.
Change your main file as follows:
from my_scheduler import scheduler
if not scheduler.running: # Clause suggested by #CyrilleMODIANO
scheduler.start()
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.add_job(
func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
...

Use .replace method with Celery sub-tasks

I'm trying to solve a problem in celery:
I have one task that queries an API for ids, and then starts a sub-task for each of these.
I do not know, ahead of time, what the ids are, or how many there are.
For each id, I go through a big calculation that then dumps some data into a database.
After all the sub-tasks are complete, I want to run a summary function (export DB results to an Excel format).
Ideally, I do not want to block my main worker querying the status of the sub-tasks (Celery gets angry if you try this.)
This question looks very similar (if not identical?): Celery: Callback after task hierarchy
So using the "solution" (which is a link to this discussion, I tried the following test script:
# test.py
from celery import Celery, chord
from celery.utils.log import get_task_logger
app = Celery('test', backend='redis://localhost:45000/10?new_join=1', broker='redis://localhost:45000/11')
app.conf.CELERY_ALWAYS_EAGER = False
logger = get_task_logger(__name__)
#app.task(bind=True)
def get_one(self):
print('hello world')
self.replace(get_two.s())
return 1
#app.task
def get_two():
print('Returning two')
return 2
#app.task
def sum_all(data):
print('Logging data')
logger.error(data)
return sum(data)
if __name__ == '__main__':
print('Running test')
x = chord(get_one.s() for i in range(3))
body = sum_all.s()
result = x(body)
print(result.get())
print('Finished w/ test')
It doesn't work for me. I get an error:
AttributeError: 'get_one' object has no attribute 'replace'
Note that I do have new_join=1 in my backend URL, though not the broker. If I put it there, I get an error:
TypeError: _init_params() got an unexpected keyword argument 'new_join'
What am I doing wrong? I'm using the Python 3.4.3 and the following packages:
amqp==1.4.6
anyjson==0.3.3
billiard==3.3.0.20
celery==3.1.18
kombu==3.0.26
pytz==2015.4
redis==2.10.3

The Task.replace method will be added in Celery 3.2: http://celery.readthedocs.org/en/master/whatsnew-3.2.html#task-replace (that changelog entry is misleading, because it suggests that Task.replace existed before and has been changed.)

What is the correct way to start endless threads when django is run as fcgi?

I want to use pyinotify to watch changes on the filesystem. If a file has changed, I want to update my database file accordingly (re-read tags, other information...)
I put the following code in my app's signals.py
import pyinotify
....
# create filesystem watcher in seperate thread
wm = pyinotify.WatchManager()
notifier = pyinotify.ThreadedNotifier(wm, ProcessInotifyEvent())
# notifier.setDaemon(True)
notifier.start()
mask = pyinotify.IN_CLOSE_WRITE | pyinotify.IN_CREATE | pyinotify.IN_MOVED_TO | pyinotify.IN_MOVED_FROM
dbgprint("Adding path to WatchManager:", settings.MUSIC_PATH)
wdd = wm.add_watch(settings.MUSIC_PATH, mask, rec=True, auto_add=True)
def connect_all():
"""
to be called from models.py
"""
rescan_start.connect(rescan_start_callback)
upload_done.connect(upload_done_callback)
....
This works great when django is run with ''./manage.py runserver''. However, when run as ''./manage.py runfcgi'' django won't start. There is no error message, it just hangs and won't daemonize, probably at the line ''notifier.start()''.
When I run ''./manage.py runfcgi method=threaded'' and enable the line ''notifier.setDaemon(True)'', then the notifier thread is stopped (isAlive() = False).
What is the correct way to start endless threads together with django when django is run as fcgi? Is it even possible?

Well, duh. Never start an own, endless thread besides django. I use celery, where it works a bit better to run such threads.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to update my Django database with rss feed every X minutes? - python

Related

How do you return the result of a completed celery task and store the data in variables?

Cron job (using django_cron) not updating objects

APScheduler job is not starting as scheduled

Use .replace method with Celery sub-tasks

What is the correct way to start endless threads when django is run as fcgi?

Categories

Resources