Schedule crawler job with status check & modifiable frequency using backend

Schedule crawler job with status check & modifiable frequency using backend - python

I want to schedule the execution of a crawler where the frequency will be based on user input in the frontend.
This means, I need to maintain the control of Scrapy execution (scrapy crawl xbot) in the backend and change the scheduler frequency accordingly.
And, based on the status of the crawler execution, I will need to run functions to update the database daily.
I've built an MVP using FastAPI, but the scrapy job is currently running independently in a docker container, at a fixed frequency and writing to the DB. The backend, running in a different container, CRUDs the database.
If I use shell scripts to run cron jobs for scrapy command, and python function commands, how do I modify cron job frequency?
If I python scheduler / crontab packages, how do I modify frequency based on user input, and check execution status of the scrapy job?
Any suggestions would be helpful!
ETA: I'm planning to expose the crawler command via API. The backend will contain the scheduler module which will call the API and then synchronously call the dependent background jobs. The other jobs that are not dependent on crawler execution will run independently.
Only issue now is modifying frequency on user input.
Maybe I can restrict it to custom frequencies like "Once at Specific Date & Time" "Every Alternate Day at X time", "Daily at X time", "Every week on Y day at X time",.... and so on, and write specific functions for each. For eg., schedule.every().day.at(X_time).do(job), if...else (7 comparisons) like if 'thursday': schedule.every().thursday.at(X_time).do(job), ...

Related

Using apscheduler in a periodic short-running script

My (simplified and generalized) need is the following:
Given is a main program which starts my Python script every minute. By design, this script is intended to be a short-runner.
Its goal is to call different web hooks - not on every call, but in individual intervals, e.g. every 5 minutes. (The script can read the configuration from a file.)
So it is not possible for me to use apscheduler in a permanent running program.
On the contrary, the program itself must determine on each run which API calls are overdue and have to be made now.
Is it possible to use apscheduler for this?

How do I stop Airflow from triggering my python scripts?

I am currently using Airflow to run a DAG (say dag.py) which has a few tasks, and then, it has a python script to execute (done via bash_operator). The python script (say report.py) basically takes data from a cloud (s3) location as a dataframe, does a few transformations, and then sends them out as a report over email.
But the issue I'm having is that airflow is basically running this python script, report.py, everytime Airflow scans the repository for changes (i.e. every 2 mins). So, the script is being run every 2 mins (and hence the email is being sent out every two minutes!).
Is there any work around to this? Can we use something apart from a bash operator (bare in mind that we need to do a few dataframe transformations before sending out the report)?
Thanks!

Just make sure you do everything serious in the tasks. It in the python script. The script will be executed often by scheduler but it should simply create tasks and build dependencies between them. The actual work is done in the 'execute' methods of the tasks.
For example rather than sending email in the script you should add the 'EmailOperator' as a task and the right dependencies, so the execute method of the operator will be executed not when the file is parsed by scheduler, but when all dependencies (other tasks ) will complete

Celery PeriodicTask per user

I'm working on project which main future will be running periodically one type of async task for each user. Every user will be able to configure task (running daily, weekly etc. at specified time). Also task will use some data stored by user. Now I'm wondering which approach should be better: allow users to create own PeriodicTask (by using some restricted endpoint of course) or create single PeriodicTask (for example running every 5 minutes) which will iterate over all users and determine if task should be queued or not for current user? I think I will use AMPQ as broker.

periodic tasks scheduler in celery is not designed to handle thousands of scheduled tasks, so from performance perspective, much better solution is to have one task that is running at the smallest interval (e.g. if you allow user to sechedule dayly, weekly, monthly - running task daily is enough)
such approach is as well more stable - every time schedule changes, all of the schedule records are reloaded
plus is more secure because you do not expose or use any internal mechanisms for tasks execution

Create scheduled job and run the periodically

I have flask web service application with some daily, weekly and monthly events I want to store these events and calculate their start time, for example for an order with count of two and weekly period.
The first payment is today and other one is next week.
I want to store repeated times and then for each of them send notification on the start time periodically.
What is the best solution ?

I have used windows task scheduler to schedule a .bat file. The .bat file contained some short code to run the python script.
This way the scripy is not idling in the background when you are not using it.
As for storing data in between, I would save it to a file.

Execute code after some time in django

I have a code which deletes an api column when executed. Now I want it to execute after some time lets say two weeks. Any idea or directions how do I implement it?
My code:
authtoken = models.UserApiToken.objects.get(api_token=token)
authtoken.delete()
This is inside a function and executed when a request is made.

There are two main ways to get this done:
Make it a custom management command, and trigger it through crontab.
Use celery, make it a celery task, and use celerybeat to trigger the job after 2 weeks.
I would recommend celery, as it provides a better control of your task queues and jobs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.