I would like to know how to automatically execute a code in Django within a defined period of time.
I'm building a web crawler that collects information and stores it in a JSON file and a function that reads the file and stores the file information in a SQLite database. This information is rendered and visible on my website. At the moment I have to run the crawler and the function that saves the data in a database with a click on a button, but that is very ineffective. It would be better if the database information were updated automatically, about every 6 hours (of course, only if the server is running).
If you want to keep the code on your webserver and you have the permissions then a CRON job (Linux) or Scheduled Task (Windows) will do what you want. Set your cron job to run every six hours and to call your Django script.
You can run the script as a Django command line, eg manage.py mycommand or run as a stand-alone PY file where you include the libs and run django.setup()
Or if you want the crawler to be run from within the context of your webserver then your cron job can initiate a GET request using HTTPIE. That gives you a nice API approach.
When I run this I have a flag in a database (a model called Job) that records what is running just in case my scheduled task is already running when the cron gets called, I have some very long-running asynchronous tasks and I tend to use cron rather than Celery.
If you are running your site on AWS or GCP then both have a cron equivalent. Eg you could create a GCP cloud scheduler link that fires off a GET to your webserver that triggers the code to run the crawler.
Related
I have been using Google CLoud for a few weeks now and I am facing a big problem for my limited GCP knowledge.
I have a python project whos goal is to "scrape" datas from a website using it's API. My project run a few tens of thousands of requests during executions and it can take very long (few hours, maybe more)
I have 4 python scripts in my project and it's all orchestrated by a bash script.
The execution is as follow :
The first script check a CSV file with all the instructions for the requests, and exeute the requests, save all the results from the requests in CSV files
Second script check the previously created CSV files and recreate an other CSV instruction file
The first script run again but with the new instructions and again save results in CSV files
Second script checks again and do the same again ...
... and so on a few times
Third script cleans the datas, delete duplicates and create an unique CSV file
Fourth script upload the final CSV file to bucket storage
Now I want to get ride of that bash script and I would like to automatize execution of thos scripts approx. once a week.
The problem here is the execution time. Here is what I already tested :
Google App Engine : The timeout of a request on GAE is limited to 10 minutes, and my functions can run for few hours. GAE is not usable here.
Google Compute Engine : My scripts will run max. 10-15 hours a week, keeping a compute engine up during all that time would be too pricey.
What could I do to automatize the execution of my scripts in a cloud environment ? What could be solutions I didn't though about, without changing my code ?
Thank you
A simple way to accomplish this without the need to get rid of the existing bash script that orchestrates everything would be:
Include the bash script on the startup script for the instance.
At the end of the bash script, include a shutdown command.
Schedule the starting of the instance using Cloud Scheduler. You'll have to make an authenticated call to the GCE API to start the existing instance.
With that, your instance will start on a schedule, it will run the startup script (that will be your existing orchestrating script), and it will shut down once it's finished.
I'm working on a Python script that connects to the Twitter API to pull in some tweets into an array, then pushes this to a mysql database. It's a pretty basic script, but I'd like to set it up to run weekly.
I'd like to know the best way to deploy it so that it can automatically run weekly, so that I don't have to manually run it every week.
This depends on platform where you intend to run your python code. As martin says, this is not a python question and more of scheduling related question.
You can create a batch file that can activate python and run your script and then, use task scheduler to schedule youe batch file execution weekly
I am hoping to gain a basic understanding of scheduled task processes and why things like Celery are recommended for Flask.
My situation is a web-based tool which generates spreadsheets based on user input. I save those spreadsheets to a temp directory, and when the user clicks the "download" button, I use Flask's "send_from_directory" function to serve the file as an attachment. I need a background service to run every 15 minutes or so to clear the temp directory of all files older than 15 minutes.
My initial plan was a basic python script running in a while(True) loop, but I did some research to find what people normally do, and everything recommends Celery or other task managers. I looked into Celery and found that I also need to learn about redis, and I need to apparently host redis in a unix environment. This is a lot of trouble for a script that just deletes files every 15 minutes.
I'm developing my Flask app locally in Windows with the built-in development server and deploying to a virtual machine on company intranet with IIS. I'm learning as I go, so please explain why this much machinery is needed to regularly call a script that simply deletes things. It seems like a vast overcomplication, but as I said, I'm trying to learn as I go so I want to do/learn it correctly.
Thanks!
You wouldn't use Celery or redis for this. A cron job would be perfectly appropriate.
Celery is for jobs that need to be run asynchronously but in response to events in the main server processes. For example, if a sign up form requires sending an email notification, that would be scheduled and run via Celery so as not to block the main web response.
I'm looking for some advice to run intensive jobs on demand in somewhere like AWS or Digital Ocean.
Here's my scenario:
I have a template/configuration of a VM with dependencies (imagemagick, ruby, python etc)
I have a codebase that runs a job, eg: querying a db and running reports, then emailing those reports to my user base
I want to be able to run and trigger this job externally (i.e maybe via some webapp somewhere else, or via a command line somewhere - maybe some cron on another cloud instance somewhere)
When I run this job, it needs to spin up a copy of this template on AWS or DO, run the job, which could run for any length of time, until all reports are generated and sent out
Once the job has finished, shutdown the instance so I'm not paying for something to always be running in the background
I'd like to not have to commit to one service (i.e AWS) but rather have a template that can be dropped in anywhere to test out the differences in cloud providers
Initially, I was thinking rubber but this seems more something you'd use for CI, rather than being able to spin up an instance, run a long running job, then shut down the instance once finished.
Does anything already exist for this, or would I need to build something myself hooking into the relevant APIs?
I need to run a python script (which is listening to Twitter) which will call various methods on my django app when it gets tweets that match a particular hashtag.
At the moment, I just run the script by hand on the command line but I'd like it to run inside django if possible so that I can control it from there and so it doesn't have to perform HTTP POSTs when it gets new data.
I've looked at celery (briefly) but this seems to be for performing certain small tasks at regular intervals to me.
Is there a way to use celery (or anything else) to be able to control this long-running "listen to Twitter" script that I've got?
You should Supervisord to run your django application and your script. Making the script a part of the Django project, will let you use Django signals which you can use to write a custom signal that will be emitted every time your twitter logic is done doing what it is supposed to. Signals are blocking. If you want them to be asynchronous use Celery with Django
An alternative would be to run your django application and the twitter script via supervisord and then expose a REST API which does a HTTP POST to the Django application. You can use TastyPie for that.