Parallel Processing in Django

Parallel Processing in Django - python

I get a file from the user. Once the file has been uploaded and saved, Now this file has to be analysed.
Since it is a huge file and analysis takes minimum 1 hour (say), I have a field in the model saying the status of the analysis as Analysing or Analysis Done.
The script for analysing is a separate python file and the analysis has to be done there.
How do I go about doing this? I want this script to run at the background. Also I have
to deploy in apache server.
How should I proceed?
Should I use threads? How do I go about using
external python scripts in threads?
I came to know about CronTabs, But I don't know
how can I implement in this situation.
I can't use Celery, since Celery has been stopped for
Windows
I came to know about Django Management
Commands. But since I deploy using an Apache
server, I don't know whether I can do that.

I can think of a few ways to solve this problem.
If you can batch the processing of the files, then you can run a cron job which will run a django command or a script at certain intervals to process the file.
If you can't batch the processing, you should look at other queuing systems like django-rq or you can build a simple queuing system using an event dispatch library.
If you really want to use celery what you can do is run your whole project inside a docker container so that you can use celery 4 since that is your requirement.

Related

How can I schedule python script in the cloud?

I am developing a python script that downloads some excel files from a web service. These two files are combined with another one stored in my computer locally to produce the final file. This final file is loaded to some database and PowerBI dashboard to finally visualize data.
My question is: How can I schedule this to run it daily if my computer is turned off? As I said, two files are web scraped (so no problem to schedule) but one file is stored locally.
One solution that comes to my mind: Store the local file in Google Drive/OneDrive and download it with the API so my script is not dependent of my computer. But if this was the case, how can I schedule that? What service would you use? Heroku,...?

I am not entirely sure about your context, but I think you could look into using AWS Lambda for this. It is reasonably easy to set it up and also create a schedule for running code.
It is even easier to achieve this using the serverless framework. This link shows an example built with Python that will run on a schedule.

I am running the schedule package for exactly something like that.
It’s easy to setup and works very well.

How to run Python code periodically that connects to the internet

I'm working on a Python script that connects to the Twitter API to pull in some tweets into an array, then pushes this to a mysql database. It's a pretty basic script, but I'd like to set it up to run weekly.
I'd like to know the best way to deploy it so that it can automatically run weekly, so that I don't have to manually run it every week.

This depends on platform where you intend to run your python code. As martin says, this is not a python question and more of scheduling related question.

You can create a batch file that can activate python and run your script and then, use task scheduler to schedule youe batch file execution weekly

Scheduled job from Flask application

I am hoping to gain a basic understanding of scheduled task processes and why things like Celery are recommended for Flask.
My situation is a web-based tool which generates spreadsheets based on user input. I save those spreadsheets to a temp directory, and when the user clicks the "download" button, I use Flask's "send_from_directory" function to serve the file as an attachment. I need a background service to run every 15 minutes or so to clear the temp directory of all files older than 15 minutes.
My initial plan was a basic python script running in a while(True) loop, but I did some research to find what people normally do, and everything recommends Celery or other task managers. I looked into Celery and found that I also need to learn about redis, and I need to apparently host redis in a unix environment. This is a lot of trouble for a script that just deletes files every 15 minutes.
I'm developing my Flask app locally in Windows with the built-in development server and deploying to a virtual machine on company intranet with IIS. I'm learning as I go, so please explain why this much machinery is needed to regularly call a script that simply deletes things. It seems like a vast overcomplication, but as I said, I'm trying to learn as I go so I want to do/learn it correctly.
Thanks!

You wouldn't use Celery or redis for this. A cron job would be perfectly appropriate.
Celery is for jobs that need to be run asynchronously but in response to events in the main server processes. For example, if a sign up form requires sending an email notification, that would be scheduled and run via Celery so as not to block the main web response.

what is a robust way to execute long-running tasks/batches under Django?

I have a Django app that is intended to be run on Virtualbox VMs on LANs. The basic user will be a savvy IT end-user, not a sysadmin.
Part of that app's job is to connect to external databases on the LAN, run some python batches against those databases and save the results in its local db. The user can then explore the systems using Django pages.
Run time for the batches isn't all that long, but runs to minutes, tens of minutes potentially, not seconds. Run frequency is infrequent at best, I think you could spend days without needing a refresh.
This is not celery's normal use case of long tasks which will eventually push the results back into the web UI via ajax and/or polling. It is more similar to a dev's occasional use of the django-admin commands, but this time intended for an end user.
The user should be able to initiate a run of one or several of those batches when they want in order to refresh the calculations of a given external database (the target db is a parameter to the batch).
Until the batches are done for a given db, the app really isn't useable. You can access its pages, but many functions won't be available.
It is very important, from a support point of view that the batches remain easily runnable at all times. Dropping down to the VMs SSH would probably require frequent handholding which wouldn't be good - it is best that you could launch them from the Django webpages.
What I currently have:
Each batch is in its own script.
I can run it on the command line (via if __name__ == "main":).
The batches are also hooked up as celery tasks and work fine that way.
Given the way I have written them, it would be relatively easy for me to allow running them from subprocess calls in Python. I haven't really looked into it, but I suppose I could make them into django-admin commands as well.
The batches already have their own rudimentary status checks. For example, they can look at the calculated data and tell whether they have been run and display that in Django pages without needing to look at celery task status backends.
The batches themselves are relatively robust and I can make them more so. This is about their launch mechanism.
What's not so great.
In Mac dev environment I find the celery/celerycam/rabbitmq stack to be somewhat unstable. It seems as if sometime rabbitmqs daemon balloons up in CPU/RAM use and then needs to be terminated. That mightily confuses the celery processes and I find I have to kill -9 various tasks and relaunch them manually. Sometimes celery still works but celerycam doesn't so no task updates. Some of these issues may be OSX specific or may be due to the DEBUG flag being switched for now, which celery warns about.
So then I need to run the batches on the command line, which is what I was trying to avoid, until the whole celery stack has been reset.
This might be acceptable on a normal website, with an admin watching over it. But I can't have that happen on a remote VM to which only the user has access.
Given that these are somewhat fire-and-forget batches, I am wondering if celery isn't overkill at this point.
Some options I have thought about:
writing a cleanup shell/Python script to restart rabbitmq/celery/celerycam and generally make it more robust. i.e. whatever is required to make celery & all more stable. I've already used psutil to figure out rabbit/celery process are running and display their status in Django.
Running the batches via subprocess instead and avoiding celery. What about django-admin commands here? Does that make a difference? Still needs to be run from the web pages.
an alternative task/process manager to celery with less capability but also less moving parts?
not using subprocess but relying on Python multiprocessing module? To be honest, I have no idea how that compares to launches via subprocess.
environment:
nginx, wsgi, ubuntu on virtualbox, chef to build VMs.

I'm not sure how your celery configuration makes it unstable but sounds like it's still the best fit for your problem. I'm using redis as the queue system and it works better than rabbitmq from my own experience. Maybe you can try it see if it improves things.
Otherwise, just use cron as a driver to run periodic tasks. You can just let it run your script periodically and update the database, your UI component will poll the database with no conflict.

How to run a system command from a django web application?

Does anyone knows of a proven and simple way of running a system command from a django application?
Maybe using celery? ...
From my research, it's a problematic task, since it involves permissions and insecure approaches to the problem. Am i right?
EDIT: Use case: delete some files on a remote machine.
Thanks

Here is one approach: in your Django web application, write a message to a queue (e.g., RabbitMQ) containing the information that you need. In a separate system, read the message from the queue and perform any file actions. You can indeed use Celery for setting up this system.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.