Simple approach to launching background task in Django

Simple approach to launching background task in Django - python

I have a Django website, and one page has a button (or link) that when clicked will launch a somewhat long running task. Obviously I want to launch this task as a background task and immediately return a result to the user. I want to implement this using a simple approach that will not require me to install and learn a whole new messaging architecture like Celery for example. I do not want to use Celery! I just want to use a simple approach that I can set up and get running over the next half hour or so. Isn't there a simple way to do this in Django without having to add (yet another) 3rd party package?

Just use a thread.
import threading
t = threading.Thread(target=long_process,
args=args,
kwargs=kwargs)
t.setDaemon(True)
t.start()
return HttpResponse()
See this question for more details:
Can Django do multi-thread works?

Have a look at django-background-tasks - it does exactly what you need and doesn't need any additional services to be running like RabbitMQ or Redis. It manages a task queue in the database and has a Django management command which you can run once or as a cron job.

If you're willing to install a 3rd party library, but you want something a whole lot simpler than Celery, check out Redis Queue. It does require Redis, which is pretty easy in itself, but that can provide a lot of other benefits as well.
RQ itself has almost zero configuration. It's startlingly simple.
References:
http://python-rq.org/
http://nvie.com/posts/introducing-rq/
https://devcenter.heroku.com/articles/python-rq (RQ on Heroku)

Related

Does django-celery-beat deal with multiple instances of django processes created by web-servers

I have a fairly complex periodic-tasks that needs to be offloaded from django context. django-celery-beat looks promising. While I was going through celery-beat docs, I found this:
You have to ensure only a single scheduler is running for a schedule at a time, otherwise you’d end up with duplicate tasks. Using a centralized approach means the schedule doesn’t have to be synchronized, and the service can operate without using locks.
A typical production deployment will spawn a pool of worker-processes each running a django instance. Will that result in creation of multiple scheduler processes as well? Do I need to have some synchronisation logic?
Thanks for your time!

It does not.
You can dig into the issues page on their github repo for confirmation. I think it's weird that the documentation doesn't call this out, but I suppose you have to assume that's how all celery beats work unless they specify otherwise.
In theory, you could build your own synchronization, but it will probably be a better experience to use a different scheduler that has that functionality built in, like Heroku's redbeat: https://blog.heroku.com/redbeat-celery-beat-scheduler.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Google App Engine - run task on publish

I have been looking for a solution for my app that does not seem to be directly discussed anywhere. My goal is to publish an app and have it reach out, automatically, to a server I am working with. This just needs to be a simple Post. I have everything working fine, and am currently solving this problem with a cron job, but it is not quite sufficient - I would like the job to execute automatically once the app has been published, not after a minute (or whichever the specified time it may be set to).
In concept I am trying to have my app register itself with my server and to do this I'd like for it to run once on publish and never be ran again.
Is there a solution to this problem? I have looked at Task Queues and am unsure if it is what I am looking for.
Any help will be greatly appreciated.
Thank you.

Personally, this makes more sense to me as a responsibility of your deploy process, rather than of the app itself. If you have your own deploy script, add the post request there (after a successful deploy). If you use google's command line tools, you could wrap that in a script. If you use a 3rd party tool for something like continuous integration, they probably have deploy hooks you could use for this purpose.

The main question will be how to ensure it only runs once for a particular version.
Here is an outline on how you might approach it.
You create a HasRun module, which you use store each the version of the deployed app and this indicates if the one time code has been run.
Then make sure you increment your version, when ever you deploy your new code.
In you warmup handler or appengine_config.py grab the version deployed,
then in a transaction try and fetch the new HasRun entity by Key (version number).
If you get the Entity then don't run the one time code.
If you can not find it then create it and run the one time code, either in a task (make sure the process is idempotent, as tasks can be retried) or in the warmup/front facing request.
Now you will probably want to wrap all of that in memcache CAS operation to provide a lock or some sort. To prevent some other instance trying to do the same thing.
Alternately if you want to use the task queue, consider naming the task the version number, you can only submit a task with a particular name once.
It still needs to be idempotent (again could be scheduled to retry) but there will only ever be one task scheduled for that version - at least for a few weeks.
Or a combination/variation of all of the above.

python web server and periodic tasks

I am using CherryPy to receive requests through REST API. Apart from handling requests the application should also do some resource management every few seconds. What is the easiest way to do this?
1) run a separate thread
2) cherrypy.process.plugins.PerpetualTimer (not sure how to use it, and it looks like it is heavy on resources?)
3) some other way?
The solution with a separate thread is fine by me, but I was wondering if there is a nicer way to do it?
Note that CherryPy is not a requirement - I have decided to use it primarily because the project looks alive and because it supports multiple simultaneous connections (in other words: I am open to alternatives).

PerpetualTimer is just a repeating version of threading._Timer.
What you really want to use is cherrypy.process.plugins.Monitor, which is little more than a way to run a separate thread for you. You should use it because it plugs into cherrypy.engine, which governs start and stop behavior for CherryPy servers. If you run your own thread, you're going to want to have it stop when CP shuts down anyway; the Monitor class already knows how to do that. It uses PerpetualTimer under the hood, until recent versions, where it was replaced by the BackgroundTask class.
my_task_runner = Monitor(cherrypy.engine, my_task, frequency=3)
my_task_runner.subscribe()

python long running daemon job processor

I want to write a long running process (linux daemon) that serves two purposes:
responds to REST web requests
executes jobs which can be scheduled
I originally had it working as a simple program that would run through runs and do the updates which I then cron’d, but now I have the added REST requirement, and would also like to change the frequency of some jobs, but not others (let’s say all jobs have different frequencies).
I have 0 experience writing long running processes, especially ones that do things on their own, rather than responding to requests.
My basic plan is to run the REST part in a separate thread/process, and figured I’d run the jobs part separately.
I’m wondering if there exists any patterns, specifically python, (I’ve looked and haven’t really found any examples of what I want to do) or if anyone has any suggestions on where to begin with transitioning my project to meet these new requirements.
I’ve seen a few projects that touch on scheduling, but I’m really looking for real world user experience / suggestions here. What works / doesn’t work for you?

If the REST server and the scheduled jobs have nothing in common, do two separate implementations, the REST server and the jobs stuff, and run them as separate processes.
As mentioned previously, look into existing schedulers for the jobs stuff. I don't know if Twisted would be an alternative, but you might want to check this platform.
If, OTOH, the REST interface invokes the same functionality as the scheduled jobs do, you should try to look at them as two interfaces to the same functionality, e.g. like this:
Write the actual jobs as programs the REST server can fork and run.
Have a separate scheduler that handles the timing of the jobs.
If a job is due to run, let the scheduler issue a corresponding REST request to the local server.
This way the scheduler only handles job descriptions, but has no own knowledge how they are implemented.
It's a common trait for long-running, high-availability processes to have an additional "supervisor" process that just checks the necessary demons are up and running, and restarts them as necessary.

One option is to simply choose a lightweight WSGI server from this list:
http://wsgi.org/wsgi/Servers
and let it do the work of a long-running process that serves requests. (I would recommend Spawning.) Your code can concentrate on the REST API and handling requests through the well defined WSGI interface, and scheduling jobs.
There are at least a couple of scheduling libraries you could use, but I don't know much about them:
http://sourceforge.net/projects/pycron/
http://code.google.com/p/scheduler-py/

Here's what we did.
Wrote a simple, pure-wsgi web application to respond to REST requests.
Start jobs
Report status of jobs
Extended the built-in wsgiref server to use the select module to check for incoming requests.
Activity on the socket is ordinary REST request, we let the wsgiref handle this.
It will -- eventually -- call our WSGI applications to respond to status and
submit requests.
Timeout means that we have to do two things:
Check all children that are running to see if they're done. Update their status, etc.
Check a crontab-like schedule to see if there's any scheduled work to do. This is a SQLite database that this server maintains.

I usually use cron for scheduling. As for REST you can use one of the many, many web frameworks out there. But just running SimpleHTTPServer should be enough.
You can schedule the REST service startup with cron #reboot
#reboot (cd /path/to/my/app && nohup python myserver.py&)

The usual design pattern for a scheduler would be:
Maintain a list of scheduled jobs, sorted by next-run-time (as Date-Time value);
When woken up, compare the first job in the list with the current time. If it's due or overdue, remove it from the list and run it. Continue working your way through the list this way until the first job is not due yet, then go to sleep for (next_job_due_date - current_time);
When a job finishes running, re-schedule it if appropriate;
After adding a job to the schedule, wake up the scheduler process.
Tweak as appropriate for your situation (eg. sometimes you might want to re-schedule jobs to run again at the point that they start running rather than finish).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.