Asynchronous background processes with web2py - python

I need to to handle a large (time and memory-consuming) process asynchronously in a web2py application called inside a controller method.
My specific use case is to call a process via stdlib.subprocess and wait for it to exit without blocking the web server, but I am open to alternative methods.
Hands-on examples would be a plus.
3rd party library recommendations
are welcome.
CRON scheduling is not required/wanted.

Assuming you'll need to start multiple, possibly simultaneous, instances of the background task, the solution is a task queue. I've heard good things about Celery and RabbitMQ, if you're looking for 3rd-party options, and web2py includes it's own task queue system that might be sufficient for your needs.
With either tool, you'll define a function that encapsulates the operation you want the background process to perform. Then bring the task queue workers online. The web2py manual and forums indicate this can be done with an #reboot statement in the web2py cron system, which is triggered whenever the web server starts. There are probably other ways to start the workers if this is unsatisfactory.
In your controller you'll insert a task into the task queue, passing any necessary parameters as inputs to the function (the background function will not run in the same environment as the controller, so it won't have access to the session, DB, etc. unless you explicitly pass the appropriate values into the task function).
Now, to get the output of the background operation to the user. When you insert a task into the task queue, you should get back a unique ID for the task. You would then implement controller logic (either something that expects an AJAX call, or a page that keeps refreshing until the task completes) that calls the task queue's API to check the status of the specified task. If the task's status is "finished", return the data to the user. If not, keep waiting.

Maybe review the book section on running tasks in the background. You can use the new scheduler or create a homemade queue (email example). There's also a web2py-celery plugin, though I'm not sure what state that is in.

This is more difficult than one might expect. Note the deadlock warnings in the stdlib.subprocess documentation. It's easy if you don't mind blocking---use Popen.communicate. To work around the blocking, you can manage the process using stdlib.subprocess from a thread.
My favorite way to deal with subprocesses is to use Twisted's spawnProcess. But, it is not easy to get Twisted to play nicely with other frameworks.

Related

Is celery the appropriate tech for running long-running processes I simply need to start/stop?

My python program isn't the sort of thing you'd create an init script for. It's simply a long-running process which needs to run until I tell it to shut down.
I run multiple instances of the program, each with different cmd-line args. Just FYI, the program acts like a Physics Tutor who chats with my users, and each instance represents a different Physics problem.
My Django app communicates with these processes using Redis pub/sub
I'd like to improve how I start/stop and manage these processes from Django views. What I don't know is if Celery is the proper technology to do this for me. A lot of the celery docs make it sound like it's for running short-lived asynchronous tasks, such as their 'add()' example task.
Currently my views are doing some awful 'spawn' stuff to start the processes, and I'm keeping track of which processes are running in a completely ad-hoc way utilizing a Redis hash.
My program actually only daemonizes if it pass it a -d argument, which I suppose I wouldn't pass it if using celery, although it does output to stdout/stderr if I don't pass that option.
All I really need is:
A way to start/stop my processes
information on whether start/stop operation succeeded
information on which of my processes are running
What I don't want is:
multiple instances of a process with the same configuration running
need to replace the way I communicate with Django (Redis pub/sub)
Does celery sound like the proper tech for me to use for my process management?
Maybe you can utilize supervisor for this. It is good at running and monitoring long running processes and has an XML-RPC interface.
You can view an example of what I did here (example output here).

How to fork a process in python/django?

This is more of a Python general question however in a context of django.
For now I have this view in django which has to process a lot of data. Usually it takes the server (nginx with django running in proxy using) a couple of minutes to do it. Sometimes the server times out. I don't want to increase the time-out time in nginx. I realize that if I can fork a process in python in the django view so that the forked (child) process will do all the data crunching independently of the django view, then the view would be able to return the request to the user immediately (therefore never timing-out) and the child process would continue running in the background finishing up all the calculation.
So here is the question:
How can I fork an independent process in python (and if possible for the python code to be in the same file)? And if possible how can I assign a unix process priority level to it?
I looked at some of the ways of forking a process in python and it seems there are a few options. Which one is the best appropriate for this scenario?
Thank you.
the 'best practice' answer is to use a queue manager, typically RabbitMQ or any backend handled by Django-celery.
Still, there are a few lighter options that do spawn a new thread. what these options usually lack is some way to track progress, or keep the number of threads under control.
check Django-utils to see if it's enough. if not, go for Celery.
If you really want to fork and set priority, you can use os.fork and os.nice, but I think the multiprocessing module or Celery would be more applicable here.

How to create a thread-safe singleton in python

I would like to hold running threads in my Django application. Since I cannot do so in the model or in the session, I thought of holding them in a singleton. I've been checking this out for a while and haven't really found a good how-to for this.
Does anyone know how to create a thread-safe singleton in python?
EDIT:
More specifically what I wand to do is I want to implement some kind of "anytime algorithm", i.e. when a user presses a button, a response returned and a new computation begins (a new thread). I want this thread to run until the user presses the button again, and then my app will return the best solution it managed to find. to do that, i need to save somewhere the thread object - i thought of storing them in the session, what apparently i cannot do.
The bottom line is - i have a FAT computation i want to do on the server side, in different threads, while the user is using my site.
Unless you have a very good reason - you should execute the long running threads in a different process altogether, and use Celery to execute them:
Celery is an open source asynchronous
task queue/job queue based on
distributed message passing. It is
focused on real-time operation, but
supports scheduling as well.
The execution units, called tasks, are
executed concurrently on one or more
worker nodes using multiprocessing,
Eventlet or gevent. Tasks can execute
asynchronously (in the background) or
synchronously (wait until ready).
Celery guide for djangonauts: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html
For singletons and sharing data between tasks/threads, again, unless you have a good reason, you should use the db layer (aka, models) with some caution regarding db locks and refreshing stale data.
Update: regarding your use case, define a Computation model, with a status field. When a user starts a computation, an instance is created, and a task will start to run. The task will monitor the status field (check db once in a while). When a user clicks the button again, a view will change the status to user requested to stop, causing the task to terminate.
If you want asynchronous code in a web application then you're taking the wrong approach. You should run background tasks with a specialist task queue like Celery: http://celeryproject.org/
The biggest problem you have is web server architecture. Unless you go against the recommended Django web server configuration and use a worker thread MPM, you will have no way to track your thread handles between requests as each request typically occupies its own process. This is how Apache normally works: http://httpd.apache.org/docs/2.0/mod/prefork.html
EDIT:
In light of your edit I think you might learn more by creating a custom solution that does this:
Maintains start/stop state in the database
Create a new program that runs as a daemon
Periodically check the start/stop state and begin or end work from here
There's no need for multithreading here unless you need to create a new process for each user. If so, things get more complicated and using Celery will make your life much easier.

Google App Engine - Use Task Queues or Deferred Jobs

Google App Engine has two methods for running jobs at some later point, Task Queues and Deferred jops
They support all the same features as far as I can tell (e.g. a deferred job can be placed on a particular task queue so you can throttle execution) - but the deferred jobs look much easier to implement and more flexible.
Anyone know of the pro's con's of each method? Any circumstances where you would want to use task queues over deferred jobs?
I'm not sure if you noticed it, but the documentation for deferreds has this section in the end:
You may be wondering when to use ext.deferred, and when to stick with the built-in task queue API. Here are our suggestions.
You may want to use the deferred library if:
You only use the task queue lightly.
You want to refactor existing code to run on the Task Queue with a minimum of changes.
You're writing a one off maintenance task, such as schema migration.
Your app has many different types of background task, and writing a separate handler for each would be burdensome.
Your task requires complex arguments that aren't easily serialized without using Pickle.
You are writing a library for other apps that needs to do background work.
You may want to use the Task Queue API if:
You need complete control over how tasks are queued and executed.
You need better queue management or monitoring than deferred provides.
You have high throughput, and overhead is important.
You are building larger abstractions and need direct control over tasks.
You like the webhook model better than the RPC model.
Naturally, you can use both the Task Queue API and the deferred library side-by-side, if your app has requirements that fit into both groups.

Django - Python: Spawn a process and return

I've been searching for an answer to this for awhile, it's possible that I haven't been searching for the right information though.
I'm trying to send data to a server, and once received the server executes a python script based on that data. I have been trying to spawn a thread and return, but I can't figure out how to "detach" the thread. I simply have to wait until the thread returns to be able to return an HttpResponse(). This is unacceptable, as the website interface has many other things that need to be able to be used while the thread runs on the server.
I'm not certain that was a clear explanation but I'll be more than happy to clarify if any part is confusing.
Have a look at Celery. It's quite nice in that you can accept the request, and it offload it quickly to workers, and return. It's simple to use.
http://celeryproject.org/
Most simply, you can do this with subprocess.Popen. See here for some information regarding the subprocess module:
http://docs.python.org/library/subprocess.html
There are other (possibly better) methods to doing this, but this one seems to fit your requirements.
Use message queue system, like celery (django-celery may help you.)
Use RDBMS and background process(es) which is periodically invoked by cron or always running.
First, the web server inserts data required by the background job into a database table. And then, background process (always running or run periodically by cron) gets the latest inserted row(s) and process it.
Spawn a thread.
worker_thread = threading.Thread(target=do_background_job, args=args)
worker_thread.setDaemon(False)
worker_thread.start()
return HttpResponse()
Even after HttpResponse is sent, do_background_job is processed. However, because Web server (apache) may kill any threads, execution of background_job is not guaranteed.

Categories

Resources