Google App Engine has two methods for running jobs at some later point, Task Queues and Deferred jops
They support all the same features as far as I can tell (e.g. a deferred job can be placed on a particular task queue so you can throttle execution) - but the deferred jobs look much easier to implement and more flexible.
Anyone know of the pro's con's of each method? Any circumstances where you would want to use task queues over deferred jobs?
I'm not sure if you noticed it, but the documentation for deferreds has this section in the end:
You may be wondering when to use ext.deferred, and when to stick with the built-in task queue API. Here are our suggestions.
You may want to use the deferred library if:
You only use the task queue lightly.
You want to refactor existing code to run on the Task Queue with a minimum of changes.
You're writing a one off maintenance task, such as schema migration.
Your app has many different types of background task, and writing a separate handler for each would be burdensome.
Your task requires complex arguments that aren't easily serialized without using Pickle.
You are writing a library for other apps that needs to do background work.
You may want to use the Task Queue API if:
You need complete control over how tasks are queued and executed.
You need better queue management or monitoring than deferred provides.
You have high throughput, and overhead is important.
You are building larger abstractions and need direct control over tasks.
You like the webhook model better than the RPC model.
Naturally, you can use both the Task Queue API and the deferred library side-by-side, if your app has requirements that fit into both groups.
Related
I need to run some parallel computations in python. The only compatible approach I can think of is the multiprocess/fork model, which is less than ideal for several reasons:
from what I understand, forks in windows are expensive
fine-grained process management (signals, ie SIGSTOP/SIGCONT) is clunky (i.e. outside the language)
These are the task requirements:
tasks may spawn new tasks
tasks must be registered with the task manager
tasks do not require shared state
tasks must return a value (python object)
The task manager is responsible for scheduling and limiting the number of concurrent tasks. These are the task manager requirements:
when a new task is started, the task manager may suspend other tasks based on a predetermined limit
when a task returns, the task manager may continue other suspended tasks
when the return value of a task is requested, the task manager may reorganize the task priority (prevent deadlocks)
So you see, the task manager doesn't need to be a parallel/concurrent process. Each task may make synchronous calls to the task manager on starting or stopping. Tasks waiting on other tasks may also make synchronous calls.
I can't seem to think of any other approaches:
asyncio can start parallel process within a limited pool, but that approach is more suited for data parallelism rather than task pre-emption. Externally pre-empting a task (suspending) isn't compatible with cooperatively programmed events. Correct me if I'm wrong, but while I could use asyncio, it wouldn't make my life easier (an abstraction without benefit) as I would still be required to use processes, and signals on "task-start/stop" events?
stackless python might be suitable, but it isn't really python?
Any ideas?
P.S. My end-goal is to automatically parallelize (decorated) function calls. The task manager limits the number of tasks executing in parallel (i.e. recursive functions) to avoid thrashing (fork bombs). I need to use python, even though a though lazy (task waiting), pure (no shared state) and stackless (lightweight threads) language might be more suitable...
Wow, this question is old and I'm surprised a Stackless Python user hasn't chimed in...
Then again, Stackless Python was/is way ahead of its time and there's very few of us out there putting it into use.
Stackless Python is indeed Python. It is a little more than just Python, but it is Python none the less.
Stackless Python Wiki
I think it would suit your needs very well. It is still up-to-date and maintained with a commit as recent as this month. It's rather solid and has worked wonderfully for my needs.
I'm fairly new to Celery/AMQP and am trying to come up with a task/queue/worker design to meet the following requirements.
I have multiple types of "per-user" tasks: e.g., TaskA, TaskB, TaskC. Each of these "per-user" tasks read/write data for one particular user in the system. So at any given time, I might need to create tasks User1_TaskA, User1_TaskB, User1_TaskC, User2_TaskA, User2_TaskB, etc. I need to ensure that, for each user, no two tasks of any task type execute concurrently. I want a system in which no worker can execute User1_TaskA at the same time as any other worker is executing User1_TaskB or User1_TaskC, but while User1_TaskA is executing, other workers shouldn't be blocked from concurrently executing User2_TaskA, User3_TaskA, etc.
I realize this could be implemented using some sort of external locking mechanism (e.g., in the DB), but I'm hoping there's a more elegant task/queue/worker design that would work.
I suppose one possible solution is to implement queues as user buckets such that, when the workers are launched there's config that specifies how many buckets to create, and each "bucket worker" is bound to exactly one bucket. Then an "intermediate worker" would pull off tasks from the main task queue and assign them into the bucketed queues via, say, a hash/mod scheme. So UserA's tasks would always end up in the same queue, and multiple tasks for UserA would back up behind each other. I don't love this approach, as it would require the number of buckets to be defined ahead of time, and would seem to prevent (easily) adding workers dynamically. Seems to me there's got to be a better way -- suggestions would be greatly appreciated.
What's so bad in using an external locking mechanism? It's simple, straightforward, and efficient enough. You can find an example of distributed task locking in Celery here. Extend it by creating a lock per user, and you're done!
I need to to handle a large (time and memory-consuming) process asynchronously in a web2py application called inside a controller method.
My specific use case is to call a process via stdlib.subprocess and wait for it to exit without blocking the web server, but I am open to alternative methods.
Hands-on examples would be a plus.
3rd party library recommendations
are welcome.
CRON scheduling is not required/wanted.
Assuming you'll need to start multiple, possibly simultaneous, instances of the background task, the solution is a task queue. I've heard good things about Celery and RabbitMQ, if you're looking for 3rd-party options, and web2py includes it's own task queue system that might be sufficient for your needs.
With either tool, you'll define a function that encapsulates the operation you want the background process to perform. Then bring the task queue workers online. The web2py manual and forums indicate this can be done with an #reboot statement in the web2py cron system, which is triggered whenever the web server starts. There are probably other ways to start the workers if this is unsatisfactory.
In your controller you'll insert a task into the task queue, passing any necessary parameters as inputs to the function (the background function will not run in the same environment as the controller, so it won't have access to the session, DB, etc. unless you explicitly pass the appropriate values into the task function).
Now, to get the output of the background operation to the user. When you insert a task into the task queue, you should get back a unique ID for the task. You would then implement controller logic (either something that expects an AJAX call, or a page that keeps refreshing until the task completes) that calls the task queue's API to check the status of the specified task. If the task's status is "finished", return the data to the user. If not, keep waiting.
Maybe review the book section on running tasks in the background. You can use the new scheduler or create a homemade queue (email example). There's also a web2py-celery plugin, though I'm not sure what state that is in.
This is more difficult than one might expect. Note the deadlock warnings in the stdlib.subprocess documentation. It's easy if you don't mind blocking---use Popen.communicate. To work around the blocking, you can manage the process using stdlib.subprocess from a thread.
My favorite way to deal with subprocesses is to use Twisted's spawnProcess. But, it is not easy to get Twisted to play nicely with other frameworks.
I have a process which sends financial tick data through redis pubsub in realtime. Now I want my Python Application to handle the input data (json) for instance calculations like moving average and so on. the results I want to be send back via redis to other tasks (doing further calculations based on the results of the 1st task). Further I want to trigger some tasks regularly once a day or each second. With this complex and unforseen structure issue I've stumbled upon solutions like gevent, Celery or just Threads.
But what I'm wondering is What are my options to do this the right way? How can I structure my redis pubsub by doing Worker/Task stuff the most efficient way? So, suggestions are welcome, in the lines of libraries (if you've used any of the mentioned please share your experiences), techniques (Python's structure Best Practice), how to make use of redis's pubsub to do the job the best way.
If any of those calculations are computationally expensive and you do them in python and you want scalability then celery makes perfect sense.
gevent would just make your code more efficient in specific cases, but won't help you in terms of scalability. That also holds true if you use threads.
Bear in mind that you can configure celery to run the worker pool on gevent (or eventlet)
I would like to hold running threads in my Django application. Since I cannot do so in the model or in the session, I thought of holding them in a singleton. I've been checking this out for a while and haven't really found a good how-to for this.
Does anyone know how to create a thread-safe singleton in python?
EDIT:
More specifically what I wand to do is I want to implement some kind of "anytime algorithm", i.e. when a user presses a button, a response returned and a new computation begins (a new thread). I want this thread to run until the user presses the button again, and then my app will return the best solution it managed to find. to do that, i need to save somewhere the thread object - i thought of storing them in the session, what apparently i cannot do.
The bottom line is - i have a FAT computation i want to do on the server side, in different threads, while the user is using my site.
Unless you have a very good reason - you should execute the long running threads in a different process altogether, and use Celery to execute them:
Celery is an open source asynchronous
task queue/job queue based on
distributed message passing. It is
focused on real-time operation, but
supports scheduling as well.
The execution units, called tasks, are
executed concurrently on one or more
worker nodes using multiprocessing,
Eventlet or gevent. Tasks can execute
asynchronously (in the background) or
synchronously (wait until ready).
Celery guide for djangonauts: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html
For singletons and sharing data between tasks/threads, again, unless you have a good reason, you should use the db layer (aka, models) with some caution regarding db locks and refreshing stale data.
Update: regarding your use case, define a Computation model, with a status field. When a user starts a computation, an instance is created, and a task will start to run. The task will monitor the status field (check db once in a while). When a user clicks the button again, a view will change the status to user requested to stop, causing the task to terminate.
If you want asynchronous code in a web application then you're taking the wrong approach. You should run background tasks with a specialist task queue like Celery: http://celeryproject.org/
The biggest problem you have is web server architecture. Unless you go against the recommended Django web server configuration and use a worker thread MPM, you will have no way to track your thread handles between requests as each request typically occupies its own process. This is how Apache normally works: http://httpd.apache.org/docs/2.0/mod/prefork.html
EDIT:
In light of your edit I think you might learn more by creating a custom solution that does this:
Maintains start/stop state in the database
Create a new program that runs as a daemon
Periodically check the start/stop state and begin or end work from here
There's no need for multithreading here unless you need to create a new process for each user. If so, things get more complicated and using Celery will make your life much easier.