I have a [python] AppEngine app which creates multiple tasks and adds them to a custom task queue. dev_appserver.py seems to ignore the rate/scheduling parameters I specify in queue.yaml and executes all the tasks immediately. This is a problem [as least for dev/testing purposes] as my tasks call a rate-throttled url; immediate execution of all tasks breaches the throttling limits and returns me a bunch of errors.
Does anyone know if task scheduling if dev_appserver.py is disabled ? I can't find anything that suggests this in the AppEngine docs. Can anyone suggest a workaround ?
Thank you.
When your app is running in the development server, tasks are automatically executed at the appropriate time just as in production.
You can examine and manipulate tasks from the developer console:
http://localhost:8080/_ah/admin/taskqueue
Documentation here
The documentation lies: the development server doesn't appear to support rate limiting. (This is documented for the Java dev server, but not for Python). You can demonstrate this by pausing a queue by giving it a 0/s rate, but you'll find it executes tasks anyway. When such an app is uploaded to production, it behaves as expected.
I opened a defect.
Rate parameter is not used for setting absolute upper bounds of TaskQueue processing. In fact, if you use for example:
rate: 10/s
bucket_size: 20
the processing can burst up to 20/s. Something more useful would be:
max_concurrent_requests: 1
which sets the maximum number of execution to 1 at a time.
However, this will not stop tasks from executing. If you are adding multiple Tasks a time but know that they need to be executed at a later time, you should probably use countdown.
_countdown using deferred library
countdown using Task class
Related
I am using celery for async processing along with Heroku. I would like to be able to determine when Heroku sends SIGTERM prior to shutting down (when we are deploying new code, setting env vars, etc) in specific tasks. This will allow us to do any clean up on long running tasks greater than 10 seconds. I understand that we should strive for short idempotent tasks, but the data we are dealing with is too large to get to that level.
I have ran into the following doc:
https://devcenter.heroku.com/articles/celery-heroku#using-remap_sigterm
But the documentation is lacking, and without much context.
If someone could give me an example of how to handle this, I would greatly appreciate it!
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?
I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.
Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).
In the documentation indicate that a task :
Tasks targeted at an automatic scaled module must finish execution
within 10 minutes. If you have tasks that require more time or
computing resources, they can be sent to manual or basic scaling
modules, where they can run up to 24 hours.
The link surrounding manual or basic scaling modules talks about a target, but doesn't say more about how to have a task that runs for a day.
You guessed my question :) How do I tell GAE that this specific task will be run for a day, not a minute ?
You'll need to configure a module to use basic or manual scaling, deploy your task handling code to an instance for that module.
You can read more about configuring modules/versions/instances on the App Engine Modules page for Python
I want to write a long running process (linux daemon) that serves two purposes:
responds to REST web requests
executes jobs which can be scheduled
I originally had it working as a simple program that would run through runs and do the updates which I then cron’d, but now I have the added REST requirement, and would also like to change the frequency of some jobs, but not others (let’s say all jobs have different frequencies).
I have 0 experience writing long running processes, especially ones that do things on their own, rather than responding to requests.
My basic plan is to run the REST part in a separate thread/process, and figured I’d run the jobs part separately.
I’m wondering if there exists any patterns, specifically python, (I’ve looked and haven’t really found any examples of what I want to do) or if anyone has any suggestions on where to begin with transitioning my project to meet these new requirements.
I’ve seen a few projects that touch on scheduling, but I’m really looking for real world user experience / suggestions here. What works / doesn’t work for you?
If the REST server and the scheduled jobs have nothing in common, do two separate implementations, the REST server and the jobs stuff, and run them as separate processes.
As mentioned previously, look into existing schedulers for the jobs stuff. I don't know if Twisted would be an alternative, but you might want to check this platform.
If, OTOH, the REST interface invokes the same functionality as the scheduled jobs do, you should try to look at them as two interfaces to the same functionality, e.g. like this:
Write the actual jobs as programs the REST server can fork and run.
Have a separate scheduler that handles the timing of the jobs.
If a job is due to run, let the scheduler issue a corresponding REST request to the local server.
This way the scheduler only handles job descriptions, but has no own knowledge how they are implemented.
It's a common trait for long-running, high-availability processes to have an additional "supervisor" process that just checks the necessary demons are up and running, and restarts them as necessary.
One option is to simply choose a lightweight WSGI server from this list:
http://wsgi.org/wsgi/Servers
and let it do the work of a long-running process that serves requests. (I would recommend Spawning.) Your code can concentrate on the REST API and handling requests through the well defined WSGI interface, and scheduling jobs.
There are at least a couple of scheduling libraries you could use, but I don't know much about them:
http://sourceforge.net/projects/pycron/
http://code.google.com/p/scheduler-py/
Here's what we did.
Wrote a simple, pure-wsgi web application to respond to REST requests.
Start jobs
Report status of jobs
Extended the built-in wsgiref server to use the select module to check for incoming requests.
Activity on the socket is ordinary REST request, we let the wsgiref handle this.
It will -- eventually -- call our WSGI applications to respond to status and
submit requests.
Timeout means that we have to do two things:
Check all children that are running to see if they're done. Update their status, etc.
Check a crontab-like schedule to see if there's any scheduled work to do. This is a SQLite database that this server maintains.
I usually use cron for scheduling. As for REST you can use one of the many, many web frameworks out there. But just running SimpleHTTPServer should be enough.
You can schedule the REST service startup with cron #reboot
#reboot (cd /path/to/my/app && nohup python myserver.py&)
The usual design pattern for a scheduler would be:
Maintain a list of scheduled jobs, sorted by next-run-time (as Date-Time value);
When woken up, compare the first job in the list with the current time. If it's due or overdue, remove it from the list and run it. Continue working your way through the list this way until the first job is not due yet, then go to sleep for (next_job_due_date - current_time);
When a job finishes running, re-schedule it if appropriate;
After adding a job to the schedule, wake up the scheduler process.
Tweak as appropriate for your situation (eg. sometimes you might want to re-schedule jobs to run again at the point that they start running rather than finish).