I use celery, with python 3 and supervisor in Ubuntu.
I've been working to make a new API, which will get an image from the internet using PIL(Pillow) and save it in a server.
However the problem is that I use Celery as scheduler and in the original API it returns the result in a milisecond, but when I use PIL, the wait becomes almost a second.
So as a solution, I am looking for a way to make the Celery worker run in the background.
Is it possible?
What you probably want is to daemonize your Celery worker.
If you follow the steps provided in the Celery running the worker as a daemon documentation you will be able to do that.
It is a bit of a complicated process, but it will allow the Celery worker to run in the background
Related
I need to run some tasks in background of web app (checking the code out, etc) without blocking the views.
The twist in typical Queue/Celery scenario is that I have to ensure that the tasks will complete, surviving even web app crash or restart until those tasks complete, whatever their final result.
I was thinking about recording parameters for multiprocessing.Pool in a database and starting all the incomplete tasks at webapp restart. It's doable, but I'm wondering if there's a simpler or more cost-effective aproach?
UPDATE: Why not Celery itself? Well, I used Celery in some projects and it's really a great solution, but for this task it's on the big side: it requires a separate server, communication, etc., while all I need is spawning a few processes/threads, doing some work in them (git clone ..., svn co ...) and checking whether they succeeded or failed. Another issue is that I need the solution to be as small as possible since I have to make it follow elaborate corporate guidelines, procedures, etc., and the human administrative and bureaucratic overhead I'd have to go through to get Celery onboard is something I'd prefer to avoid if I can.
I would suggest you to use Celery.
Celery does not require its own server, you can have a worker running on the same machine. You can also have a "poor man's queue" using an SQL database instead of a "real" queue/messaging server such as RabbitMQ - this setup would look very much like what you're describing, only with a separate process doing the long-running tasks.
The problem with starting long-running tasks from the webserver process is that in the production environment the web "workers" are normally managed by the webserver - multiple workers can be spawned or killed at any time. The viability of your approach would highly depend on the web server you're using and its configuration. Also, with multiple workers each trying to do a task you may have some concurrency issues.
Apart from Celery, another option is to look at UWSGI's spooler subsystem, especially if you're already using UWSGI.
I'm using RabbitMQ to make my tasks pool run sequentially one by one. But how can add a time parameter to make a task only run at the defined time in the future (like a scheduled tasks).
RabbitMQ is not a task scheduler, even though the documentation talks about "scheduling" a task. You might consider using something like cron. You could also use a library like sched to build a scheduler in a Python process.
FYI It looks like this question has already been answered:
Delayed message in RabbitMQ
RabbitMQ has a plugin for delayed messages.
Using this plugin, the messages can be delivered to the respective queues after a certain delay. Thanks to this plugin, you can use RabbitMQ as a scheduler, even though it's not a task scheduler by nature.
You can use celery along with rabbitmq as broker for task scheduling. Here is the celery documentation http://docs.celeryproject.org/en/master/index.html
I'm running a Pyramid application on Windows using Waitress as the app server and IIS as the Web Server (proxy). When I run the application, it works for a (seemingly) random amount of time before it just stops. It can go for days, even weeks at a time and then just stop, leaving IIS to throw a 502 error. When it stops, there's no way to restart it short of restarting Windows.
It's a small application which uses APScheduler to hit a couple of APIs to sync inventory between eBay/Amazon. I'm not entirely sure what is causing this problem, as there is no error shown in the logs. I had an older version of the application running (without APScheduler) and I didn't have this issue, so I'm assuming it has to do with APScheduler.
Has anybody else experienced this?
First let my say that I have not used APScheduler myself yet, and have next to no knowledge about running Python servers (or anything else really) on Windows and IIS.
So I can only guess here, but it seems clear that your problem is related to APScheduler in a way. I could imagine that something goes wrong in a thread that APScheduler uses for your background tasks, and the hanging thread brings down your whole application, due to the GIL (global interpreter lock in Python). This could for example happen when your thread(s) run into some kind of race condition.
Maybe it happens that processing starts before the previous iteration has finished processing. Or you get a really big backlog, and that leads to problems when processing starts.
Anyway, I think task queues are much better suited for background processing in web applications, because they run separately and out of the context of your web server.
You can schedule a task as soon as it's triggered by some user action, and it will be processed as soon as a worker is available, and not be deferred until a certain point in time.
I would recommend to give Celery a try, but there are also other solutions available, many based on Redis.
Celery is really powerful, and has advanced features like periodic tasks and crontab style schedules - so you can probably use it to do what you are doing using APScheduler now.
This looked very helpful for setting up Celery under Windows: http://mrtn.me/blog/2012/07/04/django-on-windows-run-celery-as-a-windows-service/
This may also prove helpful:
How to create Celery Windows Service?
Note: I haven't tried any of these myself, since I use Linux if I have a choice.
It's probably also possible to get APScheduler to work correctly, but I think it will be much easier to use Celery, because if a problem occurs in a worker you will be able to debug it much easier than a problem occuring in a background thread. Celery can also be configured to automatically send you email in case of an error.
I have a webapp that's built on python/Flask and it has a corresponding background job that runs continuously, periodically polling for data for each registered user.
I would like this background job to start when the system starts and keep running til it shuts down. Instead of setting up /etc/rc.d scripts, I just had the flask app spawn a new process (using the multiprocessing module) when the app starts up.
So with this setup, I only have to deploy the Flask app and that will get the background worker running as well.
What are the downsides of this? Is this a complete and utter hack that is fragile in some way or a nice way to set up a webapp with corresponding background task?
The downside of your approach is that there are many ways it could fail especially around stopping and restarting your flask application.
You will have to deal with graceful shutdown to give your worker a chance to finish its current task.
Sometime your worker won't stop on time and might linger while you start another one when you reboot your flask application.
Here are some approches I would suggest depending on your constraints:
script + crontab
You only have to write a script that does whatever task you want and cron will take care of running it for you every few minutes. Advantages: cron will run it for you periodically and will start when the system starts. Disadvantages: if the task takes too long, you might have multiple instances of your script running at the same time. You can find some solutions for this problem here.
supervisord
supervisord is a neat way to deal with different daemons. You can set it to run your app, your background script or both and have them start with the server. Only downside is that you have to install supervisord and make sure its daemon is running when the server starts.
uwsgi
uwsgi is a very common way for deploying flask applications. It has few features that might come in handy for managing background workers.
Celery
Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. I think this is the best solution for scheduling background tasks for a flask application or any other python based application. But using it comes with some extra bulk. You will be introducing at least the following processes:
- a broker (rabbitmq or redis)
- a worker
- a scheduler
You can also get supervisord to manage all of the processes above and get them to start when the server starts.
Conclusion
In your quest of reducing the number of processes, I would highly suggest the crontab based solution as it can get you a long way. But please make sure your background script leaves an execution trace or logs of some sort.
In the tornado documentation they show how they can have a very large through-put from 4 frontends. I'd like to run an app in the same way, and would like to have the frontends running as daemon processes managed with an init.d script*.
I'm fairly new to Python so don't really know where to start. Currently I'm starting the Tornado server manually in the terminal, passing in a new port number each time.
I've tried using the python-daemon package in conjunction with the lockfile package but the lockfiles that are created don't have the process ids in them and I can't see how to then kill the processes gracefully later on.
I don't really know where to go from here, and the Tornado docs leave a large chunk out regarding deployment.
* If there's a better way to manage the processes so that they can be monitored and managed as a group then please let me know.
Try Supervisor. It's great for managing multiple daemon processes. You configure your applications in the supervisord.conf file and supervisord itself is launched from an init.d script.
I can vouch for Supervisor too. We have been using tornado in production with 4 instances using supervisor and it is working uber smooth