I am new to this, so bear with me if I'm asking something completely stupid.
I am developing a basic web app and using Heroku+flask+python.
For the background tasks, Heroku recommends using a worker. I wonder if I could just create new threads for those background tasks? Or is there a reason why a worker+redis is a better solution?
Those background tasks are not critical, really.
The main benefit to doing this in a separate worker is you'd be completely decoupling your app from your background tasks, so if one breaks it can't affect the other. That said, if you don't care about that, or need your background tasks more tightly coupled to your app for whatever reason, you can use APScheduler to have the background tasks run as separate threads without spinning up another worker. A simple example of that to run a background job every 10 seconds is as follows:
from apscheduler.schedulers.background import BackgroundScheduler
def some_job():
print "successfully finished job!"
apsched = BackgroundScheduler()
apsched.start()
apsched.add_job(my_job, 'interval', seconds=10)
If you want tasks run asynchronously instead of on a schedule, you can use RQ, which has great examples of how to use it on its homepage. RQ is backed by Redis, but you don't need to run it in a separate worker process, although you can if you like.
Related
I have to set up a worker which handles some data after a certain event happens. I know I can start the worker with python manage.py runworker my_worker, but what I would need is to start the worker in the same process as the main Django app on a separate thread.
Why do I need it in a separate thread and not in a separate process? Because the worker would perform a pretty light-weight job which would not overload the server's resources, and, moreover, the effort of making the set up for a new process in the production is not worth the gain in performance. In other words, I would prefer to keep it in the Django's process if possible.
Why not perform the job synchronously? Because it is a separate logic that needs the possibility to be extended, and it is out of the main HTTP request-reply scope. It is a post-processing task which doesn't interfere with the main logic. I need to decouple this task from an infrastructural point-of-view, not only logical (e.g. with plain signals).
Is there any possibility provided by Django Channels to run a worker in such a way?
Would there be any downsides to start the worker manually on a separate thread?
Right now I have the setup for a message broker consumer thread (without using Channels), so I have the entry point for starting a new worker thread. But as I've seen from the Channel's runworker command, it loads the whole application, so it doesn't seem like a naïve worker.run() call is the proper way to do it (I might be wrong with this one).
I found an answer to my question.
The answer is no, you can't just start a worker within the same process. This is because the consumer needs to run inside an event loop thread and it is not good at all to have more than one event loop thread in the same process (Django WSGI application already runs the main thread with an event loop).
The best you can do is to start the worker in a separate process. As I mentioned in my question, I started a message broker consumer on a separate thread, which was not a good approach either, so I changed my configuration to start the consumers as separate processes.
I have a Flask app that is using external scripts to perform certain actions. In one of the scripts, I am using threading to run the threads.
I am using the following code for the actual threading:
for a_device in get_devices:
my_thread = threading.Thread(target=DMCA.do_connect, args=(self, a_device, cmd))
my_thread.start()
main_thread = threading.currentThread()
for some_thread in threading.enumerate():
if some_thread != main_thread:
some_thread.join()
However, when this script gets ran (from a form), the process will hang and I will get a continuous loading cycle on the webpage.
Is there another way to use multithreading within the app?
Implementing threading by myself in a Flask app has always ended in some kind of disaster for me. You might want to use a distributed task queue such as Celery. Even though it might be tempting to spin off threads by yourself to get it finished faster, you will start to face all kinds of problems along the way and just end up wasting a lot of time (IMHO).
Celery is an asynchronous task queue/job queue based on distributed
message passing. It is focused on real-time operation, but supports
scheduling as well.
The execution units, called tasks, are executed concurrently on a
single or more worker servers using multiprocessing, Eventlet, or
gevent. Tasks can execute asynchronously (in the background) or
synchronously (wait until ready).
Here are some good resources that you can use to get started
Using Celery With Flask - Miguel Grinberg
Celery Background Tasks - Flask Documentation
I am working on a django application which uses celery for the distributed async processes. Now I have been tasked with integrating a process which was originally written with concurrent.futures in the code. So my question is, can this job with the concurrent futures processing work inside the celery task queue. Would it cause any problems ? If so what would be the best way to go forward. The concurrent process which was written earlier is resource intensive as it is able to avoid the GIL. Also, its very fast due to it. Not only that the process uses concurrent.futures.ProcessPoolExecutor and inside it another few (<5) concurrent.futures.ThreadPoolExecutor jobs.
So now the real question is should we extract all the core functions of the process and re-write them by breaking them as celery app tasks or just keep the original code and run it as one big piece of code within the celery queue.
As per the design of the system, a user of the system can submit several such celery tasks which would contain the concurrent futures code.
Any help will be appreciated.
Your library should work without modification. There's no harm in having threaded code running within Celery, unless you are mixing in gevent with non-gevent compatible code for example.
Reasons to break the code up would be for resource management (reduce memory/CPU overhead). With threading, the thing you want to monitor is CPU load. Once your concurrency causes enough load (e.g. threads doing CPU intensive work), the OS will start swapping between threads, and your processing gets slower, not faster.
Sometimes I have a situation where Celery queue builds up on accidental unnecessary tasks, clogging down the server. E.g. the code shoots up 20 000 tasks instead of 1.
How one can inspect what Python tasks Celery queue contains and then get selectively rid of certain tasks?
Tasks are defined and started with the standard Celery decorators (if that matters):
#task()
def update_foobar(foo, bar):
# Some heavy activon here
pass
update_foobar.delay(foo, bar)
Stack: Django + Celery + RabbitMQ.
Maybe you can use Flower. It's a real time monitor for Celery with a nice web interface. I think you can shutdown tasks from there. Anyways I would try to avoid those queued unnecessary tasks.
Basically I have a lot of tasks (in batches of about 1000) and execution times of these tasks can vary widely (from less than second to 10 minutes). I know that if a task is executing more than a minute i can kill it. These tasks are steps in optimization of some data mining model (but are independent of each other) and are spending most of the time inside some C extension function so they would not cooperate if I tried to kill them gracefully.
Is there a distributed task queue that fits into that schema --- AFAIK: celery allows aborting tasks that are willing to cooperate. But I might be wrong.
I recently asked similar question about killing hanging functions in pure python Kill hanging function in Python in multithreaded enviorment.
I guess I could subclass celery task so it spawns a new process and then executes its payload aborting it's execution if it takes to long, but then I would be killed by overhead of initialization of new interpreter.
Celery supports time limiting. You can use time limits to kill long running tasks. Beside killing tasks you can use soft limits which enable to handle SoftTimeLimitExceeded exceptions in tasks and terminate tasks cleanly.
from celery.task import task
from celery.exceptions import SoftTimeLimitExceeded
#task
def mytask():
try:
do_work()
except SoftTimeLimitExceeded:
clean_up_in_a_hurry()
Pistil allows multiple process management, including killing uncooperative tasks.
But:
it's beta software, even if it powers gunicorn which is reliable
I don't know how it handle 1000 processes
Communication between process is not included yet, so you'll have to setup your own using for example zeromq
Another possibility is to use the timer signal so it raises an exception in 36000 seconds. But signals are not trigered if somebody acquire the GIL, which you C program might do.
When you revoke a celery task, you can provide it with an optional terminate=True keyword.
task.revoke(terminate=True)
It doesn't exactly fit your requirements since it's not done by the process itself, but you should be able to either extend the task class to be able to commit suicide, or have a reccurring cleanup task or process killing off tasks that have not completed on time.