Running an long-running Python function from Django - python

Here's the rough workflow:
Request for a job comes in to a particular view -> Job entered in Database -> requestProcessor() launched independent of current process -> Response "Job has been entered" is returned instantly ->
requestProcessor() looks at the database, sees if there are any outstanding Jobs to be processed, and begins processing it. Takes ~3 hours to complete.
I've been confused by this problem for a long long time now. Should I be using multiprocessing.Pool's apply_async? I have zero experience with multiple processes so I'm not sure what the best approach to this would be.

Celery is a great tool for implementing this exact type of functionality. You can use it a "task queue", for example:
tasks.py
from celery import task
#task
def do_job(*args, **kwargs):
"""
This is the function that does a "job"
"""
# TODO: Long running task here
views.py
from django.shortcuts import render_to_response
from .tasks import do_job
def view(request):
"""
This is your view.
"""
do_job.delay(*args, **kwargs)
return render_to_response('template.html', {'message': 'Job has been entered'})
Calling .delay will register do_job for execution by one of your celery workers but will not block execution of the view. A task is not executed until a worker becomes free, so you should not have any issues with the number of processes spawned by this approach.

You should be able to do this fairly easily. This is the sort of thing one might use Celery for (see Iain Shelvington's answer). To answer your question regarding how the multiprocessing module works, though, you could also simply do something like this:
from django.shortcuts import render
from multiprocessing import Process
import time
def do_job(seconds):
"""
This is the function that will run your three-hour job.
"""
time.sleep(seconds) # just sleep to imitate a long job
print 'done!' # will go to stdout, so you will see this
# most easily in test on local server
def test(request):
"""
This is your view.
"""
# In place of this comment, check the database.
# If job already running, return appropriate template.
p = Process(target=do_job, args=(15,)) # sleep for 15 seconds
p.start() # but do not join
message = 'Process started.'
return render(request, 'test.html',
{'message': message})
If you run this on your local test server, you will immediately be taken to the test page, and then in your stdout you will see done! show up 15 seconds later.
If you were to use something like this, you would also need to think about whether or not you will need to notify the user when the job finishes. In addition, you would need to think about whether or not to block further job requests until the first job is done. I don't think you'll want users to be able to start 500 processes haphazardly! You should check your database processes to see if the job is already running.

Related

Django Rest Framework testing with python queue

I have a DRF application with a python queue that I'm writing tests for. Somehow,
My queue thread cannot find an object that exists in the test database.
The main thread cannot destroy the db as it's in use by 1 other session.
To explain the usecase a bit further, I use Django's user model and have a table for metadata of files which you can upload. One of this fields is a created_by, which is a ForeignKey to django.conf.settings.AUTH_USER_MODEL. As shown below, I create a user in the TestCase's setUp(), which I then use to create an entry in the Files table. The creation of this entry happens in a queue however. During testing, this results in an error DETAIL: Key (created_by_id)=(4) is not present in table "auth_user"..
When the tests are completed, and the tearDown tries to destroy the test DB, I get another error DETAIL: There is 1 other session using the database.. The two seem related, and I'm probably handling the queue incorrectly.
The tests are written with Django's TestCase and run with python manage.py test.
from django.contrib.auth.models import User
from rest_framework.test import APIClient
from django.test import TestCase
class MyTest(TestCase):
def setUp(self):
self.client = APIClient()
self.client.force_authenticate()
user = User.objects.create_user('TestUser', 'test#test.test', 'testpass')
self.client.force_authenticate(user)
def test_failing(self):
self.client.post('/totestapi', data={'files': [open('tmp.txt', 'rt')]})
The queue is defined in separate file, app/queue.py.
from app.models import FileMeta
from queue import Queue
from threading import Thread
def queue_handler():
while True:
user, files = queue.get()
for file in files:
upload(file)
FileMeta(user=user, filename=file.name).save()
queue.task_done()
queue = Queue()
thread = Thread(target=queue_handler, daemon=True)
def start_upload_thread():
thread.start()
def put_upload_thread(*args):
queue.put(args)
Finally, the queue is started from app/views.py, which is always called when Django is started, and contains all the APIs.
from rest_framework import APIView
from app.queue import start_upload_thread, put_upload_thread
start_upload_thread()
class ToTestAPI(APIView):
def post(self, request):
put_upload_thread(request.user, request.FILES.getlist('files'))
Apologies that this is not a "real" answer but it was getting longer than a comment would allow.
The new ticket looks good. I did notice that there was no stoping of the background thread, as you did. That is probably what is causing that issue with the db still being active.
You use TestCase, which runs a db transaction and undoes all database changes when the test function ends. That means you won't be able to see data from the test case in another thread using a different connection to the database. You can see it inside your tests and views, since they share a connection.
Celery and RQ are the standard job queues - Celery is more flexible, but RQ is simpler. Start with RQ and keep things simple and isolated.
Some notes:
Pass in the PK of objects not the whole object
Read up on pickle if you do need to pass larger data.
Set the queues to async=False (run like normal code) in tests.
Queue consumers are a separate process running anywhere in the system, so data needs to get to them somehow. If you use full objects those need to be pickled, or serialized, and saved in the queue itself (i.e. redis) to be retrieved and processed. Just be careful and don't pass large objects this way - use the PK, store the file somewhere in S3 or another object storage, etc.
For Django-RQ I use this snippet to set the queues to sync mode when in testing, and then just run things as normal.
if IS_TESTING:
for q in RQ_QUEUES.keys():
RQ_QUEUES[q]['ASYNC'] = False
Good luck!

Is reading a global collections.deque from within a Flask request safe?

I have a Flask application that is supposed to display the result of a long running function to the user on a specified route. The result is about to change every hour or so. In order to avoid the user having to wait for the result, I want to have it cached somewhere in the application and re-compute it in specific intervals in the background (e.g. every hour) so that no user request ever has to wait for the long running computation function.
The idea I came up with to solve this is as follows, however, I am not completely sure whether this is really "safe" to do in a production environment with a multi-threaded or even multi-processed webserver such as waitress, eventlet, gunicorn or what not.
To re-compute the result in the background, I use a BackgroundScheduler from the APScheduler library.
The result is then left-appended in a collections.deque object which is registered as a module-wide variable (since there is no better possibility to save application wide globals in a Flask application as far as I know?!). Since the maximum size of the deque is set as 2, old results will pop out on the right side of the deque as new ones come in.
A Flask view now returns deque[0] to the requester which should always be the newest result. I decided for deque over Queue since the latter has no built-in possibility to read the first item without removing it.
Thus, it is guaranteed that no user ever has to wait for the result because the old one only disappears from "cache" in the very moment the new one comes in.
See below for a minimal example of this. When running the script and hitting http://localhost:5000, one can see the caching in action - "Job finished at" should never be later than 10 seconds plus some very short time for re-computing it behind "Current time", still one should never have to wait the time.sleep(5) seconds from the job function until the request returns.
Is this a valid implementation for the given requirement that will also work in a production-ready WSGI server setting or should this be accomplished differently?
from flask import Flask
from apscheduler.schedulers.background import BackgroundScheduler
import time
import datetime
from collections import deque
# a global deque that is filled by APScheduler and read by a Flask view
deque = deque(maxlen=2)
# a function filling the deque that is executed in regular intervals by APScheduler
def some_long_running_job():
print('complicated long running job started...')
time.sleep(5)
job_finished_at = datetime.datetime.now()
deque.appendleft(job_finished_at)
# a function setting up the scheduler
def start_scheduler():
scheduler = BackgroundScheduler()
scheduler.add_job(some_long_running_job,
trigger='interval',
seconds=10,
next_run_time=datetime.datetime.utcnow(),
id='1',
name='Some Job name'
)
scheduler.start()
# a flask application
app = Flask(__name__)
# a flask route returning an item from the global deque
#app.route('/')
def display_job_result():
current_time = datetime.datetime.now()
job_finished_at = deque[0]
return '''
Current time is: {0} <br>
Job finished at: {1}
'''.format(current_time, job_finished_at)
# start the scheduler and flask server
if __name__ == '__main__':
start_scheduler()
app.run()
Thread-safety is not enough if you run multiple processes:
Even though collections.deque is thread-safe:
Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.
Source: https://docs.python.org/3/library/collections.html#collections.deque
Depending on your configuration, your webserver might run multiple workers in multiple processes, so each of those processes has their own instance of the object.
Even with one worker, thread-safety might not be enough:
You might have selected an asynchronous worker type. The asynchronous worker won't know when it's safe to yield and your code would have to be protected against scenarios like this:
Worker for request 1 reads value a and yields
Worker for request 2 also reads value a, writes a + 1 and yields
Worker for request 1 writes value a + 1, even though it should be a + 1 + 1
Possible solutions:
Use something outside of the Flask app to store the data. This can be a database, in this case preferably an in-memory database like Redis. Or if your worker type is compatible with the multiprocessing module, you can try to use multiprocessing.managers.BaseManager to provide your Python object to all worker processes.
Are global variables thread safe in flask? How do I share data between requests?
How can I provide shared state to my Flask app with multiple workers without depending on additional software?
Store large data or a service connection per Flask session

How to do parallel execution inside flask view function

In my python-flask web app runs at port 5001, I have the scenario to create an endpoint where all other endpoint view functions need to be executed parallel and followed by aggregate all the individual responses to return it on same request life cycle.
For example,
routes in flask app contains, the following view functions
#app.route(/amazon)
def amazon():
return "amazon"
#app.route(/flipkart)
def flipkart():
return "flipkart"
#app.route(/snapdeal)
def sd():
return "snapdeal"
Note: In the above three endpoints, significant amount network io involved
I am creating another endpoint, where all other endpoint implementations has to be called even here collectively.
### This is my endpoint
#app.route(/all)
def do_all():
# execute all amazon, flipkart, snapdeal implementations
I am suggesting two approaches for this above scenario.
Approach-1 (Multiprocessing approach):
Writing the worker task as separate function, calling each callers via python-multiprocessing module and collects the response
def do_all():
def worker(name):
# doing network io task for the given source name
pass
for name in ['amazon', 'flipkart', 'snapdeal']:
p = multiprocessing.Process(target=worker, args=(name,))
jobs.append(p)
p.start()
# Terminating all the process explicitly, Since freezing after execution complete
for j in jobs:
j.terminate()
return 200
Here I am invoking each child process, to call the worker, finally all child process gets terminated explicitly, since it is also wsgi threads i presume.
Approach-2(grequests):
Calling each endpoint explicitly using python-grequests. So, each endpoint which resides in the same app will be called parallel, and collects the response
def do_all():
grequests.post("http://localhost:5001/amazon", data={})
grequests.post("http://localhost:5001/flipkart", data={})
grequests.post("http://localhost:5001/snapdeal", data={})
This will get executed via each wsgi threads will be spawned for each request, Here I have no idea about multiple process spawned and not terminating after execution ?
Both could be similar one, but Which one would be seamless to implement, Please assist me if there is any alternative way to solve this scenario? Why?
Maybe you could simplify it, using another approach:
Return immediate response to the User
def do_all():
return 'amazon, ... being processed'
Invoke all relevant methods in the background.
Let the invoked background methods send signals
from blinker import Namespace
background = Namespace()
amazon_finished = background.signal('amazon-finished')
def amazon_bg():
amazon_finished.send(**interesting)
Subscribe to the signals from the background workers.
def do_all():
amazon_finished.connect(amazon_logic)
Display the return values of the background workers to the user. This would be a new request (maybe in the same route).
def amazon_logic(sender, **extra):
sender
Advantages would be that the User has on each request immediate response to the status of the background workers, error handling would be much easier.
I have not tested it, so you should look up the blinker API for yourself.

Having error queues in celery

Is there any way in celery by which if a task execution fails I can automatically put it into another queue.
For example it the task is running in a queue x, on exception enqueue it to another queue named error_x
Edit:
Currently I am using celery==3.0.13 along with django 1.4, Rabbitmq as broker.
Some times the task fails. Is there a way in celery to add messages to an error queue and process it later.
The problem when celery task fails is that I don't have access to the message queue name. So I can't use self.retry retry to put it to a different error queue.
Well, you cannot use the retry mechanism if you want to route the task to another queue. From the docs:
retry() can be used to re-execute the task, for example in the event
of recoverable errors.
When you call retry it will send a new message, using the same
task-id, and it will take care to make sure the message is delivered
to the same queue as the originating task.
You'll have to relaunch yourself and route it manually to your wanted queue in the event of any exception raised. It seems a good job for error callbacks.
The main issue is that we need to get the task name in the error callback to be able to launch it. Also we may not want to add the callback each time we launch a task. Thus a decorator would be a good way to automatically add the right callback.
from functools import partial, wraps
import celery
#celery.shared_task
def error_callback(task_id, task_name, retry_queue, retry_routing_key):
# We must retrieve the task object itself.
# `tasks` is a dict of 'task_name': celery_task_object
task = celery.current_app.tasks[task_name]
# Re launch the task in specified queue.
task.apply_async(queue=retry_queue, routing_key=retry_routing_key)
def retrying_task(retry_queue, retry_routing_key):
"""Decorates function to automatically add error callbacks."""
def retrying_decorator(func):
#celery.shared_task
#wraps(func) # just to keep the original task name
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
# Monkey patch the apply_async method to add the callback.
wrapper.apply_async = partial(
wrapper.apply_async,
link_error=error_callback.s(wrapper.name, retry_queue, retry_routing_key)
)
return wrapper
return retrying_decorator
# Usage:
#retrying_task(retry_queue='another_queue', retry_routing_key='another_routing_key')
def failing_task():
print 'Hi, I will fail!'
raise Exception("I'm failing!")
failing_task.apply_async()
You can adjust the decorator to pass whatever parameters you need.
I had a similar problem and i solved it may be not in a most efficient way but however my solution is as follows:
I have created a django model to keep all my celery task-ids and that is capable of checking the task state.
Then i have created another celery task that is running in an infinite cycle and checks all tasks that are 'RUNNING' on their actual state and if the state is 'FAILED' it just reruns it. Im not actually changing the queue for the task which i rerun but i think you can implement some custom logic to decide where to put every task you rerun this way.

django,fastcgi: how to manage a long running process?

I have inherited a django+fastcgi application which needs to be modified to perform a lengthy computation (up to half an hour or more). What I want to do is run the computation in the background and return a "your job has been started" -type response. While the process is running, further hits to the url should return "your job is still running" until the job finishes at which point the results of the job should be returned. Any subsequent hit on the url should return the cached result.
I'm an utter novice at django and haven't done any significant web work in a decade so I don't know if there's a built-in way to do what I want. I've tried starting the process via subprocess.Popen(), and that works fine except for the fact it leaves a defunct entry in the process table. I need a clean solution that can remove temporary files and any traces of the process once it has finished.
I've also experimented with fork() and threads and have yet to come up with a viable solution. Is there a canonical solution to what seems to me to be a pretty common use case? FWIW this will only be used on an internal server with very low traffic.
I have to solve a similar problem now. It is not going to be a public site, but similarly, an internal server with low traffic.
Technical constraints:
all input data to the long running process can be supplied on its start
long running process does not require user interaction (except for the initial input to start a process)
the time of the computation is long enough so that the results cannot be served to the client in an immediate HTTP response
some sort of feedback (sort of progress bar) from the long running process is required.
Hence, we need at least two web “views”: one to initiate the long running process, and the other, to monitor its status/collect the results.
We also need some sort of interprocess communication: send user data from the initiator (the web server on http request) to the long running process, and then send its results to the reciever (again web server, driven by http requests). The former is easy, the latter is less obvious. Unlike in normal unix programming, the receiver is not known initially. The receiver may be a different process from the initiator, and it may start when the long running job is still in progress or is already finished. So the pipes do not work and we need some permamence of the results of the long running process.
I see two possible solutions:
dispatch launches of the long running processes to the long running job manager (this is probably what the above-mentioned django-queue-service is);
save the results permanently, either in a file or in DB.
I preferred to use temporary files and to remember their locaiton in the session data. I don't think it can be made more simple.
A job script (this is the long running process), myjob.py:
import sys
from time import sleep
i = 0
while i < 1000:
print 'myjob:', i
i=i+1
sleep(0.1)
sys.stdout.flush()
django urls.py mapping:
urlpatterns = patterns('',
(r'^startjob/$', 'mysite.myapp.views.startjob'),
(r'^showjob/$', 'mysite.myapp.views.showjob'),
(r'^rmjob/$', 'mysite.myapp.views.rmjob'),
)
django views:
from tempfile import mkstemp
from os import fdopen,unlink,kill
from subprocess import Popen
import signal
def startjob(request):
"""Start a new long running process unless already started."""
if not request.session.has_key('job'):
# create a temporary file to save the resuls
outfd,outname=mkstemp()
request.session['jobfile']=outname
outfile=fdopen(outfd,'a+')
proc=Popen("python myjob.py",shell=True,stdout=outfile)
# remember pid to terminate the job later
request.session['job']=proc.pid
return HttpResponse('A new job has started.')
def showjob(request):
"""Show the last result of the running job."""
if not request.session.has_key('job'):
return HttpResponse('Not running a job.'+\
'Start a new one?')
else:
filename=request.session['jobfile']
results=open(filename)
lines=results.readlines()
try:
return HttpResponse(lines[-1]+\
'<p>Terminate?')
except:
return HttpResponse('No results yet.'+\
'<p>Terminate?')
return response
def rmjob(request):
"""Terminate the runining job."""
if request.session.has_key('job'):
job=request.session['job']
filename=request.session['jobfile']
try:
kill(job,signal.SIGKILL) # unix only
unlink(filename)
except OSError, e:
pass # probably the job has finished already
del request.session['job']
del request.session['jobfile']
return HttpResponseRedirect('/startjob/') # start a new one
Maybe you could look at the problem the other way around.
Maybe you could try DjangoQueueService, and have a "daemon" listening to the queue, seeing if there's something new and process it.

Categories

Resources