I have a DRF application with a python queue that I'm writing tests for. Somehow,
My queue thread cannot find an object that exists in the test database.
The main thread cannot destroy the db as it's in use by 1 other session.
To explain the usecase a bit further, I use Django's user model and have a table for metadata of files which you can upload. One of this fields is a created_by, which is a ForeignKey to django.conf.settings.AUTH_USER_MODEL. As shown below, I create a user in the TestCase's setUp(), which I then use to create an entry in the Files table. The creation of this entry happens in a queue however. During testing, this results in an error DETAIL: Key (created_by_id)=(4) is not present in table "auth_user"..
When the tests are completed, and the tearDown tries to destroy the test DB, I get another error DETAIL: There is 1 other session using the database.. The two seem related, and I'm probably handling the queue incorrectly.
The tests are written with Django's TestCase and run with python manage.py test.
from django.contrib.auth.models import User
from rest_framework.test import APIClient
from django.test import TestCase
class MyTest(TestCase):
def setUp(self):
self.client = APIClient()
self.client.force_authenticate()
user = User.objects.create_user('TestUser', 'test#test.test', 'testpass')
self.client.force_authenticate(user)
def test_failing(self):
self.client.post('/totestapi', data={'files': [open('tmp.txt', 'rt')]})
The queue is defined in separate file, app/queue.py.
from app.models import FileMeta
from queue import Queue
from threading import Thread
def queue_handler():
while True:
user, files = queue.get()
for file in files:
upload(file)
FileMeta(user=user, filename=file.name).save()
queue.task_done()
queue = Queue()
thread = Thread(target=queue_handler, daemon=True)
def start_upload_thread():
thread.start()
def put_upload_thread(*args):
queue.put(args)
Finally, the queue is started from app/views.py, which is always called when Django is started, and contains all the APIs.
from rest_framework import APIView
from app.queue import start_upload_thread, put_upload_thread
start_upload_thread()
class ToTestAPI(APIView):
def post(self, request):
put_upload_thread(request.user, request.FILES.getlist('files'))
Apologies that this is not a "real" answer but it was getting longer than a comment would allow.
The new ticket looks good. I did notice that there was no stoping of the background thread, as you did. That is probably what is causing that issue with the db still being active.
You use TestCase, which runs a db transaction and undoes all database changes when the test function ends. That means you won't be able to see data from the test case in another thread using a different connection to the database. You can see it inside your tests and views, since they share a connection.
Celery and RQ are the standard job queues - Celery is more flexible, but RQ is simpler. Start with RQ and keep things simple and isolated.
Some notes:
Pass in the PK of objects not the whole object
Read up on pickle if you do need to pass larger data.
Set the queues to async=False (run like normal code) in tests.
Queue consumers are a separate process running anywhere in the system, so data needs to get to them somehow. If you use full objects those need to be pickled, or serialized, and saved in the queue itself (i.e. redis) to be retrieved and processed. Just be careful and don't pass large objects this way - use the PK, store the file somewhere in S3 or another object storage, etc.
For Django-RQ I use this snippet to set the queues to sync mode when in testing, and then just run things as normal.
if IS_TESTING:
for q in RQ_QUEUES.keys():
RQ_QUEUES[q]['ASYNC'] = False
Good luck!
Related
I am writing a Flask application and I am trying to insert a multi-threaded implementation for certain server related features. I noticed this weird behavior so I wanted to understand why is it happening and how to solve it. I have the following code:
from flask_login import current_user, login_required
import threading
posts = Blueprint('posts', __name__)
#posts.route("/foo")
#login_required
def foo():
print(current_user)
thread = threading.Thread(target=goo)
thread.start()
thread.join()
return
def goo():
print(current_user)
# ...
The main process correctly prints the current_user, while the child thread prints None.
User('Username1', 'email1#email.com', 'Username1-ProfilePic.jpg')
None
Why is it happening? How can I manage to obtain the current_user also in the child process? I tried passing it as argument of goo but I still get the same behavior.
I found this post but I can't understand how to ensure the context is not changing in this situation, so I tried providing a simpler example.
A partially working workaround
I tried passing as parameter also a newly created object User populated with the data from current_user
def foo():
# ...
user = User.query.filter_by(username=current_user.username).first_or_404()
thread = threading.Thread(target=goo, args=[user])
# ...
def goo(user):
print(user)
# ...
And it correctly prints the information of the current user. But since inside goo I am also performing database operations I get the following error:
RuntimeError: No application found. Either work inside a view function
or push an application context. See
http://flask-sqlalchemy.pocoo.org/contexts/.
So as I suspected I assume it's a problem of context.
I tried also inserting this inside goo as suggested by the error:
def goo():
from myapp import create_app
app = create_app()
app.app_context().push()
# ... database access
But I still get the same errors and if I try to print current_user I get None.
How can I pass the old context to the new thread? Or should I create a new one?
This is because Flask uses thread local variables to store this for each request's thread. That simplifies in many cases, but makes it hard to use multiple threads. See https://flask.palletsprojects.com/en/1.1.x/design/#thread-local.
If you want to use multiple threads to handle a single request, Flask might not be the best choice. You can always interact with Flask exclusively on the initial thread if you want and then forward anything you need on other threads back and forth yourself through a shared object of some kind. For database access on secondary threads, you can use a thread-safe database library with multiple threads as long as Flask isn't involved in its usage.
In summary, treat Flask as single threaded. Any extra threads shouldn't interact directly with Flask to avoid problems. You can also consider either not using threads at all and run everything sequentially or trying e.g. Tornado and asyncio for easier concurrency with coroutines depending on the needs.
your server serves multiple users, wich are threads by themself.
flask_login was not designed for extra threading in it, thats why child thread prints None.
i suggest u to use db for transmit variables from users and run addition docker container if you need separate process.
That is because current_user is implement as a local safe resource:
https://github.com/maxcountryman/flask-login/blob/main/flask_login/utils.py#L26
Read:
https://werkzeug.palletsprojects.com/en/1.0.x/local/#module-werkzeug.local
Here's the rough workflow:
Request for a job comes in to a particular view -> Job entered in Database -> requestProcessor() launched independent of current process -> Response "Job has been entered" is returned instantly ->
requestProcessor() looks at the database, sees if there are any outstanding Jobs to be processed, and begins processing it. Takes ~3 hours to complete.
I've been confused by this problem for a long long time now. Should I be using multiprocessing.Pool's apply_async? I have zero experience with multiple processes so I'm not sure what the best approach to this would be.
Celery is a great tool for implementing this exact type of functionality. You can use it a "task queue", for example:
tasks.py
from celery import task
#task
def do_job(*args, **kwargs):
"""
This is the function that does a "job"
"""
# TODO: Long running task here
views.py
from django.shortcuts import render_to_response
from .tasks import do_job
def view(request):
"""
This is your view.
"""
do_job.delay(*args, **kwargs)
return render_to_response('template.html', {'message': 'Job has been entered'})
Calling .delay will register do_job for execution by one of your celery workers but will not block execution of the view. A task is not executed until a worker becomes free, so you should not have any issues with the number of processes spawned by this approach.
You should be able to do this fairly easily. This is the sort of thing one might use Celery for (see Iain Shelvington's answer). To answer your question regarding how the multiprocessing module works, though, you could also simply do something like this:
from django.shortcuts import render
from multiprocessing import Process
import time
def do_job(seconds):
"""
This is the function that will run your three-hour job.
"""
time.sleep(seconds) # just sleep to imitate a long job
print 'done!' # will go to stdout, so you will see this
# most easily in test on local server
def test(request):
"""
This is your view.
"""
# In place of this comment, check the database.
# If job already running, return appropriate template.
p = Process(target=do_job, args=(15,)) # sleep for 15 seconds
p.start() # but do not join
message = 'Process started.'
return render(request, 'test.html',
{'message': message})
If you run this on your local test server, you will immediately be taken to the test page, and then in your stdout you will see done! show up 15 seconds later.
If you were to use something like this, you would also need to think about whether or not you will need to notify the user when the job finishes. In addition, you would need to think about whether or not to block further job requests until the first job is done. I don't think you'll want users to be able to start 500 processes haphazardly! You should check your database processes to see if the job is already running.
I'm using Django database instead of RabbitMQ for concurrency reasons.
But I can't solve the problem of revoking a task before it execute.
I found some answers about this matter but they don't seem complete or I can't get enough help.
first answer
second answer
How can I extend celery task table using a model, add a boolean field (revoked) to set when I don't want the task to execute?
Thanks.
Since Celery tracks tasks by an ID, all you really need is to be able to tell which IDs have been canceled. Rather than modifying kombu internals, you can create your own table (or memcached etc) that just tracks canceled IDs, then check whether the ID for the current cancelable task is in it.
This is what the transports that support a remote revoke command do internally:
All worker nodes keeps a memory of revoked task ids, either in-memory
or persistent on disk (see Persistent revokes). (from Celery docs)
When you use the django transport, you are responsible for doing this yourself. In this case it's up to each task to check whether it has been canceled.
So the basic form of your task (logging added in place of an actual operation) becomes:
from celery import shared_task
from celery.exceptions import Ignore
from celery.utils.log import get_task_logger
from .models import task_canceled
logger = get_task_logger(__name__)
#shared_task
def my_task():
if task_canceled(my_task.request.id):
raise Ignore
logger.info("Doing my stuff")
You can extend & improve this in various ways, such as by creating a base CancelableTask class as in one of the other answers you linked to, but this is the basic form. What you're missing now is the model and the function to check it.
Note that the ID in this case will be a string ID like a5644f08-7d30-43ff-a61e-81c165ad9e19, not an integer. Your model can be as simple as this:
from django.db import models
class CanceledTask(models.Model):
task_id = models.CharField(max_length=200)
def cancel_task(request_id):
CanceledTask.objects.create(task_id=request_id)
def task_canceled(request_id):
return CanceledTask.objects.filter(task_id=request_id).exists()
You can now check the behavior by watching your celery service's debug logs while doing things like:
my_task.delay()
models.cancel_task(my_task.delay())
I'm writing an web app. Users can post text, and I need to store them in my DB as well as sync them to a twitter account.
The problem is that I'd like to response to the user immediately after inserting the message to DB, and run the "sync to twitter" process in background.
How could I do that? Thanks
either you choose zrxq's solution, or you can do that with a thread, if you take care of two things:
you don't tamper with objects from the main thread (be careful of iterators),
you take good care of killing your thread once the job is done.
something that would look like :
import threading
class TwitterThreadQueue(threading.Thread):
queue = []
def run(self):
while len(self.queue!=0):
post_on_twitter(self.queue.pop()) # here is your code to post on twitter
def add_to_queue(self,msg):
self.queue.append(msg)
and then you instanciate it in your code :
tweetQueue = TwitterThreadQueue()
# ...
tweetQueue.add_to_queue(message)
tweetQueue.start() # you can check if it's not already started
# ...
I'm writing an application with python and sqlalchemy-0.7. It starts by initializing the sqlalchemy orm (using declarative) and then it starts a multithreaded web server - I'm currently using web.py for rapid prototyping but that could change in the future. I will also add other "threads" for scheduled jobs and so on, probably using other python threads.
From SA documentation I understand I have to use scoped_session() to get a thread-local session, so my web.py app should end up looking something like:
import web
from myapp.model import Session # scoped_session(sessionmaker(bind=engine))
from myapp.model import This, That, AndSoOn
urls = blah...
app = web.application(urls, globals())
class index:
def GET(self):
s = Session()
# get stuff done
Session().remove()
return(stuff)
class foo:
def GET(self):
s = Session()
# get stuff done
Session().remove()
return(stuff)
Is that the Right Way to handle the session?
As far as I understand, I should get a scoped_session at every method since it'll give me a thread local session that I could not obtain beforehand (like at the module level).
Also, I should call .remove() or .commit() or something like them at every method end, otherwise the session will still contain Persistent objects and I would not be able to query/access the same objects in other threads?
If that pattern is the correct one, it could probably be made better by writing it only once, maybe using a decorator? Such a decorator could get the session, invoke the method and then make sure to dispose the session properly. How would that pass the session to the decorated function?
Yes, this is the right way.
Example:
The Flask microframework with Flask-sqlalchemy extension does what you described. It also does .remove() automatically at the end of each HTTP request ("view" functions), so the session is released by the current thread. Calling just .commit() is not sufficient, you should use .remove().
When not using Flask views, I usually use a "with" statement:
#contextmanager
def get_db_session():
try:
yield session
finally:
session.remove()
with get_db_session() as session:
# do something with session
You can create a similar decorator.
Scoped session creates a DBMS connection pool, so this approach will be faster than opening/closing session at each HTTP request. It also works nice with greenlets (gevent or eventlet).
You don't need to create a scoped session if you create new session for each request and each request is handled by single thread.
You have to call s.commit() to make pending objects persistent, i.e. to save changes into database.
You may also want to close session by calling s.close().