Questions about django thread safety - python

I have a django app which is used for managing registrations to a survey.
There are fixed number of slots and I want to "reserve" slots for users when they sign up.
In one of my views, I get the next available slot and reserve it (or redirect the user if there are no slots available.)
I want to protect against the case where two user's signing up at the same time get registered for the same slot because the the method "get_next_available_slot" returned the same slot for both users.
For this I am trying to understand the use of processes and threads with Django's views.
1) Is each request handled in a separate thread and will using python threading module's LOCK() take care of exclusive access?
2) I am running apache on RHEL with modwsgi. How do I configure Apache/modwsgi to ensure a more easy and simple solution to handle the above situation?
I am not expecting a huge load on the web application at all. So I would like a simpler solution instead of a high performance one.

You should not make assumptions about thread/process setup of django application, because it depends on web server you're using and how django is integrated to it. Therefore, interprocess communication methods should not rely on these details to be reliable. One good solution is using built-in cache library for locks and shared data.
Here's a good example of cache lock ensuring only once instance of celery task is run at a time. You can apply it to regular requests as well.

You shouldn't be worrying about this kind of stuff.
These slots are stored in a database right? The database should handle all the locking mechanisms for you, just make sure you run everything under a transaction and you will be fine.

Related

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Flask/RabbitMQ: consume messages on a separate thread than the app

I have a Flask microservice which serves user requests by an endpoint (say): /getdata
The data can be fetched in one of the two ways 1) cache or 2) from database directly - if the cache is in the process of being updated
Another service updates the database (thus making the cache stale). Once the service is done updating the database, it publishes a message to the rabbitmq stating: "update done"
Back to the microservice: I'd like it to have two threads:
Thread 1: runs the app.run()
Thread 2: subscribes to the queue - where "update done" messages are published
Given the two threads, I don't want the /getdata to be fetching database from the cache when it's being updated. At the same time, I don't want to update the cache when data is being fetched from the endpoint.
Here's one solution I can think of:
1) Have a threading.Lock() as a "global"
2) /getdata checks if the lock is available; if so, it will acquire, fetch data from cache and release the lock. If the lock is unavailable, it will fetch the data from the database directly, thereby incurring a performance hit - but still getting the "latest" data
3) RabbitMQ "subscriber" checks the state of the lock; if so, it acquires the lock , updates the cache from the database and releases the lock. If not, it adds the request to a local "queue", and waits for say one minute before trying to acquire the lock again. When it does, it will pop the first item from queue and update the cache from the database.
My questions:
Given the multitude of libraries and options in Python/Flask - is
there a library that allows me to do task like this in a "safe" way
(I am using pika for rabbitmq access)
Is it possible to launch the flask app.run() via one thread and the
queue subscriber via another (i.e. in if __name__ == "main":
)
How do I declare a "global" threading.Lock() which can coordinate
the two threads?
Notes:
I expect that in the worse case the lock won't be acquired for more than one minute.
Pika is not thread safe. You should avoid sharing the connection object across Flask's contexts. Writing your own Flask plugin wouldn't take that much boilerplate though. It would be very similar to the documentation example plugin. Otherwise, you could do a quick search with flask pika on a search engine and you'll find some existing plugins for this purpose. I have not tried them and they don't seem really popular, but maybe you should give them a go?
I don't see why it wouldn't be possible. Flask knows how to deal with this. However, I reckon it would severly degrade performances. Moreover, you might hit some corner-cases if the plugins you use are not perfectly written.
Just like you would declare any lock for threading. Nothing much. You put it at the module level (not in Flask's context) so that it is global, that's it.
That being said, I think you shouldn't proceed this way. You should rather run the update-job in a different process from the Web Server (using Flask CLI or whatever if you need to re-use some functions). It will be better performance-wise, it's easier to reason about, it's more loosely coupled.
Also, you should avoid running into locking headaches as long as possible. Believe me, it's a real source of problems. It's a nightmare to test properly, to debug, to maintain and quite risky when it comes to real-production use-cases. And if you really, really need a lock, don't hold it for one minute, it's way too long.
I don't know your exact requirements, but there surely is a solution that is OK and that does not involve such complexity.

Concurrency doubts in Django

I'm developing a website with Django 1.5.1 and I have two doubts regarding concurrence. Now I'm runing on the development server.
When multiple users access the website at the same time, by default, does Django run each request in a different execution thread? Or must it be configured in the webserver e.g. Apache?
Will I experience issues if more than a user is modifying the same object concurrently? If so, how do you solve this problem? Using locks?
Thanks for your help!
Its webserver specific. If you configure it to run in different process, request will be handled in new process. If you configure to have threads it will be in threads.
Yes. Imagine case when, user1 is viewing/editing a object A (retrieved from DB). user2 deletes that object. And then user1 tries to save it. You need to handle such cases explicitly in your code.
Most likely the issues will be related to DB. So you can use transactions to help in some cases.
In some other cases, you can define strategy. E.g the case mentioned above, when user1 tries to save the object, and its not there in db you can just create one.
1) webserver specific.
2) Take a look at django-concurrency. It handles concurrent editing using optimistic concurrency control pattern.

Controlling a Twisted Server from Django

I'm trying to build a Twisted/Django mashup that will let me control various client connections managed by a Twisted server via Django's admin interface. Meaning, I want to be able to login to Django's admin and see what protocols are currently in use, any details specific to each connection (e.g. if the server is connected to freenode via IRC, it should list all the channels currently connected to), and allow me to disconnect or connect new clients by modifying or creating database records.
What would be the best way to do this? There are lots of posts out there about combining Django with Twisted, but I haven't found any prior art for doing quite what I've outlined. All the Twisted examples I've seen use hardcoded connection parameters, which makes it difficult for me to imagine how I would dynamically running reactor.connectTCP(...) or loseConnection(...) when signalled by a record in the database.
My strategy is to create a custom ClientFactory that solely polls the Django/managed database every N seconds for any commands, and to modify/create/delete connections as appropriate, reflecting the new status in the database when complete.
Does this seem feasible? Is there a better approach? Does anyone know of any existing projects that implement similar functionality?
Polling the database is lame, but unfortunately, databases rarely have good tools (and certainly there are no database-portable tools) for monitoring changes. So your approach might be okay.
However, if your app is in Django and you're not supporting random changes to the database from other (non-Django) clients, and your WSGI container is Twisted, then you can do this very simply by doing callFromThread(connectTCP, ...).
I've been working on yet another way of combing django and twisted. Fell free to give it a try: https://github.com/kowalski/featdjango.
The way it works, is slightly different that the others. It starts a twisted application and http site. The requests done to django are processed inside a special thread pool. What makes it special, is that that these threads can wait on Deferred, which makes it easy to combine synchronous django application code with asynchronous twisted code.
The reason I came up with structure like this, is that my application needs to perform a lot of http requests from inside the django views. Instead of performing them one by one I can delegate all of them at once to "the main application thread" which runs twisted and wait for them. The similarity to your problem is, that I also have an asynchronous component, which is a singleton and I access it from django views.
So this is, for example, this is how you would initiate the twisted component and later to get the reference from the view.
import threading
from django.conf import settings
_initiate_lock = threading.Lock()
def get_component():
global _initiate_lock
if not hasattr(settings, 'YOUR_CLIENT')
_initiate_lock.acquire()
try:
# other thread might have did our job while we
# were waiting for the lock
if not hasattr(settings, 'YOUR_CLIENT'):
client = YourComponent(**whatever)
threading.current_thread().wait_for_deferred(
client.initiate)
settings.YOUR_CLIENT = client
finally:
_initiate_lock.release()
return settings.YOUR_CLIENT
The code above, initiates my client and calls the initiate method on it. This method is asynchronous and returns a Deferred. I do all the necessary setup in there. The django thread will wait for it to finish before processing to next line.
This is how I do it, because I only access it from the request handler. You probably would want to initiate your component at startup, to call ListenTCP|SSL. Than your django request handlers could get the data about the connections just accessing some public methods on the your client. These methods could even return Deferred, in which case you should use .wait_for_defer() to call them.

Concurrency handling in python based webapp

I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)

Categories

Resources