Concurrency doubts in Django

Concurrency doubts in Django - python

I'm developing a website with Django 1.5.1 and I have two doubts regarding concurrence. Now I'm runing on the development server.
When multiple users access the website at the same time, by default, does Django run each request in a different execution thread? Or must it be configured in the webserver e.g. Apache?
Will I experience issues if more than a user is modifying the same object concurrently? If so, how do you solve this problem? Using locks?
Thanks for your help!

Its webserver specific. If you configure it to run in different process, request will be handled in new process. If you configure to have threads it will be in threads.
Yes. Imagine case when, user1 is viewing/editing a object A (retrieved from DB). user2 deletes that object. And then user1 tries to save it. You need to handle such cases explicitly in your code.
Most likely the issues will be related to DB. So you can use transactions to help in some cases.
In some other cases, you can define strategy. E.g the case mentioned above, when user1 tries to save the object, and its not there in db you can just create one.

1) webserver specific.
2) Take a look at django-concurrency. It handles concurrent editing using optimistic concurrency control pattern.

Related

Is there a way to run a separate looping worker process that references a Django app's models?

I have a webapp that monitors sites that users add for any changes. To do this, I need to have some sort of separate background thread/process that is constantly iterating through the list of sites, pinging them one at a time, and emailing any users that are monitoring a site that changes. I am currently using a thread that I initialize at the end of my urls.py file. This works fine with Django's development server, but it begins to break down once I deploy it to Heroku with Gunicorn. As soon as there are multiple connections, multiple copies of the worker thread get started, as Gunicorn starts more worker threads to handle the concurrent connections (at least, this is what I think is the reason behind the extra threads is). This causes duplicate emails to be sent out, one from each thread.
I am now trying to find another means of spawning this worker thread/process. I saw a similar inquiry here, but when I tried the posted solution, I was unable to reference the models from my Django app and received this error message when I tried to do so:
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
I have also tried using django-background-tasks, which is frequently recommended as a simple solution for issues like this. However, it doesn't seem suited for looping, continuous processes. The same goes for Celery and other solutions like it. I am just looking for a way to start a separate worker Dyno that continuously runs in the background, without a queue or anything like that, and is able to use the models from my Django app to create QuerySets that can be iterated through. What would be the best way to do something like this? Please let me know if any more information would help.

You could try editing the code so that the parts that handle the email specifically aren't tried so intrinsically to the django model, such that both the django model and this secondary application interact with the standard python class/module/object/etc, instead of trying to graft out the part of django you need elsewhere.
Alternatively, you can try using something like threading.Lock if your app is actually using threads inside one interpreter to prevent multiple messages from sending. There is also a multiprocessing.Lock that may work if the threading one does not.
Another option would be to make it so each requested change would have a unique value to it, preferably something based on the contents of the changes themselves. IE if you have something like:
def check_send_email(email_addr, website_url, text_that_changed):
database.query('INSERT INTO website_updates VALUES %s, %s', (website_url, text_that_changed,))
if (database.check_result()): # update was not already present in database
send_email(email_addr)
check_send_email('email#example.com', 'website.com', '<div id="watched-div">')
obviously you'd need to interact with some more concrete tools, but the general idea above is that if requests come in, you don't send multiple emails needlessly. Of course, finding a value you can always generate exactly the same given a specific change, but is also unique every time may prove difficult.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Run code on first Django start

I have a Django application written to handle displaying a webpage with data from a model based on the primary key passed in the URL, this all works fine and the Django component is working perfectly for the most part.
My question though is, and I have tried multiple methods such as using an AppConfig, is how I can make it so when the Django server boots up, code is called that would then create a separate thread which would then monitor an external source, logging valid data from that source as a model into the database.
I have the threading code written along with the section that creates the model and saves it in the database, my issue though is that if I try to use an AppConfig to create the thread which would then handle the code, I get an django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet. error and the server does not boot up.
Where would be appropriate to place the code? Is my approach incorrect to the matter?

Trying to use threading to get around blocking processes like web servers is an exercise in pain. I've done it before and it's fragile and often yields unpredictable results.
A much easier idea is to create a separate worker that runs in a totally different process that you start separately. It would have the same database access and could even use your Django models. This is how hosts like Heroku approach this problem. It comes with the added benefit of being able to be tested separately and doesn't need to run at all while you're working on your main Django application.
These days, with a multitude of virtualization options like Vagrant and containerization options like Docker, running parallel processes and workers is trivial. In the wild they may literally be running on separate servers with your database on yet another server. As was mentioned in the comments, starting a worker process could easily be delegated to a separate Django management command. This, in turn, can be fairly easily turned into separate worker processes by gunicorn on your web server.

Questions about django thread safety

I have a django app which is used for managing registrations to a survey.
There are fixed number of slots and I want to "reserve" slots for users when they sign up.
In one of my views, I get the next available slot and reserve it (or redirect the user if there are no slots available.)
I want to protect against the case where two user's signing up at the same time get registered for the same slot because the the method "get_next_available_slot" returned the same slot for both users.
For this I am trying to understand the use of processes and threads with Django's views.
1) Is each request handled in a separate thread and will using python threading module's LOCK() take care of exclusive access?
2) I am running apache on RHEL with modwsgi. How do I configure Apache/modwsgi to ensure a more easy and simple solution to handle the above situation?
I am not expecting a huge load on the web application at all. So I would like a simpler solution instead of a high performance one.

You should not make assumptions about thread/process setup of django application, because it depends on web server you're using and how django is integrated to it. Therefore, interprocess communication methods should not rely on these details to be reliable. One good solution is using built-in cache library for locks and shared data.
Here's a good example of cache lock ensuring only once instance of celery task is run at a time. You can apply it to regular requests as well.

You shouldn't be worrying about this kind of stuff.
These slots are stored in a database right? The database should handle all the locking mechanisms for you, just make sure you run everything under a transaction and you will be fine.

Concurrency handling in python based webapp

I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.

Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.