Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?
One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.
Related
I have a webapp that monitors sites that users add for any changes. To do this, I need to have some sort of separate background thread/process that is constantly iterating through the list of sites, pinging them one at a time, and emailing any users that are monitoring a site that changes. I am currently using a thread that I initialize at the end of my urls.py file. This works fine with Django's development server, but it begins to break down once I deploy it to Heroku with Gunicorn. As soon as there are multiple connections, multiple copies of the worker thread get started, as Gunicorn starts more worker threads to handle the concurrent connections (at least, this is what I think is the reason behind the extra threads is). This causes duplicate emails to be sent out, one from each thread.
I am now trying to find another means of spawning this worker thread/process. I saw a similar inquiry here, but when I tried the posted solution, I was unable to reference the models from my Django app and received this error message when I tried to do so:
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
I have also tried using django-background-tasks, which is frequently recommended as a simple solution for issues like this. However, it doesn't seem suited for looping, continuous processes. The same goes for Celery and other solutions like it. I am just looking for a way to start a separate worker Dyno that continuously runs in the background, without a queue or anything like that, and is able to use the models from my Django app to create QuerySets that can be iterated through. What would be the best way to do something like this? Please let me know if any more information would help.
You could try editing the code so that the parts that handle the email specifically aren't tried so intrinsically to the django model, such that both the django model and this secondary application interact with the standard python class/module/object/etc, instead of trying to graft out the part of django you need elsewhere.
Alternatively, you can try using something like threading.Lock if your app is actually using threads inside one interpreter to prevent multiple messages from sending. There is also a multiprocessing.Lock that may work if the threading one does not.
Another option would be to make it so each requested change would have a unique value to it, preferably something based on the contents of the changes themselves. IE if you have something like:
def check_send_email(email_addr, website_url, text_that_changed):
database.query('INSERT INTO website_updates VALUES %s, %s', (website_url, text_that_changed,))
if (database.check_result()): # update was not already present in database
send_email(email_addr)
check_send_email('email#example.com', 'website.com', '<div id="watched-div">')
obviously you'd need to interact with some more concrete tools, but the general idea above is that if requests come in, you don't send multiple emails needlessly. Of course, finding a value you can always generate exactly the same given a specific change, but is also unique every time may prove difficult.
Let's say that I have a Django web application with two users. My web application has a global variable that exist on the server (a Pandas Dataframe created from data from an external SQL database).
Let's say that a user makes an update request to that Dataframe and now that Dataframe is being updated. As the Dataframe is being updated, the other user makes a get request for that Dataframe. Is there a way to 'lock' that Dataframe until user 1 is finished with it and then finish the request made by user 2?
EDIT:
So the order of events should be:
User 1 makes an update request, Dataframe is locked, User 2 makes a get request, Dataframe is finished updating, Dataframe is unlocked, User 2 gets his/her request.
Lines of code would be appreciated!
Ehm... Django is not a server. It has a single-threaded development server in it, but it should not be used for anything beyond development and maybe not even for that. Django applications are deployed using WSGI. WSGI server running your app is likely to start several separate worker threads and will be killing and restarting these threads according to the rules in its configuration.
This means, that you cannot rely on multiple requests hitting the same process. Django app lifecycle is between getting a request and returning a response. Anything that is not explicitly made persistent between those two events should be considered gone.
So, when one of your users updates a global variable, this variable only exists in the one process this user randomly accessed. The second user might or might not hit the same process and therefore might or might not get the same copy of the variable. More than that, the process will sooner or later be killed by the WSGI server and all the updates will be gone.
What I am getting at is that you might want to rethink your architecture before you bother with the atomic update problems.
Don't share in memory objects if you're going to mutate them. Concurrency is super hard to do right and premature optimization is evil. Give each user their own view of the data and only share data via the database (using transactions to make your updates atomic). Keep and increment counters in your database every time you make an update, make transactions fail if those number have changed since the data was read (as somebody else has mutated it).
Also, don't make important architectural decisions when tired! :)
I have a django app which is used for managing registrations to a survey.
There are fixed number of slots and I want to "reserve" slots for users when they sign up.
In one of my views, I get the next available slot and reserve it (or redirect the user if there are no slots available.)
I want to protect against the case where two user's signing up at the same time get registered for the same slot because the the method "get_next_available_slot" returned the same slot for both users.
For this I am trying to understand the use of processes and threads with Django's views.
1) Is each request handled in a separate thread and will using python threading module's LOCK() take care of exclusive access?
2) I am running apache on RHEL with modwsgi. How do I configure Apache/modwsgi to ensure a more easy and simple solution to handle the above situation?
I am not expecting a huge load on the web application at all. So I would like a simpler solution instead of a high performance one.
You should not make assumptions about thread/process setup of django application, because it depends on web server you're using and how django is integrated to it. Therefore, interprocess communication methods should not rely on these details to be reliable. One good solution is using built-in cache library for locks and shared data.
Here's a good example of cache lock ensuring only once instance of celery task is run at a time. You can apply it to regular requests as well.
You shouldn't be worrying about this kind of stuff.
These slots are stored in a database right? The database should handle all the locking mechanisms for you, just make sure you run everything under a transaction and you will be fine.
I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)
Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.