Django session race condition? - python

Summary: is there a race condition in Django sessions, and how do I prevent it?
I have an interesting problem with Django sessions which I think involves a race condition due to simultaneous requests by the same user.
It has occured in a script for uploading several files at the same time, being tested on localhost. I think this makes simultaneous requests from the same user quite likely (low response times due to localhost, long requests due to file uploads). It's still possible for normal requests outside localhost though, just less likely.
I am sending several (file post) requests that I think do this:
Django automatically retrieves the user's session*
Unrelated code that takes some time
Get request.session['files'] (a dictionary)
Append data about the current file to the dictionary
Store the dictionary in request.session['files'] again
Check that it has indeed been stored
More unrelated code that takes time
Django automatically stores the user's session
Here the check at 6. will indicate that the information has indeed been stored in the session. However, future requests indicate that sometimes it has, sometimes it has not.
What I think is happening is that two of these requests (A and B) happen simultaneously. Request A retrieves request.session['files'] first, then B does the same, changes it and stores it. When A finally finishes, it overwrites the session changes by B.
Two questions:
Is this indeed what is happening? Is the django development server multithreaded? On Google I'm finding pages about making it multithreaded, suggesting that by default it is not? Otherwise, what could be the problem?
If this race condition is the problem, what would be the best way to solve it? It's an inconvenience but not a security concern, so I'd already be happy if the chance can be decreased significantly.
Retrieving the session data right before the changes and saving it right after should decrease the chance significantly I think. However I have not found a way to do this for the request.session, only working around it using django.contrib.sessions.backends.db.SessionStore. However I figure that if I change it that way, Django will just overwrite it with request.session at the end of the request.
So I need a request.session.reload() and request.session.commit(), basically.

Yes, it is possible for a request to start before another has finished. You can check this by printing something at the start and end of a view and launch a bunch of request at the same time.
Indeed the session is loaded before the view and saved after the view. You can reload the session using request.session = engine.SessionStore(session_key) and save it using request.session.save().
Reloading the session however does discard any data added to the session before that (in the view or before it). Saving before reloading would destroy the point of loading late. A better way would be to save the files to the database as a new model.
The essence of the answer is in the discussion of Thomas' answer, which was incomplete so I've posted the complete answer.

Mark just nailed it, only minor addition from me, is how to load that session:
for key in session.keys(): # if you have potential removals
del session[key]
session.update(session.load())
session.modified = False # just making it clean
First line optional, you only need it if certain values might be removed meanwhile from the session.
Last line is optional, if you update the session, then it does not really matter.

That is true. You can confirm it by having a look at the django.contrib.sessions.middleware.SessionMiddleware.
Basically, request.session is loaded before request hits your view (in process_request), and it is updated in the session backend (if needed) after the response has left your view (in process_response).
If what I mean is unclear, you might want to have a look at the django documentation for Middleware.
The best way to solve the issue will depend on what you're trying to achieve with that information. I'll update my answer if you provide that information!

Related

SQLAlchemy session with Celery (Multipart batch writes)

Supposing I have a mobile app which will send a filled form data (which also contains images) to a Commercial software using its API, and this data should be committed all at once.
Since the mobile does not have enough memory to send all the dataset at once, I need to send it as a Multipart batch.
I use transactions in cases where I want to perform a bunch of operations on the database, but I kind of need them to be performed all at once, meaning that I don't want the database to change out from under me while I'm in the middle of making my changes. And if I'm making a bunch of changes, I don't want users to be able to read my set of documents in that partially changed state. And I certainly don't want a set of operations failing halfway through, leaving me in a weird and inconsistent state forever. It's got to be all or nothing.
I know that Firebase provides the batch write operation which does exactly what I need. However, I need to do this into a local database (like redis or postgres).
The first approach I considered is using POST requests identified by a main session_ID.
- POST /session -> returns new SESSION_ID
- POST [image1] /session/<session_id> -> returns new IMG_ID
- POST [image2] /session/<session_id> -> returns new IMG_ID
- PUT /session/<session_id> -> validate/update metadata
However it does not seem very robust to handle errors.
The second approach I was considering is combining SQLAlchemy session with Celery task using Flask or FastAPI. I am not sure if it is common to do this to solve this issue. I just found this question. I would like to know what do you guys recommend for this second case approach (sending all data parts first, and commit all at once) ?

Preserving value of variables between subsequent requests in Python Django

I have a Django application to log the character sequences from an autocomplete interface. Each time a call is made to the server, the parameters are added to a list and when the user submits the query, the list is written to a file.
Since I am not sure how to preserve the list between subsequent calls, I relied on a global variable say query_logger. Now I can preserve the list in the following way:
def log_query(query, completions, submitted=False):
global query_logger
if query_logger is None:
query_logger = list()
query_logger.append(query, completions, submitted)
if submitted:
query_logger = None
While this hack works for a single client sending requests I don't think this is a stable solution when requests come from multiple clients. My question is two-fold:
What is the order of execution of requests: Do they follow first come first serve (especially if the requests are asynchronous)?
What is a better approach for doing this?
If your django server is single-threaded, then yes, it will respond to requests as it receives them. If you're using wsgi or another proxy, that becomes more complicated. Regardless, I think you'll want to use a db to store the information.
I encountered a similar problem and ended up using sqlite to store the data temporarily, because that's super simple and easy to manage. You'll want to use IP addresses or create a unique ID passed as a url parameter in order to identify clients on subsequent requests.
I also scheduled a daily task (using cron on ubuntu) that goes through and removes any incomplete requests that haven't been completed (excluding those started in the last hour).
You must not use global variables for this.
The proper answer is to use the session - that is exactly what it is for.
Simplest (bad) solution would be to have a global variable. Which means you need some in memory location or a db to store this info

update existing cache data with newer items in django

I want to use caching in Django and I am stuck up with how to go about it. I have data in some specific models which are write intensive. records will get added continuously to the model. Each user has some specific data in the model similar to orders table.
Since my model is write intensive I am not sure how effective caching frameworks in Django are going to be. I tried Django view specific caching and I am try to develop a view where first it will pick up data from the cache. Then I will have another call which will bring in data which was added to the model after the caching was done. What I want to do is add the updated data to the original cache data and store it again.
It is like I don't want to expire my cache, I just want to keep adding to my existing cache data. may be once in 3 hrs I can clear it.
Is what I am doing right. Are there better ways than this. Can I really add to items in existing cache.
I will be very glad for your help
You ask about "caching" which is a really broad topic, and the answer is always a mix of opinion, style and the specific app requirements. Here are a few points to consider.
If the data is per user, you can cache it per user:
from django.core.cache import cache
cache.set(request.user.id,"foo")
cache.get(request.user.id)
The common practice it to keep a database flag that tells you if the user's data changed since it was cached. So before you fetch the data from cache, check only this flag from the DB. If the flag says nothing changed, get the data from cache. If it did change, pull from DB, replace the cache, and set the flag again.
The flag check should be fast and simple: one table, indexed by user.id, and a boolean flag field. This will squeeze a lot of index rows into a single DB page, and enables a fast fetching of a single one field row. Yet you still get a persistent updated main storage, that prevents the use of not updated cache data. You can check this flag in a middleware.
You can run expiry in many ways: clear cache when user logs out, run a cron script that clears items, or let the cache backend expire items. If you use a flag check before you use the cache, there is no issue in keeping items in cache except space, and caching backends handle that. If you use the django simple file cache (which is easy, simple and zero config), you will have to clear the cache. A simple cron script will do.

Wait for datastore changes before redirecting

Very similar to this question, except that the answer is not suitable.
I populate a table from a datastore query, then there is a link allowing the user to delete a specific row. Clicking the link goes to a url that deletes the row from the datastore then redirects back to the table.
Changes more often than not aren't shown in the table until reloading again.
Easy solution is to redirect to another page, that uses a javascript redirect to add a delay of a couple of seconds. Other alternative is to send details back to the page like action=delete&key=### and then make sure that item is missed from the table. That's a pain though.
The answer is with ancestor queries.
https://cloud.google.com/appengine/docs/python/datastore/queries#Python_Ancestor_queries
Create the entities with a parent. When one of the entities is deleted, you can run an ancestor query for your table list view which will have strong consistency when data is changed.
Example ancestor query:
tom = Person(key_name='Tom')
photo_query = Photo.all()
photo_query.ancestor(tom)
With the datastore, unless you can use ancestors, you cant guarantee when the indexes will be updated, only the entity itself (for getting by key later) by doing a put without async. Best is a combination of your suggestion where the client takes into account its action to patch the ui, plus maybe using memcache to remember recent actions and patch queries server-side before returning to client.
Here is a different approach. Use Javascript and AJAX. When the user clicks a link, you do two things:
Use Javascript/jQuery to remove the row from the DOM, and
Send an AJAX call to the server to do the appropriate datastore modifications.
It makes for a nice user experience because you are not reloading the page at all.
You might want to consider that there is always room for displaying outdated info in the table: for example displaying the table simultaneously in 2 different windows/tabs, then in one of them deletion is performed, the other will still display a delete link which will cause a 404 if followed.
With this in mind I'd 1st focus on managing the expectations (the user should know that the page may occasionally display outdated info) and then on the user's ability to get an outdated page in sync (refresh button?). Which might make the issue moot.
The delay-based "solutions" are bound to fail sooner or later in race condition scenarios, I wouldn't bother with the extra complexity. Especially for the document where the deletion is done: that's exactly where the user knows that the info is outdated (for free) and would be likely inclined to refresh until the recent change becomes visible.

App engine app design questions

I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?
I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.
You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.
If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).

Categories

Resources