Let's say that I have a Django web application with two users. My web application has a global variable that exist on the server (a Pandas Dataframe created from data from an external SQL database).
Let's say that a user makes an update request to that Dataframe and now that Dataframe is being updated. As the Dataframe is being updated, the other user makes a get request for that Dataframe. Is there a way to 'lock' that Dataframe until user 1 is finished with it and then finish the request made by user 2?
EDIT:
So the order of events should be:
User 1 makes an update request, Dataframe is locked, User 2 makes a get request, Dataframe is finished updating, Dataframe is unlocked, User 2 gets his/her request.
Lines of code would be appreciated!
Ehm... Django is not a server. It has a single-threaded development server in it, but it should not be used for anything beyond development and maybe not even for that. Django applications are deployed using WSGI. WSGI server running your app is likely to start several separate worker threads and will be killing and restarting these threads according to the rules in its configuration.
This means, that you cannot rely on multiple requests hitting the same process. Django app lifecycle is between getting a request and returning a response. Anything that is not explicitly made persistent between those two events should be considered gone.
So, when one of your users updates a global variable, this variable only exists in the one process this user randomly accessed. The second user might or might not hit the same process and therefore might or might not get the same copy of the variable. More than that, the process will sooner or later be killed by the WSGI server and all the updates will be gone.
What I am getting at is that you might want to rethink your architecture before you bother with the atomic update problems.
Don't share in memory objects if you're going to mutate them. Concurrency is super hard to do right and premature optimization is evil. Give each user their own view of the data and only share data via the database (using transactions to make your updates atomic). Keep and increment counters in your database every time you make an update, make transactions fail if those number have changed since the data was read (as somebody else has mutated it).
Also, don't make important architectural decisions when tired! :)
Related
I have a webapp that monitors sites that users add for any changes. To do this, I need to have some sort of separate background thread/process that is constantly iterating through the list of sites, pinging them one at a time, and emailing any users that are monitoring a site that changes. I am currently using a thread that I initialize at the end of my urls.py file. This works fine with Django's development server, but it begins to break down once I deploy it to Heroku with Gunicorn. As soon as there are multiple connections, multiple copies of the worker thread get started, as Gunicorn starts more worker threads to handle the concurrent connections (at least, this is what I think is the reason behind the extra threads is). This causes duplicate emails to be sent out, one from each thread.
I am now trying to find another means of spawning this worker thread/process. I saw a similar inquiry here, but when I tried the posted solution, I was unable to reference the models from my Django app and received this error message when I tried to do so:
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
I have also tried using django-background-tasks, which is frequently recommended as a simple solution for issues like this. However, it doesn't seem suited for looping, continuous processes. The same goes for Celery and other solutions like it. I am just looking for a way to start a separate worker Dyno that continuously runs in the background, without a queue or anything like that, and is able to use the models from my Django app to create QuerySets that can be iterated through. What would be the best way to do something like this? Please let me know if any more information would help.
You could try editing the code so that the parts that handle the email specifically aren't tried so intrinsically to the django model, such that both the django model and this secondary application interact with the standard python class/module/object/etc, instead of trying to graft out the part of django you need elsewhere.
Alternatively, you can try using something like threading.Lock if your app is actually using threads inside one interpreter to prevent multiple messages from sending. There is also a multiprocessing.Lock that may work if the threading one does not.
Another option would be to make it so each requested change would have a unique value to it, preferably something based on the contents of the changes themselves. IE if you have something like:
def check_send_email(email_addr, website_url, text_that_changed):
database.query('INSERT INTO website_updates VALUES %s, %s', (website_url, text_that_changed,))
if (database.check_result()): # update was not already present in database
send_email(email_addr)
check_send_email('email#example.com', 'website.com', '<div id="watched-div">')
obviously you'd need to interact with some more concrete tools, but the general idea above is that if requests come in, you don't send multiple emails needlessly. Of course, finding a value you can always generate exactly the same given a specific change, but is also unique every time may prove difficult.
I am building an app where each user is assigned a task from a tasks table. In order to do so, we are going to mark an existing entry as deleted (flag) and then add an entry that holds the person responsible for the task in that table.
The issue here is that if multiple users decide to get a task at the same time, the request would prioritize older entries over newer ones, so there is the chance they are going to read the same task and get assigned the same task. Is there a simple way around it?
My first inclination was to create a singleton class that handles job distribution, but I am pretty sure that such issues can be handled by Django direct. What should I try?
Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?
One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.
Inside an web application ( Pyramid ) I create certain objects on POST which need some work done on them ( mainly fetching something from the web ). These objects are persisted to a PostgreSQL database with the help of SQLAlchemy. Since these tasks can take a while it is not done inside the request handler but rather offloaded to a daemon process on a different host. When the object is created I take it's ID ( which is a client side generated UUID ) and send it via ZeroMQ to the daemon process. The daemon receives the ID, and fetches the object from the database, does it's work and writes the result to the database.
Problem: The daemon can receive the ID before it's creating transaction is committed. Since we are using pyramid_tm, all database transactions are committed when the request handler returns without an error and I would rather like to leave it this way. On my dev system everything runs on the same box, so ZeroMQ is lightning fast. On the production system this is most likely not an issue since web application and daemon run on different hosts but I don't want to count on this.
This problem only recently manifested itself since we previously used MongoDB with a write_convern of 2. Having only two database servers the write on the entity always blocked the web-request until the entity was persisted ( which is obviously is not the greatest idea ).
Has anyone run into a similar problem?
How did you solve it?
I see multiple possible solutions, but most of them don't satisfy me:
Flushing the transaction manually before triggering the ZMQ message. However, I currently use SQLAlchemy after_created event to trigger it and this is really nice since it decouples this process completely and thus eliminating the risk of "forgetting" to tell the daemon to work. Also think that I still would need a READ UNCOMMITTED isolation level on the daemon side, is this correct?
Adding a timestamp to the ZMQ message, causing the worker thread that received the message, to wait before processing the object. This obviously limits the throughput.
Dish ZMQ completely and simply poll the database. Noooo!
I would just use PostgreSQL's LISTEN and NOTIFY functionality. The worker can connect to the SQL server (which it already has to do), and issue the appropriate LISTEN. PostgreSQL would then let it know when relevant transactions finished. You trigger for generating the notifications in the SQL server could probably even send the entire row in the payload, so the worker doesn't even have to request anything:
CREATE OR REPLACE FUNCTION magic_notifier() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('stuffdone', row_to_json(new)::text);
RETURN new;
END;
$$ LANGUAGE plpgsql;
With that, right as soon as it knows there is work to do, it has the necessary information, so it can begin work without another round-trip.
This comes close to your second solution:
Create a buffer, drop the ids from your zeromq messages in there and let you worker poll regularly this id-pool. If it fails retrieving an object for the id from the database, let the id sit in the pool until the next poll, else remove the id from the pool.
You have to deal somehow with the asynchronous behaviour of your system. When the ids arrive constantly before persisting the object in the database, it doesnt matter whether pooling the ids (and re-polling the the same id) reduces throughput, because the bottleneck is earlier.
An upside is, you could run multiple frontends in front of this.
I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)