Best way to avoid data loss in a high-load Django app? - python

Imagine a quite complex Django application with both frontend and backend parts. Some users modify some data on the frontend part. Some scripts modify the same data periodically on the backend part.
Example:
instance = SomeModel.objects.get(...)
# (long-running part where various fields are changed, takes from 3 to 20 seconds)
instance.field = 123
instance.another_field = 'abc'
instance.save()
If somebody (or something) changes the instance while that part is changing some fields, then the changes will be lost because the instance will be saved lately, dumping the data from the Python (Django) class. In other words, if something in the code takes data, then waits for some time, and then saves the data back - then only the latest 'saver' will save its data, all the others (previous) ones will lose their changes.
It's a "high-load" app, the database load (we use Postgres) is quite high and I'd like to avoid anything that would cause a significant increase of the DB activity or memory taken.
Another issue - we have many signals attached, and even the save() method overriden, so I'd like to avoid anything that might break the signals or might be incompatible with custom save() or update() methods.
What would you recommend in this situation? Any special app for that? Transactions? Anything else?
Thank you!

The correct way to protect against this is to use select_for_update to make sure that the data doesn't change between reading and writing. However this causes the row to be locked for updates so this might slow down your application significantly.
Oen solution might be to read the data and perform your long-running tasks. Then before saving it back you start a transaction, read the data again but now with select_for_update and verify that the original data hasn't changed. If the data is still the same then you save. If the data has changed you abort and re-run the long-running task. That way you will hold the lock for as short as possible.
Something like:
success = False
while not success:
instance1 = SomeModel.objects.get(...)
# (long-running part)
with django.db.transaction.atomic():
instance2 = SomeModel.objects.select_for_update().get(...)
# (compare relevant data from instance1 vs instance2)
if unchanged:
# (make the changes on instance2)
instance2.field = 123
instance2.another_field = 'abc'
instance2.save()
success = True
If this is a viable approach does depend on what exactly your long-running task is. And a user might still overwrite the data you save here.

Related

Very poor weakref performance in Python/SQL Alchemy

I've spent the day trying to debug a memory problem in my Python script. I'm using SQL Alchemy as my ORM. There are several confounding issues here, and I'm hoping that if I list them all out, somebody will be able to point me in the right direction.
In order to achieve the performance I'm looking for, I read in all the records in a table (~400k), then loop through a spreadsheet, match the records I've previously read in, then create new records (~800k) into another table. Here's roughly what the code looks like:
dimensionMap = {}
for d in connection.session.query(Dimension):
dimensionMap[d.businessKey] = d.primarySyntheticKey
# len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB
allfacts = []
sheet = open_spreadsheet(path)
for row in sheet.allrows():
dimensionId = dimensionMap[row[0]]
metric = row[1]
fact = Fact(dimensionId, metric)
connection.session.add(fact)
allfacts.append(fact)
if row.number % 20000 == 0:
connection.session.flush()
# len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB
connection.session.commit()
sys.stdout.write('All Done')
400k and 800k don't seem like especially big numbers to me, but I'm nonetheless running into memory problems a machine with 4GB of memory. This is really strange to me, as I ran sys.getsizeof on my two biggest collections, and they were both well under any size that would cause problems.
While trying to figure this out, I noticed that the script was running really, really slowly. So I ran a profile on it, hoping the results would lead me in the direction of the memory problem, and came up with two confounding issues.
First, 87% of the program time is spent in the commit, specifically on this line of code:
self.transaction._new[state] = True
This can be found in session.py:1367. self.transaction._new is an instance of weakref.WeakKeyDictionary(). Why is weakref:261:__setitem__ taking up so much time?
Second, even when the program is done ('All Done' has been printed to stdout), the script continues, seemingly forever, with 2.2GB of memory used.
I've done some searching on weakrefs, but haven't seen anybody mention the performance issues I'm facing. Ultimately, there isn't a whole lot I can do about this, given it's buried deep in SQL Alchemy, but I'd still appreciate any ideas.
Key Learnings
As mentioned by #zzzeek, there's a lot of overhead required to maintain persistent objects. Here's a little graph to show the growth.
The trendline suggests that each persistent instance takes about 2KB of memory overhead, even though the instance itself is only 30 bytes. This actually brings me another thing I learned, which is to take sys.getsizeof with a huge grain of salt.
This function only returns the shallow size of an object, and doesn't take into account any other objects that need to be there for the first object to make sense (__dict__, for example). You really need to use something like Heapy to get a good understanding of the actual memory footprint of an instance.
The last thing I learned is that, when Python is on the verge of running out of memory, and is thrashing like crazy, weird stuff happens that shouldn't be taken as part of the problem. In my case, the massive slow-down, the profile pointing to the weakref creation, and the hangup after the program completed, are all effects of the memory issue. Once I stopped creating and keeping around persistent instances, and instead just kept around the objects' properties that I needed, all the other issues went away.
800K ORM objects is very large. These are Python objects, each of which has a __dict__ as well as an _sa_instance_state attribute which is itself an object, which then has weakrefs and other things inside of it, then the Session has more than one weakref to your object - an ORM object is identity tracked, a feature
which provides a high degree of automation in persistence but at the cost of lots more memory and
function call overhead.
As far as why your profiling is all focused on that one weakref line, that seems very strange, I'd be curious to see the actual profile result there (see How can I profile a SQLAlchemy powered application? for background).
Your code example can be modified to not use any ORM identity-mapped objects as follows.
For more detail on bulk inserts, see Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?.
# 1. only load individual columns - loading simple tuples instead
# of full ORM objects with identity tracking. these tuples can be
# used directly in a dict comprehension
dimensionMap = dict(
connection.session.query(Dimension.businessKey, Dimension.primarySyntheticKey)
)
# 2. For bulk inserts, use Table.insert() call with
# multiparams in chunks
buf = []
for row in sheet.allrows():
dimensionId = dimensionMap[row[0]]
metric = row[1]
buf.append({"dimensionId": dimensionId, "metric": metric})
if len(buf == 20000):
connection.session.execute(Fact.__table__.insert(), params=buf)
buf[:] = []
connection.session.execute(Fact.__table__.insert(), params=buf)
sys.stdout.write('All Done')

Efficient approach to catching database errors

I have a desktop app that has 65 modules, about half of which read from or write to an SQLite database. I've found that there are 3 ways that the database can throw an SQliteDatabaseError:
SQL logic error or missing database (happens unpredictably every now and then)
Database is locked (if it's being edited by another program, like SQLite Database Browser)
Disk I/O error (also happens unpredictably)
Although these errors don't happen often, when they do they lock up my application entirely, and so I can't just let them stand.
And so I've started re-writing every single access of the database to be a pointer to a common "database-access function" in its own module. That function then can catch these three errors as exceptions and thereby not crash, and also alert the user accordingly. For example, if it is a "database is locked error", it will announce this and ask the user to close any program that is also using the database and then try again. (If it's the other errors, perhaps it will tell the user to try again later...not sure yet). Updating all the database accesses to do this is mostly a matter of copy/pasting the redirect to the common function--easy work.
The problem is: it is not sufficient to just provide this database-access function and its announcements, because at all of the points of database access in the 65 modules there is code that follows the access that assumes the database will successfully return data or complete a write--and when it doesn't, that code has to have a condition for that. But writing those conditionals requires carefully going into each access point and seeing how best to handle it. This is laborious and difficult for the couple of hundred database accesses I'll need to patch in this way.
I'm willing to do that, but I thought I'd inquire if there were a more efficient/clever way or at least heuristics that would help in finishing this fix efficiently and well.
(I should state that there is no particular "architecture" of this application...it's mostly what could be called "ravioli code", where the GUI and database calls and logic are all together in units that "go together". I am not willing to re-write the architecture of the whole project in MVC or something like this at this point, though I'd consider it for future projects.)
Your gut feeling is right. There is no way to add robustness to the application without reviewing each database access point separately.
You still have a lot of important choice at how the application should react on errors that depends on factors like,
Is it attended, or sometimes completely unattended?
Is delay OK, or is it important to report database errors promptly?
What are relative frequencies of the three types of failure that you describe?
Now that you have a single wrapper, you can use it to do some common configuration and error handling, especially:
set reasonable connect timeouts
set reasonable busy timeouts
enforce command timeouts on client side
retry automatically on errors, especially on SQLITE_BUSY (insert large delays between retries, fail after a few retries)
use exceptions to reduce the number of application level handlers. You may be able to restart the whole application on database errors. However, do that only if you have confidence as to in which state you are aborting the application; consistent use of transactions may ensure that the restart method does not leave inconsistent data behind.
ask a human for help when you detect a locking error
...but there comes a moment where you need to bite the bullet and let the error out into the application, and see what all the particular callers are likely to do with it.

Ways to reduce loading time of wxPython GUI

This question is a continuation of my question Desktop GUI Loading Slow.
I have a desktop GUI developed in wxPython which uses sqlAlchemy for many record fetch queries from database. I am putting the fetched records in Python dictionaries and populating the GUI using that. But, since I am reading thousands of data in background, the GUI gets stuck and loads very slowly. Now the question is:
Should I create individual threads for each of the sqlalchemy data fetch queries? If the answer for this is yes, is the wx.callAfter() the method I have to focus on (for each query)? If someone give sample/untested code or link then it will be helpful.
Is there any other way to implement lazy loading in a desktop GUI ?
P.S.: Please note that this is first time I am doing multithreading and wxPython. I was earlier web developer on Python/Django. Also, I can't share codes due to restriction.
You should redesign your app so that data loading part and data display part are separate. Load data in a separate thread which should populate a DB Model in your app, use that Model to populate GUI, so when app loads GUI will load fast but will display 'loading...' or something like that at places where data has not loaded yet.
Another way to speedup things is don't run queries until they are needed e.g. wrap them in a class with get method, on get query DB, but all of it will depend on context.
Also if GUI is mostly for view then you can may be load a first set of small data and push other data to other views which user has to go thru some menu or tabs, that way you can delay loading until it is needed or load them in background.
There are several ways to prevent your GUI from hanging. Of course you can do multi-threading and stuff the records in a global dictionary. But you'd probably run into the global interpreter lock (GIL), which would probably not help the reponsiveness of your GUI.
The first option is to use the event-driven nature of the GUI toolkit and use the "timeout" or "timer" functionality provided by the toolkit to call a function that loads a couple of records every time it is called. A generator function would work nicely for that. This is probably the easiest to implement. How many records you can load in one go depends on the speed of the machine. I would suggest to start with a single record, measure the loading of records, and increment the amount of record so that each invocation doesn't take longer than say 0.1 second.
Second is to use a second process for loading data, and then send it to the GUI in small chunks. Using a separate process (using the multiprocessing module) has the advantage that you cannot run into Python's GIL. Note that this is method more or less includes the first method, because you still have to process messages from the second process in the event loop of the GUI.
You don't mention which widgets you use to load your data into, but if you use wx.grid.Grid or the ListCtrl, then yes, there are some "lazy" loading stuff in the virtual implementations of the respective widgets. See the wxPython demo for a grid that can hold a million cells, for example. Also see Anurag's answer. You really don't need to load all the data at once. Just load the data that you can actually display. Then you can load more when the user scrolls (or pre-load it in a background thread).

Force commit of nested save() within a transaction

I have a function where I save a large number of models, (thousands at a time), this takes several minutes so I have written a progress bar to display progress to the user. The progress bar works by polling a URL (from Javascript) and looking a request.session value to see the state of the first call (the one that is saving).
The problem is that the first call is within a #transaction.commit_on_success decorator and because I am using Database Backed sessions when I try to force request.session.save() instead of it immediately committing it is appended to the ongoing transaction. This results in the progress bar only being updated once all the saves are complete, thus rendering it useless.
My question is, (and I'm 99.99% sure I already know the answer), can you commit statements within a transaction without doing the whole lot. i.e. I need to just commit the request.session.save() whilst leaving all of the others..
Many thanks, Alex
No, both your main saves and the status bar updates will be conducted using the same database connection so they will be part of the same transaction.
I can see two options to avoid this.
You can either create your own, separate database connection and save the status bar updates using that.
Don't save the status bar updates to the database at all and instead use a cache to store them. As long as you don't use the database cache backend (ideally you'd use memcached) this will work fine.
My preferred option would be the second one. You'll need to delve into the Django internals to get your own database connection so that could is likely to end up fragile and messy.

Exactly how long do Django/Python/FastCGI Processes last?

I have been working on a website in Django, served using FCGI set up using an autoinstaller and a custom templating system.
As i have it set up now, each View is an instance of a class, which is bound to a template file at load time, and not time of execution. That is, the class is bound to the template via a decorator:
#include("page/page.xtag") # bind template to view class
class Page (Base):
def main(self): # main end-point to retrieve web page
blah = get_some_stuff()
return self.template.main(data=blah) # evaluates template using some data
One thing i have noticed is that since FCGI does not create a new process and reload all the modules/classes every request, changes to the template do not automatically appear on the website until after i force a restart (i.e. by editing/saving a python file).
The web pages also contain lots of data that is stored in .txt files in the filesystem. For example, i will load big snippets of code from separate files rather than leaving them in the template (where they clutter it up) or in the database (where it is inconvenient to edit them). Knowing that the process is persistent, i created an ad-hoc memcache by saving the text i loaded in a static dictionary in one of my classes:
class XLoad:
rawCache = {} #{name : (time, text)}
#staticmethod
def loadRaw(source):
latestTime = os.stat(source).st_mtime
if source in XLoad.rawCache.keys() and latestTime < XLoad.rawCache[source][0]:
# if the cached version of file is up to date, use it
return XLoad.rawCache[source][1]
else:
# otherwise read it from disk, dump it in cache and use that
text = open(source).read()
XLoad.rawCache[source] = (latestTime, text)
return text
Which sped everything up considerably, because the two dozen or so code-snippets which i was loading one-by-one from the filesystem were now being taken directly from the process' memory. Every time i forced a restart, it would be slow for one request while the cache filled up then become blazing fast again.
My question is, what exactly determines how/when the process gets restart, the classes and modules reloaded, and the data i keep in my static dictionary purged? Is it dependent on my installation of Python, or Django, or Apache, or FastCGI? Is it deterministic, based on time, on number of requests, on load, or pseudo-random? And is it safe to do this sort of in-memory caching (which really is very easy and convenient!), or should i look into some proper way of caching these file-reads?
It sounds like you already know this.
When you edit a Python file.
When you restart the server.
When there is a nonrecoverable error.
Also known as "only when it has to".
Caching like this is fine -- you're doing it whenever you store anything in a variable. Since the information is read only, how could this not be safe? Try not to write changes to a file right after you've restarted the server; but the worst thing that could happen is one page view gets messed up.
There is a simple way to confirm all this -- logging. Have your decorators log when they are called, and log when you have to load a file from disk.
In addition to the already mentioned reasons, Apache can be configurated to terminate idle fcgi processes after a specified timespan.

Categories

Resources