SqlAlchemy look-ahead caching?

SqlAlchemy look-ahead caching? - python

Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.

A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.

Related

The maximum number of objects that can be instantiated with a Django model?

I wrote an app to record the user interactions with the website search box,
the query string is saved as an object of the model SearchQuery. Whenever a user enters some data in the search box, I can save the search query and some info related to the query on the database.
This is for the idea of getting the search trends,
the fields in my database models are,
A Character Field (max_length=30)
A PositiveIntegerField
A BooleanField
My Questions are,
How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Is this a good design or should I do some more work to make it efficient?
Django version 1.6.5
Python version 2.7

How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As many as your chosen database can handle, this is probably in the millions. If you are concerned you can use a scheduler to delete older queries when they are no longer useful.
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Could you, but its unlikely to give you much (if any efficiency gains). Because you are doing frequent writes and (presumably) infrequent reads, then its unlikely to hit the database very hard at all.
Is this a good design or should I do some more work to make it efficient?
There are probably two recommendations I'd make.
a. If you are going to be doing frequent reads on the Search log, look at using multiple databases. One for your log, and one for everything else.
b. Consider just using a regular log file for this information. Again, you will probably only be examining this data infrequently. So there are strng arguments to piping it into a log file, probably CSV-like, to make data analysis of it easier.

Does Python Django support custom SQL and denormalized databases with no Foreign Key relationships?

I've just started learning Python Django and have a lot of experience building high traffic websites using PHP and MySQL. What worries me so far is Python's overly optimistic approach that you will never need to write custom SQL and that it automatically creates all these Foreign Key relationships in your database. The one thing I've learned in the last few years of building Chess.com is that its impossible to NOT write custom SQL when you're dealing with something like MySQL that frequently needs to be told what indexes it should use (or avoid), and that Foreign Keys are a death sentence. Percona's strongest recommendation was for us to remove all FKs for optimal performance.
Is there a way in Django to do this in the models file? create relationships without creating actual DB FKs? Or is there a way to start at the database level, design/create my database, and then have Django reverse engineer the models file?

If you don't want foreign keys, then avoid using
models.ForeignKey(),
models.ManyToManyField(), and
models.OneToOneField().
Django will automatically create an auto-increment int field named id that you can use to refer to individual records, or you can override that by marking a field as primary_key=True.
There is also documentation on running raw SQL queries on the database.

Raw SQL is as easy as this :
for obj in MyModel.objects.raw('SELECT * FROM myapp_mymodel'):
print obj
Denormalizing a database is up to you at model definition time.
You can use non-relational databases (MongoDB, ...) too with Django NonRel

django-admin inspectdb allows you to reverse engineer a models file from existing tables. That is only a very partial response to your question ;)

You can just create the model.py and avoid having SQL Alchemy automatically create the tables leaving it up to you to define the actual tables as you please. So although there are foreign key relationships in the model.py this does not mean that they must exist in the actual tables. This is a very good thing considering how ludicrously foreign key constraints are implemented in MySQL - MyISAM just ignores them and InnoDB creates a non-optional index on every single one regardless of whether it makes sense.

I concur with the 'no foreign keys' advice (with the disclaimer: I also work for Percona).
The reason why it is is recommended is for concurrency / reducing locking internally.
It can be a difficult "optimization" to sell, but if you consider that the database has transactions (and is more or less ACID compliant) then it should only be application-logic errors that cause foreign-key violations. Not to say they don't exist, but if you enable foreign keys in development hopefully you should find at least a few bugs.
In terms of whether or not you need to write custom SQL:
The explanation I usually give is that "optimization rarely decreases complexity". I think it is okay to stick with an ORM by default, but if in a profiler it looks like one particular piece of functionality is taking a lot more time than you suspect it would when written by hand, then you need to be prepared to fix it (assuming the code is called often enough).
The real secret here is that you need good instrumentation / profiling in order to be frugal with your complexity adding optimization(s).

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Overhead of a Round-trip to MySql?

So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.

Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.

The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.

There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.

There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.

Map raw SQL to multiple related Django models

Due to performance reasons I can't use the ORM query methods of Django and I have to use raw SQL for some complex questions. I want to find a way to map the results of a SQL query to several models.
I know I can use the following statement to map the query results to one model, but I can't figure how to use it to be able to map to related models (like I can do by using the select_related statement in Django).
model_instance = MyModel(**dict(zip(field_names, row_data)))
Is there a relatively easy way to be able to map fields of related tables that are also in the query result set?

First, can you prove the ORM is stopping your performance? Sometimes performance problems are simply poor database design, or improper indexes. Usually this comes from trying to force-fit Django's ORM onto a legacy database design. Stored procedures and triggers can have adverse impact on performance -- especially when working with Django where the trigger code is expected to be in the Python model code.
Sometimes poor performance is an application issue. This includes needless order-by operations being done in the database.
The most common performance problem is an application that "over-fetches" data. Casually using the .all() method and creating large in-memory collections. This will crush performance. The Django query sets have to be touched as little as possible so that the query set iterator is given to the template for display.
Once you choose to bypass the ORM, you have to fight out the Object-Relational Impedance Mismatch problem. Again. Specifically, relational "navigation" has no concept of "related": it has to be a first-class fetch of a relational set using foreign keys. To assemble a complex in-memory object model via SQL is simply hard. Circular references make this very hard; resolving FK's into collections is hard.
If you're going to use raw SQL, you have two choices.
Eschew "select related" -- it doesn't exist -- and it's painful to implement.
Invent your own ORM-like "select related" features. A common approach is to add stateful getters that (a) check a private cache to see if they've fetched the related object and if the object doesn't exist, (b) fetch the related object from the database and update the cache.
In the process of inventing your own stateful getters, you'll be reinventing Django's, and you'll probably discover that it isn't the ORM layer, but a database design or an application design issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.