Overhead of a Round-trip to MySql? - python

So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.

Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.

The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.

There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.

There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.

Related

Having issues doing fast enough db inserts inside a Flask endpoint

I have an HTTP POST endpoint in Flask which needs to insert whatever data comes in into a database. This endpoint can receive up to hundreds of requests per second. Doing an insert every time a new request comes takes too much time. I have thought that doing a bulk insert every 1000 request with all the previous 1000 request data should work like some sort of caching mechanism. I have tried saving 1000 incoming data objects into some collection and then doing a bulk insert once the array is 'full'.
Currently my code looks like this:
#app.route('/user', methods=['POST'])
def add_user():
firstname = request.json['firstname']
lastname = request.json['lastname']
email = request.json['email']
usr = User(firstname, lastname, email)
global bulk
bulk.append(usr)
if len(bulk) > 1000:
bulk = []
db.session.bulk_save_objects(bulk)
db.session.commit()
return user_schema.jsonify(usr)
The problem I'm having with this is that the database becomes 'locked', and I really don't know if this is a good solution but just poorly implemented, or a stupid idea.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
Your error message indicates that you are using an sqlite DB with SQLAlchemy. You may want to try changing the setting of the sqlite "synchronous" flag to turn syncing OFF. This can speed INSERT queries up dramatically, but it does come with the increased risk of data loss. See https://sqlite.org/pragma.html#pragma_synchronous for more details.
With synchronous OFF (0), SQLite continues without syncing as soon as
it has handed data off to the operating system. If the application
running SQLite crashes, the data will be safe, but the database might
become corrupted if the operating system crashes or the computer loses
power before that data has been written to the disk surface. On the
other hand, commits can be orders of magnitude faster with synchronous
OFF
If your application and use case can tolerate the increased risks, then disabling syncing may negate the need for bulk inserts.
See "How to set SQLite PRAGMA statements with SQLAlchemy": How to set SQLite PRAGMA statements with SQLAlchemy
Once I moved the code on AWS and used the Aurora instance as the database, the problems went away, so I suppose it's safe to conclude that the issue were solely related to my sqlite3 instance.
The final solution gave me satisfactory results and I ended up changing only this line:
db.session.bulk_save_objects(bulk)
to this:
db.session.save_all(bulk)
I can now safely do up to 400 or more (haven't tested for more) calls on that specific endpoints, all ending with valid inserts, per second.
Not an expert on this, but seems like database has reached its concurrency limits. You can try using Pony for better concurrency and transaction management
https://docs.ponyorm.org/transactions.html
By default Pony uses the optimistic concurrency control concept for increasing performance. With this concept, Pony doesn’t acquire locks on database rows. Instead it verifies that no other transaction has modified the data it has read or is trying to modify.

Recommendation for manipulating data with python vs. SQL for a django app

Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP

Python Django cache vs store in model field? Which is more efficient?

My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.

Optimize Django

I have a problem with the speed of page loading.
Now it takes about 7 seconds to load the pages and 2~3 seconds is Django processing.
Obvious thing to blame is my lack of knowledge of architecture, execute average 50 queries, as shown by "Django debug tool bar" when accessing the pages but most of the queries are like "yesterday`s snapshot(group by something)" or "daily snapshot(group by something) before yesterday" and doesn't have to be updated each time.
I am coming out of idea using memory caching or create new table for prepare-possible type of data.
Is there any convention or Design Pattern for this kind of issue?
sample queries are these( I believe they must not query each time on yesterdays data or last month`s data):
SELECT `sample_salestarget`.`id`, `sample_salestarget`.`country_id`, `sample_salestarget`.`year`, `sample_salestarget`.`month`, `sample_salestarget`.`sales` FROM `sample_salestarget` WHERE (`sample_salestarget`.`country_id` = "abc" AND `sample_salestarget`.`month` = 8 AND `sample_salestarget`.`year` = 2012 )
SELECT `sample_dailysummary`.`id`, `sample_dailysummary`.`country_id`, `sample_dailysummary`.`date`, `sample_dailysummary`.`pv_day`, `sample_dailysummary`.`pv_week`, `sample_dailysummary`.`pv_month`, `sample_dailysummary`.`active_uu_day`, `sample_dailysummary`.`active_uu_week`, `sample_dailysummary`.`active_uu_month`, `sample_dailysummary`.`active_uu_7days`, `sample_dailysummary`.`active_uu_30days`, `sample_dailysummary`.`paid_uu_day`, `sample_dailysummary`.`paid_uu_week`, `sample_dailysummary`.`paid_uu_month`, `sample_dailysummary`.`sales_day`, `sample_dailysummary`.`sales_week`, `sample_dailysummary`.`sales_month`, `sample_dailysummary`.`register_uu_day`, `sample_dailysummary`.`register_uu_week`, `sample_dailysummary`.`register_uu_month`, `sample_dailysummary`.`pay_count_day`, `sample_dailysummary`.`pay_count_week`, `sample_dailysummary`.`pay_count_month`, `sample_dailysummary`.`total_user`, `sample_dailysummary`.`inv_access_uu`, `sample_dailysummary`.`inv_sender_uu`, `sample_dailysummary`.`inv_accepted_uu`, `sample_dailysummary`.`inv_send_count`, `sample_dailysummary`.`memo`, `sample_dailysummary`.`first_charge_uu` FROM `sample_dailysummary` WHERE `sample_dailysummary`.`date` = 2012-09-07 AND `sample_dailysummary`.`country_id` = "abc" )
Using Memcached can really speed things up for you. However, that does come with it's problems. You have to be extra careful on dynamic pages about explicitly invalidating caches whenever required.
Along with Memcached, try johnny-cache which does a very good job of caching your django ORM queries
Also, make use of Django's session variables as far as possible. (Try the cached_db session engine if you're using Memcached.) You could save objects (like your user profile settings) which stay consistent throughout a session. This way you're reducing the number of sql calls again.
And if you really really need quick pageloads.. Maybe try loading your page and then asynchronously calling your sql statements using Celery and load your results in an AJAXy manner.
If this is a production application to be exposed to the internet, and you can't reduce the number of queries you make then you should at least reuse the answers, I would suggest using django's built in DB cache to store database results in ram using memcached. If this is a local app then i would suggest django's ram based cache. the reason for this is memcached is able to be scaled a lot further than django's but django's requires little setup
Caching for Django

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources