I have a problem with the speed of page loading.
Now it takes about 7 seconds to load the pages and 2~3 seconds is Django processing.
Obvious thing to blame is my lack of knowledge of architecture, execute average 50 queries, as shown by "Django debug tool bar" when accessing the pages but most of the queries are like "yesterday`s snapshot(group by something)" or "daily snapshot(group by something) before yesterday" and doesn't have to be updated each time.
I am coming out of idea using memory caching or create new table for prepare-possible type of data.
Is there any convention or Design Pattern for this kind of issue?
sample queries are these( I believe they must not query each time on yesterdays data or last month`s data):
SELECT `sample_salestarget`.`id`, `sample_salestarget`.`country_id`, `sample_salestarget`.`year`, `sample_salestarget`.`month`, `sample_salestarget`.`sales` FROM `sample_salestarget` WHERE (`sample_salestarget`.`country_id` = "abc" AND `sample_salestarget`.`month` = 8 AND `sample_salestarget`.`year` = 2012 )
SELECT `sample_dailysummary`.`id`, `sample_dailysummary`.`country_id`, `sample_dailysummary`.`date`, `sample_dailysummary`.`pv_day`, `sample_dailysummary`.`pv_week`, `sample_dailysummary`.`pv_month`, `sample_dailysummary`.`active_uu_day`, `sample_dailysummary`.`active_uu_week`, `sample_dailysummary`.`active_uu_month`, `sample_dailysummary`.`active_uu_7days`, `sample_dailysummary`.`active_uu_30days`, `sample_dailysummary`.`paid_uu_day`, `sample_dailysummary`.`paid_uu_week`, `sample_dailysummary`.`paid_uu_month`, `sample_dailysummary`.`sales_day`, `sample_dailysummary`.`sales_week`, `sample_dailysummary`.`sales_month`, `sample_dailysummary`.`register_uu_day`, `sample_dailysummary`.`register_uu_week`, `sample_dailysummary`.`register_uu_month`, `sample_dailysummary`.`pay_count_day`, `sample_dailysummary`.`pay_count_week`, `sample_dailysummary`.`pay_count_month`, `sample_dailysummary`.`total_user`, `sample_dailysummary`.`inv_access_uu`, `sample_dailysummary`.`inv_sender_uu`, `sample_dailysummary`.`inv_accepted_uu`, `sample_dailysummary`.`inv_send_count`, `sample_dailysummary`.`memo`, `sample_dailysummary`.`first_charge_uu` FROM `sample_dailysummary` WHERE `sample_dailysummary`.`date` = 2012-09-07 AND `sample_dailysummary`.`country_id` = "abc" )
Using Memcached can really speed things up for you. However, that does come with it's problems. You have to be extra careful on dynamic pages about explicitly invalidating caches whenever required.
Along with Memcached, try johnny-cache which does a very good job of caching your django ORM queries
Also, make use of Django's session variables as far as possible. (Try the cached_db session engine if you're using Memcached.) You could save objects (like your user profile settings) which stay consistent throughout a session. This way you're reducing the number of sql calls again.
And if you really really need quick pageloads.. Maybe try loading your page and then asynchronously calling your sql statements using Celery and load your results in an AJAXy manner.
If this is a production application to be exposed to the internet, and you can't reduce the number of queries you make then you should at least reuse the answers, I would suggest using django's built in DB cache to store database results in ram using memcached. If this is a local app then i would suggest django's ram based cache. the reason for this is memcached is able to be scaled a lot further than django's but django's requires little setup
Caching for Django
Related
My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.
In a Django + postgresql website of mine, I need to publicly show all is online at a point in time (it's a social website). How do I do this? For instance, can there be a way to enumerate all logged in users who hit my nginx webserver in the previous 10 mins? Something like that could work. I'm a beginner and fishing for a viable solution at the moment.
Currently to accomplish this, I store sessions to the database, using an external library to make sessions enumeratable. This allows me to query how many unique users are online at a point in time.
But this scheme creates a lot of needless DB traffic. As a result, logging and pruning logs has become ineffective. Moreover pgFouine shows me that session related DB calls are the biggest performance bottleneck my website currently has.
There's a proposed solution here, but it uses the database.
Use django's cache framework to save the result of the db query to memory. That way you don't need to do the expensive database query for every page render.
from django.core.cache import cache
def count_current_users():
users = cache.get('users')
if users is None:
# last count has timed out
users = do_expensive_db_query()
cache.set('users', users, timeout=500)
return users
https://docs.djangoproject.com/en/1.10/topics/cache/#basic-usage
You can also use Template fragment caching and write a custom template tag that only runs the db query if the cache is stale. This will cache the result for 500 seconds.
{% cache 500 logged_in_users %}
{% expensive_query_db_for_logged_in_users %}
{% endcache %}
If you want your user count to be more real time, you probably have to bypass django's cache framework, and communicate directly with Redis.
Store each logged in user as a key with a set time to live. Getting a list of currently active keys from Redis would be much cheaper than the equivalent query to a sql database. It can also be implemented with just a few lines of python code.
If you're using django-user-sessions, the Session model has a last_activity field.
You may be able to do something like:
from user_sessions import Session
from datetime import datetime, timedelta
time_threshold = datetime.now() - timedelta(minutes=10)
qs = Session.objects.filter(last_activity__gt=time_threshold)
Though, the django-user-sessions does not have a database index on that field, which means if you have a very large number of users / sessions, that query may be hard and take a long time. A more complicated answer might involve creating a materialzed view (if you're using postgres) that refreshes via a cron job.
Currently, I'm trying a different approach. I've written a middleware where upon each request, the user's user_id is stored in a global sorted set. I do this only if they're authenticated, and I use the redis key-value store to ensure everything is blazingly fast.
The solution isn't live yet. I'm going to report more here and give a full answer once I'm done. I'll also consider other answers given here before marking the correct solution.
I have a Django web application that is currently live and receiving a lot of queries. I am looking for ways to optimize its performance and one area that can be improved is how it interacts with its database.
In its current state, each request to a particular view loads an entire database table into a pandas dataframe, against which queries are done. This table consists of over 55,000 rows of text data (co-ordinates mostly).
To avoid needless queries, I have been advised to cache the database into memory and have it be cached upon the first time its loaded. This will remove some overhead on the DB side of things. I've never used this feature of Django before so I am a bit lost.
The Django manual does not seem to have a concrete implementation of what I want to do. Would it be a good idea to just store the entire table in memory or would storing it in a file be a better idea?
I had a similar problem and django-cache-machine worked like a charm. It uses the Django caching features to cache the results of your queries. It is very easy to set up (assuming you have already configured Django's cache backend):
pip install django-cache-machine
Then in the model you want to cache:
from caching.base import CachingManager, CachingMixin
class MyModel(CachingMixin, models.Model):
objects = CachingManager()
And that's it, your queries will be cached.
I have developed a Shiny Application. When it starts, it loads, ONCE, some datatables. Around 4 GB of datatables. Then, people connecting to the app can use the interface and play with those datatables.
This application is nice but has some limitations. That's why I am looking for another solution.
My idea is to have Pandas and Django working together. This way, I could develop an interface and a RESTful API at the same time. But what I would need is that all requests coming to Django can use pandas datatables that have been loaded once. Imagine if for any request 4 GB of memory had to be loaded... It would be horrible.
I have looked everywhere but couldn't find any way of doing this. I found this question: https://stackoverflow.com/questions/28661255/pandas-sharing-same-dataframe-across-the-request But it has no responses.
Why do I need to have the data in RAM ? Because I need performance to render the asked results fastly. I can't ask MariaDB to calculate and maintain those datas for example as it involves some calculations that sole R or a specialized package in Python or other languages can do.
I have a similar use case where I want to load (instantiate) a certain object only once and use it in all requests, since it takes some time (seconds) to load and I couldn't afford the lag that would introduce for every request.
I use a feature in Django>=1.7, the AppConfig.ready() method, to load this only once.
Here is the code:
# apps.py
from django.apps import AppConfig
from sexmachine.detector import Detector
class APIConfig(AppConfig):
name = 'api'
def ready(self):
# Singleton utility
# We load them here to avoid multiple instantiation across other
# modules, that would take too much time.
print("Loading gender detector..."),
global gender_detector
gender_detector = Detector()
print("ok!")
Then when you want to use it:
from api.apps import gender_detector
gender_detector.get_gender('John')
Load your data table in the ready() method and then use it from anywhere. I reckon the table will be loaded once for each WSGI worker, so be careful here.
I may be misunderstanding the problem but to me having a 4 GB database table that is readily accessible by users shouldn't be too much of a problem. Is there anything wrong with actually just loading up the data one time upfront like you described? 4GB isn't too much RAM now.
Personally I'd recommend you just use the database system itself instead of loading stuff into memory and crunching with python. If you set up the data properly you can issue many thousands of queries in just seconds. Pandas is actually written to mimic SQL so most of the code you are using can probably be translated directly to SQL. Just recently I had a situation at work where I set up a big join operation essentially to take a couple hundred files (~4GB in total, 600k rows per each file) using pandas. The total execution time ended up being like 72 hours or something and this was an operation that had to be run once an hour. A coworker ended up rewriting the same python/pandas code as a pretty simple SQL query that finished in 5 minutes instead of 72 hours.
Anyways I'd recommend looking into storing your pandas dataframe in an actual database table. Django is built on a database (usually mySQL or Postgres) and pandas even has a function to directly insert your dataframe into the database dataframe.to_sql( 'database_connection_str' )! From there you can write the django code such that responses will make a single query to DB, fetch values and return data in timely manner.
So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.
Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.
The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.
There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.
There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.