I am relatively new to django and python, so i would really like to know which of these two implementation is the better or faster approach. I am currently using the filter, but i thought of it, and cos i really like list comprehension, i wrote the same code using list comprehension. Both codes do exactly the same thing, but I just want to know from developers with more experience which is better and why. Below are both codes.
posts = Post.objects.filter(approved=True).order_by('-date_posted')
posts = [post for post in Post.objects.all().order_by('-date_posted') if post.approved]
A .filter(..) is not implemented to performed filtering on the Django/Python level: filtering is done on the database, with a WHERE (or HAVING) clause. Databases are systems that are designed to store, retrieve, and aggregate large amounts of data.
If you often sort on the approved value, you an add an index on the column:
class Post(models.Model):
approved = models.BooleanField(db_index=True)
In that case, the database will add an indexing structure that make filtering more efficient.
It is usually better to filter on the database, since that means the database needs to communicate less records to the Python/Django layer, and furthermore Django has to deserialize less objects into memory objects. So even if filtering with list comprehension was as fast as filtering through the database, it would still slow down efficiency, because the Python/Django first has to deserialize more elements. If the number of elements is growing, it will eventually result in memory problems, since you can not store all the records at the same time in memory.
Related
I have a Django model backed by a very large table (Log) containing millions of rows. This model has a foreign key reference to a much smaller table (Host). Example models:
class Host(Model):
name = CharField()
class Log(Model):
value = CharField()
host = ForeignKey(Host)
In reality there are many more fields and also more foreign keys similar to Log.host.
There is an iter_logs() function that efficiently iterates over Log records using a paginated query scheme. Other places in the program use iter_logs() to process large volumes of Log records, doing things like dumping to a file, etc.
For efficient operation, any code that uses iter_logs() should only access fields like value. But problems arise when someone innocently accesses log.host. In this case Django will issue a separate query each time a new Log record's host is accessed, killing the performance of the efficient paginated query in iter_logs().
I know I can use select_related to efficiently fetch the related host records, but all known uses of iter_logs() should not need this, so it would be wasteful. If a use case for accessing log.host did arise in this context I would want to add a parameter to iter_logs() to optionally use .select_related("host"), but that has not become necessary yet.
I am looking for a way to tell the underlying query logic in Django to never perform additional database queries except those explicitly allowed in iter_logs(). If such a query becomes necessary it should raise an error instead. Is there a way to do that with Django?
One solution I'd prefer to avoid: wrap or otherwise modify objects yielded by iter_logs() to prevent access to foreign keys.
More generally, Django's deferred query logic breaks encapsulation of code that constructs queries. Dependent code must know about the implementation or risk imposing major inefficiencies. This is usually fine at small scale where a little inefficiency does not matter, but becomes a real problem at larger scale. An early error would be much better because it would be easy to detect in small-scale tests rather than deferring the problem to production run time where it manifests as general slowness.
I know that QuerySets are lazy and they are evaluated only on certain conditions to avoid hitting the databases all the times.
What I don't know is if given a generic query set (retrieving all the items) and then using it to construct a more refined queryset (adding a filter for example) would lead to multiple sql queries or not?
Example:
all_items = MyModel.objects.all()
subset1 = all_items.filter(**some_conditions)
subset2 = subset1.filter(**other_condition)
1) Would this create 3 different sql queries?
Or it all depends if the 3 variable are evaluated (for example iterating over them)?
2) Is this efficient or would it be better to fetch all the items, then convert them into a list and filter them in python?
1) If you enumerate only the final query set subset2 then only one database query request is executed, that is optimal.
2) Avoid premature optimization (before measurement on appropriate amount of data after most of application code is written.). You never know what will be finally the most important problem if the database gets bigger. E.g. if you ask for a subset then the query is usually faster thanks to caching in the database. The amount of memory is in opposition to other optimizations. Maybe you can't hold later all data in the memory and users will access them only by a page of data. A clean readable code is more important for a later possible optimization than an optimization by 20% that must be removed later to can continue.
Other important paragraphs about (lazy) evaluation of queries:
When QuerySets are evaluated
QuerySets are lazy
Laziness in Django
I have a City model and fixture data with list of cities, and currently doing cleanups for URL on view and template before loading them. So I do below in a template to have a URL like this: http://newcity.domain.com.
<a href="http://{{ city.name|lower|cut:" " }}.{{ SITE_URL}}">
The actual city.name is "New City"
Would it be better if I stored already cleaned data (newcity) in a new column (short_name) on MySQL db and just use city.short_name on templates and views?
This seems very opinion-oriented. Is it faster? The only way to know for sure is to measure. Is it faster to a degree that you care about? Probably not. In any event, it's better not to make schema design decisions based on performance unless you've observed measurably bad performance.
All other things being equal, it is generally best to store the data in different columns. It's easier to join it in controller or template code than it is to separate it out into its pieces.
storing the short name in a MySQL database requires I/O. I/O is always slow, for such an easy transormation of data, it should be faster to keep it, like it is and avoid I/O to a database.
If you really want to know the difference, use timeit (https://docs.python.org/2/library/timeit.html), probably accessing a database is much slower.
It really depends, if you have a fixed amount of cities in a list, just make them hardcoded (unless you have lots of cities that will actually put some stress on your server's resources - but I don't think that it's the case here), otherwise - you must use some type of persistent store for the cities and a database will come handy.
Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?
Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.
After enabling Appstats and profiling my application, I went on a panic rage trying to figure out how to reduce costs by any means. A lot of my costs per request came from queries, so I sought out to eliminate querying as much as possible.
For example, I had one query where I wanted to get a User's StatusUpdates after a certain date X. I used a query to fetch: statusUpdates = StatusUpdates.query(StatusUpdates.date > X).
So I thought I might outsmart the system and avoid a query, but incur higher write costs for the sake of lower read costs. I thought that every time a user writes a Status, I store the key to that status in a list property of the user. So instead of querying, I would just do ndb.get_multi(user.list_of_status_keys).
The question is, what is the difference for the system between these two approaches? Sure I avoid a query with the second case, but what is happening behind the scenes here? Is what I'm doing in the second case, where I'm collecting keys, just me doing a manual indexing that GAE would have done for me with queries?
In general, what is the difference between get_multi(keys) and a query? Which is more efficient? Which is less costly?
Check the docs on billing:
https://developers.google.com/appengine/docs/billing
It's pretty straightforward. Reads are $0.07/100k, smalls are $0.01/100k, so you want to do smalls.
A query is 1 read + 1 small / entity
A get is 1 read. If you are getting more than 1 entity back with a query, it's cheaper to do a query than reading entities from keys.
Query is likely more efficient too. The only benefit from doing the gets is that they'll be fully consistent (whereas a query is eventually consistent).
Storing the keys does not query, as you cannot do anything with just the keys. You will still have to fetch the Status objects from memory. Also, since you want to query on the date of the Status object, you will need to fetch all the Status objects into memory and compare their dates yourself. If you use a Query, appengine will fetch only the Status with the required date. Since you fetch less, your read costs will be lower.
As this is basically the same question as you have posed here, I suggest that you look at the answer I gave there.