I am coming from .NET Core and I am curious if Django has anything similar to .NET Core's projections. For instance, I can describe a relationship in a .NET Core model and then later query for it. So, if Articles can have an Author I can do something like:
var articles = dbContext.Where(article.ID == id).Inclue(a => a.author);
What I would get back are articles that have their author attached.
Is there anything similar to this in Django? How can I load related data in Django that's described in the model?
Yes. The query you wrote is more or less equivalent to:
Article.objects.filter(id=some_id).prefetch_related('author')
or:
Article.objects.filter(id=some_id).select_related('author')
select_related versus prefetch_related
in case the number of Authors is limited, or more a one-to-one relation. In case you pull a large number of Articles, and several Articles map to the same Author, it is usually better to use prefetch_related: this will first look for the Author identifiers, perform a uniquness filter, and then fetch those into memory.
The same holds if multiple Authors write one article (so a one-to-many or many-to-many relation). Since that would mean that if we perform a JOIN at the database level, we repeat every article by the number of Authors who wrote that article, and we repeat every Author by the number of Articles they wrote. We usually want to avoid this "multiplicative" behavior for such sets. So in that case prefetch_related will have linear behavior: first fetching the relevant Articles, next fetching the relevant Authors.
Lazy loading of related objects
But you actually do not need to perform a prefetch_related for a single instance. In case you load an article, you can simply use some_article.author. If the corresponding Author instance is not yet loaded, Django will perform an additional query fetching the related Author instance.
So Django can load attributes that correspond to related objects in a lazy manner: it simply loads the Article if you fetch it in memory, and if you later need the Author, or the Journal, or the .editor of the Journal (which is for example an Author as well), Django will each time make a new query and load related object(s). In case you want however to process a list of Articles in batch, select_related and prefect_related are advisable, since they will result in a limited number of queries to fetch all related objects, instead of one query per related instance.
The lazy loading of related objects can be more efficient if one frequently has to fetch zero or at most a few related instances (for example because it depends on some attributes of the Article whether we are really interested in the Author after all).
Sounds like you are looking for select_related. This traverses FK relationships based on how they are created in your models.
Related
I like how Django ORM lazy loads related objects in the queryset, but I guess it's quite unpredictable as it is.
The queryset API doesn't keep the related objects when they are used to make a queryset, thereby fetching them again when accessed later.
Suppose I have a ModelA instance (say instance_a) which is a foreign key (say for_a) of some N instances of ModelB. Now I want to perform query on ModelB which has the given ModelA instance as the foreign key.
Django ORM provides two ways:
Using .filter() on ModelB:
b_qs = ModelB.objects.filter(for_a=instance_a)
for instance_b in b_qs:
instance_b.for_a # <-- fetches the same row for ModelA again
Results in 1 + N queries here.
Using reverse relations on ModelA instance:
b_qs = instance_a.for_a_set.all()
for instance_b in b_qs:
instance_b.for_a # <-- this uses the instance_a from memory
Results in 1 query only here.
While the second way can be used to achieve the result, it's not part of the standard API and not useable for every scenario. For example, if I have instances of 2 foreign keys of ModelB (say, ModelA and ModelC) and I want to get related objects to both of them.
Something like the following works:
ModelB.objects.filter(for_a=instance_a, for_c=instance_c)
I guess it's possible to use .intersection() for this scenario, but I would like a way to achieve this via the standard API. After all, covering such cases would require more code with non-standard queryset functions which may not make sense to the next developer.
So, the first question, is it possible to optimize such scenarios with the the standard API itself?
The second question, if it's not possible right now, can it be added with some tweaks with the QuerySet?
PS: It's my first time asking a question here, so forgive me if I made any mistake.
You could improve the query by using select_related():
b_qs = ModelB.objects.select_related('for_a').filter(for_a=instance_a)
or
b_qs = instance_a.for_a_set.select_related('for_a')
Does that help?
You use .select_related(..) [Django-doc] for ForeignKeys, or .prefetch_related(..) [Django-doc] for something-to-many relations.
With .select_related(..) you will make a LEFT OUTER JOIN at the database side, and fetch records for the two objects, and thus do the deserialization to the proper objects.
ModelB.objects.select_related('for_a').filter(for_a=instance_a)
For relations that are one-to-many (so a reversed ForeignKey), or ManyToManyFields, this is not a good idea, since it could result in a large amount of duplicate objects that are retrieved. This would result in a large answer from the database, and a lot of work at the Python end to deserialize these objects. .prefetch_related will make individual queries, and then do the linking itself.
I have a users who have "followers". I need to be able to navigate up and down the tree of users/followers. I'm eventually going hit AppEngine's 1mb limit on entity entries if I use ancestor relations if a user has many followers.
What's the best way to structure this data on AppEngine?
You cannot use ancestor relations for a simple reason that your use case allows circular references (I follow you, you follow me).
The solution depends on your expected usage patterns. You can choose between two options:
(A) In each suer entity store a list of IDs of other users that this user is following.
(B) Create a separate entity that has two properties: "User" and"Follower". Every entity will represent a single "connection" between users.
While the first option seems simpler, you may run into exploding indexes problem. Besides, it may turn out to be a more expensive solution as each change in user relationships will require an overwrite of a user entity with updates to all of its other indexes. The second solution does not have these drawbacks, but may require a little extra code.
In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.
UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.
There is going to be "articles" and "tags" in my App Engine application.
And there are two techniques to implement that (thanks to Nick Johnson's article):
# one entity just refers others
class Article(db.Model):
tags = db.ListProperty(Tag)
# via separate "join" table
class ArticlesAndTags(db.Model):
article = db.ReferenceProperty(Article)
tag = db.ReferenceProperty(Tag)
Which one should I prefer according to the following tasks?
Create tag cloud (frequently),
Select articles by a tag (rather rarely)
Because of the lack of a 'reduce' feature in appengine's map reduce (nor an SQL group by like query), tag clouds are tricky to implement efficiently because you need to count all tags you have manually. Which ever implementation you go with, what I would suggest for the tag cloud is to have a separate model TagCounter that keeps track of how many tags you have. Otherwise the tag query could get expensive if you have a lot of them.
class TagCounter:
tag = db.ReferenceProperty(Tag)
counter = db.IntegerProperty(default=0)
Every time you choose to update your tags on an article, make sure you increment and decrement from this table accordingly.
As for selecting articles by a tag, the first implementation is sufficient (the second is overly complex imo).
class Article(db.Model):
tags = db.ListProperty(Tag)
#staticmethod
def select_by_tag(tag):
return Article.all().filter("tags", tag).run()
I have created a huge tag cloud * on GAEcupboard opting for the first solution:
class Post(db.Model):
title = db.StringProperty(required = True)
tags = db.ListProperty(str, required = True)
The tag class has a counter property that is updated each time a new post is created/updated/deleted.
class Tag(db.Model):
name = db.StringProperty(required = True)
counter = db.IntegerProperty(required = True)
last_modified = db.DateTimeProperty(required = True, auto_now = True)
Having the tags organized in a ListProperty it's quite easy to offer a drill-down feature that allows user to compose different tags to search for the desired articles:
Example:
http://www.gaecupboard.com/tag/python/web-frameworks
The search is done using:
posts = Post.all()
posts.filter('tags', 'python').filter('tags', 'web-frameworks')
posts.fetch()
that does not need any custom index at all.
ok, it's too huge, I know :)
Creating a tag-cloud in app-engine is really difficult because the datastore doesn't support the GROUP BY construct normally used to express that; Nor does it supply a way to order by the length of a list property.
One of the key insights is that you have to show a tag cloud frequently, but you don't have to create one except when there are new articles, or articles get retagged, since you'll get the same tag-clout in any case; In fact, the tag cloud doesn't change very much for each new article, maybe a tag in the cloud becomes a little larger or a little smaller, but not by much, and not in a way that would affect its usefullness.
This suggests that tag-clouds should be created periodically, cached, and displayed much like static content. You should think about doing that in the Task Queue API.
The other query, listing articles by tag, would be utterly unsupported by the first techinque you've shown; Inverting it, having a Tag model with an articles ListProperty does support the query, but will suffer from update contention when popular tags have to get added to it at a high rate. The other technique, using an association model, suffers from neither of these concerns, but makes it harder to make the article listing queries convenient.
The way I would deal with this is to start with the ArticlesAndTags model, but add some additional data to the model to have a useful ordering; an article date, article title, whatever makes sense for the particular kind of site you're making. You'll also need a monotonic sequence (say, a timestamp) on this so you know when the tag applied.
The tag cloud query would be supported using a Tag entity that has Only a numeric article count, and also a reference to the same timestamp used in the ArticlesAndTags Model.
A task queue can then query for the 1000 oldest ArticlesAndTags that are newer than oldest Tag, sum the frequencies of each and add it to the counts in the Tags. Tag removals are probably rare enough that they can update the Tag model immediately without too much contention, but if that assumption turns out to be wrong, then delete events should be added to the ArticlesAndTags as well.
You don't seem to have very specific/complex requirements so my opinion is it's likely neither method would show significant benefits, or rather, the pros/cons will depend completely on what you're used to, how you want to structure your code, and how you implement caching and counting mechanisms.
The things that come to mind for me are:
-The ListProperty method leaves the data models looking more natural.
-The AtriclesAndTags method will mean you'd have to query for the relationships and then the Articles (ugh..), instead of doing Article.all().filter('tags =', tag).
I have a lot of model classes with ralations between them with a CRUD interface to edit. The problem is that some objects can't be deleted since there are other objects refering to them. Sometimes I can setup ON DELETE rule to handle this case, but in most cases I don't want automatic deletion of related objects till they are unbound manually. Anyway, I'd like to present editor a list of objects refering to currently viewed one and highlight those that prevent its deletion due to FOREIGN KEY constraint. Is there a ready solution to automatically discover referers?
Update
The task seems to be quite common (e.g. django ORM shows all dependencies), so I wonder that there is no solution to it yet.
There are two directions suggested:
Enumerate all relations of current object and go through their backref. But there is no guarantee that all relations have backref defined. Moreover, there are some cases when backref is meaningless. Although I can define it everywhere I don't like doing this way and it's not reliable.
(Suggested by van and stephan) Check all tables of MetaData object and collect dependencies from their foreign_keys property (the code of sqlalchemy_schemadisplay can be used as example, thanks to stephan's comments). This will allow to catch all dependencies between tables, but what I need is dependencies between model classes. Some foreign keys are defined in intermediate tables and have no models corresponding to them (used as secondary in relations). Sure, I can go farther and find related model (have to find a way to do it yet), but it looks too complicated.
Solution
Below is a method of base model class (designed for declarative extention) that I use as solution. It is not perfect and doesn't meet all my requirements, but it works for current state of my project. The result is collected as dictionary of dictionaries, so I can show them groupped by objects and their properties. I havn't decided yet whether it's good idea, since the list of referers sometimes is huge and I'm forced to limit it to some reasonable number.
def _get_referers(self):
db = object_session(self)
cls, ident = identity_key(instance=self)
medatada = cls.__table__.metadata
result = {}
# _mapped_models is my extension. It is collected by metaclass, so I didn't
# look for other ways to find all model classes.
for other_class in medatada._mapped_models:
queries = {}
for prop in class_mapper(other_class).iterate_properties:
if not (isinstance(prop, PropertyLoader) and \
issubclass(cls, prop.mapper.class_)):
continue
query = db.query(prop.parent)
comp = prop.comparator
if prop.uselist:
query = query.filter(comp.contains(self))
else:
query = query.filter(comp==self)
count = query.count()
if count:
queries[prop] = (count, query)
if queries:
result[other_class] = queries
return result
Thanks to all who helped me, especially stephan and van.
SQL: I have to absolutely disagree with S.Lott' answer.
I am not aware of out-of-the-box solution, but it is definitely possible to discover all the tables that have ForeignKey constraints to a given table. One needs to use properly the INFORMATION_SCHEMA views such as REFERENTIAL_CONSTRAINTS, KEY_COLUMN_USAGE, TABLE_CONSTRAINTS, etc. See SQL Server example. With some limitations and extensions, most versions of new relational databases support INFORMATION_SCHEMA standard. When you have all the FK information and the object (row) in the table, it is a matter of running few SELECT statements to get all other rows in other tables that refer to given row and prevent it from being deleted.
SqlAlchemy: As noted by stephan in his comment, if you use orm with backref for relations, then it should be quite easy for you to get the list of parent objects that keep reference to the object you are trying to delete, because those objects are basically mapped properties of your object (child1.Parent).
If you work with Table objects of sql alchemy (or not always use backref for relations), then you would have to get values of foreign_keys for all the tables, and then for all those ForeignKeys call references(...) method, providing your table as a parameter. In this way you will find all the FKs (and tables) that have reference to the table your object maps to. Then you can query all the objects that keep reference to your object by constructing the query for each of those FKs.
In general, there's no way to "discover" all of the references in a relational database.
In some databases, they may use declarative referential integrity in the form of explicit Foreign Key or Check constraints.
But there's no requirement to do this. It can be incomplete or inconsistent.
Any query can include a FK relationship that is not declared. Without the universe of all queries, you can't know the relationships which are used but not declared.
To find "referers" in general, you must actually know the database design and have all queries.
For each model class, you can easily see if all its one-to-many relations are empty simply by asking for the list in each case and seeing how many entries it contains. (There is probably a more efficient way implemented in terms of COUNT, too.) If there are any foreign keys relating to the object, and you have your object relations set up correctly, then at least one of these lists will be non-zero in length.