How do to explicitly define the query used in subqueryload_all?

How do to explicitly define the query used in subqueryload_all? - python

I'm using subqueryload/subqueryload_all pretty heavily, and I've run into the edge case where I tend to need to very explicitly define the query that is used during the subqueryload. For example I have a situation where I have posts and comments. My query looks something like this:
posts_q = db.query(Post).options(subqueryload(Post.comments))
As you can see, I'm loading each Post's comments. The problem is that I don't want all of the posts' comments, I need to also take into account a deleted field, and they need to be ordered by create time descending. The only way I have observed this being done, is by adding options to the relationship() declaration between posts and comments. I would prefer not to do this, b/c it means that that relationship cannot be reused everywhere after that, as I have other places in the app where those constraints may not apply.
What I would love to do, is explicitly define the query that subqueryload/subqueryload_all uses to load the posts' comments. I read about DisjointedEagerLoading here, and it looks like I could simply define a special function that takes in the base query, and a query to load the specified relationship. Is this a good route to take for this situation? Anyone ever run into this edge case before?

The answer is that you can define multiple relationships between Posts and Comments:
class Post(...):
active_comments = relationship(Comment,
primary_join=and_(Comment.post_id==Post.post_id, Comment.deleted=False),
order_by=Comment.created.desc())
Then you should be able to subqueryload by that relationship:
posts_q = db.query(Post).options(subqueryload(Post.active_comments))
You can still use the existing .comments relationship elsewhere.

I also had this problem and it took my some time to realize that this is an issue by design. When you say Post.comments then you refer to the relationship that says "these are all the comments of that post". However, now you want to filter them. If you'd now specify that condition somewhere on subqueryload then you are essentially loading only a subset of values into Post.comments. Thus, there will be values missing. Essentially you have a faulty representation of your data in the model.
The question here is how to approach this then, because you obviously need this value somewhere. The way I go is building the subquery myself and then specify special conditions there. That means you get two objects back: The list of posts and the list of comments. That is not a pretty solution, but at least it is not displaying data in a wrong way. If you were to access Post.comments for some reason, you can safely assume it contains all posts.
But there is room for improvement: You might want to have this attached to your class so you don't carry around two variables. The easy way might be to define a second relationship, e.g. published_comments which specifies extra parameters. You could then also control that no-one writes to it, e.g. with attribute events. In these events you could, instead of forbidding manipulation, handle how manipulation is allowed. The only problem might be when updates happen, e.g. when you add a comment to Post.comments then published_comments won't be updated automatically because they are not aware of each other. Again, I'd take events for this if this is a required feature (but with the above ugly solution you would not have that either).
As a last, hybrid, solution you could take the first approach and then just assign those values to your object, e.g. Post.deleted_comments = deleted_comments.
The thing to keep in mind here is that it is generally not a clever idea to manipulate the query the ORM makes as this could lead to problems later on. I have taken this approach and manipulated the queries (with contains_eager this is easily possible) but it has created problems on some points (while generally being functional) so I dropped that approach.

Related

Same Kind Entities Groups

Let's take an example on which I run a blog that automatically updates its posts.
I would like to keep an entity of class(=model) BlogPost in two different "groups", one called "FutureBlogPosts" and one called "PastBlogPosts".
This is a reasonable division that will allow me to work with my blog posts efficiently (query them separately etc.).
Basically the problem is the "kind" of my model will always be "BlogPost". So how can I separate it into two different groups?
Here are the options I found so far:
Duplicating the same model class code twice (once FutureBlogPost class and once PastBlogPost class (so their kinds will be different)) -- seems quite ridiculous.
Putting them under different anchestors (FutureBlogPost, "SomeConstantValue", BlogPost, #id) but this method also has its implications (1 write per second?) and also the whole ancestor-child relationship doesn't seem fit here. (and why do I have to use "SomeConstantValue" if I choose that option?)
Using different namespaces -- seems too radical for such a simple separation
What is the right way to do it?

Well seems like I finally found the relevant article.
As I understand it, pulling all entities by a specific kind and pulling them by a specific property would make no difference, both will require the same type of work on the background.
(However, querying by a specific full-key, is still faster)
So basically adding a property named "Type" or any other property you want to use to split your specific entities into groups is just as useful as giving it a certain kind.
Read more here: https://developers.google.com/appengine/articles/storage_breakdown
As you see, both EntitiesByKind and EntitiesByProperty are nothing but index tables to the original key.
Finally, an answer.

Why not just put a boolean in your "BlogPost" Entity, 0 if it's past, 1 if it's future? will let you query them separately easily.

Strongly consistent queries for root entities in GAE?

I'd like some advice on the best way to do a strongly consistent read/write in Google App Engine.
My data is stored in a class like this.
class UserGroupData(ndb.Model):
users_in_group = ndb.StringProperty(repeated=True)
data = ndb.StringProperty(repeated=True)
I want to write a safe update method for this data. As far as I understand, I need to avoid eventually consistent reads here, because they risk data loss. For example, the following code is unsafe because it uses a vanilla query which is eventually consistent:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id).get()
entity.data.append(additional_data)
entity.put()
If the entity returned by the query is stale, data is lost.
In order to achieve strong consistency, it seems I have a couple of different options. I'd like to know which option is best:
Option 1:
Use get_by_id(), which is always strongly consistent. However, there doesn't seem to be a neat way to do this here. There isn't a clean way to derive the key for UserGroupData directly from a user_id, because the relationship is many-to-one. It also seems kind of brittle and risky to require my external clients to store and send the key for UserGroupData.
Option 2:
Place my entities in an ancestor group, and perform an ancestor query. Something like:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id,
ancestor=ancestor_for_all_ugd_entities()).get()
entity.data.append(additional_data)
entity.put()
I think this should work, but putting all UserGroupData entities into a single ancestor group seems like an extreme thing to do. It results in writes being limited to ~1/sec. This seems like the wrong approach, since each UserGroupData is actually logically independent.
Really what I'd like to do is perform a strongly consistent query for a root entity. Is there some way to do this? I noticed a suggestion in another answer to essentially shard the ancestor group. Is this the best that can be done?
Option 3:
A third option is to do a keys_only query followed by get_by_id(), like so:
def update_data(user_id, additional_data):
entity_key = UserGroupData.query(UserGroupData.users_in_group==user_id,
).get(keys_only=True)
entity = entity_key.get()
entity.data.append(additional_data)
entity.put()
As far as I can see this method is safe from data loss, since my keys are not changing and the get() gives strongly consistent results. However, I haven't seen this approach mentioned anywhere. Is this a reasonable thing to do? Does it have any downsides I need to understand?

I think you are also conflating the issue of inconsistent queries with safe updates of the data.
A query like the one in your example UserGroupData.query(UserGroupData.users_in_group==user_id).get() will always only return one entity, if the user_id is in the group.
If it has only just been added and the index is not up to date then you won't get a record and therefore you won't update the record.
Any update irrespective of the method of fetching the entity should be performed inside a transaction ensuring update consistency.
As to ancestors improving the consistency of the query, it's not obvious if you plan to have multiple UserGroupData entities. In which case why are you doing a get().
So option 3, is probably your best bet, do the keys only query, then inside a transaction do the Key.get() and update. Remember cross group transactions are limited 5 entity groups.
Given this approach if the index the query is based is out of date then 1 of 3 things can happen,
the record you want isn't found because the newly added userid is not reflected in the index.
the record you want is found, the get() will fetch it consistently
the record you want is found, but the userid has actually been removed and the index is out of date. The get() will retrieve the index consistently and the userid is not present.
You code can then decide what course of action.
What is the use case for querying all UserGroupData entities that a particular user is a member of that would require updates ?

Django. Remove select_related from queryset

Is there any way to remove select related from queryset?
I found, that django add JOIN on count() operation to sql query.
So, if we have code like this:
entities = Entities.objects.select_related('subentity').all()
#We will have INNER JOIN here..
entities.count()
I'm looking for a way to remove join.
One important detail - I got this queryset into django paginator, so I can't simply write
Entities.objects.all().count()

I believe this code comments provide a relatively good answer to the general question that is asked here:
If select_related(None) is called, the list is cleared.
https://github.com/django/django/blob/stable/1.8.x/django/db/models/query.py#L735
In the general sense, if you want to do something to the entities queryset, but first remove the select_related items from it, entities.select_related(None).
However, that probably doesn't solve your particular situation with the paginator. If you do entries.count(), then it already will remove the select_related items. If you find yourself with extra JOINs taking place, then it could be several non-ideal factors. It could be that the ORM fails to remove it because of other logic that may or may not affect the count when combined with the select_related.
As a simple example of one of these non-ideal cases, consider Foo.objects.select_related('bar').count() versus Foo.objects.select_related('bar').distinct().count(). It might be obvious to you that the original queryset does not contain multiple entries, but it is not obvious to the Django ORM. As a result, the SQL that executes contains a JOIN, and there is no universal prescription to work around that. Even applying .select_related(None) will not help you.

Can you show the code where you need this, I think refactoring is the best answer here.
If you want quick answer, entities.query.select_related = False, but it's rather hacky (and don't forget to restore the value if you will need select_related later).

In my MapperExtension.create_instance, how can I extract individual row data by column name?

I've got a query that returns a fair number of rows, and have found that
We wind up throwing away most of the associated ORM instances; and
building up those soon-to-be-thrown-away instances is pretty slow.
So I'd like to build only the instances that I need!
Unfortunately, I can't do this by simply restricting the query; I need to do a fair bit of "business logic" processing on each row before I can tell if I'll throw it out; I can't do this in SQL.
So I was thinking that I could use a MapperExtension to handle this: I'd subclass MapperExtension, and then override create_instance; that method would examine the row data, and either return EXT_CONTINUE if the data is worth building into an instance, or ... something else (I haven't yet decided what) otherwise.
Firstly, does this approach even make sense?
Secondly, if it does make sense, I haven't figured out how to find the data I need in the arguments that get passed to create_instance. I suspect it's in there somewhere, but it's hard to find ... instead of getting a row that directly corresponds to the particular class I'm interested in, I'm getting a row that corresponds to the query that SQLalchemy generated, which happens to be a somewhat complex join between (say) tables A, B, and C.
The problem is that I don't know which elements of the row correspond to the fields in my ORM class: I want to be able to pluck out (e.g.) A.id, B.weight, and C.height.
I assume that somewhere inside the mapper, selectcontext, or class_ arguments is some sort of mapping between columns of my table, and offsets into the row. But I haven't yet found just the right thing. I've come tantalizingly close, though. For example, I've found that selectcontext.statement.columns contains the names of the generated columns ... but not those of the table I'm interested in. For example:
Column(u'A_id', UUID(), ...
...
Column(u'%(32285328 B)s_weight, MSInt(), ...
...
Column(u'%(32285999 C)s_height', MSInt(), ...
So: how do I map column names like C.height to offsets into the
row?

The row accepts Column objects as indexes:
row[MyClass.some_element.__clause_element__()]
but that will only get you as far as the classes and aliased() constructs you have access to on the outside. Its very likely that would be all you'd need for that part of the issue (even though ultimately the idea won't work, read on).
If your statement has had subqueries wrapped around it, from using things like from_self() or join() to a polymorphic target, the create_instance() method doesn't give you access to the translation functions you'd need to accomplish that.
If you're trying to get at rows that are linked to an eagerload(), that's totally not something you should be doing. eagerload() is about optimizing the load of collections. If you want your query to join between two tables and you're looking to filter on the joined table, use join().
But above all, create_instance() is from version 0.1 of SQLAlchemy and I doubt anyone uses it for anything, and it has no capability to say, "skip this row". It has to return something or the mapper will create the instance on its own. So no matter how well you can interpret the row, there's no hook for what you want to do here.
If i really wanted to do such a thing, it would likely be easier to monkeypatch the "fetchall()" method of the returned ResultProxy to filter rows, and send it to Query.instances(). Any result can be sent to this method. Although, if the Query has done translations and such on the mapped selectables, it would need the original QueryContext as well to know how to translate. But this is nothing I'd be bothering with either.
Overall, if speed is so critical of an issue throughout all of this that creating the object is that big of a difference, I'd make it so that I don't need the mapped objects at all for the whole operation, or I'd use caching, or generate the objects I need manually from a result set. I also would make sure that I have access to all the targeted columns in the selectable I'm using so I can re-fetch from result rows, which means I either don't use automatic-subquery/alias generation functions in the ORM, or I use the expression language directly (if you're really hungry for speed and are in the mood to write large tracts of optimizing code, you should probably just be using the expression language).
So the real questions you have to ask here are:
Have you verified that the real difference in speed is creating the object from the row. I.e. not fetching the row, or fetching its columns, etc.
Does the row just have some expensive columns that you don't need? Have you looked into deferred() ?
What are these business rules and why cant they be done in SQL, as stored procedures, etc.
How many thousands of rows are you really skipping here, that its so "slow" to not "skip" them
Have you investigated techniques for having the objects already present, like in-memory caches, preloads, etc. For many scenarios, this fits the bill.
None of this works, and you really want to hack up some home-rolled optimization code. So why not use the SQL expression language directly? If ultimately you're just dealing with a view layer, result rows are quite friendly (they allow "attribute" style access and such), or build some quick "generate an object" routine from it. The ORM presents a very specific use case of the SQL expression language, and if you really need something much more lightweight than it, you're better off skipping it.

SQLAlchemy many-to-many orphan deletion

I'm trying to use SQLAlchemy to implement a basic users-groups model where users can have multiple groups and groups can have multiple users.
When a group becomes empty, I want the group to be deleted, (along with other things associated with the group. Fortunately, SQLAlchemy's cascade works fine with these more simple situations).
The problem is that cascade='all, delete-orphan' doesn't do exactly what I want; instead of deleting the group when the group becomes empty, it deletes the group when any member leaves the group.
Adding triggers to the database works fine for deleting a group when it becomes empty, except that triggers seem to bypass SQLAlchemy's cascade processing so things associated with the group don't get deleted.
What is the best way to delete a group when all of its members leave and have this deletion cascade to related entities.
I understand that I could do this manually by finding every place in my code where a user can leave a group and then doing the same thing as the trigger however, I'm afraid that I would miss places in the code (and I'm lazy).

The way I've generally handled this is to have a function on your user or group called leave_group. When you want a user to leave a group, you call that function, and you can add any side effects you want into there. In the long term, this makes it easier to add more and more side effects. (For example when you want to check that someone is allowed to leave a group).

I think you want cascade='save, update, merge, expunge, refresh, delete-orphan'. This will prevent the "delete" cascade (which you get from "all") but maintain the "delete-orphan", which is what you're looking for, I think (delete when there are no more parents).

I had the same problem about 3 months ago, i have a Post/Tags relation and wanted to delete unused Tags. I asked on irc and SA's author told me that cascades on many-to-many relations are not supported, which kind of makes sense since there is no "parent" in many-to-many.
But extending SA is easy, you can probably use a AttributeExtension to check if the group became empty when is removed from a User and delete it from there.

Could you post a sample of your table and mapper set up? It might be easier to spot what is going on.
Without seeing the code it is hard to tell, but perhaps there is something wrong with the direction of the relationship?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.