GQL query ReferenceProperty for objects with no references - python

I have the following models, where one Thing can have multiple Action:
class Thing(polymodel.PolyModel):
create_date = db.DateTimeProperty(required=True)
class Action(db.Model):
thing = db.ReferenceProperty(Thing, collection_name='action')
action_by = db.StringProperty(required=True)
I need to use raw GQL to count the number of Things that have no actions. It should be something like:
SELECT * FROM Thing WHERE action = []
I have the following limitations:
I must use raw GQL (because the actual query contains DISTINCT which is not supported on a regular Query).
I can't fetch the data and check the contents because I use remote api and would like only to count the data to save bandwith.
Can it be done?

Firstly, you're completely mistaken to say that you have to use GQL because normal queries don't support DISTINCT: there is nothing you can do with GQL that you can't do with a normal query. The datastore is not a database, and does not have an underlying query language that ORM calls must be translated to; on the contrary, in fact GQL calls are translated into RPC calls in exactly the same way as model calls, and there is no benefit to using GQL at all. In this specific case, the Query class has a distinct parameter.
However, another implication of the datastore not being an SQL database is that you cannot do JOINs. There is no way to select instances of Thing based on any property in Action, whether it's a specific field value or the absence of any relation. The only way to do this would be to get all distinct values of Action.thing, then all Things, and work out the set difference.

Related

SQLAlchemy MetaData.reflect() vs. automap_base.prepare()

It seems to me that MetaData.reflect() and sqlalchemy.ext.automap.prepare() tables should be able to be used interchangeably for many use cases, but they can't be.
The metadata.tables['mytable'] into conn.execute(select(...)) returns a sqlalchemy.engine.cursor.CursorResult and your iterator gets the columns directly (eg x.columnA).
But automap_base().classes.mytable into the same conn.execute(select(...)) returns a sqlalchemy.engine.result.ChunkedIteratorResult and you need x.mytable.columnA to get at the column.
The sqlalchemy.engine.Result() documention says as much:
New in version 1.4: The Result object provides a completely updated
usage model and calling facade for SQLAlchemy Core and SQLAlchemy ORM.
In Core, it forms the basis of the CursorResult object which replaces
the previous ResultProxy interface. When using the ORM, a higher level
object called ChunkedIteratorResult is normally used.
Can I generically convert one to the other? That is, some wrapper that works for every table without needing the table name?
What's the best futureproof way to do this? I want my code to be forward-looking to sqlalchemy 2.0. Does that mean I should move away from either automap or MetaData?
sqlalchemy 1.4.35
This is the difference between the Core and the ORM.
select() from a Table vs. ORM class
While the SQL generated in these examples looks the same whether we
invoke select(user_table) or select(User), in the more general case
they do not necessarily render the same thing, as an ORM-mapped class
may be mapped to other kinds of “selectables” besides tables. The
select() that’s against an ORM entity also indicates that ORM-mapped
instances should be returned in a result, which is not the case when
SELECTing from a Table object.
Don't hesitate to use the ORM. It's higher level, pythonic, cool, and automap is ORM.

Why does Django not allow a user to extract a usable query from a QuerySet as a standard feature?

In Django, you can extract a plain-text SQL query from a QuerySet object like this:
queryset = MyModel.objects.filter(**filters)
sql = str(queryset.query)
In most cases, this query itself is not valid - you can't pop this into a SQL interface of your choice or pass it to MyModel.objects.raw() without exceptions, since quotations (and possibly other features of the query) are not performed by Django but rather by the database interface at execution time. So at best, this is a useful debugging tool.
Coming from a data science background, I often need to write a lot of complex SQL queries to aggregate data into a reporting format. The Django ORM can be awkward at best and impossible at worst when queries need to be very complex. However, it does offer some security and convenience with respect to limiting SQL injection attacks and providing a way to dynamically build a query - for example, generating the WHERE clause for the query using the .filter() method of a model. I want to be able to use the ORM to generate a base data set in the form of a query, then take that query and use it as a subquery/CTE in a larger query that handles more complex logic. For example:
queryset = MyModel.objects.filter(**filters)
sql = str(queryset.query)
more_complex_query = f"""
with filtered_table as ({sql})
select
*
/* add other stuff */
from
filtered_table
"""
results = MyModel.objects.raw(more_complex_query)
In this case, the ORM generates a query that can be used to filter the base table, then the CTE/raw sql can take that result and do whatever calculations need to be done with a tool that is more common among people working with data (SQL) than the Django ORM, while still getting the ORM benefits of stripping bad actors out.
However, this method requires a way to generate a usable SQL query from a QuerySet object. I've found a workaround for postgres databases using the psycopg2 cursor:
from django.db import connections
# Whatever the key is in your settings.DATABASES for the reporting db
WAREHOUSE_CONNECTION_NAME = 'default'
# Get the Query object and separate it into the query and params
filtered_table_query = MyModel.objects.filter(**filters).query
raw_query, params = filtered_table_query.sql_with_params()
# Create a cursor from the relevant connection
cursor = connections[WAREHOUSE_CONNECTION_NAME].cursor()
# Call .mogrify() on the query/params to get an executable query string
usable_sql = cursor.mogrify(raw_query, params)
cursor.execute(usable_sql) # This works
cursor.fetchall() # This works
# Have not tried this yet
MyModel.objects.raw(usable_sql)
# Or this
wrapper_query = f"""
with base_table as ({usable_sql})
select
*
from
base_table
"""
cursor.execute(wrapper_query)
# or
MyModel.objects.raw(wrapper_query)
This method is dependent on the psycopg2 cursor method .mogrify() - I am not sure if this works for other back ends or if the DB API 2.0 spec takes care of that.
Other people have suggested creating a view in the database and then using an unmanaged Django model on top of the view, but I think this does not really work when your queries are dynamic in nature, i.e. need to be filtered differently based on some user input, since often the fields a user wants to filter on are not present in the result set after some aggregation.
So overall, I have two questions:
Is there a reason why Django does not let you extract a usable SQL query as a standard offering?
What other methods do people use when the ORM makes your elegant SQL into an ugly mess?
The Django developers tend to frown on features that aren't cross-compatible across all the databases they support. I can only imagine that one of the supported database engines doesn't have this capability and so they don't provide it as a standard, documented feature of the ORM.
But that's just a guess. You'd really have to ask one of the devs :)

How to optimize lazy loading of related object, if we already have its instance?

I like how Django ORM lazy loads related objects in the queryset, but I guess it's quite unpredictable as it is.
The queryset API doesn't keep the related objects when they are used to make a queryset, thereby fetching them again when accessed later.
Suppose I have a ModelA instance (say instance_a) which is a foreign key (say for_a) of some N instances of ModelB. Now I want to perform query on ModelB which has the given ModelA instance as the foreign key.
Django ORM provides two ways:
Using .filter() on ModelB:
b_qs = ModelB.objects.filter(for_a=instance_a)
for instance_b in b_qs:
instance_b.for_a # <-- fetches the same row for ModelA again
Results in 1 + N queries here.
Using reverse relations on ModelA instance:
b_qs = instance_a.for_a_set.all()
for instance_b in b_qs:
instance_b.for_a # <-- this uses the instance_a from memory
Results in 1 query only here.
While the second way can be used to achieve the result, it's not part of the standard API and not useable for every scenario. For example, if I have instances of 2 foreign keys of ModelB (say, ModelA and ModelC) and I want to get related objects to both of them.
Something like the following works:
ModelB.objects.filter(for_a=instance_a, for_c=instance_c)
I guess it's possible to use .intersection() for this scenario, but I would like a way to achieve this via the standard API. After all, covering such cases would require more code with non-standard queryset functions which may not make sense to the next developer.
So, the first question, is it possible to optimize such scenarios with the the standard API itself?
The second question, if it's not possible right now, can it be added with some tweaks with the QuerySet?
PS: It's my first time asking a question here, so forgive me if I made any mistake.
You could improve the query by using select_related():
b_qs = ModelB.objects.select_related('for_a').filter(for_a=instance_a)
or
b_qs = instance_a.for_a_set.select_related('for_a')
Does that help?
You use .select_related(..) [Django-doc] for ForeignKeys, or .prefetch_related(..) [Django-doc] for something-to-many relations.
With .select_related(..) you will make a LEFT OUTER JOIN at the database side, and fetch records for the two objects, and thus do the deserialization to the proper objects.
ModelB.objects.select_related('for_a').filter(for_a=instance_a)
For relations that are one-to-many (so a reversed ForeignKey), or ManyToManyFields, this is not a good idea, since it could result in a large amount of duplicate objects that are retrieved. This would result in a large answer from the database, and a lot of work at the Python end to deserialize these objects. .prefetch_related will make individual queries, and then do the linking itself.

Google app engine: better way to make query

Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.
You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).
There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.

Select statement with SqlAlchemy

Yes, very basic question.
I've successfully created my db using declarative_base, and can perform inserts into the db too. I just have a few questions about SqlAlchemy sql statements.
I've create a table called Location.
A few issues/questions (see code below):
For statement, "print row", I have to specify each column name that I want to have output. i.e. "print row.name, row.lat, etc" Why? (Otherwise the print statement outputs "<classname.Location at <...>>"
Also, what is the preferred way to interact with a db and perform queries (select, insert, update, etc.)- there seem to be a bunch of options: using sqlalchemy.orm.select for example, or engine.text(<sql query>).execute().fetchall(), or even conn.execute(<select>). Options are great, but right now they're all just confusing me.
Thanks so much for the tips!
Here's my code:
from sqlalchemy import create_engine
from sqlalchemy.sql import select
from location_db_setup import *
db_path = "sqlite:////volumes/users/shared/programming/python/web/map.db"
engine = create_engine(db_path, echo= True)
Session = sessionmaker(bind= engine)
session = Session()
session.query(Location).fetchall()
for row in locations:
print row
You code in sample is incomplete and has errors. So it's impossible to say for sure what is Location here. I assume it's a mapped class, so you are requesting a list of all Location objects, not rows. When you print an object you get its string representation. String representation of objects can be changed by defining custom __str__ method.
Although ORM is the most important part of SQLAlchemy, it's not the only. It also expose a lot of functionality not related to ORM directly. When you work with objects the preferred way to create queries are corresponding session method. But sometimes you need selectable objects not bound to particular session (they are not executed directly, but are used in expressions passed to session methods). That's why there are functions in sqlalchemy.orm package.
The preferred way to interact with a db when using an ORM is not to use queries but to use objects that correspond to the tables you are manipulating, typically in conjunction with the session object. SELECT queries become get() or find() calls in some ORMs, query() calls in others. INSERT becomes creating a new object of the type you want (and maybe explicitly adding it, eg session.add() in sqlalchemy). UPDATE becomes editing such an object, and DELETE becomes deleting an object (eg. session.delete() ). The ORM is meant to handle the hard work of translating these operations into SQL for you.
Have you read the tutorial?
Denis and Kylotan gave you good answers. I'm just gonna focus on point 2.
Sometimes depends on your taste. There are times when you need database specific features that an ORM can't do, that's a case when you should use Session(<sql here>).execute() or conn.execute(<sql here>). Another case is when you have a very complex query which is beyond you and you don't find a suitable ORM expression.
Usually, using ORM features like select([...]).where(... or Session.query(<Model here>).filter(... (declarative base) are enough. Almost every sql query has an ORM equivalent.

Categories

Resources