SQLAlchemy Efficient Subquery for Latest Value

SQLAlchemy Efficient Subquery for Latest Value - python

The current value of an Entity's status attribute can queried as the latest entry in a EntityHistory table for that entity, i.e.
Entities (id) <- EntityHistory (timestamp, entity_id, value)
How do I write an efficient SQLALchemy expression that eagerly loads the current value from the history table for all entities without resulting in N+1 queries?
I tried writing a property for my model, but this generates a query for each (N+1) when I iterate over it. To my knowledge, there is no way to solve this without a subquery, which still seems inefficient to me on the database.
Example EntityHistory data:
timestamp |entity_id| value
==========|=========|======
15:00| 1| x
15:01| 1| y
15:02| 2| x
15:03| 2| y
15:04| 1| z
So the current value for entity 1 would be z and for entity 2 it would be y. The backing database is Postgres.

I think you could use a column_property to load the latest value as an attribute of an Entities instance along other column-mapped attributes:
from sqlalchemy import select
from sqlalchemy.orm import column_property
class Entities(Base):
...
value = column_property(
select([EntityHistory.value]).
where(EntityHistory.entity_id == id). # the id column from before
order_by(EntityHistory.timestamp.desc()).
limit(1).
correlate_except(EntityHistory)
)
A subquery could of course also be used in a query instead of a column_property.
query = session.query(
Entities,
session.query(EntityHistory.value).
filter(EntityHistory.entity_id == Entities.id).
order_by(EntityHistory.timestamp.desc()).
limit(1).
label('value')
)
Performance would naturally depend on proper index being in place:
Index('entityhistory_entity_id_timestamp_idx',
EntityHistory.entity_id,
EntityHistory.timestamp.desc())
In a way this is still your dreaded N+1, as the query uses a subquery per row, but it's hidden in a single round trip to the DB.
If on the other hand having value as a property of Entities is not necessary, in Postgresql you can join with a DISTINCT ON ... ORDER BY query to fetch latest values:
values = session.query(EntityHistory.entity_id,
EntityHistory.value).\
distinct(EntityHistory.entity_id).\
# The same index from before speeds this up.
# Remember nullslast(), if timestamp can be NULL.
order_by(EntityHistory.entity_id, EntityHistory.timestamp.desc()).\
subquery()
query = session.query(Entities, values.c.value).\
join(values, values.c.entity_id == Entities.id)
though in limited testing with dummy data the subquery-as-output-column always beat the join by a notable margin, if every entity had values. On the other hand if there were millions of entities and a lot of missing history values, then a LEFT JOIN was faster. I'd recommend testing on your own data which query suits your data better. For random access of single entity given that the index is in place a correlated subquery is faster. For bulk fetches: test.

Related

Dealing with Arrays in Flask-SqlAlchemy and MySQL

I have a datamodel where I store a list of values separated by comma (1,2,3,4,5...).
In my code, in order to work with arrays instead of string, I have defined the model like this one:
class MyModel(db.Model):
pk = db.Column(db.Integer, primary_key=True)
__fake_array = db.Column(db.String(500), name="fake_array")
#property
def fake_array(self):
if not self.__fake_array:
return
return self.__fake_array.split(',')
#fake_array.setter
def fake_array(self, value):
if value:
self.__fake_array = ",".join(value)
else:
self.__fake_array = None
This works perfect and from the point of view of my source code "fake_array" is an array, It's only transformed into string when it's stored in database.
The problem appears when I try to filter by that field. Expressions like this doesn't work:
MyModel.query.filter_by(fake_array="1").all()
It seems that I cant filter using the SqlAlchemy query model.
What can I do here? Is there any way to filter this kind of fields? Is there is a better pattern for the "fake_array" problem?
Thanks!

What you're trying to do should really be replaced with a pair of tables and a relationship between them.
The first table (which I'll call A) contains everything BUT the array column, and it should have a primary key of some sort. You should have another table (which I'll call B) that contains a primary key, a foreign key column to A (which I'll call a_id, and an integer field.
Using this layout, each row in the A table has its associated array in table B where B's a_id == A.id via a join. You can add or remove values from the array by manipulating the rows in table B. You can filter by using a join.
If the order of the values is needed, then create an order column in table B.

GeoDjango: How to perform a query of spatially close records

I have two Django models (A and B) which are not related by any foreign key, but both have a geometry field.
class A(Model):
position = PointField(geography=True)
class B(Model):
position = PointField(geography=True)
I would like to relate them spatially, i.e. given a queryset of A, being able to obtain a queryset of B containing those records that are at less than a given distance to A.
I haven't found a way using pure Django's ORM to do such a thing.
Of course, I could write a property in A such as this one:
#property
def nearby(self):
return B.objects.filter(position__dwithin=(self.position, 0.1))
But this only allows me to fetch the nearby records on each instance and not in a single query, which is far from efficient.
I have also tried to do this:
nearby = B.objects.filter(position__dwithin=(OuterRef('position'), 0.1))
query = A.objects.annotate(nearby=Subquery(nearby.values('pk')))
list(query) # error here
However, I get this error for the last line:
ValueError: This queryset contains a reference to an outer query and may only be used in a subquery
Does anybody know a better way (more efficient) of performing such a query or maybe the reason why my code is failing?
I very much appreciate.

I finally managed to solve it, but I had to perform a raw SQL query in the end.
This will return all A records with an annotation including a list of all nearby B records:
from collections import namedtuple
from django.db import connection
with connection.cursor() as cursor:
cursor.execute('''SELECT id, array_agg(b.id) as nearby FROM myapp_a a
LEFT JOIN myapp_b b ON ST_DWithin(a.position, p.position, 0.1)
GROUP BY a.id''')
nt_result = namedtuple('Result', [col[0] for col in cursor.description])
results = [nt_result(*row) for row in cursor.fetchall()]
References:
Raw queries: https://docs.djangoproject.com/en/2.2/topics/db/sql/#executing-custom-sql-directly
Array aggregation: https://www.postgresql.org/docs/8.4/functions-aggregate.html
ST_DWithin: https://postgis.net/docs/ST_DWithin.html

Returning ranked search results using gin index with sqlalchemy

I have a GIN index set up for full text search. I would like to get a list of records that match a search query, ordered by rank (how well the record matched the search query). For the result, I only need the record and its columns, I do not need the actual rank value that was used for ordering.
I have the following query, which runs fine and returns the expected results from my postgresql db.
SELECT *, ts_rank('{0.1,0.1,0.1,0.1}', users.textsearchable_index_col, to_tsquery('smit:* | ji:*')) AS rank
FROM users
WHERE users.authentication_method != 2 AND users.textsearchable_index_col ## to_tsquery('smith:* | ji:*') ORDER
BY rank desc;
I would like to perform this query using sqlalchemy(SA). I understand that 'ts_rank' does not come ready to use in SA. I have tried a number of things, such as
proxy = self.db_session.query(User, text(
"""ts_rank('{0.1,0.1,0.1,0.1}', users.textsearchable_index_col, to_tsquery(:search_str1)) as rank""")). \
filter(User.authentication_method != 2,
text("""users.textsearchable_index_col ## to_tsquery(:search_str2)""")). \
params(search_str1=search, search_str2=search). \
order_by("rank")
and also read about using column property, although I'm not sure if/how I would use that in the solution.
would appreciate a nudge in the right direction.

You can use SQL functions in your queries by using SQLAlchemy func
from sqlalchemy.sql.expression import func
(db.session.query(User, func.ts_rank('{0.1,0.1,0.1,0.1}', User.textsearchable_index_col, func.to_tsquery('smit:* | ji:*')).label('rank'))
.filter(User.authentication_method != 2)
.filter(User.textsearchable_index_col.op('##')(func.to_tsquery('smit:* | ji:*')))
.order_by('rank desc')
).all()

Cassandra filter based on secondary index

We have been using Cassandra for awhile now and we are trying to get a really optimized table going that will be able to quickly query and filter on about 100k rows.
Our model looks something like this:
class FailedCDR(Model):
uuid = columns.UUID(partition_key=True, primary_key=True)
num_attempts = columns.Integer(index=True)
datetime = columns.Integer()
If I describe the table it clearly shows that num_attempts is index.
CREATE TABLE cdrs.failed_cdrs (
uuid uuid PRIMARY KEY,
datetime int,
num_attempts int
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX index_failed_cdrs_num_attempts ON cdrs.failed_cdrs (num_attempts);
We want to be able to run a filter similar to this:
failed = FailedCDR.filter(num_attempts__lte=9)
But this happens:
QueryException: Where clauses require either a "=" or "IN" comparison with either a primary key or indexed field
How can we accomplish a similar task?

If you want to do a range query in CQL, you need the field to be a clustering column.
So you'll want the num_attempts field to be a clustering column.
Also if you want to do a single query, you need all the rows you want to query in the same partition (or a small number of partitions that you can access using an IN clause). Since you only have 100K rows, that is small enough to fit in one partition.
So you could define your table like this:
CREATE TABLE test.failed_cdrs (
partition int,
num_attempts int,
uuid uuid,
datetime int,
PRIMARY KEY (partition, num_attempts, uuid));
You would insert your data with a constant for the partition key, such as 1.
INSERT INTO failed_cdrs (uuid, datetime, num_attempts, partition)
VALUES ( now(), 123, 5, 1);
Then you can do range queries like this:
SELECT * from failed_cdrs where partition=1 and num_attempts >=8;
The drawback to this method is that to change the value of num_attempts, you need to delete the old row and insert a new row since you are not allowed to update key fields. You could do the delete and insert for that in a batch statement.
A better option that will become available in Cassandra 3.0 is to make a materialized view that has num_attempts as a clustering column, in which case Cassandra would take care of the delete and insert for you when you updated num_attempts in the base table. The 3.0 release is currently in beta testing.

Querying a view in SQLAlchemy

I want to know if SQLAlchemy has problems querying a view. If I query the view with normal SQL on the server like:
SELECT * FROM ViewMyTable WHERE index1 = '608_56_56';
I get a whole bunch of records. But with SQLAlchemy I get only the first one. But in the count is the correct number. I have no idea why.
This is my SQLAlchemy code.
myQuery = Session.query(ViewMyTable)
erg = myQuery.filter(ViewMyTable.index1 == index1.strip())
# Contains the correct number of all entries I found with that query.
totalCount = erg.count()
# Contains only the first entry I found with my query.
ergListe = erg.all()

if you've mapped ViewMyTable, the query will only return rows that have a fully non-NULL primary key. This behavior is specific to versions 0.5 and lower - on 0.6, if any of the columns have a non-NULL in the primary key, the row is turned into an instance. Specify the flag allow_null_pks=True to your mappers to ensure that partial primary keys still count :
mapper(ViewMyTable, myview, allow_null_pks=True)
If OTOH the rows returned have all nulls for the primary key, then SQLAlchemy cannot create an entity since it can't place it into the identity map. You can instead get at the individual columns by querying for them specifically:
for id, index in session.query(ViewMyTable.id, ViewMyTable.index):
print id, index

I was facing similar problem - how to filter view with SQLAlchemy. For table:
t_v_full_proposals = Table(
'v_full_proposals', metadata,
Column('proposal_id', Integer),
Column('version', String),
Column('content', String),
Column('creator_id', String)
)
I'm filtering:
proposals = session.query(t_v_full_proposals).filter(t_v_full_proposals.c.creator_id != 'greatest_admin')
Hopefully it will help:)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQLAlchemy Efficient Subquery for Latest Value - python

Related

Dealing with Arrays in Flask-SqlAlchemy and MySQL

GeoDjango: How to perform a query of spatially close records

Returning ranked search results using gin index with sqlalchemy

Cassandra filter based on secondary index

Querying a view in SQLAlchemy

Categories

Resources