How can I speed up hybrid property queries in SQLAlchemy?

How can I speed up hybrid property queries in SQLAlchemy? - python

Is there a good way to speed up querying hybrid properties in SQLALchemy that involve relationships? I have the following two tables:
class Child(Base):
__tablename__ = 'Child'
id = Column(Integer, primary_key=True)
is_boy = Column(Boolean, default=False)
parent_id = Column(Integer, ForeignKey('Parent.id'))
class Parent(Base):
__tablename__ = 'Parent'
id = Column(Integer, primary_key=True)
children = relationship("Child", backref="parent")
#hybrid_property
def children_count(self):
return self.children_count.count()
#children_count.expression
def children_count(cls):
return (select([func.count(Children.id)]).
where(Children.parent_id == cls.id).
label("children_count")
)
When I query Parent.children_count across 50,000 rows (each parent has on average roughly 2 children), it's pretty slow. Is there a good way through indexes or something else for me to speed these queries up?

By default, PostgreSQL doesn't create indexes on foreign keys.
So the first thing I'd do is add an index, which SQLAlchemy makes really easy:
parent_id = Column(Integer, ForeignKey('Parent.id'), index=True)
This will probably result in a fast enough retrieval time given the size of your current dataset--try it and see. Be sure to try the query a few times in a row to warm up the PostgreSQL cache.
For a larger dataset, or if the queries still aren't fast enough, you could look into pre-calculating the counts and caching them... A number of ways to cache, the easiest hack is probably throw an extra column in your Parent table and just make sure whenever a new child is added that you write app logic to increment the count. It's a little hacky that way. Another option is caching the count in Redis/memcache, or even using a Materialized View (this is a great solution if it's okay for the count to occasionally be out of date by a few minutes).

Related

SqlAlchemy doubly linked tables [duplicate]

I'm trying to model the following situation: A program has many versions, and one of the versions is the current one (not necessarily the latest).
This is how I'm doing it now:
class Program(Base):
__tablename__ = 'programs'
id = Column(Integer, primary_key=True)
name = Column(String)
current_version_id = Column(Integer, ForeignKey('program_versions.id'))
current_version = relationship('ProgramVersion', foreign_keys=[current_version_id])
versions = relationship('ProgramVersion', order_by='ProgramVersion.id', back_populates='program')
class ProgramVersion(Base):
__tablename__ = 'program_versions'
id = Column(Integer, primary_key=True)
program_id = Column(Integer, ForeignKey('programs.id'))
timestamp = Column(DateTime, default=datetime.datetime.utcnow)
program = relationship('Filter', foreign_keys=[program_id], back_populates='versions')
But then I get the error: Could not determine join condition between parent/child tables on relationship Program.versions - there are multiple foreign key paths linking the tables. Specify the 'foreign_keys' argument, providing a list of those columns which should be counted as containing a foreign key reference to the parent table.
But what foreign key should I provide for the 'Program.versions' relationship? Is there a better way to model this situation?

Circular dependency like that is a perfectly valid solution to this problem.
To fix your foreign keys problem, you need to explicitly provide the foreign_keys argument.
class Program(Base):
...
current_version = relationship('ProgramVersion', foreign_keys=current_version_id, ...)
versions = relationship('ProgramVersion', foreign_keys="ProgramVersion.program_id", ...)
class ProgramVersion(Base):
...
program = relationship('Filter', foreign_keys=program_id, ...)
You'll find that when you do a create_all(), SQLAlchemy has trouble creating the tables because each table has a foreign key that depends on a column in the other. SQLAlchemy provides a way to break this circular dependency by using an ALTER statement for one of the tables:
class Program(Base):
...
current_version_id = Column(Integer, ForeignKey('program_versions.id', use_alter=True, name="fk_program_current_version_id"))
...
Finally, you'll find that when you add a complete object graph to the session, SQLAlchemy has trouble issuing INSERT statements because each row has a value that depends on the yet-unknown primary key of the other. SQLAlchemy provides a way to break this circular dependency by issuing an UPDATE for one of the columns:
class Program(Base):
...
current_version = relationship('ProgramVersion', foreign_keys=current_version_id, post_update=True, ...)
...

This design is not ideal; by having two tables refer to one another, you cannot effectively insert into either table, because the foreign key required in the other will not exist. One possible solution in outlined in the selected answer of
this question related to microsoft sqlserver, but I will summarize/elaborate on it here.
A better way to model this might be to introduce a third table, VersionHistory, and eliminate your foreign key constraints on the other two tables.
class VersionHistory(Base):
__tablename__ = 'version_history'
program_id = Column(Integer, ForeignKey('programs.id'), primary_key=True)
version_id = Column(Integer, ForeignKey('program_version.id'), primary_key=True)
current = Column(Boolean, default=False)
# I'm not too familiar with SQLAlchemy, but I suspect that relationship
# information goes here somewhere
This eliminates the circular relationship you have created in your current implementation. You could then query this table by program, and receive all existing versions for the program, etc. Because of the composite primary key in this table, you could access any specific program/version combination. The addition of the current field to this table takes the burden of tracking currency off of the other two tables, although maintaining a single current version per program could require some trigger gymnastics.
HTH!

SQLAlchemy Double Inner Join on multiple foreign keys

Please see update at bottom
I have three classes. Let's call them Post, PostVersion, and Tag. (This is for an internal version control system in a web app, perhaps similar to StackOverflow, though I'm unsure of their implementation strategy). I sort of use terminology from git to understand it. These are highly simplified versions of the classes for the purposes of this question:
class Post(db.Model):
id = db.Column(db.Integer, primary_key=True)
author_id = db.Column(db.Integer, db.ForeignKey("user.id"))
author = db.relationship("User", backref="posts")
head_id = db.Column(db.Integer, db.ForeignKey("post_version.id"))
HEAD = db.relationship("PostVersion", foreign_keys=[head_id])
added = db.Column(db.DateTime, default=datetime.utcnow)
class PostVersion(db.Model):
id = db.Column(db.Integer, primary_key=True)
editor_id = db.Column(db.Integer, db.ForeignKey("user.id"))
editor = db.relationship("User")
previous_id = db.Column(db.Integer, db.ForeignKey("post_version.id"), default=None)
previous = db.relationship("PostVersion")
pointer_id = db.Column(db.Integer, db.ForeignKey("post.id"))
pointer = db.relationship("Post", foreign_keys=[pointer_id])
post = db.Column(db.Text)
modified = db.Column(db.DateTime, default=datetime.utcnow)
tag_1_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_2_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_3_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_4_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_5_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_1 = db.relationship("Tag", foreign_keys=[tag_1_id])
tag_2 = db.relationship("Tag", foreign_keys=[tag_2_id])
tag_3 = db.relationship("Tag", foreign_keys=[tag_3_id])
tag_4 = db.relationship("Tag", foreign_keys=[tag_4_id])
tag_5 = db.relationship("Tag", foreign_keys=[tag_5_id])
class Tag(db.Model):
id = db.Column(db.Integer, primary_key=True)
tag = db.Column(db.String(128))
To make a new post, I create both a Post and an initial PostVersion to which Post.head_id points. Every time an edit is made, a new PostVersion is created pointing to the previous PostVersion, and the Post.head_id is reset to point to the new PostVersion. To reset the post version to an earlier version--well, I haven't gotten that far but it seems trivial to either copy the previous version or just reset the pointer to the previous version.
My question is this, though: how can I write a relationship between Post and Tag such that
Post.tags would be a list of all the tags the current PostVersion contains, and
Tag.posts would be a list of all the Post's that currently have that particular tag?
The first condition seems easy enough, a simple method
def get_tags(self):
t = []
if self.HEAD.tag_1:
t.append(self.HEAD.tag_1)
if self.HEAD.tag_2:
t.append(self.HEAD.tag_2)
if self.HEAD.tag_3:
t.append(self.HEAD.tag_3)
if self.HEAD.tag_4:
t.append(self.HEAD.tag_4)
if self.HEAD.tag_5:
t.append(self.HEAD.tag_5)
return t
does the trick just fine for now, but the second condition is almost intractable for me right now. I currently use an obnoxious method in Tag where I query for all the PostVersion's with the tag using an or_ filter:
def get_posts(self):
edits = PostVersion.query.filter(or_(
PostVersion.tag_1_id==self.id,
PostVersion.tag_2_id==self.id,
PostVersion.tag_3_id==self.id,
PostVersion.tag_4_id==self.id,
PostVersion.tag_5_id==self.id,
).order_by(PostVersion.modified.desc()).all()
posts = []
for e in edits:
if self in e.pointer.get_tags() and e.pointer not in posts:
posts.append(e.pointer)
return posts
This is horribly inefficient and I cannot paginate the results.
I know this would be a secondary join from Post to Tag or Tag to Post through PostVersion, but it would have to be a secondary join on an or, and I have no clue how to even start to write that.
Looking back on my code I'm beginning to wonder why some of these relationships require the foreign_keys parameter to be defined and others don't. I'm thinking it's relating to where they're defined (immediately following the FK id column or not) and noticing that there's a list for the foreign_keys, I'm thinking that's how I could define it. But I'm unsure how to pursue this.
I'm also wondering now if I could dispense with the pointer_id on PostVersion with a well-configured relationship. This, however, is irrelevant to the question (though the circular reference does cause headaches).
For reference, I am using Flask-SQLAlchemy, Flask-migrate, and MariaDB. I am heavily following Miguel Grinberg's Flask Megatutorial.
Any help or advice would be a godsend.
UPDATE
I have devised the following mysql query that works, and now I need to translate it into sqlalchemy:
SELECT
post.id, tag.tag
FROM
post
INNER JOIN
post_version
ON
post.head_id=post_version.id
INNER JOIN
tag
ON
post_version.tag_1_id=tag.id OR
post_version.tag_2_id=tag.id OR
post_version.tag_3_id=tag.id OR
post_version.tag_4_id=tag.id OR
post_version.tag_5_id=tag.id OR
WHERE
tag.tag="<tag name>";

Can you change the database design, or do you have to make your app work on a DB that you can't change? If the latter, I can't help you. If you can change the design, you should do it like this:
Replace the linked chain of PostVersions with a one-to-many relationship from Post to PostVersions. Your "Post" class will end up having a relationship "versions" to all instances of PostVersion pertinent to that Post.
Replace the tag_id members with a many-to-many relationship using an additional association table.
Both methods are well-explained in the SQLAlchemy docs. Be sure to start with minimal code, testing in small non-Flask command line programs. Once you have the basic functionality down, transfer the concept to your more complicated classes. After that, ask yourself your original questions again. The answers will come much more easily.

I solved the problem on my own, and it really just consists of defining a primary and secondary join with an or_ in the primary:
posts = db.relationship("Post", secondary="post_version",
primaryjoin="or_(Tag.id==post_version.c.tag_1_id,"
"Tag.id==post_version.c.tag_2_id,"
"Tag.id==post_version.c.tag_3_id,"
"Tag.id==post_version.c.tag_4_id,"
"Tag.id==post_version.c.tag_5_id)",
secondaryjoin="Annotation.head_id==post_version.c.id",
lazy="dynamic")
As you can see I mix table and class names. I will update the answer as I experiment to make it more regular.

flask sqlalchemy UniqueConstraint with foreignkey attribute

I have an app I am building with Flask that contains models for Projects and Plates, where Plates have Project as a foreignkey.
Each project has a year, given as an integer (so 17 for 2017); and each plate has a number and a name, constructed from the plate.project.year and plate.number. For example, Plate 106 from a project done this year would have the name '17-0106'. I would like this name to be unique.
Here are my models:
class Project(Model):
__tablename__ = 'projects'
id = Column(Integer, primary_key=True)
name = Column(String(64),unique=True)
year = Column(Integer,default=datetime.now().year-2000)
class Plate(Model):
__tablename__ = 'plates'
id = Column(Integer, primary_key=True)
number = Column(Integer)
project_id = Column(Integer, ForeignKey('projects.id'))
project = relationship('Project',backref=backref('plates',cascade='all, delete-orphan'))
#property
def name(self):
return str(self.project.year) + '-' + str(self.number).zfill(4)
My first idea was to make the number unique amongst the plates that have the same project.year attribute, so I have tried variations on
__table_args__ = (UniqueConstraint('project.year', 'number', name='_year_number_uc'),), but this needs to access the other table.
Is there a way to do this in the database? Or, failing that, an __init__ method that checks for uniqueness of either the number/project.year combination, or the name property?

There are multiple solutions to your problem. For example, you can de-normalize project.year-number combination and store it as a separate Plate field. Then you can put a unique key on it. The question is how you're going to maintain that value. The two obvious options are triggers (assuming your DB supports triggers and you're ok to use them) or sqla Events, see http://docs.sqlalchemy.org/en/latest/orm/events.html#
Both solutions won't emit an extra SELECT query. Which I believe is important for you.
your question is somewhat similar to Can SQLAlchemy events be used to update a denormalized data cache?

How to order relationship objects on query execution

The following code is for Flask-SQLAlchemy, but would be quite similar in SQLAlchemy.
I have two simple classes:
class Thread(db.Model):
id = db.Column(db.Integer, primary_key=True)
subject = db.Column(db.String)
messages = db.relationship('Message', backref='thread', lazy='dynamic')
class Message(db.Model):
id = db.Column(db.Integer, primary_key=True)
created = db.Column(db.DateTime, default=datetime.utcnow())
text = db.Column(db.String, nullable=False)
I would like to query all Threads and have them ordered by last message created. This is simple:
threads = Thread.query.join(Message).order_by(Message.created.desc()).all()
Threads is now a correctly ordered list I can iterate. However if I iterate over
threads[0].messages then Messages objects are not ordered by Message.created descending.
I can solve this issue while declaring the relationship:
messages = relationship('Message', backref='thread', lazy='dynamic',
order_by='Message.created.desc()')
However this is something I'd rather not do. I want explicitly set this while declaring my query.
I could also call:
threads[0].messages.reverse()
..but this is quite inconvenient in Jinja template.
Is there a good solution for setting order_by for joined model?

You have Thread.messages marked as lazy='dynamic'. This means that after querying for threads, messages is a query object, not a list yet. So iterate over threads[0].messages.order_by(Message.created.desc()).

How to do a paged query with empty collections in SQLAlchemy?

I'm trying to find a memory efficient way to do a paged query to test for an empty collection, but can't seem to figure out how to go about it efficiently on a large database. The table layout uses an Association Object with bi-directional backrefs. It is very similar to the documentation.
class Association(Base):
__tablename__ = 'Association'
assoc_id = Column(Integer, primary_key=True, nullable=False, unique=True)
member_id = Column(Integer, ForeignKey('Member.id'))
chunk_id = Column(Integer, ForeignKey('Chunk.id'))
extra = Column(Text)
chunk = relationship("Chunk", backref=backref("assoc", lazy="dynamic"))
class Member(Base):
__tablename__ = 'Member'
id = Column(Integer, primary_key=True, nullable=False, unique=True)
assocs = relationship("Association", backref="member", cascade="all, delete", lazy="dynamic")
class Chunk(Base):
__tablename__ = 'Chunk'
id = Column(Integer, primary_key=True, nullable=False, unique=True)
name = Column(Text, unique=True)
If the member is deleted, it will cascade and delete the member's associations. However, the chunk objects will be orphaned in the database. To delete the orphaned chunks, I can test for an empty collection using a query like this:
session.query(Chunk).filter(~Chunk.assoc.any())
and then delete the chunks with:
query.delete(synchronize_session=False)
However, if the association and chunk tables are large it seems the query or subquery loads up everything and the memory skyrockets.
I've seen the concept of using a paged query to limit the memory usage of standard queries here:
def page_query(q, count=1000):
offset = 0
while True:
r = False
for elem in q.limit(count).offset(offset):
r = True
yield elem
offset += count
if not r:
break
for chunk in page_query(Session.query(Chunk)):
print chunk.name
However this doesn't appear to work with the empty collection query as the memory usage is still high. Is there a way to do a paged query for an empty collection like this?

I figured out a couple of things were missing here. The query for the empty chunks appears to be mostly OK. The memory usage spike I was seeing was from a query a few lines earlier in the code when the actual member itself was deleted.
member = session.query(Member).filter(Member.name == membername).one()
session.delete(member)
According to the documentation, the session (by default) can only delete objects that are loaded into the session / memory. When the member is deleted, it will load all of it's associations in order to delete them per the cascade rules. What needed to happen is that the association loading had to be bypassed by using passive deletes.
I added:
passive_deletes=True
to the association relationship of the Member class and:
ondelete='CASCADE'
to the member_id foreign key of the Association class. I'm using SQLite3 and added foreign key support with an engine connect event per the docs.
In regards to the orphan chunks, instead of doing a bulk delete of chunks with the query.delete method. I used a page query that doesn't include an offset and deleted the chunks from the session in a loop as shown below. So far I don't seem to have any memory spikes:
def page_query(q):
while True:
r = False
for elem in q.limit(1000):
r = True
yield elem
if not r:
break
for chunk in page_query(query):
# Do something with the chunk if needed
session.delete(chunk)
session.commit()
To make a long story short, it seemed to help greatly to use passive_deletes=True when deleting a parent object whose has a large collection. The page query also appears to work well in this situation only that I had to take out the offset since the chunks were being removed from the session inline.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I speed up hybrid property queries in SQLAlchemy? - python

Related

SqlAlchemy doubly linked tables [duplicate]

SQLAlchemy Double Inner Join on multiple foreign keys

flask sqlalchemy UniqueConstraint with foreignkey attribute

How to order relationship objects on query execution

How to do a paged query with empty collections in SQLAlchemy?

Categories

Resources