I'm trying to find a memory efficient way to do a paged query to test for an empty collection, but can't seem to figure out how to go about it efficiently on a large database. The table layout uses an Association Object with bi-directional backrefs. It is very similar to the documentation.
class Association(Base):
__tablename__ = 'Association'
assoc_id = Column(Integer, primary_key=True, nullable=False, unique=True)
member_id = Column(Integer, ForeignKey('Member.id'))
chunk_id = Column(Integer, ForeignKey('Chunk.id'))
extra = Column(Text)
chunk = relationship("Chunk", backref=backref("assoc", lazy="dynamic"))
class Member(Base):
__tablename__ = 'Member'
id = Column(Integer, primary_key=True, nullable=False, unique=True)
assocs = relationship("Association", backref="member", cascade="all, delete", lazy="dynamic")
class Chunk(Base):
__tablename__ = 'Chunk'
id = Column(Integer, primary_key=True, nullable=False, unique=True)
name = Column(Text, unique=True)
If the member is deleted, it will cascade and delete the member's associations. However, the chunk objects will be orphaned in the database. To delete the orphaned chunks, I can test for an empty collection using a query like this:
session.query(Chunk).filter(~Chunk.assoc.any())
and then delete the chunks with:
query.delete(synchronize_session=False)
However, if the association and chunk tables are large it seems the query or subquery loads up everything and the memory skyrockets.
I've seen the concept of using a paged query to limit the memory usage of standard queries here:
def page_query(q, count=1000):
offset = 0
while True:
r = False
for elem in q.limit(count).offset(offset):
r = True
yield elem
offset += count
if not r:
break
for chunk in page_query(Session.query(Chunk)):
print chunk.name
However this doesn't appear to work with the empty collection query as the memory usage is still high. Is there a way to do a paged query for an empty collection like this?
I figured out a couple of things were missing here. The query for the empty chunks appears to be mostly OK. The memory usage spike I was seeing was from a query a few lines earlier in the code when the actual member itself was deleted.
member = session.query(Member).filter(Member.name == membername).one()
session.delete(member)
According to the documentation, the session (by default) can only delete objects that are loaded into the session / memory. When the member is deleted, it will load all of it's associations in order to delete them per the cascade rules. What needed to happen is that the association loading had to be bypassed by using passive deletes.
I added:
passive_deletes=True
to the association relationship of the Member class and:
ondelete='CASCADE'
to the member_id foreign key of the Association class. I'm using SQLite3 and added foreign key support with an engine connect event per the docs.
In regards to the orphan chunks, instead of doing a bulk delete of chunks with the query.delete method. I used a page query that doesn't include an offset and deleted the chunks from the session in a loop as shown below. So far I don't seem to have any memory spikes:
def page_query(q):
while True:
r = False
for elem in q.limit(1000):
r = True
yield elem
if not r:
break
for chunk in page_query(query):
# Do something with the chunk if needed
session.delete(chunk)
session.commit()
To make a long story short, it seemed to help greatly to use passive_deletes=True when deleting a parent object whose has a large collection. The page query also appears to work well in this situation only that I had to take out the offset since the chunks were being removed from the session inline.
Related
I'm trying to model the following situation: A program has many versions, and one of the versions is the current one (not necessarily the latest).
This is how I'm doing it now:
class Program(Base):
__tablename__ = 'programs'
id = Column(Integer, primary_key=True)
name = Column(String)
current_version_id = Column(Integer, ForeignKey('program_versions.id'))
current_version = relationship('ProgramVersion', foreign_keys=[current_version_id])
versions = relationship('ProgramVersion', order_by='ProgramVersion.id', back_populates='program')
class ProgramVersion(Base):
__tablename__ = 'program_versions'
id = Column(Integer, primary_key=True)
program_id = Column(Integer, ForeignKey('programs.id'))
timestamp = Column(DateTime, default=datetime.datetime.utcnow)
program = relationship('Filter', foreign_keys=[program_id], back_populates='versions')
But then I get the error: Could not determine join condition between parent/child tables on relationship Program.versions - there are multiple foreign key paths linking the tables. Specify the 'foreign_keys' argument, providing a list of those columns which should be counted as containing a foreign key reference to the parent table.
But what foreign key should I provide for the 'Program.versions' relationship? Is there a better way to model this situation?
Circular dependency like that is a perfectly valid solution to this problem.
To fix your foreign keys problem, you need to explicitly provide the foreign_keys argument.
class Program(Base):
...
current_version = relationship('ProgramVersion', foreign_keys=current_version_id, ...)
versions = relationship('ProgramVersion', foreign_keys="ProgramVersion.program_id", ...)
class ProgramVersion(Base):
...
program = relationship('Filter', foreign_keys=program_id, ...)
You'll find that when you do a create_all(), SQLAlchemy has trouble creating the tables because each table has a foreign key that depends on a column in the other. SQLAlchemy provides a way to break this circular dependency by using an ALTER statement for one of the tables:
class Program(Base):
...
current_version_id = Column(Integer, ForeignKey('program_versions.id', use_alter=True, name="fk_program_current_version_id"))
...
Finally, you'll find that when you add a complete object graph to the session, SQLAlchemy has trouble issuing INSERT statements because each row has a value that depends on the yet-unknown primary key of the other. SQLAlchemy provides a way to break this circular dependency by issuing an UPDATE for one of the columns:
class Program(Base):
...
current_version = relationship('ProgramVersion', foreign_keys=current_version_id, post_update=True, ...)
...
This design is not ideal; by having two tables refer to one another, you cannot effectively insert into either table, because the foreign key required in the other will not exist. One possible solution in outlined in the selected answer of
this question related to microsoft sqlserver, but I will summarize/elaborate on it here.
A better way to model this might be to introduce a third table, VersionHistory, and eliminate your foreign key constraints on the other two tables.
class VersionHistory(Base):
__tablename__ = 'version_history'
program_id = Column(Integer, ForeignKey('programs.id'), primary_key=True)
version_id = Column(Integer, ForeignKey('program_version.id'), primary_key=True)
current = Column(Boolean, default=False)
# I'm not too familiar with SQLAlchemy, but I suspect that relationship
# information goes here somewhere
This eliminates the circular relationship you have created in your current implementation. You could then query this table by program, and receive all existing versions for the program, etc. Because of the composite primary key in this table, you could access any specific program/version combination. The addition of the current field to this table takes the burden of tracking currency off of the other two tables, although maintaining a single current version per program could require some trigger gymnastics.
HTH!
Please see update at bottom
I have three classes. Let's call them Post, PostVersion, and Tag. (This is for an internal version control system in a web app, perhaps similar to StackOverflow, though I'm unsure of their implementation strategy). I sort of use terminology from git to understand it. These are highly simplified versions of the classes for the purposes of this question:
class Post(db.Model):
id = db.Column(db.Integer, primary_key=True)
author_id = db.Column(db.Integer, db.ForeignKey("user.id"))
author = db.relationship("User", backref="posts")
head_id = db.Column(db.Integer, db.ForeignKey("post_version.id"))
HEAD = db.relationship("PostVersion", foreign_keys=[head_id])
added = db.Column(db.DateTime, default=datetime.utcnow)
class PostVersion(db.Model):
id = db.Column(db.Integer, primary_key=True)
editor_id = db.Column(db.Integer, db.ForeignKey("user.id"))
editor = db.relationship("User")
previous_id = db.Column(db.Integer, db.ForeignKey("post_version.id"), default=None)
previous = db.relationship("PostVersion")
pointer_id = db.Column(db.Integer, db.ForeignKey("post.id"))
pointer = db.relationship("Post", foreign_keys=[pointer_id])
post = db.Column(db.Text)
modified = db.Column(db.DateTime, default=datetime.utcnow)
tag_1_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_2_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_3_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_4_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_5_id = db.Column(db.Integer, db.ForeignKey("tag.id"), default=None)
tag_1 = db.relationship("Tag", foreign_keys=[tag_1_id])
tag_2 = db.relationship("Tag", foreign_keys=[tag_2_id])
tag_3 = db.relationship("Tag", foreign_keys=[tag_3_id])
tag_4 = db.relationship("Tag", foreign_keys=[tag_4_id])
tag_5 = db.relationship("Tag", foreign_keys=[tag_5_id])
class Tag(db.Model):
id = db.Column(db.Integer, primary_key=True)
tag = db.Column(db.String(128))
To make a new post, I create both a Post and an initial PostVersion to which Post.head_id points. Every time an edit is made, a new PostVersion is created pointing to the previous PostVersion, and the Post.head_id is reset to point to the new PostVersion. To reset the post version to an earlier version--well, I haven't gotten that far but it seems trivial to either copy the previous version or just reset the pointer to the previous version.
My question is this, though: how can I write a relationship between Post and Tag such that
Post.tags would be a list of all the tags the current PostVersion contains, and
Tag.posts would be a list of all the Post's that currently have that particular tag?
The first condition seems easy enough, a simple method
def get_tags(self):
t = []
if self.HEAD.tag_1:
t.append(self.HEAD.tag_1)
if self.HEAD.tag_2:
t.append(self.HEAD.tag_2)
if self.HEAD.tag_3:
t.append(self.HEAD.tag_3)
if self.HEAD.tag_4:
t.append(self.HEAD.tag_4)
if self.HEAD.tag_5:
t.append(self.HEAD.tag_5)
return t
does the trick just fine for now, but the second condition is almost intractable for me right now. I currently use an obnoxious method in Tag where I query for all the PostVersion's with the tag using an or_ filter:
def get_posts(self):
edits = PostVersion.query.filter(or_(
PostVersion.tag_1_id==self.id,
PostVersion.tag_2_id==self.id,
PostVersion.tag_3_id==self.id,
PostVersion.tag_4_id==self.id,
PostVersion.tag_5_id==self.id,
).order_by(PostVersion.modified.desc()).all()
posts = []
for e in edits:
if self in e.pointer.get_tags() and e.pointer not in posts:
posts.append(e.pointer)
return posts
This is horribly inefficient and I cannot paginate the results.
I know this would be a secondary join from Post to Tag or Tag to Post through PostVersion, but it would have to be a secondary join on an or, and I have no clue how to even start to write that.
Looking back on my code I'm beginning to wonder why some of these relationships require the foreign_keys parameter to be defined and others don't. I'm thinking it's relating to where they're defined (immediately following the FK id column or not) and noticing that there's a list for the foreign_keys, I'm thinking that's how I could define it. But I'm unsure how to pursue this.
I'm also wondering now if I could dispense with the pointer_id on PostVersion with a well-configured relationship. This, however, is irrelevant to the question (though the circular reference does cause headaches).
For reference, I am using Flask-SQLAlchemy, Flask-migrate, and MariaDB. I am heavily following Miguel Grinberg's Flask Megatutorial.
Any help or advice would be a godsend.
UPDATE
I have devised the following mysql query that works, and now I need to translate it into sqlalchemy:
SELECT
post.id, tag.tag
FROM
post
INNER JOIN
post_version
ON
post.head_id=post_version.id
INNER JOIN
tag
ON
post_version.tag_1_id=tag.id OR
post_version.tag_2_id=tag.id OR
post_version.tag_3_id=tag.id OR
post_version.tag_4_id=tag.id OR
post_version.tag_5_id=tag.id OR
WHERE
tag.tag="<tag name>";
Can you change the database design, or do you have to make your app work on a DB that you can't change? If the latter, I can't help you. If you can change the design, you should do it like this:
Replace the linked chain of PostVersions with a one-to-many relationship from Post to PostVersions. Your "Post" class will end up having a relationship "versions" to all instances of PostVersion pertinent to that Post.
Replace the tag_id members with a many-to-many relationship using an additional association table.
Both methods are well-explained in the SQLAlchemy docs. Be sure to start with minimal code, testing in small non-Flask command line programs. Once you have the basic functionality down, transfer the concept to your more complicated classes. After that, ask yourself your original questions again. The answers will come much more easily.
I solved the problem on my own, and it really just consists of defining a primary and secondary join with an or_ in the primary:
posts = db.relationship("Post", secondary="post_version",
primaryjoin="or_(Tag.id==post_version.c.tag_1_id,"
"Tag.id==post_version.c.tag_2_id,"
"Tag.id==post_version.c.tag_3_id,"
"Tag.id==post_version.c.tag_4_id,"
"Tag.id==post_version.c.tag_5_id)",
secondaryjoin="Annotation.head_id==post_version.c.id",
lazy="dynamic")
As you can see I mix table and class names. I will update the answer as I experiment to make it more regular.
I'm writing a simple bookmark manager program that uses SQLAlchemy for data storage. I have database objects for Bookmarks and Tags, and there is a many-to-many relationship between them: a bookmark can use any tags in the database, and each tag can be assigned to any (or even all) bookmarks in the database. Tags are automatically created and removed by the program – if the number of bookmarks referencing a tag drops to zero, the tag should be deleted.
Here's my model code, with unnecessary methods such as __str__() removed:
mark_tag_assoc = Table('mark_tag_assoc', Base.metadata,
Column('mark_id', Integer, ForeignKey('bookmarks.id')),
Column('tag_id', Integer, ForeignKey('tags.id')),
PrimaryKeyConstraint('mark_id', 'tag_id'))
class Bookmark(Base):
__tablename__ = 'bookmarks'
id = Column(Integer, primary_key=True)
name = Column(String)
url = Column(String)
description = Column(String)
tags_rel = relationship("Tag", secondary=mark_tag_assoc,
backref="bookmarks", cascade="all, delete")
class Tag(Base):
__tablename__ = 'tags'
id = Column(Integer, primary_key=True)
text = Column(String)
I thought that if I set up a cascade (cascade="all, delete") it would take care of removing tags with no more references for me. What actually happens is that when any bookmark is deleted, all tags referenced by it are automatically removed from all other bookmarks and deleted, which is obviously not the intended behavior.
Is there a simple option to do what I want, or if not, what would be the cleanest way to implement it myself? Although I have a little bit of experience with simple SQL, this is my first time using SQLAlchemy, so details would be appreciated.
I'd still be interested to know if there happens to be a built-in function for this, but after further research it seems more likely to me that there is not, as there don't generally seem to be too many helpful functions for doing complicated stuff with many-to-many relationships. Here's how I solved it:
Remove cascade="all, delete" from the relationship so that no cascades are performed. Even with no cascades configured, SQLAlchemy will still remove rows from the association table when bookmarks are deleted.
Call a function after each delete of a Bookmark to check if the tag still has any relationships, and delete the tag if not:
def maybeExpungeTag(self, tag):
"""
Delete /tag/ from the tags table if it is no longer referenced by
any bookmarks.
Return:
True if the tag was deleted.
False if the tag is still referenced and was not deleted.
"""
if not len(tag.bookmarks):
self.session.delete(tag)
return True
else:
return False
# and for the actual delete...
mark = # ...get the bookmark being deleted
tags = mark.tags_rel
self.session.delete(mark)
for tag in tags:
self.maybeExpungeTag(tag)
self.session.commit()
Is there a good way to speed up querying hybrid properties in SQLALchemy that involve relationships? I have the following two tables:
class Child(Base):
__tablename__ = 'Child'
id = Column(Integer, primary_key=True)
is_boy = Column(Boolean, default=False)
parent_id = Column(Integer, ForeignKey('Parent.id'))
class Parent(Base):
__tablename__ = 'Parent'
id = Column(Integer, primary_key=True)
children = relationship("Child", backref="parent")
#hybrid_property
def children_count(self):
return self.children_count.count()
#children_count.expression
def children_count(cls):
return (select([func.count(Children.id)]).
where(Children.parent_id == cls.id).
label("children_count")
)
When I query Parent.children_count across 50,000 rows (each parent has on average roughly 2 children), it's pretty slow. Is there a good way through indexes or something else for me to speed these queries up?
By default, PostgreSQL doesn't create indexes on foreign keys.
So the first thing I'd do is add an index, which SQLAlchemy makes really easy:
parent_id = Column(Integer, ForeignKey('Parent.id'), index=True)
This will probably result in a fast enough retrieval time given the size of your current dataset--try it and see. Be sure to try the query a few times in a row to warm up the PostgreSQL cache.
For a larger dataset, or if the queries still aren't fast enough, you could look into pre-calculating the counts and caching them... A number of ways to cache, the easiest hack is probably throw an extra column in your Parent table and just make sure whenever a new child is added that you write app logic to increment the count. It's a little hacky that way. Another option is caching the count in Redis/memcache, or even using a Materialized View (this is a great solution if it's okay for the count to occasionally be out of date by a few minutes).
The following code is for Flask-SQLAlchemy, but would be quite similar in SQLAlchemy.
I have two simple classes:
class Thread(db.Model):
id = db.Column(db.Integer, primary_key=True)
subject = db.Column(db.String)
messages = db.relationship('Message', backref='thread', lazy='dynamic')
class Message(db.Model):
id = db.Column(db.Integer, primary_key=True)
created = db.Column(db.DateTime, default=datetime.utcnow())
text = db.Column(db.String, nullable=False)
I would like to query all Threads and have them ordered by last message created. This is simple:
threads = Thread.query.join(Message).order_by(Message.created.desc()).all()
Threads is now a correctly ordered list I can iterate. However if I iterate over
threads[0].messages then Messages objects are not ordered by Message.created descending.
I can solve this issue while declaring the relationship:
messages = relationship('Message', backref='thread', lazy='dynamic',
order_by='Message.created.desc()')
However this is something I'd rather not do. I want explicitly set this while declaring my query.
I could also call:
threads[0].messages.reverse()
..but this is quite inconvenient in Jinja template.
Is there a good solution for setting order_by for joined model?
You have Thread.messages marked as lazy='dynamic'. This means that after querying for threads, messages is a query object, not a list yet. So iterate over threads[0].messages.order_by(Message.created.desc()).