Flask-SQL alchemy slow insert due to multiple foreign keys - python

So I am currently trying to build a new app using flask-restful for the backend, and I am still learning since all of this is quite new to me. I already set up everything with several different mySQL tables detailed next and all the relationships between them, and everything seems to be working fine, except than the insert process of new data is quite slow.
Here is a (simplified) explanation of the current setup I have. Basically, I first of all have one Flights table.
class FlightModel(db.Model):
__tablename__ = "Flights"
flightid = db.Column(db.Integer, primary_key = True, nullable=False)
[Other properties]
reviewid = db.Column(db.Integer, db.ForeignKey('Reviews.reviewid'), index = True, nullable = False)
review = db.relationship("ReviewModel", back_populates="flight", lazy='joined')
This table then points to a Reviews table, in which I store the global review left by the user.
class ReviewModel(db.Model):
__tablename__ = "Reviews"
reviewid = db.Column(db.Integer, primary_key = True, nullable = False)
[Other properties]
depAirportReviewid = db.Column(db.Integer, db.ForeignKey('DepAirportReviews.reviewid'), index=True, nullable = False)
arrAirportReviewid = db.Column(db.Integer, db.ForeignKey('ArrAirportReviews.reviewid'), index=True, nullable = False)
airlineReviewid = db.Column(db.Integer, db.ForeignKey('AirlineReviews.reviewid'), index=True, nullable = False)
flight = db.relationship("FlightModel", uselist=False, back_populates="review", lazy='joined')
depAirportReview = db.relationship("DepAirportReviewModel", back_populates="review", lazy='joined')
arrAirportReview = db.relationship("ArrAirportReviewModel", back_populates="review", lazy='joined')
airlineReview = db.relationship("AirlineReviewModel", back_populates="review", lazy='joined')
Then, a more detailed review can be left regarding different aspects of the flights, stored in yet another table (for example, in the following DepAirportReviews table: there are three tables in total at this level).
class DepAirportReviewModel(db.Model):
__tablename__ = "DepAirportReviews"
reviewid = db.Column(db.Integer, primary_key = True, nullable = False)
[Other properties]
review = db.relationship("ReviewModel", uselist=False, back_populates="depAirportReview", lazy='joined')
The insert process is slow (it typically takes 1 second per flight to insert, which is a problem when I try to bulk insert a few hundreds of them).
I understand this is because of all these relationships and all the trips to the database it implies, in order to retrieve the different ids for the different tables. Is it correct? Is there anything I could do to solve this, or will I need to redesign the tables to remove some of these relationships?
Thanks for any help!
EDIT: displaying the SQL executed directly showed what I expected: 7 simple queries in total are executed for each insertion, each taking ~300ms. It's quite long, I guess it is mostly due to the time to reach the server. Nothing to be done except remvoing some foreign keys, right?

Related

understanding session.query() and select() behaviour with order_by in Sqlalchemy 1.4

I am using sqlalchemy1.4 along with mysql8 as my backend database for one of my projects and have the following model classes.
class CM(Base):
__tablename__ = "cm"
cm_id = Column("id", Integer(), primary_key=True)
status = Column("status", String(32), nullable=False)
hostname = Column("hostname", String(128), nullable=False)
created_by = Column("created_by", String(64), default="")
# One to many relationship with Fault class
faults = relationship(
"Fault", back_populates="cm", lazy="selectin", cascade="all, delete-orphan",
)
class Fault(CBase):
__tablename__ = "fault"
id = Column("id", Integer(), primary_key=True)
cm_id = Column(
Integer,
ForeignKey("cm.id", ondelete="CASCADE"),
index=True,
)
check_id = Column(
String(96),
ForeignKey("faultmap.check_id", onupdate="CASCADE"),
nullable=False,
)
cm = relationship("CM", back_populates="faults", lazy="selectin")
faultmap = relationship("NewFaultMap", back_populates="faults", lazy="selectin")
component = Column("component", Text(255), default="")
created_by = Column("created_by", String(64), default="")
class NewFaultMap(CBase):
__tablename__ = "faultmap"
id = Column("id", Integer, primary_key=True)
check_id = Column(String(96), unique=True, nullable=False)
detection_system = Column(String(96), nullable=False)
fault_type = Column(String(64), nullable=False)
description = Column(Text(255), nullable=False, default="")
# One to many relationship with Fault class
faults = relationship("Fault", back_populates="faultmap", lazy="selectin")
So my goal here is to fetch the rows from CM table and eagerly fetching the data for related field i.e. faults and then fetching its related field(faultmap) as well and then sort(order_by) and slice based on user values.
I have tried using different approaches both with select and session.query to achieve this but there are some concerns/challenges with every option. Here are some of the approaches I tried
Using session.query()
q = (
session.query(CM)
.where(*cm_filters)
.join(Fault)
.where(*fault_filters)
.join(NewFaultMap)
.where(*faultmap_filters)
.order_by(Fault.check_id)
)
cms = session.execute(q.distinct().limit(5000)).all() # returns 5000 rows
This query works just fine and gives the expected result but I get the following warning
SADeprecationWarning: Object <sqlalchemy.orm.query.Query object at 0x10af82950> should not be used directly in a SQL statement context, such as passing to methods such as session.execute(). This usage will be disallowed in a future release. Please use Core select() / update() / delete() etc. with Session.execute() and other statement execution methods.
cms = session.execute(q.distinct().limit(5000)).all()
To address the above warning if I make the following change
cms = q.distinct().limit(5000).all() # returns < 5000 results
The query above works but it does not give expected results, it basically first does a limit and then and does a distinct which basically gives much less reocrds.
Also If I change the distinct() to distinct(CM.cm_id) it does not work and throws this error
(3065, "Expression #1 of ORDER BY clause is not in SELECT list, references column 'db.fault.check_id' which is not in SELECT list; this is incompatible with DISTINCT")
Using sqlalchemy1.4 select
One of my biggest confusions with select is that it does not have similar interface to session.query() which makes it difficult to port queries that just work with session.query.
Here is the my first version of the query with session.query replaced with select
q = (
select(CM)
.where(*cm_filters)
.join(Fault)
.where(*fault_filters)
.join(NewFaultMap)
.where(*faultmap_filters)
.order_by(Fault.check_id)
)
cms = session.execute(q.distinct().limit(5000)).all()
```
When I run this I get the same
```(3065, "Expression #1 of ORDER BY clause is not in SELECT list, references column 'db.fault.check_id' which is not in SELECT list; this is incompatible with DISTINCT")
So I thought maybe distinct() and order_by arn't playing nicely when order_by is run for related field attribute Fault.check_id, so I tried unique() with the following variation
cms = session.execute(q.slice(0,100)).scalars().unique().all()
This works but the problem is it runs slice() first and then unique() which means I do not get the expected number of results. So what I want say 100 unique records.

SqlAlchemy auto increment composite key for every new combination

I have a scenario to iterate up session_number column for related user_name. If a user created a session before I'll iterate up the last session_number but if a user created session for the first time session_number should start from 1. I tried to illustrate on below. Right now I handle this by using logic but try to find more elegant way to do that in SqlAlchemy.
id - user_name - session_number
1 user_1 1
2 user_1 2
3 user_2 1
4 user_1 3
5 user_2 2
Here is my python code of the table. My database is PostgreSQL and I'm using alembic to upgrade tables. Right now it continues to iterate up the session_number regardless user_name.
class UserSessions(db.Model):
__tablename__ = 'user_sessions'
id = db.Column(db.Integer, primary_key=True, unique=True)
username = db.Column(db.String, nullable=False)
session_number = db.Column(db.Integer, Sequence('session_number_seq', start=0, increment=1))
created_at = db.Column(db.DateTime)
last_edit = db.Column(db.DateTime)
__table_args__ = (
db.UniqueConstraint('username', 'session_number', name='_username_session_number_idx_'),
)
I've searched on the internet for this situation but those were not like my problem. Is it possible to achieve this with SqlAlchemy/PostgreSQL actions?
First, I do not know of any "pure" solution for this situation by using either SqlAlchemy or Postgresql or a combination of the two.
Although it might not be exactly the solution you are looking for, I hope it will give you some ideas.
If you wanted to calculate the session_number for the whole table without it being stored, i would use the following query or a variation of thereof:
def get_user_sessions_with_rank():
expr = (
db.func.rank()
.over(partition_by=UserSessions.username, order_by=[UserSessions.id])
.label("session_number")
)
subq = db.session.query(UserSessions.id, expr).subquery("subq")
q = (
db.session.query(UserSessions, subq.c.session_number)
.join(subq, UserSessions.id == subq.c.id)
.order_by(UserSessions.id)
)
return q.all()
Alternatively, I would actually add a column_property to the model compute it on the fly for each instance of UserSessions. it is not as efficient in calculation, but for queries filtering by specific user it should be good enough:
class UserSessions(db.Model):
__tablename__ = "user_sessions"
id = db.Column(db.Integer, primary_key=True, unique=True)
username = db.Column(db.String, nullable=False)
created_at = db.Column(db.DateTime)
last_edit = db.Column(db.DateTime)
# must define this outside of the model definition because of need for aliased
US2 = db.aliased(UserSessions)
UserSessions.session_number = db.column_property(
db.select(db.func.count(US2.id))
.where(US2.username == UserSessions.username)
.where(US2.id <= UserSessions.id)
.scalar_subquery()
)
In this case, when you query for UserSessions, the session_number will be fetched from the database, while being None for newly created instances.

SQLAlchemy filter query results based on other table's field

I have got a not very common join and filter problem.
Here are my models;
class Order(Base):
id = Column(Integer, primary_key=True)
order_id = Column(String(19), nullable=False)
... (other fields)
class Discard(Base):
id = Column(Integer, primary_key=True)
order_id = Column(String(19), nullable=False)
I want to query all and full instances of Order but just exclude those that have a match in Discard.order_id based on Order.order_id field. As you can see there is no relationship between order_id fields.
I've tried outer left join, notin_ but ended up with no success.
With this answer I've achieved desired results.
Here is my code;
orders = (
session.query(Order)
.outerjoin(Discard, Order.order_id == Discard.order_id)
.filter(Discard.order_id == None) # noqa: E711
.all()
)
I was paying too much attention to flake8 wrong syntax message at Discard.order_id == None and was using Discard.order_id is None. It appeared out they were rendered differently by sqlalchemy.

Fastest way to insert object if it doesn't exist with SQLAlchemy

So I'm quite new to SQLAlchemy.
I have a model Showing which has about 10,000 rows in the table. Here is the class:
class Showing(Base):
__tablename__ = "showings"
id = Column(Integer, primary_key=True)
time = Column(DateTime)
link = Column(String)
film_id = Column(Integer, ForeignKey('films.id'))
cinema_id = Column(Integer, ForeignKey('cinemas.id'))
def __eq__(self, other):
if self.time == other.time and self.cinema == other.cinema and self.film == other.film:
return True
else:
return False
Could anyone give me some guidance on the fastest way to insert a new showing if it doesn't exist already. I think it is slightly more complicated because a showing is only unique if the time, cinmea, and film are unique on a showing.
I currently have this code:
def AddShowings(self, showing_times, cinema, film):
all_showings = self.session.query(Showing).options(joinedload(Showing.cinema), joinedload(Showing.film)).all()
for showing_time in showing_times:
tmp_showing = Showing(time=showing_time[0], film=film, cinema=cinema, link=showing_time[1])
if tmp_showing not in all_showings:
self.session.add(tmp_showing)
self.session.commit()
all_showings.append(tmp_showing)
which works, but seems to be very slow. Any help is much appreciated.
If any such object is unique based on a combination of columns, you need to mark these as a composite primary key. Add the primary_key=True keyword parameter to each of these columns, dropping your id column altogether:
class Showing(Base):
__tablename__ = "showings"
time = Column(DateTime, primary_key=True)
link = Column(String)
film_id = Column(Integer, ForeignKey('films.id'), primary_key=True)
cinema_id = Column(Integer, ForeignKey('cinemas.id'), primary_key=True)
That way your database can handle these rows more efficiently (no need for an incrementing column), and SQLAlchemy now automatically knows if two instances of Showing are the same thing.
I believe you can then just merge your new Showing back into the session:
def AddShowings(self, showing_times, cinema, film):
for showing_time in showing_times:
self.session.merge(
Showing(time=showing_time[0], link=showing_time[1],
film=film, cinema=cinema)
)

Help with Complicated SQL Alchemy Join

First, the database overview:
competitors - people who compete
competitions - things that people compete at
competition_registrations - Competitors registered for a particular competition
event - An "event" at a competition.
events_couples - A couple (2 competitors) competing in an event.
First, EventCouple, a Python class corresponding to events_couples, is:
class EventCouple(Base):
__tablename__ = 'events_couples'
competition_id = Column(Integer, ForeignKey('competitions.id'), primary_key=True)
event_id = Column(Integer, ForeignKey('events.id'), primary_key=True)
leader_id = Column(Integer)
follower_id = Column(Integer)
__table_args__ = (
ForeignKeyConstraint(['competition_id', 'leader_id'], ['competition_registrations.competition_id', 'competition_registrations.competitor_id']),
ForeignKeyConstraint(['competition_id', 'follower_id'], ['competition_registrations.competition_id', 'competition_registrations.competitor_id']),
{}
)
I have a Python class, CompetitorRegistration, that corresponds to a record/row in competition_registrations. A competitor, who is registered, can compete in multiple events, but either as a "leader", or a "follower". I'd like to add to CompetitorRegistration an attribute leading, that is a list of EventCouple where the competition_id and leader_id match. This is my CompetitorRegistration class, complete with attempt:
class CompetitorRegistration(Base):
__tablename__ = 'competition_registrations'
competition_id = Column(Integer, ForeignKey('competitions.id'), primary_key=True)
competitor_id = Column(Integer, ForeignKey('competitors.id'), primary_key=True)
email = Column(String(255))
affiliation_id = Column(Integer, ForeignKey('affiliation.id'))
is_student = Column(Boolean)
registered_time = Column(DateTime)
leader_number = Column(Integer)
leading = relationship('EventCouple', primaryjoin=and_('CompetitorRegistration.competition_id == EventCouple.competition_id', 'CompetitorRegistration.competitor_id == EventCouple.leader_id'))
following = relationship('EventCouple', primaryjoin='CompetitorRegistration.competition_id == EventCouple.competition_id and CompetitorRegistration.competitor_id == EventCouple.follower_id')
However, I get:
ArgumentError: Could not determine relationship direction for primaryjoin
condition 'CompetitorRegistration.competition_id == EventCouple.competition_id
AND CompetitorRegistration.competitor_id == EventCouple.leader_id', on
relationship CompetitorRegistration.leading. Ensure that the referencing Column
objects have a ForeignKey present, or are otherwise part of a
ForeignKeyConstraint on their parent Table, or specify the foreign_keys parameter
to this relationship.
Thanks for any help, & let me know if more info is needed on the schema.
Also, another attempt of mine is visible in following — this did not error, but didn't give correct results either. (It only joined on the competition_id, and completely ignored the follower_id)
Your leading's condition mixes expression and string to be eval()ed. And following's condition mixes Python and SQL operators: and in Python is not what you expected here. Below are corrected examples using both variants:
leading = relationship('EventCouple', primaryjoin=(
(competition_id==EventCouple.competition_id) & \
(competitor_id==EventCouple.leader_id)))
leading = relationship('EventCouple', primaryjoin=and_(
competition_id==EventCouple.competition_id,
competitor_id==EventCouple.leader_id))
following = relationship('EventCouple', primaryjoin=\
'(CompetitorRegistration.competition_id==EventCouple.competition_id) '\
'& (CompetitorRegistration.competitor_id==EventCouple.follower_id)')

Categories

Resources