Join two Pandas Dataframes

Join two Pandas Dataframes - python

We have two tables:
Table 1: EventLog
class EventLog(Base):
""""""
__tablename__ = 'event_logs'
id = Column(Integer, primary_key=True, autoincrement=True)
# Keys
event_id = Column(Integer)
data = Column(String)
signature = Column(String)
# Unique constraint
__table_args__ = (UniqueConstraint('event_id', 'signature'),)
Table 2: Machine_Event_Logs
class Machine_Event_Logs(Base):
""""""
__tablename__ = 'machine_event_logs'
id = Column(Integer, primary_key=True, autoincrement=True)
# Keys
machine_id = Column(String, ForeignKey("machines.id"))
event_log_id = Column(String, ForeignKey("event_logs.id"))
event_record_id = Column(Integer)
time_created = Column(String)
# Unique constraint
__table_args__ = (UniqueConstraint('machine_id', 'event_log_id', 'event_record_id', 'time_created'),)
# Relationships
event_logs = relationship("EventLog")
The relationship between EventLogs and Machine_Event_Logs is 1 to many.
Whereby we register a unique event log into the EventLogs table and then register millions of entries into Machine_Event_Logs for every time we encounter that event.
Goal: We're trying to join both table to display the entire timeline of event logs captured.
We've tried multiple combinations of the merge() function in Panda Dataframe but it only returns a bunch of NaN or empty. For example:
pd.merge(event_logs, machine_event_logs, how='left', left_on='id', right_on='event_log_id')
Any ideas on how to solve this?
Thank in in advance for your assistance.

According to your data schema, you have incompatible types where id in event_logs is an Integer and event_log_id in machine_event_logs is a String column. In Python the equality of a string and its equivalent numeric value yields false:
print('0'==0)
# False
Therefore your pandas left join merge returns all NAN on right hand side since no matches are successfully found. Consider converting to align types for proper merging:
event_logs['id'] = event_logs['id'].astype(str)
OR
machine_event_logs['event_log_id'] = machine_event_logs['event_log_id'].astype(int)

Related

SqlAlchemy difference of timestamps

I have a table to log user actions. I dont have different columns for different action types yet for my newest task, I have to find the average time difference inbetween some actions.
class UserAction(db.Model):
__tablename__ = "user_action"
id = Column(Integer, nullable=False, primary_key=True)
user_id = Column(Integer, ForeignKey('user.id))
case_id = Column(Integer, ForeignKey('case.id'))
category_id = Column(Integer)
action_time = Column(DateTime(), server_default=func.now())
So I want to do something like,
x = session.query(func.avg(UserAction.action_time)).filter(UserAction.category_id == 1).all()
y = session.query(func.avg(UserAction.action_time)).filter(UserAction.category_id == 2).all()
dif = x - y
or
session.query(func.avg(func.justify_hours(timestamp_column1) - func.justify_hours(timestamp_column1))).all()
I can create a new table to log each case's actions' in different columns to do it like thw first way but that wont be efficient. But if i do it like the second way, the output won't be correct becasue not all actions are done by all cases. So im a little bit stuck. How can i achieve this?
Thanks

SQLAlchemy filter query results based on other table's field

I have got a not very common join and filter problem.
Here are my models;
class Order(Base):
id = Column(Integer, primary_key=True)
order_id = Column(String(19), nullable=False)
... (other fields)
class Discard(Base):
id = Column(Integer, primary_key=True)
order_id = Column(String(19), nullable=False)
I want to query all and full instances of Order but just exclude those that have a match in Discard.order_id based on Order.order_id field. As you can see there is no relationship between order_id fields.
I've tried outer left join, notin_ but ended up with no success.

With this answer I've achieved desired results.
Here is my code;
orders = (
session.query(Order)
.outerjoin(Discard, Order.order_id == Discard.order_id)
.filter(Discard.order_id == None) # noqa: E711
.all()
)
I was paying too much attention to flake8 wrong syntax message at Discard.order_id == None and was using Discard.order_id is None. It appeared out they were rendered differently by sqlalchemy.

Many-to-many join table with additional field in Flask

I have two tables, Products and Orders, inside my Flask-SqlAlchemy setup, and they are linked so an order can have several products:
class Products(db.Model):
id = db.Column(db.Integer, primary_key=True)
....
class Orders(db.Model):
guid = db.Column(db.String(36), default=generate_uuid, primary_key=True)
products = db.relationship(
"Products", secondary=order_products_table, backref="orders")
....
linked via:
order_products_table = db.Table("order_products_table",
db.Column('orders_guid', db.String(36), db.ForeignKey('orders.guid')),
db.Column('products_id', db.Integer, db.ForeignKey('products.id'))
# db.Column('license', dbString(36))
)
For my purposes, each product in an order will receive a unique license string, which logically should be added to the order_products_table rows of each product in an order.
How do I declare this third license column on the join table order_products_table so it gets populated it as I insert an Order?

I've since found the documentation for the Association Object from the SQLAlchemy docs, which allows for exactly this expansion to the join table.
Updated setup:
# Instead of a table, provide a model for the JOIN table with additional fields
# and explicit keys and back_populates:
class OrderProducts(db.Model):
__tablename__ = 'order_products_table'
orders_guid = db.Column(db.String(36), db.ForeignKey(
'orders.guid'), primary_key=True)
products_id = db.Column(db.Integer, db.ForeignKey(
'products.id'), primary_key=True)
order = db.relationship("Orders", back_populates="products")
products = db.relationship("Products", back_populates="order")
licenses = db.Column(db.String(36), nullable=False)
class Products(db.Model):
id = db.Column(db.Integer, primary_key=True)
order = db.relationship(OrderProducts, back_populates="order")
....
class Orders(db.Model):
guid = db.Column(db.String(36), default=generate_uuid, primary_key=True)
products = db.relationship(OrderProducts, back_populates="products")
....
What is really tricky (but also shown on the documentation page), is how you insert the data. In my case it goes something like this:
o = Orders(...) # insert other data
for id in products:
# Create OrderProducts join rows with the extra data, e.g. licenses
join = OrderProducts(licenses="Foo")
# To the JOIN add the products
join.products = Products.query.get(id)
# Add the populated JOIN as the Order products
o.products.append(join)
# Finally commit to database
db.session.add(o)
db.session.commit()
I was at first trying to populate the Order.products (or o.products in the example code) directly, which will give you an error about using a Products class when it expects a OrderProducts class.
I also struggled with the whole field naming and referencing of the back_populates. Again, the example above and on the docs show this. Note the pluralization is entirely to do with how you want your fields named.

Fastest way to insert object if it doesn't exist with SQLAlchemy

So I'm quite new to SQLAlchemy.
I have a model Showing which has about 10,000 rows in the table. Here is the class:
class Showing(Base):
__tablename__ = "showings"
id = Column(Integer, primary_key=True)
time = Column(DateTime)
link = Column(String)
film_id = Column(Integer, ForeignKey('films.id'))
cinema_id = Column(Integer, ForeignKey('cinemas.id'))
def __eq__(self, other):
if self.time == other.time and self.cinema == other.cinema and self.film == other.film:
return True
else:
return False
Could anyone give me some guidance on the fastest way to insert a new showing if it doesn't exist already. I think it is slightly more complicated because a showing is only unique if the time, cinmea, and film are unique on a showing.
I currently have this code:
def AddShowings(self, showing_times, cinema, film):
all_showings = self.session.query(Showing).options(joinedload(Showing.cinema), joinedload(Showing.film)).all()
for showing_time in showing_times:
tmp_showing = Showing(time=showing_time[0], film=film, cinema=cinema, link=showing_time[1])
if tmp_showing not in all_showings:
self.session.add(tmp_showing)
self.session.commit()
all_showings.append(tmp_showing)
which works, but seems to be very slow. Any help is much appreciated.

If any such object is unique based on a combination of columns, you need to mark these as a composite primary key. Add the primary_key=True keyword parameter to each of these columns, dropping your id column altogether:
class Showing(Base):
__tablename__ = "showings"
time = Column(DateTime, primary_key=True)
link = Column(String)
film_id = Column(Integer, ForeignKey('films.id'), primary_key=True)
cinema_id = Column(Integer, ForeignKey('cinemas.id'), primary_key=True)
That way your database can handle these rows more efficiently (no need for an incrementing column), and SQLAlchemy now automatically knows if two instances of Showing are the same thing.
I believe you can then just merge your new Showing back into the session:
def AddShowings(self, showing_times, cinema, film):
for showing_time in showing_times:
self.session.merge(
Showing(time=showing_time[0], link=showing_time[1],
film=film, cinema=cinema)
)

Help with Complicated SQL Alchemy Join

First, the database overview:
competitors - people who compete
competitions - things that people compete at
competition_registrations - Competitors registered for a particular competition
event - An "event" at a competition.
events_couples - A couple (2 competitors) competing in an event.
First, EventCouple, a Python class corresponding to events_couples, is:
class EventCouple(Base):
__tablename__ = 'events_couples'
competition_id = Column(Integer, ForeignKey('competitions.id'), primary_key=True)
event_id = Column(Integer, ForeignKey('events.id'), primary_key=True)
leader_id = Column(Integer)
follower_id = Column(Integer)
__table_args__ = (
ForeignKeyConstraint(['competition_id', 'leader_id'], ['competition_registrations.competition_id', 'competition_registrations.competitor_id']),
ForeignKeyConstraint(['competition_id', 'follower_id'], ['competition_registrations.competition_id', 'competition_registrations.competitor_id']),
{}
)
I have a Python class, CompetitorRegistration, that corresponds to a record/row in competition_registrations. A competitor, who is registered, can compete in multiple events, but either as a "leader", or a "follower". I'd like to add to CompetitorRegistration an attribute leading, that is a list of EventCouple where the competition_id and leader_id match. This is my CompetitorRegistration class, complete with attempt:
class CompetitorRegistration(Base):
__tablename__ = 'competition_registrations'
competition_id = Column(Integer, ForeignKey('competitions.id'), primary_key=True)
competitor_id = Column(Integer, ForeignKey('competitors.id'), primary_key=True)
email = Column(String(255))
affiliation_id = Column(Integer, ForeignKey('affiliation.id'))
is_student = Column(Boolean)
registered_time = Column(DateTime)
leader_number = Column(Integer)
leading = relationship('EventCouple', primaryjoin=and_('CompetitorRegistration.competition_id == EventCouple.competition_id', 'CompetitorRegistration.competitor_id == EventCouple.leader_id'))
following = relationship('EventCouple', primaryjoin='CompetitorRegistration.competition_id == EventCouple.competition_id and CompetitorRegistration.competitor_id == EventCouple.follower_id')
However, I get:
ArgumentError: Could not determine relationship direction for primaryjoin
condition 'CompetitorRegistration.competition_id == EventCouple.competition_id
AND CompetitorRegistration.competitor_id == EventCouple.leader_id', on
relationship CompetitorRegistration.leading. Ensure that the referencing Column
objects have a ForeignKey present, or are otherwise part of a
ForeignKeyConstraint on their parent Table, or specify the foreign_keys parameter
to this relationship.
Thanks for any help, & let me know if more info is needed on the schema.
Also, another attempt of mine is visible in following — this did not error, but didn't give correct results either. (It only joined on the competition_id, and completely ignored the follower_id)

Your leading's condition mixes expression and string to be eval()ed. And following's condition mixes Python and SQL operators: and in Python is not what you expected here. Below are corrected examples using both variants:
leading = relationship('EventCouple', primaryjoin=(
(competition_id==EventCouple.competition_id) & \
(competitor_id==EventCouple.leader_id)))
leading = relationship('EventCouple', primaryjoin=and_(
competition_id==EventCouple.competition_id,
competitor_id==EventCouple.leader_id))
following = relationship('EventCouple', primaryjoin=\
'(CompetitorRegistration.competition_id==EventCouple.competition_id) '\
'& (CompetitorRegistration.competitor_id==EventCouple.follower_id)')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join two Pandas Dataframes - python

Related

SqlAlchemy difference of timestamps

SQLAlchemy filter query results based on other table's field

Many-to-many join table with additional field in Flask

Fastest way to insert object if it doesn't exist with SQLAlchemy

Help with Complicated SQL Alchemy Join

Categories

Resources