How can I merge two rows with same value in one column. Lets say I have a model with ~40 columns like below:
class Model(Base):
__tablename__ = "table"
id = Column(Integer, primary_key=True)
value_a = Column(String)
value_b = Column(String)
value_c = Column(String)
...
And I need to process each time ~500k rows of new data. Also each process creates a new table.
Once inserting the data first time(using session.bulk_insert_mappings(Model, data)) there are duplicated value_c values(max 2), but each time either it has value_a with some string and value_b is empty or value_b with some string and value_a is empty.
After initial insert:
| id | value_a | value_b | value_c |
| -- | ------- | ------- | ------- |
| 1 | foo | None | xyz |
| 2 | None | bar | xyz |
Having all rows I need to merge the rows with common value_c values together to get rid of duplicates.
After update:
| id | value_a | value_b | value_c |
| -- | ------- | ------- | ------- |
| 3 | foo | bar | xyz |
What is the most efficient way to do that? I was using from beginning session.merge(row) for each row but it is to slow and I decided to split it into insert and update stages.
You should be able to insert from a select statement that joins the not null a to the not null b. Then after inserted the combined rows you can delete the old rows. This matches the case you outlined exactly you might need to add more conditions to ignore other entries you might not want inserted or not deleted. (ie. (a, b, c) == (None, None, 'value'))
I used aliased so that i can join the same table against itself.
import sys
from sqlalchemy import (
create_engine,
Integer,
String,
)
from sqlalchemy.schema import (
Column,
)
from sqlalchemy.orm import Session, declarative_base, aliased
from sqlalchemy.sql import select, or_, and_, delete, insert
username, password, db = sys.argv[1:4]
Base = declarative_base()
engine = create_engine(f"postgresql+psycopg2://{username}:{password}#/{db}", echo=True)
metadata = Base.metadata
class Model(Base):
__tablename__ = "table"
id = Column(Integer, primary_key=True)
value_a = Column(String)
value_b = Column(String)
value_c = Column(String)
metadata.create_all(engine)
def print_models(session):
for (model,) in session.execute(select(Model)).all():
print(model.id, model.value_a, model.value_b, model.value_c)
with Session(engine) as session, session.begin():
for (a, b, c) in [('foo', None, 'xyz'), (None, 'bar', 'xyz'), ('leave', 'it', 'asis')]:
session.add(Model(value_a=a, value_b=b, value_c=c))
session.flush()
print_models(session)
with Session(engine) as session, session.begin():
#
# Insert de-nulled entires.
#
left = aliased(Model)
right = aliased(Model)
nulls_joined_q = select(
left.value_a,
right.value_b,
left.value_c
).distinct().select_from(
left
).join(
right,
left.value_c == right.value_c
).where(
and_(
# Ignore entries with no C value.
left.value_c != None,
left.value_b == None,
right.value_a == None))
stmt = insert(
Model.__table__
).from_select([
"value_a",
"value_b",
"value_c"
], nulls_joined_q)
session.execute(stmt)
#
# Remove null entries: All rows where value_c is NOT NULL and either value_a is empty or value b is empty.
#
# #NOTE: This deletes entries where value_a and value_b are BOTH null in the same row as well.
#
stmt = delete(Model.__table__).where(and_(
# Ignore these like we did in insert.
Model.value_c != None,
or_(
Model.value_a == None,
Model.value_b == None),
))
session.execute(stmt)
session.flush()
# Output
print_models(session)
Output
1 foo None xyz
2 None bar xyz
3 leave it asis
#... then
3 leave it asis
4 foo bar xyz
Docs
https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.aliased
https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.delete
https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.insert
Related
I have this case:
| Note table |
|---------------------|------------------|
| id | parent_id |
|---------------------|------------------|
| 1 | Null |
|---------------------|------------------|
| 2 | 1
|---------------------|------------------|
| 3 | 2
|---------------------|------------------|
| 4 | 3
|---------------------|------------------|
What I want to achieve is to get the Top level parent Id.
in this case if I pass the Id number 4 I would get the Id 1, since Id 1 is the top level parent.
When it reaches the null on the parent_id it means that the id is the top level parent.
I've tried this, but the return is the Id that I pass to the function.
def get_top_level_Note(self, id: int):
hierarchy = self.db.session.query(Note).filter(Note.id == id).cte(name="hierarchy", recursive=True)
parent = aliased(hierarchy, name="p")
children = aliased(Note, name="c")
hierarchy = hierarchy.union_all(self.db.session.query(children).filter(children.parent_id == parent.c.id))
result = self.db.session.query(Note).select_entity_from(hierarchy).all()
With an existing table named "note"
id parent_id
----------- -----------
11 NULL
22 11
33 22
44 33
55 NULL
66 55
a bit of messing around in PostgreSQL showed that
WITH RECURSIVE parent (i, id, parent_id)
AS (
SELECT 0, id, parent_id FROM note WHERE id=44
UNION ALL
SELECT i + 1, n.id, n.parent_id
FROM note n INNER JOIN parent p ON p.parent_id = n.id
WHERE p.parent_id IS NOT NULL
)
SELECT * FROM parent ORDER BY i;
returned
i id parent_id
----------- ----------- -----------
0 44 33
1 33 22
2 22 11
3 11 NULL
and therefore we could get the top-level parent by changing the last line to
WITH RECURSIVE parent (i, id, parent_id)
AS (
SELECT 0, id, parent_id FROM note WHERE id=44
UNION ALL
SELECT i + 1, n.id, n.parent_id
FROM note n INNER JOIN parent p ON p.parent_id = n.id
WHERE p.parent_id IS NOT NULL
)
SELECT id FROM parent ORDER BY i DESC LIMIT 1 ;
returning
id
-----------
11
So to convert that into SQLAlchemy (1.4):
from sqlalchemy import (
create_engine,
Column,
Integer,
select,
literal_column,
)
from sqlalchemy.orm import declarative_base
connection_uri = "postgresql://scott:tiger#192.168.0.199/test"
engine = create_engine(connection_uri, echo=False)
Base = declarative_base()
class Note(Base):
__tablename__ = "note"
id = Column(Integer, primary_key=True)
parent_id = Column(Integer)
def get_top_level_note_id(start_id):
note_tbl = Note.__table__
parent_cte = (
select(
literal_column("0").label("i"), note_tbl.c.id, note_tbl.c.parent_id
)
.where(note_tbl.c.id == start_id)
.cte(name="parent_cte", recursive=True)
)
parent_cte_alias = parent_cte.alias("parent_cte_alias")
note_tbl_alias = note_tbl.alias()
parent_cte = parent_cte.union_all(
select(
literal_column("parent_cte_alias.i + 1"),
note_tbl_alias.c.id,
note_tbl_alias.c.parent_id,
)
.where(note_tbl_alias.c.id == parent_cte_alias.c.parent_id)
.where(parent_cte_alias.c.parent_id.is_not(None))
)
stmt = select(parent_cte.c.id).order_by(parent_cte.c.i.desc()).limit(1)
with engine.begin() as conn:
result = conn.execute(stmt).scalar()
return result
if __name__ == "__main__":
test_id = 44
print(
f"top level id for note {test_id} is {get_top_level_note_id(test_id)}"
)
# top level id for note 44 is 11
test_id = 66
print(
f"top level id for note {test_id} is {get_top_level_note_id(test_id)}"
)
# top level id for note 66 is 55
Using SQLAlchemy with an SQLite engine, I've got a self-referential hierarchal table that describes a directory structure.
from sqlalchemy import Column, Integer, String, ForeignKey, Index
from sqlalchemy.orm import column_property, aliased, join
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Dr(Base):
__tablename__ = 'directories'
id = Column(Integer, primary_key=True)
name = Column(String)
parent_id = Column(Integer, ForeignKey('directories.id'))
Each Dr row only knows it's own "name" and its "parent_id". I've added a recursive column_property called "path" that returns a string containing all of a Dr's ancestors from the root Dr.
root_anchor = (
select([Dr.id, Dr.name, Dr.parent_id,Dr.name.label('path')])
.where(Dr.parent_id == None).cte(recursive=True)
)
dir_alias = aliased(Dr)
cte_alias = aliased(root_anchor)
path_table = root_anchor.union_all(
select([
dir_alias.id, dir_alias.name,
dir_alias.parent_id, cte_alias.c.path + "/" + dir_alias.name
]).select_from(join(
dir_alias, cte_alias, onclause=cte_alias.c.id==dir_alias.parent_id)
))
)
Dr.path = column_property(
select([path_table.c.path]).where(path_table.c.id==Dr.id)
)
Here's an example of the output:
"""
-----------------------------
| id | name | parent_id |
-----------------------------
| 1 | root | NULL |
-----------------------------
| 2 | kid | 1 |
-----------------------------
| 3 | grandkid | 2 |
-----------------------------
"""
sqllite_engine = create_engine('sqlite:///:memory:')
Session = sessionmaker(bind=sqllite_engine)
session = Session()
instance = session.query(Dr).filter(Dr.name=='grandkid').one()
print(instance.path)
# Outputs: "root/kid/grandkid"
I'd like to be able to add an index, or a least a unique constraint, on the "path" property so that unique paths cannot exist more than once in the table. I've tried:
Index('pathindex', Directory.path, unique=True)
...with no luck. No error is raised, but SQLAlchemy doesn't seem to register the index, it just silently ignores it. It still allows adding a duplicate path, e.g.:
session.add(Dr(name='grandkid', parent_id=2))
session.commit()
As further evidence that the Index() was ignored, inspecting the "indexes" property of the table results in an empty set:
print(Dr.__table__.indexes)
#Outputs: set([])
It's essential to me that duplicate paths cannot exist in the database. I'm not sure whether what I'm trying to do with column_property is possible in SQLAlchemy, and if not I'd love to hear some suggestions on how else I can go about this.
I think unique index should suffice, in class Db
__table_args__ = (UniqueConstraint('parent_id', 'name'), )
I'am using Flask-SQLAlchemy and i use one-to-many relationships. Two models
class Request(db.Model):
id = db.Column(db.Integer, primary_key = True)
r_time = db.Column(db.DateTime, index = True, default=datetime.utcnow)
org = db.Column(db.String(120))
dest = db.Column(db.String(120))
buyer_id = db.Column(db.Integer, db.ForeignKey('buyer.id'))
sale_id = db.Column(db.Integer, db.ForeignKey('sale.id'))
cost = db.Column(db.Integer)
sr = db.Column(db.Integer)
profit = db.Column(db.Integer)
def __repr__(self):
return '<Request {} by {}>'.format(self.org, self.buyer_id)
class Buyer(db.Model):
id = db.Column(db.Integer, primary_key = True)
name = db.Column(db.String(120), unique = True)
email = db.Column(db.String(120), unique = True)
requests = db.relationship('Request', backref = 'buyer', lazy='dynamic')
def __repr__(self):
return '<Buyer {}>'.format(self.name)
I need to identify which Buyer has a minimum requests from all
of the buyers.
I could do it manually by creating additional lists and put all
requests in a lists and search for the list. But I believe there is another simple way to do it via SQLAlchemy query
You can do this with a CTE (common table expression) for a select that produces buyer ids together with their request counts, so
buyer_id | request_count
:------- | :------------
1 | 5
2 | 3
3 | 1
4 | 1
You can filter here on the counts having to be greater than 0 to be listed.
You can then join the buyers table against that to produce:
buyer_id | buyer_name | buyer_email | request_count
:------- | :--------- | :--------------- | :------------
1 | foo | foo#example.com | 5
2 | bar | bar#example.com | 3
3 | baz | baz#example.com | 1
4 | spam | spam#example.com | 1
but because we are using a CTE, you can also query the CTE for the lowest count value. In the above example, that's 1, and you can add a WHERE clause to the joined buyer-with-cte-counts query to filter the results down to only rows where the request_count value is equal to that minimum number.
The SQL query for this is
WITH request_counts AS (
SELECT request.buyer_id AS buyer_id, count(request.id) AS request_count
FROM request GROUP BY request.buyer_id
HAVING count(request.id) > ?
)
SELECT buyer.*
FROM buyer
JOIN request_counts ON buyer.id = request_counts.buyer_id
WHERE request_counts.request_count = (
SELECT min(request_counts.request_count)
FROM request_counts
)
The WITH request_counts AS (...) defines a CTE, and it is that part that would produce the first table with buyer_id and request_count. The request_count table is then joined with request and the WHERE clause does the filtering on the min(request_counts.request_count) value.
Translating the above to Flask-SQLAlchemy code:
request_count = db.func.count(Request.id).label("request_count")
cte = (
db.select([Request.buyer_id.label("buyer_id"), request_count])
.group_by(Request.buyer_id)
.having(request_count > 0)
.cte('request_counts')
)
min_request_count = db.select([db.func.min(cte.c.request_count)]).as_scalar()
buyers_with_least_requests = Buyer.query.join(
cte, Buyer.id == cte.c.buyer_id
).filter(cte.c.request_count == min_request_count).all()
Demo:
>>> __ = db.session.bulk_insert_mappings(
... Buyer, [{"name": n} for n in ("foo", "bar", "baz", "spam", "no requests")]
... )
>>> buyers = Buyer.query.order_by(Buyer.id).all()
>>> requests = [
... Request(buyer_id=b.id)
... for b in [*([buyers[0]] * 3), *([buyers[1]] * 5), *[buyers[2], buyers[3]]]
... ]
>>> __ = db.session.add_all(requests)
>>> request_count = db.func.count(Request.id).label("request_count")
>>> cte = (
... db.select([Request.buyer_id.label("buyer_id"), request_count])
... .group_by(Request.buyer_id)
... .having(request_count > 0)
... .cte("request_counts")
... )
>>> buyers_w_counts = Buyer.query.join(cte, cte.c.buyer_id == Buyer.id)
>>> for buyer, count in buyers_w_counts.add_column(cte.c.request_count):
... # print out buyer and request count for this demo
... print(buyer, count, sep=": ")
<Buyer foo>: 3
<Buyer bar>: 5
<Buyer baz>: 1
<Buyer spam>: 1
>>> min_request_count = db.select([db.func.min(cte.c.request_count)]).as_scalar()
>>> buyers_w_counts.filter(cte.c.request_count == min_request_count).all()
[<Buyer baz>, <Buyer spam>]
I've also created a db<>fiddle here, containing the same queries, to play with.
I've found out that you can use collections in relationship in order to change the type of return value, specifically I was interested in dictionaries.
Documentation gives an example:
class Item(Base):
__tablename__ = 'item'
id = Column(Integer, primary_key=True)
notes = relationship("Note",
collection_class=attribute_mapped_collection('keyword'),
cascade="all, delete-orphan")
class Note(Base):
__tablename__ = 'note'
id = Column(Integer, primary_key=True)
item_id = Column(Integer, ForeignKey('item.id'), nullable=False)
keyword = Column(String)
text = Column(String)
And it works. However I was hoping that it will make list values if there are more than just one key with the same name. But it only puts the last value under unique key name.
Here is an example:
| Note table |
|---------------------|------------------|
| id | keyword |
|---------------------|------------------|
| 1 | foo |
|---------------------|------------------|
| 2 | foo |
|---------------------|------------------|
| 3 | bar |
|---------------------|------------------|
| 4 | bar |
|---------------------|------------------|
item.notes will return something like this:
{'foo': <project.models.note.Note at 0x7fc6840fadd2>,
'bar': <project.models.note.Note at 0x7fc6840fadd4>}
Where ids of foo and bar objects are 2 and 4 respectively.
What I'm looking for is to get something like this:
{'foo': [<project.models.note.Note at 0x7fc6840fadd1,
<project.models.note.Note at 0x7fc6840fadd2>],
'bar': [<project.models.note.Note at 0x7fc6840fadd3>,
<project.models.note.Note at 0x7fc6840fadd4>]}
Is it possible to get dict of lists from relationship in sqlalchemy?
So, it turns out you can simply inherit MappedCollection and do whatever you like in setitem there.
from sqlalchemy.orm.collections import (MappedCollection,
_SerializableAttrGetter,
collection,
_instrument_class)
#This will ensure that the MappedCollection has been properly
#initialized with custom __setitem__() and __delitem__() methods
#before used in a custom subclass
_instrument_class(MappedCollection)
class DictOfListsCollection(MappedCollection):
#collection.internally_instrumented
def __setitem__(self, key, value, _sa_initiator=None):
if not super(DictOfListsCollection, self).get(key):
super(DictOfListsCollection, self).__setitem__(key, [], _sa_initiator)
super(DictOfListsCollection, self).__getitem__(key).append(value)
I am dealing with a many-to-many relationship with sqlalchemy. My question is how to avoid adding duplicate pair values in a many-to-many relational table.
To make things clearer, I will use the example from the official SQLAlchemy documentation.
Base = declarative_base()
Parents2children = Table('parents2children', Base.metadata,
Column('parents_id', Integer, ForeignKey('parents.id')),
Column('children_id', Integer, ForeignKey('children.id'))
)
class Parent(Base):
__tablename__ = 'parents'
id = Column(Integer, primary_key=True)
parent_name = Column(String(45))
child_rel = relationship("Child", secondary=Parents2children, backref= "parents_backref")
def __init__(self, parent_name=""):
self.parent_name=parent_name
def __repr__(self):
return "<parents(id:'%i', parent_name:'%s')>" % (self.id, self.parent_name)
class Child(Base):
__tablename__ = 'children'
id = Column(Integer, primary_key=True)
child_name = Column(String(45))
def __init__(self, child_name=""):
self.child_name= child_name
def __repr__(self):
return "<experiments(id:'%i', child_name:'%s')>" % (self.id, self.child_name)
###########################################
def setUp():
global Session
engine=create_engine('mysql://root:root#localhost/db_name?charset=utf8', pool_recycle=3600,echo=False)
Session=sessionmaker(bind=engine)
def add_data():
session=Session()
name_father1=Parent(parent_name="Richard")
name_mother1=Parent(parent_name="Kate")
name_daughter1=Child(child_name="Helen")
name_son1=Child(child_name="John")
session.add(name_father1)
session.add(name_mother1)
name_father1.child_rel.append(name_son1)
name_daughter1.parents_backref.append(name_father1)
name_son1.parents_backref.append(name_father1)
session.commit()
session.close()
setUp()
add_data()
session.close()
With this code, the data inserted in the tables is the following:
Parents table:
+----+-------------+
| id | parent_name |
+----+-------------+
| 1 | Richard |
| 2 | Kate |
+----+-------------+
Children table:
+----+------------+
| id | child_name |
+----+------------+
| 1 | Helen |
| 2 | John |
+----+------------+
Parents2children table
+------------+-------------+
| parents_id | children_id |
+------------+-------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
+------------+-------------+
As you can see, there's a duplicate in the last table... how could I prevent SQLAlchemy from adding these duplicates?
I've tried to put relationship("Child", secondary=..., collection_class=set) but this error is displayed:
AttributeError: 'InstrumentedSet' object has no attribute 'append'
Add a PrimaryKeyConstraint (or a UniqueConstraint) to your relationship table:
Parents2children = Table('parents2children', Base.metadata,
Column('parents_id', Integer, ForeignKey('parents.id')),
Column('children_id', Integer, ForeignKey('children.id')),
PrimaryKeyConstraint('parents_id', 'children_id'),
)
and your code will generate an error when you try to commit the relationship added from both sides. This is very recommended to do.
In order to not even generate an error, just check first:
if not(name_father1 in name_son1.parents_backref):
name_son1.parents_backref.append(name_father1)