I'm using SQLAlchemy to manage a database and I'm trying to delete all rows that contain duplicates. The table has an id (primary key) and domain name.
Example:
ID| Domain
1 | example-1.com
2 | example-2.com
3 | example-1.com
In this case I want to delete 1 instance of example-1.com. Sometimes I will need to delete more than 1 but in general the database should not have a domain more than once and if it does, only the first row should be kept and the others should be deleted.
Assuming your model looks something like this:
import sqlalchemy as sa
class Domain(Base):
__tablename__ = 'domain_names'
id = sa.Column(sa.Integer, primary_key=True)
domain = sa.Column(sa.String)
Then you can delete the duplicates like this:
# Create a query that identifies the row for each domain with the lowest id
inner_q = session.query(sa.func.min(Domain.id)).group_by(Domain.domain)
aliased = sa.alias(inner_q)
# Select the rows that do not match the subquery
q = session.query(Domain).filter(~Domain.id.in_(aliased))
# Delete the unmatched rows (SQLAlchemy generates a single DELETE statement from this loop)
for domain in q:
session.delete(domain)
session.commit()
# Show remaining rows
for domain in session.query(Domain):
print(domain)
print()
If you are not using the ORM, the core equivalent is:
meta = sa.MetaData()
domains = sa.Table('domain_names', meta, autoload=True, autoload_with=engine)
inner_q = sa.select([sa.func.min(domains.c.id)]).group_by(domains.c.domain)
aliased = sa.alias(inner_q)
with engine.connect() as conn:
conn.execute(domains.delete().where(~domains.c.id.in_(aliased)))
This answer is based on the SQL provided in this answer. There are other ways of deleting duplicates, which you can see in the other answers on the link, or by googling "sql delete duplicates" or similar.
Related
I have a messy and old query that I'm trying to convert from SQL to Django ORM and I can't seem to figure it out.
As the original query is not something that should be public, heres something similair to what I'm working with:
Table 1
id
Table 2
Id
username
active
birthday
table_1_fk
Table 3
Id
amount
table_1_fk
I need to end up with a list of active users (username), sorted by date, displaying the amount. Table1 references within table 2 and 3 are not in order. The main issues I'm having are:
How do I retrieve these with just ORM (no looping/executing, or hardly any if I must)
If I can't use solely ORM and do decide to just loop over the parts I need to, how would I even create a single object to display in a table without looping over everything multiple times?
My tought processes:
Table 2 is active -> get table 1 -> find table 1 pk in table 3 -> add table 3 info to table 1?
Table 1 -> get table 2 Actives, Table1 -> get table 3 amounts -> loop to match according to table1_fks
You can perform related references using the Table1. If your models looks something like this:
from django.db import models
from django.db.models import F
class Table1(models.Model):
...
class Table2(models.Model):
username = models.CharField(max_length=100)
active = models.BooleanField()
birthday = models.DateField() # Sorted by date
table1 = models.ForeignKey(Table1, related_name="table2")
class Table3(models.Model):
amount = models.IntegerField()
table1 = models.ForeignKey(Table1, related_name="table3")
You can do later:
>>> users = (
Table1.objects
.filter(table2__active=True)
.annotate(
username=F("table2__username"),
amount=F("table3__amount"),
birthday=F("table2__birthday")
)
.order_by("-birthday")
.values("username", "amount", "birthday")
)
>>> print(users)
[
["user1", 100.0, "2020-01-13"],
["user2", 890.0, "2020-01-10"],
["user3", None, "2020-01-01"],
]
It completely depends on how your models classes are implemented.
This question already has answers here:
sqlalchemy - join child table with 2 conditions
(3 answers)
Closed 5 years ago.
Say I have the following two SQLAlchemy ORM classes:
import sqlalchemy as sa
class Address(Base):
__tablename__ = 'DimAddress'
AddressKey = sa.Column(sa.Integer, primary_key=True)
# ... columns ...
class DealerOrganisation(Base):
__tablename__ = 'DimDealerOrganisation'
DealerOrganisationKey = sa.Column(sa.Integer, primary_key=True)
# ... columns ...
DealerOrganizationAddressKey = sa.Column(sa.Integer, sa.ForeignKey('DimAddress.AddressKey'), nullable=False)
# ... columns ...
address = relationship('Address')
I can get dealer organizations and their address, if present, as follows:
session = Session()
query = session.query(DealerOrganisation).outerjoin(DealerOrganisation.address).options(contains_eager(DealerOrganisation.address))
This gives me SQL approximately like this:
SELECT *
FROM DimDealerOrganisation
LEFT JOIN DimAddress ON AddressKey = DealerOrganizationAddressKey
But what if I want to do an ORM query for only a subset of related objects:
SELECT *
FROM DimDealerOrganisation
LEFT JOIN DimAddress ON AddressKey = DealerOrganizationAddressKey AND ZipCode = '90210'
That is, I want all the dealers, but I only want their address if the zip code is 90210. As far as I can tell, join() and outerjoin() let you specify either a relationship or an explicit condition, but not both. In this contrived example I could use an explicit condition and get back rows instead of ORM objects, but it would be unwieldy in a real query involving multiple tables and one-to-many relations. I want to add additional conditions to the on clause but still have it populate the address attribute of the returned DealerOrganisation objects. Is this possible?
You are going to want to use the and_ operator from SQLAlchemy in the join. I think it will look something like:
session = Session()
session.query(Table1).join(Table2, and_(Table1.address==Table2.address, Table1.zip == '90210'), isouter=True)
I have two tables Entry and Group defined in Python using Flask SQLAlchemy connected to a PostgresSQL database:
class Entry (db.Model):
__tablename__ = "entry"
id = db.Column('id', db.Integer, primary_key = True)
group_title = db.Column('group_title', db.Unicode, db.ForeignKey('group.group_title'))
url = db.Column('url', db.Unicode)
next_entry_id = db.Column('next_entry_id', db.Integer, db.ForeignKey('entry.id'))
entry = db.relationship('Entry', foreign_keys=next_entry_id)
group = db.relationship('Group', foreign_keys=group_title)
class Group (db.Model):
__tablename__ = "group"
group_title = db.Column('group_title', db.Unicode, primary_key = True)
group_start_id = db.Column('group_start_id', db.Integer)
#etc.
I am trying to combine the two tables with a natural join using the Entry.id and Group.group_start_id as the common field.
I have been able to query a single table for all records. But I want to join tables by foreign key ID to give records relating Group.group_start_id and Group.group_title to a specific Entry record.
I am having trouble with the Flask-SQLAlchemy query syntax or process
I have tried several approaches (to list a few):
db.session.query(Group, Entry.id).all()
db.session.query(Entry, Group).all()
db.session.query.join(Group).first()
db.session.query(Entry, Group).join(Group)
All of them have returned a list of tuples that is bigger than expected and does not contain what I want.
I am looking for the following result:
(Entry.id, group_title, Group.group_start_id, Entry.url)
I would be grateful for any help.
I used the following query to perform a natuaral join for Group and Entry Table:
db.session.query(Entry, Group).join(Group).filter(Group.group_start_id == Entry.id).order_by(Group.order.asc())
I did this using the .join function in my query which allowed me to join the Group table to the Entry table. Then I filtering the results of the query by using the Group.group_start_id which is a foreign key in the Group table which referred to the Entry.id which is the primary key in the Entry table.
Since you have already performed the basic join by using the relationship() call.
We can focus on getting the data you want, a query such as db.session.query(Entry, Group).all() returns tuples of (Entry, Group) type, from this you can easily do something like:
test = db.session.query(Entry, Group).one()
print(test[0].id) #prints entry.id
print(test[1].group_start_id) # prints Group.group_start_id
#...
SQLAlchemy has great article on how joins work
I have 6 tables in my SQLite database, each table with 6 columns(Date, user, NormalA, specialA, contact, remarks) and 1000+ rows.
How can I use sqlalchemy to sort through the Date column to look for duplicate dates, and delete that row?
Assuming this is your model:
class MyTable(Base):
__tablename__ = 'my_table'
id = Column(Integer, primary_key=True)
date = Column(DateTime)
user = Column(String)
# do not really care of columns other than `id` and `date`
# important here is the fact that `id` is a PK
following are two ways to delete you data:
Find duplicates, mark them for deletion and commit the transaction
Create a single SQL query which will perform deletion on the database directly.
For both of them a helper sub-query will be used:
# helper subquery: find first row (by primary key) for each unique date
subq = (
session.query(MyTable.date, func.min(MyTable.id).label("min_id"))
.group_by(MyTable.date)
) .subquery('date_min_id')
Option-1: Find duplicates, mark them for deletion and commit the transaction
# query to find all duplicates
q_duplicates = (
session
.query(MyTable)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
)
for x in q_duplicates:
print("Will delete %s" % x)
session.delete(x)
session.commit()
Option-2: Create a single SQL query which will perform deletion on the database directly
sq = (
session
.query(MyTable.id)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
).subquery("subq")
dq = (
session
.query(MyTable)
.filter(MyTable.id.in_(sq))
).delete(synchronize_session=False)
Inspired by the Find duplicate values in SQL table this might help you to select duplicate dates:
query = session.query(
MyTable
).\
having(func.count(MyTable.date) > 1).\
group_by(MyTable.date).all()
If you only want to show unique dates; distinct on is what you might need
While I like the whole object oriented approache with SQLAlchemy, sometimes I find it easier to directly use some SQL.
And since the records don't have a key, we need the row number (_ROWID_) to delete the targeted records and I don't think the API provides it.
So first we connect to the database:
from sqlalchemy import create_engine
db = create_engine(r'sqlite:///C:\temp\example.db')
eng = db.engine
Then to list all the records:
for row in eng.execute("SELECT * FROM TableA;") :
print row
And to display all the duplicated records where the dates are identical:
for row in eng.execute("""
SELECT * FROM {table}
WHERE {field} IN (SELECT {field} FROM {table} GROUP BY {field} HAVING COUNT(*) > 1)
ORDER BY {field};
""".format(table="TableA", field="Date")) :
print row
Now that we identified all the duplicates, they probably need to be fixed if the other fields are different:
eng.execute("UPDATE TableA SET NormalA=18, specialA=20 WHERE Date = '2016-18-12' ;");
eng.execute("UPDATE TableA SET NormalA=4, specialA=8 WHERE Date = '2015-18-12' ;");
And finnally to keep the first inserted record and delete the most recent duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MIN(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount
Or to keep the last inserted record and delete the other duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MAX(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount
I'm working with a database that has a relationship that looks like:
class Source(Model):
id = Identifier()
class SourceA(Source):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
class SourceB(Source):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
class SourceC(Source, ServerOptions):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
What I want to do is join all tables Source, SourceA, SourceB, SourceC and then order_by name.
Sound easy to me but I've been banging my head on this for while now and my heads starting to hurt. Also I'm not very familiar with SQL or sqlalchemy so there's been a lot of browsing the docs but to no avail. Maybe I'm just not seeing it. This seems to be close albeit related to a newer version than what I have available (see versions below).
I feel close not that that means anything. Here's my latest attempt which seems good up until the order_by call.
Sources = [SourceA, SourceB, SourceC]
# list of join on Source
joins = [session.query(Source).join(source) for source in Sources]
# union the list of joins
query = joins.pop(0).union_all(*joins)
query seems right at this point as far as I can tell i.e. query.all() works. So now I try to apply order_by which doesn't throw an error until .all is called.
Attempt 1: I just use the attribute I want
query.order_by('name').all()
# throws sqlalchemy.exc.ProgrammingError: (ProgrammingError) column "name" does not exist
Attempt 2: I just use the defined column attribute I want
query.order_by(SourceA.name).all()
# throws sqlalchemy.exc.ProgrammingError: (ProgrammingError) missing FROM-clause entry for table "SourceA"
Is it obvious? What am I missing? Thanks!
versions:
sqlalchemy.version = '0.8.1'
(PostgreSQL) 9.1.3
EDIT
I'm dealing with a framework that wants a handle to a query object. I have a bare query that appears to accomplish what I want but I would still need to wrap it in a query object. Not sure if that's possible. Googling ...
select = """
select s.*, a.name from Source d inner join SourceA a on s.id = a.Source_id
union
select s.*, b.name from Source d inner join SourceB b on s.id = b.Source_id
union
select s.*, c.name from Source d inner join SourceC c on s.id = c.Source_id
ORDER BY "name";
"""
selectText = text(select)
result = session.execute(selectText)
# how to put result into a query. maybe Query(selectText)? googling...
result.fetchall():
Assuming that coalesce function is good enough, below examples should point you in the direction. One option automatically creates a list of children, while the other is explicit.
This is not the query you specified in your edit, but you are able to sort (your original request):
def test_explicit():
# specify all children tables to be queried
Sources = [SourceA, SourceB, SourceC]
AllSources = with_polymorphic(Source, Sources)
name_col = func.coalesce(*(_s.name for _s in Sources)).label("name")
query = session.query(AllSources).order_by(name_col)
for x in query:
print(x)
def test_implicit():
# get all children tables in the query
from sqlalchemy.orm import class_mapper
_map = class_mapper(Source)
Sources = [_smap.class_
for _smap in _map.self_and_descendants
if _smap != _map # #note: exclude base class, it has no `name`
]
AllSources = with_polymorphic(Source, Sources)
name_col = func.coalesce(*(_s.name for _s in Sources)).label("name")
query = session.query(AllSources).order_by(name_col)
for x in query:
print(x)
Your first attempt sounds like it isn't working because there is no name in Source, which is the root table of the query. In addition, there will be multiple name columns after your joins, so you will need to be more specific. Try
query.order_by('SourceA.name').all()
As for your second attempt, what is ServerA?
query.order_by(ServerA.name).all()
Probably a typo, but not sure if it's for SO or your code. Try:
query.order_by(SourceA.name).all()