python sqlalchemy distinct column values - python

I have 6 tables in my SQLite database, each table with 6 columns(Date, user, NormalA, specialA, contact, remarks) and 1000+ rows.
How can I use sqlalchemy to sort through the Date column to look for duplicate dates, and delete that row?

Assuming this is your model:
class MyTable(Base):
__tablename__ = 'my_table'
id = Column(Integer, primary_key=True)
date = Column(DateTime)
user = Column(String)
# do not really care of columns other than `id` and `date`
# important here is the fact that `id` is a PK
following are two ways to delete you data:
Find duplicates, mark them for deletion and commit the transaction
Create a single SQL query which will perform deletion on the database directly.
For both of them a helper sub-query will be used:
# helper subquery: find first row (by primary key) for each unique date
subq = (
session.query(MyTable.date, func.min(MyTable.id).label("min_id"))
.group_by(MyTable.date)
) .subquery('date_min_id')
Option-1: Find duplicates, mark them for deletion and commit the transaction
# query to find all duplicates
q_duplicates = (
session
.query(MyTable)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
)
for x in q_duplicates:
print("Will delete %s" % x)
session.delete(x)
session.commit()
Option-2: Create a single SQL query which will perform deletion on the database directly
sq = (
session
.query(MyTable.id)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
).subquery("subq")
dq = (
session
.query(MyTable)
.filter(MyTable.id.in_(sq))
).delete(synchronize_session=False)

Inspired by the Find duplicate values in SQL table this might help you to select duplicate dates:
query = session.query(
MyTable
).\
having(func.count(MyTable.date) > 1).\
group_by(MyTable.date).all()
If you only want to show unique dates; distinct on is what you might need

While I like the whole object oriented approache with SQLAlchemy, sometimes I find it easier to directly use some SQL.
And since the records don't have a key, we need the row number (_ROWID_) to delete the targeted records and I don't think the API provides it.
So first we connect to the database:
from sqlalchemy import create_engine
db = create_engine(r'sqlite:///C:\temp\example.db')
eng = db.engine
Then to list all the records:
for row in eng.execute("SELECT * FROM TableA;") :
print row
And to display all the duplicated records where the dates are identical:
for row in eng.execute("""
SELECT * FROM {table}
WHERE {field} IN (SELECT {field} FROM {table} GROUP BY {field} HAVING COUNT(*) > 1)
ORDER BY {field};
""".format(table="TableA", field="Date")) :
print row
Now that we identified all the duplicates, they probably need to be fixed if the other fields are different:
eng.execute("UPDATE TableA SET NormalA=18, specialA=20 WHERE Date = '2016-18-12' ;");
eng.execute("UPDATE TableA SET NormalA=4, specialA=8 WHERE Date = '2015-18-12' ;");
And finnally to keep the first inserted record and delete the most recent duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MIN(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount
Or to keep the last inserted record and delete the other duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MAX(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount

Related

Django Delete duplicates rows and keep the last using SQL query

I need to execute a SQL query that deletes the duplicated rows based on one column and keep the last record. Noting that it's a large table so Django ORM takes very long time so I need SQL query instead. the column name is customer_number and table name is pages_dataupload. I'm using sqlite.
Update: I tried this but it gives me no such column: row_num
cursor = connection.cursor()
cursor.execute(
'''WITH cte AS (
SELECT
id,
customer_number ,
ROW_NUMBER() OVER (
PARTITION BY
id,
customer_number
ORDER BY
id,
customer_number
) row_num
FROM
pages.dataupload
)
DELETE FROM pages_dataupload
WHERE row_num > 1;
'''
)
You can work with an Exists subquery [Django-doc] to determine efficiently if there is a younger DataUpload:
from django.db.models import Exists, OuterRef
DataUpload.objects.filter(Exists(
DataUpload.objects.filter(
pk__gt=OuterRef('pk'), customer_number=OuterRef('customer_number')
)
)).delete()
This will thus check for each DataUpload if there exists a DataUpload with a larger primary key that has the same customer_number. If that is the case, we will remove that DataUpload.
I have solved the problem with the below query, is there any way to reset the id field after removing the duplicate?
cursor = connection.cursor()
cursor.execute(
'''
DELETE FROM pages_dataupload WHERE id not in (
SELECT Max(id) FROM pages_dataupload Group By Dial
)
'''
)

SQLAchemy: Delete all duplicate rows [duplicate]

I'm using SQLAlchemy to manage a database and I'm trying to delete all rows that contain duplicates. The table has an id (primary key) and domain name.
Example:
ID| Domain
1 | example-1.com
2 | example-2.com
3 | example-1.com
In this case I want to delete 1 instance of example-1.com. Sometimes I will need to delete more than 1 but in general the database should not have a domain more than once and if it does, only the first row should be kept and the others should be deleted.
Assuming your model looks something like this:
import sqlalchemy as sa
class Domain(Base):
__tablename__ = 'domain_names'
id = sa.Column(sa.Integer, primary_key=True)
domain = sa.Column(sa.String)
Then you can delete the duplicates like this:
# Create a query that identifies the row for each domain with the lowest id
inner_q = session.query(sa.func.min(Domain.id)).group_by(Domain.domain)
aliased = sa.alias(inner_q)
# Select the rows that do not match the subquery
q = session.query(Domain).filter(~Domain.id.in_(aliased))
# Delete the unmatched rows (SQLAlchemy generates a single DELETE statement from this loop)
for domain in q:
session.delete(domain)
session.commit()
# Show remaining rows
for domain in session.query(Domain):
print(domain)
print()
If you are not using the ORM, the core equivalent is:
meta = sa.MetaData()
domains = sa.Table('domain_names', meta, autoload=True, autoload_with=engine)
inner_q = sa.select([sa.func.min(domains.c.id)]).group_by(domains.c.domain)
aliased = sa.alias(inner_q)
with engine.connect() as conn:
conn.execute(domains.delete().where(~domains.c.id.in_(aliased)))
This answer is based on the SQL provided in this answer. There are other ways of deleting duplicates, which you can see in the other answers on the link, or by googling "sql delete duplicates" or similar.

How do I add 1 to a SQLite3 Integer (python)

I am trying to make a table that logs the amount of times a player has logged in, by UUID and number of joins.
Here is what my table looks like (for testing)
I would like to make my program check if the UUID is already in the database, and then add 1 to the number of joins.
import sqlite3
connection = sqlite3.connect("joins.db")
cursor = connection.cursor()
try:
cursor.execute("CREATE TABLE joinsdatabase (uuid TEXT, joins INTEGER)")
except:
print("Table exists: Not creating a new one!")
def addP(player_uuid):
rows = cursor.execute("SELECT uuid, joins FROM joinsdatabase").fetchall()
cursor.execute("INSERT INTO joinsdatabase VALUES ('"+player_uuid+"', 1)")
connection.commit()
If your version of SQLite is 3.24.0+ you can use UPSERT, but first you must define the column uuid as PRIMARY KEY or UNIQUE.
So drop the table that you have with:
DROP TABLE joinsdatabase;
and then recreate it:
CREATE TABLE joinsdatabase (uuid TEXT PRIMARY KEY, joins INTEGER DEFAULT 1)
This way also, the default value of joins for a new row will be 1, so there is no need to set it in the INSERT statement.
Now you can use UPSERT like this:
def addP(player_uuid):
sql = """INSERT INTO joinsdatabase(uuid) VALUES (?)
ON CONFLICT(uuid) DO UPDATE
SET joins = joins + 1"""
cursor.execute(sql, (player_uuid,))
connection.commit()
You don't need this line:
rows = cursor.execute("SELECT uuid, joins FROM joinsdatabase").fetchall()
inside addP().
If your version of SQLite does not support UPSERT you will need 2 statements.
def addP(player_uuid):
sql = "UPDATE joinsdatabase SET joins = joins + 1 WHERE uuid = ?"
cursor.execute(sql, (player_uuid,))
sql = "INSERT OR IGNORE INTO joinsdatabase(uuid) VALUES (?)"
cursor.execute(sql, (player_uuid,))
connection.commit()
The UPDATE statement will update the row if the uuid exists in the table and the INSERT statement will insert the row if the uuid does not exist in the table.
This will also work if uuid is unique.
Try this, you may need to play about with the variables.
cursor.execute("UPDATE joinsdatabase SET count = ? WHERE uuid = ?", (int(oldCount) + 1, player_uuid))

Insert unique auto-increment ID on python to sql?

I'd like to insert an Order_ID to make each row unique using python and pyodbc to SQL Server.
Currently, my code is:
name = input("Your name")
def connectiontoSQL(order_id,name):
query = f'''\
insert into order (Order_ID, Name)
values('{order_id}','{name}')'''
return (execute_query_commit(conn,query))
If my table in SQL database is empty and I'd like it to add a order_ID by 1 every time I execute,
How should I code order_id in Python such that it will automatically create the first order_ID as OD001, and if I execute another time, it would create OD002?
You can create a INT Identity column as your primary key and add a computed column that has the order number that you display in your application.
create table Orders
(
[OrderId] [int] IDENTITY(0,1) NOT NULL,
[OrderNumber] as 'OD'+ right( '00000' + cast(OrderId as varchar(6)) , 6) ,
[OrderDate] date,
PRIMARY KEY CLUSTERED
(
[OrderId] ASC
)
)

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.
Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).
It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Categories

Resources