Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.
ID | title | link | size | author
Suppose if link and size are similar for 2 rows or more, then those rows are duplicates.
How do I get those duplicates into a list and process them?
Will return all records that have dups:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size
I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).
Example: select all duplicate records EXCEPT the one with the max ID:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size, max(ID) as maxID
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link
AND theTable.size = dups.size
AND theTable.ID <> dups.maxID
Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.
select a.id, b.id
from tbl a, tbl b
where a.id < b.id
and a.link = b.link
and a.size = b.size
After you remove the duplicates from the MySQL table, you can add a unique index
to the table so no more duplicates can be inserted:
create unique index theTable_index on theTable (link,size);
If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process withitertools.groupbyusing, as key, theoperator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.
I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!
Related
I have the following code in python to update db where the first column is "id" INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE:
con = lite.connect('test_score.db')
with con:
cur = con.cursor()
cur.execute("INSERT INTO scores VALUES (NULL,?,?,?)", (first,last,score))
item = cur.fetchone()
on.commit()
cur.close()
con.close()
I get table "scores" with following data:
1,Adam,Smith,68
2,John,Snow,76
3,Jim,Green,88
Two different users (userA and userB) copy test_score.db and code to their computer and use it separately.
I get back two db test_score.db but now with different content:
user A test_score.db :
1,Adam,Smith,68
2,John,Snow,76
3,Jim,Green,88
4,Jim,Green,91
5,Tom,Hanks,15
user A test_score.db :
1,Adam,Smith,68
2,John,Snow,76
3,Jim,Green,88
4,Chris,Prat,99
5,Tom,Hanks,09
6,Tom,Hanks,15
I was trying to use
insert into AuditRecords select * from toMerge.AuditRecords;
to combine two db into one but failed as the first column is a unique id. Two db have now the same ids but with different or the same data and merging is failing.
I would like to find unique rows in both db (all values different ignoring id) and merge results to one full db.
Result should be something like this:
1,Adam,Smith,68
2,John,Snow,76
3,Jim,Green,88
4,Jim,Green,91
5,Tom,Hanks,15
6,Chris,Prat,99
7,Tom,Hanks,09
I can extract each value one by one and compare but want to avoid it as I might have longer rows in the future with more columns.
Sorry if it is obvious and easy questions, I'm still learning. I tried to find the answer but failed, please point me to answer if it already exists somewhere else. Thank you very much for your help.
You need to define the approach to resolve duplicated rows. Will consider the max score? The min? The first one?
Considering the table AuditRecords has all the lines of both User A and B, you can use GROUP BY to deduplicate rows and use an aggregation function to resolve the score:
insert into
AuditRecords
select
id,
first_name,
last_name,
max(score) as score
from
toMerge.AuditRecords
group by
id,
first_name,
last_name;
For this requirement you should have defined a UNIQUE constraint for the combination of the columns first, last and score:
CREATE TABLE AuditRecords(
id INTEGER PRIMARY KEY AUTOINCREMENT,
first TEXT,
last TEXT,
score INTEGER,
UNIQUE(first, last, score)
);
Now you can use INSERT OR IGNORE to merge the tables:
INSERT OR IGNORE INTO AuditRecords(first, last, score)
SELECT first, last, score
FROM toMerge.AuditRecords;
Note that you must explicitly define the list of the columns that will receive the values and in this list the id is missing because its value will be autoincremented by each insertion.
Another way to do it without defining the UNIQUE constraint is to use EXCEPT:
INSERT INTO AuditRecords(first, last, score)
SELECT first, last, score FROM toMerge.AuditRecords
EXCEPT
SELECT first, last, score FROM AuditRecords
I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()
Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()
How to make peewee put ids of related table rows into additional list-like field into resulting query?
I want to make duplicates detecting manager for media files. For each file on my PC I have record in database with fields like
File name, Size, Path, SHA3-512, Perceptual hash, Tags, Comment, Date added, Date changed, etc...
Depending on the situation I want to use different patterns to be used to consider records in table as duplicates.
In the most simple case I want just to see all records having the same hash, so I
subq = Record.select(Record.SHA).group_by(Record.SHA).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA)).order_by(Record.SHA)
for r in q:
process_record_in_some_way(r)
and everything is fine.
But there are lot of cases when I want to use different sets of table columns as grouping patterns. So in the worst case I use all of them except id and "Date added" column to detect exact duplicating rows in database, when I just readded the same file for few times which leads to the monster like
subq = Record.select(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA and Record.Name == q.c.Name and Record.Date == q.c.Date and Record.Size == q.c.Size and Record.Tags == q.c.Tags)).order_by(Record.SHA)
for r in q:
process_record_in_some_way(r)
and this is not the full list of my fields, just example.
Same thing I have to do for other patterns of sets of fields, i.e. duplicating it's list 3 times in select clause, grouping clause of subquery and then listing them all again in joining clause.
I wish I could just group the records with appropriate pattern and peewee would just list ids of all the members of each group into new list field like
q=Record.select(Record, SOME_MAJIC.alias('duplicates')).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1).SOME_ANOTHER_MAJIC
for r in q:
process_group_of_records(r) # r.duplicates == [23, 44, 45, 56, 100], for example
How can I do this? Listing the same parameters trice I really feel like I do something wrong.
You can use GROUP_CONCAT (or for postgres, array_agg) to group and concatenate a list of ids/filenames, whatever.
So for files with the same hash:
query = (Record
.select(Record.sha, fn.GROUP_CONCAT(Record.id).alias('id_list'))
.group_by(Record.sha)
.having(fn.COUNT(Record.id) > 1))
This is a relational database. So you're dealing all the time, everywhere, with tables consisting of rows and columns. There's no "nesting". GROUP_CONCAT is about as close as you can get.
I have a python program in which I want to read the odd rows from one table and insert them into another table. How can I achieve this?
For example, the first table has 5 rows in total, and I want to insert the first, third, and fifth rows into another table.
Note that the table may contains millions of rows, so the performance is very important.
I found a few methods here. Here's two of them transcribed to psycopg2.
If you have a sequential primary key, you can just use mod on it:
database_cursor.execute('SELECT * FROM table WHERE mod(primary_key_column, 2) = 1')
Otherwise, you can use a subquery to get the row number and use mod:
database_cursor.execute('''SELECT col1, col2, col3
FROM (SELECT row_number() OVER () as rnum, col1, col2, col3
FROM table)
WHERE mod(rnum, 2) = 1''')
If you have an id-type column that is guaranteed to increment by 1 upon every insert (kinda like an auto-increment index), you could always mod that to select the row. However, this would break when you begin to delete rows from the table you are selecting from.
A more complicated solution would be to use postgresql's row_number() function. The following assumes you have an id column that can be used to sort the rows in the desired order:
select r.*
from (select *,row_number() over(order by id) as row
from <tablename>
) r
where r.row % 2 = 0
Note: regardless of how you do it, the performance will NEVER really be efficient as you necessarily have to do a full table scan, and selecting all columns on a table with millions of records using a full table scan is going to be slow.
I don't want to aggregate any columns. I just want the newest row for each foreign key in a table.
I've tried grouping.
Model.query.order_by(Model.created_at.desc()).group_by(Model.foreign_key_id).all()
# column "model.id" must appear in the GROUP BY clause
And I've tried distinct.
Model.query.order_by(Model.created_at.desc()).distinct(Model.foreign_key_id).all()
# SELECT DISTINCT ON expressions must match initial ORDER BY expressions
This is known as greatest-n-per-group, and for PostgreSQL you can use DISTINCT ON, as in your second example:
SELECT DISTINCT ON (foreign_key_id) * FROM model ORDER BY foreign_key_id, created_at DESC;
In your attempt, you were missing the DISTINCT ON column in your ORDER BY list, so all you had to do was:
Model.query.order_by(Model.foreign_key_id, Model.created_at.desc()).distinct(Model.foreign_key_id)
The solution is to left join an aliased model to itself (with a special join condition). Then filter out the rows that do not have an id.
model = Model
aliased = aliased(Model)
query = model.query.outerjoin(aliased, and_(
aliased.primary_id == model.primary_id,
aliased.created_at > model.created_at))
query = query.filter(aliased.id.is_(None))