I have the following table:
id as int, prop as text, timestamp as int, json as blob
I want to find all pairs, which have the same prop and with the same timestamp. Later I want to extend the timestamp to e.g., +/- 5 sec.
I try to do it with INNER JOIN but my query runs into endless loop:
SELECT * FROM myTable c
INNER JOIN myTable c1
ON c.id != c1.id
AND c.prop = c1.prop
AND c.timestamp = c1.timestamp
Maybe my approach is wrong. What is the problem with my query? How can I do it? Actually, I need groups with these pairs.
You could try to see if the query gets faster with a GROUP BY:
SELECT * FROM myTable
WHERE (prop, timestamp) IN (
SELECT prop, timestamp
FROM myTable
GROUP BY prop, timestamp
HAVING COUNT(*) > 1
)
Although its hard to say without sample data.
If the table is huge you might have to create an index to speed up the query.
Related
rows_order = "SELECT COUNT (*) FROM 'Order'"
cursor.execute(rows_order)
ordernum = cursor.fetchall()
connection.commit()
cursor.execute("INSERT INTO 'Order' (OrderNo, CustomerID, Date, TotalCost) VALUES (?,?,?,?)", (
[ordernum], custid_Sorder, now, total_item_price))
This is what I am trying but this error popped up;
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
How do I fix this? I want to make it so the OrderNo is = to the amount of orders before it, hence why I want to assign the orderno to it. (I am using sqlite3)
as you have only one value you need only fetchone
import sqlite3
con = sqlite3.connect("tutorial.db")
cursor = con.cursor()
rows_order = "SELECT COUNT (*) FROM 'Order'"
cursor.execute(rows_order)
ordernum = cursor.fetchone()[0]
cursor.execute("INSERT INTO 'Order' (OrderNo, CustomerID, Date, TotalCost) VALUES (?,?,?,?)", (
ordernum, custid_Sorder, now, total_item_price))
tl;dr Don't do this. Use an auto-incremented primary key.
fetchall returns all rows as a list, even if there is only one row.
Instead, use fetchone. This will return a single tuple which you can then select the first item. ordernum = cursor.fetchone()[0]
However, you appear to be writing a query to get the next ID. Using count(*) is wrong. If there are any gaps in OrderNo, for example if something gets deleted, it can return a duplicate. Consider [1, 3, 4]; count(*) will return 3. Use max(OrderNo) instead.
Furthermore, if you try to insert two orders at the same time you might get a race condition and one will try to duplicate the other.
process 1 process 2
select max(orderNo)
fetchone # 4
select max(orderNo)
fetchone # 4
insert into orders...
insert into orders... # duplicate OrderNo
To avoid this, you have to do both the select and insert in a transaction.
process 1 process 2
begin
select max(orderNo)...
fetchone # 4 begin
select max(orderNo)
fetchone
insert into orders... # wait
commit # wait
# 5
insert into orders...
commit
Better yet, do them as a single query.
insert into "Order" (OrderNo, CustomerID, Date, TotalCost)
select max(orderNo), ?, ?, ?
from "order"
Even better don't do it at all. There is a built-in mechanism to do this use an auto-incremented primary keys.
-- order is a keyword, pluralizing table names helps to avoid them
create table orders (
-- It is a special feature of SQLite that this will automatically be unique.
orderNo integer primary key
customerID int,
-- date is also a keyword, and vague. Use xAt.
orderedAt timestamp,
totalCost int
)
-- orderNo will automatically be set to a unique number
insert into orders (customerID, orderedAt, totalCost) values (...)
I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.
For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries
I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()
Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()
I am loading a batch csv file to postgres using python (Say Table A).
I am using pandas to upload the data into chunk which is quite faster.
for chunk in pd.read_csv(csv_file, sep='|',chunksize=chunk_size,low_memory=False):
Now I want to update another table (say Table B) using A based on following rules
if there are any new records in table A which is not in table B then insert that as a new record in table B (based on Id field)
if the values changes in the Table A for the same ID which exists in Table B then update the records in table B using TableA
(There are server tables which i need to update based on Table A )
I am able to do that using below and then loop through each row, but Table A always have records around 1,825,172 and it becomes extremely slow. Any forum member can help to speed this up or suggest a alternate approach to achieve the same.
cursor.execute(sql)
records = cursor.fetchall()
for row in records:
id= 0 if row[0] is None else row[0] # Use this to match with Table B and decide insert or update
id2=0 if row[1] is None else row[1]
id2=0 if row[2] is None else row[2]
You could leverage Postgres upsert syntax, like:
insert into tableB tb (id, col1, col2)
select ta.id, ta.col1, ta.col2 from tableA ta
on conflict(id) do update
set col1 = ta.col1, col2 = ta.col2
You should do this completely inside the DBMS, not loop through the records inside your python script. That allows your DBMS to better optimize.
UPDATE TableB
SET x=y
FROM TableA
WHERE TableA.id = TableB.id
INSERT INTO TableB(id,x)
SELECT id, y
FROM TableA
WHERE TableA.id NOT IN ( SELECT id FROM TableB )
Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.
ID | title | link | size | author
Suppose if link and size are similar for 2 rows or more, then those rows are duplicates.
How do I get those duplicates into a list and process them?
Will return all records that have dups:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size
I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).
Example: select all duplicate records EXCEPT the one with the max ID:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size, max(ID) as maxID
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link
AND theTable.size = dups.size
AND theTable.ID <> dups.maxID
Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.
select a.id, b.id
from tbl a, tbl b
where a.id < b.id
and a.link = b.link
and a.size = b.size
After you remove the duplicates from the MySQL table, you can add a unique index
to the table so no more duplicates can be inserted:
create unique index theTable_index on theTable (link,size);
If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process withitertools.groupbyusing, as key, theoperator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.
I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!