Filter by results of select - python

I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()

Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()

Related

Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent

I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.
For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries

sqlite3: my sql query runs into endless loop

I have the following table:
id as int, prop as text, timestamp as int, json as blob
I want to find all pairs, which have the same prop and with the same timestamp. Later I want to extend the timestamp to e.g., +/- 5 sec.
I try to do it with INNER JOIN but my query runs into endless loop:
SELECT * FROM myTable c
INNER JOIN myTable c1
ON c.id != c1.id
AND c.prop = c1.prop
AND c.timestamp = c1.timestamp
Maybe my approach is wrong. What is the problem with my query? How can I do it? Actually, I need groups with these pairs.
You could try to see if the query gets faster with a GROUP BY:
SELECT * FROM myTable
WHERE (prop, timestamp) IN (
SELECT prop, timestamp
FROM myTable
GROUP BY prop, timestamp
HAVING COUNT(*) > 1
)
Although its hard to say without sample data.
If the table is huge you might have to create an index to speed up the query.

Nested SELECT query in Pyspark DataFrames

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?
As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

Retrieving and selecting binary values from Mysql with Python 3

I'm trying to select data from one table, and perform a query on another table using the returned values from the first table.
Both tables are case-sensitive, and of type utf8-bin.
When I perform my first select, I am returned a tuple of binary values:
query = """SELECT id FROM table1"""
results = (b'1234', b'2345', b'3456')
I'd then like to perform a query on table2 using the ids returned from table1:
query = """SELECT element FROM table2 WHERE id IN (%s) """ % results
Is this the right way to do this?
You need to create the query so that it can be properly parameterized:
query = """SELECT element FROM table2 WHERE id IN (%s) """ % ",".join(['%s'] * len(results))
This will transform the query to:
query = """SELECT element FROM table2 WHERE id IN (%s,%s,%s) """
Then you can just pass query and results to the execute() (or appropriate) method so that results are properly parameterized.

How to find duplicates in MySQL

Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.
ID | title | link | size | author
Suppose if link and size are similar for 2 rows or more, then those rows are duplicates.
How do I get those duplicates into a list and process them?
Will return all records that have dups:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size
I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).
Example: select all duplicate records EXCEPT the one with the max ID:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size, max(ID) as maxID
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link
AND theTable.size = dups.size
AND theTable.ID <> dups.maxID
Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.
select a.id, b.id
from tbl a, tbl b
where a.id < b.id
and a.link = b.link
and a.size = b.size
After you remove the duplicates from the MySQL table, you can add a unique index
to the table so no more duplicates can be inserted:
create unique index theTable_index on theTable (link,size);
If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process withitertools.groupbyusing, as key, theoperator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.
I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!

Categories

Resources