How to specify the FROM tables in SQLAlchemy subqueries? - python

I am trying to fetch in a single query a fixed set of rows, plus some other rows found by a subquery. My problem is that the query generated by my SQLAlchemy code is incorrect.
The problem is that the query generated by SQLAlchemy is as follows:
SELECT tbl.id AS tbl_id
FROM tbl
WHERE tbl.id IN
(
SELECT t2.id AS t2_id
FROM tbl AS t2, tbl AS t1
WHERE t2.id =
(
SELECT t3.id AS t3_id
FROM tbl AS t3, tbl AS t1
WHERE t3.id < t1.id ORDER BY t3.id DESC LIMIT 1 OFFSET 0
)
AND t1.id IN (4, 8)
)
OR tbl.id IN (0, 8)
while the correct query should not have the second tbl AS t1 (the goal from this query is to select IDs 0 and 8, as well as the IDs just before 4 and 8).
Unfortunately, I can't find how to get SQLAlchemy to generate the correct one (see the code below).
Suggestions to also achieve the same result with a simpler query are also welcome (they need to be efficient though -- I tried a few variants and some were a lot slower on my real use case).
The code producing the query:
from sqlalchemy import create_engine, or_
from sqlalchemy import Column, Integer, MetaData, Table
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///:memory:', echo=True)
meta = MetaData(bind=engine)
table = Table('tbl', meta, Column('id', Integer))
session = sessionmaker(bind=engine)()
meta.create_all()
# Insert IDs 0, 2, 4, 6, 8.
i = table.insert()
i.execute(*[dict(id=i) for i in range(0, 10, 2)])
print session.query(table).all()
# output: [(0,), (2,), (4,), (6,), (8,)]
# Subquery of interest: look for the row just before IDs 4 and 8.
sub_query_txt = (
'SELECT t2.id '
'FROM tbl t1, tbl t2 '
'WHERE t2.id = ( '
' SELECT t3.id from tbl t3 '
' WHERE t3.id < t1.id '
' ORDER BY t3.id DESC '
' LIMIT 1) '
'AND t1.id IN (4, 8)')
print session.execute(sub_query_txt).fetchall()
# output: [(2,), (6,)]
# Full query of interest: get the rows mentioned above, as well as more rows.
query_txt = (
'SELECT * '
'FROM tbl '
'WHERE ( '
' id IN (%s) '
'OR id IN (0, 8))'
) % sub_query_txt
print session.execute(query_txt).fetchall()
# output: [(0,), (2,), (6,), (8,)]
# Attempt at an SQLAlchemy translation (from innermost sub-query to full query).
t1 = table.alias('t1')
t2 = table.alias('t2')
t3 = table.alias('t3')
q1 = session.query(t3.c.id).filter(t3.c.id < t1.c.id).order_by(t3.c.id.desc()).\
limit(1)
q2 = session.query(t2.c.id).filter(t2.c.id == q1, t1.c.id.in_([4, 8]))
q3 = session.query(table).filter(
or_(table.c.id.in_(q2), table.c.id.in_([0, 8])))
print list(q3)
# output: [(0,), (6,), (8,)]

What you are missing is a correlation between the innermost sub-query and the next level up; without the correlation, SQLAlchemy will include the t1 alias in the innermost sub-query:
>>> print str(q1)
SELECT t3.id AS t3_id
FROM tbl AS t3, tbl AS t1
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?
>>> print str(q1.correlate(t1))
SELECT t3.id AS t3_id
FROM tbl AS t3
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?
Note that tbl AS t1 is now missing from the query. From the .correlate() method documentation:
Return a Query construct which will correlate the given FROM clauses to that of an enclosing Query or select().
Thus, t1 is assumed to be part of the enclosing query, and isn't listed in the query itself.
Now your query works:
>>> q1 = session.query(t3.c.id).filter(t3.c.id < t1.c.id).order_by(t3.c.id.desc()).\
... limit(1).correlate(t1)
>>> q2 = session.query(t2.c.id).filter(t2.c.id == q1, t1.c.id.in_([4, 8]))
>>> q3 = session.query(table).filter(
... or_(table.c.id.in_(q2), table.c.id.in_([0, 8])))
>>> print list(q3)
2012-10-24 22:16:22,239 INFO sqlalchemy.engine.base.Engine SELECT tbl.id AS tbl_id
FROM tbl
WHERE tbl.id IN (SELECT t2.id AS t2_id
FROM tbl AS t2, tbl AS t1
WHERE t2.id = (SELECT t3.id AS t3_id
FROM tbl AS t3
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?) AND t1.id IN (?, ?)) OR tbl.id IN (?, ?)
2012-10-24 22:16:22,239 INFO sqlalchemy.engine.base.Engine (1, 0, 4, 8, 0, 8)
[(0,), (2,), (6,), (8,)]

I'm only kinda sure I understand the query you're asking for. Lets break it down, though:
the goal from this query is to select IDs 0 and 8, as well as the IDs just before 4 and 8.
It looks like you want to query for two kinds of things, and then combine them. The proper operator for that is union. Do the simple queries and add them up at the end. I'll start with the second bit, "ids just before X".
To start with; lets look at the all the ids that are before some given value. For this, we'll join the table on itself with a <:
# select t1.id t1_id, t2.id t2_id from tbl t1 join tbl t2 on t1.id < t2.id;
t1_id | t2_id
-------+-------
0 | 2
0 | 4
0 | 6
0 | 8
2 | 4
2 | 6
2 | 8
4 | 6
4 | 8
6 | 8
(10 rows)
That certainly gives us all of the pairs of rows where the left is less than the right. Of all of them, we want the rows for a given t2_id that is as high as possible; We'll group by t2_id and select the maximum t1_id
# select max(t1.id), t2.id from tbl t1 join tbl t2 on t1.id < t2.id group by t2.id;
max | id
-----+-------
0 | 2
2 | 4
4 | 6
6 | 8
(4 rows)
Your query, using a limit, could achieve this, but its usually a good idea to avoid using this technique when alternatives exist because partitioning does not have good, portable support across Database implementations. Sqlite can use this technique, but postgresql doesn't like it, it uses a technique called "analytic queries" (which are both standardised and more general). MySQL can do neither. The above query, though, works consistently across all sql database engines.
the rest of the work is just using in or other equivalent filtering queries and are not difficult to express in sqlalchemy. The boilerplate...
>>> import sqlalchemy as sa
>>> from sqlalchemy.orm import Query
>>> engine = sa.create_engine('sqlite:///:memory:')
>>> meta = sa.MetaData(bind=engine)
>>> table = sa.Table('tbl', meta, sa.Column('id', sa.Integer))
>>> meta.create_all()
>>> table.insert().execute([{'id':i} for i in range(0, 10, 2)])
>>> t1 = table.alias()
>>> t2 = table.alias()
>>> before_filter = [4, 8]
First interesting bit is we give the 'max(id)' expression a name. this is needed so that we can refer to it more than once, and to lift it out of a subquery.
>>> c1 = sa.func.max(t1.c.id).label('max_id')
>>> # ^^^^^^
The 'heavy lifting' portion of the query, join the above aliases, group and select the max
>>> q1 = Query([c1, t2.c.id]) \
... .join((t2, t1.c.id < t2.c.id)) \
... .group_by(t2.c.id) \
... .filter(t2.c.id.in_(before_filter))
Because we'll be using a union, we need this to produce the right number of fields: we wrap it in a subquery and project down to the only column we're interested in. This will have the name we gave it in the above label() call.
>>> q2 = Query(q1.subquery().c.max_id)
>>> # ^^^^^^
The other half of the union is much simpler:
>>> t3 = table.alias()
>>> exact_filter = [0, 8]
>>> q3 = Query(t3).filter(t3.c.id.in_(exact_filter))
All that's left is to combine them:
>>> q4 = q2.union(q3)
>>> engine.execute(q4.statement).fetchall()
[(0,), (2,), (6,), (8,)]

The responses here helped me fix my issue but in my case I had to use both correlate() and subquery():
# ...
subquery = subquery.correlate(OuterCorrelationTable).subquery()
filter_query = db.session.query(func.sum(subquery.c.some_count_column))
filter = filter_query.as_scalar() == as_many_as_some_param
# ...
final_query = db.session.query(OuterCorrelationTable).filter(filter)

Related

Join on a CTE in SQLAlchemy

I'm trying to formulate a SQLAlchemy query that uses a CTE to build a table-like structure of an input list of tuples, and JOIN it with one of my tables (backend DB is Postgres). Conceptually, it would look like:
WITH to_compare AS (
SELECT * FROM (
VALUES
(1, 'flimflam'),
(2, 'fimblefamble'),
(3, 'pigglywiggly'),
(4, 'beepboop')
-- repeat for a couple dozen or hundred rows
) AS t (field1, field2)
)
SELECT b.field1, b.field2, b.field3
FROM my_model b
JOIN to_compare c ON (c.field1 = b.field1) AND (c.field2 = b.field2)
The goal is to see what field3 for the pair (field1, field2) in the table if it is, for a medium-sized list of (field1, field2) pairs.
In SQLAlchemy I'm trying to do it like this:
stmts = [
sa.select(
[
sa.cast(sa.literal(field1), sa.Integer).label("field1"),
sa.cast(sa.literal(field2), sa.Text).label("field2"),
]
)
if idx == 0
else sa.select([sa.literal(field1), sa.literal(field2)])
for idx, (field1, field2) in enumerate(list_of_tuples)
]
cte = sa.union_all(*stmts).cte(name="temporary_table")
already_in_db_query = db.session.query(MyModel)\
.join(cte,
cte.c.field1 == MyModel.field1,
cte.c.field2 == MyModel.field2,
).all()
But it seems like CTEs and JOINs don't play well together: the error is on the join, saying:
sqlalchemy.exc.InvalidRequestError: Don't know how to join to ; please use an ON clause to more clearly establish the left side of this join
And if I try to print the cte, it does look like a non-SQL entity:
$ from pprint import pformat
$ print(pformat(str(cte)), flush=True)
> ''
Is there a way to do this? Or a better way to achieve my goal?
The second argument to Query.join() should in this case be the full ON clause, but instead you pass 3 arguments to join(). Use and_() to combine the predicates, as is done in the raw SQL:
already_in_db_query = db.session.query(MyModel)\
.join(cte,
and_(cte.c.field1 == MyModel.field1,
cte.c.field2 == MyModel.field2),
).all()

How to use another method to achieve mysql left join?

Because MySQL Left join limited 61, maybe this is table:
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for test
-- ----------------------------
DROP TABLE IF EXISTS `test`;
CREATE TABLE `test` (
`id` int(11) DEFAULT NULL,
`pid` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
-- ----------------------------
-- Records of test
-- ----------------------------
BEGIN;
INSERT INTO `test` VALUES (1, 3);
INSERT INTO `test` VALUES (1, 4);
INSERT INTO `test` VALUES (2, 4);
INSERT INTO `test` VALUES (3, 5);
INSERT INTO `test` VALUES (3, 6);
COMMIT;
SET FOREIGN_KEY_CHECKS = 1;
This is MySQL SQL:
SELECT
t1.pid AS lev1,
t2.pid AS lev2,
t3.pid AS lev3,
t4.pid AS lev4
FROM
test AS t1
LEFT JOIN test AS t2 ON ( t2.id = t1.pid )
LEFT JOIN test AS t3 ON ( t3.id = t2.pid )
LEFT JOIN test AS t4 ON ( t4.id = t3.pid )
LEFT JOIN test AS t5 ON ( t5.id = t4.pid )
WHERE t1.id = 1 ;
I want to output like this but not using MYSQL LEFT JOIN:
lev1,lev2,lev3,lev4
3, 6,
3, 5,
4, ,
If python can achieve, I also need!
The limit of 61 is there for a reason and your query might run somewhat slowly, but something like the following should work for your needs:
SELECT
t1.id AS lev1, t2.id AS lev2, t3.id AS lev3, t4.id AS lev4
FROM
t1, t2, t3, t4
WHERE
t2.pid = t1.id AND t3.pid = t1.id AND t4.pid = t1.id

Sqlalchemy duplicated WHERE clause to FROM

I wrote raw query to psql and it's work fine but when i wrote this in sqlalchemy my WHERE clause duplicated to FROM clause.
select id from T1 where arr && array(select l.id from T1 as l where l.box && box '((0,0),(50,50))');
In this query i fetch all id from T1 where array with ints intersects with results from subquery.
class T1():
arr = Column(ARRAY(Integer))
...
class T2():
box = Column(Box) # my geometry type
...
1 verison:
layers_q = select([T2.id]).where(T2.box.op('&&')(box)) # try find all T2 intersects with box
chunks = select([T1.id]).where(T1.arr.overlap(layers_q)) # try find all T1.id where T1.arr overlap with result from first query
SELECT T1.id
FROM T1
WHERE T1.arr && (SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s)
This i have a PG error about type cast. I understand it.
2 version:
layers_q = select([T2.id]).where(T2.box.op('&&')(box))
chunks = select([T1.id]).where(T1.arr.overlap(func.array(layers_q)))
I added func.array() for cast to array but result is not correct:
SELECT T1.id
FROM T1, (SELECT T2.id AS id
FROM T2
WHERE T2.box && %(box_1)s)
WHERE T1.arr && array((SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s))
There you can see what i have duplicate in FROM clause. How did it correctly?
I find solution!
func.array(select([T2.id]).where(T2.box.op('&&')(box)).as_scalar())
After added as_scalar() all be good, beacause in my select all ids need have in one array.

Why doesn't this work? WHERE IN

(Not a duplicate. I know that there's a way of doing this that works: Parameter substitution for a SQLite "IN" clause.)
I'd like to know what I'm missing in this code. I build a simple table. Then I successfully copy some of its records to a new table where the records are qualified by a WHERE clause that involves two lists. Having tossed that table I attempt to copy the same records but this time I put the list into a variable which I insert into the sql statement. This time no records are copied.
How come?
import sqlite3
conn = sqlite3.connect(':memory:')
curs = conn.cursor()
oldTableRecords = [ [ 15, 3 ], [ 2, 1], [ 44, 2], [ 6, 9 ] ]
curs.execute('create table oldTable (ColA integer, ColB integer)')
curs.executemany('insert into oldTable (ColA, ColB) values (?,?)', oldTableRecords)
print ('This goes ...')
curs.execute('''create table newTable as
select * from oldTable
where ColA in (15,3,44,9) or ColB in (15,3,44,9)''')
for row in curs.execute('select * from newTable'):
print ( row)
curs.execute('''drop table newTable''')
print ('This does not ...')
TextTemp = ','.join("15 3 44 9".split())
print (TextTemp)
curs.execute('''create table newTable as
select * from oldTable
where ColA in (?) or ColB in (?)''', (TextTemp,TextTemp))
for row in curs.execute('select * from newTable'):
print ( row)
Output:
This goes ...
(15, 3)
(44, 2)
(6, 9)
This does not ...
15,3,44,9
TIA!
The whole point of a SQL parameter is to prevent SQL syntax in values from being executed. That includes commas between values; if this wasn't the case then you couldn't ever use values with commas in query parameters and is probably a security issue to boot.
You can't just use one ? to insert multiple values into a query; the whole TextTemp value is seen as one value, producing the following equivalent:
create table newTable as
select * from oldTable
where ColA in ('15,3,44,9') or ColB in ('15,3,44,9')
None of the values in ColA or ColB have a single row with the string value 15,3,44,9.
You need to use separate placeholders for each of the values in your parameter:
col_values = [int(v) for v in "15 3 44 9".split()]
placeholders = ', '.join(['?'] * len(col_values))
sql = '''create table newTable as
select * from oldTable
where ColA in ({0}) or ColB in ({0})'''.format(placeholders)
curs.execute(sql, col_values * 2)

Multiple insertion of one value in sqlalchemy statement to pandas

I have constructed a sql clause where I reference the same table as a and b to compare the two geometries as a postgis command.
I would like to pass a value into the sql statement using the %s operator and read the result into a pandas dataframe using to_sql, params kwargs. Currently my code will allow for one value to be passed to one %s but i'm looking for multiple insertions of the same list of values.
I'm connecting to a postgresql database using psycopg2.
Simplified code is below
sql = """
SELECT
st_distance(a.the_geom, b.the_geom, true) AS dist
FROM
(SELECT
table.*
FROM table
WHERE id in %s) AS a,
(SELECT
table.*
FROM table
WHERE id in %s) AS b
WHERE a.nid <> b.nid """
sampList = (14070,11184)
df = pd.read_sql(sql, con=conn, params = [sampList])
Basically i'm looking to replace both %s with the sampList value in both places. The code as written will only replace the first value indicating ': list index out of range. If I adjust to having one %s and replacing the second in statement with numbers the code runs, but ultimately I would like away to repeat those values.
You dont need the subqueries, just join the table with itself:
SELECT a.*, b.* -- or whatwever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM ztable a
JOIN ztable b ON a.nid < b.nid
WHERE a.id IN (%s)
AND b.id IN (%s)
;
avoid repetition by using a CTE (this may be non-optimal, performance-wise)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (%s)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
;
Performance-wise, I would just stick to the first version, and supply the list-argument twice. (or refer to it twice, using a FORMAT() construct)
first of all i would recommend you to use updated SQL from #wildplasser - it's much better and more efficient way to do that.
now you can do the following:
sql_ = """\
WITH zt AS (
SELECT * FROM ztable
WHERE id IN ({})
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
"""
sampList = (14070,11184)
sql = sql_.format(','.join(['?' for x in sampList]))
df = pd.read_sql(sql, con=conn, params=sampList)
dynamically generated SQL with parameters (AKA: prepared statements, bind variables, etc.):
In [27]: print(sql)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (?,?)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid

Categories

Resources