Multiple insertion of one value in sqlalchemy statement to pandas - python

I have constructed a sql clause where I reference the same table as a and b to compare the two geometries as a postgis command.
I would like to pass a value into the sql statement using the %s operator and read the result into a pandas dataframe using to_sql, params kwargs. Currently my code will allow for one value to be passed to one %s but i'm looking for multiple insertions of the same list of values.
I'm connecting to a postgresql database using psycopg2.
Simplified code is below
sql = """
SELECT
st_distance(a.the_geom, b.the_geom, true) AS dist
FROM
(SELECT
table.*
FROM table
WHERE id in %s) AS a,
(SELECT
table.*
FROM table
WHERE id in %s) AS b
WHERE a.nid <> b.nid """
sampList = (14070,11184)
df = pd.read_sql(sql, con=conn, params = [sampList])
Basically i'm looking to replace both %s with the sampList value in both places. The code as written will only replace the first value indicating ': list index out of range. If I adjust to having one %s and replacing the second in statement with numbers the code runs, but ultimately I would like away to repeat those values.

You dont need the subqueries, just join the table with itself:
SELECT a.*, b.* -- or whatwever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM ztable a
JOIN ztable b ON a.nid < b.nid
WHERE a.id IN (%s)
AND b.id IN (%s)
;
avoid repetition by using a CTE (this may be non-optimal, performance-wise)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (%s)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
;
Performance-wise, I would just stick to the first version, and supply the list-argument twice. (or refer to it twice, using a FORMAT() construct)

first of all i would recommend you to use updated SQL from #wildplasser - it's much better and more efficient way to do that.
now you can do the following:
sql_ = """\
WITH zt AS (
SELECT * FROM ztable
WHERE id IN ({})
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
"""
sampList = (14070,11184)
sql = sql_.format(','.join(['?' for x in sampList]))
df = pd.read_sql(sql, con=conn, params=sampList)
dynamically generated SQL with parameters (AKA: prepared statements, bind variables, etc.):
In [27]: print(sql)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (?,?)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid

Related

Extracting parameters from strings - SQL Server

I have a table with strings in one column, which are actually storing other SQL Queries written before and stored to be ran at later times. They contain parameters such as '#organisationId' or '#enterDateHere'. I want to be able to extract these.
Example:
ID
Query
1
SELECT * FROM table WHERE id = #organisationId
2
SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere
3
SELECT name + '#' + domain FROM user
I want the following:
ID
Parameters
1
#organisationId
2
#startDate, #endDate, #enterOrgHere
3
NULL
No need to worry about how to separate/list them, as long as they are clearly visible and as long as the query lists all of them, which I don't know the number of. Please note that sometimes the queries contain just # for example when email binding is being done, but it's not a parameter. I want just strings which start with # and have at least one letter after it, ending with a non-letter character (space, enter, comma, semi-colon). If this causes problems, then return all strings starting with # and I will simply identify the parameters manually.
It can include usage of Excel/Python/C# if needed, but SQL is preferable.
The official way to interrogate the parameters is with sp_describe_undeclared_parameters, eg
exec sp_describe_undeclared_parameters #tsql = N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
It is very simple to implement by using tokenization via XML and XQuery.
Notable points:
1st CROSS APPLY is tokenazing Query column as XML.
2nd CROSS APPLY is filtering out tokens that don't have "#" symbol.
SQL #1
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, Query VARCHAR(2048));
INSERT INTO #tbl (Query) VALUES
('SELECT * FROM table WHERE id = #organisationId'),
('SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'),
('SELECT name + ''#'' + domain FROM user');
-- DDL and sample data population, end
DECLARE #separator CHAR(1) = SPACE(1);
SELECT t.ID
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Query, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])').value('text()[1]','VARCHAR(1024)'))) AS t2(Par)
SQL #2
A cleansing step was added to handle other than a regular space whitespaces first.
SELECT t.*
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<r><![CDATA[' + Query + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')) AS t0(pure)
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Pure, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])')
.value('text()[1]','VARCHAR(1024)'))) AS t2(Par);
Output
ID
Parameters
1
#organisationId
2
#startDate #endDate #enterOrgHere
3
NULL
You can use string split, and then remove the undesired caracters, here's a query :
DROP TABLE IF EXISTS #TEMP
SELECT 1 AS ID ,'SELECT * FROM table WHERE id = #organisationId' AS Query
INTO #TEMP
UNION ALL SELECT 2, 'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
UNION ALL SELECT 3, 'SELECT name + ''#'' + domain FROM user'
;WITH cte as
(
SELECT ID,
Query,
STRING_AGG(REPLACE(REPLACE(REPLACE(value,'<',''),'>',''),'=',''),', ') AS Parameters
FROM #TEMP
CROSS APPLY string_split(Query,' ')
WHERE value LIKE '%#[a-z]%'
GROUP BY ID,
Query
)
SELECT #TEMP.*,cte.Parameters
FROM #TEMP
LEFT JOIN cte on #TEMP.ID = cte.ID
Using SQL Server for parsing is a very bad idea because of low performance and lack of tools. I highly recommend using .net assembly or external language (since your project is in python anyway) with regexp or any other conversion method.
However, as a last resort, you can use something like this extremely slow and generally horrible code (this code working just on sql server 2017+, btw. On earlier versions code will be much more terrible):
DECLARE #sql TABLE
(
id INT PRIMARY KEY IDENTITY
, sql_query NVARCHAR(MAX)
);
INSERT INTO #sql (sql_query)
VALUES (N'SELECT * FROM table WHERE id = #organisationId')
, (N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere')
, (N' SELECT name + ''#'' + domain FROM user')
;
WITH prepared AS
(
SELECT id
, IIF(sql_query LIKE '%#%'
, SUBSTRING(sql_query, CHARINDEX('#', sql_query) + 1, LEN(sql_query))
, CHAR(32)
) prep_string
FROM #sql
),
parsed AS
(
SELECT id
, IIF(CHARINDEX(CHAR(32), value) = 0
, SUBSTRING(value, 1, LEN(VALUE))
, SUBSTRING(value, 1, CHARINDEX(CHAR(32), value) -1)
) parsed_value
FROM prepared p
CROSS APPLY STRING_SPLIT(p.prep_string, '#')
)
SELECT id, '#' + STRING_AGG(IIF(parsed_value LIKE '[a-zA-Z]%', parsed_value, NULL) , ', #')
FROM parsed
GROUP BY id

How to iteratively create UNION ALL SQL statement using Python?

I am connecting to Snowflake to query row count data of view table from Snowflake. I am also querying metadata related to View table. My Query looks like below. I was wondering if I can iterate through UNION ALL statement using python ? When I try to run my below query I received an error that says "view_table_3" does not exist.
Thanks in advance for your time and efforts!
Query to get row count for Snowflake view table (with metadata)
view_tables=['view_table1','view_table2','view_table3','view_table4']
print(f""" SELECT * FROM (SELECT TABLE_SCHEMA,TABLE_NAME,CREATED,LAST_ALTERED FROM SCHEMA='INFORMATION_SCHEMA.VIEWS' WHERE TABLE_SCHEMA='MY_SCHEMA' AND TABLE_NAME IN ({','.join("'" +x+ "'" for x in view_tables)})) t1
LEFT JOIN
(SELECT 'view_table1' table_name2, count(*) as view_row_count from MY_DB.SCHEMA.view_table1
UNION ALL SELECT {','.join("'" +x+ "'" for x in view_tables[1:])},count(*) as view_row_count from MY_DB.SCHEMA.{','.join("" +x+ "" for x.replace("'"," ") in view_tables)})t2
on t1.TABLE_NAME =t2.table_name2 """)
If you want to make a union dynamically, put the entire SELECT query inside the generator, and then join them with ' UNION '.
sql = f'''SELECT * FROM INFORMATION_SCHEMA.VIEWS AS v
LEFT JOIN (
{' UNION '.join(f"SELECT '{table}' AS table_name2, COUNT(*) AS view_row_count FROM MY_SCHEMA.{table}" for table in view_tables)}
) AS t2 ON v.TABLE_NAME = t2.table_name2
WHERE v.TABLE_NAME IN ({','.join(f"'{table}'" for table in view_tables)})
'''
print(sql);

Extracting max date from a database and use output in another query

I want to query max date in a table and use this as parameter in a where clausere in another query. I am doing this:
query = (""" select
cast(max(order_date) as date)
from
tablename
""")
cursor.execute(query)
d = cursor.fethcone()
as output:[(datetime.date(2021, 9, 8),)]
Then I want to use this output as parameter in another query:
query3=("""select * from anothertable
where order_date = d::date limit 10""")
cursor.execute(query3)
as output: column "d" does not exist
I tried to cast(d as date) , d::date but nothing works. I also tried to datetime.date(d) no success too.
What I am doing wrong here?
There is no reason to select the date then use it in another query. That requires 2 round trips to the server. Do it in a single query. This has the advantage of removing all client side processing of that date.
select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
);
I am not exactly how this translates into your obfuscation layer but, from what I see, I believe it would be something like.
query = (""" select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
) """)
cursor.execute(query)
Heed the warning by #OneCricketeer. You may need cast on anothertable order_date as well. So where cast(order_date as date) = ( select ... )

How to replace IN in an SQL query containing a lot of parameters with Postgresql?

I am trying to retrieve information from a database using a Python tuple containing a set of ids (between 1000 and 10000 ids), but my query uses the IN statement and is subsequently very slow.
query = """ SELECT *
FROM table1
LEFT JOIN table2 ON table1.id = table2.id
LEFT JOIN ..
LEFT JOIN ...
WHERE table1.id IN {} """.format(my_tuple)
and then I query the database using PostgreSQL to charge the result in a Pandas dataframe:
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pd.read_csv(tmpfile, low_memory=False)
I know that IN is not very efficient with a high number of parameters, but I do not have any idea to optimise this part of the query. Any hint?
You could debug your query using explain statement. Probably you are trying to
sequently read big table while needing only a few rows. Is field table1.id indexed?
Or you could try to filter table1 first and then start joining
with t1 as (
select f1,f2, .... from table1 where id in {}
)
select *
from t1
left join ....

How to prevent nested queries in sqlalchemy from selecting a table again?

I wrote a query for mysql that achieved what I wanted. It's structured a bit like this:
select * from table_a where exists(
select * from table_b where table_a.x = table_b.x and exists(
select * from table_c where table_a.y = table_c.y and table_b.z = table_c.z
)
)
I translated the query to sqlalchemy and the result is structured like this:
session.query(table_a).filter(
session.query(table_b).filter(table_a.x == table_b.x).filter(
session.query(table_c).filter(table_a.y == table_c.y).filter(table_b.x == table_c.z).exists()
).exists()
)
Which generates a query like this:
select * from table_a where exists(
select * from table_b where table_a.x = table_b.x and exists(
select * from table_c, table_a where table_a.y = table_c.y and table_b.z = table_c.z
)
)
Note the re-selection of table_a in the innermost query - which breaks the intended functionality.
How can I stop sqlalchemy from selecting the table again in a nested query?
Tell the innermost query to correlate all except table_c:
session.query(table_a).filter(
session.query(table_b).filter(table_a.x == table_b.x).filter(
session.query(table_c).filter(table_a.y == table_c.y).filter(table_b.x == table_c.z)
.exists().correlate_except(table_c)
).exists()
)
In contrast to "auto-correlation", which only considers FROM elements from the enclosing Select, explicit correlation will consider FROM elements from any nesting level as candidates.

Categories

Resources