Extracting parameters from strings - SQL Server - python

I have a table with strings in one column, which are actually storing other SQL Queries written before and stored to be ran at later times. They contain parameters such as '#organisationId' or '#enterDateHere'. I want to be able to extract these.
Example:
ID
Query
1
SELECT * FROM table WHERE id = #organisationId
2
SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere
3
SELECT name + '#' + domain FROM user
I want the following:
ID
Parameters
1
#organisationId
2
#startDate, #endDate, #enterOrgHere
3
NULL
No need to worry about how to separate/list them, as long as they are clearly visible and as long as the query lists all of them, which I don't know the number of. Please note that sometimes the queries contain just # for example when email binding is being done, but it's not a parameter. I want just strings which start with # and have at least one letter after it, ending with a non-letter character (space, enter, comma, semi-colon). If this causes problems, then return all strings starting with # and I will simply identify the parameters manually.
It can include usage of Excel/Python/C# if needed, but SQL is preferable.

The official way to interrogate the parameters is with sp_describe_undeclared_parameters, eg
exec sp_describe_undeclared_parameters #tsql = N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'

It is very simple to implement by using tokenization via XML and XQuery.
Notable points:
1st CROSS APPLY is tokenazing Query column as XML.
2nd CROSS APPLY is filtering out tokens that don't have "#" symbol.
SQL #1
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, Query VARCHAR(2048));
INSERT INTO #tbl (Query) VALUES
('SELECT * FROM table WHERE id = #organisationId'),
('SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'),
('SELECT name + ''#'' + domain FROM user');
-- DDL and sample data population, end
DECLARE #separator CHAR(1) = SPACE(1);
SELECT t.ID
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Query, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])').value('text()[1]','VARCHAR(1024)'))) AS t2(Par)
SQL #2
A cleansing step was added to handle other than a regular space whitespaces first.
SELECT t.*
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<r><![CDATA[' + Query + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')) AS t0(pure)
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Pure, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])')
.value('text()[1]','VARCHAR(1024)'))) AS t2(Par);
Output
ID
Parameters
1
#organisationId
2
#startDate #endDate #enterOrgHere
3
NULL

You can use string split, and then remove the undesired caracters, here's a query :
DROP TABLE IF EXISTS #TEMP
SELECT 1 AS ID ,'SELECT * FROM table WHERE id = #organisationId' AS Query
INTO #TEMP
UNION ALL SELECT 2, 'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
UNION ALL SELECT 3, 'SELECT name + ''#'' + domain FROM user'
;WITH cte as
(
SELECT ID,
Query,
STRING_AGG(REPLACE(REPLACE(REPLACE(value,'<',''),'>',''),'=',''),', ') AS Parameters
FROM #TEMP
CROSS APPLY string_split(Query,' ')
WHERE value LIKE '%#[a-z]%'
GROUP BY ID,
Query
)
SELECT #TEMP.*,cte.Parameters
FROM #TEMP
LEFT JOIN cte on #TEMP.ID = cte.ID

Using SQL Server for parsing is a very bad idea because of low performance and lack of tools. I highly recommend using .net assembly or external language (since your project is in python anyway) with regexp or any other conversion method.
However, as a last resort, you can use something like this extremely slow and generally horrible code (this code working just on sql server 2017+, btw. On earlier versions code will be much more terrible):
DECLARE #sql TABLE
(
id INT PRIMARY KEY IDENTITY
, sql_query NVARCHAR(MAX)
);
INSERT INTO #sql (sql_query)
VALUES (N'SELECT * FROM table WHERE id = #organisationId')
, (N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere')
, (N' SELECT name + ''#'' + domain FROM user')
;
WITH prepared AS
(
SELECT id
, IIF(sql_query LIKE '%#%'
, SUBSTRING(sql_query, CHARINDEX('#', sql_query) + 1, LEN(sql_query))
, CHAR(32)
) prep_string
FROM #sql
),
parsed AS
(
SELECT id
, IIF(CHARINDEX(CHAR(32), value) = 0
, SUBSTRING(value, 1, LEN(VALUE))
, SUBSTRING(value, 1, CHARINDEX(CHAR(32), value) -1)
) parsed_value
FROM prepared p
CROSS APPLY STRING_SPLIT(p.prep_string, '#')
)
SELECT id, '#' + STRING_AGG(IIF(parsed_value LIKE '[a-zA-Z]%', parsed_value, NULL) , ', #')
FROM parsed
GROUP BY id

Related

split column into multi dynamically in python or sql

I'm trying to Split the details column into multi using T-sql or python.
the table is like this:
ID
Details
15
Hotel:Campsite;Message:Reservation inquiries
150
Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y/|
13
NULL
There are a lot of keys or columns under the details. So I want a dynamic way to split the details into multiple columns using python or tsql
The desired output:
ID
Details
Hotel
Message
Page
PageLink
15
Hotel:Campsite;Message:Reservation inquiries
Campsite
Reservation inquiries
NULL
NULL
150
Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y
NULL
NULL
45-discount-y
https://xx.xx.net/SS/45-discount-y/|
13
NULL
NULL
NULL
NULL
NULL
First :split part of Data ';' with string_split
second :after split second part of Data with string_split and replace
we use replace for Handle character : in 'Page Link in'
Finally use pivot
dECLARE #cols AS NVARCHAR(MAX),#scols AS NVARCHAR(MAX),
#query AS NVARCHAR(MAX)
set #query = '
;with cte as (
select Id,Details,valuesd,[1],replace( [2],''https//'',''https://'') as [2] from (
select * from (
select Id,Details,value as valuesd
from T
cross apply(
select *
from string_split(Details,'';'')
)d
)t
cross apply (
select RowN=Row_Number() over (Order by (SELECT NULL)), value
from string_split(replace( t.valuesd,''https://'',''https//''), '':'')
) d
) src
pivot (max(value) for src.RowN in([1],[2])) p
)
SELECT T.id,T.Details,Max([Hotel]) as [Hotel],Max([Message]) as Message,Max([Page]) as Page,Max([PageLink]) as PageLink from
(
select Id,Details, valuesd,[1],[2]
from cte
) x
pivot
(
max( [2]) for [1] in ([Hotel],[Message],[Page],[PageLink])
) p
right join T on p.id=T.id
group by T.id,T.Details
'
execute(#query)
You can to insert the basic data with the following codes
create table T(id int,Details nvarchar(max))
insert into T
select 15,'Hotel:Campsite;Message:Reservation inquiries' union all
select 150,'Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y/|' union all
select 13, null

Why does identical code for SQL Merge (Upsert) work in Microsoft SQL Server console but doesn't work in Python?

I have a function in my main Python file which gets called by main() and executes a SQL Merge (Upsert) statement using pyodbc from a different file & function. Concretely, the SQL statement traverses a source table with transaction details by distinct transaction datetimes and merges customers into a separate target table. The function that executes the statement and the function that returns the completed SQL statement are attached below.
When I run my Python script, it doesn't work as expected and inserts only around 70 rows (sometimes 69, 71, or 72) into the target customer table. However, when I use an identical SQL statement and execute it in the Microsoft SQL Server Management Studio console (attached below), it works fine and inserts 4302 rows (as expected).
I'm not sure what's wrong.. Would really appreciate any help!
SQL Statement Executor in Python main file:
def stage_to_dim(connection, cursor, now):
log(f"Filling {cfg.dim_customer} and {cfg.dim_product}")
try:
cursor.execute(sql_statements.stage_to_dim_statement(now))
connection.commit()
except Exception as e:
log(f"Error in stage_to_dim: {e}" )
sys.exit(1)
log("Stage2Dimensions complete.")
SQL Statement formulator in Python:
def stage_to_dim_statement(now):
return f"""
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM {cfg.stage_table} ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE {cfg.dim_customer} AS Target
USING (SELECT * FROM {cfg.stage_table} WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME, '{now}'), Target.LoadSource = '{cfg.ingest_file_path}'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID,
Source.AcquisitionDate, Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'{now}'), '{cfg.ingest_file_path}');
END
"""
SQL Statement from MS SQL Server Console:
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM dbo.STG_CustomerTransactions ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE dbo.DIM_CustomerDup AS Target
USING (SELECT * FROM dbo.STG_CustomerTransactions WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME,'6/30/2022 11:53:05'), Target.LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, Source.AcquisitionDate,
Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'6/30/2022 11:53:05'), '../csv/cleaned_original_data.csv');
END
If you think carefully about what your final result ends up, you are actually just taking the latest row (by date) for each customer. So you can just filter the source using a standard row-number approach.
Exactly why the Python code didn't work properly is unclear, but the below query might work better. You are also doing SQL injection, which is dangerous and can also cause correctness problems.
Also you should always use a non-ambiguous date format.
MERGE dbo.DIM_CustomerDup AS t
USING (
SELECT *
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TransactionDateTime DESC)
FROM dbo.STG_CustomerTransactions s
) AS s
WHERE s.rn = 1
) AS s
ON t.CustomerCodeNK = s.CustomerID
WHEN MATCHED THEN
UPDATE SET
AquiredDate = s.AcquisitionDate,
AquiredSource = s.AcquisitionSource,
ZipCode = s.Zipcode,
LoadDate = SYSDATETIME(),
LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource)
VALUES (s.CustomerID, s.AcquisitionDate, s.AcquisitionSource, s.Zipcode, SYSDATETIME(), '../csv/cleaned_original_data.csv')
;

How to iteratively create UNION ALL SQL statement using Python?

I am connecting to Snowflake to query row count data of view table from Snowflake. I am also querying metadata related to View table. My Query looks like below. I was wondering if I can iterate through UNION ALL statement using python ? When I try to run my below query I received an error that says "view_table_3" does not exist.
Thanks in advance for your time and efforts!
Query to get row count for Snowflake view table (with metadata)
view_tables=['view_table1','view_table2','view_table3','view_table4']
print(f""" SELECT * FROM (SELECT TABLE_SCHEMA,TABLE_NAME,CREATED,LAST_ALTERED FROM SCHEMA='INFORMATION_SCHEMA.VIEWS' WHERE TABLE_SCHEMA='MY_SCHEMA' AND TABLE_NAME IN ({','.join("'" +x+ "'" for x in view_tables)})) t1
LEFT JOIN
(SELECT 'view_table1' table_name2, count(*) as view_row_count from MY_DB.SCHEMA.view_table1
UNION ALL SELECT {','.join("'" +x+ "'" for x in view_tables[1:])},count(*) as view_row_count from MY_DB.SCHEMA.{','.join("" +x+ "" for x.replace("'"," ") in view_tables)})t2
on t1.TABLE_NAME =t2.table_name2 """)
If you want to make a union dynamically, put the entire SELECT query inside the generator, and then join them with ' UNION '.
sql = f'''SELECT * FROM INFORMATION_SCHEMA.VIEWS AS v
LEFT JOIN (
{' UNION '.join(f"SELECT '{table}' AS table_name2, COUNT(*) AS view_row_count FROM MY_SCHEMA.{table}" for table in view_tables)}
) AS t2 ON v.TABLE_NAME = t2.table_name2
WHERE v.TABLE_NAME IN ({','.join(f"'{table}'" for table in view_tables)})
'''
print(sql);

Extracting max date from a database and use output in another query

I want to query max date in a table and use this as parameter in a where clausere in another query. I am doing this:
query = (""" select
cast(max(order_date) as date)
from
tablename
""")
cursor.execute(query)
d = cursor.fethcone()
as output:[(datetime.date(2021, 9, 8),)]
Then I want to use this output as parameter in another query:
query3=("""select * from anothertable
where order_date = d::date limit 10""")
cursor.execute(query3)
as output: column "d" does not exist
I tried to cast(d as date) , d::date but nothing works. I also tried to datetime.date(d) no success too.
What I am doing wrong here?
There is no reason to select the date then use it in another query. That requires 2 round trips to the server. Do it in a single query. This has the advantage of removing all client side processing of that date.
select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
);
I am not exactly how this translates into your obfuscation layer but, from what I see, I believe it would be something like.
query = (""" select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
) """)
cursor.execute(query)
Heed the warning by #OneCricketeer. You may need cast on anothertable order_date as well. So where cast(order_date as date) = ( select ... )

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.
Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).
It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Categories

Resources