split column into multi dynamically in python or sql

split column into multi dynamically in python or sql - python

I'm trying to Split the details column into multi using T-sql or python.
the table is like this:
ID
Details
15
Hotel:Campsite;Message:Reservation inquiries
150
Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y/|
13
NULL
There are a lot of keys or columns under the details. So I want a dynamic way to split the details into multiple columns using python or tsql
The desired output:
ID
Details
Hotel
Message
Page
PageLink
15
Hotel:Campsite;Message:Reservation inquiries
Campsite
Reservation inquiries
NULL
NULL
150
Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y
NULL
NULL
45-discount-y
https://xx.xx.net/SS/45-discount-y/|
13
NULL
NULL
NULL
NULL
NULL

First :split part of Data ';' with string_split
second :after split second part of Data with string_split and replace
we use replace for Handle character : in 'Page Link in'
Finally use pivot
dECLARE #cols AS NVARCHAR(MAX),#scols AS NVARCHAR(MAX),
#query AS NVARCHAR(MAX)
set #query = '
;with cte as (
select Id,Details,valuesd,[1],replace( [2],''https//'',''https://'') as [2] from (
select * from (
select Id,Details,value as valuesd
from T
cross apply(
select *
from string_split(Details,'';'')
)d
)t
cross apply (
select RowN=Row_Number() over (Order by (SELECT NULL)), value
from string_split(replace( t.valuesd,''https://'',''https//''), '':'')
) d
) src
pivot (max(value) for src.RowN in([1],[2])) p
)
SELECT T.id,T.Details,Max([Hotel]) as [Hotel],Max([Message]) as Message,Max([Page]) as Page,Max([PageLink]) as PageLink from
(
select Id,Details, valuesd,[1],[2]
from cte
) x
pivot
(
max( [2]) for [1] in ([Hotel],[Message],[Page],[PageLink])
) p
right join T on p.id=T.id
group by T.id,T.Details
'
execute(#query)
You can to insert the basic data with the following codes
create table T(id int,Details nvarchar(max))
insert into T
select 15,'Hotel:Campsite;Message:Reservation inquiries' union all
select 150,'Page:45-discount-y;PageLink:https://xx.xx.net/SS/45-discount-y/|' union all
select 13, null

Related

Extracting parameters from strings - SQL Server

I have a table with strings in one column, which are actually storing other SQL Queries written before and stored to be ran at later times. They contain parameters such as '#organisationId' or '#enterDateHere'. I want to be able to extract these.
Example:
ID
Query
1
SELECT * FROM table WHERE id = #organisationId
2
SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere
3
SELECT name + '#' + domain FROM user
I want the following:
ID
Parameters
1
#organisationId
2
#startDate, #endDate, #enterOrgHere
3
NULL
No need to worry about how to separate/list them, as long as they are clearly visible and as long as the query lists all of them, which I don't know the number of. Please note that sometimes the queries contain just # for example when email binding is being done, but it's not a parameter. I want just strings which start with # and have at least one letter after it, ending with a non-letter character (space, enter, comma, semi-colon). If this causes problems, then return all strings starting with # and I will simply identify the parameters manually.
It can include usage of Excel/Python/C# if needed, but SQL is preferable.

The official way to interrogate the parameters is with sp_describe_undeclared_parameters, eg
exec sp_describe_undeclared_parameters #tsql = N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'

It is very simple to implement by using tokenization via XML and XQuery.
Notable points:
1st CROSS APPLY is tokenazing Query column as XML.
2nd CROSS APPLY is filtering out tokens that don't have "#" symbol.
SQL #1
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, Query VARCHAR(2048));
INSERT INTO #tbl (Query) VALUES
('SELECT * FROM table WHERE id = #organisationId'),
('SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'),
('SELECT name + ''#'' + domain FROM user');
-- DDL and sample data population, end
DECLARE #separator CHAR(1) = SPACE(1);
SELECT t.ID
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Query, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])').value('text()[1]','VARCHAR(1024)'))) AS t2(Par)
SQL #2
A cleansing step was added to handle other than a regular space whitespaces first.
SELECT t.*
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<r><![CDATA[' + Query + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')) AS t0(pure)
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Pure, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])')
.value('text()[1]','VARCHAR(1024)'))) AS t2(Par);
Output
ID
Parameters
1
#organisationId
2
#startDate #endDate #enterOrgHere
3
NULL

You can use string split, and then remove the undesired caracters, here's a query :
DROP TABLE IF EXISTS #TEMP
SELECT 1 AS ID ,'SELECT * FROM table WHERE id = #organisationId' AS Query
INTO #TEMP
UNION ALL SELECT 2, 'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
UNION ALL SELECT 3, 'SELECT name + ''#'' + domain FROM user'
;WITH cte as
(
SELECT ID,
Query,
STRING_AGG(REPLACE(REPLACE(REPLACE(value,'<',''),'>',''),'=',''),', ') AS Parameters
FROM #TEMP
CROSS APPLY string_split(Query,' ')
WHERE value LIKE '%#[a-z]%'
GROUP BY ID,
Query
)
SELECT #TEMP.*,cte.Parameters
FROM #TEMP
LEFT JOIN cte on #TEMP.ID = cte.ID

Using SQL Server for parsing is a very bad idea because of low performance and lack of tools. I highly recommend using .net assembly or external language (since your project is in python anyway) with regexp or any other conversion method.
However, as a last resort, you can use something like this extremely slow and generally horrible code (this code working just on sql server 2017+, btw. On earlier versions code will be much more terrible):
DECLARE #sql TABLE
(
id INT PRIMARY KEY IDENTITY
, sql_query NVARCHAR(MAX)
);
INSERT INTO #sql (sql_query)
VALUES (N'SELECT * FROM table WHERE id = #organisationId')
, (N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere')
, (N' SELECT name + ''#'' + domain FROM user')
;
WITH prepared AS
(
SELECT id
, IIF(sql_query LIKE '%#%'
, SUBSTRING(sql_query, CHARINDEX('#', sql_query) + 1, LEN(sql_query))
, CHAR(32)
) prep_string
FROM #sql
),
parsed AS
(
SELECT id
, IIF(CHARINDEX(CHAR(32), value) = 0
, SUBSTRING(value, 1, LEN(VALUE))
, SUBSTRING(value, 1, CHARINDEX(CHAR(32), value) -1)
) parsed_value
FROM prepared p
CROSS APPLY STRING_SPLIT(p.prep_string, '#')
)
SELECT id, '#' + STRING_AGG(IIF(parsed_value LIKE '[a-zA-Z]%', parsed_value, NULL) , ', #')
FROM parsed
GROUP BY id

How to iteratively create UNION ALL SQL statement using Python?

I am connecting to Snowflake to query row count data of view table from Snowflake. I am also querying metadata related to View table. My Query looks like below. I was wondering if I can iterate through UNION ALL statement using python ? When I try to run my below query I received an error that says "view_table_3" does not exist.
Thanks in advance for your time and efforts!
Query to get row count for Snowflake view table (with metadata)
view_tables=['view_table1','view_table2','view_table3','view_table4']
print(f""" SELECT * FROM (SELECT TABLE_SCHEMA,TABLE_NAME,CREATED,LAST_ALTERED FROM SCHEMA='INFORMATION_SCHEMA.VIEWS' WHERE TABLE_SCHEMA='MY_SCHEMA' AND TABLE_NAME IN ({','.join("'" +x+ "'" for x in view_tables)})) t1
LEFT JOIN
(SELECT 'view_table1' table_name2, count(*) as view_row_count from MY_DB.SCHEMA.view_table1
UNION ALL SELECT {','.join("'" +x+ "'" for x in view_tables[1:])},count(*) as view_row_count from MY_DB.SCHEMA.{','.join("" +x+ "" for x.replace("'"," ") in view_tables)})t2
on t1.TABLE_NAME =t2.table_name2 """)

If you want to make a union dynamically, put the entire SELECT query inside the generator, and then join them with ' UNION '.
sql = f'''SELECT * FROM INFORMATION_SCHEMA.VIEWS AS v
LEFT JOIN (
{' UNION '.join(f"SELECT '{table}' AS table_name2, COUNT(*) AS view_row_count FROM MY_SCHEMA.{table}" for table in view_tables)}
) AS t2 ON v.TABLE_NAME = t2.table_name2
WHERE v.TABLE_NAME IN ({','.join(f"'{table}'" for table in view_tables)})
'''
print(sql);

Extracting max date from a database and use output in another query

I want to query max date in a table and use this as parameter in a where clausere in another query. I am doing this:
query = (""" select
cast(max(order_date) as date)
from
tablename
""")
cursor.execute(query)
d = cursor.fethcone()
as output:[(datetime.date(2021, 9, 8),)]
Then I want to use this output as parameter in another query:
query3=("""select * from anothertable
where order_date = d::date limit 10""")
cursor.execute(query3)
as output: column "d" does not exist
I tried to cast(d as date) , d::date but nothing works. I also tried to datetime.date(d) no success too.
What I am doing wrong here?

There is no reason to select the date then use it in another query. That requires 2 round trips to the server. Do it in a single query. This has the advantage of removing all client side processing of that date.
select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
);
I am not exactly how this translates into your obfuscation layer but, from what I see, I believe it would be something like.
query = (""" select *
from anothertable
where order_date =
( select max(cast(order_date as date ))
from tablename
) """)
cursor.execute(query)
Heed the warning by #OneCricketeer. You may need cast on anothertable order_date as well. So where cast(order_date as date) = ( select ... )

troubles with 'WHERE...IN' clause

I'm trying to run the following query through pandasql, but the output I get is not what I was expecting. I was expecting to get a table with exactly 800 rows as I am selecting the only employee_day_transmitters of the table employee_days_transmitters, but what I get is a table with more than 800 rows. What's wrong? How can I get exactly 800 rows related to the employee_day_transmitters selected in the table employee_days_transmitters?
query_text = '''WITH employee_days_transmitters AS (
SELECT DISTINCT
employeeId
, theDate
, transmitterId
, employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId AS employee_day_transmitter
FROM
table1
WHERE variable='rpv'
ORDER BY
RANDOM()
LIMIT
800
)
SELECT
*
FROM
table1
WHERE
(employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId) IN (SELECT employee_day_transmitter FROM employee_days_transmitters) AND variable = 'rpv'
'''
table2=pandasql.sqldf(query_text,globals())

You are using DISTINCT in the CTE, so I suspect you have duplicates for the combination of the columns employeeId, theDate, transmitterId and this why you get more than 800 rows.
You select 800 rows in the CTE but when you use the operator IN in your main query, all the rows that satisfy your conditions are returned, which are more than 800.
But why do you use the CTE?
You could apply the conditions directly in the main query:
SELECT DISTINCT employeeId, theDate, transmitterId
FROM table1
WHERE variable='rpv'
ORDER BY RANDOM()
LIMIT 800
Or maybe with ROW_NUMBER() window function:
WITH cte AS (
SELECT id
FROM (
SELECT rowid id,
ROW_NUMBER() OVER (PARTITION BY employeeId, theDate, transmitterId ORDER BY RANDOM()) rn
FROM table1
WHERE variable='rpv'
)
WHERE rn = 1
ORDER BY RANDOM()
LIMIT 800
)
SELECT *
FROM table1
WHERE rowid IN cte

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.

Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).

It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split column into multi dynamically in python or sql - python

Related

Extracting parameters from strings - SQL Server

How to iteratively create UNION ALL SQL statement using Python?

Extracting max date from a database and use output in another query

troubles with 'WHERE...IN' clause

Count the number of non-null values in each column of each table in MySQL

Categories

Resources