I have a table with strings in one column, which are actually storing other SQL Queries written before and stored to be ran at later times. They contain parameters such as '#organisationId' or '#enterDateHere'. I want to be able to extract these.
Example:
ID
Query
1
SELECT * FROM table WHERE id = #organisationId
2
SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere
3
SELECT name + '#' + domain FROM user
I want the following:
ID
Parameters
1
#organisationId
2
#startDate, #endDate, #enterOrgHere
3
NULL
No need to worry about how to separate/list them, as long as they are clearly visible and as long as the query lists all of them, which I don't know the number of. Please note that sometimes the queries contain just # for example when email binding is being done, but it's not a parameter. I want just strings which start with # and have at least one letter after it, ending with a non-letter character (space, enter, comma, semi-colon). If this causes problems, then return all strings starting with # and I will simply identify the parameters manually.
It can include usage of Excel/Python/C# if needed, but SQL is preferable.
The official way to interrogate the parameters is with sp_describe_undeclared_parameters, eg
exec sp_describe_undeclared_parameters #tsql = N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
It is very simple to implement by using tokenization via XML and XQuery.
Notable points:
1st CROSS APPLY is tokenazing Query column as XML.
2nd CROSS APPLY is filtering out tokens that don't have "#" symbol.
SQL #1
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, Query VARCHAR(2048));
INSERT INTO #tbl (Query) VALUES
('SELECT * FROM table WHERE id = #organisationId'),
('SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'),
('SELECT name + ''#'' + domain FROM user');
-- DDL and sample data population, end
DECLARE #separator CHAR(1) = SPACE(1);
SELECT t.ID
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Query, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])').value('text()[1]','VARCHAR(1024)'))) AS t2(Par)
SQL #2
A cleansing step was added to handle other than a regular space whitespaces first.
SELECT t.*
, Parameters = IIF(t2.Par LIKE '#[a-z]%', t2.Par, NULL)
FROM #tbl AS t
CROSS APPLY (SELECT TRY_CAST('<r><![CDATA[' + Query + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')) AS t0(pure)
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(Pure, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT TRIM('><=' FROM c.query('data(/root/r[contains(text()[1],"#")])')
.value('text()[1]','VARCHAR(1024)'))) AS t2(Par);
Output
ID
Parameters
1
#organisationId
2
#startDate #endDate #enterOrgHere
3
NULL
You can use string split, and then remove the undesired caracters, here's a query :
DROP TABLE IF EXISTS #TEMP
SELECT 1 AS ID ,'SELECT * FROM table WHERE id = #organisationId' AS Query
INTO #TEMP
UNION ALL SELECT 2, 'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere'
UNION ALL SELECT 3, 'SELECT name + ''#'' + domain FROM user'
;WITH cte as
(
SELECT ID,
Query,
STRING_AGG(REPLACE(REPLACE(REPLACE(value,'<',''),'>',''),'=',''),', ') AS Parameters
FROM #TEMP
CROSS APPLY string_split(Query,' ')
WHERE value LIKE '%#[a-z]%'
GROUP BY ID,
Query
)
SELECT #TEMP.*,cte.Parameters
FROM #TEMP
LEFT JOIN cte on #TEMP.ID = cte.ID
Using SQL Server for parsing is a very bad idea because of low performance and lack of tools. I highly recommend using .net assembly or external language (since your project is in python anyway) with regexp or any other conversion method.
However, as a last resort, you can use something like this extremely slow and generally horrible code (this code working just on sql server 2017+, btw. On earlier versions code will be much more terrible):
DECLARE #sql TABLE
(
id INT PRIMARY KEY IDENTITY
, sql_query NVARCHAR(MAX)
);
INSERT INTO #sql (sql_query)
VALUES (N'SELECT * FROM table WHERE id = #organisationId')
, (N'SELECT * FROM topic WHERE creation_time <=#startDate AND creation_time >= #endDate AND id = #enterOrgHere')
, (N' SELECT name + ''#'' + domain FROM user')
;
WITH prepared AS
(
SELECT id
, IIF(sql_query LIKE '%#%'
, SUBSTRING(sql_query, CHARINDEX('#', sql_query) + 1, LEN(sql_query))
, CHAR(32)
) prep_string
FROM #sql
),
parsed AS
(
SELECT id
, IIF(CHARINDEX(CHAR(32), value) = 0
, SUBSTRING(value, 1, LEN(VALUE))
, SUBSTRING(value, 1, CHARINDEX(CHAR(32), value) -1)
) parsed_value
FROM prepared p
CROSS APPLY STRING_SPLIT(p.prep_string, '#')
)
SELECT id, '#' + STRING_AGG(IIF(parsed_value LIKE '[a-zA-Z]%', parsed_value, NULL) , ', #')
FROM parsed
GROUP BY id
I need to execute a SQL query that deletes the duplicated rows based on one column and keep the last record. Noting that it's a large table so Django ORM takes very long time so I need SQL query instead. the column name is customer_number and table name is pages_dataupload. I'm using sqlite.
Update: I tried this but it gives me no such column: row_num
cursor = connection.cursor()
cursor.execute(
'''WITH cte AS (
SELECT
id,
customer_number ,
ROW_NUMBER() OVER (
PARTITION BY
id,
customer_number
ORDER BY
id,
customer_number
) row_num
FROM
pages.dataupload
)
DELETE FROM pages_dataupload
WHERE row_num > 1;
'''
)
You can work with an Exists subquery [Django-doc] to determine efficiently if there is a younger DataUpload:
from django.db.models import Exists, OuterRef
DataUpload.objects.filter(Exists(
DataUpload.objects.filter(
pk__gt=OuterRef('pk'), customer_number=OuterRef('customer_number')
)
)).delete()
This will thus check for each DataUpload if there exists a DataUpload with a larger primary key that has the same customer_number. If that is the case, we will remove that DataUpload.
I have solved the problem with the below query, is there any way to reset the id field after removing the duplicate?
cursor = connection.cursor()
cursor.execute(
'''
DELETE FROM pages_dataupload WHERE id not in (
SELECT Max(id) FROM pages_dataupload Group By Dial
)
'''
)
I have a sqlite database named StudentDB which has 3 columns Roll number, Name, Marks. Now I want to fetch only the columns that user selects in the IDE. User can select one column or two or all the three. How can I alter the query accordingly using Python?
I tried:
import sqlite3
sel={"Roll Number":12}
query = 'select * from StudentDB Where({seq})'.format(seq=','.join(['?']*len(sel))),[i for k,i in sel.items()]
con = sqlite3.connect(database)
cur = con.cursor()
cur.execute(query)
all_data = cur.fetchall()
all_data
I am getting:
operation parameter must be str
You should control the text of the query. The where clause shall allways be in the form WHERE colname=value [AND colname2=...] or (better) WHERE colname=? [AND ...] if you want to build a parameterized query.
So you want:
query = 'select * from StudentDB Where ' + ' AND '.join('"{}"=?'.format(col)
for col in sel.keys())
...
cur.execute(query, tuple(sel.values()))
In your code, the query is now a tuple instead of str and that is why the error.
I assume you want to execute a query like below -
select * from StudentDB Where "Roll number"=?
Then you can change the sql query like this (assuming you want and and not or) -
query = "select * from StudentDB Where {seq}".format(seq=" and ".join('"{}"=?'.format(k) for k in sel.keys()))
and execute the query like -
cur.execute(query, tuple(sel.values()))
Please make sure in your code the provided database is defined and contains the database name and studentDB is indeed the table name and not database name.
I am trying to retrieve information from a database using a Python tuple containing a set of ids (between 1000 and 10000 ids), but my query uses the IN statement and is subsequently very slow.
query = """ SELECT *
FROM table1
LEFT JOIN table2 ON table1.id = table2.id
LEFT JOIN ..
LEFT JOIN ...
WHERE table1.id IN {} """.format(my_tuple)
and then I query the database using PostgreSQL to charge the result in a Pandas dataframe:
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pd.read_csv(tmpfile, low_memory=False)
I know that IN is not very efficient with a high number of parameters, but I do not have any idea to optimise this part of the query. Any hint?
You could debug your query using explain statement. Probably you are trying to
sequently read big table while needing only a few rows. Is field table1.id indexed?
Or you could try to filter table1 first and then start joining
with t1 as (
select f1,f2, .... from table1 where id in {}
)
select *
from t1
left join ....
My original query is like
select table1.id, table1.value
from some_database.something table1
join some_set table2 on table2.code=table1.code
where table1.date_ >= :_startdate and table1.date_ <= :_enddate
which is saved in a string in Python. If I do
x = session.execute(script_str, {'_startdate': start_date, '_enddate': end_date})
then
x.fetchall()
will give me the table I want.
Now the situation is, table2 is no longer available to me in the Oracle database, instead it is available in my python environment as a DataFrame. I wonder what is the best way to fetch the same table from the database in this case?
You can use the IN clause instead.
First remove the join from the script_str:
script_str = """
select table1.id, table1.value
from something table1
where table1.date_ >= :_startdate and table1.date_ <= :_enddate
"""
Then, get codes from dataframe:
codes = myDataFrame.code_column.values
Now, we need to dynamically extend the script_str and the parameters to the query:
param_names = ['_code{}'.format(i) for i in range(len(codes))]
script_str += "AND table1.code IN ({})".format(
", ".join([":{}".format(p) for p in param_names])
)
Create dict with all parameters:
params = {
'_startdate': start_date,
'_enddate': end_date,
}
params.update(zip(param_names, codes))
And execute the query:
x = session.execute(script_str, params)