How to extract table names and column names from sql query?

How to extract table names and column names from sql query? - python

So let assume we have such simple query:
Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;
The result should looks this way:
tb1 col1
tb1 col7
tb2 col2
tb2 col8
I've tried to solve this problem using some python library:
1) Even extracting only tables using sqlparse might be a huge problem. For example this official book doesn't work properly at all.
2) Using regular expression seems to be really hard to achieve.
3) But then I found this , that might help. However the problem is that I can't connect to any database and execute that query.
Any ideas?

sql-metadata is a Python library that uses a tokenized query returned by python-sqlparse and generates query metadata.
This metadata can return column and table names from your supplied SQL query. Here are a couple of example from the sql-metadata github readme:
>>> sql_metadata.get_query_columns("SELECT test, id FROM foo, bar")
[u'test', u'id']
>>> sql_metadata.get_query_tables("SELECT test, id FROM foo, bar")
[u'foo', u'bar']
>>> sql_metadata.get_query_limit_and_offset('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000')
(50, 1000)
A hosted version of the library exists at sql-app.infocruncher.com to see if it works for you.

Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.
import ply.lex as lex, re
tokens = (
"TABLE",
"JOIN",
"COLUMN",
"TRASH"
)
tables = {"tables": {}, "alias": {}}
columns = []
t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"
def t_TABLE(t):
r"from\s(\w+)\sas\s(\w+)"
regex = re.compile(t_TABLE.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_JOIN(t):
r"inner\s+join\s+(\w+)\s+as\s+(\w+)"
regex = re.compile(t_JOIN.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_COLUMN(t):
r"(\w+\.\w+)"
regex = re.compile(t_COLUMN.__doc__)
m = regex.search(t.value)
if m is not None:
t.value = m.group(1)
columns.append(t.value)
return t
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
t.lexer.skip(len(t.value))
# here is where the magic starts
def mylex(inp):
lexer = lex.lex()
lexer.input(inp)
for token in lexer:
pass
result = {}
for col in columns:
tbl, c = col.split('.')
if tbl in tables["alias"].keys():
key = tables["alias"][tbl]
else:
key = tbl
if key in result:
result[key].append(c)
else:
result[key] = list()
result[key].append(c)
print result
# {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}
string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)

moz-sql-parser is a python library to convert some subset of SQL-92 queries to JSON-izable parse trees. Maybe it what you want.
Here is an example.
>>> parse("SELECT id,name FROM dual WHERE id>3 and id<10 ORDER BY name")
{'select': [{'value': 'id'}, {'value': 'name'}], 'from': 'dual', 'where': {'and': [{'gt': ['id', 3]}, {'lt': ['id', 10]}]}, 'orderby': {'value': 'name'}}

I am tackling a similar problem and found a simpler solution and it seems to work well.
import re
def tables_in_query(sql_str):
# remove the /* */ comments
q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)
# remove whole line -- and # comments
lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]
# remove trailing -- and # comments
q = " ".join([re.split("--|#", line)[0] for line in lines])
# split on blanks, parens and semicolons
tokens = re.split(r"[\s)(;]+", q)
# scan the tokens. if we see a FROM or JOIN, we set the get_next
# flag, and grab the next one (unless it's SELECT).
tables = set()
get_next = False
for tok in tokens:
if get_next:
if tok.lower() not in ["", "select"]:
tables.add(tok)
get_next = False
get_next = tok.lower() in ["from", "join"]
dictTables = dict()
for table in tables:
fields = []
for token in tokens:
if token.startswith(table):
if token != table:
fields.append(token)
if len(list(set(fields))) >= 1:
dictTables[table] = list(set(fields))
return dictTables
code adapted from https://grisha.org/blog/2016/11/14/table-names-from-sql/

Create a list of all the tables that are present in the DB. You can then search each table name in the queries.
This obviously isn't foolproof and the code will break in case any column/alias name matches the table name.
But it can be done as a workaround.

import pandas as pd
#%config PPMagics.autolimit=0
#txt = """<your SQL text here>"""
txt_1 = txt
replace_list = ['\n', '(', ')', '*', '=','-',';','/','.']
count = 0
for i in replace_list:
txt_1 = txt_1.replace(i, ' ')
txt_1 = txt_1.split()
res = []
for i in range(1, len(txt_1)):
if txt_1[i-1].lower() in ['from', 'join','table'] and txt_1[i].lower() != 'select':
count +=1
str_count = str(count)
res.append(txt_1[i] + "." + txt_1[i+1])
#df.head()
res_l = res
f_res_l = []
for i in range(0,len(res_l)):
if len(res_l[i]) > 15 : # change it to 0 is you want all the caught strings
f_res_l.append(res_l[i])
else :
pass
All_Table_List = f_res_l
print("All the unique tables from the SQL text, in the order of their appearence in the code : \n",100*'*')
df = pd.DataFrame(All_Table_List,columns=['Tables_Names'])
df.reset_index(level=0, inplace=True)
list_=list(df["Tables_Names"].unique())
df_1_Final = pd.DataFrame(list_,columns=['Tables_Names'])
df_1_Final.reset_index(level=0, inplace=True)
df_1_Final

Unfortunately, in order to do this successfully for "complex SQL" queries, you will more or less have to implement a complete parser for the particular database engine you are using.
As an example, consider this very basic complex query:
WITH a AS (
SELECT col1 AS c FROM b
)
SELECT c FROM a
In this case, a is not a table but a common table expression (CTE), and should be excluded from your output. There's no simple way of using regexp:es to realize that b is a table access but a is not - your code will really have to understand the SQL at a deeper level.
Also consider
SELECT * FROM tbl
You'd have to know the column names actually present in a particular instance of a database (and accessible to a particular user, too) to answer that correctly.
If by "works with complex SQL" you mean that it must work with any valid SQL statement, you also need to specify for which SQL dialect - or implement dialect-specific solutions. A solution which works with any SQL handled by a database that does not implement CTE:s would not work in one that does.
I am sorry to say so, but I do not think you will find a complete solution which works for arbitrarily complex SQL queries. You'll have to settle for a solution which works with a subset of a particular SQL-dialect.

For my simple use case (one table in query, no joins), I used the following tweak
lst = "select * from table".split(" ")
lst = [item for item in lst if len(item)>0]
table_name = lst[lst.index("from")+1]

Related

SQL query format

I have a list of string that I need to pass to an sql query.
listofinput = []
for i in input:
listofinput.append(i)
if(len(listofinput)>1):
listofinput = format(tuple(listofinput))
sql_query = f"""SELECT * FROM countries
where
name in {listofinput};
"""
This works when I have a list, but in case of just one value it fails.
as listofinput = ['USA'] for one value
but listofinput ('USA', 'Germany') for multiple
also I need to do this for thousands of input, what is the best optimized way to achieve the same. name in my table countries is an indexed column

You can just convert to tuple and then if the second last character is a coma, remove it.
listofinput = format(tuple(input))
if listofinput[-2] == ",":
listofinput = f"{listofinput[:-2]})"
sql_query = f"""SELECT * FROM countries
where name in {listofinput};"""

Change if(len(listofinput)>1): to if(len(listofinput)>=1):
This might work.

Remove condition if(len(listofinput)>1) .
Because if you don't convert to tuple your query should be like this:
... where name in ['USA']
or
... where name in []
and in [...] not acceptable in SQL, only in (...) is acceptable.
You can remove format() too:
listofinput = tuple(listofinput)
Final Code:
listofinput = []
for i in input:
listofinput.append(i)
listofinput = tuple(listofinput)
sql_query = f"""SELECT * FROM countries
WHERE
name IN {listofinput};
"""

Yes the tuple with one element will required a ","
To circumvent your problem, maybe you can use string instead by just changing your code to the below:
listofinput = []
for i in input:
listofinput.append(i)
if(len(listofinput)>1):
listofinput = format(tuple(listofinput))
else:
listofinput='('+listofinput[0]+')'

How to replace a number in a string in Python?

I need to search a string and check if it contains numbers in its name. If it does, I want to replace it with nothing. I've started doing something like this but I didn't find a solution for my problem.
table = "table1"
if any(chr.isdigit() for chr in table) == True:
table = table.replace(chr, "_")
print(table)
# The output should be "table"
Any ideas?

You could do this in many different ways. Here's how it could be done with the re module:
import re
table = 'table1'
table = re.sub('\d+', '', table)

This sound like task for .translate method of str, you could do
table = "table1"
table = table.translate("".maketrans("","","0123456789"))
print(table) # table
2 first arguments of maketrans are for replacement character-for-character, as it we do not need this we use empty strs, third (optional) argument is characters to remove.

If you dont want to import any modules you could try:
table = "".join([i for i in table if not i.isdigit()])

table = "table123"
for i in table:
if i.isdigit():
table = table.replace(i, "")
print(table)

I found this works to remove numbers quickly.
table = "table1"
table_temp =""
for i in table:
if i not in "0123456789":
table_temp +=i
print(table_temp)

char_nums = [chr for chr in table if chr.isdigit()]
for i in char_nums:
table = table.replace(i, "")
print(table)

How to use a list as parameter in clause WHERE in python and CosmoDB

i got a list of ids and i want to use this list as parameter like this:
list_id = ['4833', '43443', '431431']
qry = f"""SELECT
c.nm_cnae as Nome_CNAE
FROM cnae as c
WHERE c.cod_cnae = '{list_id}'"""
resultado_busca = cosmosengine.query('cnae', qry)
resultado_busca = list(resultado_busca)
how should i do this works?
I'm using azure cosmosdb

Assuming you want to find all the ones that have an id in the list it would be:
list_id = ['4833', '43443', '431431']
qry = f"""SELECT
c.nm_cnae as Nome_CNAE
FROM cnae as c
WHERE c.cod_cnae IN ({','.join(list_id)})"""
resultado_busca = cosmosengine.query('cnae', qry)
resultado_busca = list(resultado_busca)
This uses the IN operator in sql: https://www.w3schools.com/sql/sql_in.asp
','.join(list_id) creates a string where each value is separated by a comma.

Bulk replace with regular expressions in Python

For a Django application, I need to turn all occurrences of a pattern in a string into a link if I have the resource related to the match in my database.
Right now, here's the process:
- I use re.sub to process a very long string of text
- When re.sub finds a pattern match, it runs a function that looks up whether that pattern matches an entry in the database
- If there is a match, it wraps the link wraps a link around the match.
The problem is that there are sometimes hundreds of hits on the database. What I'd like to be able to do is a single bulk query to the database.
So: can you do a bulk find and replace using regular expressions in Python?
For reference, here's the code (for the curious, the patterns I'm looking up are for legal citations):
def add_linked_citations(text):
linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
return linked_text
def create_citation_link(match_object):
volume = None
reporter = None
page = None
if match_object.group("volume") not in [None, '']:
volume = match_object.group("volume")
if match_object.group("reporter") not in [None, '']:
reporter = match_object.group("reporter")
if match_object.group("page") not in [None, '']:
page = match_object.group("page")
if volume and reporter and page: # These should all be here...
# !!! Here's where I keep hitting the database
citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
if citations.exists():
citation = citations[0]
document = citation.document
url = document.url()
return '%s %s %s' % (url, volume, reporter, page)
else:
return '%s %s %s' % (volume, reporter, page)

Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?
You have to run the regexp twice, but it seems like the database access is the expensive part anyway.

You can do it with a single regexp pass, by using finditer which returns match objects.
The match object have:
a method returning a dict of the named groups, groupdict()
the start and the end positions of the match in the original text, span()
the original matching text, group()
So I would suggest that you:
Make a list of all the matches in your text using finditer
Make a list of all the unique volume, reporter, page triplets in the matches
Lookup those triplets
Correlate each match object with the result of the triplet lookup if found
Process the original text, splitting by the match spans and interpolating lookup results.
I've implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3).... There maybe be more efficient approaches.
Here's an untested implementation:
from django.db.models import Q
from collections import namedtuple
Triplet = namedtuple('Triplet',['volume','reporter','page'])
def lookup_references(matches):
match_to_triplet = {}
triplet_to_url = {}
for m in matches:
group_dict = m.groupdict()
if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
continue
match_to_triplet[m] = Triplet(**group_dict)
# Build query
unique_triplets = set(match_to_triplet.values())
# List of Q objects
q_list = [Q(**trip._asdict()) for trip in unique_triplets]
# Consolidated Q
single_q = reduce(Q.__or__,q_list)
for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
url = row.pop('url')
triplet_to_url[Triplet(**row)] = url
# Now pair original match objects with URL where found
lookups = {}
for match, triplet in match_to_triplet.items():
if triplet in triplet_to_url:
lookups[match] = triplet_to_url[triplet]
return lookups
def interpolate_citation_matches(text,matches,lookups):
result = []
prev = m_start = 0
last = m_end = len(text)
for m in matches:
m_start, m_end = m.span()
if prev != m_start:
result.append(text[prev:m_start])
# Now check match
if m in lookups:
result.append('%s' % (lookups[m],m.group()))
else:
result.append(m.group())
if m_end != last:
result.append(text[m_end:last])
return ''.join(result)
def process_citations(text):
citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
matches = list(re.finditer(citation_regex,text))
lookups = lookup_references(matches)
new_text = interpolate_citation_matches(text,matches,lookups)
return new_text

Case-insensitive query that supports multiple search words

I'm trying to perform a case-insensitive query. I would generally use __icontains, but since it doesn't support the .split() method, I'm stuck to using __in instead:
def search(request):
query = request.GET.get('q', '')
query = query.lower()
product_results = []
category_results = []
if query:
product_results = Product.objects.filter(Q(title__in=query.split())|
Q(brand__in=query.split())|
Q(description__in=query).split())
category_results = Category.objects.filter(title__in=query.split())
My problem is that the object fields usually have a the first letter capitalized, so an all lowercase query always returns negative.
Anyway around this?

I have solved this problem by using exec to generate code from a string using icontains instead of in. I admit, it's sloppy and not elegant, and should be audited for security but it worked.
see the untested/pseudocode:
query = "product_results = Product.objects.filter("
for word in words:
query += "Q(title__icontains(word))|"
query += "Q(brand__icontains(word))|"
query += "Q(description__icontains(word))|"
query = query[:-1] # remove the trailing |
query += ")"
exec("product_results = "+query)
Again, this is probably not advisable, and I'm sure there's a better way to do this, but this fixed me up in a pinch once so I thought I would share. Also note, I don't use this code anymore as I've switched over to sqlalchemy which makes these kinds of dynamic queries a bit easier since it's "or" object accepts a list.

thanks for sharing, i wrote up this quick hack, not elegant at all....
def search(request):
query = request.GET.get('q', '')
query = query.split()
product_results = []
category_results = []
if query:
for x in query:
product_results.extend(Product.objects.filter(Q(title__icontains=x)|
Q(brand__icontains=x)|
Q(description__icontains=x)))
category_results.extend(Category.objects.filter(title__icontains=x))
query = request.GET.get('q', '')
product_results = list(set(product_results))
category_results = list(set(category_results))
return render_to_response('search_results.html', {'query': query,
'product_results': product_results,
'category_results': category_results})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract table names and column names from sql query? - python

Create a list of all the tables that are present in the DB. You can then search each table name in the queries. This obviously isn't foolproof and the code will break in case any column/alias name matches the table name. But it can be done as a workaround.

For my simple use case (one table in query, no joins), I used the following tweak lst = "select * from table".split(" ") lst = [item for item in lst if len(item)>0] table_name = lst[lst.index("from")+1]

Related

SQL query format

How to replace a number in a string in Python?

How to use a list as parameter in clause WHERE in python and CosmoDB

Bulk replace with regular expressions in Python

Case-insensitive query that supports multiple search words

Categories

Resources