I am trying to python to generate a script that generates unload command in redshift. I not an expert Python programmer. I need to where I can generate all columns for the unload list. If the column is of specific name, I need to replace with a function. The challenge I am facing is it appending "," to last item in the dictionary. Is there a way I can avoid the last comma? Any help would be appreciated.
import psycopg2 from psycopg2.extras
import RealDictCursor
try:
conn = psycopg2.connect("dbname='test' port='5439' user='scott' host='something.redshift.amazonaws.com' password='tiger'");
except:
print "Unable to connect to the database"
conn.cursor_factory = RealDictCursor
cur = conn.cursor()
conn.set_isolation_level( psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT )
try:
cur.execute("SELECT column from pg_table_def where schema_name='myschema' and table_name ='tab1' " );
except:
print "Unable to execute select statement from the database!"
result = cur.fetchall()
print "unload mychema.tab1 (select "
for row in result:
for key,value in row.items():
print "%s,"%(value)
print ") AWS Credentials here on..."
conn.close()
Use the join function on the list of values in each row:
print ",".join(row.values())
Briefly, the join function is called on a string which we can think of as the "glue", and takes a list of "pieces" as its argument. The result is a string of "pieces" "held together" by the "glue". Example:
>>> glue = "+"
>>> pieces = ["a", "b", "c"]
>>> glue.join(pieces)
"a+b+c"
(since row.values() returns a list, you don't actually need the comprehension, so it's even simpler than I wrote it at first)
Infact, this worked better.
columns = []
for row in result:
if (row['column_name'] == 'mycol1') or (row['column_name'] == 'mycol2') :
columns.append("func_sha1(" + row['column_name'] + "||" + salt +")")
else:
columns.append(row['column_name'])
print selstr + ",".join(columns)+" ) TO s3://"
Thanks for your help, Jon
Related
So I'm reading a list of part numbers from excel using Pandas, which can be just about anything, like:
287380274-87
or
ME982394-01
or
HOU8929
that changes randomly based on what the user is looking for and can contain some bad numbers as well. Such as blanks, invalid characters (<, >, or !), as well as phrases, like '12390-01 to 04'. I don't care about filtering the part numbers for all of the random conditions that throw synxtax errors in SQL. But I am attempting to query a SAP database WHERE part number IN (list):
import pandas as pd
from hdbcli import dbapi
userFile = r'T:\H01 Cell\Projects\Part Breakdown Update Spreadsheet Improvements\2022.03.21 Part Breakdown update - VK.xlsm'
# read input from excel for part numbers to WHERE in queries
partNums = pd.read_excel\
(io=userFile, sheet_name='Inputs', usecols=lambda x: 'Unnamed' not in x,\
skiprows=1, dtype={'Part List' : str})
# Open SAP database connection
conn = dbapi.connect(address="server", port=####, user="XXXXX", password="XXXXXXX")
# Function to convert
def listToString(s):
# use list comprehension
listToStr = ', '.join([str(elem) for elem in s])
# return string
return listToStr
partNumStr = listToString(partNums['Part List'].drop_duplicates().tolist())
# GetInvOnHand()
# DISTINCT List
queryIOHlist = [
"PART_NO",
"PLANT",
"LOCATION_DESCRIPTION",
"VALUATION_TYPE"
]
queryIOHstr = listToString(queryIOHlist)
# our SQL query, select all from ' '
queryInvOnHand = (
"SELECT DISTINCT " +
queryIOHstr +
" FROM ZWILLIAMS.ZV_WI_GU_INVENTORY_ON_HAND_IM_THIN INV" +
" WHERE PART_NO IN " +
"(" +
partNumStr +
")"
)
# pandas read SQL to store SQL table in dataframe
inventoryOnHand = pd.read_sql(queryInvOnHand, conn)
conn.close()
I'm running into synxtax errors for my SQL query because of these bad part numbers, such as:
(257, 'sql syntax error: incorrect syntax near "to": line 1 col 8120 (at pos 8120))
where the part number it doesn't like is: 62219-01 to -04
In SQL, is there a way to just skip that number if not found in the Part Numbers column in the table? Ideally, it would just be something like:
if syntaxError:
continue
and then just not record anything in my dataframe for that part number.
Beginner here.
I have the following circumstances.
A text file with each line containing a name.
A cassandra 3.5 database
A python script
The intention is to have the script read from the file one line (one name) at a time, and query Cassandra with that name.
FYI, everything works fine except for when I try to pass the value of the list to the query.
I current have something like:
#... driver import, datetime imports done above
#...
with open(fname) as f:
content = f.readlines()
# Loop for each line from the number of lines in the name list file
# num_of_lines is already set
for x in range(num_of_lines):
tagname = str(content[x])
rows = session.execute("""SELECT * FROM tablename where name = %s and date = %s order by time desc limit 1""", (tagname, startDay))
for row in rows:
print row.name + ", " + str(row.date)
Everything works fine if I remove the tagname list component and edit the query itself with a name value.
What am I doing wrong here?
Simply building on the answer from #Vinny above, format simply replaces literal value. You need to put quotes around it.
for x in content:
rows = session.execute("SELECT * FROM tablename where name ='{}' and date ='{}' order by time desc limit 1".format(x, startDay))
for row in rows:
print row.name + ", " + str(row.date)
You can simply iterate over content:
for x in content:
rows = session.execute("SELECT * FROM tablename where name = {} and date = {} order by time desc limit 1".format(x, startDay))
for row in rows:
print row.name + ", " + str(row.date)
....
Also, you don't need to have 3 quotes for the string. Single quotes is good enough (3 quotes is used for documentation / multiple line comments in python)
Note that this might end in a different error; but you will be iterating on the lines instead of iterating over an index and reading lines.
I'm looking to take an array list and attach it to a string.
Python 2.7.10, Windows 10
The list is loaded from a mySQL table and the output is this:
skuArray = [('000381001238',) ('000381001238',) ('000381001238',) ('FA200513652',) ('000614400967',)]
I'm wanting to take this list and attach it to a separate query
the problem:
query = "SELECT ItemLookupCode,Description, Quantity, Price, LastReceived "
query = query+"FROM Item "
query = query+"WHERE ItemLookupCode IN ("+skuArray+") "
query = query+"ORDER BY LastReceived ASC;"
I get the error:
TypeError: cannot concatenate 'str' and 'tuple' objects
My guess here is that I need to format the string as:
'000381001238', '000381001238', '000381001238', 'FA200513652','000614400967'
Ultimately the string needs to read:
query = query+"WHERE ItemLookupCode IN ('000381001238', '000381001238', '000381001238', 'FA200513652','000614400967') "
I have tried the following:
skuArray = ''.join(skuArray.split('(', 1))
skuArray = ''.join(skuArray.split(')', 1))
Second Try:
skus = [sku[0] for sku in skuArray]
stubs = ','.join(["'?'"]*len(skuArray))
msconn = pymssql.connect(host=r'*', user=r'*', password=r'*', database=r'*')
cur = msconn.cursor()
query ='''
SELECT ItemLookupCode,Description, Quantity, Price, LastReceived
FROM Item
WHERE ItemLookupCode IN { sku_params }
ORDER BY LastReceived ASC;'''.format(sku_params = stubs)
cur.execute(query, params=skus)
row = cur.fetchone()
print row[3]
cur.close()
msconn.close()
Thanks in advance for your help!
If you want to do the straight inline SQL you could use a list comprehension:
', '.join(["'{}'}.format(sku[0]) for sku in skuArray])
Note: You need to add commas between tuples (based on example)
That said, if you want to do some sql, I would encourage you to parameterize your request with ?
Here is an example of how you would do something like that:
skuArray = [('000381001238',), ('000381001238',), ('000381001238',), ('FA200513652',), ('000614400967',)]
skus = [sku[0] for sku in skuArray]
stubs = ','.join(["'?'"]*len(skuArray))
qry = '''
SELECT ItemLookupCode,Description, Quantity, Price, LastReceived
FROM Item
WHERE ItemLookupCode IN ({ sku_params })
ORDER BY LastReceived ASC;'''.format(sku_params = stubs)
#assuming pyodbc connection syntax may be off
conn.execute(qry, params=skus)
Why?
Non-parameterized queries are a bad idea because it leaves you vulnerable to sql injection and is easy to avoid.
Assuming that skuArray is a list, like this:
>>> skuArray = [('000381001238',), ('000381001238',), ('000381001238',), ('FA200513652',), ('000614400967',)]
You can format your string like this:
>>> ', '.join(["'{}'".format(x[0]) for x in skuArray])
"'000381001238', '000381001238', '000381001238', 'FA200513652', '000614400967'"
So let assume we have such simple query:
Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;
The result should looks this way:
tb1 col1
tb1 col7
tb2 col2
tb2 col8
I've tried to solve this problem using some python library:
1) Even extracting only tables using sqlparse might be a huge problem. For example this official book doesn't work properly at all.
2) Using regular expression seems to be really hard to achieve.
3) But then I found this , that might help. However the problem is that I can't connect to any database and execute that query.
Any ideas?
sql-metadata is a Python library that uses a tokenized query returned by python-sqlparse and generates query metadata.
This metadata can return column and table names from your supplied SQL query. Here are a couple of example from the sql-metadata github readme:
>>> sql_metadata.get_query_columns("SELECT test, id FROM foo, bar")
[u'test', u'id']
>>> sql_metadata.get_query_tables("SELECT test, id FROM foo, bar")
[u'foo', u'bar']
>>> sql_metadata.get_query_limit_and_offset('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000')
(50, 1000)
A hosted version of the library exists at sql-app.infocruncher.com to see if it works for you.
Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.
import ply.lex as lex, re
tokens = (
"TABLE",
"JOIN",
"COLUMN",
"TRASH"
)
tables = {"tables": {}, "alias": {}}
columns = []
t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"
def t_TABLE(t):
r"from\s(\w+)\sas\s(\w+)"
regex = re.compile(t_TABLE.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_JOIN(t):
r"inner\s+join\s+(\w+)\s+as\s+(\w+)"
regex = re.compile(t_JOIN.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_COLUMN(t):
r"(\w+\.\w+)"
regex = re.compile(t_COLUMN.__doc__)
m = regex.search(t.value)
if m is not None:
t.value = m.group(1)
columns.append(t.value)
return t
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
t.lexer.skip(len(t.value))
# here is where the magic starts
def mylex(inp):
lexer = lex.lex()
lexer.input(inp)
for token in lexer:
pass
result = {}
for col in columns:
tbl, c = col.split('.')
if tbl in tables["alias"].keys():
key = tables["alias"][tbl]
else:
key = tbl
if key in result:
result[key].append(c)
else:
result[key] = list()
result[key].append(c)
print result
# {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}
string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)
moz-sql-parser is a python library to convert some subset of SQL-92 queries to JSON-izable parse trees. Maybe it what you want.
Here is an example.
>>> parse("SELECT id,name FROM dual WHERE id>3 and id<10 ORDER BY name")
{'select': [{'value': 'id'}, {'value': 'name'}], 'from': 'dual', 'where': {'and': [{'gt': ['id', 3]}, {'lt': ['id', 10]}]}, 'orderby': {'value': 'name'}}
I am tackling a similar problem and found a simpler solution and it seems to work well.
import re
def tables_in_query(sql_str):
# remove the /* */ comments
q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)
# remove whole line -- and # comments
lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]
# remove trailing -- and # comments
q = " ".join([re.split("--|#", line)[0] for line in lines])
# split on blanks, parens and semicolons
tokens = re.split(r"[\s)(;]+", q)
# scan the tokens. if we see a FROM or JOIN, we set the get_next
# flag, and grab the next one (unless it's SELECT).
tables = set()
get_next = False
for tok in tokens:
if get_next:
if tok.lower() not in ["", "select"]:
tables.add(tok)
get_next = False
get_next = tok.lower() in ["from", "join"]
dictTables = dict()
for table in tables:
fields = []
for token in tokens:
if token.startswith(table):
if token != table:
fields.append(token)
if len(list(set(fields))) >= 1:
dictTables[table] = list(set(fields))
return dictTables
code adapted from https://grisha.org/blog/2016/11/14/table-names-from-sql/
Create a list of all the tables that are present in the DB. You can then search each table name in the queries.
This obviously isn't foolproof and the code will break in case any column/alias name matches the table name.
But it can be done as a workaround.
import pandas as pd
#%config PPMagics.autolimit=0
#txt = """<your SQL text here>"""
txt_1 = txt
replace_list = ['\n', '(', ')', '*', '=','-',';','/','.']
count = 0
for i in replace_list:
txt_1 = txt_1.replace(i, ' ')
txt_1 = txt_1.split()
res = []
for i in range(1, len(txt_1)):
if txt_1[i-1].lower() in ['from', 'join','table'] and txt_1[i].lower() != 'select':
count +=1
str_count = str(count)
res.append(txt_1[i] + "." + txt_1[i+1])
#df.head()
res_l = res
f_res_l = []
for i in range(0,len(res_l)):
if len(res_l[i]) > 15 : # change it to 0 is you want all the caught strings
f_res_l.append(res_l[i])
else :
pass
All_Table_List = f_res_l
print("All the unique tables from the SQL text, in the order of their appearence in the code : \n",100*'*')
df = pd.DataFrame(All_Table_List,columns=['Tables_Names'])
df.reset_index(level=0, inplace=True)
list_=list(df["Tables_Names"].unique())
df_1_Final = pd.DataFrame(list_,columns=['Tables_Names'])
df_1_Final.reset_index(level=0, inplace=True)
df_1_Final
Unfortunately, in order to do this successfully for "complex SQL" queries, you will more or less have to implement a complete parser for the particular database engine you are using.
As an example, consider this very basic complex query:
WITH a AS (
SELECT col1 AS c FROM b
)
SELECT c FROM a
In this case, a is not a table but a common table expression (CTE), and should be excluded from your output. There's no simple way of using regexp:es to realize that b is a table access but a is not - your code will really have to understand the SQL at a deeper level.
Also consider
SELECT * FROM tbl
You'd have to know the column names actually present in a particular instance of a database (and accessible to a particular user, too) to answer that correctly.
If by "works with complex SQL" you mean that it must work with any valid SQL statement, you also need to specify for which SQL dialect - or implement dialect-specific solutions. A solution which works with any SQL handled by a database that does not implement CTE:s would not work in one that does.
I am sorry to say so, but I do not think you will find a complete solution which works for arbitrarily complex SQL queries. You'll have to settle for a solution which works with a subset of a particular SQL-dialect.
For my simple use case (one table in query, no joins), I used the following tweak
lst = "select * from table".split(" ")
lst = [item for item in lst if len(item)>0]
table_name = lst[lst.index("from")+1]
I have a text file that contains many different entries. What I'd like to do is take the first column, use each unique value as a key, and then store the second column as values. I actually have this working, sort of, but I'm looking for a better way to do this. Here is my example file:
account_check:"login/auth/broken"
adobe_air_installed:kb_base+"/"+app_name+"/Path"
adobe_air_installed:kb_base+"/"+app_name+"/Version"
adobe_audition_installed:'SMB/Adobe_Audition/'+version+'/Path'
adobe_audition_installed:'SMB/Adobe_Audition/'+version+'/ExePath'
Here is the code I'm using to parse my text file:
val_dict = {}
for row in creader:
try:
value = val_dict[row[0]]
value += row[1] + ", "
except KeyError:
value = row[1] + ", "
val_dict[row[0]] = value
for row in val_dict.items():
values = row[1][:-1],row[0]
cursor.execute("UPDATE 'plugins' SET 'sets_kb_item'= ? WHERE filename= ?", values)
And here is the code I use to query + format the data currently:
def kb_item(query):
db = get_db()
cur = db.execute("select * from plugins where sets_kb_item like ?", (query,))
plugins = cur.fetchall()
for item in plugins:
for i in item['sets_kb_item'].split(','):
print i.strip()
Here is the output:
kb_base+"/Installed"
kb_base+"/Path"
kb_base+"/Version"
It took me many tries but I finally got the output the way I wanted it, however I'm looking for critique. Is there a better way to do this? Could my entire for item in plugins.... print i.strip() be done in one line and saved as a variable? I am very new to working with databases, and my python skills could also use refreshing.
NOTE I'm using csvreader in this code because I originally had a .csv file - however I found it was just as easy to use the .txt file I was provided.