How to parse sql statement insert into to get values with pyspark

How to parse sql statement insert into to get values with pyspark - python

I have a sql dump with several insert into like the following one
query ="INSERT INTO `temptable` VALUES (1773,0,'morne',0),(6004,0,'ATT',0)"
I'm trying to get only the values in a dataframe
(1773,0,'morne',0)
(6004,0,'ATT',0)
I tried
spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
and get
'InsertIntoTable 'UnresolvedRelation `temptable`, false, false
+- 'UnresolvedInlineTable [col1, col2, col3, col4], [List(1773, 0,
morne, 0), List(6004,0, 0, ATT, 0)]
But I don't know how to retrieve those lists of value
is there a way to get without hive?

If you are trying to get only list of values from multiple insert statements then you may try below
listOfInserts = [('''INSERT INTO temptable VALUES (1773,0,'morne',0),(6004,0,'ATT',0)''',),('''INSERT INTO temptable VALUES (1673,0,'morne',0),(5004,0,'ATT',0)''',)]
df = spark.createDataFrame(listOfInserts, ['VALUES'])
from pyspark.sql.functions import substring_index
df.select(substring_index(df.VALUES, 'VALUES', -1).alias('right')).show(truncate = False)

Related

Pass dataframe column values as String in dbduck sql query using loop

I am using dbduck to run sql query on the following dataframe df:
In this sql, I need to pass the values from dataframe col3 using a loop:
aa = ps.sqldf("select * from result where col3= 'id1'")
print(aa)

You can iterate on the values of col3 like this, using Python f-strings:
for v in df["col3"]:
aa = ps.sqldf(f"select * from result where col3='{v}'")
print(aa)

how to manipulate multiple dataframes and store the values in a new dataframe

I have the following situation:
I have multiple tables that look like this:
table1 = pd.DataFrame([[0,1],[0,1],[0,1],[0,1],[0,1]], columns=['v1','v2'])
I have one dataframe that each element refers to these tables, something like this:
df = pd.DataFrame([table1, table2, table3, table4], columns=['tablename'])
I need to create a new column in df that contains, for each table, the values that I get from np.polyfit(table1['v1'],table1['v2'],1)
I have tried to do the following
for x in df['tablename']:
df.loc[:,'fit_result'] = np.polyfit(x['v1'],x['v2'],1)
but it returns me
TypeError: string indices must be integers
Is there a way to do it? or am I writing something that makes no sense?
obs: in fact, these tables are HUGE and contains more than two columns.

You can try something like this
import numpy as np
import pandas as pd
table1 = pd.DataFrame([[0.0,0.0],[1.0,0.8],[2.0,0.9],[3.0,0.1],[4.0,-0.8],[5.0,-1.0]], columns=['table1_v1','table1_v2'])
df = pd.DataFrame([['some','random'],['values','here']], columns=['example_1','example_2'])
def fit_result(v1,v2):
return np.polyfit(v1, v2, 1)
df['fit_result'] = df.apply(lambda row: fit_result(table1['table1_v1'].values,table1['table1_v2'].values), axis=1)
df.head()
Output
example_1 example_2 fit_result
0 some random [-0.3028571428571428, 0.7571428571428572]
1 values here [-0.3028571428571428, 0.7571428571428572]
You only need do this over all your dataframes and concat all off them at the end
df_col = pd.concat([df1,df2], axis=1) (https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)

perform update on basis of select query column using python

How to make the query of the update, after selecting in a way that if column
initially does not have value in a table then perform an update on that column else does not perform an update using python.
The update query is mentioned below.
sql_update = """Update table_name1 set column1 = %s, column2 = %s,column3=%s,column4=%s where column5 = %s"""
input = ('a', 'b', 'c', 'd' , 1)
cursor.execute(sql_update , input)
conn.commit()

With sql you can do an update only on columns that have empty values as:
update table_name set col_1 = 1, col_2 = 2 where col_3 is null

Insert data into grouped DataFrame (pandas)

I have a pandas dataframe grouped by certain columns. Now I want to insert the mean of the numeric values of four adjacent columns into a new column. This is what I did:
df = pd.read_csv(filename)
# in this line I extract a unique ID from the filename
id = re.search('(\w\w\w)', filename).group(1)
Files look like this:
col1 | col2 | col3
-----------------------
str1a | str1b | float1
My idea was now the following:
# get the numeric values
df2 = pd.DataFrame(df.groupby(['col1', 'col2']).mean()['col3'].T
# insert the id into a new column
df2.insert(0, 'ID', id)
Now loop over all
for j in range(len(df2.values)):
for k in df['col1'].unique():
df2.insert(j+5, (k, 'mean'), df2.values[j])
df2.to_excel('text.xlsx')
But I get the following error, referring to the line with df.insert:
TypeError: not all arguments converted during string formatting
and
if not allow_duplicates and item in self.items:
# Should this be a different kind of error??
raise ValueError('cannot insert %s, already exists' % item)
I am not sure what string formatting refers to here, since I have only numerical values being passed around.
The final output should have all values from col3 in a single row (indexed by id) and every fifth column should be the inserted mean value of the four preceding values.

If I had to work with files like yours I code a function to convert to csv... something like that:
data = []
for lineInFile in file.read().splitlines():
lineInFile_splited = lineInFile.split('|')
if len(lineInFile_splited)>1: ## get only data and not '-------'
data.append(lineInFile_splited)
df = pandas.DataFrame(data, columns = ['A','B'])
Hope it helps!

How do you cleanly pass column names into cursor, Python/SQLite?

I'm new to cursors and I'm trying to practice by building a dynamic python sql insert statement using a sanitized method for sqlite3:
import sqlite3
conn = sqlite3.connect("db.sqlite")
cursor = conn.cursor()
list = ['column1', 'column2', 'column3', 'value1', 'value2', 'value3']
cursor.execute("""insert into table_name (?,?,?)
values (?,?,?)""", list)
When I attempt to use this, I get a syntax error "sqlite3.OperationalError: near "?"" on the line with the values. This is despite the fact that when I hard code the columns (and remove the column names from the list), I have no problem. I could construct with %s but I know that the sanitized method is preferred.
How do I insert these cleanly? Or am I missing something obvious?

The (?, ?, ?) syntax works only for the tuple containing the values, imho... That would be the reason for sqlite3.OperationalError
I believe(!) that you are ought to build it similar to that:
cursor.execute("INSERT INTO {tn} ({f1}, {f2}) VALUES (?, ?)".format(tn='testable', f1='foo', f1='bar'), ('test', 'test2',))
But this does not solve the injection problem, if the user is allowed to provide tablename or fieldnames himself, however.
I do not know any inbuilt method to help against that. But you could use a function like that:
def clean(some_string):
return ''.join(char for char in some_string if char.isalnum())
to sanitize the usergiven tablename or fieldnames. This should suffice, because table-/fieldnames usually consists only of alphanumeric chars.
Perhaps it may be smart to check, if
some_string == clean(some_string)
And if False, drop a nice exception to be on the safe side.
During my work with sql & python I felt, that you wont need to let the user name his tables and fieldnames himself, though. So it is/was rarely necessary for me.
If anyone could elaborate some more and give his insights, I would greatly appreciate it.

First, I would start by creating a mapping of columns and values:
data = {'column1': 'value1', 'column2': 'value2', 'column3': 'value3'}
And, then get the columns from here:
columns = data.keys()
# ['column1', 'column3', 'column2']
Now, we need to create placeholders for both columns and for values:
placeholder_columns = ", ".join(data.keys())
# 'column1, column3, column2'
placeholder_values = ", ".join([":{0}".format(col) for col in columns])
# ':column1, :column3, :column2'
Then, we create the INSERT SQL statement:
sql = "INSERT INTO table_name ({placeholder_columns}) VALUES ({placeholder_values})".format(
placeholder_columns=placeholder_columns,
placeholder_values=placeholder_values
)
# 'INSERT INTO table_name (column1, column3, column2) VALUES (:column1, :column3, :column2)'
Now, what we have in sql is a valid SQL statement with named parameters. Now you can execute this SQL query with the data:
cursor.execute(sql, data)
And, since data has keys and values, it will use the named placeholders in the query to insert the values in correct columns.
Have a look at the documentation to see how named parameters are used. From what I can see that you only need to worry about the sanitization for the values being inserted. And, there are two ways to do that 1) either using question mark style, or 2) named parameter style.

So, here's what I ended up implementing, I thought it was pretty pythonic, but couldn't have answered it without Krysopath's insight:
columns = ['column1', 'column2', 'column3']
values = ['value1', 'value2', 'value3']
columns = ', '.join(columns)
insertString=("insert into table_name (%s) values (?,?,?,?)" %columns)
cursor.execute(insertString, values)

import sqlite3
conn = sqlite3.connect("db.sqlite")
cursor = conn.cursor()
## break your list into two, one for column and one for value
list = ['column1', 'column2', 'column3']
list2= ['value1', 'value2', 'value3']
cursor.execute("""insert into table_name("""+list[0]+""","""+list[1]+""","""+list[2]+""")
values ("""+list2[0]+""","""+list2[1]+""","""+list2[2]+""")""")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse sql statement insert into to get values with pyspark - python

Related

Pass dataframe column values as String in dbduck sql query using loop

how to manipulate multiple dataframes and store the values in a new dataframe

perform update on basis of select query column using python

Insert data into grouped DataFrame (pandas)

How do you cleanly pass column names into cursor, Python/SQLite?

Categories

Resources