Using pandas in python, I need to be able to generate efficient queries from a dataframe into postgresql. Unfortunately DataFrame.to_sql(...) only performs direct inserts and the query i wish to make is fairly complicated.
Ideally, I'd like to do this:
WITH my_data AS (
SELECT * FROM (
VALUES
<dataframe data>
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
my_table.col1 = my_data.col1,
my_table.col2 = complex_function(my_table.col2, my_data.col2),
FROM my_data
WHERE my_table.col3 < my_data.col3;
However, to do that, i would need to turn my dataframe into a plain values statement. I could, of course, rewrite my own functions, but past experiences have taught me that writing functions to escape and sanitize sql should never be done manually.
We are using SQLAlchemy, but bound parameters seem to only work with a limited number of arguments, and ideally i would like the serialization of the dataframe into text to be done at C-speed.
So, is there a way, either through pandas, or through SQLAlchemy, to turn efficiently my dataframe into the values substatement, and insert it into my query?
You could use psycopg2.extras.execute_values.
For example, given this setup
CREATE TABLE my_table (
col1 int
, col2 text
, col3 int
);
INSERT INTO my_table VALUES
(99, 'X', 1)
, (99, 'Y', 2)
, (99, 'Z', 99);
# | col1 | col2 | col3 |
# |------+------+------|
# | 99 | X | 1 |
# | 99 | Y | 2 |
# | 99 | Z | 99 |
The python code
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, 'A', 10),
(2, 'B', 20),
(3, 'C', 30)])
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
pge.execute_values(cursor, sql, df.values)
updates my_table to be
# SELECT * FROM my_table
| col1 | col2 | col3 |
|------+------+------|
| 99 | Z | 99 |
| 1 | XA | 1 |
| 1 | YA | 2 |
Alternatively, you could use psycopg2 to generate the SQL.
The code in format_values is almost entirely copied from the source code for pge.execute_values.
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, "A'foo'", 10),
(2, 'B', 20),
(3, 'C', 30)])
def format_values(cur, sql, argslist, template=None, page_size=100):
enc = pge._ext.encodings[cur.connection.encoding]
if not isinstance(sql, bytes):
sql = sql.encode(enc)
pre, post = pge._split_sql(sql)
result = []
for page in pge._paginate(argslist, page_size=page_size):
if template is None:
template = b'(' + b','.join([b'%s'] * len(page[0])) + b')'
parts = pre[:]
for args in page:
parts.append(cur.mogrify(template, args))
parts.append(b',')
parts[-1:] = post
result.append(b''.join(parts))
return b''.join(result).decode(enc)
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
print(format_values(cursor, sql, df.values))
yields
WITH my_data AS (
SELECT * FROM (
VALUES (1,'A''foo''',10),(2,'B',20),(3,'C',30)
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3
Related
I've got a a database filled with lots of data.
Let's make it simple and say the schema looks like this:
CREATE TABLE foo (
col1 CHAR(25) PRIMARY KEY,
col2 CHAR(2) NOT NULL,
col3 CHAR(1) NOT NULL
CONSTRAINT c_col2 (col2 = 'an' OR col2 = 'bx' OR col2 = 'zz')
CONSTRAINT c_col3 (col3 = 'a' OR col3 = 'b' OR col3 = 'n')
)
There are lots of rows with lots of values, but let's say I've just done this:
cur.executemany('INSERT INTO foo VALUES(?, ?, ?)', [('xxx', 'bx', 'a'),
('yyy', 'bx', 'b'),
('zzz', 'an', 'b')])
I have match lists for each of the values, and I want to return rows that match the UNION of all list values. For this question, assume no lists are empty.
Say I have these match lists...
row2 = ['bx', 'zz'] # Consider all rows that have 'bx' OR 'zz' in row2
row3 = ['b'] # Consider all rows that have 'b' in row3
I can build a text-based query correctly, using something like this..
s_row2 = 'row2 IN (' + ', '.join('"{}"'.format(x) for x in row2) + ')'
s_row3 = 'row3 IN (' + ', '.join('"{}"'.format(x) for x in row3) + ')'
query = 'SELECT col1 FROM foo WHERE ' + ' AND '.join([s_row2, s_row3])
for row in cur.execute(query):
print(row)
Output should be just yyy.
xxx is NOT chosen because col3 is a and not in the col3 match list.
zzz is NOT chosen because col2 is an and not in the col2 match list.
How would I do this using the safer qmark style, like my 'INSERT' above?
edit: I just realized that I screwed up the notion of 'row' and 'col' here... Sorry for the confusion! I won't change it because it has perpetuated into the answer below...
What you want can be done like this:
# Combine the values into a single list:
vals = row2 + row3
# Create query string with placeholders:
query = """SELECT col1 FROM foo WHERE col2 IN (?, ?) AND col3 IN (?)"""
cur.execute(query, vals)
for row in cur:
print row
Or if the number of values may vary, like this:
rows = [row2, row3]
# Flatten the list of rows to get scalar values.
vals = [x for y in rows for x in y]
# Generate placeholders for each row.
placeholders = (', '.join(['?'] * len(row)) for row in rows)
# Create query string with placeholders for placeholders.
query = """SELECT col1 FROM foo WHERE col2 IN ({}) AND col3 IN ({})"""
# Replace placeholders with placeholders.
query = query.format(*placeholders)
for row in cur.execute(query, vals):
print(row)
I am learning Databricks and going through an exploring and research phase. I found various tools while triaging python syntax. I.e. Dataframes with PySpark, Bamboo library, Apache Spark library to read SQL objects, Panda etc.
But somehow I am mixing up the usage of these all libraries.
I am exploring these alternatives to achieve one task. How to combine or merge multiple table schemas in one table.
For an instance, if I have 20 tables. Table1, Table2, Table3, ... , Table20.
Table1 has 3 columns.
Col1 | Col2 | Col3 |
Table2 has 4 columns.
Col4 | Col5 | Col6 | Col7
and that way all 20 table has such columns.
Can the community provide some insight to approach this implementation?
This is greatly appreciated.
Troubleshooting
schema1 = "database string, table string"
schema2 = "table string, column string, datatype string"
tbl_df = spark.createDataFrame([],schema1)
tbl_df3 = spark.createDataFrame([],schema2)
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for db in db_list:
# Get list of tables from each the database
db_tables = spark.sql(f"SHOW TABLES in {db}").rdd.map(lambda row: row.tableName).collect()
# For each table, get list of columns
for table in db_tables:
#initialize the database
spark.sql(f"use {db}")
df = spark.createDataFrame([(db, table.strip())], schema=['database', 'table'])
tbl_df = tbl_df.union(df)
above code works fine and gives me list of all databases and tables associated. Now next thing I am trying to achieve is schema2.
Based on list of tables, I managed to retrieve list of columns from all tables. But I believe it returns in the form of tuple.
For example, when I iterate for loop on db_tables as below,
columns = spark.sql(f"DESCRIBE TABLE {table}").rdd.collect()
this gives me below result.
[Row(col_name='Col1', data_type='timestamp', comment=None), Row(col_name='Col2', data_type='string', comment=None), Row(col_name='Col3', data_type='string', comment=None)]
[Row(col_name='Col4', data_type='timestamp', comment=None), Row(col_name='Col5', data_type='timestamp', comment=None), Row(col_name='Col6', data_type='timestamp', comment=None), Row(col_name='Col7', data_type='timestamp', comment=None)]
This is my real challenge now. I try to figure out to access above Row format and transform in below tabular outcome.
Table | Column | Datatype
-------------------------
Table1| Col1 | Timestamp
Table1| Col2 | string
Table1| Col3 | string
Table2| Col4 | Timestamp
Table2| Col5 | string
Table2| Col6 | string
Table2| Col7 | string
Finally I will merge or join 2 dataframes based on table name (taking it as key) and generate final outcome like below.
Database| Table | Column | Datatype
------------------------------------
Db1 | Table1| Col1 | Timestamp
Db1 | Table1| Col2 | string
Db1 | Table1| Col3 | string
Db1 | Table2| Col4 | Timestamp
Db1 | Table2| Col5 | string
Db1 | Table2| Col6 | string
Db1 | Table2| Col7 | string
If each table has unique columns, you can use unionByName.To create a single table with merged schema, you can use the following code:
#list of table names
tables = ['default.t1','default.t2','default.t3']
final_df = spark.sql(f'select * from {tables[0]}') #load 1st table to a dataframe
#display(final_df)
final = 'final_df'
for table in tables[1:]:
final = final + f'.unionByName(spark.sql("select * from {table}"),allowMissingColumns=True)' #creating string expression to get final result
#print(final)
req_df = eval(final)
#display(req_df)
req_df.printSchema()
UPDATE:
To get the database name, table name, column name and their type for each of the table in each of the database, you can use the following code:
My table creation code:
%sql
create database d1;
create table d1.t1(id int, gname varchar(40));
create table d1.t2(fname varchar(40),lname varchar(40));
create database d2;
create table d2.tb1(id varchar(40),age int, name varchar(40));
To get dataframe as per requirement:
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
#db_list
db_tables = spark.sql(f"SHOW TABLES in {db_list[0]}")
for i in db_list[1:]:
db_tables = db_tables.union(spark.sql(f"SHOW TABLES in {i}"))
#display(db_tables)
final_df = None
for row in db_tables.collect():
if(final_df is None):
final_df = spark.sql(f"DESCRIBE TABLE {row.database}.{row.tableName}")\
.withColumn('database',lit(f'{row.database}'))\
.withColumn('tablename',lit(f'{row.tableName}'))\
.select('database','tablename','col_name','data_type')
else:
final_df = final_df.union(spark.sql(f"DESCRIBE TABLE {row.database}.{row.tableName}")\
.withColumn('database',lit(f'{row.database}'))\
.withColumn('tablename',lit(f'{row.tableName}'))\
.select('database','tablename','col_name','data_type'))
display(final_df)
I have a database holding names, and I have to create a new list which will hold such values as ID, name, and gender and insert it in the current database. I have to create a list of the names which are not in the database yet. So I simply checked only 3 names and trying to work with them.
I am not sure what sort of list I suppose to create and how I can loop through it to insert all the new values in the proper way.
That's what I have so far:
mylist = [["Betty Beth", "1", "Female"], ["John Cena", "2", "Male"]]
#get("/list_actors")
def list_actors():
with connection.cursor() as cursor:
sql = "INSERT INTO imdb VALUES (mylist)"
cursor.execute(sql)
connection.commit()
return "done"
I am very new to this material so I will appreciate any help. Thanks in advance!
vals = [["TEST1", 1], ["TEST2", 2]]
with connection.cursor() as cursor:
cursor.executemany("insert into test(prop, val) values (%s, %s)", vals )
connection.commit()
mysql> select * from test;
+----+-------+------+---------------------+
| id | prop | val | ts |
+----+-------+------+---------------------+
| 1 | TEST1 | 1 | 2017-05-19 09:46:16 |
| 2 | TEST2 | 2 | 2017-05-19 09:46:16 |
+----+-------+------+---------------------+
Adapted from https://groups.google.com/forum/#!searchin/pymysql-users/insert%7Csort:relevance/pymysql-users/4_D8bYusodc/EHFxjRh89XEJ
In Pl/Python "RETURNS setof" or "RETURNS table" clause are used to return a table like structured data. It seems to me that one has to provide the name of each column to get a table returned. If you have a table with a few columns it is an easy thing. However, if you have a table of 200 columns, what's the best way to do that? Do I have to type the names of all of columns (as shown below) or there is a way to get around it? Any help would be much appreciated.
Below is an example that uses "RETURNS table" clause. The code snippets creates a table (mysales) in postgres, populate it and then use Pl/Python to fetch it and returning the column values. For simplicity I am only returning 4 columns from the table.
DROP TABLE IF EXISTS mysales;
CREATE TABLE mysales (id int, year int, qtr int, day int, region
text) DISTRIBUTED BY (id);
INSERT INTO mysales VALUES
(1, 2014, 1,1, 'north america'),
(2, 2002, 2,2, 'europe'),
(3, 2014, 3,3, 'asia'),
(4, 2010, 4,4, 'north-america'),
(5, 2014, 1,5, 'europe'),
(6, 2009, 2,6, 'asia'),
(7, 2002, 3,7, 'south america');
DROP FUNCTION IF EXISTS myFunc02();
CREATE OR REPLACE FUNCTION myFunc02()
RETURNS TABLE (id integer, x integer, y integer, s text) AS
$$
rv = plpy.execute("SELECT * FROM mysales ORDER BY id", 5)
d = rv.nrows()
return ( (rv[i]['id'],rv[i]['year'], rv[i]['qtr'], rv[i]['region'])
for i in range(0,d) )
$$ LANGUAGE 'plpythonu';
SELECT * FROM myFunc02();
#Here is the output of the SELECT statement:
1; 2014; 1;"north america"
2; 2002; 2;"europe"
3; 2014; 3;"asia"
4; 2010; 4;"north-america"
5; 2014; 1;"europe"
6; 2009; 2;"asia"
7; 2002; 3;"south america"
Try this:
CREATE OR REPLACE FUNCTION myFunc02()
RETURNS TABLE (like mysales) AS
$$
rv = plpy.execute('SELECT * FROM mysales ORDER BY id;', 5)
d = rv.nrows()
return rv[0:d]
$$ LANGUAGE 'plpythonu';
which returns:
gpadmin=# SELECT * FROM myFunc02();
id | year | qtr | day | region
----+------+-----+-----+---------------
1 | 2014 | 1 | 1 | north america
2 | 2002 | 2 | 2 | europe
3 | 2014 | 3 | 3 | asia
4 | 2010 | 4 | 4 | north-america
5 | 2014 | 1 | 5 | europe
(5 rows)
Something to consider for MPP like Greenplum and HAWQ is to strive for functions that take data as an argument and return a result, rather than originating the data in the function itself. The same code executes on every segment so occasionally there can be unintended side effects.
Update for SETOF variant:
CREATE TYPE myType AS (id integer, x integer, y integer, s text);
CREATE OR REPLACE FUNCTION myFunc02a()
RETURNS SETOF myType AS
$$
# column names of myType ['id', 'x', 'y', 's']
rv = plpy.execute("SELECT id, year as x, qtr as y, region as s FROM mysales ORDER BY id", 5)
d = rv.nrows()
return rv[0:d]
$$ LANGUAGE 'plpythonu';
Note, to use the same data from the original example, I had to alias each of the columns to corresponding names in myType. Also, you'll have to enumerate all of the columns of mysales if going this route - there isn't a straightforward way to CREATE TYPE foo LIKE tableBar although you might be able to use this to alleviate some of the manual work of enumerate all the names/types:
select string_agg(t.attname || ' ' || t.format_type || ', ') as columns from
(
SELECT a.attname,
pg_catalog.format_type(a.atttypid, a.atttypmod),
(SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid) for 128)
FROM pg_catalog.pg_attrdef d
WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),
a.attnotnull, a.attnum,
a.attstorage ,
pg_catalog.col_description(a.attrelid, a.attnum)
FROM pg_catalog.pg_attribute a
LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e
ON e.attrelid = a .attrelid AND e.attnum = a.attnum
WHERE a.attrelid = (SELECT oid FROM pg_class WHERE relname = 'mysales') AND a.attnum > 0 AND NOT a.attisdropped
ORDER BY a.attnum
) t ;
which returns:
columns
-------------------------------------------------------------------
id integer, year integer, qtr integer, day integer, region text,
(1 row)
I'm using the python bindings for sqlite3 and I'm attempting to do a query something like this
table1
col1 | col2
------------
aaaaa|1
aaabb|2
bbbbb|3
test.py
def get_rows(db, ugc):
# I want a startswith query. but want to protect against potential sql injection
# with the user-generated-content
return db.execute(
# Does not work :)
"SELECT * FROM table1 WHERE col1 LIKE ? + '%'",
[ugc],
).fetchall()
Is there a way to do this safely?
Expected behaviour:
>>> get_rows('aa')
[('aaaaa', 1), ('aaabb', 2)]
In SQL, + is used to add numbers.
Your SQL ends up as ... WHERE col1 LIKE 0.
To concatenate strings, use ||:
db.execute(
"SELECT * FROM table1 WHERE col1 LIKE ? || '%'",
[ugc],
)