Using SQL Minus Operator in python - python

I want to perform a minus operation like the code below on two tables.
SELECT
column_list_1
FROM
T1
MINUS
SELECT
column_list_2
FROM
T2;
This is after a migration has happened. I have these two databases that I have connected like this:
import cx_Oracle
import pandas as pd
import pypyodbc
source = cx_Oracle.connect(user, password, name)
df = pd.read_sql(""" SELECT * from some_table """, source)
target = pypyodbc.connect(blah, blah, db)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
How can I run a minus operation on the source and target databases in python using a query?

Choose either one:
Use Python in order to perform a "manual" MINUS operation between the two result sets.
Use Oracle by means of a dblink. In this case, you won't need to open two connections from Python.

if you have a DB link then you can do a MINUS or you can use merge from Pandas.
df = pd.read_sql(""" SELECT * from some_table """, source)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
df_combine = df.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
There will be a new column _merge created in df_combine which will contain values both (row present in both the data frame) and right_only (row in data frame df).
In the same way you can join a left merge.

Related

Spark Driver stuck when using different windows

I'm having problems with Spark (Spark 3.0.1) when trying to use use window functions over different windows.
There are no errors but the driver gets stuck when trying to show or write the dataframe (no tasks start in spark ui).
The problem happens both when running on yarn and local mode.
Example:
select
col_1,
col_2,
col_3,
sum(x) over (partition by col_1, col_2) s,
max(x) over (partition by col_1, col_3) m
from table
My work-around so far has been to split the window functions in different queries and persist the intermediate results to parquet. Something like this:
df = spark.sql("select col_1, col_2, col_3, sum(x) over (partition by col_1, col_2) s from table")
df.write.parquet(path)
df = spark.read.parquet(path)
df.createOrReplaceTempView("table")
df = spark.sql("select *, max(x) over (partition by col_1, col_3) m from table")
But I don't want to repeat this work-around anymore

Join based on multiple complex conditions in Python

I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.
This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.
For example, the equivalent SQL code could be like:
SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')
I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?
Thanks!
In principle, your query could be rewritten as a join and a filter where clause.
SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')
Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.
import pandas as pd
df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)
# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]
# join
merged = pd.merge(df_a, df_b, on='id', how='inner')
# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]
off-topic: depending on the use case, you may want to use abs() the difference.

Joining multiple data frames in one statement and selecting only required columns

I have the following Spark DataFrames:
df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?
If you have a SQL query that works, why not use pyspark-sql?
First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')
Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:
query = "select df1.*, df2.salary, df3.dob " \
"from df1 " \
"left join df2 on df1.id=df2.id "\
"left join df3 on df1.name=df3.name"
joined_df = spark.sql(query)
You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')
Then the following should work:
from pyspark.sql.functions import col
joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')

How do you cleanly pass column names into cursor, Python/SQLite?

I'm new to cursors and I'm trying to practice by building a dynamic python sql insert statement using a sanitized method for sqlite3:
import sqlite3
conn = sqlite3.connect("db.sqlite")
cursor = conn.cursor()
list = ['column1', 'column2', 'column3', 'value1', 'value2', 'value3']
cursor.execute("""insert into table_name (?,?,?)
values (?,?,?)""", list)
When I attempt to use this, I get a syntax error "sqlite3.OperationalError: near "?"" on the line with the values. This is despite the fact that when I hard code the columns (and remove the column names from the list), I have no problem. I could construct with %s but I know that the sanitized method is preferred.
How do I insert these cleanly? Or am I missing something obvious?
The (?, ?, ?) syntax works only for the tuple containing the values, imho... That would be the reason for sqlite3.OperationalError
I believe(!) that you are ought to build it similar to that:
cursor.execute("INSERT INTO {tn} ({f1}, {f2}) VALUES (?, ?)".format(tn='testable', f1='foo', f1='bar'), ('test', 'test2',))
But this does not solve the injection problem, if the user is allowed to provide tablename or fieldnames himself, however.
I do not know any inbuilt method to help against that. But you could use a function like that:
def clean(some_string):
return ''.join(char for char in some_string if char.isalnum())
to sanitize the usergiven tablename or fieldnames. This should suffice, because table-/fieldnames usually consists only of alphanumeric chars.
Perhaps it may be smart to check, if
some_string == clean(some_string)
And if False, drop a nice exception to be on the safe side.
During my work with sql & python I felt, that you wont need to let the user name his tables and fieldnames himself, though. So it is/was rarely necessary for me.
If anyone could elaborate some more and give his insights, I would greatly appreciate it.
First, I would start by creating a mapping of columns and values:
data = {'column1': 'value1', 'column2': 'value2', 'column3': 'value3'}
And, then get the columns from here:
columns = data.keys()
# ['column1', 'column3', 'column2']
Now, we need to create placeholders for both columns and for values:
placeholder_columns = ", ".join(data.keys())
# 'column1, column3, column2'
placeholder_values = ", ".join([":{0}".format(col) for col in columns])
# ':column1, :column3, :column2'
Then, we create the INSERT SQL statement:
sql = "INSERT INTO table_name ({placeholder_columns}) VALUES ({placeholder_values})".format(
placeholder_columns=placeholder_columns,
placeholder_values=placeholder_values
)
# 'INSERT INTO table_name (column1, column3, column2) VALUES (:column1, :column3, :column2)'
Now, what we have in sql is a valid SQL statement with named parameters. Now you can execute this SQL query with the data:
cursor.execute(sql, data)
And, since data has keys and values, it will use the named placeholders in the query to insert the values in correct columns.
Have a look at the documentation to see how named parameters are used. From what I can see that you only need to worry about the sanitization for the values being inserted. And, there are two ways to do that 1) either using question mark style, or 2) named parameter style.
So, here's what I ended up implementing, I thought it was pretty pythonic, but couldn't have answered it without Krysopath's insight:
columns = ['column1', 'column2', 'column3']
values = ['value1', 'value2', 'value3']
columns = ', '.join(columns)
insertString=("insert into table_name (%s) values (?,?,?,?)" %columns)
cursor.execute(insertString, values)
import sqlite3
conn = sqlite3.connect("db.sqlite")
cursor = conn.cursor()
## break your list into two, one for column and one for value
list = ['column1', 'column2', 'column3']
list2= ['value1', 'value2', 'value3']
cursor.execute("""insert into table_name("""+list[0]+""","""+list[1]+""","""+list[2]+""")
values ("""+list2[0]+""","""+list2[1]+""","""+list2[2]+""")""")

How to use Pandas.DataFrame as input of SQL query?

I am trying to use Pandas.DataFrame as the intermediate result dataset between two consequent SQL queries.
I imagine it looks like:
import pandas.io.sql as pisql
import pyodbc
SQL_command1 = """
select * from tab_A
"""
result = pisql.read_frame(SQL_command1)
SQL_command2 = """
select *
from ? A
inner join B
on A.id = B.id
"""
pyodbc.cursor.execute(SQL_command2, result)
The SQL_command2 in above code is simply a pseudo code, where ? takes in the result as the input and given a alias name as A.
This is my first time using Pandas, so I'm not confident if my idea is feasible or efficient. Can anyone enlight me please?
Many thanks.
The pseudo code would look like this
import pandas as pd
df_a = pd.read_csv('tab_a.csv') #or read_sql or other read engine
df_b = pd.read_csv('tab_b.csv')
result = pd.merge(left=df_a,
right=df_b,
how='inner',
on='id') #assuming 'id' is in both table
And to select columns of pandas dataframe, it would be something like df_a[['col1','col2','col3']]

Categories

Resources