How to run multiple inserts on multiple tables parallelly using Pyspark - python

I have insert data from staging to main tables using sql query using pyspark programming. But, the problem is I have inserts to multiple tables. So, in order to achieve parallelism what has to be performed instead of using threading.
spark.sql("INSERT INTO Cls.tbl1 (Contract, Name)
SELECT s.Contract, s.Name
FROM tbl1 AS s LEFT JOIN Cls.tbl1 AS c
ON s.Contract = c.Contract AND s.Adj = c.Adj
WHERE c.Contract IS NULL")
spark.sql("INSERT INTO Cls.tbl2 (Contract, Name)
SELECT s.Contract, s.Name
FROM tbl2 AS s LEFT JOIN Cls.tbl2 AS c
ON s.Contract = c.Contract AND s.Adj = c.Adj
WHERE c.Contract IS NULL")
We have to execute multiple insert statements as above and also we want to achieve parallelism when running through spark.

In short, you cannot run them in parallel. But you can run two different job, each inserting into one table, you can sort of achieve parallelism with this approach

Related

Organizing SQL Queries in Python project

I have a python script I'm creating that will replace an set of SQL Server stored procedures to make the process more efficient. However, I have a 20-30 queries I need to execute at different points. To make the main query more simple I organized them into a separate file in a dictionary and created a function to pull the query to be executed.
My question here is there a better way to organize them? An idea I had was to put them into a table on the SQL Server or is this method best or is there another better method? Below is an example of what I'm doing now:
queryDict = {}
queryDict.update({"dbQuery1": "TRUNCATE TABLE MyTable;\
INSERT MyTable (Column1, Column2)\
SELECT Col1, Col2 FROM myTable2;"})
queryDict.update({"dbQuery1": 'SELECT MAX(val) FROM MyTable3;'})
def queryRequest(query):
return queryDict[query]

How can I use "where not exists" SQL condition in pyspark?

I have a table on Hive and I am trying to insert data in that table. I am taking data from SQL but I don't want to insert id which already exists in the Hive table. I am trying to use the same condition like where not exists. I am using PySpark on Airflow.
The exists operator doesn't exist in Spark but there are 2 join operators that can replace it : left_anti and left_semi.
If you want for example to insert a dataframe df in a hive table target, you can do :
new_df = df.join(
spark.table("target"),
how='left_anti',
on='id'
)
then you write new_df in your table.
left_anti allows you to keep only the lines which do not meet the join condition (equivalent of not exists). The equivalent of exists is left_semi.
You can use not exist directly using spark SQL on the dataframes through temp views:
table_withNull_df.createOrReplaceTempView("table_withNull")
tblA_NoNull_df.createOrReplaceTempView("tblA_NoNull")
result_df = spark.sql("""
select * from table_withNull
where not exists
(select 1 from
tblA_NoNull
where table_withNull.id = tblA_NoNull.id)
""")
This method can be preferred to left anti joins since they can cause unexpected BroadcastNestedLoopJoin resulting in a broadcast timeout (even without explicitly requesting the broadcast in the anti join).
After that you can do write.mode("append") to insert the previously not encountered data.
Example taken from here
IMHO I don't think exists such a property in Spark. I think you can use 2 approaches:
A workaround with the UNIQUE condition (typical of relational DB): in this way when you try to insert (in append mode) an already existing record you'll get an exception that you can properly handle.
Read the table in which you want to write, outer join it with the data that you want to add to the aforementioned table and then write the result in overwrite mode (but I think that the first solution may be better in performance).
For more details feel free to ask

Teradata MERGE yielding no results when executed through SQLAlchemy

I'm attempting to use python with sqlalchemy to download some data, create a temporary staging table on a Teradata Server, then MERGEing that table into another table which I've created to permanently store this data. I'm using sql = slqalchemy.text(merge) and td_engine.execute(sql) where merge is a string similar to the below:
MERGE INTO perm_table as p
USING temp_table as t
ON p.Id = t.Id
WHEN MATCHED THEN
UPDATE
SET col1 = t.col1,
col2 = t.col2,
...
col50 = t.col50
WHEN NOT MATCHED THEN
INSERT (col1,
col2,
...
col50)
VALUES (t.col1,
t.col2,
...
t.col50)
The script runs all the way to the end without error and the SQL executes properly through Teradata Studio, but for some reason the table won't update when I execute it through SQLAlchemy. However, I've also run different SQL expressions, like the insert that populated perm_table from the same python script and it worked fine. Maybe there's something specific to the MERGE and SQLAlchemy combo?
Since you're using the engine directly, without using a transaction, you're probably (barring unseen configuration on your part) relying on SQLAlchemy's version of autocommit, which works by detecting data changing operations such as INSERTs etc. Possibly MERGE is not one of the detected operations. Try
sql = sqlalchemy.text(merge).execution_options(autocommit=True)
td_engine.execute(sql)

Pandas read_sql query with multiple selects

Can read_sql query handle a sql script with multiple select statements?
I have a MSSQL query that is performing different tasks, but I don't want to have to write an individual query for each case. I would like to write just the one query and pull in the multiple tables.
I want the multiple queries in the same script because the queries are related, and it making updating the script easier.
For example:
SELECT ColumnX_1, ColumnX_2, ColumnX_3
FROM Table_X
INNER JOIN (Etc etc...)
----------------------
SELECT ColumnY_1, ColumnY_2, ColumnY_3
FROM Table_Y
INNER JOIN (Etc etc...)
Which leads to two separate query results.
The subsequent python code is:
scriptFile = open('.../SQL Queries/SQLScript.sql','r')
script = scriptFile.read()
engine = sqlalchemy.create_engine("mssql+pyodbc://UserName:PW!#Table")
connection = engine.connect()
df = pd.read_sql_query(script,connection)
connection.close()
Only the first table from the query is brought in.
Is there anyway I can pull in both query results (maybe with a dictionary) that will prevent me from having to separate the query into multiple scripts.
You could do the following:
queries = """
SELECT ColumnX_1, ColumnX_2, ColumnX_3
FROM Table_X
INNER JOIN (Etc etc...)
---
SELECT ColumnY_1, ColumnY_2, ColumnY_3
FROM Table_Y
INNER JOIN (Etc etc...)
""".split("---")
Now you can query each table and concat the result:
df = pd.concat([pd.read_sql_query(q, connection) for q in queries])
Another option is to use UNION on the two results i.e. do the concat in SQL.

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?
To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Categories

Resources