Pandas read_sql query with multiple selects - python

Can read_sql query handle a sql script with multiple select statements?
I have a MSSQL query that is performing different tasks, but I don't want to have to write an individual query for each case. I would like to write just the one query and pull in the multiple tables.
I want the multiple queries in the same script because the queries are related, and it making updating the script easier.
For example:
SELECT ColumnX_1, ColumnX_2, ColumnX_3
FROM Table_X
INNER JOIN (Etc etc...)
----------------------
SELECT ColumnY_1, ColumnY_2, ColumnY_3
FROM Table_Y
INNER JOIN (Etc etc...)
Which leads to two separate query results.
The subsequent python code is:
scriptFile = open('.../SQL Queries/SQLScript.sql','r')
script = scriptFile.read()
engine = sqlalchemy.create_engine("mssql+pyodbc://UserName:PW!#Table")
connection = engine.connect()
df = pd.read_sql_query(script,connection)
connection.close()
Only the first table from the query is brought in.
Is there anyway I can pull in both query results (maybe with a dictionary) that will prevent me from having to separate the query into multiple scripts.

You could do the following:
queries = """
SELECT ColumnX_1, ColumnX_2, ColumnX_3
FROM Table_X
INNER JOIN (Etc etc...)
---
SELECT ColumnY_1, ColumnY_2, ColumnY_3
FROM Table_Y
INNER JOIN (Etc etc...)
""".split("---")
Now you can query each table and concat the result:
df = pd.concat([pd.read_sql_query(q, connection) for q in queries])
Another option is to use UNION on the two results i.e. do the concat in SQL.

Related

How to run multiple inserts on multiple tables parallelly using Pyspark

I have insert data from staging to main tables using sql query using pyspark programming. But, the problem is I have inserts to multiple tables. So, in order to achieve parallelism what has to be performed instead of using threading.
spark.sql("INSERT INTO Cls.tbl1 (Contract, Name)
SELECT s.Contract, s.Name
FROM tbl1 AS s LEFT JOIN Cls.tbl1 AS c
ON s.Contract = c.Contract AND s.Adj = c.Adj
WHERE c.Contract IS NULL")
spark.sql("INSERT INTO Cls.tbl2 (Contract, Name)
SELECT s.Contract, s.Name
FROM tbl2 AS s LEFT JOIN Cls.tbl2 AS c
ON s.Contract = c.Contract AND s.Adj = c.Adj
WHERE c.Contract IS NULL")
We have to execute multiple insert statements as above and also we want to achieve parallelism when running through spark.
In short, you cannot run them in parallel. But you can run two different job, each inserting into one table, you can sort of achieve parallelism with this approach

SQLAlchemy join wtih tables creates on the fly

I am having the following query:
select t2.col as tag, count(*)
from my_table t1 JOIN TABLE(SPLIT(tags,'/')) as t2
where t2.col != ''
group by tag
(TABLE(SPLIT(tags,'/')) - create a temp table by splitting the tags field.)
Query works just fine running it on the database directly, but having trouble to create the query with this join clause using SQLAlchemy.
How can i perform a join with a table that created on the fly? and uses functions that aren't defined in SQLAlchemy.
Thanks.

Pass Pandas df as table to SQL query

I want to pass a local df as a table to inner join to an SQL server like so.
sql = """
select top 10000 *
from Table1 as t
inner join {} as a on t.id= a.id
""".format(pandas_df)
results = pd.read_sql_query(sql,conn)
This is obviously not the way to do it.
Any ideas?
Thanks!
You need to convert your dataframe to a SQL table before reading it.
Use pd.pandas_df.to_sql(name_of_table, con)
I see two main options, depending on the data size of your id's. The simplest way would be to add the id to an IN clause in your SQL statement.
This approach is useful if you don't have write permission on the database, but you limited by the maximum batch size of SQL, which iirc is around 256Mb.
From your id series, you create a tuple of id's you're interested in, then cast the tuple to a string to concatenate with you sql statement.
sql = """
select top 10000 *
from Table1 as t
where t.id in """ + str(tuple(pandas.df['id'].values))
results = pd.read_sql_query(sql,conn)
Can use df.to_sql to load it to the df.

sqlalchemy error when calling mysql stored procedure

I'm using sqlalchemy to run query on a MySql server from python.
I initialize sqlalchemy with:
engine = create_engine("mysql+mysqlconnector://{user}:{password}#{host}:{port}/{database}".format(**connection_params))
conn = engine.connect()
Where connection_params is a dict containing the server access details.
I'm running this query:
SELECT
new_db.asset_specification.identifier_code,
new_db.asset_specification.asset_name,
new_db.asset_specification.asset_type,
new_db.asset_specification.currency_code,
new_db.sector_map.sector_description,
new_db.super_sector_map.super_sector_description,
new_db.country_map.country_description,
new_db.country_map.country_macro_area
FROM new_db.asset_specification
INNER JOIN new_db.identifier_code_legal_entity_map on new_db.asset_specification.identifier_code = new_db.identifier_code_legal_entity_map.identifier_code
INNER JOIN new_db.legal_entity_map on projecthf_db.identifier_code_legal_entity_map.legal_entity_code = new_db.legal_entity_map.legal_entity_code
INNER JOIN new_db.sector_map on new_db.legal_entity_map.legal_entity_sector = new_db.sector_map.sector_code
INNER JOIN new_db.super_sector_map on projecthf_db.legal_entity_map.legal_entity_super_sector = new_db.super_sector_map.super_sector_code
INNER JOIN new_db.country_map on new_db.legal_entity_map.legal_entity_country = new_db.country_map.country_code
WHERE new_db.asset_specification.identifier_code = str_identifier_code;
Using conn.execute(query) (where i set query equal to the string above).
This runs just fine.
I tried to put my query in a stored procedure like:
CREATE DEFINER=`root`#`localhost` PROCEDURE `test_anag`(IN str_identifier_code varchar(100))
BEGIN
SELECT
new_db.asset_specification.identifier_code,
new_db.asset_specification.asset_name,
new_db.asset_specification.asset_type,
new_db.asset_specification.currency_code,
new_db.sector_map.sector_description,
new_db.super_sector_map.super_sector_description,
new_db.country_map.country_description,
new_db.country_map.country_macro_area
FROM new_db.asset_specification
INNER JOIN new_db.identifier_code_legal_entity_map on new_db.asset_specification.identifier_code = new_db.identifier_code_legal_entity_map.identifier_code
INNER JOIN new_db.legal_entity_map on projecthf_db.identifier_code_legal_entity_map.legal_entity_code = new_db.legal_entity_map.legal_entity_code
INNER JOIN new_db.sector_map on new_db.legal_entity_map.legal_entity_sector = new_db.sector_map.sector_code
INNER JOIN new_db.super_sector_map on projecthf_db.legal_entity_map.legal_entity_super_sector = new_db.super_sector_map.super_sector_code
INNER JOIN new_db.country_map on new_db.legal_entity_map.legal_entity_country = new_db.country_map.country_code
WHERE new_db.asset_specification.identifier_code = str_identifier_code;
END
I can run the stored procedure from the query editor in mysql workbench with CALL new_db.test_anag('000000') and I get the desired result (which is a single line).
Now I try to run:
res = conn.execute("CALL new_db.test_anag('000000')")
But it fails with the following exception
sqlalchemy.exc.InterfaceError: (mysql.connector.errors.InterfaceError) Use multi=True when executing multiple statements [SQL: "CALL projecthf_db.test_anag('0237400')"]
I looked around but I can't find anything useful on this error and for the love of me I can't get my head around it. I'm not an expert on either Mysql nor sqlalchemy (or anything RDBMS) but this one looks like it should be easy to fix. Let me know if more info is required.
Thank in advance for the help
From reading a related question it can be seen that mysql.connector automatically fetches and stores multiple result sets when executing stored procedures producing such, even if only one result set is produced. SQLAlchemy on the other hand does not support multiple result sets – directly. To execute stored procedures use callproc(). To access a DB-API cursor in SQLAlchemy you have to use a raw connection. In case of mysql.connector the produced result sets can be accessed using stored_results():
from contextlib import closing
# Create a raw MySQLConnection
conn = engine.raw_connection()
try:
# Get a MySQLCursor
with closing(conn.cursor()) as cursor:
# Call the stored procedure
result_args = cursor.callproc('new_db.test_anag', ['000000'])
# Iterate through the result sets produced by the procedure
for result in cursor.stored_results():
result.fetchall()
finally:
conn.close()

how to copy and paste sql query in pandas read_Sql

I'm new to python and am trying to run sql code in python and have the results in a pandas dataframe. I'm using the following code and the code runs when i have a simple sql query. But when I try to run a super long and complex query with proper formatting in sql, it fails. Can I use any module/option so python recognizes the indention and spacing within sql queries as python specific?
cnxn=...#here it's the connection to my sql server database
sql_2=
r'( Select distinct NPI,
practice_code=RIGHT('000'+CAST(newcode AS VARCHAR(3)),3),
SRcode,
StandardZip,
Zipclass,
CountySSA,
PrimaryCountySSA,
PrimaryCounty,
PrimaryCountyClass,
Lat_Clean,
Long_Clean
FROM Docusinporactice a
LEFT JOIN locationInfo b
on a.zip=b.zip
)
sql_data_test=pd.read_sql_query(sql_2, cnxn)
r = """ Select distinct NPI,
practice_code=RIGHT('000'+CAST(newcode AS VARCHAR(3)),3),
SRcode,
StandardZip,
Zipclass,
CountySSA,
PrimaryCountySSA,
PrimaryCounty,
PrimaryCountyClass,
Lat_Clean,
Long_Clean
FROM Docusinporactice a
LEFT JOIN locationInfo b
on a.zip=b.zip
"""
this way should work the sql statement

Categories

Resources