Utility to find join columns - python

I have been given several tables in SQL Server and am trying to figure out the best way to join them.
What I've done is:
1) open a connection in R to the database
2) pull all the column names from the INFORMATION_SCHEMA.COLUMNS table
3) build loops in R to try every combination of columns and see what the row count is of the inner join of the 2 columns
I'm wondering if there's a better way to do this or if there's a package or utility that helps with this type of problem.

You could do your joins in python using pandas. Pandas has a powerful IO engine, so you could import from SQL Server into a pandas dataframe, perform your joins with python and write back to SQL server.
Below is a script I use to perform an import from SQL Server and an export to a MySQL table. I use the python package sqlalchemy for my ORM connections. You could follow this example and read up on joins in pandas.
import pyodbc
import pandas as pd
from sqlalchemy import create_engine
# MySQL info
username = 'user'
password = 'pw'
sqlDB = 'mydb'
# Create MSSQL PSS Connector
server = 'server'
database = 'mydb'
connMSSQL = pyodbc.connect(
'DRIVER={ODBC Driver 13 for SQL Server};' +
f'SERVER={server};PORT=1433;DATABASE={database};Trusted_Connection=yes;')
# Read Table into pandas dataframe
tsql = '''
SELECT [Index],
Tag,
FROM [dbo].[Tags]
'''
df = pd.read_sql(tsql, connMSSQL, index_col='Index')
# Write df to MySQL db
engine = create_engine(
f'mysql+mysqldb://{username}:{password}#localhost/mydb', pool_recycle=3600)
with engine.connect() as connMySQL:
df.to_sql('pss_alarms', connMySQL, if_exists='replace')

Related

Python to MS SQL Error: Error when connecting to SQL using sqlalchemy.create_engine() using pypyodbc

Scenario:
I am trying to Convert the SQL output directly to Table using dataframe.to_sql, so for that i am using sqlalchemy.create_engine() and its throwing error when trying to createngine()
sqlchemyparams= urllib.parse.quote_plus(ConnectionString)
sqlchemy_conn_str = 'mssql+pypyodbc:///?odbc_connect={}'.format(sqlchemyparams)
engine_azure = sqlalchemy.create_engine(sqlchemy_conn_str,echo=True,fast_executemany =
True, poolclass=NullPool)
df_top_features.to_sql('Topdata', engine_azure,schema='dbo', index = False, if_exists =
'replace')
2.It will work fine if i use:pyodbc
sqlchemy_conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(sqlchemyparams)
So is there any way i can using pypyodbc in sqlchem_conn_str
SQLAlchemy does not have a pypyodbc driver defined for the mssql dialect, so
mssql+pypyodbc:// …
simply will not work. There may be some way to "fool" your code into using pypyodbc when you specify mssql+pyodbc://, similar to doing
import pypyodbc as pyodbc
in plain Python, but it is not recommended.
In cases where pyodbc cannot be used, the recommended alternative would be mssql+pymssql://.
Here's what I do
import sqlalchemy as sa
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL
Then create varaibles to holder the server, database, username and password and pass it to...
params = urllib.parse.quote_plus("DRIVER={SQL Server};"
"SERVER="+server+";"
"DATABASE="+database+";"
"UID="+username+";"
"PWD="+password+";")
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params))
then upload data to sql using.
dfc.to_sql('jobber',con=engine,index=False, if_exists='append')
Using https://www.dataquest.io/blog/sql-insert-tutorial/ as a source.

django panda read sql query map parameters

I am trying to connect sql server database within django framework, to read sql query result into panda dataframe
from django.db import connections
query = """SELECT * FROM [dbo].[table] WHERE project=%(Name)s"""
data = pd.read_sql(query, connections[database], params={'Name': input} )
the error message I got is 'format requires a mapping'
if I do it something like below, it will work, but I really want to be able to map each parameter with names:
from django.db import connections
query = """SELECT * FROM [dbo].[table] WHERE project=%s"""
data = pd.read_sql(query, connections[database], params={input} )
I was using odbc driver 17 for sql server
you can format at string level and then run pd.read_sql

python cx_oracle hang when storing as DataFrame?

I'm trying to store the results of an Oracle SQL query into a dataframe and the execution hangs infinitely. But, when I print the query it comes out instantly. What is causing the error when saving this as a DataFrame?
import cx_Oracle
import pandas as pd
dsn_tns = cx_Oracle.makedsn('HOST', 'PORT', service_name='SID')
conn = cx_Oracle.connect(user='USER', password='PASSWORD', dsn=dsn_tns)
curr =conn.cursor()
curr.execute('alter session set current_schema= apps')
df = pd.read_sql('select * from TABLE', curr)
####THE ALTERNATIVE CODE TO PRINT THE RESULTS
# curr.execute('select * from TABLE')
# for line in curr:
# print(line)
curr.close()
conn.close()
Pandas's read_sql requires a connection object for its con argument not the result of a cursor's execute. Also, consider using SQLAlchemy the recommended interface between pandas and databases where you define the schema in the engine connection assignment. This engine also allows to_sql calls.
engine = create_engine("oracle+cx_oracle://user:pwd#host:port/dbname")
df = pd.read_sql('select * from TABLE', con=engine)
engine.dispose()
And as mentioned on this DBA post, in Oracle users and schemas are essentially the same thing (unlike other RBDMS). Therefore, try passing apps as the user in create_engine call with needed credentials:
engine = create_engine("oracle+cx_oracle://apps:PASSWORD#HOST:PORT/SID")
df = pd.read_sql('select * from TABLE', con=engine)
engine.dispose()

Pandas Change To_SQL Column Mappings

I have a slight problem in regards to pd.to_sql(). My task is to load excel files into a MSSQL database (import wizard is not an option). I've used sqlalchemy along with pandas in the past with success but can't seem to crack this.
from sqlalchemy import create_engine
import pandas as pd
# Parameters for SQL
ServerName = "SERVER_NAME_HERE"
Database = "MY_NAME_HERE"
Driver = "driver=SQL Server Native Client 11.0"
# Create the connection
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database + "?" + Driver)
df1=read_excel('MY_PATH_HERE')
# do my manipulations below and make sure the dtypes are correct....
#... end my manipulations
df2.to_sql('Auvi-Q_Evzio_Log', engine, if_exists='append', index=False)
ERROR:
pyodbc.ProgrammingError: ('42S22', "[42S22] [Microsoft][SQL Server Native
Client 11.0][SQL Server]Invalid column name 'Created On'. (207)
(SQLExecDirectW)")
My issue is that the schema of the database is already set up and cannot change it. I have a column in my dataframe Created On, but the column name in the Database is CreatedOn. I have a handful of columns where this issue arises. Is there a way to set the mappings or schema correctly in to_sql? there is a schema parameter in the documentation, but I can't find a valid example.
I could just change the column names of my dataframe to match the scehma, but my interest has been peeked otherwise.
I'd try the following approach:
db_tab_cols = pd.read_sql("select * from [Auvi-Q_Evzio_Log] where 1=2", engine) \
.columns.tolist()
df2.columns = db_tab_cols
df2.to_sql('Auvi-Q_Evzio_Log', engine, if_exists='append', index=False)
PS this solution assumes that df2 has the same order of columns as Auvi-Q_Evzio_Log table

Pyodbc Accessing Multiple Databases on same server

I'm tasked with obtaining data from two MS SQL databases on the same server so i can run a single query that uses info from both databases simultaneously. I am trying to achieve this in python 2.7 with pyodbc 3.0.7. My query would look like this:
Select forcast.WindGust_Forecast, forcast.Forecast_Date, anoSection.SectionName, refTable.WindGust
FROM [EO1D].[dbo].[Dashboard_Forecast] forcast
JOIN [EO1D].[dbo].[Dashboard_AnoSections] anoSection
ON forcast.Section_ID = anoSection.Record_ID
JOIN [EO1D].[dbo].[Dashboard_AnoCircuits] anoCircuits
ON anoSection.Circuit_Number = anoCircuits.Circuit_Number
JOIN [FTSAutoCaller].[dbo].[ReferenceTable] refTable
ON anoCircuits.StationCode = refTable.StationCode
Where refTable.Circuit IS NOT NULL and refTable.StationCode = 'sil'
the typical connection for pyodbc looks like:
cnxn = pyodbc.connect('DRIVER{SQLServer};SERVER=SQLSRV01;DATABASE=DATABASE;UID=USER;PWD=PASSWORD')
Which would only allow access to the database name provided.
how would I go about setting up a connection that allows me access to both databases so this query can be ran. The two database names in my case are EO1D and FTSAutoCaller.
you're overthinking it. If you setup the connection as you did above, and then simply pass the sql along to a cursor it should work.
import pyodbc
conn_string = '<removed>'
conn = pyodbc.connect(conn_string)
cur = conn.cursor()
query = 'select top 10 * from table1 t1 inner join database2..table2 t2 on t1.id = t2.id'
cur.execute(query)
and you are done (tested in my own environment, clearly the connection string and query were different, but it did work.)
The query takes care of its self although I only referenced one of the tables in the connection the query didnt have an issue connecting to both of the database. Not 100% sure but im assuming it worked because of the prefixed in "[ ]"

Categories

Resources