Connect to Oracle Database from SQLAlchemy in Python on AWS EC2 - python

I'm using python Jupyter-Lab inside a Docker Conteiner, which is embedded in an AWS EC-2. This Docker Container has an Instant Oracle Cliente installed inside it, so everything is set. The problem is that I'm still having trouble to connect this Docker to my AWS RDS with an Oracle Database, but only using SQLAlchemy.
When I try the connection using cx-Oracle==8.2.1 engine:
host = '***********************'
user = '*********'
password = '**********'
port = '****'
service = '****'
dsn_tns = cx_Oracle.makedsn(host,
port,
service)
engine_oracle = cx_Oracle.connect(user=user, password=password, dsn=dsn_tns)
Everything works fine. I can read tables using pandas read_sql(), I can create tables using cx_Oracle execute(), etc.
But when I try to take a DataFrame and send it to my RDS using pandas to_sql(), my cx_Oracle connection returns the error:
DatabaseError: ORA-01036: illegal variable name/number
I then tried to use a SQLAlchemy==1.4.22 engine from the string:
tns = """
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = %s)(PORT = %s))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = %s)
)
)
""" % (host, port, service)
engine_alchemy = create_engine('oracle+cx_oracle://%s:%s#%s' % (user, password, tns))
But I get this error:
DatabaseError: ORA-12154: TNS:could not resolve the connect identifier specified
And I keep getting this error even when I try to use pandas read_sql with the SQLAlchemy engine. Thus, I ran out of options. Can somebody help me please?
EDIT*
I tried again with SQLAlchemy==1.3.9 and it worked. Does anybody knows why?
The code I'm using for reading and sending a test table from and to Oracle is:
sql = """
SELECT
*
FROM
DADOS_MIS.DR_ACIO_ATIVOS_HASH
WHERE
ROWNUM <= 5"""
df = pd.read_sql(sql, engine_oracle)
dtyp1 = {c:'VARCHAR2('+str(df[c].str.len().max())+')'
for c in df.columns[df.dtypes == 'object'].tolist()}
dtyp2 = {c:'NUMBER'
for c in df.columns[df.dtypes == 'float64'].tolist()}
dtyp3 = {c:'DATE'
for c in df.columns[df.dtypes == 'datetime'].tolist()}
dtyp4 = {c:'NUMBER'
for c in df.columns[df.dtypes == 'int64'].tolist()}
dtyp_total = dtyp1
dtyp_total.update(dtyp2)
dtyp_total.update(dtyp3)
dtyp_total.update(dtyp4)
df.to_sql(name='teste', con=engine_oracle, if_exists='replace', dtype=dtyp_total, index=False)
The dtyp_total is:
{'IDENTIFICADOR': 'VARCHAR2(32)',
'IDENTIFICADOR_PRODUTO': 'VARCHAR2(32)',
'DATA_CHAMADA': 'VARCHAR2(19)',
'TABULACAO': 'VARCHAR2(25)'}

Related

How to get the column names in redshift using Python boto3

I want to get the column names in redshift using python boto3
Creaed Redshift Cluster
Insert Data into it
Configured Secrets Manager
Configure SageMaker Notebook
Open the Jupyter Notebook wrote the below code
import boto3
import time
client = boto3.client('redshift-data')
response = client.execute_statement(ClusterIdentifier = "test", Database= "dev", SecretArn= "{SECRET-ARN}",Sql= "SELECT `COLUMN_NAME` FROM `INFORMATION_SCHEMA`.`COLUMNS` WHERE `TABLE_SCHEMA`='dev' AND `TABLE_NAME`='dojoredshift'")
I got the response but there is no table schema inside it
Below is the code i used to connect I am getting timed out
import psycopg2
HOST = 'xx.xx.xx.xx'
PORT = 5439
USER = 'aswuser'
PASSWORD = 'Password1!'
DATABASE = 'dev'
def db_connection():
conn = psycopg2.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
return conn
How to get the ip address go to https://ipinfo.info/html/ip_checker.php
pass your hostname of redshiftcluster xx.xx.us-east-1.redshift.amazonaws.com or you can see in cluster page itself
I got the error while running above code
OperationalError: could not connect to server: Connection timed out
Is the server running on host "x.xx.xx..xx" and accepting
TCP/IP connections on port 5439?
I fixed with the code, and add the above the rules
import boto3
import psycopg2
# Credentials can be set using different methodologies. For this test,
# I ran from my local machine which I used cli command "aws configure"
# to set my Access key and secret access key
client = boto3.client(service_name='redshift',
region_name='us-east-1')
#
#Using boto3 to get the Database password instead of hardcoding it in the code
#
cluster_creds = client.get_cluster_credentials(
DbUser='awsuser',
DbName='dev',
ClusterIdentifier='redshift-cluster-1',
AutoCreate=False)
try:
# Database connection below that uses the DbPassword that boto3 returned
conn = psycopg2.connect(
host = 'redshift-cluster-1.cvlywrhztirh.us-east-1.redshift.amazonaws.com',
port = '5439',
user = cluster_creds['DbUser'],
password = cluster_creds['DbPassword'],
database = 'dev'
)
# Verifies that the connection worked
cursor = conn.cursor()
cursor.execute("SELECT VERSION()")
results = cursor.fetchone()
ver = results[0]
if (ver is None):
print("Could not find version")
else:
print("The version is " + ver)
except:
logger.exception('Failed to open database connection.')
print("Failed")

ODBC for SQL Server in Python

I have requirement to extract data from SQL Server and create a .csv file from numerous tables. So I created a python script to do this activity which uses pyodbc/turbodbc connection with SQL Server ODBC Drivers. It works fine sometimes however it disconnects when it finds large table (over 11M) and performance wise it is very slow. I tried freeTDS, but looks the same as pyodbc interns of performance.
This is my connection:
pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=systemname,UID=user_name,PWD=pwd)
def connect_to_SQL_Server(logins):
'''Connects to SQL Server.
Returns connection object or None
'''
con = None
try:
hostname = logins['hostname']
username = logins['sql_username']
password = logins['snow_password']
#con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
#con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
#con = pyodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0,Trace='Yes',ForceTrace='Yes',TraceFile='/maxbill_mvp_data/all_data/sql.log')
con = pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
#con = turbodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
#con = pyodbc.connect(DSN='MSSQLDEV',server=hostname,UID=username,PWD=password)
return con
except (pyodbc.ProgrammingError, Exception) as error:
logging.critical(error)
sqlCon = connect_to_SQL_Server(logins)
sql = 'select * from table'
i = 0
for partial_df in(pd.read_sql(sql, sqlCon, chunksize=300000)):
#chunk.to_csv(f+'_'+str(i)+'.csv',index = False,header = False,sep = ',',mode = 'a+')
partial_df.to_csv(filenamewithpath + '_'+str(i)+'.csv.gz', compression='gzip', index=False, sep='\01', header= False, mode='a+')
i+=1
Are there any parameters I can try with for performance improvement. Just to let you know these python scripts running from different server than SQL Server hosted server and which is Linux cloud instance

Teradata and sqlachemy connection

I wish to use sqlachemy with teradata dialect to push some csv into a table.
So far I wrote this :
import pandas as pd
from sqlalchemy import create_engine
user = '******'
pasw = '******'
host = 'FTGPRDTD'
DATABASE = 'DB_FTG_SRS_DATALAB'
# connect
td_engine = create_engine('teradata://'+ user +':' + pasw + '#'+ DBCNAME + ':1025/')
print ('ok step one')
print(td_engine)
# execute sql
df = pd.read_csv(r'C:/Users/c92434/Desktop/Load.csv')
print('df chargé')
df.to_sql(name= 'mdc_load', con = td_engine, index=False, schema = DATABASE,
if_exists='replace')
print ('ok step two')
This is the error message I get :
DatabaseError: (teradata.api.DatabaseError) (0, '[08001] [TPT][ODBC SQL Server Wire Protocol driver]Invalid Connection Data., [TPT][ODBC SQL Server Wire Protocol driver]Invalid attribute in connection string: DBCNAME.')
(Background on this error at: http://sqlalche.me/e/4xp6)
What I can I do ?
Hopefully you've solved this by now, but I had success with this. Looking at what you provided, it looks like the host information you set is not being used in the connection string. My example includes the dtype parameter, which I use to define the data type for each column so they don't show up as CLOB.
database = "database_name"
table = "mdc_load"
user = "user"
password = "password"
host = 'FTGPRDTD:1025'
td_engine = create_engine(f'teradata://{user}:{password}#{host}/?database={database}&driver=Teradata&authentication=LDAP')
conn = td_engine.connect()
data.to_sql(name=table, con=conn, index=False, if_exists='replace', dtype=destType)
conn.close()
The "teradata" dialect (sqlalchemy-teradata module) relies on a Teradata ODBC driver being separately installed on the client platform. If you have multiple ODBC drivers installed that include the word Teradata in the name (for example, because you installed TPT with the Teradata-branded drivers for other database platforms), you may need to explicitly specify the one to be used by appending an optional parameter to your connection string, e.g.
td_engine = create_engine('teradata://'+ user +':' + pasw + '#'+ DBCNAME + ':1025/?driver=Teradata Database ODBC Driver 16.20')
Alternatively, you could use the "teradatasql" dialect (teradatasqlalchemy module) which does not require ODBC.

How to Load Data into Amazon Redshift via Python Boto3?

In Amazon Redshift's Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J. I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3.
However in Boto3's documentation of Redshift, I'm unable to find a method that would allow me to upload data into Amazon Redshift cluster.
I've been able to connect with Redshift utilizing Boto3 with the following code:
client = boto3.client('redshift')
But I'm not sure what method would allow me to either create tables or upload data to Amazon Redshift the way it's done in the tutorial with SQLWorkbenchJ.
Right, you need psycopg2 Python module to execute COPY command.
My code looks like this:
import psycopg2
#Amazon Redshift connect string
conn_string = "dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
#connect to Redshift (database should be open to the world)
con = psycopg2.connect(conn_string);
sql="""COPY %s FROM '%s' credentials
'aws_access_key_id=%s; aws_secret_access_key=%s'
delimiter '%s' FORMAT CSV %s %s; commit;""" %
(to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,delim,quote,gzip)
#Here
# fn - s3://path_to__input_file.gz
# gzip = 'gzip'
cur = con.cursor()
cur.execute(sql)
con.close()
I used boto3/psycopg2 to write CSV_Loader_For_Redshift
Go back to step 4 in that tutorial you linked. See where it shows you how to get the URL of the cluster? You have to connect to that URL with a PostgreSQL driver. The AWS SDKs such as Boto3 provide access to the AWS API. You need to connect to Redshift over a PostgreSQL API, just like you would connect to a PostgreSQL database on RDS.
Using psycopyg2 & get_cluster_credentials
Prerequisites -
IAM ROLE attached to respective User
IAM Role with get_cluster_credentials policy LINK
On cloud (EC2) with appropriate IAM Role attached
The below code will work only if you deploying it on a PC/VM where a user's AWS Credentials are already configured [ CLI - aws configure ] OR
you are on an instance in the same Account,VPC.
Have a config.ini file -
[Redshift]
port = 3389
username = please_enter_username
database_name = please_database-name
cluster_id = please_enter_cluster_id_name
url = please_enter_cluster_endpoint_url
region = us-west-2
My Redshift_connection.py
import logging
import psycopg2
import boto3
import ConfigParser
def db_connection():
logger = logging.getLogger(__name__)
parser = ConfigParser.ConfigParser()
parser.read('config.ini')
RS_PORT = parser.get('Redshift','port')
RS_USER = parser.get('Redshift','username')
DATABASE = parser.get('Redshift','database_name')
CLUSTER_ID = parser.get('Redshift','cluster_id')
RS_HOST = parser.get('Redshift','url')
REGION_NAME = parser.get('Redshift','region')
client = boto3.client('redshift',region_name=REGION_NAME)
cluster_creds = client.get_cluster_credentials(DbUser=RS_USER,
DbName=DATABASE,
ClusterIdentifier=CLUSTER_ID,
AutoCreate=False)
try:
conn = psycopg2.connect(
host=RS_HOST,
port=RS_PORT,
user=cluster_creds['DbUser'],
password=cluster_creds['DbPassword'],
database=DATABASE
)
return conn
except psycopg2.Error:
logger.exception('Failed to open database connection.')
print "Failed"
Query Execution script -
from Redshift_Connection import db_connection
def executescript(redshift_cursor):
query = "SELECT * FROM <SCHEMA_NAME>.<TABLENAME>"
cur=redshift_cursor
cur.execute(query)
conn = db_connection()
conn.set_session(autocommit=False)
cursor = conn.cursor()
executescript(cursor)
conn.close()

create a database using pyodbc

I am trying to create a database using pyodbc, however, I cannot find it seems to be paradox as the pyodbc needs to connect to a database first, and the new database is created within the linked one. Please correct me if I am wrong.
In my case, I used following code to create a new database
conn = pyodbc.connect("driver={SQL Server};server= serverName; database=databaseName; trusted_connection=true")
cursor = conn.cursor()
sqlcommand = """
CREATE DATABASE ['+ #IndexDBName +'] ON PRIMARY
( NAME = N'''+ #IndexDBName+''', FILENAME = N''' + #mdfFileName + ''' , SIZE = 4000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
LOG ON
( NAME = N'''+ #IndexDBName+'_log'', FILENAME = N''' + #ldfFileName + ''' , SIZE = 1024KB , MAXSIZE = 100GB , FILEGROWTH = 10%)'
"""
cursor.execute(sqlcommand)
cursor.commit()
conn.commit()
The above code works without errors, however, there is no database created.
So how can I create a database using pyodbc?
Thanks a lot.
If you try to create a database with the default autocommit value for the connection, you should receive an error like the following. If you're not seeing this error message, try updating the SQL Server native client for a more descriptive message:
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][SQL Server Native Client 11.0]
[SQL Server]CREATE DATABASE statement not allowed within multi-statement transaction.
(226) (SQLExecDirectW)')
Turn on autocommit for the connection to resolve:
conn = pyodbc.connect("driver={SQL Server};server=serverName; database=master; trusted_connection=true",
autocommit=True)
Note two things:
autocommit is not part of the connection string, it is a separate keyword passed to the connect function
specify the initial connection database context is the master system database
As an aside, you may want to check the #IndexDBName, #mdfFileName, and #ldfFileName are being appropriately set in your T-SQL. With the code you provided, a database named '+ #IndexDBName +' would be created.
The accepted answer did not work for me but I managed to create a database using the following code on Ubuntu:
conn_str = r"Driver={/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.9.so.1.1};" + f"""
Server={server_ip};
UID=sa;
PWD=passwd;
"""
conn = pyodbc.connect(conn_str, autocommit=True)
cursor = conn.cursor()
cursor.execute(f"CREATE DATABASE {db_name}")
Which uses the default "master database" when connecting. You can check if the dataset is created by this query:
SELECT name FROM master.sys.databases

Categories

Resources