How to Load Data into Amazon Redshift via Python Boto3?

How to Load Data into Amazon Redshift via Python Boto3? - python

In Amazon Redshift's Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J. I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3.
However in Boto3's documentation of Redshift, I'm unable to find a method that would allow me to upload data into Amazon Redshift cluster.
I've been able to connect with Redshift utilizing Boto3 with the following code:
client = boto3.client('redshift')
But I'm not sure what method would allow me to either create tables or upload data to Amazon Redshift the way it's done in the tutorial with SQLWorkbenchJ.

Right, you need psycopg2 Python module to execute COPY command.
My code looks like this:
import psycopg2
#Amazon Redshift connect string
conn_string = "dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
#connect to Redshift (database should be open to the world)
con = psycopg2.connect(conn_string);
sql="""COPY %s FROM '%s' credentials
'aws_access_key_id=%s; aws_secret_access_key=%s'
delimiter '%s' FORMAT CSV %s %s; commit;""" %
(to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,delim,quote,gzip)
#Here
# fn - s3://path_to__input_file.gz
# gzip = 'gzip'
cur = con.cursor()
cur.execute(sql)
con.close()
I used boto3/psycopg2 to write CSV_Loader_For_Redshift

Go back to step 4 in that tutorial you linked. See where it shows you how to get the URL of the cluster? You have to connect to that URL with a PostgreSQL driver. The AWS SDKs such as Boto3 provide access to the AWS API. You need to connect to Redshift over a PostgreSQL API, just like you would connect to a PostgreSQL database on RDS.

Using psycopyg2 & get_cluster_credentials
Prerequisites -
IAM ROLE attached to respective User
IAM Role with get_cluster_credentials policy LINK
On cloud (EC2) with appropriate IAM Role attached
The below code will work only if you deploying it on a PC/VM where a user's AWS Credentials are already configured [ CLI - aws configure ] OR
you are on an instance in the same Account,VPC.
Have a config.ini file -
[Redshift]
port = 3389
username = please_enter_username
database_name = please_database-name
cluster_id = please_enter_cluster_id_name
url = please_enter_cluster_endpoint_url
region = us-west-2
My Redshift_connection.py
import logging
import psycopg2
import boto3
import ConfigParser
def db_connection():
logger = logging.getLogger(__name__)
parser = ConfigParser.ConfigParser()
parser.read('config.ini')
RS_PORT = parser.get('Redshift','port')
RS_USER = parser.get('Redshift','username')
DATABASE = parser.get('Redshift','database_name')
CLUSTER_ID = parser.get('Redshift','cluster_id')
RS_HOST = parser.get('Redshift','url')
REGION_NAME = parser.get('Redshift','region')
client = boto3.client('redshift',region_name=REGION_NAME)
cluster_creds = client.get_cluster_credentials(DbUser=RS_USER,
DbName=DATABASE,
ClusterIdentifier=CLUSTER_ID,
AutoCreate=False)
try:
conn = psycopg2.connect(
host=RS_HOST,
port=RS_PORT,
user=cluster_creds['DbUser'],
password=cluster_creds['DbPassword'],
database=DATABASE
)
return conn
except psycopg2.Error:
logger.exception('Failed to open database connection.')
print "Failed"
Query Execution script -
from Redshift_Connection import db_connection
def executescript(redshift_cursor):
query = "SELECT * FROM <SCHEMA_NAME>.<TABLENAME>"
cur=redshift_cursor
cur.execute(query)
conn = db_connection()
conn.set_session(autocommit=False)
cursor = conn.cursor()
executescript(cursor)
conn.close()

Related

Connect to cloudSQL db using service account with pymysql or mysql.connector

I have a running CloudSQL instance running in another VPC and a nginx proxy to allow cross-vpc access.
I can access the db using a built-in user. But how can I access the DB using a Google Service Account?
import google.auth
import google.auth.transport.requests
import mysql.connector
from mysql.connector import Error
import os
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
connection = mysql.connector.connect(host=HOST,
database=DB,
user=SA_USER,
password=creds.token)
if connection.is_connected():
db_Info = connection.get_server_info()
print("Connected to MySQL Server version ", db_Info)
cur = connection.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
When using mysql connnector, I get this error:
DatabaseError: 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled
Then I tried using pymysql
import pymysql
import google.auth
import google.auth.transport.requests
import os
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
try:
conn = pymysql.connect(host=ENDPOINT, user=SA_USER, passwd=creds.token, port=PORT, database=DBNAME)
cur = conn.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
except Exception as e:
print("Database connection failed due to {}".format(e))
Database connection failed due to (1045, "Access denied for user 'xx'#'xxx.xxx.xx.xx' (using password: YES)"
I guess these errors are all related to the token.
Anyone to suggest a proper way to get SA token to access CloudSQL DB?
PS: Using cloudsql auth proxy is not a good option for our architecture.

The error that you have mentioned in description , indicates an issue with authentication , to exactly understand what could have caused ,try these things
Verify the username and corresponding password.
Check the origin of the connection to see if it matches the URL where
the user has access privileges.
Check the user's grant privileges in the database.
As you are trying to access the DB using a Google Service Account then you should try to use the default service account credentials to include this authorization token for you. Check out the Client libraries and sample code page for more info.Alternatively, if you prefer to manually create the requests, you can use an Oauth 2.0 token. The Authorizing requests page has more information for how to create these.These access tokens are only valid for 60 minutes after which they expire - however once a token expires it does not disconnect clients but if that client connection is broken and must re-connect to the instance, and it's been more than an hour, then a new access token will need to be pulled and provided on that new connection attempt.
For your use case as you are not interested in cloud sql proxy, a service account IAM user is the better way to go.
Note that to get an appropriate access token the scope must be set to Cloud SQL Admin API.

It finally works.
I had to enforce SSL connection.
import pymysql
from google.oauth2 import service_account
import google.auth.transport.requests
scopes = ["https://www.googleapis.com/auth/cloud-platform", "https://www.googleapis.com/auth/sqlservice.admin"]
credentials = service_account.Credentials.from_service_account_file('key.json', scopes=scopes)
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)
config = {'user': SA_USER,
'host': ENDPOINT,
'database': DBNAME,
'password': credentials.token,
'ssl_ca': './server-ca.pem',
'ssl_cert': './client-cert.pem',
'ssl_key': './client-key.pem'}
try:
conn = pymysql.connect(**config)
with conn:
print("Connected")
cur = conn.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
except Exception as e:
print("Database connection failed due to {}".format(e))

I'd recommend using the Cloud SQL Python Connector it should make your life way easier!
It manages the SSL connection for you (no need for cert files!), takes care of the credentials (uses Application Default Credentials which you can set to service account easily) and allows you to login with Automatic IAM AuthN so that you don't have to pass the credentials token as a password.
Connecting looks like this:
from google.cloud.sql.connector import Connector, IPTypes
import sqlalchemy
import pymysql
# initialize Connector object
connector = Connector(ip_type=IPTypes.PRIVATE, enable_iam_auth=True,)
# function to return the database connection
def getconn() -> pymysql.connections.Connection:
conn: pymysql.connections.Connection = connector.connect(
"project:region:instance", # your Cloud SQL instance connection name
"pymysql",
user="my-user",
db="my-db-name"
)
return conn
# create connection pool
pool = sqlalchemy.create_engine(
"mysql+pymysql://",
creator=getconn,
)
# insert statement
insert_stmt = sqlalchemy.text(
"INSERT INTO my_table (id, title) VALUES (:id, :title)",
)
# interact with Cloud SQL database using connection pool
with pool.connect() as db_conn:
# insert into database
db_conn.execute(insert_stmt, id="book1", title="Book One")
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
Let me know if you run into any issues! There is also an interactive Cloud SQL Notebook that will walk your through things in more detail you can check out.

How to get the column names in redshift using Python boto3

I want to get the column names in redshift using python boto3
Creaed Redshift Cluster
Insert Data into it
Configured Secrets Manager
Configure SageMaker Notebook
Open the Jupyter Notebook wrote the below code
import boto3
import time
client = boto3.client('redshift-data')
response = client.execute_statement(ClusterIdentifier = "test", Database= "dev", SecretArn= "{SECRET-ARN}",Sql= "SELECT `COLUMN_NAME` FROM `INFORMATION_SCHEMA`.`COLUMNS` WHERE `TABLE_SCHEMA`='dev' AND `TABLE_NAME`='dojoredshift'")
I got the response but there is no table schema inside it
Below is the code i used to connect I am getting timed out
import psycopg2
HOST = 'xx.xx.xx.xx'
PORT = 5439
USER = 'aswuser'
PASSWORD = 'Password1!'
DATABASE = 'dev'
def db_connection():
conn = psycopg2.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
return conn
How to get the ip address go to https://ipinfo.info/html/ip_checker.php
pass your hostname of redshiftcluster xx.xx.us-east-1.redshift.amazonaws.com or you can see in cluster page itself
I got the error while running above code
OperationalError: could not connect to server: Connection timed out
Is the server running on host "x.xx.xx..xx" and accepting
TCP/IP connections on port 5439?

I fixed with the code, and add the above the rules
import boto3
import psycopg2
# Credentials can be set using different methodologies. For this test,
# I ran from my local machine which I used cli command "aws configure"
# to set my Access key and secret access key
client = boto3.client(service_name='redshift',
region_name='us-east-1')
#
#Using boto3 to get the Database password instead of hardcoding it in the code
#
cluster_creds = client.get_cluster_credentials(
DbUser='awsuser',
DbName='dev',
ClusterIdentifier='redshift-cluster-1',
AutoCreate=False)
try:
# Database connection below that uses the DbPassword that boto3 returned
conn = psycopg2.connect(
host = 'redshift-cluster-1.cvlywrhztirh.us-east-1.redshift.amazonaws.com',
port = '5439',
user = cluster_creds['DbUser'],
password = cluster_creds['DbPassword'],
database = 'dev'
)
# Verifies that the connection worked
cursor = conn.cursor()
cursor.execute("SELECT VERSION()")
results = cursor.fetchone()
ver = results[0]
if (ver is None):
print("Could not find version")
else:
print("The version is " + ver)
except:
logger.exception('Failed to open database connection.')
print("Failed")

Trying to connect Azure SQL database from Azure ML Service using MSI authentication (Without username and passowrd connect the Azure database)

I am trying to connect the Azure SQL Database from Azure Machine Learning Service with MSI Authentication (Without a username and password).
I am trying to Machine learning model on azure Machine learning service that purpose I need data that' why I want to connect Azure SQL Database from Azure Machine Learning Service using MSI Authentication.
But I got below error:-
"error": {"message": "Activity Failed:\n{\n \"error\": {\n \"code\": \"UserError\",\n \"message\": \"User program failed with KeyError: 'MSI_ENDPOINT'\",\n
Please check the below code that I have used for the database connection.
import logging
import struct
import pyodbc
import os
import requests
class AzureDbConnect:
def __init__(self):
print("Inside msi database")
msi_endpoint = os.environ["MSI_ENDPOINT"]
msi_secret = os.environ["MSI_SECRET"]
resource_uri = 'https://database.windows.net/'
logging.info(msi_endpoint)
print(msi_endpoint)
logging.info(msi_secret)
print(msi_secret)
print("Inside token")
token_auth_uri = f"{msi_endpoint}?resource={resource_uri}&api-version=2017-09-01"
head_msi = {'Secret': msi_secret}
resp = requests.get(token_auth_uri, headers=head_msi)
access_token = resp.json()['access_token']
logging.info(access_token)
print("Token is :- ")
print(access_token)
accesstoken = bytes(access_token, 'utf-8')
exptoken = b""
for i in accesstoken:
exptoken += bytes({i})
exptoken += bytes(1)
tokenstruct = struct.pack("=i", len(exptoken)) + exptoken
conn = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
"Server=tcp:<Server Name>"
"1433;Database=<Database Name>",
attrs_before={1256: bytearray(tokenstruct)})
print(conn)
self.sql_db = conn.cursor()
Is there any way to connect Azure, SQL Database from Azure Machine Learning Service With MSI Authentication?

Currently MSI Authentication is not supported to connect Azure SQL DB from Azure ML, It's on road map to add in future. You can Usually this is to do with setting up service principals in the DB, Attached is a step-by-step guide for getting this setup for Azure ML.

Why does my login to MS SQL with AzureML dataprep using Windows authentication fail?

I tried connecting to a MS SQL database using azureml.dataprep in an Azure Notebook, as outlined in https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-load-data#load-sql-data, using MSSqlDataSource, using code of the form
import azureml.dataprep as dprep
secret = dprep.register_secret(value="[SECRET-PASSWORD]", id="[SECRET-ID]")
ds = dprep.MSSQLDataSource(server_name="[SERVER-NAME]",
database_name="[DATABASE-NAME], [PORT]",
user_name="[DATABASE-USERNAME]",
password=secret)
Setting [DATABASE-USERNAME] equal to MYWINDOWSDOMAIN\\MYWINDOWSUSERNAME and the password [SECRET-PASSWORD] coinciding with my Windows password (i.e. trying to use Windows authentication).
After firing a query with
dataflow = dprep.read_sql(ds, "SELECT top 100 * FROM [dbo].[MYTABLE]")
dataflow.head(5)
I get
ExecutionError: Login failed.
I could connect to other databases without Windows Authentication fine. What am I doing wrong?

Consider using SQL server authentication as a workaround/alternative solution to connect to that db (the same dataflow syntax will work):
import azureml.dataprep as dprep
secret = dprep.register_secret(value="[SECRET-PASSWORD]", id="[SECRET-ID]")
ds = dprep.MSSQLDataSource(server_name="[SERVER-NAME],[PORT]",
database_name="[DATABASE-NAME]",
user_name="[DATABASE-USERNAME]",
password=secret)
Note that the usage of dataprep is deprecated, sqlalchemy can be used an alternative
import pandas as pd
from sqlalchemy import create_engine
def mssql_engine(user = "[DATABASE-USERNAME]",
password = "[SECRET-PASSWORD]",
host = "[SERVER-NAME],[PORT]",
db = "[DATABASE-NAME]"):
engine = create_engine(f'mssql+pyodbc://{user}:{password}#{host}/{db}?driver=SQL+Server')
return engine
query = "SELECT ..."
df = pd.read_sql(query, mssql_engine())

Here is the MS Doc on MSSQLDataSource. MSSQLDataSource instances have a property, credentials_type which defaults to SERVER. Try explicitly setting this to WINDOWS before you do your query. Also, the port should be specified together with the server name.
import azureml.dataprep as dprep
windows_domain = 'localhost'
windows_user = 'my_user'
windows_password = 'my_password'
secret = dprep.register_secret(value=windows_password, id="password")
ds = dprep.MSSQLDataSource(server_name="localhost",
database_name="myDb",
user_name=f'{windows_domain}\{windows_user}',
password=secret)
ds.credentials_type = dprep.DatabaseAuthType.WINDOWS
dataflow = dprep.read_sql(ds, "SELECT top 100 * FROM [dbo].[MYTABLE]")
dataflow.head(5)

Unload to S3 with Python using IAM Role credentials

In Redshift, I run the following to unload data from a table into a file in S3:
unload('select * from table')
to 's3://bucket/unload/file_'
iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>'
I would like to do the same in Python- any suggestion how to replicate this? I saw examples using access key and secret, but that is not an option for me- need to use role based credentials on a non-public bucket.

You will need two sets of credentials. IAM credentials via an IAM Role to access the S3 bucket and Redshift ODBC credentials to execute SQL commands.
Create a Python program that connects to Redshift, in a manner similar to other databases such as SQL Server, and execute your query. This program will need Redshift login credentials and not IAM credentials (Redshift username, password).
The IAM credentials for S3 are assigned as a role to Redshift so that Redshift can store the results on S3. This is the iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>' part of the Redshift query in your question.
You do not need boto3 (or boto) to access Redshift, unless you plan to actually interface with the Redshift API (which does not access the database stored inside Redshift).
Here is an example Python program to access Redshift. The link to this code is here. Credit due to Varun Verma
There are other examples on the Internet to help you get started.
############ REQUIREMENTS ####################
# sudo apt-get install python-pip
# sudo apt-get install libpq-dev
# sudo pip install psycopg2
# sudo pip install sqlalchemy
# sudo pip install sqlalchemy-redshift
##############################################
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<<
DATABASE = "dbname"
USER = "username"
PASSWORD = "password"
HOST = "host"
PORT = ""
SCHEMA = "public" #default is "public"
####### connection and session creation ##############
connection_string = "redshift+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema #######
################ write queries from here ######################
query = "unload('select * from table') to 's3://bucket/unload/file_' iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>';"
rr = s.execute(query)
all_results = rr.fetchall()
def pretty(all_results):
for row in all_results :
print "row start >>>>>>>>>>>>>>>>>>>>"
for r in row :
print " ----" , r
print "row end >>>>>>>>>>>>>>>>>>>>>>"
pretty(all_results)
########## close session in the end ###############
s.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Load Data into Amazon Redshift via Python Boto3? - python

Related

Connect to cloudSQL db using service account with pymysql or mysql.connector

How to get the column names in redshift using Python boto3

Trying to connect Azure SQL database from Azure ML Service using MSI authentication (Without username and passowrd connect the Azure database)

Why does my login to MS SQL with AzureML dataprep using Windows authentication fail?

Unload to S3 with Python using IAM Role credentials

Categories

Resources